-
Notifications
You must be signed in to change notification settings - Fork 639
fixtures: add function to resolve sample shortened name by MD5 #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This adds get_sample_short_name_by_md5() to provide reverse lookup functionality for test fixtures. The new function is the inverse of the existing get_sample_md5_by_name() function. This is useful when you have an MD5 hash and need to determine which sample fixture it corresponds to, particularly for test infrastructure that validates sample identity. Changes: - Add get_sample_short_name_by_md5() function in tests/fixtures.py - Add tests/test_fixtures.py with unit tests for the new function - Tests verify both direct lookup and roundtrip behavior - Update CHANGELOG.md closes mandiant#1743
Summary of ChangesHello @kami922, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a gap in the test fixture infrastructure by implementing a reverse lookup mechanism. Previously, it was possible to find an MD5 hash by sample name, but not the other way around. The new function Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful reverse lookup function for test fixtures. The implementation is functional, but I've provided suggestions to improve maintainability and performance, particularly by refactoring a long if/elif chain into a dictionary lookup. The new tests are comprehensive, and I've suggested using pytest.mark.parametrize to enhance their readability and extensibility. Lastly, a minor correction is needed in the CHANGELOG.md to align with project conventions and protect privacy.
| def get_sample_short_name_by_md5(md5): | ||
| """reverse lookup: given an MD5 hash, return the sample's shortened name""" | ||
| if md5 == "5f66b82558ca92e54e77f216ef4c066c": | ||
| return "mimikatz" | ||
| elif md5 == "e80758cf485db142fca1ee03a34ead05": | ||
| return "kernel32" | ||
| elif md5 == "a8565440629ac87f6fef7d588fe3ff0f": | ||
| return "kernel32-64" | ||
| elif md5 == "56bed8249e7c2982a90e54e1e55391a2": | ||
| return "pma12-04" | ||
| elif md5 == "7faafc7e4a5c736ebfee6abbbc812d80": | ||
| return "pma16-01" | ||
| elif md5 == "290934c61de9176ad682ffdd65f0a669": | ||
| return "pma01-01" | ||
| elif md5 == "c8403fb05244e23a7931c766409b5e22": | ||
| return "pma21-01" | ||
| elif md5 == "db648cd247281954344f1d810c6fd590": | ||
| return "al-khaser x86" | ||
| elif md5 == "3cb21ae76ff3da4b7e02d77ff76e82be": | ||
| return "al-khaser x64" | ||
| elif md5 == "b7841b9d5dc1f511a93cc7576672ec0c": | ||
| return "39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41" | ||
| elif md5 == "499c2a85f6e8142c3f48d4251c9c7cd6": | ||
| return "499c2a85f6e8142c3f48d4251c9c7cd6" | ||
| elif md5 == "9324d1a8ae37a36ae560c37448c9705a": | ||
| return "9324d1a8ae37a36ae560c37448c9705a" | ||
| elif md5 == "a198216798ca38f280dc413f8c57f2c2": | ||
| return "a198216798ca38f280dc413f8c57f2c2" | ||
| elif md5 == "a933a1a402775cfa94b6bee0963f4b46": | ||
| return "a933a1a402775cfa94b6bee0963f4b46" | ||
| elif md5 == "bfb9b5391a13d0afd787e87ab90f14f5": | ||
| return "bfb9b5391a13d0afd787e87ab90f14f5" | ||
| elif md5 == "c91887d861d9bd4a5872249b641bc9f9": | ||
| return "c91887d861d9bd4a5872249b641bc9f9" | ||
| elif md5 == "64d9f7d96b99467f36e22fada623c3bb": | ||
| return "64d9f7d96b99467f36e22fada623c3bb" | ||
| elif md5 == "82bf6347acf15e5d883715dc289d8a2b": | ||
| return "82bf6347acf15e5d883715dc289d8a2b" | ||
| elif md5 == "773290480d5445f11d3dc1b800728966": | ||
| return "773290480d5445f11d3dc1b800728966" | ||
| elif md5 == "56a6ffe6a02941028cc8235204eef31d": | ||
| return "3b13b6ba338bc08b0ddcd24515a99faf211da0e7296798162c5c4f9de8af9e2c" | ||
| elif md5 == "7351f8a40c5450557b24622417fc478d": | ||
| return "7351f8a40c5450557b24622417fc478d" | ||
| elif md5 == "79abd17391adc6251ecdc58d13d76baf": | ||
| return "79abd17391adc6251ecdc58d13d76baf" | ||
| elif md5 == "946a99f36a46d335dec080d9a4371940": | ||
| return "946a99f36a46d335dec080d9a4371940" | ||
| elif md5 == "b9f5bd514485fb06da39beff051b9fdc": | ||
| return "b9f5bd514485fb06da39beff051b9fdc" | ||
| elif md5 == "3db3e55b16a7b1b1afb970d5e77c5d98": | ||
| return "294b8db1f2702b60fb2e42fdc50c2cee6a5046112da9a5703a548a4fa50477bc" | ||
| elif md5 == "2bf18d0403677378adad9001b1243211": | ||
| return "2bf18d0403677378adad9001b1243211" | ||
| elif md5 == "76fa734236daa023444dec26863401dc": | ||
| return "ea28761b7b6aa3b4111b8e0f17f83e40" | ||
| else: | ||
| raise ValueError(f"unexpected sample MD5: {md5}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This long if/elif chain is inefficient and hard to maintain. Using a dictionary for lookups would be much better, providing O(1) average time complexity instead of O(n). It also makes the code more readable and less prone to errors when adding new samples.
For even better maintainability, you might consider creating a single source of truth for the sample mappings to avoid data duplication between get_sample_md5_by_name and this new function. This would prevent potential inconsistencies when adding or updating samples in the future.
_MD5_TO_SHORT_NAME = {
"5f66b82558ca92e54e77f216ef4c066c": "mimikatz",
"e80758cf485db142fca1ee03a34ead05": "kernel32",
"a8565440629ac87f6fef7d588fe3ff0f": "kernel32-64",
"56bed8249e7c2982a90e54e1e55391a2": "pma12-04",
"7faafc7e4a5c736ebfee6abbbc812d80": "pma16-01",
"290934c61de9176ad682ffdd65f0a669": "pma01-01",
"c8403fb05244e23a7931c766409b5e22": "pma21-01",
"db648cd247281954344f1d810c6fd590": "al-khaser x86",
"3cb21ae76ff3da4b7e02d77ff76e82be": "al-khaser x64",
"b7841b9d5dc1f511a93cc7576672ec0c": "39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41",
"499c2a85f6e8142c3f48d4251c9c7cd6": "499c2a85f6e8142c3f48d4251c9c7cd6",
"9324d1a8ae37a36ae560c37448c9705a": "9324d1a8ae37a36ae560c37448c9705a",
"a198216798ca38f280dc413f8c57f2c2": "a198216798ca38f280dc413f8c57f2c2",
"a933a1a402775cfa94b6bee0963f4b46": "a933a1a402775cfa94b6bee0963f4b46",
"bfb9b5391a13d0afd787e87ab90f14f5": "bfb9b5391a13d0afd787e87ab90f14f5",
"c91887d861d9bd4a5872249b641bc9f9": "c91887d861d9bd4a5872249b641bc9f9",
"64d9f7d96b99467f36e22fada623c3bb": "64d9f7d96b99467f36e22fada623c3bb",
"82bf6347acf15e5d883715dc289d8a2b": "82bf6347acf15e5d883715dc289d8a2b",
"773290480d5445f11d3dc1b800728966": "773290480d5445f11d3dc1b800728966",
"56a6ffe6a02941028cc8235204eef31d": "3b13b6ba338bc08b0ddcd24515a99faf211da0e7296798162c5c4f9de8af9e2c",
"7351f8a40c5450557b24622417fc478d": "7351f8a40c5450557b24622417fc478d",
"79abd17391adc6251ecdc58d13d76baf": "79abd17391adc6251ecdc58d13d76baf",
"946a99f36a46d335dec080d9a4371940": "946a99f36a46d335dec080d9a4371940",
"b9f5bd514485fb06da39beff051b9fdc": "b9f5bd514485fb06da39beff051b9fdc",
"3db3e55b16a7b1b1afb970d5e77c5d98": "294b8db1f2702b60fb2e42fdc50c2cee6a5046112da9a5703a548a4fa50477bc",
"2bf18d0403677378adad9001b1243211": "2bf18d0403677378adad9001b1243211",
"76fa734236daa023444dec26863401dc": "ea28761b7b6aa3b4111b8e0f17f83e40",
}
def get_sample_short_name_by_md5(md5):
"""reverse lookup: given an MD5 hash, return the sample's shortened name"""
try:
return _MD5_TO_SHORT_NAME[md5]
except KeyError:
raise ValueError(f"unexpected sample MD5: {md5}")
CHANGELOG.md
Outdated
|
|
||
| - ci: remove redundant "test_run" action from build workflow @mike-hunhoff #2692 | ||
| - dev: add bumpmyversion to bump and sync versions across the project @mr-tz | ||
| - fixtures: add function to resolve sample shortened name by MD5 [email protected] #TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace your email address with your GitHub handle to maintain privacy and consistency with other entries in this file. Also, the #TBD should be replaced with the pull request number once it's known.
| - fixtures: add function to resolve sample shortened name by MD5 [email protected] #TBD | |
| - fixtures: add function to resolve sample shortened name by MD5 @kamranulhaq2002 #TBD |
tests/test_fixtures.py
Outdated
| def test_get_sample_short_name_by_md5(): | ||
| """Test that get_sample_short_name_by_md5 correctly returns sample names""" | ||
| # Test known samples | ||
| assert fixtures.get_sample_short_name_by_md5("5f66b82558ca92e54e77f216ef4c066c") == "mimikatz" | ||
| assert fixtures.get_sample_short_name_by_md5("e80758cf485db142fca1ee03a34ead05") == "kernel32" | ||
| assert fixtures.get_sample_short_name_by_md5("a8565440629ac87f6fef7d588fe3ff0f") == "kernel32-64" | ||
| assert fixtures.get_sample_short_name_by_md5("db648cd247281954344f1d810c6fd590") == "al-khaser x86" | ||
| assert fixtures.get_sample_short_name_by_md5("3cb21ae76ff3da4b7e02d77ff76e82be") == "al-khaser x64" | ||
|
|
||
| # Test that unknown MD5 raises ValueError | ||
| with pytest.raises(ValueError, match="unexpected sample MD5"): | ||
| fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test function handles two separate concerns: testing known values and testing error handling for unknown values. It's a good practice to separate these into distinct test functions for clarity. Additionally, using pytest.mark.parametrize for the known values will make the test more readable and easier to extend with new test cases.
| def test_get_sample_short_name_by_md5(): | |
| """Test that get_sample_short_name_by_md5 correctly returns sample names""" | |
| # Test known samples | |
| assert fixtures.get_sample_short_name_by_md5("5f66b82558ca92e54e77f216ef4c066c") == "mimikatz" | |
| assert fixtures.get_sample_short_name_by_md5("e80758cf485db142fca1ee03a34ead05") == "kernel32" | |
| assert fixtures.get_sample_short_name_by_md5("a8565440629ac87f6fef7d588fe3ff0f") == "kernel32-64" | |
| assert fixtures.get_sample_short_name_by_md5("db648cd247281954344f1d810c6fd590") == "al-khaser x86" | |
| assert fixtures.get_sample_short_name_by_md5("3cb21ae76ff3da4b7e02d77ff76e82be") == "al-khaser x64" | |
| # Test that unknown MD5 raises ValueError | |
| with pytest.raises(ValueError, match="unexpected sample MD5"): | |
| fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000") | |
| @pytest.mark.parametrize( | |
| "md5, name", | |
| [ | |
| ("5f66b82558ca92e54e77f216ef4c066c", "mimikatz"), | |
| ("e80758cf485db142fca1ee03a34ead05", "kernel32"), | |
| ("a8565440629ac87f6fef7d588fe3ff0f", "kernel32-64"), | |
| ("db648cd247281954344f1d810c6fd590", "al-khaser x86"), | |
| ("3cb21ae76ff3da4b7e02d77ff76e82be", "al-khaser x64"), | |
| ], | |
| ) | |
| def test_get_sample_short_name_by_md5(md5, name): | |
| """Test that get_sample_short_name_by_md5 correctly returns sample names""" | |
| assert fixtures.get_sample_short_name_by_md5(md5) == name | |
| def test_get_sample_short_name_by_md5_unknown(): | |
| """Test that get_sample_short_name_by_md5 raises ValueError for an unknown MD5""" | |
| with pytest.raises(ValueError, match="unexpected sample MD5"): | |
| fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000") |
tests/test_fixtures.py
Outdated
| def test_md5_name_lookup_roundtrip(): | ||
| """Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses""" | ||
| # Test samples with simple names (not hash-based names) | ||
| simple_names = ["mimikatz", "kernel32", "kernel32-64", "pma12-04", "pma16-01", "pma01-01", "pma21-01"] | ||
|
|
||
| for name in simple_names: | ||
| md5 = fixtures.get_sample_md5_by_name(name) | ||
| assert fixtures.get_sample_short_name_by_md5(md5) == name | ||
|
|
||
| # Test samples with hash-based names that use startswith() logic | ||
| hash_based_samples = [ | ||
| ("499c2a85f6e8142c3f48d4251c9c7cd6", "499c2a85f6e8142c3f48d4251c9c7cd6"), | ||
| ("9324d1a8ae37a36ae560c37448c9705a", "9324d1a8ae37a36ae560c37448c9705a"), | ||
| ("64d9f7d96b99467f36e22fada623c3bb", "64d9f7d96b99467f36e22fada623c3bb"), | ||
| ] | ||
|
|
||
| for md5_hash, expected_name in hash_based_samples: | ||
| assert fixtures.get_sample_short_name_by_md5(fixtures.get_sample_md5_by_name(expected_name)) == expected_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can be made more readable and maintainable by using pytest.mark.parametrize. This separates the test data from the test logic, making it easier to add new test cases and understand the intent of the test. I've also split it into two separate tests for simple and hash-based names for clarity.
@pytest.mark.parametrize(
"name",
["mimikatz", "kernel32", "kernel32-64", "pma12-04", "pma16-01", "pma01-01", "pma21-01"],
)
def test_md5_name_lookup_roundtrip_simple_names(name):
"""Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses for simple names."""
md5 = fixtures.get_sample_md5_by_name(name)
assert fixtures.get_sample_short_name_by_md5(md5) == name
@pytest.mark.parametrize(
"name",
[
"499c2a85f6e8142c3f48d4251c9c7cd6",
"9324d1a8ae37a36ae560c37448c9705a",
"64d9f7d96b99467f36e22fada623c3bb",
],
)
def test_md5_name_lookup_roundtrip_hash_names(name):
"""Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses for hash-based names."""
assert fixtures.get_sample_short_name_by_md5(fixtures.get_sample_md5_by_name(name)) == name…ests - Refactor get_sample_short_name_by_md5() to use dictionary lookup for O(1) performance - Add _MD5_TO_SHORT_NAME module-level constant for efficient lookups - Refactor tests to use pytest.mark.parametrize for better readability - Split tests into separate functions for different concerns - Fix CHANGELOG: use @kamranulhaq2002 instead of email - Update CHANGELOG: use PR mandiant#2802 instead of #TBD - Fix import order per isort requirements - Apply black and ruff formatting Addresses feedback from Gemini Code Assist bot review.
…ests - Refactor get_sample_short_name_by_md5() to use dictionary lookup for O(1) performance - Add _MD5_TO_SHORT_NAME module-level constant for efficient lookups - Refactor tests to use pytest.mark.parametrize for better readability - Split tests into separate functions for different concerns - Fix CHANGELOG: use @kamranulhaq2002 instead of email - Update CHANGELOG: use PR mandiant#2802 instead of #TBD - Apply code formatting (isort, black, ruff) Addresses feedback from Gemini Code Assist bot review.
18c192f to
703ff5d
Compare
williballenthin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please review the associated issue for strategies that we'd like to use when refactoring this code. We don't want to maintain long lists of file names and hashes manually.
"## Summary
This PR adds a reverse lookup function for test fixtures, addressing issue #1743.
Problem
The codebase has
get_sample_md5_by_name()to look up MD5 hashes by sample name, but there was no way to do the reverse - find the sample name given an MD5 hash.Solution
Added
get_sample_short_name_by_md5()function that takes an MD5 hash and returns the corresponding sample's shortened name. This is useful for test infrastructure that needs to validate sample identity.Changes
get_sample_short_name_by_md5()function (inverse of existingget_sample_md5_by_name())Testing