Skip to content

Conversation

@kami922
Copy link
Contributor

@kami922 kami922 commented Jan 1, 2026

"## Summary

This PR adds a reverse lookup function for test fixtures, addressing issue #1743.

Problem

The codebase has get_sample_md5_by_name() to look up MD5 hashes by sample name, but there was no way to do the reverse - find the sample name given an MD5 hash.

Solution

Added get_sample_short_name_by_md5() function that takes an MD5 hash and returns the corresponding sample's shortened name. This is useful for test infrastructure that needs to validate sample identity.

Changes

  • tests/fixtures.py: Added get_sample_short_name_by_md5() function (inverse of existing get_sample_md5_by_name())
  • tests/test_fixtures.py: Added unit tests verifying both direct lookup and roundtrip behavior
  • CHANGELOG.md: Added entry under Development section

Testing

.venv/bin/python3 -m pytest tests/test_fixtures.py -v

This adds get_sample_short_name_by_md5() to provide reverse lookup
functionality for test fixtures. The new function is the inverse of
the existing get_sample_md5_by_name() function.

This is useful when you have an MD5 hash and need to determine which
sample fixture it corresponds to, particularly for test infrastructure
that validates sample identity.

Changes:
- Add get_sample_short_name_by_md5() function in tests/fixtures.py
- Add tests/test_fixtures.py with unit tests for the new function
- Tests verify both direct lookup and roundtrip behavior
- Update CHANGELOG.md

closes mandiant#1743
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kami922, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a gap in the test fixture infrastructure by implementing a reverse lookup mechanism. Previously, it was possible to find an MD5 hash by sample name, but not the other way around. The new function get_sample_short_name_by_md5() resolves this, enabling test systems to more robustly validate sample identities by converting MD5 hashes back to their corresponding short names.

Highlights

  • New Functionality: Introduced get_sample_short_name_by_md5() in tests/fixtures.py to provide a reverse lookup for test samples, allowing retrieval of a sample's shortened name given its MD5 hash.
  • Test Coverage: Added a new test file tests/test_fixtures.py with unit tests for get_sample_short_name_by_md5(), including checks for known samples, error handling for unknown MD5s, and roundtrip verification with get_sample_md5_by_name().
  • Changelog Update: An entry has been added to CHANGELOG.md under the Development section to document the new fixture function.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful reverse lookup function for test fixtures. The implementation is functional, but I've provided suggestions to improve maintainability and performance, particularly by refactoring a long if/elif chain into a dictionary lookup. The new tests are comprehensive, and I've suggested using pytest.mark.parametrize to enhance their readability and extensibility. Lastly, a minor correction is needed in the CHANGELOG.md to align with project conventions and protect privacy.

Comment on lines 553 to 611
def get_sample_short_name_by_md5(md5):
"""reverse lookup: given an MD5 hash, return the sample's shortened name"""
if md5 == "5f66b82558ca92e54e77f216ef4c066c":
return "mimikatz"
elif md5 == "e80758cf485db142fca1ee03a34ead05":
return "kernel32"
elif md5 == "a8565440629ac87f6fef7d588fe3ff0f":
return "kernel32-64"
elif md5 == "56bed8249e7c2982a90e54e1e55391a2":
return "pma12-04"
elif md5 == "7faafc7e4a5c736ebfee6abbbc812d80":
return "pma16-01"
elif md5 == "290934c61de9176ad682ffdd65f0a669":
return "pma01-01"
elif md5 == "c8403fb05244e23a7931c766409b5e22":
return "pma21-01"
elif md5 == "db648cd247281954344f1d810c6fd590":
return "al-khaser x86"
elif md5 == "3cb21ae76ff3da4b7e02d77ff76e82be":
return "al-khaser x64"
elif md5 == "b7841b9d5dc1f511a93cc7576672ec0c":
return "39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41"
elif md5 == "499c2a85f6e8142c3f48d4251c9c7cd6":
return "499c2a85f6e8142c3f48d4251c9c7cd6"
elif md5 == "9324d1a8ae37a36ae560c37448c9705a":
return "9324d1a8ae37a36ae560c37448c9705a"
elif md5 == "a198216798ca38f280dc413f8c57f2c2":
return "a198216798ca38f280dc413f8c57f2c2"
elif md5 == "a933a1a402775cfa94b6bee0963f4b46":
return "a933a1a402775cfa94b6bee0963f4b46"
elif md5 == "bfb9b5391a13d0afd787e87ab90f14f5":
return "bfb9b5391a13d0afd787e87ab90f14f5"
elif md5 == "c91887d861d9bd4a5872249b641bc9f9":
return "c91887d861d9bd4a5872249b641bc9f9"
elif md5 == "64d9f7d96b99467f36e22fada623c3bb":
return "64d9f7d96b99467f36e22fada623c3bb"
elif md5 == "82bf6347acf15e5d883715dc289d8a2b":
return "82bf6347acf15e5d883715dc289d8a2b"
elif md5 == "773290480d5445f11d3dc1b800728966":
return "773290480d5445f11d3dc1b800728966"
elif md5 == "56a6ffe6a02941028cc8235204eef31d":
return "3b13b6ba338bc08b0ddcd24515a99faf211da0e7296798162c5c4f9de8af9e2c"
elif md5 == "7351f8a40c5450557b24622417fc478d":
return "7351f8a40c5450557b24622417fc478d"
elif md5 == "79abd17391adc6251ecdc58d13d76baf":
return "79abd17391adc6251ecdc58d13d76baf"
elif md5 == "946a99f36a46d335dec080d9a4371940":
return "946a99f36a46d335dec080d9a4371940"
elif md5 == "b9f5bd514485fb06da39beff051b9fdc":
return "b9f5bd514485fb06da39beff051b9fdc"
elif md5 == "3db3e55b16a7b1b1afb970d5e77c5d98":
return "294b8db1f2702b60fb2e42fdc50c2cee6a5046112da9a5703a548a4fa50477bc"
elif md5 == "2bf18d0403677378adad9001b1243211":
return "2bf18d0403677378adad9001b1243211"
elif md5 == "76fa734236daa023444dec26863401dc":
return "ea28761b7b6aa3b4111b8e0f17f83e40"
else:
raise ValueError(f"unexpected sample MD5: {md5}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This long if/elif chain is inefficient and hard to maintain. Using a dictionary for lookups would be much better, providing O(1) average time complexity instead of O(n). It also makes the code more readable and less prone to errors when adding new samples.

For even better maintainability, you might consider creating a single source of truth for the sample mappings to avoid data duplication between get_sample_md5_by_name and this new function. This would prevent potential inconsistencies when adding or updating samples in the future.

_MD5_TO_SHORT_NAME = {
    "5f66b82558ca92e54e77f216ef4c066c": "mimikatz",
    "e80758cf485db142fca1ee03a34ead05": "kernel32",
    "a8565440629ac87f6fef7d588fe3ff0f": "kernel32-64",
    "56bed8249e7c2982a90e54e1e55391a2": "pma12-04",
    "7faafc7e4a5c736ebfee6abbbc812d80": "pma16-01",
    "290934c61de9176ad682ffdd65f0a669": "pma01-01",
    "c8403fb05244e23a7931c766409b5e22": "pma21-01",
    "db648cd247281954344f1d810c6fd590": "al-khaser x86",
    "3cb21ae76ff3da4b7e02d77ff76e82be": "al-khaser x64",
    "b7841b9d5dc1f511a93cc7576672ec0c": "39c05b15e9834ac93f206bc114d0a00c357c888db567ba8f5345da0529cbed41",
    "499c2a85f6e8142c3f48d4251c9c7cd6": "499c2a85f6e8142c3f48d4251c9c7cd6",
    "9324d1a8ae37a36ae560c37448c9705a": "9324d1a8ae37a36ae560c37448c9705a",
    "a198216798ca38f280dc413f8c57f2c2": "a198216798ca38f280dc413f8c57f2c2",
    "a933a1a402775cfa94b6bee0963f4b46": "a933a1a402775cfa94b6bee0963f4b46",
    "bfb9b5391a13d0afd787e87ab90f14f5": "bfb9b5391a13d0afd787e87ab90f14f5",
    "c91887d861d9bd4a5872249b641bc9f9": "c91887d861d9bd4a5872249b641bc9f9",
    "64d9f7d96b99467f36e22fada623c3bb": "64d9f7d96b99467f36e22fada623c3bb",
    "82bf6347acf15e5d883715dc289d8a2b": "82bf6347acf15e5d883715dc289d8a2b",
    "773290480d5445f11d3dc1b800728966": "773290480d5445f11d3dc1b800728966",
    "56a6ffe6a02941028cc8235204eef31d": "3b13b6ba338bc08b0ddcd24515a99faf211da0e7296798162c5c4f9de8af9e2c",
    "7351f8a40c5450557b24622417fc478d": "7351f8a40c5450557b24622417fc478d",
    "79abd17391adc6251ecdc58d13d76baf": "79abd17391adc6251ecdc58d13d76baf",
    "946a99f36a46d335dec080d9a4371940": "946a99f36a46d335dec080d9a4371940",
    "b9f5bd514485fb06da39beff051b9fdc": "b9f5bd514485fb06da39beff051b9fdc",
    "3db3e55b16a7b1b1afb970d5e77c5d98": "294b8db1f2702b60fb2e42fdc50c2cee6a5046112da9a5703a548a4fa50477bc",
    "2bf18d0403677378adad9001b1243211": "2bf18d0403677378adad9001b1243211",
    "76fa734236daa023444dec26863401dc": "ea28761b7b6aa3b4111b8e0f17f83e40",
}


def get_sample_short_name_by_md5(md5):
    """reverse lookup: given an MD5 hash, return the sample's shortened name"""
    try:
        return _MD5_TO_SHORT_NAME[md5]
    except KeyError:
        raise ValueError(f"unexpected sample MD5: {md5}")

CHANGELOG.md Outdated

- ci: remove redundant "test_run" action from build workflow @mike-hunhoff #2692
- dev: add bumpmyversion to bump and sync versions across the project @mr-tz
- fixtures: add function to resolve sample shortened name by MD5 [email protected] #TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please replace your email address with your GitHub handle to maintain privacy and consistency with other entries in this file. Also, the #TBD should be replaced with the pull request number once it's known.

Suggested change
- fixtures: add function to resolve sample shortened name by MD5 [email protected] #TBD
- fixtures: add function to resolve sample shortened name by MD5 @kamranulhaq2002 #TBD

Comment on lines 20 to 31
def test_get_sample_short_name_by_md5():
"""Test that get_sample_short_name_by_md5 correctly returns sample names"""
# Test known samples
assert fixtures.get_sample_short_name_by_md5("5f66b82558ca92e54e77f216ef4c066c") == "mimikatz"
assert fixtures.get_sample_short_name_by_md5("e80758cf485db142fca1ee03a34ead05") == "kernel32"
assert fixtures.get_sample_short_name_by_md5("a8565440629ac87f6fef7d588fe3ff0f") == "kernel32-64"
assert fixtures.get_sample_short_name_by_md5("db648cd247281954344f1d810c6fd590") == "al-khaser x86"
assert fixtures.get_sample_short_name_by_md5("3cb21ae76ff3da4b7e02d77ff76e82be") == "al-khaser x64"

# Test that unknown MD5 raises ValueError
with pytest.raises(ValueError, match="unexpected sample MD5"):
fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test function handles two separate concerns: testing known values and testing error handling for unknown values. It's a good practice to separate these into distinct test functions for clarity. Additionally, using pytest.mark.parametrize for the known values will make the test more readable and easier to extend with new test cases.

Suggested change
def test_get_sample_short_name_by_md5():
"""Test that get_sample_short_name_by_md5 correctly returns sample names"""
# Test known samples
assert fixtures.get_sample_short_name_by_md5("5f66b82558ca92e54e77f216ef4c066c") == "mimikatz"
assert fixtures.get_sample_short_name_by_md5("e80758cf485db142fca1ee03a34ead05") == "kernel32"
assert fixtures.get_sample_short_name_by_md5("a8565440629ac87f6fef7d588fe3ff0f") == "kernel32-64"
assert fixtures.get_sample_short_name_by_md5("db648cd247281954344f1d810c6fd590") == "al-khaser x86"
assert fixtures.get_sample_short_name_by_md5("3cb21ae76ff3da4b7e02d77ff76e82be") == "al-khaser x64"
# Test that unknown MD5 raises ValueError
with pytest.raises(ValueError, match="unexpected sample MD5"):
fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000")
@pytest.mark.parametrize(
"md5, name",
[
("5f66b82558ca92e54e77f216ef4c066c", "mimikatz"),
("e80758cf485db142fca1ee03a34ead05", "kernel32"),
("a8565440629ac87f6fef7d588fe3ff0f", "kernel32-64"),
("db648cd247281954344f1d810c6fd590", "al-khaser x86"),
("3cb21ae76ff3da4b7e02d77ff76e82be", "al-khaser x64"),
],
)
def test_get_sample_short_name_by_md5(md5, name):
"""Test that get_sample_short_name_by_md5 correctly returns sample names"""
assert fixtures.get_sample_short_name_by_md5(md5) == name
def test_get_sample_short_name_by_md5_unknown():
"""Test that get_sample_short_name_by_md5 raises ValueError for an unknown MD5"""
with pytest.raises(ValueError, match="unexpected sample MD5"):
fixtures.get_sample_short_name_by_md5("0000000000000000000000000000000")

Comment on lines 34 to 51
def test_md5_name_lookup_roundtrip():
"""Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses"""
# Test samples with simple names (not hash-based names)
simple_names = ["mimikatz", "kernel32", "kernel32-64", "pma12-04", "pma16-01", "pma01-01", "pma21-01"]

for name in simple_names:
md5 = fixtures.get_sample_md5_by_name(name)
assert fixtures.get_sample_short_name_by_md5(md5) == name

# Test samples with hash-based names that use startswith() logic
hash_based_samples = [
("499c2a85f6e8142c3f48d4251c9c7cd6", "499c2a85f6e8142c3f48d4251c9c7cd6"),
("9324d1a8ae37a36ae560c37448c9705a", "9324d1a8ae37a36ae560c37448c9705a"),
("64d9f7d96b99467f36e22fada623c3bb", "64d9f7d96b99467f36e22fada623c3bb"),
]

for md5_hash, expected_name in hash_based_samples:
assert fixtures.get_sample_short_name_by_md5(fixtures.get_sample_md5_by_name(expected_name)) == expected_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test can be made more readable and maintainable by using pytest.mark.parametrize. This separates the test data from the test logic, making it easier to add new test cases and understand the intent of the test. I've also split it into two separate tests for simple and hash-based names for clarity.

@pytest.mark.parametrize(
    "name",
    ["mimikatz", "kernel32", "kernel32-64", "pma12-04", "pma16-01", "pma01-01", "pma21-01"],
)
def test_md5_name_lookup_roundtrip_simple_names(name):
    """Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses for simple names."""
    md5 = fixtures.get_sample_md5_by_name(name)
    assert fixtures.get_sample_short_name_by_md5(md5) == name


@pytest.mark.parametrize(
    "name",
    [
        "499c2a85f6e8142c3f48d4251c9c7cd6",
        "9324d1a8ae37a36ae560c37448c9705a",
        "64d9f7d96b99467f36e22fada623c3bb",
    ],
)
def test_md5_name_lookup_roundtrip_hash_names(name):
    """Test that get_sample_md5_by_name and get_sample_short_name_by_md5 are inverses for hash-based names."""
    assert fixtures.get_sample_short_name_by_md5(fixtures.get_sample_md5_by_name(name)) == name

kami922 added a commit to kami922/capa that referenced this pull request Jan 2, 2026
…ests

- Refactor get_sample_short_name_by_md5() to use dictionary lookup for O(1) performance
- Add _MD5_TO_SHORT_NAME module-level constant for efficient lookups
- Refactor tests to use pytest.mark.parametrize for better readability
- Split tests into separate functions for different concerns
- Fix CHANGELOG: use @kamranulhaq2002 instead of email
- Update CHANGELOG: use PR mandiant#2802 instead of #TBD
- Fix import order per isort requirements
- Apply black and ruff formatting

Addresses feedback from Gemini Code Assist bot review.
…ests

- Refactor get_sample_short_name_by_md5() to use dictionary lookup for O(1) performance
- Add _MD5_TO_SHORT_NAME module-level constant for efficient lookups
- Refactor tests to use pytest.mark.parametrize for better readability
- Split tests into separate functions for different concerns
- Fix CHANGELOG: use @kamranulhaq2002 instead of email
- Update CHANGELOG: use PR mandiant#2802 instead of #TBD
- Apply code formatting (isort, black, ruff)

Addresses feedback from Gemini Code Assist bot review.
@kami922 kami922 force-pushed the fix-1743-add-md5-to-name-lookup branch from 18c192f to 703ff5d Compare January 2, 2026 16:05
Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review the associated issue for strategies that we'd like to use when refactoring this code. We don't want to maintain long lists of file names and hashes manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants