Skip to content

Conversation

@jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Jan 14, 2026

Description

This PR adds back prediction and inference scripts, as well as the fine-tuning and brca zero shot fine-tuning notebooks. This also adds in support for batched vortex generation.

Usage

See examples/*.ipynb in the evo2_megatron recipe for usage.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Full inference engine and CLI for text generation, prediction workflow, and a simple token-generation helper.
    • New training flag to disable FP8 weight gradients.
  • Bug Fixes

    • Corrected multi-batch decoding behavior.
    • Ensured final normalization is always defined and preprocessing/postprocessing flags honor explicit disables.
  • Improvements

    • Tokenizer decoding switched to a Fuse decoder.
    • Enhanced GPU cleanup and robust test utilities; expanded, runnable test suites.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

/ok to test bbc9e0a

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

/ok to test 5ea1971

Signed-off-by: John St. John <[email protected]>
Signed-off-by: John St. John <[email protected]>
Signed-off-by: John St. John <[email protected]>
@jstjohn jstjohn force-pushed the jstjohn/evo2_predict_and_inference branch from a68fcb5 to d8d55e6 Compare January 15, 2026 19:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py`:
- Around line 750-771: calculate_sequence_identity can return None; update the
code that builds match_percents and matchperc_print to handle None safely (e.g.
normalize None to 0.0 or a sentinel and record that) so formatting and numeric
comparisons don't raise. Specifically, when appending to match_percents in the
loop where calculate_sequence_identity(target, gen_seq) is called, coerce None
to a float (like 0.0) or record a separate flag, then adjust matchperc_print
generation (currently using f"{mp:.1f}%") to only format floats, and change the
final assertion that checks mp >= 0.90 * ep to skip or treat None-normalized
values appropriately; reference calculate_sequence_identity, match_percents,
matchperc_print, expected_matchpercents, and the final mp >= 0.90 * ep assertion
so you update all places that assume a non-None float.
- Around line 779-826: The parametrization in test_batch_generate_mbridge
includes checkpoints without "bf16" (ckpt_name values "evo2/1b-8k:1.0" and
"evo2/7b-8k:1.0") but the test unconditionally skips any ckpt_name lacking
"bf16", making those cases dead; fix by either removing the non-bf16 entries
from the param list in test_batch_generate_mbridge or change the skip logic
around the "if 'bf16' not in ckpt_name" check to mirror the conditional behavior
in test_batch_generate_coding_sequences (use ckpt_name, fp8, and
is_fp8_supported to decide skip) so that non-bf16 cases are only skipped when
appropriate. Ensure you update or remove the specific parametrized entries and
adjust the conditional skip using the same decision pattern as
test_batch_generate_coding_sequences to keep intent clear.
♻️ Duplicate comments (1)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (1)

38-45: Duplicate utility function.

This find_free_network_port is identical to the one in test_predict.py and other test files. As mentioned earlier, consider extracting to a shared test utilities module.

🧹 Nitpick comments (10)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (1)

764-767: Consider extracting the complex assertion logic for readability.

The assertion on lines 764-767 is dense and hard to parse. Consider extracting it into a helper function or breaking it into multiple checks with clearer variable names and error messages.

Suggested improvement
def validate_cds_length(cds_len: int | None, original_len: int, prompt_len: int, target_len: int) -> bool:
    """Validate CDS length meets expectations (stop codon not too early, length preserved)."""
    if cds_len is None:
        return True
    min_expected_generated = 96 * 3  # Random chance of stop codon within 96 codons
    generated_len = cds_len - prompt_len
    # Allow short targets OR require meaningful generation beyond random chance
    length_ok = generated_len > min_expected_generated or target_len < min_expected_generated
    preservation_ok = cds_len >= 0.90 * original_len
    return length_ok and preservation_ok

# Then use:
assert all(
    validate_cds_length(pcl, ocl, len(pmpt), len(tgt))
    for pcl, ocl, (pmpt, tgt) in zip(cds_lengths, original_cds_lengths, seq_prompts)
), f"CDS length validation failed: {list(zip(cds_lengths, original_cds_lengths))}"
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (3)

41-47: Consider extracting to shared test utilities.

This function is duplicated in test_predict.py. Consider moving GPU detection utilities to a shared test helper module to reduce duplication and ensure consistent behavior across tests.


297-304: Duplicated utility and socket option ordering.

  1. This function is duplicated in test_evo2.py and test_predict.py. Consider extracting to a shared test utilities module.

  2. SO_REUSEADDR is set after bind(), which has no effect. To allow address reuse, set it before binding:

♻️ Proposed fix
 def find_free_network_port(address: str = "localhost") -> int:
     """Find a free port on localhost for distributed testing."""
     import socket

     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
         s.bind((address, 0))
-        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
         return s.getsockname()[1]

517-523: String replacement for command modification is fragile.

Using .replace() on the command string could cause issues if the values appear multiple times or in unexpected locations. Consider building cmd2 from scratch or using a command builder function:

♻️ Alternative approach
# Instead of string replacement, build the second command explicitly
# or use a dictionary of parameters that can be modified:
def _build_train_cmd(*, result_dir, world_size, tp_size, **kwargs) -> str:
    # ... build command from parameters
    pass

# Then call with modified parameters for cmd2
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py (1)

66-73: Correct port discovery implementation, but consider extracting to shared utility.

The context manager ensures proper socket cleanup, and the implementation is correct. However, this exact function is duplicated across 5 test files (test_infer.py, test_evo2.py, test_predict.py, test_train.py, test_stop_and_go.py). Consider extracting to a shared test utility in conftest.py or a dedicated test utils module.

♻️ Suggested consolidation in conftest.py
# In conftest.py
def find_free_network_port(address: str = "localhost") -> int:
    """Find a free port on localhost for distributed testing."""
    import socket

    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind((address, 0))
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        return s.getsockname()[1]

Then import from conftest.py in all test files that need it.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (1)

1103-1103: Consider deferring logging configuration to the caller.

Calling logging.basicConfig(level=logging.INFO) inside predict() may override logging configuration set by the caller. Consider moving this to main() or making it conditional.

♻️ Suggested refactor
-    logging.basicConfig(level=logging.INFO)
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py (2)

41-48: Duplicate utility function - consider extracting to shared module.

find_free_network_port is duplicated across multiple test files (test_predict.py, test_infer.py, test_stop_and_go.py, test_evo2.py). Consider extracting to a shared test utilities module.

Note: The function correctly binds to localhost (not ''), so the past security advisory about binding to all interfaces does not apply here.


519-521: Hardcoded layer count could become stale.

The original_num_layers = 25 is hardcoded for the 1b model. If the model configuration changes, this test will fail with a confusing error. Consider extracting this from the checkpoint configuration or documenting it more prominently.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py (1)

128-144: Edge case: temperature=0 not handled explicitly.

When temperature=0, the division at line 131 would cause a division by zero. However, the condition if temperature > 0 at line 130 prevents this. Consider clarifying behavior when temperature=0 in the docstring.

The current logic treats temperature <= 0 as "skip scaling", which combined with top_k != 1 would still sample. This may not be the intended behavior.

♻️ Suggested clarification
         temperature: Sampling temperature. Higher values (e.g., 1.0) make output
             more random, lower values make it more deterministic. Default is 1.0.
+            Note: When temperature <= 0, temperature scaling is skipped.
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (1)

554-554: Same logging configuration concern as predict.py.

Similar to predict.py, calling logging.basicConfig(level=logging.INFO) inside infer() may override external logging configuration. Consider making it conditional or moving to main().

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e53b726 and d8d55e6.

📒 Files selected for processing (26)
  • bionemo-recipes/recipes/evo2_megatron/examples/.gitignore
  • bionemo-recipes/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb
  • bionemo-recipes/recipes/evo2_megatron/examples/zeroshot_brca1.ipynb
  • bionemo-recipes/recipes/evo2_megatron/pyproject.toml
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/recipes/evo2.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/train.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/conftest.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/data/test_tokenizer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/common.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_finetune.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_inference.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
  • bionemo-recipes/recipes/evo2_megatron/tokenizers/nucleotide_fast_tokenizer_256/tokenizer.json
  • bionemo-recipes/recipes/evo2_megatron/tokenizers/nucleotide_fast_tokenizer_512/tokenizer.json
💤 Files with no reviewable changes (2)
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_finetune.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_inference.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.cursorrules)

**/*.py: Fix Python linter errors immediately using Ruff for linting and formatting (configured with line-length: 119 in pyproject.toml), and verify all auto-fixes are appropriate
Ensure all Python files follow Google-style docstrings (pydocstyle convention)
Follow import sorting configuration as per isort with 2 lines after imports
Use Pyright for type checking as configured in pyproject.toml

Files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/common.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/data/test_tokenizer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/recipes/evo2.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/train.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/conftest.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
{**/*test*.py,**/__init__.py}

📄 CodeRabbit inference engine (.cursorrules)

Ensure test files and __init__.py files respect relaxed linting rules as configured in pyproject.toml

Files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/data/test_tokenizer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/conftest.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: jstjohn
Repo: NVIDIA/bionemo-framework PR: 1123
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py:321-325
Timestamp: 2025-09-10T22:16:55.450Z
Learning: In pipeline parallelism for Evo2 prediction, tokens are only required on the last pipeline stage for multi-token prediction scenarios, not for single-token prediction. The hyena_predict_data_step function in sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py is designed for single-token prediction where tokens aren't needed on the final stage.
📚 Learning: 2025-09-10T22:16:55.450Z
Learnt from: jstjohn
Repo: NVIDIA/bionemo-framework PR: 1123
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py:321-325
Timestamp: 2025-09-10T22:16:55.450Z
Learning: In pipeline parallelism for Evo2 prediction, tokens are only required on the last pipeline stage for multi-token prediction scenarios, not for single-token prediction. The hyena_predict_data_step function in sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py is designed for single-token prediction where tokens aren't needed on the final stage.

Applied to files:

  • bionemo-recipes/recipes/evo2_megatron/pyproject.toml
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/common.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
📚 Learning: 2025-08-29T05:42:21.618Z
Learnt from: nvmvle
Repo: NVIDIA/bionemo-framework PR: 1028
File: ci/benchmarks/partial-conv/evo2_finetuning.yaml:56-56
Timestamp: 2025-08-29T05:42:21.618Z
Learning: In BioNeMo framework CI/CD benchmarking YAML configurations, some parameters like `precision` may be required for pipeline validation checks even if they're not directly used by the training scripts themselves.

Applied to files:

  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/recipes/evo2.py
📚 Learning: 2025-11-10T23:19:53.726Z
Learnt from: ohadmo
Repo: NVIDIA/bionemo-framework PR: 1139
File: bionemo-recipes/models/geneformer/tests/test_checkpoints_modeling_bert.py:306-353
Timestamp: 2025-11-10T23:19:53.726Z
Learning: In `models/geneformer/tests/test_checkpoints_modeling_bert.py`, the test code intentionally accesses the `.transform` attribute of `_unpack_qkv_weight` and `_unpack_qkv_bias` functions to bypass the `io.state_transform` decorator. This allows the test to use the underlying tensor transformation functions directly without requiring a full NeMo `TransformCTX` with complete source-target state dictionaries.

Applied to files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
🧬 Code graph analysis (5)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/data/test_tokenizer.py (1)
bionemo-recipes/recipes/llama3_native_te/tests/conftest.py (1)
  • tokenizer_path (33-35)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py (4)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (1)
  • find_free_network_port (38-45)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (1)
  • find_free_network_port (64-71)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py (1)
  • find_free_network_port (41-48)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (1)
  • find_free_network_port (297-304)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (2)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (1)
  • find_free_network_port (64-71)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/utils/checkpoint/nemo2_to_mbridge.py (2)
  • nemo2_to_mbridge (152-214)
  • run_nemo2_to_mbridge (217-247)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py (4)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (2)
  • predict (1035-1386)
  • batch_collator (184-291)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (2)
  • find_free_network_port (64-71)
  • check_fp8_support (181-193)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (1)
  • find_free_network_port (297-304)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/utils/checkpoint/nemo2_to_mbridge.py (2)
  • nemo2_to_mbridge (152-214)
  • run_nemo2_to_mbridge (217-247)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (4)
bionemo-recipes/recipes/codonfm_ptl_te/src/data/mutation_dataset.py (1)
  • copy (163-164)
bionemo-recipes/models/amplify/tests/test_encoder_block.py (1)
  • data (50-54)
sub-packages/bionemo-noodles/src/bionemo/noodles/nvfaidx.py (2)
  • items (235-237)
  • keys (228-229)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (1)
  • find_free_network_port (64-71)
🔇 Additional comments (58)
bionemo-recipes/recipes/evo2_megatron/examples/.gitignore (1)

11-11: LGTM!

The addition of evo2_1b_bf16_mbridge/ to the .gitignore is appropriate. It correctly follows the pattern of other directory entries and properly excludes notebook-generated model artifacts from version control.

bionemo-recipes/recipes/evo2_megatron/tokenizers/nucleotide_fast_tokenizer_256/tokenizer.json (1)

139-141: Appropriate decoder configuration for character-level tokenization.

The Fuse decoder correctly concatenates tokens without inserting separators, which is essential for character-level DNA sequence tokenizers where roundtrip consistency (encode → decode = original) is required.

bionemo-recipes/recipes/evo2_megatron/tokenizers/nucleotide_fast_tokenizer_512/tokenizer.json (1)

112-114: Consistent decoder update across tokenizer variants.

This change mirrors the 256-vocab tokenizer update, ensuring both tokenizer variants behave consistently for detokenization.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/data/test_tokenizer.py (1)

59-95: Well-structured regression test for tokenizer roundtrip consistency.

This test properly validates that the Fuse decoder configuration works correctly for character-level DNA tokenization. The test cases cover:

  • Basic DNA sequences
  • Longer sequences with all nucleotides including ambiguous bases (N)
  • Special characters used in pipe-delimited tags

This ensures the tokenizer changes in the JSON files don't regress.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_utils.py (1)

547-562: Well-structured test with proper module state cleanup.

The try/finally pattern correctly ensures the module is restored to a valid state after the test, regardless of whether the pytest.raises assertion succeeds or fails. Key improvements:

  1. Importing the module before entering the try block establishes a valid reference
  2. The finally block executes after the patch.dict context has exited, so sys.modules is restored and the reload succeeds
  3. The comment clearly documents why this cleanup is critical for test isolation

This prevents the corrupted module state from leaking into subsequent tests (like test_infer.py).

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (4)

64-71: LGTM!

The socket-based port discovery implementation is correct and uses the standard pattern for finding available ports for distributed testing.


376-386: LGTM!

The fixture follows the same pattern as the existing sequences fixture and appropriately skips when the test data file is missing.


616-636: LGTM!

Both helper functions are well-implemented:

  • mid_point_split correctly handles optional token limits with keyword-only arguments
  • calculate_sequence_identity handles edge cases (empty sequences) and correctly computes identity over the comparable length

871-886: LGTM!

Good handling of the None case from calculate_sequence_identity - only appending valid percentages. The assertion at line 882 will catch any missing results, though consider adding a diagnostic message if lengths don't match.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (10)

19-38: LGTM!

Imports are well-organized and appropriate for the test module's functionality. The TensorLike type alias provides clear intent for functions accepting both single tensors and iterables.


50-62: LGTM!

Clean implementation of tensor utilities with proper handling of sharded tensors and edge cases.


65-148: LGTM!

Well-documented gradient checking utilities with proper references to the theoretical foundation. The FP8 handling using BF16 epsilon is appropriate per the cited paper.


151-224: LGTM!

Solid implementation of optimizer state comparison with good error collection and diagnostics. The shifted tensor check on line 167 is a clever debugging aid.

Minor note: The print() calls on lines 212-215 and 220 could use pytest warnings or a logging framework for better integration with test runners, but this is acceptable for test code.


227-262: LGTM!

Clean checkpoint loading implementation using PyTorch's distributed checkpoint APIs. The two-pass approach (metadata read, then selective tensor load) is efficient.


265-294: LGTM!

Well-structured multi-checkpoint comparison with proper memory cleanup and comprehensive error reporting.


307-336: LGTM!

Compute capability checks are well-documented and correct for current GPU architectures. The upper bound < (12, 0) for FP4/MXFP8 may need adjustment when future architectures beyond Blackwell are released, but this is acceptable for now.


339-387: LGTM!

Clean command construction with proper environment isolation via PRETEST_ENV. The micro-batch size adjustment ensures consistent global batch size across different data parallel configurations.


578-602: LGTM!

Well-designed session-scoped fixture that efficiently creates a reusable base checkpoint for gradient equivalence tests.


605-663: LGTM!

Well-structured gradient equivalence test with appropriate skip markers for known issues. The parameterization covers the main parallelism strategies (DP, CP with active tests; TP, PP marked for future debugging).

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/recipes/evo2.py (2)

32-32: LGTM!

The import of get_mixed_precision_config alongside MixedPrecisionConfig is appropriate for the new string-based precision config support.


169-170: LGTM!

Clean handling of string-based precision config conversion. The isinstance check properly guards the conversion, and None values pass through unchanged to be handled downstream.

bionemo-recipes/recipes/evo2_megatron/pyproject.toml (1)

38-40: LGTM!

The CLI entry points for infer_evo2 and predict_evo2 are properly enabled, following the same pattern as existing entry points like train_evo2 and preprocess_evo2.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/common.py (1)

64-71: LGTM!

The simplified predict_cmd correctly removes model architecture arguments (model-size, num-layers, hybrid-override-pattern) since the MBridge-format checkpoint now provides this configuration via run_config.yaml. The updated docstring clearly documents this requirement.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_block.py (1)

202-204: LGTM!

Good defensive fix. This ensures final_norm is always defined as an attribute, preventing AttributeError when post_process=False. The forward method at line 525 already properly handles the None case with if self.final_norm is not None.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py (3)

44-51: LGTM with minor note.

The dynamic port allocation avoids conflicts in parallel test runs. Note that there's an inherent TOCTOU race between releasing the socket and using the port, but this is a standard pattern for test infrastructure and acceptable in practice.

Consider: SO_REUSEADDR is set after bind(), which doesn't affect the port discovery here but is typically set before binding. This doesn't impact correctness in this use case.


66-66: LGTM!

Using dynamic port allocation prevents port conflicts when multiple test processes run concurrently.


401-401: LGTM!

The relaxed tolerances (rtol=0.02, atol=2e-4) are appropriate when comparing PyTorch reference implementation against CUDA kernel outputs, as different accumulation orders and optimizations can introduce small numerical differences.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py (1)

1008-1026: LGTM! Correct batch-aware decode path implementation.

The reshape from [b, d, l] to [(b*l), d] correctly handles arbitrary batch sizes during the decode path where l=1. The subsequent reshape "b d -> b 1 d" properly restores the sequence dimension before the final transpose back to [b, d, l] format.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/evo2_provider.py (1)

514-527: LGTM! Correct handling of explicit pre/post_process overrides.

The conditional logic correctly prioritizes explicit False values from the provider configuration for embedding extraction scenarios, while falling back to caller-provided values or pipeline-stage defaults otherwise. This ensures embedding extraction workflows can disable post-processing even when the megatron bridge passes pipeline-aware values.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/conftest.py (4)

68-90: Robust child process cleanup implementation.

The use of SIGTERM is appropriate as it allows graceful termination. Consider adding a brief wait and follow-up SIGKILL for stubborn processes if this becomes an issue, but for most test scenarios, SIGTERM should suffice.


93-116: Thorough GPU cleanup with proper synchronization.

The cleanup sequence (sync → cache clear → gc × 3 → sync → cache clear → sleep) is well-designed to ensure all GPU operations complete and memory is released. The 0.1s sleep is a reasonable delay for GPU memory deallocation.


119-134: Appropriate random seed reset for test isolation.

Using None to reset seeds to system entropy ensures tests don't inherit state from previous tests. The # noqa: NPY002 comment correctly acknowledges the intentional use of the legacy NumPy API to reset global state.


137-152: Well-structured test cleanup fixture.

The fixture correctly resets random seeds before each test for reproducibility and performs thorough cleanup afterward. The sequence ensures both test isolation and resource cleanup.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/train.py (3)

239-239: LGTM! New CLI flag for FP8 weight gradient control.

The --no-fp8-wgrad flag follows the standard CLI pattern for disabling default-on features. Clear and consistent with other --no-* flags in this file.


602-607: Appropriate cleanup of unimplemented CLI options.

Commenting out the LoRA-related flags with TODO markers prevents users from attempting to use unimplemented features while preserving the intended API for future implementation.


769-770: Correct wiring of fp8_wgrad configuration.

The double-negative pattern (not args.no_fp8_wgrad) correctly translates the CLI flag to the configuration, enabling FP8 weight gradients by default and disabling when --no-fp8-wgrad is specified. The fp8_wgrad attribute is a valid configuration parameter in the megatron mixed precision settings.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (7)

109-176: LGTM: Checkpoint resolution logic is well-structured.

The resolve_checkpoint_path function handles both direct checkpoint paths and training output directories with iter_* subdirectories cleanly. The error handling is comprehensive with appropriate exceptions for missing files and invalid directories.


184-292: LGTM: Batch collation with structural pattern matching.

The recursive batch_collator function elegantly handles nested structures using Python's structural pattern matching. The default key overrides for backward compatibility (token_logits) are well-documented.


344-466: LGTM: Distributed initialization for inference.

The function properly initializes distributed environment with appropriate guards for re-initialization. The seed offsetting by pipeline and data parallel ranks ensures reproducibility across different parallel configurations.


473-546: LGTM: Context parallel gathering with zigzag unshuffling.

The _gather_along_cp_dim function correctly implements the zigzag pattern restoration. The inverse mapping logic properly reconstructs original sequence order from the gathered chunks.


778-862: LGTM: Prediction step with proper TP/CP handling.

The _predict_step function correctly distinguishes between embedding extraction (hidden states not sharded) and logit computation (vocab dimension sharded across TP). The gathering operations are applied in the correct order: TP first, then CP.


1165-1207: LGTM: Embedding layer extraction with comprehensive validation.

The embedding layer handling correctly:

  • Supports Python-style negative indexing
  • Truncates hybrid_override_pattern to match reduced layer count
  • Disables remove_activation_post_first_layer for single-layer extraction
  • Rejects incompatible --output-log-prob-seqs option

1316-1358: LGTM: Prediction loop with proper batch handling.

The prediction loop correctly:

  • Handles context parallel slicing while preserving seq_idx
  • Moves predictions to CPU for memory efficiency
  • Clears predictions list after batch writes to avoid accumulation
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py (5)

75-110: LGTM: Module-scoped fixture for checkpoint conversion.

The fixture properly handles missing data sources with informative skip messages and uses tmp_path_factory for module-scoped temporary directories.


128-239: LGTM: Comprehensive prediction run test with parallelism coverage.

The test properly validates:

  • Output file naming conventions for both batch and epoch modes
  • Prediction tensor shapes and vocabulary size
  • Sequence index mapping alignment with input sequences

The A6000-specific NCCL fix demonstrates good awareness of hardware-specific issues.


466-476: Tolerance values should be investigated further.

The FIXME comment indicates that changing batch sizes requires dropping tolerance from rel=1e-6 to rel=1e-3, and TP=2 needs even more (rel=2e-3). The FP8 tolerance of rel=1e-2 is also hand-tuned. Consider documenting the root cause or tracking this as a technical debt item.

The tolerances work but the underlying numerical differences may indicate precision handling that should be better understood.


574-595: LGTM: Log parsing for layer count verification.

The regex-based verification of model layer initialization logs is a pragmatic approach to validate the embedding extraction behavior without exposing internal model state.


643-721: LGTM: Error case tests for embedding extraction.

Both validation tests properly verify that:

  1. Invalid embedding layer indices produce clear error messages
  2. Incompatible --embedding-layer and --output-log-prob-seqs options are rejected
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer_example_simple.py (2)

48-56: LGTM: Clear module docstring and function signature.

The function is well-documented as a low-level alternative for debugging and custom workflows, with appropriate disclaimers about preferring the MCore-based inference for production.


103-166: LGTM: Autoregressive generation with proper context management.

The function correctly:

  • Starts in prefill mode and processes the full prompt
  • Switches to decode mode after prefill
  • Increments sequence_len_offset after each step
  • Handles both greedy (top_k=1) and sampling modes
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (3)

83-130: LGTM: Inference test with proper environment setup.

The test correctly:

  • Sets MASTER_ADDR and MASTER_PORT for distributed execution
  • Uses a longer prompt (64 chars) for FP8 compatibility
  • Validates both command success and output file creation

272-323: LGTM: Unit tests for HyenaInferenceContext.

The tests provide good coverage of the context initialization, reset behavior, and configuration options. The dynamic attribute addition in test_context_reset effectively tests the filter state cleanup mechanism.


213-270: LGTM: Phylogenetic prompt test validates Evo2-specific functionality.

Testing with phylogenetic lineage prompts is important for Evo2 since the model is trained with these tags. The test demonstrates proper generation with domain-specific input.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (5)

91-197: LGTM: Evo2 model inference wrapper implementation.

The Evo2ModelInferenceWrapper correctly:

  • Prepares inference input with sequential position IDs
  • Handles context window slicing for autoregressive generation
  • Passes HyenaInferenceContext to the model's forward pass
  • Uses runtime_gather_output=True for proper tensor gathering

224-398: LGTM: Comprehensive inference engine setup.

The setup_inference_engine function properly orchestrates:

  1. Checkpoint resolution and config loading
  2. Parallelism and precision configuration
  3. Tokenizer loading with fallback
  4. Distributed environment initialization
  5. Model creation, weight loading, and Float16Module wrapping
  6. MCore inference infrastructure setup (wrapper, controller, engine)

400-449: LGTM: Generate function with proper context lifecycle.

The function correctly resets the inference context both before and after generation to ensure clean state. The SamplingParams construction handles edge cases (e.g., max(0, top_k)).


515-602: LGTM: CLI inference workflow with proper cleanup.

The infer function properly:

  • Sets up inference components
  • Generates text from the prompt
  • Outputs to stdout and optionally to file
  • Cleans up distributed process group

282-284: [rewritten comment]
[classification tag]

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

/ok to test fce2a7b

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py`:
- Around line 627-633: Update the incorrect comment that says "The 1b model has
28 layers" to reflect the correct layer count (25) so it matches the earlier
comment and the actual model; locate the comment above the command construction
in the test (near the command string/variable names such as command,
mbridge_checkpoint_1b_8k_bf16_path, and the --embedding-layer 100 flag) and
change the text to "The 1b model has 25 layers" (or similar) to correct the
documentation in test_predict.py.
🧹 Nitpick comments (4)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (2)

162-169: Unparseable iteration directories may cause unexpected behavior.

If all iter_* directories have unparseable names (all return -1), max() will arbitrarily select one. Consider filtering out directories that fail parsing or raising an error when no valid iterations are found.

♻️ Suggested improvement
     def _parse_iter_num(item: tuple[str, Path]) -> int:
         try:
             return int(item[0].replace("iter_", ""))
         except ValueError:
             return -1

-    _, latest_iter_path = max(iter_dirs, key=_parse_iter_num)
+    # Filter out unparseable directories
+    valid_iter_dirs = [(name, path) for name, path in iter_dirs if _parse_iter_num((name, path)) >= 0]
+    if not valid_iter_dirs:
+        raise FileNotFoundError(
+            f"No valid iter_XXXXXXX subdirectories found at '{checkpoint_path}'. "
+            "Expected directories named iter_<number>."
+        )
+    _, latest_iter_path = max(valid_iter_dirs, key=_parse_iter_num)

974-983: Inconsistent use of effective_files in subdirectory calculation.

Line 977 computes effective_files = num_files_written * data_parallel_world_size, but line 980 recomputes this expression instead of using effective_files. This is a minor code duplication that could cause maintenance issues.

♻️ Suggested fix
     if files_per_subdir is not None:
         # Calculate how many subdirs we've created based on total files written
         # (counting all DP ranks)
         effective_files = num_files_written * data_parallel_world_size
         if effective_files >= files_per_subdir:
             # Need a new subdirectory
-            num_subdirs_written = (num_files_written * data_parallel_world_size) // files_per_subdir + 1
+            num_subdirs_written = effective_files // files_per_subdir + 1
             current_output_dir = output_dir / f"subdir_{num_subdirs_written}"
             current_output_dir.mkdir(parents=True, exist_ok=True)
             num_files_written = 0
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (1)

556-565: Consider exposing max_batch_size parameter in infer() function.

The infer() function doesn't expose max_batch_size, defaulting to 1. While this works for single-prompt CLI usage, users calling infer() programmatically with multiple prompts would need to use setup_inference_engine() + generate() directly. Consider adding the parameter for consistency.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (1)

144-146: Consider including second moment in optimizer comparison.

The filter only includes .exp_avg. tensors (first moment). For Adam-style optimizers, you may also want to compare .exp_avg_sq. (second moment) for more comprehensive validation, unless there's a specific reason to exclude it.

Optional: Include second moment in comparison
 def _filter_optimizer_tensors(plain_tensors: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
     """Return only optimizer-related tensors from a flat checkpoint tensor dict."""
-    return {k: v for k, v in plain_tensors.items() if k.startswith("optimizer.") and ".exp_avg." in k}
+    return {
+        k: v for k, v in plain_tensors.items()
+        if k.startswith("optimizer.") and (".exp_avg." in k or ".exp_avg_sq." in k)
+    }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8d55e6 and fce2a7b.

📒 Files selected for processing (11)
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/__init__.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/__init__.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py
✅ Files skipped from review due to trivial changes (2)
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/init.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/init.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.cursorrules)

**/*.py: Fix Python linter errors immediately using Ruff for linting and formatting (configured with line-length: 119 in pyproject.toml), and verify all auto-fixes are appropriate
Ensure all Python files follow Google-style docstrings (pydocstyle convention)
Follow import sorting configuration as per isort with 2 lines after imports
Use Pyright for type checking as configured in pyproject.toml

Files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
{**/*test*.py,**/__init__.py}

📄 CodeRabbit inference engine (.cursorrules)

Ensure test files and __init__.py files respect relaxed linting rules as configured in pyproject.toml

Files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: jstjohn
Repo: NVIDIA/bionemo-framework PR: 1123
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py:321-325
Timestamp: 2025-09-10T22:16:55.450Z
Learning: In pipeline parallelism for Evo2 prediction, tokens are only required on the last pipeline stage for multi-token prediction scenarios, not for single-token prediction. The hyena_predict_data_step function in sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py is designed for single-token prediction where tokens aren't needed on the final stage.
📚 Learning: 2025-09-10T22:16:55.450Z
Learnt from: jstjohn
Repo: NVIDIA/bionemo-framework PR: 1123
File: sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py:321-325
Timestamp: 2025-09-10T22:16:55.450Z
Learning: In pipeline parallelism for Evo2 prediction, tokens are only required on the last pipeline stage for multi-token prediction scenarios, not for single-token prediction. The hyena_predict_data_step function in sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py is designed for single-token prediction where tokens aren't needed on the final stage.

Applied to files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py
📚 Learning: 2025-11-10T23:19:53.726Z
Learnt from: ohadmo
Repo: NVIDIA/bionemo-framework PR: 1139
File: bionemo-recipes/models/geneformer/tests/test_checkpoints_modeling_bert.py:306-353
Timestamp: 2025-11-10T23:19:53.726Z
Learning: In `models/geneformer/tests/test_checkpoints_modeling_bert.py`, the test code intentionally accesses the `.transform` attribute of `_unpack_qkv_weight` and `_unpack_qkv_bias` functions to bypass the `io.state_transform` decorator. This allows the test to use the underlying tensor transformation functions directly without requiring a full NeMo `TransformCTX` with complete source-target state dictionaries.

Applied to files:

  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py
🧬 Code graph analysis (6)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py (1)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (1)
  • find_free_network_port (23-28)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (3)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (1)
  • find_free_network_port (23-28)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/utils/checkpoint/nemo2_to_mbridge.py (2)
  • nemo2_to_mbridge (152-214)
  • run_nemo2_to_mbridge (217-247)
bionemo-recipes/recipes/llama3_native_te/tests/conftest.py (1)
  • tokenizer_path (33-35)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (1)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (4)
  • find_free_network_port (23-28)
  • is_a6000_gpu (84-90)
  • is_fp8_supported (39-45)
  • is_mxfp8_supported (57-63)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (2)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (2)
  • check_fp8_support (66-81)
  • find_free_network_port (23-28)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/utils/checkpoint/nemo2_to_mbridge.py (1)
  • run_nemo2_to_mbridge (217-247)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (2)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (5)
  • predict (1036-1385)
  • initialize_inference_distributed (345-466)
  • resolve_checkpoint_path (110-177)
  • parse_args (554-686)
  • main (1393-1418)
sub-packages/bionemo-scdl/src/bionemo/scdl/schema/header.py (1)
  • default (741-746)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py (1)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (4)
  • find_free_network_port (23-28)
  • is_fp4_supported (48-54)
  • is_fp8_supported (39-45)
  • is_mxfp8_supported (57-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: pre-commit
🔇 Additional comments (42)
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/predict.py (10)

19-58: Well-documented module docstring.

The module docstring provides clear usage examples for CLI invocation with various parallelism configurations and documents the output format conventions. This is helpful for users.


237-293: Clean use of structural pattern matching for batch collation.

The recursive handling of nested structures (dicts, lists, tuples) using match/case is elegant and maintainable. The default key overrides provide backward compatibility.


295-338: Padding logic is correctly implemented.

The pad_spec calculation for arbitrary tensor dimensions is correct. The formula 2 * (tensor.ndim - 1 - seq_dim) + 1 properly targets the right side of the sequence dimension in PyTorch's pad format.


345-467: Distributed initialization is comprehensive and follows Megatron patterns.

The initialization sequence correctly sets up process groups, model parallelism, and RNG seeds. The use of get_*_safe() functions for pre-initialization rank detection is appropriate.


474-547: Context parallel gathering with zigzag unshuffle is well-implemented.

The zigzag pattern handling correctly reconstructs the original sequence order after gathering from CP ranks. The docstring clearly explains the pattern with an example.


779-863: Prediction step correctly handles embedding vs logits output modes.

The function properly distinguishes between embedding extraction (hidden states not sharded across TP) and logits output (vocab dimension sharded). The transpose for embeddings from [S, B, H] to [B, S, H] maintains consistency with the batch-first convention.


865-924: Log probability computation is correct with proper shift alignment.

The logit/target alignment (logits[:, :-1] with tokens[:, 1:]) correctly computes next-token probabilities. The torch.clamp(min=1.0) prevents division by zero when computing mean log probs.


1164-1207: Embedding layer extraction with comprehensive validation.

The embedding layer handling correctly:

  1. Supports Python-style negative indexing
  2. Validates bounds against original layer count
  3. Truncates hybrid_override_pattern to match
  4. Disables incompatible settings for single-layer extraction
  5. Raises clear error when combined with output_log_prob_seqs

This is a well-designed feature with proper edge case handling.


1375-1386: Proper distributed cleanup with barrier synchronization.

The cleanup sequence ensures all ranks complete writing before destroying the process group. The write_idx_map call is appropriately placed after prediction completion.


1393-1419: Several CLI arguments are not passed to the predict() function.

The following arguments are defined in parse_args() but not forwarded to predict():

  • --hybrid-override-pattern
  • --num-layers
  • --seq-len-interpolation-factor
  • --eden-tokenizer
  • --mask-phylogenetic-tags
  • --vortex-style-fp8

If these are intended for future use, consider adding a comment. Otherwise, they should be wired through or removed to avoid user confusion.

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py (5)

19-40: Clear module docstring with usage examples.

The docstring provides both CLI and Python API usage examples, making it easy for users to get started with text generation.


91-197: Well-implemented inference wrapper for Evo2.

The wrapper correctly:

  1. Creates sequential position IDs appropriate for Hyena models
  2. Uses attention_mask=None for flash attention backend
  3. Handles context window slicing for autoregressive generation
  4. Passes inference_context and runtime_gather_output=True to the model

362-389: Inference context and engine setup is correct.

Setting materialize_only_last_token_logits = False ensures full sequence logits are available for sampling. The legacy=True mode for StaticInferenceEngine is appropriate for the current integration.


429-448: Context reset before and after generation ensures clean state.

The double reset pattern (lines 430 and 446) prevents state leakage between generation calls. This is important for maintaining consistent behavior across batches.


582-599: Proper output handling and cleanup.

The function correctly:

  1. Only outputs on data parallel rank 0
  2. Creates parent directories for output file
  3. Synchronizes with barrier before destroying process group
  4. Returns generated text for programmatic use
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (3)

40-72: Fixture properly handles checkpoint creation with skip for missing data.

The mbridge_checkpoint_path fixture correctly handles the case where NGC data is unavailable by using pytest.skip() with a helpful message about re-running with BIONEMO_DATA_SOURCE=pbss.


75-122: Comprehensive end-to-end test with appropriate assertions.

The test correctly:

  1. Uses FP8-compatible prompt length
  2. Sets distributed environment variables
  3. Uses dynamic port allocation to avoid conflicts
  4. Validates exit code, file creation, and non-empty output

264-315: Good unit test coverage for HyenaInferenceContext.

The test class covers:

  1. Basic initialization and attribute access
  2. Reset behavior with simulated filter state
  3. Configuration toggle for materialize_only_last_token_logits
  4. Various batch sizes and sequence lengths

This provides confidence in the context's core functionality.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_predict.py (5)

39-41: Pre-test environment capture is a good pattern.

Capturing PRETEST_ENV = copy.deepcopy(os.environ) at module import time ensures tests have a clean baseline environment, preventing pollution from previous tests.


80-207: Comprehensive prediction test with good parallelism coverage.

The test properly:

  1. Skips when insufficient GPUs are available
  2. Handles A6000-specific NCCL issues
  3. Validates file naming conventions for both batch and epoch modes
  4. Uses batch_collator to combine multi-rank predictions
  5. Validates sequence lengths against pad masks

The PP=2 skip is appropriately documented.


243-297: Baseline fixture provides deterministic reference for comparison tests.

The module-scoped fixture generates predictions without parallelism, providing a stable baseline for verifying that different parallelism configurations produce equivalent results.


434-444: Tolerance levels for log probability comparison are well-justified.

The differentiated tolerance levels (1e-6 for baseline, 2e-3 for MP>1, 1e-2 for FP8) appropriately account for numerical differences from parallelism and reduced precision. The FIXME comment at line 436 about batch size affecting tolerance should be tracked.

Consider creating an issue to investigate why changing batch size requires relaxed tolerance from 1e-6 to 1e-3 (line 436-438).


456-609: Thorough embedding extraction test coverage.

The tests validate:

  1. Correct layer count initialization for various embedding_layer values
  2. Log output parsing for layer configuration
  3. Output key presence (hidden_embeddings vs token_logits)
  4. Shape validation including hidden dimension (1920 for 1b)
  5. Error handling for invalid layers and incompatible options

This provides strong confidence in the embedding extraction feature.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_stop_and_go.py (1)

29-30: Good refactoring to centralize test utilities.

The import of shared utility functions from the centralized utils.py module improves maintainability and reduces code duplication across test files.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/utils.py (3)

23-28: Port allocation implementation looks correct.

The socket is properly used as a context manager. Note that SO_REUSEADDR is set after binding, which is fine here since you're only using the socket to discover an available port. There's a small TOCTOU window where the port could be taken between when this returns and when it's actually used, but this is acceptable for test scenarios.


66-90: LGTM - utility functions are well-implemented.

check_fp8_support provides detailed diagnostic information for test skip messages, and is_a6000_gpu correctly iterates all visible devices. The CUDA availability checks are properly handled.


48-63: Clarify the upper bound for FP4/MXFP8 support or align with FP8 handling.

The functions is_fp4_supported() and is_mxfp8_supported() use an upper bound < (12, 0) for compute capability checks, while the similar is_fp8_supported() function uses no upper bound (cc >= (8, 9)). Since Blackwell (10.0+) supports both FP4 and MXFP8, either the upper bound should be removed to match the FP8 pattern, or documented with explicit justification if future architectures (12.0+) are intentionally unsupported for these formats.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/models/megatron/hyena/test_hyena_mixer_kernel.py (3)

35-36: Good adoption of centralized port allocation.

Using find_free_network_port() from the shared utilities module eliminates potential port conflicts when running tests in parallel or when the default port is already in use.


58-58: Dynamic port allocation improves test reliability.

Replacing the hardcoded "29500" with dynamically allocated ports prevents test failures due to port conflicts in CI environments or when running multiple test sessions.


393-394: Explicit tolerances are appropriate for kernel comparison.

The rtol=0.02, atol=2e-4 tolerances are reasonable when comparing PyTorch reference implementation against CUDA kernel outputs, as numerical differences between implementations are expected. The stricter default tolerances for scalar loss comparison (line 394) is appropriate.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py (6)

54-55: Good adoption of centralized utilities.

Importing check_fp8_support and find_free_network_port from the shared utilities module reduces code duplication.


353-363: Well-structured fixture with proper error handling.

The fixture follows the same pattern as the existing sequences fixture and appropriately skips tests when the data file is missing.


593-613: Helper functions are correctly implemented.

mid_point_split provides flexible sequence splitting, and calculate_sequence_identity properly handles edge cases by returning None for empty sequences. Note that the identity calculation uses min_length as the denominator, which means sequences of different lengths will be compared only up to the shorter length.


726-730: None handling from calculate_sequence_identity is addressed.

The or 0.0 fallback on line 727 properly handles the case where calculate_sequence_identity returns None, addressing the past review comment about potential TypeError.


756-764: Parametrization aligns with FP8 skip logic.

The test cases for non-bf16 checkpoints (evo2/1b-8k:1.0, evo2/7b-8k:1.0) are parametrized with fp8=True, which means they will be appropriately skipped on hardware that doesn't support FP8 (via lines 797-799). This is a reasonable approach that avoids dead test cases.


845-858: None handling uses conditional append approach.

The test handles None returns from calculate_sequence_identity by conditionally appending (lines 848-849), which differs from the or 0.0 approach in test_batch_generate_coding_sequences. If a sequence identity cannot be computed, it's omitted from the results. The assertion on line 855 will fail if all identities are None, which would indicate a more fundamental issue.

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_train.py (6)

37-37: Good centralization of test utilities.

Importing shared utilities from the parent utils module maintains consistency across the test suite.


43-141: Well-documented gradient comparison utilities.

The implementation follows the theoretical bounds from the referenced paper (arxiv 2506.09280 theorem 5.3). The handling of sharded tensors in _fro_norm and relative_grad_diff is correct, and the machine epsilon mapping for FP8 to BF16 is justified by the accumulation behavior.


149-178: Diagnostic-rich gradient assertion.

The function provides comprehensive diagnostics when assertions fail, including a shifted-tensor comparison to help identify alignment issues. Using l=0 (layer 0) for the most permissive bound is a reasonable choice for general gradient comparison.


294-338: Command construction utilities are well-structured.

The _run_train_command function properly manages environment variables and provides useful debugging output on failure. The _distributed_training_cmd correctly adjusts micro-batch size based on data parallelism to maintain the global batch size.


355-526: Comprehensive fine-tuning test with good validation.

The test validates the complete fine-tuning workflow including checkpoint loading, step resetting, and loss continuity. The A6000-specific NCCL_P2P_DISABLE handling (lines 412-414) addresses known hardware issues. The loss continuity assertions (lines 515-526) verify that fine-tuning starts from the trained model's learned state.


529-613: Well-designed distributed training equivalence test.

The session-scoped base_checkpoint fixture efficiently creates a reference checkpoint once for all parameterized tests. The explicit skip markers for tensor and pipeline parallelism (lines 567-579) transparently document known issues. The test validates an important property: that different parallelism strategies produce equivalent optimizer states after the same training steps.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Signed-off-by: John St. John <[email protected]>
@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

/ok to test 817d39c

Signed-off-by: John St. John <[email protected]>
@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 15, 2026

/ok to test a80393a

Copy link
Member

@cspades cspades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One observation that looks weird, approving pending that confirm.

@jstjohn jstjohn enabled auto-merge January 15, 2026 22:48
@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 16, 2026

/ok to test a5eccd5

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 16, 2026

/ok to test ffab37b

Signed-off-by: John St. John <[email protected]>
@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 16, 2026

/ok to test 91a5ef5

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 16, 2026

/ok to test 151e75d

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 17, 2026

/ok to test 3ea494c

@jstjohn
Copy link
Collaborator Author

jstjohn commented Jan 17, 2026

/ok to test fbfa8dc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants