Skip to content

[None][feat] Merge feat/bench_y (deepseek-v32) changes to main#12865

Draft
dongfengy wants to merge 12 commits intoNVIDIA:mainfrom
dongfengy:user/dongfengy/bench_y_mass_merge
Draft

[None][feat] Merge feat/bench_y (deepseek-v32) changes to main#12865
dongfengy wants to merge 12 commits intoNVIDIA:mainfrom
dongfengy:user/dongfengy/bench_y_mass_merge

Conversation

@dongfengy
Copy link
Copy Markdown
Collaborator

@dongfengy dongfengy commented Apr 9, 2026

Hi reviewers! https://github.com/NVIDIA/TensorRT-LLM/commits/feat/bench_y/ is a side branch we have been working on for certain benchmarks. Now we want to merge the changes back to main. Thanks!

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced attention operations with explicit context and token count tracking for improved scheduling.
    • Added diagnostic logging for token overflow and batch wait decisions.
    • Introduced compile-time control for parallel execution behaviors across multiple model implementations.
  • Bug Fixes

    • Improved KV cache reuse calculations and token accounting in scheduling logic.
    • Enhanced resource validation and compute cost estimation for cache allocation.
  • Chores

    • Updated model implementations to support improved compilation and parallel execution handling.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

This pull request refactors compute token accounting across the batch scheduler and attention operations to incorporate KV-cache reuse dynamics via an explicit reuse_adjusted_compute helper. It extends attention operation signatures to accept num_contexts and num_ctx_tokens parameters, removes internal derivation of these values, and adds a disable_on_compile flag to parallel execution helpers. Additionally, it introduces budget pre-validation and diagnostic logging for token allocation.

Changes

Cohort / File(s) Summary
Batch Scheduling & Compute Cost
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp
Added reuse_adjusted_compute() helper to compute forward-pass token costs under KV-cache reuse; refactored fitDraftTokens, setCtxRequestsChunkSize methods, and scheduling entry point to use reuse-adjusted costs instead of inline clamped computations.
Attention Operation Signatures
cpp/tensorrt_llm/thop/attentionOp.h, cpp/tensorrt_llm/thop/attentionOp.cpp, cpp/tensorrt_llm/nanobind/thop/bindings.cpp
Extended attention function signature to accept int64_t num_contexts and int64_t num_ctx_tokens parameters; removed internal derivation from host_request_types and now use passed-in values for generation/context token accounting.
K-Cache Scatter Operation
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
Refactored function signature from k_fp8_bytes/k_scale_bytes to k_fp8/k_scale tensor parameters plus explicit int64_t num_tokens; updated validation and kernel invocation to use element-size checks and byte-pointer casting.
Python Attention Backend
tensorrt_llm/_torch/attention_backend/sparse/dsa.py, tensorrt_llm/_torch/attention_backend/trtllm.py
Removed manual byte-view computations in _update_k_cache(); now pass raw tensors and num_tokens directly to C++ op. Extended TrtllmAttentionWrapper.run() and TrtllmAttention.forward() to accept and forward num_contexts and num_ctx_tokens.
Attention Generation Path
tensorrt_llm/_torch/attention_backend/trtllm_gen.py
Removed _parse_request_types() helper; now accepts num_contexts and num_ctx_tokens as explicit parameters instead of deriving them from host_request_types and host_context_lengths.
Custom Operations & Planning
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
Added num_contexts and num_ctx_tokens planner state fields, initialized from num_prefill and summed prefill sequence lengths; passed to thop.attention invocation.
Parallel Execution Control
tensorrt_llm/_torch/modules/multi_stream_utils.py
Added disable_on_compile: bool = False parameter to maybe_execute_in_parallel(), gating multi-stream execution to be disabled when compiling if flag is set.
Model Compilation Behavior
tensorrt_llm/_torch/models/modeling_*.py (Deepseekv3, ExaoneMoE, GLM, Llama, LlamMinLatency, Nemotron, Qwen3Next), tensorrt_llm/_torch/modules/mamba/*.py (GDN, Mamba2)
Updated all maybe_execute_in_parallel() calls in MoE/attention forward paths to pass disable_on_compile=True, controlling parallel execution during compilation.
Token Accounting & Diagnostics
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Refactored token scheduling to compute reuse-adjusted context tokens via _compute_scheduled_tokens(); added _maybe_log_batch_wait_decision() for rank-0 diagnostics and overflow logging before token limit assertion.
Resource & Budget Management
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added pre-addSequence KV cache budget re-validation with _estimate_post_reuse_compute() helper; re-probes KV reuse for first-chunk context requests and skips requests exceeding remaining budget; handles add_sequence() failures gracefully.
Test Updates
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
Updated CUDA kernel invocation test to pass raw FP8/scale tensors and num_tokens instead of precomputed byte-level views; adjusted debug output formatting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 39.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is incomplete and lacks essential information required by the template. Add a proper Description section explaining the changes and their purpose, specify Test Coverage with relevant test cases, and complete the PR Checklist with specific items addressed by this PR.
Title check ❓ Inconclusive The title '[None][feat] Merge feat/bench_y (deepseek-v32) changes to main' is vague and generic. It uses a placeholder '[None]' prefix, describes the merge action rather than the substantive changes, and lacks specificity about what the PR actually implements. Revise to clearly summarize the main technical changes, e.g., 'Add reuse-adjusted compute accounting for KV cache and split MLA DSA custom ops for CUDA graph captureability' or similar.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
tensorrt_llm/_torch/models/modeling_exaone_moe.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the required NVIDIA SPDX header to this modified file.

This file is modified but currently lacks the mandated OSS copyright/license header.

Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 import math
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_exaone_moe.py` at line 1, This file
(modeling_exaone_moe.py) is missing the required NVIDIA SPDX/Open Source
copyright header; add the official NVIDIA header block (including
SPDX-License-Identifier and the year of latest meaningful modification) at the
top of the file before any imports (above the existing import math), ensuring it
matches the project's mandated format for Python files so the file complies with
the OSS header policy.
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update the SPDX copyright year for this modified file.

The header still says 2022-2024, but this file has meaningful 2026 modifications.

Suggested patch
- # SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` at line 1, The SPDX
copyright header at the top of mamba2_mixer.py still lists "2022-2024" and must
be updated to include the year of the latest meaningful modification (2026);
open the top of tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the
SPDX/ copyright year range to include 2026 (e.g., "2022-2026" or the appropriate
single year), ensuring the SPDX line and any other copyright header occurrences
in the file are updated consistently.
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Please add the NVIDIA SPDX header in this modified Python file.

The file has changes in this PR but does not include the required OSS header block.

Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 import copy
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama.py` at line 1, This file
tensorrt_llm/_torch/models/modeling_llama.py is missing the required NVIDIA
SPDX/open-source header; add the NVIDIA copyright header block (including SPDX
identifier and the year of latest meaningful modification) at the very top of
the file before any code (e.g., before the existing "import copy" line) so the
header applies to the entire module.
tensorrt_llm/_torch/models/modeling_glm.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add NVIDIA SPDX header for this modified file.

This file is changed in the PR but is missing the required header.

Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 import inspect
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_glm.py` at line 1, This file
(tensorrt_llm/_torch/models/modeling_glm.py) is missing the required NVIDIA
SPDX/copyright header; add the standard NVIDIA Open Source Software header
(including the current year of latest meaningful modification and SPDX
identifier) as the very first lines of the file before any imports (e.g., before
the existing "import inspect") to comply with the project's licensing/coding
guidelines.
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Please add the NVIDIA SPDX header to this modified file.

The required OSS header is currently missing.

Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
 from collections.abc import Callable
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` at line 1, Add the
NVIDIA SPDX/Open Source header to the top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (the modified file
containing the import line "from collections.abc import Callable"); insert the
required NVIDIA copyright/SPDX block for Python files (with the correct year of
latest modification) as the very first lines of the file so it complies with the
project's OSS header guidelines.
tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)

230-236: ⚠️ Potential issue | 🟠 Major

Add disable_on_compile=True to the Qwen3NextMTP.forward() callsite too.

The sparse MoE block at line 230 was correctly fixed, but Qwen3NextMTP.forward() at line 745 still lacks the disable_on_compile=True guard. Since Qwen3NextForCausalLM supports one-model MTP execution, this unguarded callsite remains vulnerable to Dynamo tracing of Python stream/event operations.

Fix
         inputs_embeds, hidden_states = maybe_execute_in_parallel(
             norm_embeds,
             norm_hidden,
             self.event_dict[EventType.Main],
             self.event_dict[EventType.MoeShared],
             self.aux_stream,
+            disable_on_compile=True,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 230 - 236,
The call to Qwen3NextMTP.forward in Qwen3NextForCausalLM is missing the
disable_on_compile=True guard and can be traced into Python stream/event ops;
update the forward callsite (the invocation of Qwen3NextMTP.forward inside
Qwen3NextForCausalLM.forward) to pass disable_on_compile=True (matching the
earlier maybe_execute_in_parallel change) so MTP execution is skipped during
compile/tracing and preserves event/stream semantics.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

410-411: Make num_contexts and num_ctx_tokens required here.

These counters now drive the context/generation split for both backends. Defaulting them to 0 turns any missed call-site update into silently wrong bookkeeping instead of an immediate failure.

Proposed fail-fast change
         quant_q_buffer: Optional[torch.Tensor] = None,
-        num_contexts: int = 0,
-        num_ctx_tokens: int = 0,
+        num_contexts: int,
+        num_ctx_tokens: int,
     ):
+        if not 0 <= num_contexts <= self.host_request_types.shape[0]:
+            raise ValueError(f"Invalid {num_contexts=}")
+        if not 0 <= num_ctx_tokens <= q.shape[0]:
+            raise ValueError(f"Invalid {num_ctx_tokens=}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 410 - 411,
Remove the default 0 values from the function signature parameters num_contexts
and num_ctx_tokens so they are required (i.e., change "num_contexts: int = 0,
num_ctx_tokens: int = 0" to "num_contexts: int, num_ctx_tokens: int"). Update
any call sites of the function that relied on the defaults to pass explicit
values, and adjust any related type hints or docstrings for the function that
reference these arguments (search for usages of num_contexts and num_ctx_tokens
in the module to find callers); run tests to ensure no remaining implicit uses.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2650-2656: Reuse the module-level logger instead of re-importing it inside _prepare_tp_inputs.

Line 2650 introduces a redundant local import (_mnt_logger) even though logger is already imported at module scope. Reusing the existing logger keeps the path cleaner and avoids extra import/lookup work.

Proposed cleanup
-            from tensorrt_llm.logger import logger as _mnt_logger
-            _mnt_logger.error(
+            logger.error(
                 f"MNT overflow: total={total_num_tokens} "
                 f"max={self.max_num_tokens} "
                 f"ctx_reqs={len(scheduled_requests.context_requests)} "
                 f"gen_reqs={gen_count} "
                 f"ctx_breakdown=[{'; '.join(ctx_details)}]")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2650 - 2656, The
local import of tensorrt_llm.logger as _mnt_logger inside _prepare_tp_inputs is
redundant; remove that import and use the module-level logger variable (logger)
instead. Update the error logging call in _prepare_tp_inputs to call
logger.error(...) with the same message (including total_num_tokens,
self.max_num_tokens, len(scheduled_requests.context_requests), gen_count, and
ctx_details) so you reuse the existing logger symbol and avoid the extra
import/lookup.
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp (1)

27-44: Prefer an anonymous namespace and camelCase for the new helper.

This introduces a file-local helper with static linkage and snake_case naming. Moving it into an anonymous namespace as reuseAdjustedCompute() would match the repo's C++ conventions and keep this scheduler file consistent.

As per coding guidelines, "Rather than using the static keyword to mark a function as having internal linkage, prefer anonymous namespaces" and "Local variables, functions, methods, and namespaces should use camel case with lowercase first letter".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp` around lines 27 - 44,
The helper function reuse_adjusted_compute has file-local static linkage and
snake_case name; replace it with a function inside an anonymous namespace and
rename it to camelCase reuseAdjustedCompute to follow project conventions:
remove the static keyword, wrap the function definition in an anonymous
namespace, change its name from reuse_adjusted_compute to reuseAdjustedCompute,
and update any callers to the new name so linkage and naming are consistent.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

2810-2811: Add type annotations to the new helper signature.

_compute_scheduled_tokens() is newly added but has no parameter or return annotations, which weakens type-checking around the new scheduling-accounting path.

♻️ Suggested signature update
     `@staticmethod`
-    def _compute_scheduled_tokens(context_requests, generation_requests):
+    def _compute_scheduled_tokens(
+        context_requests: list[LlmRequest],
+        generation_requests: list[LlmRequest],
+    ) -> int:

As per coding guidelines, "Always annotate functions with type hints; make the return type None if the function does not return anything".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2810 - 2811, The
helper _compute_scheduled_tokens lacks type annotations; update its signature to
include parameter types for context_requests and generation_requests (e.g., use
typing.Sequence or typing.Iterable with the appropriate element type from your
codebase, such as Sequence[RequestContext] and Sequence[GenerationRequest] or
Sequence[Any] if concrete types aren't yet defined) and set the return type to
-> None; also add any needed imports from typing (Sequence/Iterable/Any) so
static type checkers can validate callers of _compute_scheduled_tokens.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

740-755: Keep _estimate_post_reuse_compute() in one place.

This is now a second copy of reuse-adjusted compute accounting. Sharing the scheduler helper, or at least adding a parity test against it, would reduce the chance of Python pre-validation drifting away from the scheduler again.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 740 - 755,
The function _estimate_post_reuse_compute duplicates reuse-adjusted compute
logic already implemented in the scheduler; replace this local implementation by
delegating to the scheduler's shared helper (call the existing scheduler helper
that computes post-reuse forward tokens) or, if delegation is not possible, add
a parity unit test that asserts _estimate_post_reuse_compute returns identical
values to the scheduler helper for a wide range of reuse_tokens, chunk_size,
prompt_len inputs; update references to _estimate_post_reuse_compute to use the
shared helper name and ensure tokens_per_block alignment logic remains identical
to the scheduler's implementation so the two cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/thop/attentionOp.cpp`:
- Around line 633-634: The function signature now relies on num_contexts and
num_ctx_tokens but treats them as optional/possibly stale; update the
implementation in attentionOp.cpp to either (A) treat these as required
validated inputs (remove defaults at the public API/binding layer) or (B) add a
robust fallback: if num_contexts or num_ctx_tokens are zero or missing, derive
them from flash_mla_tile_scheduler_metadata / flash_mla_num_splits or from the
batch/token counts, then clamp and validate ranges (ensure >=0, <= total
batch/tokens, and that prefix/suffix splits produce non-negative
num_generations/num_gen_tokens). Apply the same checks to the other partitioning
block referenced (lines ~837-846). On invalid values, fail fast with a clear
error/exception rather than letting negative sizes flow into workspace sizing or
runner->run.

In `@cpp/tensorrt_llm/thop/attentionOp.h`:
- Around line 80-83: The prototype in attentionOp.h (the declaration containing
parameters flash_mla_tile_scheduler_metadata, flash_mla_num_splits,
num_contexts, num_ctx_tokens) must be reformatted with clang-format (LLVM style,
max line length 120) so it matches CI expectations; run clang-format on the
header, update the declaration to the formatted multiline style the tool
produces, and commit the resulting changes so the prototype no longer differs in
Release Checks.

In `@cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp`:
- Around line 39-90: The wrapper currently omits bounds checks so the
raw-pointer launch in tk::invokeIndexerKCacheScatter can OOB; add explicit
validations before the call: ensure static_cast<int32_t>(k_fp8.size(0)) >=
num_tokens, static_cast<int32_t>(k_scale.size(0)) >= num_tokens, and that
slot_mapping_fp8.size(0) and slot_mapping_scale.size(0) are each >= num_tokens;
verify k_cache.element_size() == 1 (byte-addressable) and that cache_dim_3
equals the expected per-token byte width (e.g. head_dim + scale_size) so the 4th
dimension matches the kernel’s per-token layout, and optionally check
k_cache.is_contiguous() or that cache strides are consistent with byte
addressing; perform these TORCH_CHECKs just before calling
tk::invokeIndexerKCacheScatter using the existing variables num_tokens,
head_dim, scale_size, cache_dim_3, slot_mapping_fp8/scale, and
k_fp8/k_scale/k_cache to guarantee a safe buffer contract.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2810-2851: The helper _compute_scheduled_tokens is forcing a
minimum of 1 compute token for fully-reused context chunks via max(1,
ctx_req.context_chunk_size - reusable_in_chunk), which contradicts
reuse_adjusted_compute(...) that may return 0; change the logic in
_compute_scheduled_tokens so a fully reusable chunk can contribute 0 compute
tokens (e.g., use max(0, ctx_req.context_chunk_size - reusable_in_chunk) or
otherwise allow zero), keeping the existing reusable_in_chunk calculation and
the short-circuit branch intact; update references to
ctx_req.context_chunk_size, ctx_req.context_current_position,
ctx_req.estimated_reusable_tokens, and ctx_req.context_remaining_length in that
function.
- Around line 2853-2918: The _maybe_log_batch_wait_decision function currently
only gates logging by self.dist.rank and passes many separate args to
logger.info (producing a malformed message); update the early-return to also
require os.environ.get("TLLM_LOG_BATCH_WAIT") == "1" (i.e., return if rank != 0
or env var != "1"), and replace the multiple-argument logger.info call with a
single format string using %s placeholders and a tuple/list of the values
(include num_scheduled_tokens, num_scheduled_ctx_formula,
num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold,
self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting,
len(generation_requests), "; ".join(ctx_summaries)) so the message logs
correctly.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 702-712: The branch that handles connector-managed first chunks
debits remaining_budget after accepting the request, allowing a request whose
estimated compute (req_compute after reuse) exceeds remaining_budget to be
appended to accepted_ctx_requests; move the fit check earlier: compute the
post-reuse cost using self._estimate_post_reuse_compute(reusable,
req.context_chunk_size, req.prompt_len) (same call used now) and if that
estimated cost > remaining_budget skip/continue instead of appending, otherwise
subtract it from remaining_budget and then append the request (mirroring the fit
check logic used for other requests so req.is_first_context_chunk is validated
before acceptance).
- Around line 682-692: The try-except around self.impl.add_sequence should not
catch all RuntimeError variants; narrow it so only the duplicate-sequence
condition appends to accepted_ctx_requests and other errors (OOM or allocation
failures) are re-raised or handled separately. Update the handler around
self.impl.add_sequence to inspect the exception type/message from add_sequence
(or catch a more specific DuplicateSequenceError if available) and only perform
logger.warning and accepted_ctx_requests.append(req) when the error clearly
indicates "already exists" (otherwise re-raise or handle as an allocation/OOM
error), leaving downstream send logic to run only for legitimately accepted
sequences.

In `@tests/unittest/_torch/attention/sparse/test_dsa_indexer.py`:
- Line 757: The print statement print(f"\n=== Path 1: CUDA Kernel ===") uses an
unnecessary f-string and triggers F541; change it to a regular string by
removing the f prefix so it reads print("\n=== Path 1: CUDA Kernel ==="),
ensuring no other formatting is needed in the surrounding test_dsa_indexer code.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_exaone_moe.py`:
- Line 1: This file (modeling_exaone_moe.py) is missing the required NVIDIA
SPDX/Open Source copyright header; add the official NVIDIA header block
(including SPDX-License-Identifier and the year of latest meaningful
modification) at the top of the file before any imports (above the existing
import math), ensuring it matches the project's mandated format for Python files
so the file complies with the OSS header policy.

In `@tensorrt_llm/_torch/models/modeling_glm.py`:
- Line 1: This file (tensorrt_llm/_torch/models/modeling_glm.py) is missing the
required NVIDIA SPDX/copyright header; add the standard NVIDIA Open Source
Software header (including the current year of latest meaningful modification
and SPDX identifier) as the very first lines of the file before any imports
(e.g., before the existing "import inspect") to comply with the project's
licensing/coding guidelines.

In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Line 1: Add the NVIDIA SPDX/Open Source header to the top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (the modified file
containing the import line "from collections.abc import Callable"); insert the
required NVIDIA copyright/SPDX block for Python files (with the correct year of
latest modification) as the very first lines of the file so it complies with the
project's OSS header guidelines.

In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Line 1: This file tensorrt_llm/_torch/models/modeling_llama.py is missing the
required NVIDIA SPDX/open-source header; add the NVIDIA copyright header block
(including SPDX identifier and the year of latest meaningful modification) at
the very top of the file before any code (e.g., before the existing "import
copy" line) so the header applies to the entire module.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 230-236: The call to Qwen3NextMTP.forward in Qwen3NextForCausalLM
is missing the disable_on_compile=True guard and can be traced into Python
stream/event ops; update the forward callsite (the invocation of
Qwen3NextMTP.forward inside Qwen3NextForCausalLM.forward) to pass
disable_on_compile=True (matching the earlier maybe_execute_in_parallel change)
so MTP execution is skipped during compile/tracing and preserves event/stream
semantics.

In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Line 1: The SPDX copyright header at the top of mamba2_mixer.py still lists
"2022-2024" and must be updated to include the year of the latest meaningful
modification (2026); open the top of
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the SPDX/ copyright
year range to include 2026 (e.g., "2022-2026" or the appropriate single year),
ensuring the SPDX line and any other copyright header occurrences in the file
are updated consistently.

---

Nitpick comments:
In `@cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp`:
- Around line 27-44: The helper function reuse_adjusted_compute has file-local
static linkage and snake_case name; replace it with a function inside an
anonymous namespace and rename it to camelCase reuseAdjustedCompute to follow
project conventions: remove the static keyword, wrap the function definition in
an anonymous namespace, change its name from reuse_adjusted_compute to
reuseAdjustedCompute, and update any callers to the new name so linkage and
naming are consistent.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 410-411: Remove the default 0 values from the function signature
parameters num_contexts and num_ctx_tokens so they are required (i.e., change
"num_contexts: int = 0, num_ctx_tokens: int = 0" to "num_contexts: int,
num_ctx_tokens: int"). Update any call sites of the function that relied on the
defaults to pass explicit values, and adjust any related type hints or
docstrings for the function that reference these arguments (search for usages of
num_contexts and num_ctx_tokens in the module to find callers); run tests to
ensure no remaining implicit uses.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2650-2656: The local import of tensorrt_llm.logger as _mnt_logger
inside _prepare_tp_inputs is redundant; remove that import and use the
module-level logger variable (logger) instead. Update the error logging call in
_prepare_tp_inputs to call logger.error(...) with the same message (including
total_num_tokens, self.max_num_tokens, len(scheduled_requests.context_requests),
gen_count, and ctx_details) so you reuse the existing logger symbol and avoid
the extra import/lookup.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2810-2811: The helper _compute_scheduled_tokens lacks type
annotations; update its signature to include parameter types for
context_requests and generation_requests (e.g., use typing.Sequence or
typing.Iterable with the appropriate element type from your codebase, such as
Sequence[RequestContext] and Sequence[GenerationRequest] or Sequence[Any] if
concrete types aren't yet defined) and set the return type to -> None; also add
any needed imports from typing (Sequence/Iterable/Any) so static type checkers
can validate callers of _compute_scheduled_tokens.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 740-755: The function _estimate_post_reuse_compute duplicates
reuse-adjusted compute logic already implemented in the scheduler; replace this
local implementation by delegating to the scheduler's shared helper (call the
existing scheduler helper that computes post-reuse forward tokens) or, if
delegation is not possible, add a parity unit test that asserts
_estimate_post_reuse_compute returns identical values to the scheduler helper
for a wide range of reuse_tokens, chunk_size, prompt_len inputs; update
references to _estimate_post_reuse_compute to use the shared helper name and
ensure tokens_per_block alignment logic remains identical to the scheduler's
implementation so the two cannot drift.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a93251f-e31d-467d-8e68-525a5983ba40

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and cdf7717.

📒 Files selected for processing (23)
  • cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/attention_backend/trtllm_gen.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_exaone_moe.py
  • tensorrt_llm/_torch/models/modeling_glm.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/models/modeling_qwen3_next.py
  • tensorrt_llm/_torch/modules/mamba/gdn_mixer.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/multi_stream_utils.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/unittest/_torch/attention/sparse/test_dsa_indexer.py

@dongfengy dongfengy changed the title [None][feat] Merge feat/bench_y (deepseek-v32) changes to main draft: [None][feat] Merge feat/bench_y (deepseek-v32) changes to main Apr 9, 2026
@dongfengy dongfengy changed the title draft: [None][feat] Merge feat/bench_y (deepseek-v32) changes to main [None][feat] Merge feat/bench_y (deepseek-v32) changes to main Apr 9, 2026
@dongfengy dongfengy requested review from hyukn, lancelly, liji-nv and v-shobhit and removed request for Fridah-nv, brb-nv, hlu1, mikeiovine, omera-nv and symphonylyh April 9, 2026 00:56
@dongfengy
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42418 [ run ] triggered by Bot. Commit: 554034d Link to invocation

@longlee0622 longlee0622 marked this pull request as draft April 9, 2026 04:30
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42418 [ run ] completed with state SUCCESS. Commit: 554034d
/LLM/main/L0_MergeRequest_PR pipeline #33188 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@govind-ramnarayan govind-ramnarayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoDeploy changes look good

# chunk: if context_current_position has already been advanced past
# the reusable prefix (V2), the credit is 0; if not (V1), the full
# reusable count is subtracted.
reusable_in_chunk = max(0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check with the reuse_adjusted_compute calculation for calculating the budget.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push a fix myself in a new commit.


should_waiting = self.batch_wait_iters_count < self.batch_wait_timeout_iters and num_scheduled_tokens < self.batch_wait_max_tokens_ratio * self.max_num_tokens
should_waiting = self.batch_wait_iters_count < self.batch_wait_timeout_iters and num_scheduled_tokens < wait_threshold
self._maybe_log_batch_wait_decision(context_requests,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log function is always called here with info level. Maybe consider moving it to debug level.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push a fix myself in a new commit.

# correct budget. Non-first-chunk costs were
# already pre-subtracted above.
reusable = req.estimated_reusable_tokens
remaining_budget -= self._estimate_post_reuse_compute(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check if the remaining_budget could go negative after the subtraction? And only do the subtraction if not and add the request? CC @lancelly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will push a fix myself in a new commit.

self.impl.add_sequence(req.py_request_id,
req.prompt_len,
req_beam_width, req)
except RuntimeError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General concern about this try-except approach with add_sequence as it can change the state of cache blocks by claiming it. Note sure when would the RuntimeError rise and if it leaves the cache state unchanged.
If the error occurs after state changes, maybe we need to add clean-up logic here to avoid the side-effects.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part can probably be optimized with #12892.

@@ -78,7 +78,8 @@ void attention(torch::Tensor q, std::optional<torch::Tensor> k, std::optional<to
std::optional<torch::Tensor> fmha_scheduler_counter, std::optional<torch::Tensor> mla_bmm1_scale,
std::optional<torch::Tensor> mla_bmm2_scale, std::optional<torch::Tensor> quant_q_buffer,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies (merge them first):
#12631
#12892

liji-nv added 5 commits April 10, 2026 18:51
Split the monolithic mla_custom_op_inplace into two ops for DSA models:
- mla_dsa_proj (Op 1): Token-wise projections (cublas_mm, rope, FP8
  quantize, weight scaling). CUDA-graph-capturable — no batch metadata
  access, no tensor slicing by num_tokens.
- mla_dsa_attn_inplace (Op 2): Batch-dependent k cache update,
  sparse_attn_indexer, and context/generation attention dispatch.
  Excluded from CUDA graph capture.

This enables the piecewise CUDA graph optimizer to capture the
compute-heavy projection portion of DSA MLA while keeping the
batch-structure-dependent attention dispatch outside the graph.

Key design decisions:
- Indexer split into pre_indexer_proj (graph-safe) and _update_k_cache
  (moved to Op 2) to avoid capturing metadata-dependent scatter ops.
- All num_tokens slicing deferred to Op 2 so graph capture sees
  fixed-shape tensors.
- Indexer intermediates (q_fp8, k_fp8, k_scale, weights) returned from
  Op 1 as List[Tensor] and passed explicitly to Op 2 — no stashing on
  self to avoid CUDA graph address aliasing.
- _should_use_short_mha disabled under torch compile for straight-line
  control flow in Op 1.
- Non-DSA MLA unchanged (still uses mla_custom_op_inplace).

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
…ocstring

- Remove no-op `lambda: weights` in pre_indexer's maybe_execute_in_parallel;
  _weight_scale already ran in pre_indexer_proj, so just call _update_k_cache
  directly.
- Fix mla_dsa_proj docstring: k cache update happens in Op 2
  (mla_dsa_attn_inplace), not Op 1.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Move _update_k_cache call to the top of sparse_attn_indexer so
the k cache is populated right before prefill chunks gather from it.
Remove pre_indexer (now redundant); forward() and forward_dsa_proj
both call pre_indexer_proj directly.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
- Remove dead is_dsa branch from mla_custom_op_inplace since DSA is now
  exclusively handled by the split mla_dsa_proj/mla_dsa_attn_inplace ops
- Use literal 1 for k_scale shape to match C++ fusedCatFp8 kernel output
- Simplify proj_outputs unpacking

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
The assertion was dropped when the old Indexer.forward was split into
pre_indexer_proj and sparse_attn_indexer. Restore it in
sparse_attn_indexer which has access to metadata.kv_cache_manager.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@dongfengy dongfengy force-pushed the user/dongfengy/bench_y_mass_merge branch from 554034d to 61ca3ce Compare April 10, 2026 18:54
lancelly and others added 7 commits April 10, 2026 18:58
…ext chunking (NVIDIA#12682)

Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
Co-authored-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
…t perf optimizations for DSA part (NVIDIA#12681)

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
…ion path (NVIDIA#12691)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: v-shobhit <161510941+v-shobhit@users.noreply.github.com>
NVIDIA#12806)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
@dongfengy dongfengy force-pushed the user/dongfengy/bench_y_mass_merge branch from 61ca3ce to cb206e6 Compare April 10, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants