[None][feat] Merge feat/bench_y (deepseek-v32) changes to main#12865
[None][feat] Merge feat/bench_y (deepseek-v32) changes to main#12865dongfengy wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request refactors compute token accounting across the batch scheduler and attention operations to incorporate KV-cache reuse dynamics via an explicit Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
tensorrt_llm/_torch/models/modeling_exaone_moe.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the required NVIDIA SPDX header to this modified file.
This file is modified but currently lacks the mandated OSS copyright/license header.
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + import math🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_exaone_moe.py` at line 1, This file (modeling_exaone_moe.py) is missing the required NVIDIA SPDX/Open Source copyright header; add the official NVIDIA header block (including SPDX-License-Identifier and the year of latest meaningful modification) at the top of the file before any imports (above the existing import math), ensuring it matches the project's mandated format for Python files so the file complies with the OSS header policy.tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate the SPDX copyright year for this modified file.
The header still says
2022-2024, but this file has meaningful 2026 modifications.As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."Suggested patch
- # SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` at line 1, The SPDX copyright header at the top of mamba2_mixer.py still lists "2022-2024" and must be updated to include the year of the latest meaningful modification (2026); open the top of tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the SPDX/ copyright year range to include 2026 (e.g., "2022-2026" or the appropriate single year), ensuring the SPDX line and any other copyright header occurrences in the file are updated consistently.tensorrt_llm/_torch/models/modeling_llama.py (1)
1-1:⚠️ Potential issue | 🟡 MinorPlease add the NVIDIA SPDX header in this modified Python file.
The file has changes in this PR but does not include the required OSS header block.
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + import copy🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama.py` at line 1, This file tensorrt_llm/_torch/models/modeling_llama.py is missing the required NVIDIA SPDX/open-source header; add the NVIDIA copyright header block (including SPDX identifier and the year of latest meaningful modification) at the very top of the file before any code (e.g., before the existing "import copy" line) so the header applies to the entire module.tensorrt_llm/_torch/models/modeling_glm.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd NVIDIA SPDX header for this modified file.
This file is changed in the PR but is missing the required header.
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + import inspect🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_glm.py` at line 1, This file (tensorrt_llm/_torch/models/modeling_glm.py) is missing the required NVIDIA SPDX/copyright header; add the standard NVIDIA Open Source Software header (including the current year of latest meaningful modification and SPDX identifier) as the very first lines of the file before any imports (e.g., before the existing "import inspect") to comply with the project's licensing/coding guidelines.tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1:⚠️ Potential issue | 🟡 MinorPlease add the NVIDIA SPDX header to this modified file.
The required OSS header is currently missing.
As per coding guidelines `**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}`: "All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification."Suggested patch
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + from collections.abc import Callable🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py` at line 1, Add the NVIDIA SPDX/Open Source header to the top of tensorrt_llm/_torch/models/modeling_llama_min_latency.py (the modified file containing the import line "from collections.abc import Callable"); insert the required NVIDIA copyright/SPDX block for Python files (with the correct year of latest modification) as the very first lines of the file so it complies with the project's OSS header guidelines.tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)
230-236:⚠️ Potential issue | 🟠 MajorAdd
disable_on_compile=Trueto theQwen3NextMTP.forward()callsite too.The sparse MoE block at line 230 was correctly fixed, but
Qwen3NextMTP.forward()at line 745 still lacks thedisable_on_compile=Trueguard. SinceQwen3NextForCausalLMsupports one-model MTP execution, this unguarded callsite remains vulnerable to Dynamo tracing of Python stream/event operations.Fix
inputs_embeds, hidden_states = maybe_execute_in_parallel( norm_embeds, norm_hidden, self.event_dict[EventType.Main], self.event_dict[EventType.MoeShared], self.aux_stream, + disable_on_compile=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 230 - 236, The call to Qwen3NextMTP.forward in Qwen3NextForCausalLM is missing the disable_on_compile=True guard and can be traced into Python stream/event ops; update the forward callsite (the invocation of Qwen3NextMTP.forward inside Qwen3NextForCausalLM.forward) to pass disable_on_compile=True (matching the earlier maybe_execute_in_parallel change) so MTP execution is skipped during compile/tracing and preserves event/stream semantics.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
410-411: Makenum_contextsandnum_ctx_tokensrequired here.These counters now drive the context/generation split for both backends. Defaulting them to
0turns any missed call-site update into silently wrong bookkeeping instead of an immediate failure.Proposed fail-fast change
quant_q_buffer: Optional[torch.Tensor] = None, - num_contexts: int = 0, - num_ctx_tokens: int = 0, + num_contexts: int, + num_ctx_tokens: int, ): + if not 0 <= num_contexts <= self.host_request_types.shape[0]: + raise ValueError(f"Invalid {num_contexts=}") + if not 0 <= num_ctx_tokens <= q.shape[0]: + raise ValueError(f"Invalid {num_ctx_tokens=}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 410 - 411, Remove the default 0 values from the function signature parameters num_contexts and num_ctx_tokens so they are required (i.e., change "num_contexts: int = 0, num_ctx_tokens: int = 0" to "num_contexts: int, num_ctx_tokens: int"). Update any call sites of the function that relied on the defaults to pass explicit values, and adjust any related type hints or docstrings for the function that reference these arguments (search for usages of num_contexts and num_ctx_tokens in the module to find callers); run tests to ensure no remaining implicit uses.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
2650-2656: Reuse the module-levelloggerinstead of re-importing it inside_prepare_tp_inputs.Line 2650 introduces a redundant local import (
_mnt_logger) even thoughloggeris already imported at module scope. Reusing the existing logger keeps the path cleaner and avoids extra import/lookup work.Proposed cleanup
- from tensorrt_llm.logger import logger as _mnt_logger - _mnt_logger.error( + logger.error( f"MNT overflow: total={total_num_tokens} " f"max={self.max_num_tokens} " f"ctx_reqs={len(scheduled_requests.context_requests)} " f"gen_reqs={gen_count} " f"ctx_breakdown=[{'; '.join(ctx_details)}]")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2650 - 2656, The local import of tensorrt_llm.logger as _mnt_logger inside _prepare_tp_inputs is redundant; remove that import and use the module-level logger variable (logger) instead. Update the error logging call in _prepare_tp_inputs to call logger.error(...) with the same message (including total_num_tokens, self.max_num_tokens, len(scheduled_requests.context_requests), gen_count, and ctx_details) so you reuse the existing logger symbol and avoid the extra import/lookup.cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp (1)
27-44: Prefer an anonymous namespace and camelCase for the new helper.This introduces a file-local helper with
staticlinkage and snake_case naming. Moving it into an anonymous namespace asreuseAdjustedCompute()would match the repo's C++ conventions and keep this scheduler file consistent.As per coding guidelines, "Rather than using the
statickeyword to mark a function as having internal linkage, prefer anonymous namespaces" and "Local variables, functions, methods, and namespaces should use camel case with lowercase first letter".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp` around lines 27 - 44, The helper function reuse_adjusted_compute has file-local static linkage and snake_case name; replace it with a function inside an anonymous namespace and rename it to camelCase reuseAdjustedCompute to follow project conventions: remove the static keyword, wrap the function definition in an anonymous namespace, change its name from reuse_adjusted_compute to reuseAdjustedCompute, and update any callers to the new name so linkage and naming are consistent.tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
2810-2811: Add type annotations to the new helper signature.
_compute_scheduled_tokens()is newly added but has no parameter or return annotations, which weakens type-checking around the new scheduling-accounting path.♻️ Suggested signature update
`@staticmethod` - def _compute_scheduled_tokens(context_requests, generation_requests): + def _compute_scheduled_tokens( + context_requests: list[LlmRequest], + generation_requests: list[LlmRequest], + ) -> int:As per coding guidelines, "Always annotate functions with type hints; make the return type
Noneif the function does not return anything".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2810 - 2811, The helper _compute_scheduled_tokens lacks type annotations; update its signature to include parameter types for context_requests and generation_requests (e.g., use typing.Sequence or typing.Iterable with the appropriate element type from your codebase, such as Sequence[RequestContext] and Sequence[GenerationRequest] or Sequence[Any] if concrete types aren't yet defined) and set the return type to -> None; also add any needed imports from typing (Sequence/Iterable/Any) so static type checkers can validate callers of _compute_scheduled_tokens.tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
740-755: Keep_estimate_post_reuse_compute()in one place.This is now a second copy of reuse-adjusted compute accounting. Sharing the scheduler helper, or at least adding a parity test against it, would reduce the chance of Python pre-validation drifting away from the scheduler again.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 740 - 755, The function _estimate_post_reuse_compute duplicates reuse-adjusted compute logic already implemented in the scheduler; replace this local implementation by delegating to the scheduler's shared helper (call the existing scheduler helper that computes post-reuse forward tokens) or, if delegation is not possible, add a parity unit test that asserts _estimate_post_reuse_compute returns identical values to the scheduler helper for a wide range of reuse_tokens, chunk_size, prompt_len inputs; update references to _estimate_post_reuse_compute to use the shared helper name and ensure tokens_per_block alignment logic remains identical to the scheduler's implementation so the two cannot drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/thop/attentionOp.cpp`:
- Around line 633-634: The function signature now relies on num_contexts and
num_ctx_tokens but treats them as optional/possibly stale; update the
implementation in attentionOp.cpp to either (A) treat these as required
validated inputs (remove defaults at the public API/binding layer) or (B) add a
robust fallback: if num_contexts or num_ctx_tokens are zero or missing, derive
them from flash_mla_tile_scheduler_metadata / flash_mla_num_splits or from the
batch/token counts, then clamp and validate ranges (ensure >=0, <= total
batch/tokens, and that prefix/suffix splits produce non-negative
num_generations/num_gen_tokens). Apply the same checks to the other partitioning
block referenced (lines ~837-846). On invalid values, fail fast with a clear
error/exception rather than letting negative sizes flow into workspace sizing or
runner->run.
In `@cpp/tensorrt_llm/thop/attentionOp.h`:
- Around line 80-83: The prototype in attentionOp.h (the declaration containing
parameters flash_mla_tile_scheduler_metadata, flash_mla_num_splits,
num_contexts, num_ctx_tokens) must be reformatted with clang-format (LLVM style,
max line length 120) so it matches CI expectations; run clang-format on the
header, update the declaration to the formatted multiline style the tool
produces, and commit the resulting changes so the prototype no longer differs in
Release Checks.
In `@cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp`:
- Around line 39-90: The wrapper currently omits bounds checks so the
raw-pointer launch in tk::invokeIndexerKCacheScatter can OOB; add explicit
validations before the call: ensure static_cast<int32_t>(k_fp8.size(0)) >=
num_tokens, static_cast<int32_t>(k_scale.size(0)) >= num_tokens, and that
slot_mapping_fp8.size(0) and slot_mapping_scale.size(0) are each >= num_tokens;
verify k_cache.element_size() == 1 (byte-addressable) and that cache_dim_3
equals the expected per-token byte width (e.g. head_dim + scale_size) so the 4th
dimension matches the kernel’s per-token layout, and optionally check
k_cache.is_contiguous() or that cache strides are consistent with byte
addressing; perform these TORCH_CHECKs just before calling
tk::invokeIndexerKCacheScatter using the existing variables num_tokens,
head_dim, scale_size, cache_dim_3, slot_mapping_fp8/scale, and
k_fp8/k_scale/k_cache to guarantee a safe buffer contract.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2810-2851: The helper _compute_scheduled_tokens is forcing a
minimum of 1 compute token for fully-reused context chunks via max(1,
ctx_req.context_chunk_size - reusable_in_chunk), which contradicts
reuse_adjusted_compute(...) that may return 0; change the logic in
_compute_scheduled_tokens so a fully reusable chunk can contribute 0 compute
tokens (e.g., use max(0, ctx_req.context_chunk_size - reusable_in_chunk) or
otherwise allow zero), keeping the existing reusable_in_chunk calculation and
the short-circuit branch intact; update references to
ctx_req.context_chunk_size, ctx_req.context_current_position,
ctx_req.estimated_reusable_tokens, and ctx_req.context_remaining_length in that
function.
- Around line 2853-2918: The _maybe_log_batch_wait_decision function currently
only gates logging by self.dist.rank and passes many separate args to
logger.info (producing a malformed message); update the early-return to also
require os.environ.get("TLLM_LOG_BATCH_WAIT") == "1" (i.e., return if rank != 0
or env var != "1"), and replace the multiple-argument logger.info call with a
single format string using %s placeholders and a tuple/list of the values
(include num_scheduled_tokens, num_scheduled_ctx_formula,
num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold,
self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting,
len(generation_requests), "; ".join(ctx_summaries)) so the message logs
correctly.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 702-712: The branch that handles connector-managed first chunks
debits remaining_budget after accepting the request, allowing a request whose
estimated compute (req_compute after reuse) exceeds remaining_budget to be
appended to accepted_ctx_requests; move the fit check earlier: compute the
post-reuse cost using self._estimate_post_reuse_compute(reusable,
req.context_chunk_size, req.prompt_len) (same call used now) and if that
estimated cost > remaining_budget skip/continue instead of appending, otherwise
subtract it from remaining_budget and then append the request (mirroring the fit
check logic used for other requests so req.is_first_context_chunk is validated
before acceptance).
- Around line 682-692: The try-except around self.impl.add_sequence should not
catch all RuntimeError variants; narrow it so only the duplicate-sequence
condition appends to accepted_ctx_requests and other errors (OOM or allocation
failures) are re-raised or handled separately. Update the handler around
self.impl.add_sequence to inspect the exception type/message from add_sequence
(or catch a more specific DuplicateSequenceError if available) and only perform
logger.warning and accepted_ctx_requests.append(req) when the error clearly
indicates "already exists" (otherwise re-raise or handle as an allocation/OOM
error), leaving downstream send logic to run only for legitimately accepted
sequences.
In `@tests/unittest/_torch/attention/sparse/test_dsa_indexer.py`:
- Line 757: The print statement print(f"\n=== Path 1: CUDA Kernel ===") uses an
unnecessary f-string and triggers F541; change it to a regular string by
removing the f prefix so it reads print("\n=== Path 1: CUDA Kernel ==="),
ensuring no other formatting is needed in the surrounding test_dsa_indexer code.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_exaone_moe.py`:
- Line 1: This file (modeling_exaone_moe.py) is missing the required NVIDIA
SPDX/Open Source copyright header; add the official NVIDIA header block
(including SPDX-License-Identifier and the year of latest meaningful
modification) at the top of the file before any imports (above the existing
import math), ensuring it matches the project's mandated format for Python files
so the file complies with the OSS header policy.
In `@tensorrt_llm/_torch/models/modeling_glm.py`:
- Line 1: This file (tensorrt_llm/_torch/models/modeling_glm.py) is missing the
required NVIDIA SPDX/copyright header; add the standard NVIDIA Open Source
Software header (including the current year of latest meaningful modification
and SPDX identifier) as the very first lines of the file before any imports
(e.g., before the existing "import inspect") to comply with the project's
licensing/coding guidelines.
In `@tensorrt_llm/_torch/models/modeling_llama_min_latency.py`:
- Line 1: Add the NVIDIA SPDX/Open Source header to the top of
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (the modified file
containing the import line "from collections.abc import Callable"); insert the
required NVIDIA copyright/SPDX block for Python files (with the correct year of
latest modification) as the very first lines of the file so it complies with the
project's OSS header guidelines.
In `@tensorrt_llm/_torch/models/modeling_llama.py`:
- Line 1: This file tensorrt_llm/_torch/models/modeling_llama.py is missing the
required NVIDIA SPDX/open-source header; add the NVIDIA copyright header block
(including SPDX identifier and the year of latest meaningful modification) at
the very top of the file before any code (e.g., before the existing "import
copy" line) so the header applies to the entire module.
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 230-236: The call to Qwen3NextMTP.forward in Qwen3NextForCausalLM
is missing the disable_on_compile=True guard and can be traced into Python
stream/event ops; update the forward callsite (the invocation of
Qwen3NextMTP.forward inside Qwen3NextForCausalLM.forward) to pass
disable_on_compile=True (matching the earlier maybe_execute_in_parallel change)
so MTP execution is skipped during compile/tracing and preserves event/stream
semantics.
In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Line 1: The SPDX copyright header at the top of mamba2_mixer.py still lists
"2022-2024" and must be updated to include the year of the latest meaningful
modification (2026); open the top of
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the SPDX/ copyright
year range to include 2026 (e.g., "2022-2026" or the appropriate single year),
ensuring the SPDX line and any other copyright header occurrences in the file
are updated consistently.
---
Nitpick comments:
In `@cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp`:
- Around line 27-44: The helper function reuse_adjusted_compute has file-local
static linkage and snake_case name; replace it with a function inside an
anonymous namespace and rename it to camelCase reuseAdjustedCompute to follow
project conventions: remove the static keyword, wrap the function definition in
an anonymous namespace, change its name from reuse_adjusted_compute to
reuseAdjustedCompute, and update any callers to the new name so linkage and
naming are consistent.
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 410-411: Remove the default 0 values from the function signature
parameters num_contexts and num_ctx_tokens so they are required (i.e., change
"num_contexts: int = 0, num_ctx_tokens: int = 0" to "num_contexts: int,
num_ctx_tokens: int"). Update any call sites of the function that relied on the
defaults to pass explicit values, and adjust any related type hints or
docstrings for the function that reference these arguments (search for usages of
num_contexts and num_ctx_tokens in the module to find callers); run tests to
ensure no remaining implicit uses.
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2650-2656: The local import of tensorrt_llm.logger as _mnt_logger
inside _prepare_tp_inputs is redundant; remove that import and use the
module-level logger variable (logger) instead. Update the error logging call in
_prepare_tp_inputs to call logger.error(...) with the same message (including
total_num_tokens, self.max_num_tokens, len(scheduled_requests.context_requests),
gen_count, and ctx_details) so you reuse the existing logger symbol and avoid
the extra import/lookup.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2810-2811: The helper _compute_scheduled_tokens lacks type
annotations; update its signature to include parameter types for
context_requests and generation_requests (e.g., use typing.Sequence or
typing.Iterable with the appropriate element type from your codebase, such as
Sequence[RequestContext] and Sequence[GenerationRequest] or Sequence[Any] if
concrete types aren't yet defined) and set the return type to -> None; also add
any needed imports from typing (Sequence/Iterable/Any) so static type checkers
can validate callers of _compute_scheduled_tokens.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 740-755: The function _estimate_post_reuse_compute duplicates
reuse-adjusted compute logic already implemented in the scheduler; replace this
local implementation by delegating to the scheduler's shared helper (call the
existing scheduler helper that computes post-reuse forward tokens) or, if
delegation is not possible, add a parity unit test that asserts
_estimate_post_reuse_compute returns identical values to the scheduler helper
for a wide range of reuse_tokens, chunk_size, prompt_len inputs; update
references to _estimate_post_reuse_compute to use the shared helper name and
ensure tokens_per_block alignment logic remains identical to the scheduler's
implementation so the two cannot drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7a93251f-e31d-467d-8e68-525a5983ba40
📒 Files selected for processing (23)
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cppcpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/thop/attentionOp.htensorrt_llm/_torch/attention_backend/sparse/dsa.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/attention_backend/trtllm_gen.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.pytensorrt_llm/_torch/models/modeling_deepseekv3.pytensorrt_llm/_torch/models/modeling_exaone_moe.pytensorrt_llm/_torch/models/modeling_glm.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/models/modeling_qwen3_next.pytensorrt_llm/_torch/modules/mamba/gdn_mixer.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/multi_stream_utils.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/unittest/_torch/attention/sparse/test_dsa_indexer.py
|
/bot run |
|
PR_Github #42418 [ run ] triggered by Bot. Commit: |
|
PR_Github #42418 [ run ] completed with state
|
govind-ramnarayan
left a comment
There was a problem hiding this comment.
AutoDeploy changes look good
| # chunk: if context_current_position has already been advanced past | ||
| # the reusable prefix (V2), the credit is 0; if not (V1), the full | ||
| # reusable count is subtracted. | ||
| reusable_in_chunk = max(0, |
There was a problem hiding this comment.
Please double check with the reuse_adjusted_compute calculation for calculating the budget.
There was a problem hiding this comment.
Will push a fix myself in a new commit.
|
|
||
| should_waiting = self.batch_wait_iters_count < self.batch_wait_timeout_iters and num_scheduled_tokens < self.batch_wait_max_tokens_ratio * self.max_num_tokens | ||
| should_waiting = self.batch_wait_iters_count < self.batch_wait_timeout_iters and num_scheduled_tokens < wait_threshold | ||
| self._maybe_log_batch_wait_decision(context_requests, |
There was a problem hiding this comment.
The log function is always called here with info level. Maybe consider moving it to debug level.
There was a problem hiding this comment.
Will push a fix myself in a new commit.
| # correct budget. Non-first-chunk costs were | ||
| # already pre-subtracted above. | ||
| reusable = req.estimated_reusable_tokens | ||
| remaining_budget -= self._estimate_post_reuse_compute( |
There was a problem hiding this comment.
Shall we check if the remaining_budget could go negative after the subtraction? And only do the subtraction if not and add the request? CC @lancelly
There was a problem hiding this comment.
Will push a fix myself in a new commit.
| self.impl.add_sequence(req.py_request_id, | ||
| req.prompt_len, | ||
| req_beam_width, req) | ||
| except RuntimeError: |
There was a problem hiding this comment.
General concern about this try-except approach with add_sequence as it can change the state of cache blocks by claiming it. Note sure when would the RuntimeError rise and if it leaves the cache state unchanged.
If the error occurs after state changes, maybe we need to add clean-up logic here to avoid the side-effects.
There was a problem hiding this comment.
This part can probably be optimized with #12892.
| @@ -78,7 +78,8 @@ void attention(torch::Tensor q, std::optional<torch::Tensor> k, std::optional<to | |||
| std::optional<torch::Tensor> fmha_scheduler_counter, std::optional<torch::Tensor> mla_bmm1_scale, | |||
| std::optional<torch::Tensor> mla_bmm2_scale, std::optional<torch::Tensor> quant_q_buffer, | |||
Split the monolithic mla_custom_op_inplace into two ops for DSA models: - mla_dsa_proj (Op 1): Token-wise projections (cublas_mm, rope, FP8 quantize, weight scaling). CUDA-graph-capturable — no batch metadata access, no tensor slicing by num_tokens. - mla_dsa_attn_inplace (Op 2): Batch-dependent k cache update, sparse_attn_indexer, and context/generation attention dispatch. Excluded from CUDA graph capture. This enables the piecewise CUDA graph optimizer to capture the compute-heavy projection portion of DSA MLA while keeping the batch-structure-dependent attention dispatch outside the graph. Key design decisions: - Indexer split into pre_indexer_proj (graph-safe) and _update_k_cache (moved to Op 2) to avoid capturing metadata-dependent scatter ops. - All num_tokens slicing deferred to Op 2 so graph capture sees fixed-shape tensors. - Indexer intermediates (q_fp8, k_fp8, k_scale, weights) returned from Op 1 as List[Tensor] and passed explicitly to Op 2 — no stashing on self to avoid CUDA graph address aliasing. - _should_use_short_mha disabled under torch compile for straight-line control flow in Op 1. - Non-DSA MLA unchanged (still uses mla_custom_op_inplace). Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
…ocstring - Remove no-op `lambda: weights` in pre_indexer's maybe_execute_in_parallel; _weight_scale already ran in pre_indexer_proj, so just call _update_k_cache directly. - Fix mla_dsa_proj docstring: k cache update happens in Op 2 (mla_dsa_attn_inplace), not Op 1. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Move _update_k_cache call to the top of sparse_attn_indexer so the k cache is populated right before prefill chunks gather from it. Remove pre_indexer (now redundant); forward() and forward_dsa_proj both call pre_indexer_proj directly. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
- Remove dead is_dsa branch from mla_custom_op_inplace since DSA is now exclusively handled by the split mla_dsa_proj/mla_dsa_attn_inplace ops - Use literal 1 for k_scale shape to match C++ fusedCatFp8 kernel output - Simplify proj_outputs unpacking Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
The assertion was dropped when the old Indexer.forward was split into pre_indexer_proj and sparse_attn_indexer. Restore it in sparse_attn_indexer which has access to metadata.kv_cache_manager. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
554034d to
61ca3ce
Compare
…ext chunking (NVIDIA#12682) Signed-off-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com> Co-authored-by: Liao Lanyu <108499334+lancelly@users.noreply.github.com>
…t perf optimizations for DSA part (NVIDIA#12681) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
…ion path (NVIDIA#12691) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: v-shobhit <161510941+v-shobhit@users.noreply.github.com>
NVIDIA#12806) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
61ca3ce to
cb206e6
Compare
Hi reviewers! https://github.com/NVIDIA/TensorRT-LLM/commits/feat/bench_y/ is a side branch we have been working on for certain benchmarks. Now we want to merge the changes back to main. Thanks!
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.