Skip to content

[NVBug: 6000530] Fix AWQ crash for uncalibrated MoE experts#1142

Merged
cjluo-nv merged 3 commits intomainfrom
chenjiel/fix_awq_moe
Mar 31, 2026
Merged

[NVBug: 6000530] Fix AWQ crash for uncalibrated MoE experts#1142
cjluo-nv merged 3 commits intomainfrom
chenjiel/fix_awq_moe

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Mar 30, 2026

Summary

  • Fixes NVBugs 6000530: AttributeError: 'float' object has no attribute 'pow' when running AWQ lite with moe_calib_experts_ratio < 1.0 on MoE models (e.g. Qwen3-30B-A3B).
  • Root cause: When moe_calib_experts_ratio=0.5, some MoE experts receive zero tokens during the AWQ cache phase, leaving act_scale as a Python float 0.0 instead of a tensor. This causes two failures:
    1. Search phase crash: Uncalibrated experts crash in get_scale() because float.pow() doesn't exist.
    2. Export crash: Calibrated experts have pre_quant_scale but uncalibrated ones don't, causing torch.stack() to fail on mixed None/tensor values in preprocess_linear_fusion().
  • Fix: Handle uncalibrated experts (num_cache_steps == 0) in two stages:
    1. Before search: Disable AWQ search (is_enabled = False) to prevent get_scale() crash on float act_scale.
    2. During postprocessing: Max calibrate weights and apply a neutral (all-ones) pre_quant_scale so export can stack scaling factors consistently across all experts. The pre_quant_scale buffer must be registered outside enable_weight_access_and_writeback because HF accelerate's post_forward hook drops newly-registered submodule buffers.

Test plan

  • Reproduce with Qwen/Qwen3-30B-A3B, --qformat int4_awq, --moe_calib_experts_ratio 0.5 — verify no crash during calibration and export

🤖 Generated with Claude Code

…rch phase

When moe_calib_experts_ratio < 1.0, some MoE experts may never receive
tokens during the AWQ cache phase, leaving act_scale as a Python float
(0.0) instead of a tensor. During the search phase, these uncalibrated
experts crash in get_scale() on float.pow(). Fix by disabling AWQ for
experts with num_cache_steps == 0 before the search phase begins, so
they gracefully fall back to max calibration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from a team as a code owner March 30, 2026 20:08
@cjluo-nv cjluo-nv requested a review from kaix-nv March 30, 2026 20:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Pre-search now skips AWQ parameter search for quantized linear modules with awq_lite.num_cache_steps == 0 by disabling AWQ; postprocess instead performs weight max calibration, sets input_quantizer.pre_quant_scale to all-ones with _enable_pre_quant_scale = True, and postprocess is only run when num_search_steps != 0.

Changes

Cohort / File(s) Summary
AWQ Lite calibration adjustments
modelopt/torch/quantization/model_calib.py
Added a pre-search pass that disables AWQ search for modules with awq_lite.num_cache_steps == 0, performs weight max calibration using enable_weight_access_and_writeback, sets input_quantizer.pre_quant_scale to all-ones and _enable_pre_quant_scale = True. Reorganized postprocess so it runs only when num_search_steps != 0 and preserves prior behavior for num_search_steps == 0 by disabling awq_lite.is_enabled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main fix: addressing an AWQ crash when dealing with uncalibrated MoE experts, directly referencing the NVBug ticket.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed PR introduces no security anti-patterns. Changes to model_calib.py are purely algorithmic modifications to handle uncalibrated MoE experts without deserialization, remote code execution, unsafe eval/exec, or new dependencies.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/fix_awq_moe

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv changed the title [BugFix][6000530] Fix AWQ crash for uncalibrated MoE experts [NVBug: 6000530] Fix AWQ crash for uncalibrated MoE experts Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-03-31 20:55 UTC

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 38.46154% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.20%. Comparing base (f04e106) to head (23a901e).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 38.46% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1142      +/-   ##
==========================================
+ Coverage   70.19%   70.20%   +0.01%     
==========================================
  Files         230      230              
  Lines       26073    26080       +7     
==========================================
+ Hits        18302    18310       +8     
+ Misses       7771     7770       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 added the cherry-pick After code freeze, cherry-pick into release branch for next rc. Only for bug fixes and doc updates label Mar 30, 2026
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1182-1200: The loop that handles uncalibrated experts leaves input
quantization disabled because setup() may have turned off module.input_quantizer
but postprocess is skipped when module.awq_lite.num_cache_steps == 0; modify the
block handling those modules (the for loop iterating model.named_modules(), the
branch checking is_quantized_linear(module) && hasattr(module, "awq_lite") &&
module.awq_lite.num_cache_steps == 0) to re-enable the input_quantizer state it
originally had: after setting module.input_quantizer.pre_quant_scale and before
disabling module.awq_lite.is_enabled, restore
module.input_quantizer._enable_pre_quant_scale (or call the appropriate
re-enable API on input_quantizer) to the value it had prior to setup() so
uncalibrated experts that started with input quantization enabled end up
re-enabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b32c7d9-8fae-4245-9dc9-bf46dfda9d9f

📥 Commits

Reviewing files that changed from the base of the PR and between 399df39 and d17d987.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/model_calib.py

Comment on lines +1182 to +1200
# Handle uncalibrated experts (e.g. when moe_calib_experts_ratio < 1.0,
# some experts may never receive tokens during the cache phase, leaving act_scale
# as a Python float instead of a tensor, which would crash in get_scale()).
# We fully handle them here: max calibrate weights, apply a neutral (all-ones)
# pre_quant_scale for export consistency, and disable AWQ search.
for name, module in model.named_modules():
if (
is_quantized_linear(module)
and hasattr(module, "awq_lite")
and module.awq_lite.num_cache_steps == 0
):
with enable_weight_access_and_writeback(module, model, name_to_module):
max_calibrate(module, lambda module: module.weight_quantizer(module.weight))
ones_scale = torch.ones(
module.weight.shape[1], dtype=module.weight.dtype, device=module.weight.device
)
module.input_quantizer._enable_pre_quant_scale = True
module.input_quantizer.pre_quant_scale = ones_scale
module.awq_lite.is_enabled = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing input_quantizer re-enable for uncalibrated experts.

When setup() runs, it disables input_quantizer if it was originally enabled. For modules with num_cache_steps == 0, postprocess is skipped (lines 1234-1236), so the input_quantizer is never re-enabled. This will leave input quantization disabled for uncalibrated experts that originally had it enabled.

🐛 Proposed fix to re-enable input_quantizer
             module.input_quantizer._enable_pre_quant_scale = True
             module.input_quantizer.pre_quant_scale = ones_scale
+        if module.awq_lite.is_input_quantized:
+            module.input_quantizer.enable()
         module.awq_lite.is_enabled = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1182 - 1200, The
loop that handles uncalibrated experts leaves input quantization disabled
because setup() may have turned off module.input_quantizer but postprocess is
skipped when module.awq_lite.num_cache_steps == 0; modify the block handling
those modules (the for loop iterating model.named_modules(), the branch checking
is_quantized_linear(module) && hasattr(module, "awq_lite") &&
module.awq_lite.num_cache_steps == 0) to re-enable the input_quantizer state it
originally had: after setting module.input_quantizer.pre_quant_scale and before
disabling module.awq_lite.is_enabled, restore
module.input_quantizer._enable_pre_quant_scale (or call the appropriate
re-enable API on input_quantizer) to the value it had prior to setup() so
uncalibrated experts that started with input quantization enabled end up
re-enabled.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from meenchen March 31, 2026 19:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/quantization/model_calib.py (1)

1226-1242: ⚠️ Potential issue | 🟠 Major

Restore input_quantizer in the uncalibrated-expert fallback.

AWQLiteHelper.setup() disables the input quantizer at Lines 1009-1015, and the normal re-enable path lives in postprocess() at Lines 1204-1215. Because this branch skips postprocess(), experts that started with input quantization enabled silently stay disabled after AWQ completes.

🐛 Proposed fix
                 module.input_quantizer.pre_quant_scale = torch.ones(
                     w_shape,
                     dtype=w_dtype,
                     device=w_device,
                 )
+                if module.awq_lite.is_input_quantized:
+                    module.input_quantizer.enable()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1226 - 1242, This
branch skips postprocess(), so restore the input quantizer exactly as
postprocess() does: after the weight-calibration block (inside the
uncalibrated-expert fallback), re-enable the module's input quantizer and set
its pre-quant scale state by applying the same changes postprocess() applies —
e.g. flip the input_quantizer enabled flag back on and set
module.input_quantizer._enable_pre_quant_scale = True and
module.input_quantizer.pre_quant_scale = torch.ones(...) (use
w_shape/w_dtype/w_device), mirroring AWQLiteHelper.setup and postprocess()
behavior so experts that started with input quantization enabled are re-enabled
here as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1226-1242: This branch skips postprocess(), so restore the input
quantizer exactly as postprocess() does: after the weight-calibration block
(inside the uncalibrated-expert fallback), re-enable the module's input
quantizer and set its pre-quant scale state by applying the same changes
postprocess() applies — e.g. flip the input_quantizer enabled flag back on and
set module.input_quantizer._enable_pre_quant_scale = True and
module.input_quantizer.pre_quant_scale = torch.ones(...) (use
w_shape/w_dtype/w_device), mirroring AWQLiteHelper.setup and postprocess()
behavior so experts that started with input quantization enabled are re-enabled
here as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1b2b5045-67b9-42d6-a698-d0441e711bdb

📥 Commits

Reviewing files that changed from the base of the PR and between d17d987 and 23a901e.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/model_calib.py

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@cjluo-nv cjluo-nv merged commit ada1e26 into main Mar 31, 2026
45 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/fix_awq_moe branch March 31, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick After code freeze, cherry-pick into release branch for next rc. Only for bug fixes and doc updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants