Optimize fp8 block scaling Allgather for FSDP2#2789

Open

vthumbe1503 wants to merge 12 commits intoNVIDIA:mainfrom

vthumbe1503:optimize_fp8_blockwise_scaling

Collaborator

vthumbe1503 commented Mar 23, 2026 •

edited

Loading

Description

Eliminate Columnwise allgather for fp8_model_init with fsdp2. For weights when FP8 blockscaling is used, we typically use 2d. And in such a case, columnwise data and scale inv is just the transpose of the rowwise data and scale inverse. And so allgathering the rowwise data/scales are enough

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

vthumbe1503 added 3 commits

March 23, 2026 00:36


          done

ef51ab4

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>


          one review comment form greptile

d504f05

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>


          instead part of the comment not needed

e5594bc

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 changed the title ~~Optimize fp8 blockwise scaling~~ Optimize fp8 block scaling Allgather for FSDP2


          [pre-commit.ci] auto fixes from pre-commit.com hooks

bb70a33

for more information, see https://pre-commit.ci

Collaborator Author

vthumbe1503 commented Mar 23, 2026

/te-ci L1 pytorch

Contributor

greptile-apps bot commented Mar 23, 2026 •

edited

Loading

Greptile Summary

This PR optimises the FSDP2 all-gather path for Float8BlockwiseQTensor (2D block scaling) by eliminating the separate columnwise all-gather. Because 2D block-scaled (128×128) columnwise data is mathematically identical to the transpose of the rowwise data, only the two rowwise tensors are now communicated; every rank then derives the columnwise form locally via tex.fp8_transpose in _create_columnwise(), halving the all-gather communication volume.

fsdp_pre_all_gather now always returns only (rowwise_data, rowwise_scale_inv) regardless of forward/backward pass.
fsdp_post_all_gather calls _create_columnwise() locally when columnwise_usage=True, then delegates cleanup to update_usage(). Buffer reuse is handled correctly.
The same defensive param_group is None guard is applied consistently across all three tensor files.
All previously raised P1 concerns appear addressed.

Confidence Score: 5/5

PR is safe to merge; the optimization is mathematically sound and all previously raised P1 concerns have been addressed.

Only P2 finding remains — a trivial truncated comment. Core logic is correct for symmetric 128×128 block quantization; buffer reuse, null guards, and stale-state windows are all properly handled.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	Core optimization: eliminates columnwise all-gather, derives it locally via _create_columnwise(). Previously raised concerns addressed.
transformer_engine/pytorch/tensor/float8_tensor.py	Defensive refactor only: param_group None guard added.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	Defensive refactor only: param_group None guard added; no functional change.

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into optimize_fp8_bl..." | Re-trigger Greptile}

greptile-apps bot reviewed

View reviewed changes

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Outdated

Comment on lines +641 to +642

		fsdp_state = _get_module_fsdp_state(module)
		reshard_after_forward = fsdp_state._fsdp_param_group._reshard_after_forward

Contributor

greptile-apps bot Mar 23, 2026

Unguarded access to _fsdp_param_group

fsdp_state._fsdp_param_group is typed as Optional[FSDPParamGroup] in PyTorch's FSDP2 internals — it is None for any FSDP module that does not directly manage parameters (e.g. a container module whose children are individually sharded). Accessing ._reshard_after_forward on it unconditionally will raise AttributeError: 'NoneType' object has no attribute '_reshard_after_forward' in that case.

While in practice fsdp_pre_all_gather is only called for tensors managed by a param group, this assumption is implicit. A guard makes the failure mode explicit and easier to diagnose:

fsdp_state = _get_module_fsdp_state(module)
param_group = fsdp_state._fsdp_param_group
if param_group is None:
    raise RuntimeError(
        "FSDP state for this module has no parameter group; "
        "cannot determine reshard_after_forward."
    )
reshard_after_forward = param_group._reshard_after_forward

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Show resolved Hide resolved

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Show resolved Hide resolved

vthumbe1503 added 2 commits

March 23, 2026 05:52


          address review comments

06c0952

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>


          Merge branch 'optimize_fp8_blockwise_scaling' of github.com:vthumbe15…

d46c82b

…03/TransformerEngine into optimize_fp8_blockwise_scaling

Collaborator Author

vthumbe1503 commented Mar 23, 2026

/te-ci L1 pytorch

greptile-apps bot reviewed

View reviewed changes

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Outdated Show resolved Hide resolved

vthumbe1503 and others added 3 commits

March 23, 2026 08:54


          Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

347c276

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>


          No need to set it to None

Remove unnecessary columnwise data and scale inv assignments.

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>


          Merge branch 'main' into optimize_fp8_blockwise_scaling

a4e655e

greptile-apps bot reviewed

View reviewed changes

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

Comment on lines +627 to +628

		# PyTorch FSDP2 private API – tested with PyTorch 2.5+;
		from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState

Contributor

greptile-apps bot Mar 23, 2026

Inconsistent import style for TrainingState

TrainingState is imported at the module level (line 10) in float8_tensor.py and at line 13 in mxfp8_tensor.py, but here it's imported lazily inside fsdp_pre_all_gather. While the inline comment about the private API and PyTorch version is valuable, the inconsistency across the three sibling files may confuse readers.

Consider either:

Moving the TrainingState import to the module level and placing the version comment there (matching the other two files), or
Adding the same lazy-import pattern and version comment to float8_tensor.py and mxfp8_tensor.py for symmetry.

Suggested change

      
                    # PyTorch FSDP2 private API – tested with PyTorch 2.5+;
          
                    from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState
          
                    # PyTorch FSDP2 private API – tested with PyTorch 2.5+;
          
                    from torch.distributed.fsdp._fully_shard._fsdp_common import TrainingState
          
                    from transformer_engine.pytorch.distributed import _get_module_fsdp_state

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

Comment on lines +656 to +663

+                      if reshard_after_forward:
+                          training_state = param_group._training_state
+                          is_backward_pass = training_state == TrainingState.PRE_BACKWARD
+                          rowwise_usage = not is_backward_pass
+                          columnwise_usage = is_backward_pass
+                      else:
+                          rowwise_usage = True
+                          columnwise_usage = self._quantizer.columnwise_usage

Contributor

greptile-apps bot Mar 23, 2026

columnwise_usage not derived from training state in non-resharded path

When reshard_after_forward=False, the same all-gathered weight is reused through both forward and backward passes. The code sets:

rowwise_usage = True
columnwise_usage = self._quantizer.columnwise_usage

This means whether columnwise data gets derived locally (and kept) is entirely controlled by the sharded quantizer's setting, not the actual pass. The comment in the previous code explicitly noted that both forms were needed when not resharding. If self._quantizer.columnwise_usage is False (e.g. on an architecture that doesn't need the transpose), columnwise data won't be created and won't be available for the backward pass GEMM.

This matches the pre-existing float8_tensor.py behavior (same pattern there), so it's presumably already validated by the existing usage assumptions — but it would be worth a brief comment here documenting that self._quantizer.columnwise_usage must be True whenever the backward GEMM needs columnwise access for the non-resharding path.

vthumbe1503 requested a review from ptrendx

March 23, 2026 17:48


          Merge branch 'main' into optimize_fp8_blockwise_scaling

83f0fe8

Contributor

jomitchellnv commented Mar 24, 2026

Btw I don't see any test files that were updated. I'd expect a test under tests/pytorch/fsdp/ or similar validating that the locally-derived columnwise output matches the old all-gathered columnwise output.

jomitchellnv reviewed

View reviewed changes

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py Show resolved Hide resolved

jomitchellnv approved these changes

View reviewed changes

Contributor

jomitchellnv left a comment

LGTM just hoping theres some test coverage around this new implementation. i think i wrote some last time but not sure

vthumbe1503 mentioned this pull request

adds NVFP4 Fused Adam support #2797

Open

13 tasks


          Merge branch 'main' into optimize_fp8_blockwise_scaling

2295cae

Collaborator Author

vthumbe1503 commented Mar 24, 2026

/te-ci L1 pytorch

ksivaman approved these changes

View reviewed changes

Member

ksivaman left a comment

LGTM, CI pending, consistent with other recipes

jomitchellnv mentioned this pull request

[FSDP2/Megatron-FSDP/DCP] If model parameters are DTensors, optimizer states should also be DTensors. #2795

Open

13 tasks


          Merge branch 'main' into optimize_fp8_blockwise_scaling

43c0274

Collaborator Author

vthumbe1503 commented Mar 31, 2026

/te-ci L1 pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet