[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
Open
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
Open
[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Contributor
Greptile SummaryThis PR adds xfailing regression tests that document two known FSDP2 + FP8 memory leak bugs (#2681 and #2717). A new
Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[pytest test_torch_fsdp2.py] --> B[test_fsdp2_mem_leak_tests]
B --> C["torchrun -m pytest run_fsdp2_mem_leak.py\n(2 GPUs, FSDP2)"]
C --> D[test_bf16_no_excess_forward_memory\ncontrol — PASS]
C --> E[test_bf16_no_excess_backward_memory\ncontrol — PASS]
C --> F["test_fp8_temp_accumulation_across_layers\nxfail — 5 recipes × 2 init modes"]
C --> G["test_transpose_cache_retained_after_backward\nxfail — 5 recipes × 2 init modes"]
D --> D1[_LayerMemoryTracker hooks\ncheck per-layer BF16 uniformity]
E --> E1[_measure_backward_memory_delta\nbf16 vs bf16 excess ≈ 0]
F --> F1[BF16 baseline avg increment\nvs FP8 avg increment\nexcess > 50 KiB/layer → xfail]
G --> G1[BF16 backward delta\nvs FP8 backward delta\nexcess > 256 KiB → xfail]
Reviews (3): Last reviewed commit: "[PyT][Test] Add xfailing FSDP2 memory le..." | Re-trigger Greptile |
3 tasks
Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com>
29cd628 to
27a505f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue #2681: FP8 weight copy accumulation during forward
FP8 weight copies created by
te.autocast()accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes withno_quant_init.Issue #2717: Transpose cache retained after backward
_create_transposetensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected forDelayedScalingandFloat8CurrentScalingwithquant_init.New tests (in
run_fsdp2_mem_leak.py)test_bf16_no_excess_forward_memorytest_bf16_no_excess_backward_memorytest_fp8_temp_accumulation_across_layerstest_transpose_cache_retained_after_backwardAll FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.
Test plan
pytest tests/pytorch/distributed/test_torch_fsdp2.py— all 4 outer tests pass (including existing model and fused_adam tests)