[PyT][Test] Add xfailing FSDP2 memory leak detection tests by pstjohn · Pull Request #2803 · NVIDIA/TransformerEngine

pstjohn · 2026-03-25T23:48:14Z

Summary

Add xfailing tests that demonstrate two known FSDP2 + FP8 memory issues (te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681, _create_transpose tensor accumulating during FSDP2 with quantized_model_init #2717)
Tests use layer-level forward hooks and backward memory deltas to detect FP8 temporary tensor accumulation, comparing against a bf16 baseline
Includes bf16 control tests that validate the measurement methodology passes cleanly

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in `run_fsdp2_mem_leak.py`)

Test	Type	What it checks
`test_bf16_no_excess_forward_memory`	control (PASS)	bf16 per-layer increments are uniform
`test_bf16_no_excess_backward_memory`	control (PASS)	bf16 vs bf16 backward delta shows zero excess
`test_fp8_temp_accumulation_across_layers`	xfail	FP8 per-layer forward increment exceeds bf16
`test_transpose_cache_retained_after_backward`	xfail	FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
bf16 control tests PASS
FP8 accumulation tests XFAIL for affected configurations
Pre-commit hooks pass

greptile-apps · 2026-03-25T23:50:14Z

Greptile Summary

This PR adds xfailing regression tests that document two known FSDP2 + FP8 memory leak bugs (#2681 and #2717). A new run_fsdp2_mem_leak.py script provides four tests: two bf16 control tests that validate the measurement methodology (expected to PASS), and two FP8 tests marked xfail(strict=False) that expose the known bugs. The outer test_torch_fsdp2.py gains a test_fsdp2_mem_leak_tests() wrapper that follows the same torchrun-as-subprocess pattern as the existing FSDP2 test suites.

Measurement strategy is sound: per-layer forward hooks capture FP8 weight-copy accumulation (Issue te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681); post-bwd minus post-fwd memory delta catches retained transpose caches (Issue _create_transpose tensor accumulating during FSDP2 with quantized_model_init #2717).
recipe_name fixture is sourced from the shared conftest.py; quantized_model_init is defined locally — parametrization is correct.
The three issues flagged in previous review rounds (standalone runner crash, stale layer-count comment, unused MEASURED_STEPS constant) have all been addressed in this revision.
One minor docstring omission remains: bf16_no_excess_backward_memory is not listed in the "Available --test values" section of the module-level docstring, though it works correctly at runtime.

Confidence Score: 5/5

Safe to merge — tests are additive, xfail-guarded, and follow established patterns; no production code is changed.
The only open finding is a P2 docstring omission. All previously reported P1 issues have been resolved, logic and parametrization are correct, and the changes are confined to test infrastructure.
No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py	New file: 4 FSDP2 memory-leak detection tests (2 bf16 control + 2 xfail FP8), with helpers for per-layer forward hook measurement and backward delta comparison. One minor docstring omission (missing `bf16_no_excess_backward_memory` in the standalone test list); all three previously flagged issues have been resolved.
tests/pytorch/distributed/test_torch_fsdp2.py	Adds `test_fsdp2_mem_leak_tests()` outer pytest test that spawns the inner torchrun/pytest subprocess for the new memory-leak file. Consistent with the existing `test_fsdp2_fused_adam_tests` pattern (2-GPU minimum, 600 s timeout, returncode 0 or 5 check).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pytest test_torch_fsdp2.py] --> B[test_fsdp2_mem_leak_tests]
    B --> C["torchrun -m pytest run_fsdp2_mem_leak.py\n(2 GPUs, FSDP2)"]

    C --> D[test_bf16_no_excess_forward_memory\ncontrol — PASS]
    C --> E[test_bf16_no_excess_backward_memory\ncontrol — PASS]
    C --> F["test_fp8_temp_accumulation_across_layers\nxfail — 5 recipes × 2 init modes"]
    C --> G["test_transpose_cache_retained_after_backward\nxfail — 5 recipes × 2 init modes"]

    D --> D1[_LayerMemoryTracker hooks\ncheck per-layer BF16 uniformity]
    E --> E1[_measure_backward_memory_delta\nbf16 vs bf16 excess ≈ 0]
    F --> F1[BF16 baseline avg increment\nvs FP8 avg increment\nexcess > 50 KiB/layer → xfail]
    G --> G1[BF16 backward delta\nvs FP8 backward delta\nexcess > 256 KiB → xfail]

_{Reviews (3): Last reviewed commit: "[PyT][Test] Add xfailing FSDP2 memory le..." | Re-trigger Greptile}

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com>

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

pstjohn mentioned this pull request Mar 26, 2026

[PyT] Fix FSDP2 memory leaks for FP8 weight workspaces and transpose caches #2805

Open

3 tasks

pstjohn force-pushed the pstjohn/fsdp2-mem-leak-tests branch from 29cd628 to 27a505f Compare March 30, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

pstjohn commented Mar 25, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pstjohn commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue #2681: FP8 weight copy accumulation during forward

Issue #2717: Transpose cache retained after backward

New tests (in run_fsdp2_mem_leak.py)

Test plan

Uh oh!

greptile-apps bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pstjohn commented Mar 25, 2026 •

edited

Loading

New tests (in `run_fsdp2_mem_leak.py`)

greptile-apps bot commented Mar 25, 2026 •

edited

Loading