Skip to content

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

Open
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests
Open

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
pstjohn wants to merge 1 commit intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

Conversation

@pstjohn
Copy link
Copy Markdown
Contributor

@pstjohn pstjohn commented Mar 25, 2026

Summary

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in run_fsdp2_mem_leak.py)

Test Type What it checks
test_bf16_no_excess_forward_memory control (PASS) bf16 per-layer increments are uniform
test_bf16_no_excess_backward_memory control (PASS) bf16 vs bf16 backward delta shows zero excess
test_fp8_temp_accumulation_across_layers xfail FP8 per-layer forward increment exceeds bf16
test_transpose_cache_retained_after_backward xfail FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

  • pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
  • bf16 control tests PASS
  • FP8 accumulation tests XFAIL for affected configurations
  • Pre-commit hooks pass

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds xfailing regression tests that document two known FSDP2 + FP8 memory leak bugs (#2681 and #2717). A new run_fsdp2_mem_leak.py script provides four tests: two bf16 control tests that validate the measurement methodology (expected to PASS), and two FP8 tests marked xfail(strict=False) that expose the known bugs. The outer test_torch_fsdp2.py gains a test_fsdp2_mem_leak_tests() wrapper that follows the same torchrun-as-subprocess pattern as the existing FSDP2 test suites.

Confidence Score: 5/5

  • Safe to merge — tests are additive, xfail-guarded, and follow established patterns; no production code is changed.
  • The only open finding is a P2 docstring omission. All previously reported P1 issues have been resolved, logic and parametrization are correct, and the changes are confined to test infrastructure.
  • No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py New file: 4 FSDP2 memory-leak detection tests (2 bf16 control + 2 xfail FP8), with helpers for per-layer forward hook measurement and backward delta comparison. One minor docstring omission (missing bf16_no_excess_backward_memory in the standalone test list); all three previously flagged issues have been resolved.
tests/pytorch/distributed/test_torch_fsdp2.py Adds test_fsdp2_mem_leak_tests() outer pytest test that spawns the inner torchrun/pytest subprocess for the new memory-leak file. Consistent with the existing test_fsdp2_fused_adam_tests pattern (2-GPU minimum, 600 s timeout, returncode 0 or 5 check).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pytest test_torch_fsdp2.py] --> B[test_fsdp2_mem_leak_tests]
    B --> C["torchrun -m pytest run_fsdp2_mem_leak.py\n(2 GPUs, FSDP2)"]

    C --> D[test_bf16_no_excess_forward_memory\ncontrol — PASS]
    C --> E[test_bf16_no_excess_backward_memory\ncontrol — PASS]
    C --> F["test_fp8_temp_accumulation_across_layers\nxfail — 5 recipes × 2 init modes"]
    C --> G["test_transpose_cache_retained_after_backward\nxfail — 5 recipes × 2 init modes"]

    D --> D1[_LayerMemoryTracker hooks\ncheck per-layer BF16 uniformity]
    E --> E1[_measure_backward_memory_delta\nbf16 vs bf16 excess ≈ 0]
    F --> F1[BF16 baseline avg increment\nvs FP8 avg increment\nexcess > 50 KiB/layer → xfail]
    G --> G1[BF16 backward delta\nvs FP8 backward delta\nexcess > 256 KiB → xfail]
Loading

Reviews (3): Last reviewed commit: "[PyT][Test] Add xfailing FSDP2 memory le..." | Re-trigger Greptile

Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681
- test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@pstjohn pstjohn force-pushed the pstjohn/fsdp2-mem-leak-tests branch from 29cd628 to 27a505f Compare March 30, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant