Skip to content

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1225

@bugm

Description

@bugm

Describe the bug
According to the paper https://arxiv.org/abs/2205.05198. The normal activation memory for a transformed based model in each layer can be calculated as
Image
and with the selective activation recompute, it can be decreased to
Image

with my training set,
Image
with tp =1 and pp =1, I expected when i use --recompute-activations, the GPU memory usage for storing activation should only be about 34 / (34+80) = 30% of that with no activation recompute applied.

Here are some info about the GPU memory usage
with --recompute-activations
Image

without --recompute-activations
Image

I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB.

Expected behavior
The max_memory allocated during training should decrease more.

Environment (please complete the following information):

  • Megatron-LM commit ID
  • PyTorch 2.4.1
  • CUDA version 12.5
  • NCCL version 2.20.5

Additional context
According to the formula above, with b = 12 s =1024 h =1024 L= 20 a=16 t=1, the original activation memory should be around 32GB, plus the memory for model states , which is about 7.3 GB for a 0.43B parameters model, which should be around 40GB even not take the temporary buffers and unusable fragment memory into account. That is much bigger than the max_memory allocated without activation recomputing, So I wonder the Megatron-LM has done some optimize here?
And why the max_memory allocated only changes little with/without --recompute-activations (use selective activation as default according to the doc)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions