-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Describe the bug
According to the paper https://arxiv.org/abs/2205.05198. The normal activation memory for a transformed based model in each layer can be calculated as

and with the selective activation recompute, it can be decreased to

with my training set,

with tp =1 and pp =1, I expected when i use --recompute-activations, the GPU memory usage for storing activation should only be about 34 / (34+80) = 30% of that with no activation recompute applied.
Here are some info about the GPU memory usage
with --recompute-activations

without --recompute-activations

I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB.
Expected behavior
The max_memory allocated during training should decrease more.
Environment (please complete the following information):
- Megatron-LM commit ID
- PyTorch 2.4.1
- CUDA version 12.5
- NCCL version 2.20.5
Additional context
According to the formula above, with b = 12 s =1024 h =1024 L= 20 a=16 t=1, the original activation memory should be around 32GB, plus the memory for model states , which is about 7.3 GB for a 0.43B parameters model, which should be around 40GB even not take the temporary buffers and unusable fragment memory into account. That is much bigger than the max_memory allocated without activation recomputing, So I wonder the Megatron-LM has done some optimize here?
And why the max_memory allocated only changes little with/without --recompute-activations (use selective activation as default according to the doc)