[BUG] selective activation recompute only decrease little of GPU memory usage during training

**Describe the bug**
According to the paper https://arxiv.org/abs/2205.05198. The normal activation memory for a transformed based model in each layer can be calculated as 
![Image](https://github.com/user-attachments/assets/a9522623-43b3-4986-bf26-2ba807868527)
and with the selective activation recompute, it can be decreased to 
![Image](https://github.com/user-attachments/assets/5cfcbd37-990f-4c6e-9ede-f6c9294bf2f0)

with my training set, 
![Image](https://github.com/user-attachments/assets/15348b27-3c55-495a-88d0-1ddf65f60f04)
with tp =1 and pp =1, I expected when i use --recompute-activations, the GPU memory usage for storing activation should only be about 34 / (34+80)  = 30% of that with no activation recompute applied.

Here are some info about the GPU memory usage 
with --recompute-activations
![Image](https://github.com/user-attachments/assets/4c534f4b-064a-4b2d-8e50-66e0448c0505)

without   --recompute-activations
![Image](https://github.com/user-attachments/assets/fb31eb4e-c2e8-4d26-a926-2ef7500aaf18)

I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB. 


**Expected behavior**
The max_memory allocated  during training should decrease more.


**Environment (please complete the following information):**
 - Megatron-LM commit ID
 - PyTorch 2.4.1 
 - CUDA version  12.5
 - NCCL version 2.20.5


**Additional context**
According to the formula above, with b = 12 s =1024 h =1024  L= 20 a=16 t=1, the original activation memory should be around 32GB,  plus the memory for model states  , which is about 7.3 GB for a  0.43B parameters model, which should be around 40GB even not take the temporary buffers and unusable fragment memory into account. That is much bigger than the max_memory allocated without activation recomputing, So I wonder the Megatron-LM has done some optimize here?
And why the max_memory allocated only changes little with/without --recompute-activations （use selective activation as default according to the doc）


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1225

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1225

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions