-
Notifications
You must be signed in to change notification settings - Fork 91
Open
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers
Description
micro_acc_steps: the documentation says the flag implements microbatching, but there seems no such functionality.
Expected behaviour (from README --distribute_modules example)
“It accumulates gradients over 8 minibatches, and splits each minibatch into 2 microbatches before feeding them into the SAE encoder, thus saving a lot of memory.”
torchrun … --grad_acc_steps 8 … **--micro_acc_steps 2**
Actual behaviour in the code
sparsify/config.py:
micro_acc_steps: int = 1 # "Chunk the activations into this number of microbatches for training"sparsify/trainer.py(only place the value is used):
acc_steps = self.cfg.grad_acc_steps * self.cfg.micro_acc_steps I don't see actual split on the micro_acc_steps minibatches, and the activations are fed to the SAE whole, regardless of the micro_acc_steps value.
From what I can see, setting micro_acc_steps > 1 only multiplies the gradient-accumulation denominator (acc_steps). That means the effective learning rate goes down, but the memory footprint stays the same.
If that’s correct, it might be worth updating the README (and the flag’s doc-string in config.py) to avoid confusion for new users.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomers