Skip to content

micro_acc_steps flag functionality #111

@avgalichin

Description

@avgalichin

micro_acc_steps: the documentation says the flag implements microbatching, but there seems no such functionality.

Expected behaviour (from README --distribute_modules example)

“It accumulates gradients over 8 minibatches, and splits each minibatch into 2 microbatches before feeding them into the SAE encoder, thus saving a lot of memory.”

torchrun … --grad_acc_steps 8 … **--micro_acc_steps 2**

Actual behaviour in the code

  • sparsify/config.py:
micro_acc_steps: int = 1  # "Chunk the activations into this number of microbatches for training"
  • sparsify/trainer.py (only place the value is used):
acc_steps = self.cfg.grad_acc_steps * self.cfg.micro_acc_steps  

I don't see actual split on the micro_acc_steps minibatches, and the activations are fed to the SAE whole, regardless of the micro_acc_steps value.


From what I can see, setting micro_acc_steps > 1 only multiplies the gradient-accumulation denominator (acc_steps). That means the effective learning rate goes down, but the memory footprint stays the same.

If that’s correct, it might be worth updating the README (and the flag’s doc-string in config.py) to avoid confusion for new users.

Metadata

Metadata

Labels

bugSomething isn't workinggood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions