Skip to content

Conversation

@thomasfortin1
Copy link

I made two changes which should help future users implement MuP correctly:

Previously a user could use mup.Adam, mup.AdamW, or mup.SGD (which are just the regular PyTorch optimizers) instead of the correct mup.MuAdam, mup.MuAdamW, or mup.MuSGD. Now the vanilla PyTorch optimizers cannot be accidentally accessed through the mup package.

If mup.MuAdam is used with weight decay, a warning will prompt the user to switch to mup.MuAdamW for correct weight decay scaling as described in appendix B.3 of the version of the paper which is on ArXiv. Note that doing a coord check will not indicate an incorrect implementation when using MuAdam with weight decay, but increasing model size will still eventually lead to diminishing performance unless MuAdamW is used instead (in my experience).

@thomasfortin1
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant