-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Closed
Description
I'd like to train llama3.1 model with variable sequence length as mentioned in Meta's paper. That is, to start the training with one sequence length and after a number of consumed tokens switch to a larger sequence length and continue training.
Several approaches in my opinion:
- Train on dataset A with sequence length A for some iterations and save checkpoint. Continue training from checkpoint with a dataset B which has sequence length B. At runtime seq_length will be fixed and can be taken from args. This will require minimal changes.
- Use BlendedMegatronDataset which can be used to build a dataset from several datasets. Currently, according to BlendedMegatronDatasetConfig all datasets share same seq_len. This could be changed to make seq_len a list instead of an int. The high level dataset will have variable sized samples. Some samples will come from dataset A with seq_len A and some will have seq_len B from dataset B. In this approach seq_length will not be fixed. It could be taken from a global function that returns the seq_length based on the index of the sample or the length of the sample.
- A low level dataset could be comprised of variable sequence lengths. It will include a vector with the changes in sequence length. In runtime the sequence length would be taken from sample. At runtime sequence length is needed in several functions. In runtime seq_length will not be fixed, as in previous approach.
Any thoughts?
Metadata
Metadata
Assignees
Labels
No labels