LLM-pretraining

Pretraining LLM from scratch

Results

An example of pre-training model of 1B parameters with 3.66B token wiki data sub-sampled form olmo-mix-1124

Training logs on wiki data.

Quick Start ▶️

Prerequisites

Singularity ≥ 3.6
Slurm client (optional for local execution)

Launch Training:

git clone https://github.com/alexxchen/LLM-pretraining.git
cd LLM-pretraining
./start_run.sh

The script will automatically:

Pull the pre-built Docker image and convert it into singularity image
Launch Slurm job with optimal default parameters

You can inspect the data during training by:

singularity exec olmo.sif python inspect_train_data.py --checkpoint_num 100 ./workspace/OLMo-1B-dolma2-tokenizer-wiki_{date}_{your slurm job id} 0 10

Post Training:

You can easily convert the pytorch checkpoints into safetensors by using command:

./convert_to_hf.sh \
    --checkpoint-dir ./workspace/OLMo-1B-dolma2-tokenizer-wiki_{date}_{your slurm job id}/step2800-unsharded/ \
    --destination-dir ./{your output dir}/ \
    --tokenizer ./tokenizers/allenai_dolma2.json

And continue enhancing the base model into a reasoning model using our GRPO training code (https://github.com/alexxchen/open-r1-vision)

License 📄

Apache 2.0 - See LICENSE for details

Acknowledgments 🌟

Special thanks to Prof. Zuoren Wang from Center for Excellence in Brain Science and Intelligence Technology for the support. Sincerely thank to OLMo.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
images		images
tokenizers		tokenizers
Dockerfile		Dockerfile
LICENSE		LICENSE
OLMo-1B-local-data.yaml		OLMo-1B-local-data.yaml
OLMo-1B.yaml		OLMo-1B.yaml
README.md		README.md
convert_olmo_to_hf.py		convert_olmo_to_hf.py
convert_to_hf.sh		convert_to_hf.sh
inspect_train_data.py		inspect_train_data.py
run_singularity.sh		run_singularity.sh
start_run.sh		start_run.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-pretraining

Results

Quick Start ▶️

Prerequisites

Launch Training:

Post Training:

License 📄

Acknowledgments 🌟

About

Uh oh!

Releases

Packages

Languages

License

alexxchen/LLM-pretraining

Folders and files

Latest commit

History

Repository files navigation

LLM-pretraining

Results

Quick Start ▶️

Prerequisites

Launch Training:

Post Training:

License 📄

Acknowledgments 🌟

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages