This repository provides a minimal yet extensible implementation of the GPT architecture, enabling users to train, evaluate, and analyze transformer-based language models on custom datasets. It is designed for research, education, and small-scale experiments, with a focus on clarity and modularity.
- Custom GPT Models: Supports multiple model sizes (GPT2-14M, GPT2-29M, GPT2-49M), easily extensible.
- Flexible Training: Train on your own datasets with configurable hyperparameters in shell scripts.
- Robust Inference: Generate text with various sampling strategies and checkpoint selection.
- Comprehensive Visualization: Analyze metrics, activations, and attention maps to understand model behavior.
- Modular Utilities: Includes reusable utilities for data processing, logging, and parameter calculation.
- Datasets Analysis:Evaluate datasets metrics, including basic statistic features, sentence complexity, vocalbulary&domain diversity.
train.py: Main script for training GPT models.inference.py: Script for running inference with trained models.scripts/: Shell scripts for streamlined training and inference workflows.utils/: Utility modules for argument parsing, data loading, logging, learning rate scheduling, parameter calculation, and tokenization.visualize/: Python scripts for visualizing activations, attention, and metrics.data/: Contains datasets and tokenized data for training and validation.logs/: Stores training logs, metrics, and model checkpoints.models/: Implementation of the GPT model.report/: LaTeX report and documentation.
- Python 3.8 or higher
- CUDA-enabled GPU (recommended for training, optional for inference)
- PyTorch and other dependencies listed in
requirements.txt
Clone the repository and install dependencies:
git clone https://github.com/yourusername/GPTfromScratch.git
cd GPTfromScratch
pip install -r requirements.txtTo train a GPT model, use the provided shell scripts or run train.py directly. Example for training the 14M model:
bash scripts/train_14M.shOr customize training via Python:
python train.py --model GPT2-14M --epochs 10 --batch_size 256 --lr 3e-4 --data_path data/tinystories/tokenized_train_bs256Checkpoints and logs will be saved in logs/GPT2-14M/ckpts and logs/GPT2-14M/train.log.
Generate text using a trained model checkpoint:
bash scripts/inference.sh -m GPT2-14M -p "Once upon a time" -l 100Or use Python directly:
python inference.py --model_path logs/GPT2-14M/ckpts/best_model.pth --prompt "In a distant galaxy" --max_length 50You can list available checkpoints:
bash scripts/inference.sh --list-ckpts -m GPT2-29MVisualize training metrics, activations, or attention maps to better understand model performance:
python utils/visualize_metrics.py
python utils/visualize_activations.py
python utils/visualize_attention.pyOutputs are saved in the visualize/ directory and can be used for further analysis or reporting.