A collection of experiments exploring the lifecycle of Small Language Models (SLMs). This repository covers Supervised Fine-Tuning (SFT), Preference Optimization (DPO/CPO), and architectural evolution by Upcycling dense models into Mixture-of-Experts (MoE) with targeted expert specialization.
The experiments utilize SmolLM-135M as the base model.
I explore methods to align a base model to follow grammatical correction instructions using the CoEdIT dataset.
- SFT (Supervised Fine-Tuning): Establishing a baseline for instruction following.
- DPO (Direct Preference Optimization): Optimizing using edit-distance-based preference pairs.
- CPO (Contrastive Preference Optimization): Exploring memory-efficient alternatives to DPO.
- Result: Achieved ~0.49 BLEU score with SFT+DPO.
I transform the dense SmolLM-135M into a SmolMoE (Mixture of Experts) model.
- Upcycling: Initializing MoE weights from the dense checkpoint using "SwiGLU" duplication.
- Router Training: Validated using Load Balancing Loss.
- Continuous Pre-training: Training on the Cosmopedia dataset to stabilize the upcycled model.
I force specific experts to specialize in distinct domains (Code, Math, Chat) using architectural modifications and loss guidance.
- Dataset: Interleaved subsets from
Llama-Nemotron-Post-Training-Dataset. - Methodology:
- Router Guidance Loss: KL Divergence loss forcing tokens from specific domains to specific experts.
- Two-Stage Training: * Stage A: Freeze model, warm-start router with high guidance.
- Stage B: Joint tuning of router and experts.
- Result: Achieved >90% routing accuracy for Math tokens to the Math Expert.
This plot shows the stabilization of training metrics during continued pre-training on the Cosmopedia dataset.

This plot tracks the percentage of active experts over time, confirming the router successfully avoided collapse.

Routing Before Specialization
This visualization shows the initial, unspecialized distribution of tokens across experts.

Routing After Specialization
This visualization confirms the successful specialization of experts after targeted training.

Token-Level Routing Detail
A granular analysis demonstrating the model correctly routing individual tokens based on domain context.

src/: Core model architectures (Dense & MoE), data processing, and custom trainers.scripts/: Executable Python scripts for running full training pipelines.notebooks/: Visualizations, routing heatmaps, and evaluation walkthroughs.viz/: Contains all visualization assets.
git clone [https://github.com/yourusername/smol-experiments.git](https://github.com/yourusername/smol-experiments.git)
cd smol-experiments
pip install -r requirements.txt