An educational platform for understanding transformer architecture through interactive, step-by-step visualizations.
This project helps you understand transformers - the AI technology behind ChatGPT, BERT, and modern language models. Instead of reading complex papers, you can:
✅ Visualize how transformers process text step-by-step ✅ Interact with real models trained from scratch ✅ Learn the math and intuition behind attention mechanisms ✅ Experiment with different inputs and see real-time results
Perfect for:
- 🎓 Students learning about transformers
- 👨💻 ML Engineers wanting to understand internals
- 🧑🏫 Teachers explaining transformers visually
- 🔬 Researchers prototyping transformer variants
Train and visualize a GPT-style model that predicts the next word in a sentence.
Example:
- Input: "I eat"
- Output: "vegetables" (52.9% confidence)
6-Step Visualization Pipeline:
- Tokenization - See how text becomes tokens
- Embeddings - Understand semantic meaning + position encoding
- Attention - Watch how words "attend" to each other
- Feedforward - See neural network transformations
- Softmax - Probability distribution over vocabulary
- Prediction - Final result with confidence scores
Each step shows:
- 📊 Visual representations (heatmaps, graphs, grids)
- 🧮 Mathematical formulas
- 💡 Educational annotations
- 🔍 Interactive exploration
Before you begin, make sure you have:
- ✅ Python 3.9+ - Download here
- ✅ Node.js 16+ - Download here
Open a terminal and run:
# Navigate to backend directory
cd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start backend server
python -m app.main✅ Success! Backend is running at http://localhost:8000
🔍 Verify: Open http://localhost:8000/docs in your browser - you should see the API documentation.
Open a new terminal (keep backend running) and run:
# Navigate to backend (make sure venv is activated)
cd backend
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# Train Mode 1 model (takes ~10-15 minutes on CPU)
python -m app.features.mode1_next_word.train --epochs 50
# Wait for training to complete...
# You'll see: "TRAINING COMPLETE!" when done
# Model saved to: backend/app/features/mode1_next_word/checkpoints/best_model.ptWhat happens:
- Trains GPT-style model on sample corpus (1,449 lines)
- 50 epochs with 80/20 train/validation split
- Saves best model automatically
- Shows training progress and loss curves
✅ Success! Model trained and saved to checkpoints/best_model.pt
🔍 Verify: Check that backend/app/features/mode1_next_word/checkpoints/best_model.pt exists
💡 Tip: You only need to do this once. The trained model will be reused on backend restart.
Open a new terminal (keep backend running) and run:
# Navigate to frontend directory
cd frontend
# Install dependencies
npm install
# Start frontend server
npm run dev✅ Success! Frontend is running at http://localhost:3000
🔍 Verify: Open http://localhost:3000 in your browser - you should see the landing page.
- Open http://localhost:3000 in your browser
- Click "Applications" → "Mode 1: Next Word Prediction"
- Enter text like "I eat" or "She likes"
- Click "Predict Next Word"
- Explore the 6 visualization steps!
"I eat" → vegetables, breakfast, rice
"She likes" → dancing, music, chocolate
"We go" → to, home, school
"The weather is" → wonderful, nice, cold
"I work as" → teacher, engineer, doctor
Step 1: Tokenization
- Shows how your text is split into tokens (words)
- Each token gets a unique ID number
- Color-coded for easy tracking
Step 2: Embeddings + Positional Encoding
- Word Embeddings: Shows semantic meaning (similar words have similar patterns)
- Positional Encoding: Shows position information (sinusoidal waves)
- Final: Combined embedding used by the model
Step 3: Attention
- Q/K/V Projections: Shows how input is transformed
- Attention Weights: Which words are "looking at" which words
- Multi-Head: Model uses 4 different attention heads
- Interactive head selector to explore different attention patterns
Step 4: Feedforward Network
- Shows dimension expansion (256 → 1024 → 256)
- ReLU activation function visualization
- Input/output comparison
Step 5: Softmax Output
- Probability distribution over all possible next words
- Top-10 predictions with confidence scores
- Bar chart visualization
Step 6: Prediction Result
- Final predicted word
- Confidence percentage
- Alternative predictions
Want to train a custom model on your own text?
Create a text file with sentences (one per line):
backend/app/features/mode1_next_word/data/my_corpus.txt
Example content:
I love programming.
Python is a great language.
Transformers are powerful models.
...
Tips:
- Minimum: ~500 lines for decent results
- Recommended: 1,000-5,000 lines
- Use simple, clear sentences
- Mix different sentence structures
# Navigate to backend (with venv activated)
cd backend
# Train for 50 epochs
python -m app.features.mode1_next_word.train \
--corpus app/features/mode1_next_word/data/my_corpus.txt \
--epochs 50
# Training will take ~10-15 minutes on CPUWhat happens:
- ✅ Tokenizes your corpus
- ✅ Builds vocabulary
- ✅ Trains for 50 epochs with validation
- ✅ Saves
best_model.pt(lowest validation loss) - ✅ Shows training progress and loss curves
The API automatically loads best_model.pt on startup. Just restart the backend:
# Stop backend (Ctrl+C)
# Start again
python -m app.mainYour custom model is now being used!
Training Options:
# More epochs (better quality, takes longer)
--epochs 100
# Larger model (more parameters, slower)
--d-model 512 --n-heads 8 --n-layers 6
# Custom learning rate
--lr 1e-3
# Use GPU (if available)
--device cudaTraining Loss: How well model fits training data Validation Loss: How well model generalizes to new data ⭐
✅ Best model: Saved at epoch with lowest validation loss ❌ Don't use: Final epoch model (may be overfit)
Example:
Epoch 6: train_loss=5.5, val_loss=4.69 ← BEST (saved as best_model.pt)
Epoch 50: train_loss=0.99, val_loss=5.38 ← Overfit (don't use)
Why is validation loss higher?
- It's measured on unseen data (realistic performance)
- Lower training loss doesn't mean better model!
- Validation loss is the true measure of quality
See VALIDATION_COMPARISON.md for detailed explanation.
- Attention is All You Need - Original Transformer paper (Vaswani et al., 2017)
- The Illustrated Transformer - Visual guide by Jay Alammar
- Annotated Transformer - Harvard NLP code walkthrough
Unlike reading papers, this project lets you:
- ✅ See attention weights in real-time
- ✅ Experiment with different inputs
- ✅ Understand the math step-by-step
- ✅ Train your own models from scratch
Problem: ModuleNotFoundError or import errors
Solution:
# Make sure you're in backend/ directory
cd backend
# Make sure virtual environment is activated
# Windows: venv\Scripts\activate
# macOS/Linux: source venv/bin/activate
# Reinstall dependencies
pip install -r requirements.txtProblem: "Network Error" or "Failed to fetch"
Solution:
- Check backend is running (http://localhost:8000/docs should work)
- Check no firewall blocking port 8000
- Try restarting both backend and frontend
Problem: Model predicts nonsense
Solution:
- ✅ Check
best_model.ptexists inbackend/app/features/mode1_next_word/checkpoints/ - ✅ Train model for more epochs (50-100)
- ✅ Expand training corpus (more diverse sentences)
- ✅ Try different input phrases (model learns from training data)
Problem: Training takes too long
Solution:
- ✅ Use GPU:
--device cuda(if available) - ✅ Reduce model size:
--d-model 128 --n-layers 2 - ✅ Reduce epochs:
--epochs 20(quick test) - ℹ️ Normal: ~10-15 minutes for 50 epochs on CPU
- Backend Guide - Backend setup, API, development
- Frontend Guide - Frontend setup, components, UI
- Mode 1 Complete Guide - Training, inference, API
- Developer Guide (CLAUDE.md) - For contributors and developers
- Project Plan - Original vision and technical decisions
✅ Mode 1: Next Word Prediction - Production ready
- GPT-style decoder-only transformer
- 6-step visualization pipeline
- Training from scratch
- Interactive exploration
🔜 Mode 2: Translation (Seq2Seq)
- Full encoder-decoder architecture
- Translate between languages
- Visualize encoder-decoder attention
🔜 Mode 3: Masked Language Modeling (BERT-style)
- Bidirectional attention
- Fill in the blanks
- Sentence understanding
🔜 Mode 4: Load Pre-trained Models
- Load GPT-2, BERT, etc.
- Visualize production models
- Compare architectures
- Dark mode
- Export visualizations (PNG/SVG)
- Animation playback controls
- Comparison mode (side-by-side inputs)
Contributions are welcome! This is an educational project focused on clarity and learning.
How to contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-visualization) - Make your changes
- Test thoroughly
- Submit a pull request
Guidelines:
- Maintain educational focus (clear explanations)
- Add comments explaining what, why, how
- Include visual examples if adding visualizations
- Update relevant documentation
See CLAUDE.md for developer guide.
MIT License - See LICENSE file for details.
- Vaswani et al. for "Attention is All You Need" (2017)
- Jay Alammar for "The Illustrated Transformer"
- Harvard NLP for "The Annotated Transformer"
- PyTorch and FastAPI communities
- All contributors and users of this project
Have questions or found a bug?
- 📖 Check the troubleshooting section above
- 📚 Read the documentation
- 🐛 Open an issue on GitHub
- 💬 Start a discussion for feature requests
Want to learn more about transformers?
- Start with Mode 1 and explore all 6 steps
- Try different input texts and observe patterns
- Train your own model on custom data
- Read the papers listed in Learning Resources
Built with educational clarity in mind - Helping people understand transformers through interactive visualization.
⭐ If this helped you understand transformers, please star the repo!
Last Updated: 2025-11-29 Status: ✅ Production Ready (Mode 1) Version: 1.0