This repository contains the code for a decoder-only transformer, similar to Llama or GPT. It was trained on an English corpus built from the seven Harry Potter books and has roughly 75M trainable parameters.
- Tokenization: Byte pair encoding (sentencepiece)
- FlashAttention, Grouped Query Attention
- Rotary Position Embeddings
- Key Value Cache
- Sampling: top-p, top-k
| Parameter | Value |
|---|---|
| Layer | 4 |
| Model Dimension | 768 |
| Context Length | 1024 |
| Attention Heads | 8 |
| Key/Value Heads | 4 |
| Vocabulary Size | 32000 |
| RoPE Theta | 10000 |
-
Grouped Query Attention -
Rotary Position Embeddings -
Key Value Cache - Distributed training
- Finetuning with (Q)LoRA
- Add Mixture of Experts model
TODO