CS559 Code Completion Project

Team

Mayasah Lami (22401352)
Alireza Dastmalchi Saei (22404076)
Utku Boran Torun (21901898)
Mahyar Fardinfar (22501404)

Code completion model using transformer architecture trained on Python code from the py150 dataset.

Project Overview

This project implements a transformer-based code completion model for Python. The model uses a causal self-attention mechanism to predict the next token in a sequence of code tokens.

Project Structure

cs559_code_completion/
├── model.py                      # Transformer model architecture
├── preprocess.py                 # Data preprocessing script (tokenization)
├── create_completion_datasets.py # Create token/line-level completion datasets
├── train.py                      # Original training script (token + line level)
├── train_v2.py                   # Recommended training script (cleaner, with gradient accumulation)
├── evaluate.py                   # Evaluation script
├── inference.py                  # Inference script
├── PARAMETER_GUIDE.md            # Guide for choosing training parameters
├── EXPERIMENT_GUIDE.md           # Guide for systematic hyperparameter experiments
├── requirements.txt              # Python dependencies
├── download_and_extract.sh       # Script to download py150 dataset
├── literals.json                 # Common string/number literals for tokenization
├── py150_files/                  # Dataset directory (Python source files)
│   ├── data/                     # Python source files
│   ├── python100k_train.txt      # Training file paths
│   ├── python50k_eval.txt        # Evaluation file paths
│   └── ...
├── token_completion/             # Preprocessed tokenized datasets
│   ├── train.txt                 # 95,000 training examples
│   ├── dev.txt                   # 5,000 development examples
│   └── test.txt                  # 50,000 test examples
└── completion_datasets/          # Code completion datasets (generated)
    ├── token_level/              # Next token prediction datasets
    └── line_level/               # Line completion datasets
└── runs/                         # Training run directories (generated)
    └── run_*/                    # Individual run directories named by parameters
        ├── best_model_*.pt       # Model checkpoint
        ├── vocab.json            # Vocabulary file
        ├── training_params.json  # Training parameters
        └── test_results.json    # Evaluation results (after running evaluate.py)

Setup

Install dependencies:

pip install -r requirements.txt

Or install individually:

pip install torch tqdm numpy

Note: For GPU support, install PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then install other dependencies:

pip install tqdm numpy

Download and extract the dataset:
```
bash download_and_extract.sh
```

Preprocess the data:

python preprocess.py --base_dir py150_files --output_dir token_completion

Create completion datasets:

python create_completion_datasets.py \
    --input_dir token_completion \
    --output_dir completion_datasets

Model Architecture

The model (model.py) implements:

CodeCompletionTransformer: A GPT-style decoder-only transformer
Configuration:
- Vocabulary size: 32,000
- Model dimension: 512
- Number of layers: 6
- Number of attention heads: 8
- Feed-forward dimension: 2,048
- Maximum sequence length: 256
- Dropout: 0.1

Data Preprocessing

The preprocess.py script:

Tokenizes Python source files using Python's tokenize module
Replaces string and number literals with special tokens (<STR_LIT>, <NUM_LIT>)
Adds <EOL> markers for line breaks
Splits data into train (95k), dev (5k), and test (50k) sets
Outputs tokenized sequences with <s> and </s> markers

The create_completion_datasets.py script creates task-specific datasets:

Token-level: Next token prediction (context → target token)
Line-level: Line completion (previous lines + prefix → suffix)

Training

Recommended: Using train_v2.py

train_v2.py is the recommended training script with improved features:

Gradient accumulation support (reduces GPU memory usage)
Configurable learning rate and weight decay
Validation accuracy reporting
Better code organization
Reproducibility (random seed support)

Quick Start

Create completion datasets (if not already done):

python create_completion_datasets.py \
    --input_dir token_completion \
    --output_dir completion_datasets

Train token-level model (recommended):

python train_v2.py \
    --task token \
    --batch_size 32 \
    --vocab_min_freq 50 \
    --num_epochs 15 \
    --max_length 256 \
    --max_train_examples 1000000 \
    --learning_rate 1e-4 \
    --early_stopping_patience 3 \
    --weight_decay 0.01 \
    --max_val_examples 500000 \
    --d_model 384 \
    --n_layer 4 \
    --n_head 6 \
    --d_ff 1536 \
    --dropout 0.2 \
    --device cuda

Train line-level model:

python train_v2.py \
    --task line \
    --vocab_min_freq 50 \
    --batch_size 32 \
    --num_epochs 15 \
    --max_length 256 \
    --max_train_examples 2000000 \
    --max_val_examples 100000 \
    --device cuda
    --d_model 384 \
    --n_layer 4 \
    --n_head 6 \
    --d_ff 1536 \
    --dropout 0.2 \

With gradient accumulation (if GPU memory is limited):

python train_v2.py \
    --task token \
    --batch_size 16 \
    --accumulation_steps 4 \
    --num_epochs 15 \
    --max_length 256 \
    --device cuda

This simulates batch_size=64 (16×4) with less GPU memory.

Training Options (train_v2.py)

Data arguments:

--task: token or line (default: token)
--dataset_dir: Directory with completion datasets (default: completion_datasets)
--tokenized_dir: Directory with tokenized files for vocabulary (default: token_completion)

Training hyperparameters:

--batch_size: Batch size (default: 32)
--num_epochs: Number of training epochs (default: 10)
--max_length: Maximum sequence length (default: 256)
--learning_rate: Learning rate (default: 1e-4)
--weight_decay: Weight decay / L2 regularization (default: 0.01)
--accumulation_steps: Gradient accumulation steps (default: 1, use >1 to reduce GPU memory)

Model architecture (optional overrides):

--d_model: Transformer width / embedding dimension
--n_layer: Number of transformer blocks (depth)
--n_head: Number of attention heads (must divide d_model)
--d_ff: Feed-forward (MLP) hidden dimension
--dropout: Dropout probability

Vocabulary arguments:

--vocab_min_freq: Minimum token frequency for vocabulary (default: 10, higher = smaller vocab)
--vocab_sample_lines: Sample N lines for vocabulary building (default: 50000)

Data loading:

--max_train_examples: Limit number of training examples (None = all)
--max_val_examples: Limit number of validation examples (default: 10000)
--lazy_load: Use lazy loading for datasets (saves memory)
--num_workers: Number of data loading workers (default: 4)

System:

--device: cuda or cpu (auto-detected)
--seed: Random seed for reproducibility (default: 42)

Alternative: Using train.py

The original train.py script is also available and supports both token and line-level training. See train.py --help for options.

Parameter Selection Guide

For guidance on choosing parameters (vocab size, batch size, etc.), see PARAMETER_GUIDE.md.

For systematic hyperparameter experiments (learning rate, dropout, etc.), see EXPERIMENT_GUIDE.md.

Output Files

Training creates a run directory in runs/ named after the training parameters:

train.py format: run_{task}_bs{batch_size}_ep{epochs}_len{max_length}_vocab{min_freq}_{timestamp}
train_v2.py format: run_{task}_v2_bs{batch_size}_ep{epochs}_len{max_length}_vocab{min_freq}_{timestamp}
Example: runs/run_token_v2_bs32_ep15_len256_vocab25_20240101_120000/

Each run directory contains:

best_model.pt (train_v2.py) or best_model_token_level.pt/best_model_line_level.pt (train.py) - Best model checkpoint
vocab.json - Vocabulary mapping (tokens ↔ indices)
training_params.json - All training parameters used for this run
test_results.json - Evaluation results (created after running evaluate.py)

Inference

Token-Level Inference

Predict the next token:

python inference.py \
    --model_path runs/run_token_v2_bs16_ep15_len256_vocab35_train200000_acc4_20251216_123055/best_model_token_level.pt \
    --vocab_path runs/run_token_v2_bs16_ep15_len256_vocab35_train200000_acc4_20251216_123055/vocab.json \
    --task token \
    --context "from bootstrap import" \
    --top_k 5

Line-Level Inference

Complete a line:

python inference.py \
    --model_path runs/run_line_bs32_ep10_len256_vocab10_20240101_120000/best_model_line_level.pt \
    --vocab_path runs/run_line_bs32_ep10_len256_vocab10_20240101_120000/vocab.json \
    --task line \
    --context "def hello(name):" \
    --device cuda

Evaluation

Evaluate a trained model on the test set. The script automatically detects vocabulary and training parameters from the run directory:

# Simple evaluation (auto-detects vocab and max_length from run directory)
python evaluate.py \
    --model_path runs/runs/run_4000000 \
    --max_test_examples 200000 \
    --num_workers 0 \
    --task token \
    --device cuda

Evaluation with Options

# Limit test examples for faster evaluation
python evaluate.py \
    --model_path runs/run_line_v2_dm348_ly4_hd6_ff1392_do30_lr0.0003_wd0.01_bs16_ep15_len256_vocab50_train1000000_acc4/checkpoint_epoch_2.pt \
    --vocab_path runs/run_line_v2_dm348_ly4_hd6_ff1392_do30_lr0.0003_wd0.01_bs16_ep15_len256_vocab50_train1000000_acc4/vocab.json \
    --task line \
    --max_test_examples 50000 \
    --num_workers 0 \
    --device cuda

Evaluation Options

--model_path: Path to trained model checkpoint (required)
--vocab_path: Path to vocabulary file (auto-detected from model directory if not specified)
--task: token or line (default: token)
--dataset_dir: Directory containing test datasets (default: completion_datasets)
--max_length: Maximum sequence length (auto-detected from training_params.json if available)
--batch_size: Batch size for evaluation (default: 32)
--max_test_examples: Limit number of test examples (None = all, recommended for large test sets)
--num_workers: Number of data loading workers (default: 4, use 0 if experiencing hangs)
--lazy_load: Use lazy loading (default: True, saves memory)
--device: cuda or cpu (auto-detected, falls back to CPU if CUDA unavailable)

Auto-Detection Features

The evaluation script automatically:

Detects vocabulary from the same directory as the model (if vocab.json exists there)
Loads training parameters from training_params.json to set max_length
Saves results to the same run directory as the model

This means you typically only need to specify --model_path and --task!

Data Flow

Tokenized files (token_completion/*.txt)
- Already tokenized with special tokens (<EOL>, <STR_LIT>, etc.)
Completion datasets (completion_datasets/*/)
- JSONL format with context/target pairs
- Created by create_completion_datasets.py
Training (train.py or train_v2.py)
- Builds vocabulary from tokenized files
- Converts tokens to indices
- Trains model with PyTorch DataLoader
- Creates run directory with model, vocabulary, and training parameters
- Supports gradient accumulation (train_v2.py) for memory efficiency
Evaluation (evaluate.py)
- Auto-detects vocabulary and training parameters from run directory
- Loads trained model
- Evaluates on test set
- Saves results to run directory
Inference (inference.py)
- Auto-detects vocabulary and training parameters from run directory
- Loads trained model
- Converts input tokens to indices
- Generates predictions

Example Usage in Code

from model import CodeCompletionTransformer, ModelConfig
from train import Vocabulary
import torch
import json

# Load vocabulary from run directory
run_dir = 'runs/run_token_bs32_ep10_len256_vocab10_20240101_120000'
vocab = Vocabulary()
with open(f'{run_dir}/vocab.json', 'r') as f:
    vocab_data = json.load(f)
    vocab.token_to_idx = vocab_data['token_to_idx']
    vocab.idx_to_token = {int(k): v for k, v in vocab_data['idx_to_token'].items()}

# Create model
config = ModelConfig()
config.vocab_size = len(vocab.token_to_idx)
model = CodeCompletionTransformer(config)
model.load_state_dict(torch.load(f'{run_dir}/best_model_token_level.pt'))
model.eval()

# Predict
context = "from bootstrap import".split()
context_ids = vocab.encode(context, max_length=256, pad=True)
input_ids = torch.tensor([context_ids])

with torch.no_grad():
    logits = model(input_ids)
    next_token_logits = logits[0, -1, :]
    predicted_token_idx = torch.argmax(next_token_logits).item()
    predicted_token = vocab.idx_to_token[predicted_token_idx]
    print(f"Next token: {predicted_token}")

Troubleshooting

Memory Issues (Process Killed)

If training starts but gets killed due to memory issues:

Reduce batch size:

python train.py --batch_size 8  # or even 4

Use lazy loading (enabled by default):

python train.py --lazy_load  # Already default

Sample vocabulary building:

python train.py --vocab_sample_lines 100000  # Only use 100k lines for vocab

Reduce sequence length:

python train.py --max_length 128  # Instead of 256

Process smaller dataset first:
- Test with a subset of data to verify it works
- Use --limit in create_completion_datasets.py to create smaller datasets

Other Issues

Out of memory:
- Reduce --batch_size or --max_length
- Use gradient accumulation: --accumulation_steps 4 (train_v2.py)
- Use --vocab_min_freq 50 or higher to reduce vocabulary size
Slow training:
- Use GPU (--device cuda)
- Reduce batch size or use gradient accumulation
- Use --lazy_load to save memory
Poor predictions:
- Train longer (more epochs)
- Check data quality
- Adjust learning rate (see EXPERIMENT_GUIDE.md)
- Try different vocabulary sizes (see PARAMETER_GUIDE.md)
Process killed during vocabulary building:
- Use --vocab_sample_lines to limit vocabulary building
Evaluation hangs or is slow:
- Use --num_workers 0 to disable multiprocessing
- Use --max_test_examples to limit test set size
- Use --lazy_load (default: True) to avoid loading all examples into memory
Vocabulary mismatch errors:
- Always use the vocab.json from the same run directory as the model
- The evaluation script auto-detects this, but you can specify --vocab_path explicitly

Additional Resources

PARAMETER_GUIDE.md: Comprehensive guide for choosing training parameters (vocab size, batch size, etc.)
EXPERIMENT_GUIDE.md: Guide for systematic hyperparameter experiments (learning rate, dropout, regularization, etc.)
diagnose_accuracy_issue.py: Diagnostic tool to identify vocabulary mismatches and configuration issues

To Do

Mahyar
- Creating the model architecture
- Testing the model architecture
Mayasah
- Data loading
- Data preprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS559 Code Completion Project

Team

Project Overview

Project Structure

Setup

Model Architecture

Data Preprocessing

Training

Recommended: Using train_v2.py

Quick Start

Training Options (train_v2.py)

Alternative: Using train.py

Parameter Selection Guide

Output Files

Inference

Token-Level Inference

Line-Level Inference

Evaluation

Evaluation with Options

Evaluation Options

Auto-Detection Features

Data Flow

Example Usage in Code

Troubleshooting

Memory Issues (Process Killed)

Other Issues

Additional Resources

To Do

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Baseline		Baseline
ipynb_ver		ipynb_ver
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
compare_vocabs.py		compare_vocabs.py
create_completion_datasets.py		create_completion_datasets.py
cs559_code_completion_colab.ipynb		cs559_code_completion_colab.ipynb
download_and_extract.sh		download_and_extract.sh
evaluate.py		evaluate.py
inference.py		inference.py
literals.json		literals.json
model.py		model.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
train_v2.py		train_v2.py

MahyarFardin/Decoder_only_Code_Completion

Folders and files

Latest commit

History

Repository files navigation

CS559 Code Completion Project

Team

Project Overview

Project Structure

Setup

Model Architecture

Data Preprocessing

Training

Recommended: Using train_v2.py

Quick Start

Training Options (train_v2.py)

Alternative: Using train.py

Parameter Selection Guide

Output Files

Inference

Token-Level Inference

Line-Level Inference

Evaluation

Evaluation with Options

Evaluation Options

Auto-Detection Features

Data Flow

Example Usage in Code

Troubleshooting

Memory Issues (Process Killed)

Other Issues

Additional Resources

To Do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages