Speculative Sampling Playground

This project provides:

Exact speculative sampling (SpS).
AutoJudge (lossy judge-decoding style method with synthetic labels, no manual annotation).
SpecExec (exact target sampling with draft-branch cache prefill and pruning).
A Hugging Face adapter with KV cache and optional quantization.
A benchmark harness on MT-Bench with JSONL metrics.

Features

Baseline, speculative, AutoJudge, and SpecExec decoding in one benchmark entrypoint.
MT-Bench loader (JSON/JSONL).
Benchmark runner with median timing, resume support, and method-specific metrics.
Preset configs for models, methods, and paired experiments.
Makefile shortcuts for local and Docker workflows.
Docker support for CPU and GPU.
CI pipeline (GitHub Actions) for checks/tests + benchmark JSONL schema validation.

Getting Started (From Zero)

Bootstrap dependencies on a clean Ubuntu host (safe mode, does not touch NVIDIA driver):

bash scripts/install_dependencies.sh

Recommended Python version is 3.11 (.python-version in repo). Dependencies are pinned in requirements*.txt for reproducible runs. For GPU Python extras (bitsandbytes, accelerate):

bash scripts/install_dependencies.sh --gpu

For EOL Ubuntu (for example Ubuntu 17), script stops by default. Continue only if you explicitly accept risks:

bash scripts/install_dependencies.sh --allow-eol-ubuntu

Install Docker. For GPU runs, keep your existing NVIDIA driver and install the NVIDIA Container Toolkit only.
Put MT‑Bench dataset file (JSON/JSONL) into project folder datasets/, for example datasets/mt_bench.jsonl.
Build a CPU image:

docker build -t sp-samp .

Run tests (CPU):

docker run --rm sp-samp

Run a CPU benchmark (toy models):

docker run --rm sp-samp \
  python -m benchmarks.bench_speculative \
  --method both \
  --runs 1 \
  --max-samples 5 \
  --max-new-tokens 32 \
  --vocab-size 2048

Build a GPU image (CUDA example):

docker build -f Dockerfile.gpu \
  --build-arg BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu124 \
  --build-arg TORCH_VERSION=2.5.1 \
  -t sp-samp-gpu .

Run a GPU benchmark (HF model, results saved to JSONL):

docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/mt_bench.jsonl \
  --hf-model RedHatAI/gpt-oss-20b \
  --quant 4bit \
  --bnb-compute-dtype bfloat16 \
  --device cuda \
  --use-chat-template \
  --max-samples 50 \
  --max-new-tokens 128 \
  --k 4 \
  --runs 5 \
  --out /data/results.jsonl

Run all methods in one launch (baseline + speculative + autojudge + specexec):

docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/mt_bench.jsonl \
  --hf-model meta-llama/Meta-Llama-3-8B-Instruct \
  --hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
  --device cuda \
  --use-chat-template \
  --method all \
  --k 4 \
  --runs 5 \
  --out /data/results_all.jsonl

Run SpecExec only (branch execution parameters included):

docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m sp_samp.cli specexec \
  --config-dir configs \
  --experiment llama3_target_llama3_specexec_k4 \
  --dataset /data/mt_bench.jsonl \
  --parallel-branches 8 \
  --branch-prune-threshold 0.0 \
  --out /data/results_specexec.jsonl

Run AutoJudge only with checkpoint reuse:

docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/mt_bench.jsonl \
  --hf-model meta-llama/Meta-Llama-3-8B-Instruct \
  --hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
  --device cuda \
  --use-chat-template \
  --method autojudge \
  --autojudge-train-samples 4000 \
  --autojudge-train-steps 400 \
  --autojudge-threshold 0.5 \
  --autojudge-checkpoint /data/autojudge_llama3.pt \
  --out /data/results_autojudge.jsonl

Make Targets Defaults for benchmark paths:

DATASET=datasets/mt_bench.jsonl
OUT=datasets/results.jsonl

Show all commands:

make help

Install/upgrade dependencies in safe mode:

make setup

Install/upgrade including GPU Python extras:

make setup-gpu

Syntax check:

make check

Validate benchmark JSONL schema:

make validate-results RESULTS=datasets/results.jsonl

List presets:

make list-presets

Validate config logic:

make validate-configs

Quick toy benchmark (no HF models):

make bench-toy OUT=/tmp/bench_toy.jsonl

Quick HF smoke run (needs torch + transformers, downloads tiny model):

make smoke-hf OUT=/tmp/smoke_hf.jsonl

Run experiment on MT-Bench:

make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl

Run AutoJudge preset:

make autojudge DATASET=datasets/mt_bench.jsonl OUT=datasets/results_autojudge.jsonl

Run SpecExec preset:

make specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl

Build and run GPU Docker flow:

make docker-build-gpu
make docker-bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl
make docker-specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl

Enforce headless GPU mode for long runs:

make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl HEADLESS=1

Presets

Models: configs/models.json
Methods: configs/methods.json
Experiments (target/draft pairings): configs/experiments.json
Method templates (AutoJudge/SpecExec): configs/method_templates.json

CLI Runner

List presets:

python -m sp_samp.cli list-presets --config-dir configs

Direct method selection:

python -m benchmarks.bench_speculative \
  --method specexec \
  --dataset datasets/mt_bench.jsonl \
  --hf-model meta-llama/Meta-Llama-3-8B-Instruct \
  --hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
  --parallel-branches 8 \
  --branch-prune-threshold 0.0

Run benchmark using presets:

python -m sp_samp.cli bench \
  --config-dir configs \
  --model-preset gpt_oss_20b_4bit \
  --method-preset speculative_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results.jsonl

Run benchmark using an experiment preset:

python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment llama3_all_methods \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results.jsonl

Run AutoJudge shortcut command:

python -m sp_samp.cli autojudge \
  --config-dir configs \
  --experiment llama3_target_llama3_autojudge_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results_autojudge.jsonl

Run SpecExec shortcut command:

python -m sp_samp.cli specexec \
  --config-dir configs \
  --experiment llama3_target_llama3_specexec_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results_specexec.jsonl

Require headless GPU mode (fail-fast when display is active):

python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment llama3_all_methods \
  --dataset datasets/mt_bench.jsonl \
  --require-headless \
  --out datasets/results.jsonl

Metrics Output The benchmark writes JSONL records with per-run metrics and a summary record per method. Fields include:

status (ok/error/skipped)
resume_key (used to skip completed runs on re-launch)
tokens_per_sec
acceptance_rate
avg_tokens_per_step
proposed, accepted, rejections
judge_accept_rate (AutoJudge only)
target_fallback_rate (AutoJudge only)
autojudge_train_samples, autojudge_train_loss (AutoJudge only)
branch_prune_rate (SpecExec only)
effective_parallelism (SpecExec only)
target_calls_per_token (AutoJudge and SpecExec)
draft_calls_per_token (SpecExec only)
cache_hit_rate (SpecExec only)
max_active_branches (SpecExec only)
error_type, error_message, traceback (for failed runs)
System metadata: git_sha, hostname, gpu_name, gpu_driver, cuda_runtime, torch_version, transformers_version, display_active
Validate output schema with:

python scripts/validate_results_jsonl.py --path datasets/results.jsonl --strict

Project Layout

sp_samp/: core library, AutoJudge, SpecExec, HF adapter.
benchmarks/: benchmark runner.
configs/: preset configs.
tests/: tests.
Dockerfile, Dockerfile.gpu: containers.

Notes

SpecExec is implemented as an exact (distribution-preserving) decoder with speculative cache prefill and configurable parallel_branches / branch_prune_threshold.
HF SpecExec uses KV-cache reuse and depth-wise tree passes for faster cache construction.
Draft and target models must share an identical tokenizer vocabulary mapping for speculative, AutoJudge, and SpecExec correctness.
scripts/install_dependencies.sh is idempotent for apt/pip dependencies and never modifies NVIDIA driver packages.
scripts/install_dependencies.sh prefers python3.11 when available and warns when running with lower versions.
Ubuntu 17 is EOL. Script blocks by default on EOL Ubuntu unless --allow-eol-ubuntu is set explicitly.
Re-running a benchmark with the same OUT file automatically skips completed runs (resume mode).
Failed runs are written to JSONL and do not stop the whole benchmark method loop.
make validate-configs checks config references and tokenizer compatibility in configs/*.json.
make validate-results enforces JSONL schema compatibility for downstream analysis.
Current default experiments are correctness-safe (same tokenizer family) and may show limited speedup when target=draft.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
configs		configs
datasets		datasets
papers		papers
scripts		scripts
sp_samp		sp_samp
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CODEX.MD		CODEX.MD
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
LICENSE		LICENSE
Makefile		Makefile
README.MD		README.MD
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speculative Sampling Playground

About

Uh oh!

Releases

Packages

Languages

License

levvius/adaptive-speculative-decoding

Folders and files

Latest commit

History

Repository files navigation

Speculative Sampling Playground

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages