This project provides:
- Exact speculative sampling (SpS).
- AutoJudge (lossy judge-decoding style method with synthetic labels, no manual annotation).
- SpecExec (exact target sampling with draft-branch cache prefill and pruning).
- A Hugging Face adapter with KV cache and optional quantization.
- A benchmark harness on MT-Bench with JSONL metrics.
Features
- Baseline, speculative, AutoJudge, and SpecExec decoding in one benchmark entrypoint.
- MT-Bench loader (JSON/JSONL).
- Benchmark runner with median timing, resume support, and method-specific metrics.
- Preset configs for models, methods, and paired experiments.
- Makefile shortcuts for local and Docker workflows.
- Docker support for CPU and GPU.
- CI pipeline (GitHub Actions) for checks/tests + benchmark JSONL schema validation.
Getting Started (From Zero)
- Bootstrap dependencies on a clean Ubuntu host (safe mode, does not touch NVIDIA driver):
bash scripts/install_dependencies.shRecommended Python version is 3.11 (.python-version in repo). Dependencies are pinned in requirements*.txt for reproducible runs.
For GPU Python extras (bitsandbytes, accelerate):
bash scripts/install_dependencies.sh --gpuFor EOL Ubuntu (for example Ubuntu 17), script stops by default. Continue only if you explicitly accept risks:
bash scripts/install_dependencies.sh --allow-eol-ubuntu- Install Docker. For GPU runs, keep your existing NVIDIA driver and install the NVIDIA Container Toolkit only.
- Put MT‑Bench dataset file (JSON/JSONL) into project folder
datasets/, for exampledatasets/mt_bench.jsonl. - Build a CPU image:
docker build -t sp-samp .- Run tests (CPU):
docker run --rm sp-samp- Run a CPU benchmark (toy models):
docker run --rm sp-samp \
python -m benchmarks.bench_speculative \
--method both \
--runs 1 \
--max-samples 5 \
--max-new-tokens 32 \
--vocab-size 2048- Build a GPU image (CUDA example):
docker build -f Dockerfile.gpu \
--build-arg BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 \
--build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu124 \
--build-arg TORCH_VERSION=2.5.1 \
-t sp-samp-gpu .- Run a GPU benchmark (HF model, results saved to JSONL):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/mt_bench.jsonl \
--hf-model RedHatAI/gpt-oss-20b \
--quant 4bit \
--bnb-compute-dtype bfloat16 \
--device cuda \
--use-chat-template \
--max-samples 50 \
--max-new-tokens 128 \
--k 4 \
--runs 5 \
--out /data/results.jsonl- Run all methods in one launch (baseline + speculative + autojudge + specexec):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/mt_bench.jsonl \
--hf-model meta-llama/Meta-Llama-3-8B-Instruct \
--hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
--device cuda \
--use-chat-template \
--method all \
--k 4 \
--runs 5 \
--out /data/results_all.jsonl- Run SpecExec only (branch execution parameters included):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m sp_samp.cli specexec \
--config-dir configs \
--experiment llama3_target_llama3_specexec_k4 \
--dataset /data/mt_bench.jsonl \
--parallel-branches 8 \
--branch-prune-threshold 0.0 \
--out /data/results_specexec.jsonl- Run AutoJudge only with checkpoint reuse:
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/mt_bench.jsonl \
--hf-model meta-llama/Meta-Llama-3-8B-Instruct \
--hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
--device cuda \
--use-chat-template \
--method autojudge \
--autojudge-train-samples 4000 \
--autojudge-train-steps 400 \
--autojudge-threshold 0.5 \
--autojudge-checkpoint /data/autojudge_llama3.pt \
--out /data/results_autojudge.jsonlMake Targets Defaults for benchmark paths:
DATASET=datasets/mt_bench.jsonlOUT=datasets/results.jsonl
- Show all commands:
make help- Install/upgrade dependencies in safe mode:
make setup- Install/upgrade including GPU Python extras:
make setup-gpu- Syntax check:
make check- Validate benchmark JSONL schema:
make validate-results RESULTS=datasets/results.jsonl- List presets:
make list-presets- Validate config logic:
make validate-configs- Quick toy benchmark (no HF models):
make bench-toy OUT=/tmp/bench_toy.jsonl- Quick HF smoke run (needs
torch+transformers, downloads tiny model):
make smoke-hf OUT=/tmp/smoke_hf.jsonl- Run experiment on MT-Bench:
make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl- Run AutoJudge preset:
make autojudge DATASET=datasets/mt_bench.jsonl OUT=datasets/results_autojudge.jsonl- Run SpecExec preset:
make specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl- Build and run GPU Docker flow:
make docker-build-gpu
make docker-bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl
make docker-specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl- Enforce headless GPU mode for long runs:
make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl HEADLESS=1Presets
- Models:
configs/models.json - Methods:
configs/methods.json - Experiments (target/draft pairings):
configs/experiments.json - Method templates (AutoJudge/SpecExec):
configs/method_templates.json
CLI Runner
- List presets:
python -m sp_samp.cli list-presets --config-dir configs- Direct method selection:
python -m benchmarks.bench_speculative \
--method specexec \
--dataset datasets/mt_bench.jsonl \
--hf-model meta-llama/Meta-Llama-3-8B-Instruct \
--hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
--parallel-branches 8 \
--branch-prune-threshold 0.0- Run benchmark using presets:
python -m sp_samp.cli bench \
--config-dir configs \
--model-preset gpt_oss_20b_4bit \
--method-preset speculative_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results.jsonl- Run benchmark using an experiment preset:
python -m sp_samp.cli bench \
--config-dir configs \
--experiment llama3_all_methods \
--dataset datasets/mt_bench.jsonl \
--out datasets/results.jsonl- Run AutoJudge shortcut command:
python -m sp_samp.cli autojudge \
--config-dir configs \
--experiment llama3_target_llama3_autojudge_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results_autojudge.jsonl- Run SpecExec shortcut command:
python -m sp_samp.cli specexec \
--config-dir configs \
--experiment llama3_target_llama3_specexec_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results_specexec.jsonl- Require headless GPU mode (fail-fast when display is active):
python -m sp_samp.cli bench \
--config-dir configs \
--experiment llama3_all_methods \
--dataset datasets/mt_bench.jsonl \
--require-headless \
--out datasets/results.jsonlMetrics Output The benchmark writes JSONL records with per-run metrics and a summary record per method. Fields include:
status(ok/error/skipped)resume_key(used to skip completed runs on re-launch)tokens_per_secacceptance_rateavg_tokens_per_stepproposed,accepted,rejectionsjudge_accept_rate(AutoJudge only)target_fallback_rate(AutoJudge only)autojudge_train_samples,autojudge_train_loss(AutoJudge only)branch_prune_rate(SpecExec only)effective_parallelism(SpecExec only)target_calls_per_token(AutoJudge and SpecExec)draft_calls_per_token(SpecExec only)cache_hit_rate(SpecExec only)max_active_branches(SpecExec only)error_type,error_message,traceback(for failed runs)- System metadata:
git_sha,hostname,gpu_name,gpu_driver,cuda_runtime,torch_version,transformers_version,display_active - Validate output schema with:
python scripts/validate_results_jsonl.py --path datasets/results.jsonl --strictProject Layout
sp_samp/: core library, AutoJudge, SpecExec, HF adapter.benchmarks/: benchmark runner.configs/: preset configs.tests/: tests.Dockerfile,Dockerfile.gpu: containers.
Notes
- SpecExec is implemented as an exact (distribution-preserving) decoder with speculative cache prefill and configurable
parallel_branches/branch_prune_threshold. - HF SpecExec uses KV-cache reuse and depth-wise tree passes for faster cache construction.
- Draft and target models must share an identical tokenizer vocabulary mapping for speculative, AutoJudge, and SpecExec correctness.
scripts/install_dependencies.shis idempotent for apt/pip dependencies and never modifies NVIDIA driver packages.scripts/install_dependencies.shpreferspython3.11when available and warns when running with lower versions.- Ubuntu 17 is EOL. Script blocks by default on EOL Ubuntu unless
--allow-eol-ubuntuis set explicitly. - Re-running a benchmark with the same
OUTfile automatically skips completed runs (resume mode). - Failed runs are written to JSONL and do not stop the whole benchmark method loop.
make validate-configschecks config references and tokenizer compatibility inconfigs/*.json.make validate-resultsenforces JSONL schema compatibility for downstream analysis.- Current default experiments are correctness-safe (same tokenizer family) and may show limited speedup when target=draft.