This repository contains tooling and evaluation code to benchmark Qwen3-family models on the ITALIC multiple-choice dataset. It supports both standard (non-reasoning) evaluation and explicit reasoning (chain-of-thought) evaluation modes.
- Python 3.10 to 3.12
- GPU recommended (CUDA-enabled) for reasonable performance when running vLLM
Note: adjust your PyTorch/CUDA installation to match your local GPU runtime if you plan to run parts of the code that require PyTorch. The project primarily uses vLLM for inference.
- vLLM: efficient inference engine used to run LLMs.
- PEFT (LoRA): Parameter-Efficient Fine-Tuning utilities (LoRA) for merging and evaluating adapters.
- pandas: data inspection and tabulation.
- tqdm: progress bars for loops and batch processing.
- Jupyter: interactive notebooks for experimentation and visualization.
- python-dotenv: load environment variables from a .env file.
- Transformers: model and tokenizer utilities (optional; useful when working outside vLLM).
- PyTorch: core deep learning library (install a wheel that matches your CUDA runtime if you need GPU-enabled PyTorch).
- datasets: dataset utilities and I/O.
- scikit-learn: evaluation utilities and metrics.
- TRL (trl): training utilities for SFT / policy learning.
Install the primary runtime dependencies (example):
pip install vllm python-dotenv pandas tqdm jupyter peftRecommended additional packages depending on your workflow and whether you run PyTorch-based tooling:
- torch (install a wheel that matches your CUDA runtime)
- transformers
- datasets
- scikit-learn
- trl
Two simple ways to set up the project: pip+virtualenv or Poetry. The minimal instructions below use the packages from dependences.txt.
Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activateUpgrade pip and install core dependencies (adjust PyTorch wheel to match your CUDA version if you need GPU-enabled PyTorch):
pip install --upgrade pip
pip install vllm python-dotenv pandas tqdm jupyter peftIf you prefer Poetry, create a virtual environment and add the same dependencies. Example:
poetry init --no-interaction
poetry add vllm python-dotenv pandas tqdm jupyter peft
# add torch/transformers as needed depending on your CUDA and workflowPlace environment values (for example HF tokens or custom settings) in a .env file at the repo root. The benchmark loader uses python-dotenv to load these values.
Benchmark output and summaries are written to the results/ folder. The run_benchmark() flow saves a detailed CSV (*_results.csv) and a JSON summary (*_summary.json).