AI Eval CLI is a command-line tool that leverages AI evals to A/B test multiple AI models against goldens and generated edge cases. By scoring responses on accuracy and robustness, it helps product managers and developers compare models under real-world conditions. The tool outputs clear reports that highlight strengths, weaknesses, and trade-offs between modelsβmaking it easier to guide model selection and deployment decisions.
- Takes a JSON dataset with prompts + "golden" (perfect) responses
- Example:
"Who is the CEO of Tesla?"β"Elon Musk."
For each prompt, automatically generates 4 types of edge cases:
- Paraphrase: Slight rewording of the original
- Constraint: Adds requirements (e.g., "answer in one sentence")
- Noisy: Adds typos, emojis, shorthand
- Ambiguity: Makes the prompt less specific
- Calls Model A and Model B (e.g., GPT-4 vs Llama2) on each prompt
- Tests both the original prompt + all 4 edge cases
- Scores each response 1-5 based on:
- Semantic similarity to golden response
- Clarity/readability
- Intent alignment
Generates 3 files:
- CSV: Raw data with all scores
- Markdown: Human-readable report
- PDF: Printable report with recommendations
Normal Case: Model A wins 70%
Edge Cases: Model B wins 55%
Recommendation: Model A for accuracy; Model B for robustness
- Model Selection: "Which model should we deploy?"
- Robustness Testing: "How well does our model handle real user variations?"
- Quality Assurance: "Does our model break under edge cases?"
Most tools test models on perfect inputs. This tool tests how models perform when users ask questions in messy, real-world ways - which is what actually happens in production.
Bottom line: It's a PM-friendly way to evaluate AI models beyond just accuracy, focusing on real-world robustness.
- Clone the repository:
git clone <your-repo-url>
cd ai-eval-cli- Create virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install in editable mode:
pip install -e .- Set up OpenAI API key:
export OPENAI_API_KEY=your_api_key_here# Compare two OpenAI models
aieval evaluate --dataset datasets/starter.json --model-a gpt-4o-mini --model-b gpt-4o-mini --outdir .gpt-4o-mini(recommended for testing)gpt-4ogpt-4gpt-3.5-turbo
The tool generates three files:
results_edge.csv- Detailed results in CSV formatresults_edge.md- Human-readable markdown reportresults_edge.pdf- Professional PDF report
- Semantic Similarity: Token overlap with golden response
- Clarity: Flesch reading ease score
- Tone Alignment: VADER sentiment analysis
- Overall Score: Weighted average mapped to 1-5 scale
Create your own dataset in JSON format:
[
{
"task": "summarization",
"prompt": "Summarize: The cat sat on the mat.",
"golden": "A cat is sitting on a mat.",
"tone": "neutral"
}
]task: Type of task (summarization, qa, instruction, reasoning, creative)prompt: The input prompt for the modelgolden: The expected/perfect responsetone: Expected tone (neutral, polite, concise, creative, friendly, punchy)
aieval evaluate --dataset datasets/starter.json --model-a gpt-4o-mini --model-b gpt-4o-mini --outdir ./results# Compare different model families
aieval evaluate --dataset datasets/starter.json --model-a gpt-4o-mini --model-b gpt-4o --outdir .- Normal-case wins: Performance on original prompts
- Edge-case wins: Performance on generated edge cases
- Overall recommendation: Based on combined performance
- 5: Excellent match with golden response
- 4: Good match with minor differences
- 3: Acceptable match with some differences
- 2: Poor match with significant differences
- 1: Very poor match or error
ai-eval-cli/
βββ aieval/
β βββ cli.py # Main CLI implementation
βββ datasets/
β βββ starter.json # Example dataset
βββ pyproject.toml # Project configuration
βββ README.md # This file
- Modify
aieval/cli.py - Update
pyproject.tomlif adding dependencies - Test with
pip install -e . - Update documentation
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is open source and available under the MIT License.
Error 429 - Insufficient Quota:
- Check your OpenAI account billing
- Add payment method even for free tier
- Try a different model or wait for rate limits
Import Errors:
- Ensure virtual environment is activated
- Reinstall with
pip install -e .
PDF Generation Fails:
- Check if reportlab is installed
- Ensure write permissions in output directory
For issues and questions:
- Check the troubleshooting section
- Review the documentation
- Open an issue on GitHub
Built for Product Managers and AI practitioners who need simple, effective AI evals. π―