Your AI generates content but you can't trust it blindly? Put Gran Sabio LLM in the middle. Multiple AI models evaluate, score, and approve every piece of content before it reaches you.
Every developer using AI for content generation faces the same challenges:
| Problem | What Happens | The Cost |
|---|---|---|
| Hallucinations | AI invents facts, dates, or events | Credibility destroyed, corrections needed |
| Quality inconsistency | Sometimes great, sometimes terrible | Manual review of every output |
| No validation | You get content, hope it's good | Time wasted on unusable content |
| Single point of failure | One model, one opinion | Bias and blind spots undetected |
| Format violations | JSON that doesn't match your schema | Parsing errors, retry loops |
| Repetitive vocabulary | Same phrases appearing everywhere | Unprofessional, robotic text |
Traditional solution: Review everything manually or accept the risk.
Gran Sabio LLM solution: Let multiple AI models evaluate every output with configurable quality criteria, automatic retry on failure, and a "Great Sage" arbiter for final decisions.
Your Request
|
v
[Preflight Validation] --> Detects contradictions before wasting tokens
|
v
[Content Generation] --> Your chosen AI model generates content
|
v
[Multi-Layer QA] --> Multiple AI models evaluate different aspects
| - Historical accuracy
| - Literary quality
| - Format compliance
| - Custom criteria you define
v
[Consensus Engine] --> Calculates scores across all evaluators
|
v
[Pass?] --No--> [Iterate with Feedback] --> Back to generation
|
Yes
|
v
[Deal Breaker?] --Yes--> [Gran Sabio Escalation] --> Premium model decides
|
No
|
v
[Approved Content] --> Delivered with confidence scores
Note: Gran Sabio LLM is fundamentally an API-first tool designed to integrate into your content generation pipelines. The web interface below is a development/demo UI to help visualize and test the API capabilities - not a production-ready application. Think of it as a reference implementation showing what's possible when you build on top of this API.
Access the interactive demo at http://localhost:8000/ - configure your generation, select models, define QA layers, and watch results in real-time.
Configure prompts, models, QA layers, and quality thresholds from an intuitive web UI
Click "Live Matrix" to watch the entire process unfold:
- Content chunks streaming as they're generated
- QA evaluations appearing for each layer and model
- Scores updating as consensus is calculated
- Deal-breaker escalations and Gran Sabio decisions
Watch content generation, QA evaluation, and scoring happen in real-time
Beyond direct API connections (OpenAI, Anthropic, Google, xAI), you can access all models available on OpenRouter - including Mistral, DeepSeek, LLaMA, Qwen, and many more.
Access hundreds of models through OpenRouter integration
Every generation is logged in detail. Access /debugger to inspect:
- Complete request payloads and parameters
- Every iteration with content and scores
- QA evaluations per layer and model
- Consensus calculations
- Gran Sabio escalations and decisions
- Token usage and costs per phase
Inspect every detail of your generation sessions
Define what "quality" means for YOUR use case:
{
"qa_layers": [
{
"name": "Factual Accuracy",
"criteria": "Verify all dates, names, and events are historically correct",
"min_score": 8.5,
"deal_breaker_criteria": "invents facts or presents false information"
},
{
"name": "Narrative Flow",
"criteria": "Evaluate prose quality, transitions, and reader engagement",
"min_score": 7.5
}
],
"qa_models": ["gpt-4o", "claude-sonnet-4", "gemini-2.0-flash"]
}Each layer is evaluated by ALL configured QA models. If GPT-4o passes but Claude finds an issue, you'll know. Consensus is calculated automatically.
Some issues are too serious to just lower the score:
- Majority deal-breaker (>50% of models): Forces immediate regeneration
- Minority deal-breaker (<50%): Escalates to Gran Sabio for arbitration
- Tie (50%): Gran Sabio decides if it's a real issue or false positive
Why this matters: You define what's unacceptable. "Invented facts" can be a deal-breaker while "slightly awkward phrasing" just lowers the score.
{
"deal_breaker_criteria": "uses offensive language or invents historical events"
}When evaluators disagree or max iterations are reached, the "Great Sage" steps in:
- Uses premium reasoning models (Claude Opus 4.5 with 30K thinking tokens by default)
- Analyzes the conflict: Was it a real issue or false positive?
- Can modify content: Fixes minor issues without full regeneration
- Tracks model reliability: Learns which models produce more false positives
- Flexible model choice: Use GPT-5.2-Pro for maximum accuracy or Claude Opus 4.5 for deep reasoning
{
"gran_sabio_model": "claude-opus-4-5-20251101",
"gran_sabio_call_limit_per_session": 15
}Or use OpenAI's most powerful model:
{
"gran_sabio_model": "gpt-5.2-pro"
}Before spending money on generation, the system checks if your request makes sense:
Request: "Write a fiction story about dragons"
QA Layer: "Verify historical accuracy of all events"
Preflight Response:
{
"decision": "reject",
"issues": [{
"code": "contradiction_detected",
"severity": "critical",
"message": "Fiction content cannot be validated for historical accuracy"
}]
}
No tokens wasted on impossible requests.
AI models are notoriously bad at hitting word targets. Gran Sabio LLM solves this:
{
"min_words": 800,
"max_words": 1200,
"word_count_enforcement": {
"enabled": true,
"flexibility_percent": 15,
"direction": "both",
"severity": "deal_breaker"
}
}The system automatically injects a QA layer that counts words and triggers regeneration if the target isn't met.
Detect and prevent repetitive vocabulary:
- MTLD, HD-D, Yule's K, Herdan's C metrics calculated automatically
- GREEN/AMBER/RED grading based on configurable thresholds
- Window analysis finds exactly where repetition clusters appear
- Top words report shows which words are overused
{
"lexical_diversity": {
"enabled": true,
"metrics": "auto",
"decision": {
"deal_breaker_on_red": true,
"deal_breaker_on_amber": false
}
}
}Block specific phrases or patterns:
{
"phrase_frequency": {
"enabled": true,
"rules": [
{
"name": "no_then_went_to",
"phrase": "then went to",
"max_repetitions": 1,
"severity": "deal_breaker"
},
{
"name": "short_phrases",
"min_length": 3,
"max_length": 6,
"max_repetitions": 3,
"severity": "warn"
}
]
}
}Detect when AI models claim to use evidence but actually ignore it:
{
"evidence_grounding": {
"enabled": true,
"model": "gpt-4o-mini",
"budget_gap_threshold": 0.5,
"on_flag": "deal_breaker",
"max_flagged_claims": 2
}
}How it works:
- Extracts verifiable claims from generated content
- Measures P(claim | evidence) vs P(claim | no evidence) using logprobs
- Flags claims where confidence doesn't drop when evidence is removed
This catches:
- "According to the sources, Marie Curie was born in Paris" (context says Warsaw)
- Claims that sound referenced but ignore the actual context
Configuration modes:
| Mode | on_flag |
When to use |
|---|---|---|
| Verification-only | "warn" |
General content, informational logging |
| Fail-fast | "deal_breaker" |
Critical factual content, medical/legal |
| Regenerate | "regenerate" |
Auto-fix on detection |
Cost: ~$0.003 per request for 10 claims (2-6% overhead)
Process images alongside text for multimodal content generation:
{
"prompt": "Describe these product images in detail",
"generator_model": "gpt-4o",
"username": "your_username",
"images": [
{"upload_id": "img_001", "username": "your_username", "detail": "high"},
{"upload_id": "img_002", "username": "your_username", "detail": "auto"}
],
"image_detail": "auto",
"qa_layers": []
}Supported vision models:
- OpenAI: GPT-4o, GPT-5, GPT-5 Pro, O1, O3, O3-Pro
- Anthropic: Claude Sonnet 4, Claude Opus 4.5, Haiku 4
- Google: Gemini 2.0 Flash, Gemini 2.5 Pro/Flash
- xAI: Grok 2+ models
Detail levels (OpenAI-style):
| Level | Tokens | Use Case |
|---|---|---|
low |
~85 fixed | Quick classification, thumbnails |
high |
Variable (tiles) | Detailed analysis, text extraction |
auto |
Model decides | General use (default) |
QA with vision - Let QA models see input images for accuracy validation:
{
"prompt": "Describe this architectural diagram",
"generator_model": "claude-sonnet-4-20250514",
"images": [{"upload_id": "diagram", "username": "user1"}],
"qa_with_vision": true,
"qa_layers": [
{
"name": "Visual Accuracy",
"criteria": "Verify description matches the actual diagram elements",
"include_input_images": true,
"min_score": 8.5
},
{
"name": "Writing Quality",
"criteria": "Evaluate clarity and technical accuracy",
"include_input_images": false,
"min_score": 7.5
}
]
}Limits and auto-processing:
- Default: 20 images per request (configurable up to 100)
- Auto-resize to optimal dimensions per provider
- Automatic format conversion (HEIC/HEIF to JPEG)
- Preflight validation rejects requests when model lacks vision capability
100% format guarantee across all major providers:
{
"generator_model": "gpt-5",
"json_output": true,
"json_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"summary": {"type": "string"},
"key_points": {"type": "array", "items": {"type": "string"}}
},
"required": ["title", "summary"]
}
}Supported providers:
- OpenAI: GPT-4o, GPT-5, GPT-5.2-Pro, O1/O3 series
- Anthropic: Claude 4 Sonnet, Claude Opus 4.5
- Google: Gemini 2.0+, Gemini 2.5
- xAI: Grok 4
- OpenRouter: All compatible models (Mistral, DeepSeek, LLaMA, Qwen, and 200+ more)
For complex tasks, enable deep thinking:
OpenAI Reasoning:
{
"generator_model": "gpt-5",
"reasoning_effort": "high"
}Claude Thinking Mode:
{
"generator_model": "claude-sonnet-4-20250514",
"thinking_budget_tokens": 8000
}Both work for QA evaluation too - your evaluators can "think" before scoring.
Need fast generation without QA? Just send empty layers:
{
"prompt": "Write a quick draft",
"qa_layers": []
}Content is approved immediately. Perfect for testing, bulk generation, or content that will be manually edited.
git clone https://github.com/jordicor/Gran_Sabio_LLM.git
cd Gran_Sabio_LLM
python quick_start.pyCreate .env file with your own API keys from each provider:
# Get your keys from each provider's dashboard:
# - OpenAI: https://platform.openai.com/api-keys
# - Anthropic: https://console.anthropic.com/
# - Google: https://aistudio.google.com/apikey
# - xAI: https://console.x.ai/
# - OpenRouter: https://openrouter.ai/keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
XAI_API_KEY=xai-...
OPENROUTER_API_KEY=sk-or-...
PEPPER=any-random-string-hereNote: You only need keys for the providers you want to use. At minimum, configure one provider.
python main.pyServer starts at http://localhost:8000
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a 500-word biography of Marie Curie",
"content_type": "biography",
"generator_model": "gpt-4o",
"qa_models": ["gpt-4o", "claude-sonnet-4"],
"qa_layers": [
{
"name": "Accuracy",
"criteria": "Verify historical facts",
"min_score": 8.0,
"deal_breaker_criteria": "invents facts"
}
],
"min_global_score": 8.0,
"max_iterations": 3
}'The demos/ folder contains 11 ready-to-run scripts showcasing different capabilities. Here are the highlights:
| Demo | Description | Complexity |
|---|---|---|
| YouTube Script Generator | Multi-phase pipeline: topic analysis, script, scenes, thumbnails. Uses JSON Schema, lexical diversity, and project grouping. | Advanced |
| Code Analyzer | Dynamic JSON output for code review. Detects security issues, performance problems. Shows when to use flexible JSON vs strict schemas. | Advanced |
| Reasoning Models | GPT-5 reasoning effort, Claude thinking mode. Complex analysis with deep thinking. | Advanced |
| JSON Structured Output | 100% format guarantee with json_schema. Multi-provider support. |
Intermediate |
| Text Quality Analyzer | Analyze existing text without generating. Lexical diversity, AI pattern detection. | Intermediate |
| Parallel Generation | Bulk content creation with async parallel execution. | Advanced |
Quick start:
# Start the API server
python main.py
# Run any demo
python demos/03_youtube_script_generator.py --topic "How AI is Changing Music"See the complete list and documentation: demos/README.md
Full interactive documentation available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc - Custom Docs:
http://localhost:8000/api-docs(recommended)
| Endpoint | Method | Description |
|---|---|---|
/generate |
POST | Start content generation with QA |
/status/{session_id} |
GET | Check session status |
/stream/project/{project_id} |
GET | Real-time SSE progress stream (project_id = session_id when not explicit) |
/result/{session_id} |
GET | Get final approved content |
/stop/{session_id} |
POST | Cancel active generation |
/models |
GET | List available AI models |
/debugger |
GET | Session history and inspection UI |
Standalone text analysis tools (no generation required):
| Endpoint | Method | Description |
|---|---|---|
/analysis/lexical-diversity |
POST | Vocabulary richness metrics (MTLD, HD-D, etc.) |
/analysis/repetition |
POST | N-gram repetition analysis with clustering |
Group multiple sessions under a single project ID:
| Endpoint | Method | Description |
|---|---|---|
/project/new |
POST | Reserve a new project ID |
/project/start/{id} |
POST | Activate a project |
/project/stop/{id} |
POST | Cancel all project sessions |
/stream/project/{id} |
GET | Stream all project events |
A ready-to-use Python client is available for easy integration:
from gransabio_client import GranSabioClient
client = GranSabioClient("http://localhost:8000")
# Simple generation
result = client.generate(
prompt="Write a product description",
generator_model="gpt-4o",
qa_layers=[{"name": "Quality", "criteria": "...", "min_score": 8.0}]
)
print(result.content)
print(f"Score: {result.final_score}")Stream progress:
for event in client.stream_generate(prompt="...", qa_layers=[...]):
print(f"[{event.phase}] {event.message}")Gran Sabio LLM includes a Model Context Protocol (MCP) server that integrates directly with AI coding assistants. Get multi-model code review and analysis without leaving your terminal.
| Tool | Description |
|---|---|
gransabio_analyze_code |
Analyze code for bugs, security issues, and best practices |
gransabio_review_fix |
Validate a proposed fix before applying it |
gransabio_generate_with_qa |
Generate content with multi-model QA |
gransabio_check_health |
Verify Gran Sabio LLM API connectivity |
gransabio_list_models |
List available AI models |
1. Install MCP dependencies:
pip install -r mcp/requirements.txt2. Run the installer script:
Windows:
install_mcp.batLinux/macOS:
./install_mcp.shThe scripts automatically detect paths and register the MCP server with Claude Code.
Manual installation (if you prefer):
# Use absolute paths - relative paths won't work!
claude mcp add gransabio-llm -- python /path/to/Gran_Sabio_LLM/mcp_server/gransabio_mcp_server.pyGemini CLI (~/.gemini/settings.json):
{
"mcpServers": {
"gransabio-llm": {
"command": "python",
"args": ["/path/to/Gran_Sabio_LLM/mcp_server/gransabio_mcp_server.py"]
}
}
}Codex CLI (~/.codex/config.toml):
[mcp_servers.gransabio-llm]
command = "python"
args = ["/path/to/Gran_Sabio_LLM/mcp_server/gransabio_mcp_server.py"]You: Analyze this code for security issues using Gran Sabio
Claude: [Calls gransabio_analyze_code]
Gran Sabio Analysis (Score: 8.2/10):
- [CRITICAL] SQL injection at line 45
- [HIGH] Hardcoded credentials at line 12
- [MEDIUM] Missing input validation at line 30
Reviewed by: GPT-5-Codex, Claude Opus 4.5, GLM-4.7
Consensus: 3/3 models agree
For hosted Gran Sabio LLM instances:
claude mcp add gransabio-llm \
--env GRANSABIO_API_URL=https://api.gransabio.example.com \
--env GRANSABIO_API_KEY=your-api-key \
-- python /path/to/gransabio_mcp_server.pySee full documentation: mcp/README.md
Gran Sabio LLM is currently a self-hosted solution. You deploy it on your infrastructure and use your own API keys from each AI provider.
| Aspect | Self-Hosted |
|---|---|
| API Keys | You obtain and configure keys from OpenAI, Anthropic, Google, xAI, and/or OpenRouter |
| Billing | Each provider bills you directly based on your usage |
| Infrastructure | You host and maintain the server |
| Data Privacy | Your prompts and content stay on your infrastructure |
| Models Available | All models your API keys have access to, plus 200+ via OpenRouter |
- Full control over your data and costs
- No intermediaries - direct connection to AI providers
- Use your existing accounts - no new subscriptions needed
- Enterprise compliance - deploy in your own cloud/datacenter
- Unlimited usage - no rate limits beyond provider limits
- Python 3.10+
- API keys for at least one provider (OpenAI, Anthropic, Google, xAI, or OpenRouter)
- ~500MB disk space for dependencies
- Recommended: 4GB RAM minimum
- Pillow library (auto-installed, required for vision/image processing)
Quality assurance: The codebase includes 950+ automated tests covering API endpoints, engines, client SDK, and integrations.
# With uvicorn
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
# With gunicorn + uvicorn workers
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000Don't want to manage API keys and infrastructure? A hosted version is in development.
| Feature | Cloud Version |
|---|---|
| API Keys | None required - we handle all provider connections |
| Setup | Zero - just sign up and start making API calls |
| Billing | Single subscription covers all AI providers |
| Models | All supported models, always up to date |
| Features | Everything in self-hosted, fully managed |
| Web Interface | Polished, production-ready UI for non-developers |
Want to be notified when the Cloud version launches? Star this repo and follow me on GitHub or my social media channels - I'll announce early access there first.
Self-hosting will always remain available for those who prefer full control.
Every request can include cost breakdown:
{
"show_query_costs": 2,
"prompt": "..."
}Returns detailed token usage and costs:
{
"content": "Generated content...",
"costs": {
"grand_totals": {
"input_tokens": 4370,
"output_tokens": 2156,
"cost": 0.018765
},
"phases": {
"generation": {"cost": 0.008234},
"qa": {"cost": 0.003456},
"gran_sabio": {"cost": 0.005678}
}
}
}This project was originally called BioAI Unified - "Bio" for biography (its first use case was validating AI-generated biographies) and "Unified" because it brought together multiple AI providers into a single, coherent QA system.
However, "BioAI" consistently caused confusion. People assumed this was a biomedical or bioinformatics tool, expecting features for DNA analysis or drug discovery. The name created friction before the tool could even be evaluated.
The new name directly reflects what makes this engine unique:
"Gran Sabio" (Spanish for "Great Sage") is not just a brand - it's a core architectural component. When multiple AI models disagree during quality evaluation, a premium reasoning model called the Gran Sabio (the wise arbiter) steps in to make the final decision. This concept of a "council of sages" deliberating on content quality is central to how the system works.
"LLM" (Large Language Model) clarifies that this is AI infrastructure for text generation - not a fantasy game, not biomedicine, but a practical tool for orchestrating language models.
The result: a name that immediately tells you what you're getting - an AI content pipeline with a wise, multi-model arbitration system at its heart.
Previous name: BioAI Unified (2024). Rebranded to Gran Sabio LLM in January 2025.
This project is actively developed. If you find it useful:
- Star this repo to follow updates and new features
- Follow me on social media for development insights, AI tips, and early announcements about the upcoming Cloud version
Find my social links on my GitHub profile.
MIT License - see LICENSE for details.
Trust your AI output.
Let multiple models validate before you ship.