A Flow-Engineered Agent that improves LLM code generation accuracy by 400% through iterative, sandbox-validated repair loops.
Large Language Models (LLMs) often generate code that looks correct but fails on edge cases or runtime constraints. "Zero-shot" prompting hits a ceiling (~30-40% on hard problems).
Instead of asking once, AlphaKhulnasoft treats code generation as a Search & Repair problem.
- Analyze: Semantic parsing of constraints (System 2 thinking).
- Generate: Drafts initial solution.
- Adversarial Test: Runs code in a secure
subprocesssandbox. - Root Cause Analysis: Feeds specific error logs (stderr) back to the agent.
- Iterative Repair: Loops until success or max retries.
| Metric | Zero-Shot (Baseline) | AlphaKhulnasoft (Iter 5) | Improvement |
|---|---|---|---|
| Pass Rate | 30% | 80% | +166% |
| Logic | Implicit | Chain-of-Thought | N/A |
| Safety | None | Sandboxed | ✅ |
(See repair_curve.png for the full trajectory)
graph TD
A[Problem] --> B(Semantic Analyzer)
B --> C{Code Generator}
C --> D[Sandbox Execution]
D -->|Pass| E[✅ Solved]
D -->|Fail| F[Root Cause Agent]
F -->|Hypothesis| C
git clone https://github.com/KhulnaSoft/alphakhulnasoft.git
cd alphakhulnasoft
uv sync
cp .env.example .env # Add your OPENAI_API_KEYBootstrap a hard dataset using the LLM itself:
uv run python -m alphakhulnasoft.dataset_genExecute the flow-engineering benchmark:
uv run python -m alphakhulnasoft.benchmark data/hard_mode.jsonlGenerate the efficiency report and visualization:
uv run python -m alphakhulnasoft.visualizer results_latest.jsonIf you prefer containerized execution:
# 1. Build and Run the benchmark
docker compose run alphakhulnasoft
# 2. Run a specific command
docker compose run alphakhulnasoft python -m alphakhulnasoft.dataset_genImages are automatically published to the GitHub Container Registry:
docker pull ghcr.io/khulnasoft/alphakhulnasoft:mainWe use modern tooling to ensure high code quality:
- Linting & Formatting:
ruff - Type Checking:
mypy - Testing:
pytest
Run quality checks locally:
# Lint & Format check
uv run ruff check .
uv run ruff format --check .
# Type check
uv run mypy alphakhulnasoft
# Run tests
uv run pytest tests/CI is automatically handled by GitHub Actions on every push to main.
- alphakhulnasoft/alpha_repair.py: Flow state and logic.
- alphakhulnasoft/prompts.py: The specialized personas (Architect, Debugger).
- alphakhulnasoft/sandbox.py: Secure execution engine.
- alphakhulnasoft/evaluator.py: Scoring and metrics logic.
- alphakhulnasoft/visualizer.py: Research-grade plotting.
- alphakhulnasoft/data_loader.py: Ingestion from local and Hugging Face.
- alphakhulnasoft/publisher.py: Results sharing to HF Hub.
AlphaKhulnasoft now integrates directly with the Hugging Face ecosystem:
- Load Datasets: Fetch popular coding benchmarks (HumanEval, MBPP) directly from HF Hub using
openai_humanevalormbpp. - Publish Results: Automatically push your benchmark reports to a HF Dataset repository.
- Serve Models: Use
huggingface/model prefixes vialitellmto run local or inference-api models.
AlphaKhulnasoft is enterprise-ready with support for:
- Google Cloud Vertex AI: Run Gemini models with enterprise-grade security and reliability.
- Subprocess Isolation: Standard isolation for safe code execution (ready for optional Docker/nsjail hardening).
Built as a demonstration of System 2 AI Architecture.