Skip to content

ardaglobal/i2p

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

I2P Meta-Reasoning System

Python 3.11+ DSPy Modal License: MIT

Strategic Technical Advisory for AI Agents - A comprehensive meta-reasoning system that transforms issue descriptions into structured strategic analysis for complex software development projects.

🎯 What is I2P?

I2P (Issue to Prompt) is an AI-powered meta-reasoning system designed to provide strategic technical advisory for complex software development projects. It analyzes issues across multiple dimensions and generates comprehensive strategic guidance with specific code navigation paths.

Key Capabilities

  • πŸ—οΈ System Boundary Analysis - Maps issue scope and dependencies across system components
  • 🎯 Strategic Gap Analysis - Identifies what you have, what you need, and what's missing
  • πŸ—ΊοΈ Code Navigation Index - Provides specific file references and implementation pathways
  • πŸ” Multi-Language Vector Search - Searches across Rust, TypeScript, Solidity, and Documentation
  • πŸ“š Knowledge Base Integration - Semantic search across organizational knowledge repositories
  • πŸ“‹ PRD-Style Requirements - Structures complex issues into actionable requirements
  • πŸ€– Trained Query Generation - GEPA module learns to generate optimized search queries from training data
  • πŸ”„ Continuous Model Training - Automated training pipeline with GitHub Actions integration

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • uv (installed automatically if missing)
  • OpenRouter API Key (for LLM access)
  • Qdrant Vector Database (local or cloud)
  • Modal Account (for embedding services)

Installation

# Clone the repository
git clone [email protected]:ardaglobal/i2p.git
cd i2p

# Complete setup with virtual environment (uv-based)
make setup

# Or step-by-step:
make venv      # Create virtual environment
make install   # Install dependencies

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys:
# OPENROUTER_API_KEY=your_openrouter_key
# QDRANT_URL=your_qdrant_url
# QDRANT_API_KEY=your_qdrant_key

# Activate virtual environment
source .venv/bin/activate

# Verify setup
make check-env

Basic Usage

# Ingest your codebase for vector search
make ingest

# (Optional) Train GEPA search query model
make train

# Analyze an issue
make i2p ISSUE='Implement privacy-preserving credit score verification using zero-knowledge proofs'

# Run demo with sample issues
make demo

# Check system health
make health

πŸ“ Project Structure

i2p/
β”œβ”€β”€ πŸ“– Makefile                  # Comprehensive command interface
β”œβ”€β”€ πŸ“ CLAUDE.md                 # Development guidelines & best practices
β”œβ”€β”€
β”œβ”€β”€ modules/                     # Core I2P processing modules
β”‚   β”œβ”€β”€ cli/                     # Command-line interface
β”‚   β”‚   └── 🧠 i2p_cli.py        # Main CLI entry point
β”‚   β”œβ”€β”€ core/                    # Core processing pipeline
β”‚   β”‚   β”œβ”€β”€ πŸ”„ pipeline.py       # Main processing pipeline orchestrator
β”‚   β”‚   β”œβ”€β”€ 🎯 classifier.py     # System boundary analysis
β”‚   β”‚   β”œβ”€β”€ πŸ“Š analyzer.py       # Strategic gap analysis
β”‚   β”‚   β”œβ”€β”€ πŸ“ generator.py      # Code navigation & output generation
β”‚   β”‚   β”œβ”€β”€ βœ… validation.py     # Input validation & error handling
β”‚   β”‚   β”œβ”€β”€ πŸ” vector_search.py  # Vector similarity search
β”‚   β”‚   └── πŸ€– gepa.py           # GEPA search query generation module
β”‚   β”œβ”€β”€ training/                # Training & model optimization
β”‚   β”‚   β”œβ”€β”€ πŸŽ“ train_gepa.py         # GEPA model training script
β”‚   β”‚   β”œβ”€β”€ πŸ€– gepa_trainer.py       # DSPy BootstrapFewShot trainer
β”‚   β”‚   β”œβ”€β”€ πŸ“Š dataset_generator.py  # Automated dataset generation
β”‚   β”‚   β”œβ”€β”€ πŸ“ dataset_manager.py    # Dataset loading & splitting
β”‚   β”‚   β”œβ”€β”€ πŸ“ˆ dataset_statistics.py # Dataset analysis & metrics
β”‚   β”‚   β”œβ”€β”€ πŸ”§ dataset_loader.py     # Dataset I/O operations
β”‚   β”‚   └── πŸ–₯️ generate_dataset_cli.py # CLI for dataset generation
β”‚   └── ingest/                  # Ingestion & embedding services
β”‚       β”œβ”€β”€ core/
β”‚       β”‚   β”œβ”€β”€ πŸ”„ pipeline.py        # Multi-language ingestion pipeline
β”‚       β”‚   β”œβ”€β”€ βš™οΈ config.py          # Ingestion configuration
β”‚       β”‚   β”œβ”€β”€ πŸ”— embedding_service.py # Embedding generation
β”‚       β”‚   └── πŸ“¦ batch_processor.py  # Batch processing
β”‚       β”œβ”€β”€ parsers/
β”‚       β”‚   β”œβ”€β”€ πŸ¦€ rust_parser.py         # Rust code parsing & analysis
β”‚       β”‚   β”œβ”€β”€ πŸ“˜ typescript_parser.py   # TypeScript code parsing
β”‚       β”‚   β”œβ”€β”€ ⚑ solidity_parser.py     # Solidity contract parsing
β”‚       β”‚   └── πŸ“„ documentation_parser.py # Documentation extraction
β”‚       β”œβ”€β”€ services/
β”‚       β”‚   β”œβ”€β”€ πŸ”— vector_client.py       # Qdrant vector database client
β”‚       β”‚   β”œβ”€β”€ πŸš€ tei_service.py         # TEI embedding service (Modal L4 GPU)
β”‚       β”‚   β”œβ”€β”€ πŸ€– modal_client.py        # Modal service client
β”‚       β”‚   β”œβ”€β”€ πŸ” enhanced_ranking.py    # Advanced search ranking
β”‚       β”‚   └── βœ… quality_validator.py   # Code quality validation
β”‚       └── deploy/
β”‚           └── πŸš€ modal_deploy.py    # Modal deployment orchestrator
β”‚
└── repos/                       # Target codebases for analysis
    β”œβ”€β”€ arda-credit/             # Rust blockchain infrastructure
    β”œβ”€β”€ arda-platform/           # TypeScript monorepo (Platform, Credit App, IDR)
    β”œβ”€β”€ arda-knowledge-hub/      # Documentation and knowledge base (Obsidian vault)
    β”œβ”€β”€ aig/                     # Arda Investment Group markdown documentation
    β”œβ”€β”€ arda-chat-agent/         # JavaScript/TypeScript chat agent implementation
    └── ari-ui/                  # JavaScript/TypeScript chat bot implementation

πŸ› οΈ Available Commands

πŸ—οΈ Setup & Installation

make venv             # Create virtual environment using uv
make install          # Install Python dependencies (creates venv if needed)
make sync             # Sync dependencies with uv (faster than install)
make setup            # Complete system setup (venv + install + check environment)
make check-env        # Verify environment variables and credentials

πŸ—„οΈ Vector Database & Ingestion

make ingest           # Full ingestion pipeline (Rust + TypeScript + Solidity + Documentation)
make ingest-warmup    # Warm up Modal embedding service before ingestion
make ingest-search QUERY='text'  # Test vector search functionality
make vector-status    # Check Qdrant collections and vector counts

🧠 I2P Meta-Reasoning

make i2p ISSUE='your issue description'    # Run I2P analysis
make demo             # Run demo with sample Arda Credit issues
make examples         # Generate example analyses for common issues

πŸš€ Modal & Embedding Services

make modal-deploy     # Deploy Qwen3-Embedding-8B service to Modal (L4 GPU)
make modal-health     # Check Modal embedding service health
make modal-monitor    # Monitor GPU distribution across containers

πŸ€– Training & Model Management

make train                    # Train GEPA search query model with DSPy optimization
make train-eval               # Evaluate trained model on test set
make generate-dataset         # Generate training dataset from codebase
make train-clean              # Clean training cache and artifacts

βš™οΈ System Management

make health           # System health check (pipeline + vector search)
make test             # Run all tests
make clean            # Clean up generated files and caches

πŸ’‘ Example Analyses

Zero-Knowledge Proof Implementation

make i2p ISSUE='Implement privacy-preserving credit score verification using zero-knowledge proofs in the Arda Credit loan approval process'

Output includes:

  • System boundary analysis identifying affected components
  • Strategic gap analysis of current ZK infrastructure vs requirements
  • Specific file references for implementation (contracts/src/, program/src/main.rs)
  • Step-by-step implementation roadmap

API Error Handling

make i2p ISSUE='Implement comprehensive error handling for API timeouts in the Arda Credit authentication service'

Output includes:

  • Code navigation to auth handlers (api/src/authentication_handlers.rs:45)
  • Strategic analysis of current error handling vs robust patterns
  • Implementation suggestions with middleware integration

πŸ—οΈ Architecture

Processing Pipeline

graph LR
    A[Issue Input] --> B[Boundary Analysis]
    B --> C[Strategic Gap Analysis]
    C --> D[Vector Search]
    D --> E[Code Navigation Index]
    E --> F[Structured Output]
Loading

Components

  1. 🎯 Issue Classifier - Categorizes issues by type, complexity, and domain
  2. πŸ“Š Strategic Analyzer - Performs gap analysis using "have/need/missing" framework
  3. πŸ€– GEPA Module - Trained DSPy module for generating optimized search queries from issue analysis
  4. πŸ” Vector Search - Semantic search across ingested codebases
  5. πŸ“ Output Generator - Creates structured markdown with code references
  6. πŸš€ Modal Embedding Service - High-performance embedding generation (L4 GPU)

Supported Languages & Content Types

  • πŸ¦€ Rust - Complete parsing including macros, traits, and async code
  • πŸ“˜ TypeScript - React components, hooks, utilities, and type definitions
  • ⚑ Solidity - Smart contracts, interfaces, and deployment scripts
  • πŸ“š Documentation - Markdown files, knowledge bases (Obsidian), technical documentation

πŸ”§ Configuration

Environment Variables

# Required
OPENROUTER_API_KEY=your_openrouter_api_key
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_qdrant_api_key

# Optional
MODAL_TOKEN_ID=your_modal_token_id
MODAL_TOKEN_SECRET=your_modal_token_secret

Model Configuration

The system uses OpenRouter for LLM access with optimized model selection:

  • Grok-4-Fast: Primary model (8192 tokens)
  • Claude-3.5: Alternative model (4096 tokens)
  • GPT-4: Fallback option (4096 tokens)

Vector Database

  • Code Collections: arda_code_rust, arda_code_typescript, arda_code_solidity
  • Documentation Collection: arda_documentation (for knowledge bases and technical docs)
  • Embedding Model: Qwen3-Embedding-8B (4096 dimensions)
  • Chunk Size: 500 tokens with 50 token overlap (code), 6k-12k chars (documentation)

πŸ€– GEPA Training System

What is GEPA?

GEPA (Query Generation and Exploration for Prompt Augmentation) is a trained DSPy module that generates optimized search queries from issue descriptions. It uses DSPy BootstrapFewShot optimization to learn patterns from training data and produces domain-specific, codebase-aware queries.

Training Process

# Generate training dataset from codebase (optional)
make generate-dataset NUM_EXAMPLES=50 OUTPUT=custom_dataset.json

# Train the GEPA model
make train

# Evaluate model performance
make train-eval

Training Features

  • πŸ“Š Semantic Similarity Metrics - Uses embedding-based cosine similarity for evaluation
  • πŸ”„ Vector Search Integration - Enriches training with real codebase context during optimization
  • πŸ“ˆ BootstrapFewShot Optimization - Automatically generates few-shot examples from training data
  • 🎯 Domain-Specific Queries - Learns to generate queries using actual type/struct/function names
  • πŸ’Ύ Model Persistence - Saves trained models to trained_model.json for reuse

Training Configuration

  • Model: OpenRouter API (Claude Sonnet 4.5, GPT-4o-mini, or Grok-4-fast)
  • Dataset: 93 examples (74 train / 9 val / 10 test) across backend and frontend domains
  • Metric: Semantic similarity using Qwen3-Embedding-8B (4096-dim cosine similarity)
  • Optimizer: DSPy BootstrapFewShot with 8-40 bootstrapped demos
  • Cache: Training results cached in .cache/training/ for faster iteration

Automated Training Pipeline

The system includes a GitHub Actions workflow (.github/workflows/gepa-training.yml) that:

  • Triggers weekly or after vector ingestion completes
  • Trains GEPA model with latest codebase context
  • Evaluates accuracy and commits improved models
  • Provides comprehensive training reports with metrics

πŸ” Vector Search Features

Enhanced Ranking

  • Semantic similarity using cosine distance
  • File type relevance boosting
  • Recency scoring for recently modified files
  • Dependency graph awareness for related components

Quality Validation

  • Syntax verification for all ingested code
  • Content filtering removing comments and empty files
  • Deduplication preventing redundant vector storage
  • Error handling with graceful fallbacks

πŸ“Š Performance & Monitoring

System Health Checks

make health         # Pipeline status & vector search connectivity
make vector-status  # Detailed vector database metrics

Health Check Coverage:

  • Pipeline component initialization status
  • Vector search connectivity and basic query test
  • Component readiness verification

Note: Health checks verify system readiness but do not include end-to-end DSPy module testing or embedding service response time measurement.

Key Metrics

  • Vector Collections: ~50K+ code chunks + documentation across all collections
  • Search Latency: <200ms for semantic queries
  • Embedding Generation: ~45 embeddings/sec via Modal TEI (L4 GPU)
  • Pipeline Processing: Varies by model (2-10s for boundary analysis, 5-30s for gap analysis, 3-15s for navigation index)
  • GEPA Training: ~5-15 minutes on 93 examples with vector context enrichment
  • Documentation Chunks: Intelligent section grouping (6k-12k chars per chunk)

🎯 Use Cases

For Development Teams

  • Feature Planning - Strategic analysis of complex feature requirements
  • Technical Debt - Identification of gaps and missing components
  • Code Navigation - Quick discovery of relevant implementation files
  • Architecture Decisions - Boundary analysis for system design choices

For AI Agents

  • Context Enhancement - Rich markdown output optimized for agent consumption
  • Code Discovery - Specific file paths and line references for implementation
  • Strategic Guidance - Structured requirements and implementation pathways
  • Multi-Language Support - Comprehensive codebase understanding

πŸ›‘οΈ Security & Best Practices

Code Quality Guidelines

  • Files must be under 500 lines (strict enforcement)
  • Single responsibility principle for all classes
  • Comprehensive error handling and validation
  • Security-first approach with no exposed secrets

Development Standards

  • OOP-First Design - Every functionality in dedicated classes
  • Modular Architecture - Lego-like component composition
  • DSPy Integration - Optimized LLM interactions with structured signatures
  • Vector Search - Semantic code discovery across languages

Pipeline Behavior Notes

Validation Strategy

  • Output validation is skipped - The system relies on vector search quality assurance rather than strict output validation
  • Confidence scores are provided as metadata but don't block pipeline execution
  • This enables faster processing while maintaining quality through context-aware LLM reasoning

Cross-System Search Protection

  • Vector search includes cross-system contamination prevention
  • When analyzing I2P system issues, search is automatically scoped to I2P codebase only
  • Documentation search is excluded by default from code searches to prevent pattern contamination
  • Use search_with_documentation_priority() for architectural context when needed

Collection Strategy

  • Code Collections: arda_code_rust, arda_code_typescript, arda_code_solidity
  • Documentation Collection: arda_documentation (knowledge bases, technical docs, architectural overviews)
  • Default searches exclude documentation to focus on implementation patterns
  • Gap analysis uses documentation-priority search for existing system understanding
  • Knowledge base integration provides organizational context and research findings

🀝 Contributing

  1. Follow the guidelines in CLAUDE.md
  2. Ensure all files remain under 500 lines
  3. Use single responsibility principle
  4. Add comprehensive tests for new features
  5. Maintain security best practices

πŸ“š Documentation

  • docs/architecture/ARCHITECTURE.md - πŸ†• Comprehensive architecture guide and navigation hub
  • docs/ - Complete technical documentation (architecture, modules, guides)
  • CLAUDE.md - Development guidelines and coding standards
  • Makefile - Complete command reference with examples
  • Module docstrings - Detailed API documentation for each component
  • Example outputs - Run make examples to generate analyses in examples/ directory (not checked into repo)

πŸ”— Related Projects

  • Arda Credit - Privacy-preserving credit infrastructure (Rust)
  • Arda Platform - Monorepo with Platform, Credit App, and IDR (TypeScript)
  • Arda Knowledge Hub - Organizational knowledge base and documentation (Markdown/Obsidian)
  • Modal Platform - Serverless GPU infrastructure for embeddings
  • Qdrant - Vector database for semantic search

πŸ“„ License

MIT License - see LICENSE file for details.


I2P Meta-Reasoning System - Transforming complex issues into strategic technical guidance with AI-powered analysis and code navigation.

About

I2P - AI Meta-Reasoning System

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •