Click to Navigate
| Feature | Description | Status |
|---|---|---|
| ๐ Smart OCR | Extract text from scanned PDFs with 98%+ accuracy | โ Ready |
| ๐ Auto Classification | Categorize documents using AI-powered analysis | โ Ready |
| ๐ง RAG Pipeline | Answer questions using document context | โ Ready |
| ๐ Multi-Model Support | Compare TinyLlama, Phi-2, Mistral performance | โ Ready |
| ๐ Optimized Processing | GPU-accelerated with batch processing | โ Ready |
| ๐ Secure Handling | PII detection and redaction capabilities | ๐ง Coming |
๐ฆ Financial Services
- Mortgage Processing: Extract key terms from loan documents
- Contract Analysis: Identify important clauses and conditions
- Compliance Checking: Ensure documents meet regulatory requirements
๐ Legal Operations
- Document Discovery: Search through large document sets
- Contract Review: Extract and analyze key terms
- Due Diligence: Automated document verification
๐ข Enterprise Solutions
- Invoice Processing: Extract line items and totals
- Report Generation: Summarize lengthy documents
- Knowledge Management: Build searchable document repositories
graph LR
A[๐ PDF Input] --> B[๐ OCR Engine]
B --> C[๐ Classifier]
C --> D[๐งฉ Chunking]
D --> E[๐ข Embeddings]
E --> F[๐พ Vector Store]
F --> G[๐ค LLM + RAG]
G --> H[๐ฌ Answer]
style A fill:#e1f5fe
style H fill:#c8e6c9
| Requirement | Version |
|---|---|
| Python | 3.8+ |
| CUDA | 11.8+ (for GPU) |
| RAM | 8GB minimum |
| Storage | 10GB free |
# Clone the repository
git clone https://github.com/ShamsRupak/ai-doc-processing-suite.git
cd ai-doc-processing-suite
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download models
python scripts/download_models.pyfrom doc_processor import DocumentPipeline
# Initialize pipeline
pipeline = DocumentPipeline(model="tinyllama")
# Process document
result = pipeline.process("data/sample_loan.pdf")
# Query the document
answer = pipeline.query("What is the interest rate?")
print(f"Answer: {answer}")| Document Type | Pages | Processing Time | Accuracy |
|---|---|---|---|
| Loan Agreement | 10 | 2.3s | 98.5% |
| Bank Statement | 5 | 1.1s | 99.2% |
| Contract | 15 | 3.5s | 97.8% |
๐ฆ ai-doc-processing-suite/
โโโ ๐ src/
โ โโโ ๐ ocr/
โ โ โโโ extract_text.py
โ โ โโโ preprocess.py
โ โโโ ๐ classification/
โ โ โโโ classifier.py
โ โ โโโ models.py
โ โโโ ๐ง retrieval/
โ โ โโโ rag_pipeline.py
โ โ โโโ embeddings.py
โ โ โโโ vector_store.py
โ โโโ ๐ค llm/
โ โโโ model_loader.py
โ โโโ prompts.py
โโโ ๐ data/
โ โโโ sample_loan.pdf
โ โโโ bank_statement.pdf
โ โโโ contract.pdf
โโโ ๐ tests/
โโโ ๐ notebooks/
โ โโโ demo.ipynb
โโโ ๐ requirements.txt
โโโ ๐ง config.yaml
โโโ ๐ README.md
Click to expand code example
from src.ocr import extract_text
from src.classification import DocumentClassifier
from src.retrieval import RAGPipeline
# Extract text from PDF
text = extract_text("path/to/document.pdf")
# Classify document
classifier = DocumentClassifier()
doc_type = classifier.classify(text)
print(f"Document type: {doc_type}")
# Setup RAG pipeline
rag = RAGPipeline(model="tinyllama")
rag.add_document(text, metadata={"type": doc_type})
# Query document
response = rag.query("What are the key terms?")
print(response)Click to expand code example
# Complex multi-document analysis
pipeline = DocumentPipeline(
model="mistral-7b",
chunk_size=512,
overlap=128
)
# Process multiple documents
documents = [
"loan_agreement.pdf",
"property_appraisal.pdf",
"income_verification.pdf"
]
for doc in documents:
pipeline.add_document(doc)
# Cross-document queries
questions = [
"What is the total loan amount?",
"Compare the appraised value with the loan amount",
"Verify the borrower's income"
]
for q in questions:
answer = pipeline.query(q)
print(f"Q: {q}\nA: {answer}\n")Create a config.yaml file:
# Model Configuration
model:
name: "tinyllama"
quantization: "8bit"
max_tokens: 512
# OCR Settings
ocr:
engine: "tesseract"
language: "eng"
dpi: 300
# RAG Configuration
rag:
chunk_size: 512
chunk_overlap: 128
retriever_k: 5
# Performance
performance:
batch_size: 32
use_gpu: true
cache_embeddings: true| Model | Accuracy | Speed (docs/min) | Memory (GB) | Cost |
|---|---|---|---|---|
| TinyLlama 1.1B | 85% | 45 | 2.5 | Free |
| Phi-2 2.7B | 92% | 30 | 4.0 | Free |
| Mistral 7B | 96% | 15 | 8.0 | Free |
| GPT-3.5 | 98% | 60 | API | $$$ |
We love contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.


