Skip to content

Commit 5e77e22

Browse files
committed
Rename to docai_toolkit, add packaging + Docker hardening, and update UI/RAG pipeline
- rename library to docai_toolkit and persist config to ~/.docai/config.json - add pyproject.toml, Dockerfile (non-root, GUI-ready), docker/pypi workflows - clean out legacy training scripts, update README with install/Docker/XQuartz notes - harden HF/OCR/RAG clients (timeouts, batching, safe index load), move UI to threads - adjust tests for new package and optional deps
1 parent 22492f3 commit 5e77e22

23 files changed

+1117
-225
lines changed

.dockerignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
.git
2+
.gitignore
3+
__pycache__
4+
.pytest_cache
5+
.venv
6+
env
7+
venv
8+
*.pyc
9+
*.pyo
10+
*.pyd
11+
*.db
12+
*.sqlite
13+
*.log
14+
*.DS_Store

.github/workflows/docker.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
name: Build Docker image
2+
3+
on:
4+
push:
5+
branches: [main, master]
6+
pull_request:
7+
branches: [main, master]
8+
9+
jobs:
10+
build:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4
15+
16+
- name: Set up Docker Buildx
17+
uses: docker/setup-buildx-action@v3
18+
19+
- name: Login to DockerHub
20+
if: secrets.DOCKERHUB_USERNAME && secrets.DOCKERHUB_TOKEN
21+
uses: docker/login-action@v3
22+
with:
23+
username: ${{ secrets.DOCKERHUB_USERNAME }}
24+
password: ${{ secrets.DOCKERHUB_TOKEN }}
25+
26+
- name: Build image
27+
uses: docker/build-push-action@v6
28+
with:
29+
context: .
30+
push: ${{ secrets.DOCKERHUB_USERNAME && secrets.DOCKERHUB_TOKEN }}
31+
tags: ${{ secrets.DOCKERHUB_USERNAME }}/docai-toolkit:latest

.github/workflows/pypi.yml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
push:
5+
tags:
6+
- "v*"
7+
8+
permissions:
9+
contents: read
10+
id-token: write # required for trusted publishing
11+
12+
jobs:
13+
build-and-publish:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- name: Checkout
17+
uses: actions/checkout@v4
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: "3.12"
23+
24+
- name: Install build tooling
25+
run: |
26+
python -m pip install --upgrade pip
27+
python -m pip install build
28+
29+
- name: Build wheel and sdist
30+
run: |
31+
python -m build
32+
33+
- name: Publish to PyPI via Trusted Publisher
34+
uses: pypa/gh-action-pypi-publish@release/v1
35+
with:
36+
packages-dir: dist

.gitignore

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Byte-compiled / cache
2+
__pycache__/
3+
*.py[cod]
4+
5+
# Virtual env
6+
.venv/
7+
env/
8+
venv/
9+
10+
# OS junk
11+
.DS_Store
12+
13+
# IDE
14+
.idea/
15+
.vscode/
16+
.pytest_cache/

Dockerfile

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Base Python image
2+
FROM python:3.12-slim
3+
4+
ENV PYTHONDONTWRITEBYTECODE=1 \
5+
PYTHONUNBUFFERED=1 \
6+
PIP_NO_CACHE_DIR=1
7+
8+
WORKDIR /app
9+
10+
# System deps for OCR / pdf rendering
11+
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
12+
tesseract-ocr \
13+
poppler-utils \
14+
build-essential \
15+
libgl1 \
16+
python3-tk \
17+
&& rm -rf /var/lib/apt/lists/*
18+
19+
# Install Python deps
20+
COPY requirements.txt .
21+
RUN python -m pip install --upgrade pip && python -m pip install -r requirements.txt
22+
23+
# Copy source
24+
COPY . .
25+
26+
# Create non-root user and switch
27+
RUN useradd -m appuser
28+
USER appuser
29+
30+
# Default command: launch the viewer UI
31+
CMD ["python", "pdf_viewer_app.py"]

README.md

Lines changed: 111 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,123 @@
1-
# Contextual PDF Search
1+
# DocAI Toolkit
22

3-
This scripta enable you to ask natural questions about PDF document(s) and get answers generated by a (S)LLM of your choice. It leverages the model's natural language processing capabilities to understand your queries and provide relevant information from the PDF, building a RAG and responds to natural questions.
3+
Local OCR + Markdown + RAG with optional Hugging Face/custom endpoints. Renamed to avoid PyPI name collisions (`docai-toolkit` package import is `docai_toolkit`).
44

5-
## Features
6-
7-
* **Question-Answering:** Ask questions in natural language about the content of your PDF.
8-
* **Hugging Face Integration:** Leverages the Hugging Face Transformers library to access a wide range of state-of-the-art LLM models.
9-
* **Sentence Embeddings:** Uses sentence embeddings to efficiently find the most relevant parts of the PDF to answer your questions.
10-
* **Automatic Dependency Management:** Checks and installs required libraries to ensure a smooth setup.
5+
- `pdf_viewer_app.py`: Tkinter UI to open PDFs, run OCR → Markdown, and “chat” via retrieval + generation.
6+
- `docai_toolkit/`: library for OCR (local Tesseract or remote endpoint), embedding/indexing (local or remote), and simple chat over FAISS.
7+
- Status: under active development; APIs and defaults may change as the AI ecosystem moves quickly.
118

129
## Requirements
1310

14-
* **Python 3.9 or higher:** Please ensure you have a compatible version of Python installed.
15-
* **Hugging Face Account:** You'll need a Hugging Face account to access their models. You can create one for free at [https://huggingface.co/](https://huggingface.co/).
16-
* **Libraries:** The following Python libraries are required and will be installed automatically if not present:
17-
* `langchain`
18-
* `transformers`
19-
* `accelerate`
20-
* `bitsandbytes`
21-
* `sentence_transformers`
22-
23-
## Usage
11+
- Python 3.9+
12+
- Runtime deps vary by script:
13+
- Viewer: `PyPDF2`, `reportlab` (for saving)
14+
- RAG scripts: `langchain`, `langchain-community`, `transformers`, `accelerate`, `bitsandbytes`, `sentence_transformers`
2415

25-
1. **Save the Script:** Download this script and save it as `pdf_qa.py`.
16+
Install everything:
2617

27-
2. **Install Dependencies:** Although the script installs and updates all needed libraries, it sometimes fails to do so. In that case open your terminal or command prompt and run:
28-
```bash
29-
pip install -r requirements.txt
30-
```
18+
```bash
19+
pip install -r requirements.txt
20+
# or editable install
21+
pip install -e .
22+
```
3123

32-
3. **Run the Script:**
33-
```
34-
python3 pdf_qa.py [model_id] [pdf_file_path]
35-
```
36-
Replace `[model_id]` with the Hugging Face model ID you want to use (e.g., `mistralai/Mistral-7B-Instruct-v0.1`). You can find a list of available models at [https://huggingface.co/models](https://huggingface.co/models).
37-
Replace `[pdf_file_path]` with the path to your PDF file(s).
38-
39-
4. **Ask Questions:**
40-
You'll be prompted to enter questions. Type your questions in natural language and press Enter. The script will provide answers based on the content of the PDF.
24+
## Usage
4125

42-
5. **Exit:**
43-
Type `exit` and press Enter to quit the script.
26+
### GUI Viewer
27+
28+
```bash
29+
python pdf_viewer_app.py
30+
```
31+
32+
- Open: loads all pages of a PDF into the text area.
33+
- Save As: renders the text area content into a new PDF (requires `reportlab`).
34+
- OCR → Markdown: run OCR on a PDF and save Markdown to the configured output directory (local Tesseract or remote OCR endpoint via HF/custom).
35+
- Chat: build a quick FAISS index over a chosen Markdown file and query it with a selected HF model (remote endpoint or local HF pipeline).
36+
- Settings: set HF token, optional custom endpoints (OCR/embeddings/LLM), model choices, and output directory. Settings persist to `~/.docai/config.json`. Env vars (`HF_TOKEN`, `HUGGINGFACEHUB_API_TOKEN`, `DOC_AI_OUTPUT_DIR`) are auto-read.
37+
38+
### Hugging Face onboarding (fast path)
39+
40+
1. Create a Hugging Face access token: https://huggingface.co/settings/tokens (choose “Read” or “Write” as needed).
41+
2. Export it so the app can auto-load it:
42+
```bash
43+
export HF_TOKEN=your_token_here
44+
# or HUGGINGFACEHUB_API_TOKEN=your_token_here
45+
```
46+
3. Pick models (examples):
47+
- OCR: point the OCR endpoint at a hosted OCR model (HF Inference API URL).
48+
- Embeddings: e.g., `sentence-transformers/all-mpnet-base-v2` via Inference Endpoints (text-embeddings task) or local.
49+
- LLM: e.g., `mistralai/Mistral-7B-Instruct-v0.1` via Inference Endpoints or local HF pipeline.
50+
4. Start the app, open Settings, and paste endpoints/models if you didn’t set env vars. Output dir can be set there as well.
51+
52+
Environment variables:
53+
- `HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` / `DOC_AI_HF_TOKEN`: auth token (auto-loads into LLM + embeddings).
54+
- `DOC_AI_OUTPUT_DIR`: default output directory for OCR/Markdown.
55+
56+
### Docker
57+
58+
Build:
59+
```bash
60+
docker build -t docai-toolkit .
61+
```
62+
63+
Run (GUI requires X/Wayland forwarding; for headless tasks, override CMD):
64+
```bash
65+
docker run --rm -v $PWD:/data docai-toolkit python -m pytest -q
66+
# or override to run OCR in batch using the library CLI you add
67+
```
68+
69+
macOS GUI via XQuartz:
70+
1) Install/start XQuartz (`brew install --cask xquartz`; enable “Allow connections from network clients” in prefs and restart).
71+
2) Allow local clients: `xhost +localhost`
72+
3) Run:
73+
```bash
74+
docker run --rm -it \
75+
-e DISPLAY=host.docker.internal:0 \
76+
-v /tmp/.X11-unix:/tmp/.X11-unix \
77+
docai-toolkit
78+
```
79+
For day-to-day use, running natively is simpler; use the container when you need an isolated, reproducible environment.
80+
81+
## Tests
82+
83+
Basic round-trip test for the viewer’s PDF writer:
84+
85+
```bash
86+
pytest
87+
```
88+
89+
`reportlab` must be installed for the test to run.
90+
91+
## OCR + RAG (docai_toolkit/)
92+
93+
- OCR: pluggable clients (`RemoteOcrClient` for HF/custom endpoints, `TesseractOcrClient` local fallback) that turn PDFs into Markdown (`ocr/pipeline.py`).
94+
- RAG: build a FAISS index from Markdown (`rag/index.py`), then chat using a chosen HF model (`rag/chat.py`).
95+
- Config: lightweight dataclasses in `docai_toolkit/config.py` for selecting providers/models; saved at `~/.docai/config.json`.
96+
- Remote-friendly: use HF token + model ids by default; configs allow custom OCR/embedding/generation endpoints. FAISS runs locally for fast retrieval.
97+
98+
To experiment locally:
99+
100+
```bash
101+
# OCR to Markdown (Tesseract fallback requires pytesseract + pdf2image installed)
102+
python - <<'PY'
103+
from pathlib import Path
104+
from docai_toolkit.ocr import TesseractOcrClient, run_ocr_to_markdown
105+
client = TesseractOcrClient()
106+
md_path = run_ocr_to_markdown(Path("your.pdf"), Path("outputs"), client)
107+
print("Saved:", md_path)
108+
PY
109+
110+
# Build index + chat (requires sentence_transformers + transformers)
111+
python - <<'PY'
112+
from pathlib import Path
113+
from docai_toolkit.rag import build_index_from_markdown, chat_over_corpus, load_index
114+
index_path = Path("outputs/faiss_index")
115+
db = build_index_from_markdown([Path("outputs/your.md")], persist_path=index_path)
116+
print(chat_over_corpus(db, "What is this document about?", model_id="mistralai/Mistral-7B-Instruct-v0.1"))
117+
# Later: db = load_index(index_path)
118+
PY
119+
```
44120

45121
## License
46122

47-
This code is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. See `LICENSE.md` for details.
48-
49-
## Contributing
50-
51-
Contributions are welcome! Please feel free to fork this repository and submit pull requests.
52-
53-
## Disclaimer
54-
55-
This script is provided as-is for educational and personal use. It is not intended for production or commercial applications. The author assumes no liability for any consequences arising from the use of this script.
56-
123+
CC BY-NC-SA 4.0 (see `LICENSE`).

docai_toolkit/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""DocAI Toolkit package (renamed to avoid PyPI conflicts)."""
2+
3+
__all__ = ["config"]

docai_toolkit/config.py

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
import json
2+
import os
3+
from dataclasses import dataclass, field, asdict
4+
from pathlib import Path
5+
from typing import Optional
6+
7+
CONFIG_PATH = Path.home() / ".docai" / "config.json"
8+
9+
10+
@dataclass
11+
class OcrConfig:
12+
provider: str = "deepseek" # deepseek | tesseract
13+
api_key: Optional[str] = None
14+
model: Optional[str] = None # provider-specific
15+
endpoint: Optional[str] = None # user-defined OCR API endpoint
16+
17+
18+
@dataclass
19+
class EmbeddingConfig:
20+
backend: str = "sentence-transformers" # sentence-transformers | huggingface-hub
21+
model: str = "all-mpnet-base-v2"
22+
device: str = "auto"
23+
endpoint: Optional[str] = None # user-defined embedding API endpoint
24+
api_key: Optional[str] = None # for hosted endpoints
25+
26+
27+
@dataclass
28+
class LlmConfig:
29+
backend: str = "huggingface-hub" # huggingface-hub | local-gguf | openai-compatible
30+
model: str = "mistralai/Mistral-7B-Instruct-v0.1"
31+
api_key: Optional[str] = None
32+
max_new_tokens: int = 256
33+
endpoint: Optional[str] = None # user-defined generation endpoint
34+
35+
36+
@dataclass
37+
class AppConfig:
38+
output_dir: Path = field(default_factory=lambda: Path("./outputs"))
39+
ocr: OcrConfig = field(default_factory=OcrConfig)
40+
embeddings: EmbeddingConfig = field(default_factory=EmbeddingConfig)
41+
llm: LlmConfig = field(default_factory=LlmConfig)
42+
43+
@classmethod
44+
def from_env(cls) -> "AppConfig":
45+
cfg = cls.load_from_file(CONFIG_PATH) or cls()
46+
hf_token = (
47+
os.getenv("HF_TOKEN")
48+
or os.getenv("HUGGINGFACEHUB_API_TOKEN")
49+
or os.getenv("DOC_AI_HF_TOKEN")
50+
)
51+
if hf_token:
52+
cfg.llm.api_key = hf_token
53+
cfg.embeddings.api_key = hf_token
54+
55+
output_dir_env = os.getenv("DOC_AI_OUTPUT_DIR")
56+
if output_dir_env:
57+
cfg.output_dir = Path(output_dir_env)
58+
return cfg
59+
60+
@classmethod
61+
def load_from_file(cls, path: Path | None) -> "AppConfig | None":
62+
if not path:
63+
return None
64+
if not path.exists():
65+
return None
66+
data = json.loads(path.read_text(encoding="utf-8"))
67+
return cls(
68+
output_dir=Path(data.get("output_dir", "./outputs")),
69+
ocr=OcrConfig(**data.get("ocr", {})),
70+
embeddings=EmbeddingConfig(**data.get("embeddings", {})),
71+
llm=LlmConfig(**data.get("llm", {})),
72+
)
73+
74+
def save(self, path: Path | None = None) -> None:
75+
path = path or CONFIG_PATH
76+
path.parent.mkdir(parents=True, exist_ok=True)
77+
data = asdict(self)
78+
data["output_dir"] = str(self.output_dir)
79+
path.write_text(json.dumps(data, indent=2), encoding="utf-8")

0 commit comments

Comments
 (0)