|
1 | | -# Contextual PDF Search |
| 1 | +# DocAI Toolkit |
2 | 2 |
|
3 | | -This scripta enable you to ask natural questions about PDF document(s) and get answers generated by a (S)LLM of your choice. It leverages the model's natural language processing capabilities to understand your queries and provide relevant information from the PDF, building a RAG and responds to natural questions. |
| 3 | +Local OCR + Markdown + RAG with optional Hugging Face/custom endpoints. Renamed to avoid PyPI name collisions (`docai-toolkit` package import is `docai_toolkit`). |
4 | 4 |
|
5 | | -## Features |
6 | | - |
7 | | -* **Question-Answering:** Ask questions in natural language about the content of your PDF. |
8 | | -* **Hugging Face Integration:** Leverages the Hugging Face Transformers library to access a wide range of state-of-the-art LLM models. |
9 | | -* **Sentence Embeddings:** Uses sentence embeddings to efficiently find the most relevant parts of the PDF to answer your questions. |
10 | | -* **Automatic Dependency Management:** Checks and installs required libraries to ensure a smooth setup. |
| 5 | +- `pdf_viewer_app.py`: Tkinter UI to open PDFs, run OCR → Markdown, and “chat” via retrieval + generation. |
| 6 | +- `docai_toolkit/`: library for OCR (local Tesseract or remote endpoint), embedding/indexing (local or remote), and simple chat over FAISS. |
| 7 | +- Status: under active development; APIs and defaults may change as the AI ecosystem moves quickly. |
11 | 8 |
|
12 | 9 | ## Requirements |
13 | 10 |
|
14 | | -* **Python 3.9 or higher:** Please ensure you have a compatible version of Python installed. |
15 | | -* **Hugging Face Account:** You'll need a Hugging Face account to access their models. You can create one for free at [https://huggingface.co/](https://huggingface.co/). |
16 | | -* **Libraries:** The following Python libraries are required and will be installed automatically if not present: |
17 | | - * `langchain` |
18 | | - * `transformers` |
19 | | - * `accelerate` |
20 | | - * `bitsandbytes` |
21 | | - * `sentence_transformers` |
22 | | - |
23 | | -## Usage |
| 11 | +- Python 3.9+ |
| 12 | +- Runtime deps vary by script: |
| 13 | + - Viewer: `PyPDF2`, `reportlab` (for saving) |
| 14 | + - RAG scripts: `langchain`, `langchain-community`, `transformers`, `accelerate`, `bitsandbytes`, `sentence_transformers` |
24 | 15 |
|
25 | | -1. **Save the Script:** Download this script and save it as `pdf_qa.py`. |
| 16 | +Install everything: |
26 | 17 |
|
27 | | -2. **Install Dependencies:** Although the script installs and updates all needed libraries, it sometimes fails to do so. In that case open your terminal or command prompt and run: |
28 | | - ```bash |
29 | | - pip install -r requirements.txt |
30 | | - ``` |
| 18 | +```bash |
| 19 | +pip install -r requirements.txt |
| 20 | +# or editable install |
| 21 | +pip install -e . |
| 22 | +``` |
31 | 23 |
|
32 | | -3. **Run the Script:** |
33 | | - ``` |
34 | | - python3 pdf_qa.py [model_id] [pdf_file_path] |
35 | | - ``` |
36 | | - Replace `[model_id]` with the Hugging Face model ID you want to use (e.g., `mistralai/Mistral-7B-Instruct-v0.1`). You can find a list of available models at [https://huggingface.co/models](https://huggingface.co/models). |
37 | | - Replace `[pdf_file_path]` with the path to your PDF file(s). |
38 | | - |
39 | | -4. **Ask Questions:** |
40 | | - You'll be prompted to enter questions. Type your questions in natural language and press Enter. The script will provide answers based on the content of the PDF. |
| 24 | +## Usage |
41 | 25 |
|
42 | | -5. **Exit:** |
43 | | - Type `exit` and press Enter to quit the script. |
| 26 | +### GUI Viewer |
| 27 | + |
| 28 | +```bash |
| 29 | +python pdf_viewer_app.py |
| 30 | +``` |
| 31 | + |
| 32 | +- Open: loads all pages of a PDF into the text area. |
| 33 | +- Save As: renders the text area content into a new PDF (requires `reportlab`). |
| 34 | +- OCR → Markdown: run OCR on a PDF and save Markdown to the configured output directory (local Tesseract or remote OCR endpoint via HF/custom). |
| 35 | +- Chat: build a quick FAISS index over a chosen Markdown file and query it with a selected HF model (remote endpoint or local HF pipeline). |
| 36 | +- Settings: set HF token, optional custom endpoints (OCR/embeddings/LLM), model choices, and output directory. Settings persist to `~/.docai/config.json`. Env vars (`HF_TOKEN`, `HUGGINGFACEHUB_API_TOKEN`, `DOC_AI_OUTPUT_DIR`) are auto-read. |
| 37 | + |
| 38 | +### Hugging Face onboarding (fast path) |
| 39 | + |
| 40 | +1. Create a Hugging Face access token: https://huggingface.co/settings/tokens (choose “Read” or “Write” as needed). |
| 41 | +2. Export it so the app can auto-load it: |
| 42 | + ```bash |
| 43 | + export HF_TOKEN=your_token_here |
| 44 | + # or HUGGINGFACEHUB_API_TOKEN=your_token_here |
| 45 | + ``` |
| 46 | +3. Pick models (examples): |
| 47 | + - OCR: point the OCR endpoint at a hosted OCR model (HF Inference API URL). |
| 48 | + - Embeddings: e.g., `sentence-transformers/all-mpnet-base-v2` via Inference Endpoints (text-embeddings task) or local. |
| 49 | + - LLM: e.g., `mistralai/Mistral-7B-Instruct-v0.1` via Inference Endpoints or local HF pipeline. |
| 50 | +4. Start the app, open Settings, and paste endpoints/models if you didn’t set env vars. Output dir can be set there as well. |
| 51 | + |
| 52 | +Environment variables: |
| 53 | +- `HF_TOKEN` / `HUGGINGFACEHUB_API_TOKEN` / `DOC_AI_HF_TOKEN`: auth token (auto-loads into LLM + embeddings). |
| 54 | +- `DOC_AI_OUTPUT_DIR`: default output directory for OCR/Markdown. |
| 55 | + |
| 56 | +### Docker |
| 57 | + |
| 58 | +Build: |
| 59 | +```bash |
| 60 | +docker build -t docai-toolkit . |
| 61 | +``` |
| 62 | + |
| 63 | +Run (GUI requires X/Wayland forwarding; for headless tasks, override CMD): |
| 64 | +```bash |
| 65 | +docker run --rm -v $PWD:/data docai-toolkit python -m pytest -q |
| 66 | +# or override to run OCR in batch using the library CLI you add |
| 67 | +``` |
| 68 | + |
| 69 | +macOS GUI via XQuartz: |
| 70 | +1) Install/start XQuartz (`brew install --cask xquartz`; enable “Allow connections from network clients” in prefs and restart). |
| 71 | +2) Allow local clients: `xhost +localhost` |
| 72 | +3) Run: |
| 73 | +```bash |
| 74 | +docker run --rm -it \ |
| 75 | + -e DISPLAY=host.docker.internal:0 \ |
| 76 | + -v /tmp/.X11-unix:/tmp/.X11-unix \ |
| 77 | + docai-toolkit |
| 78 | +``` |
| 79 | +For day-to-day use, running natively is simpler; use the container when you need an isolated, reproducible environment. |
| 80 | + |
| 81 | +## Tests |
| 82 | + |
| 83 | +Basic round-trip test for the viewer’s PDF writer: |
| 84 | + |
| 85 | +```bash |
| 86 | +pytest |
| 87 | +``` |
| 88 | + |
| 89 | +`reportlab` must be installed for the test to run. |
| 90 | + |
| 91 | +## OCR + RAG (docai_toolkit/) |
| 92 | + |
| 93 | +- OCR: pluggable clients (`RemoteOcrClient` for HF/custom endpoints, `TesseractOcrClient` local fallback) that turn PDFs into Markdown (`ocr/pipeline.py`). |
| 94 | +- RAG: build a FAISS index from Markdown (`rag/index.py`), then chat using a chosen HF model (`rag/chat.py`). |
| 95 | +- Config: lightweight dataclasses in `docai_toolkit/config.py` for selecting providers/models; saved at `~/.docai/config.json`. |
| 96 | +- Remote-friendly: use HF token + model ids by default; configs allow custom OCR/embedding/generation endpoints. FAISS runs locally for fast retrieval. |
| 97 | + |
| 98 | +To experiment locally: |
| 99 | + |
| 100 | +```bash |
| 101 | +# OCR to Markdown (Tesseract fallback requires pytesseract + pdf2image installed) |
| 102 | +python - <<'PY' |
| 103 | +from pathlib import Path |
| 104 | +from docai_toolkit.ocr import TesseractOcrClient, run_ocr_to_markdown |
| 105 | +client = TesseractOcrClient() |
| 106 | +md_path = run_ocr_to_markdown(Path("your.pdf"), Path("outputs"), client) |
| 107 | +print("Saved:", md_path) |
| 108 | +PY |
| 109 | + |
| 110 | +# Build index + chat (requires sentence_transformers + transformers) |
| 111 | +python - <<'PY' |
| 112 | +from pathlib import Path |
| 113 | +from docai_toolkit.rag import build_index_from_markdown, chat_over_corpus, load_index |
| 114 | +index_path = Path("outputs/faiss_index") |
| 115 | +db = build_index_from_markdown([Path("outputs/your.md")], persist_path=index_path) |
| 116 | +print(chat_over_corpus(db, "What is this document about?", model_id="mistralai/Mistral-7B-Instruct-v0.1")) |
| 117 | +# Later: db = load_index(index_path) |
| 118 | +PY |
| 119 | +``` |
44 | 120 |
|
45 | 121 | ## License |
46 | 122 |
|
47 | | -This code is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. See `LICENSE.md` for details. |
48 | | -
|
49 | | -## Contributing |
50 | | -
|
51 | | -Contributions are welcome! Please feel free to fork this repository and submit pull requests. |
52 | | -
|
53 | | -## Disclaimer |
54 | | -
|
55 | | -This script is provided as-is for educational and personal use. It is not intended for production or commercial applications. The author assumes no liability for any consequences arising from the use of this script. |
56 | | -
|
| 123 | +CC BY-NC-SA 4.0 (see `LICENSE`). |
0 commit comments