A small, practical toolkit for PDF text & table extraction, OCR and a simple RAG pipeline. It’s based on my master’s thesis and focuses on using both text and tables to answer competency questions (CQs) in the biomedical domain.
- Extract text from PDFs with
PyPDF2. - Extract tables with
tabula-py(Java required). - Optional OCR for scanned PDFs with
pdf2image+pytesseract(Poppler required). - Merge text + tables into Markdown for LLM-ready context.
- RAG demo: BM25 retrieval (Haystack) + an LLM (tested with Mixtral‑8x7B via Hugging Face).
- Small evaluation helpers and a simple metrics plot.
- CLI for all tasks; config via YAML.
Quick demo: see examples/demo.md (real PDF or toy inputs).
Read the full abstract in docs/ABSTRACT.md.
git clone https://github.com/muradali4442/thesis_extractor.git
cd thesis_extractor
pip install -e .
# optional (dev tools)
pip install -r dev-requirements.txt
pre-commit installtabula-py→ Java runtimepdf2image→ Poppler (apt install poppler-utilsorbrew install poppler)
# 1) Extract from a PDF
thesis-extractor pdf extract --pdf data/paper.pdf --out out/text.txt
thesis-extractor pdf tables --pdf data/paper.pdf --out out/tables.csv
# 2) Merge text + tables to one file (for RAG)
thesis-extractor data merge --text out/text.txt --tables out/tables.csv --out out/merged.md
# 3) Ask a question with BM25 + LLM (Mixtral via HF Inference)
export HF_TOKEN=... # or pass --api-key
thesis-extractor rag ask --data out/merged.md --question "Which clinical outcomes improved and under what conditions?" --model mistralai/Mixtral-8x7B-Instruct-v0.1 --top-k 5Edit configs/base.yaml or pass flags:
pdf:
dpi: 300
ocr:
lang: eng
rag:
model: mistralai/Mixtral-8x7B-Instruct-v0.1You can also override on the CLI for one‑off runs.
src/thesis_extractor/
pdf.py # text & table extraction, OCR helpers
preprocess.py # merge text + tables -> Markdown
rag.py # Haystack BM25 + LLM (Mixtral via HF)
eval.py # baseline metrics
visualize.py # simple plotting
cli.py # CLI commands
configs/
tests/
.github/workflows/ci.yml
- The pipeline is domain‑agnostic; I used biomedical papers (IEEE, CEUR‑WS Vol‑3880 & Vol‑3578), but you can point it at any PDFs.
- To use Mixtral or another LLM on Hugging Face, set
HF_TOKENor pass--api-key. A custom endpoint can be passed via--api-base-url.
@software{thesis_extractor_2025,
author = {Murad Ali},,
title = {Thesis Extractor — Biomedical PDF/Text–Table RAG},
year = {2025},
url = {https://github.com/muradali4442/thesis_extractor},
version = {0.1.0}
}