Important note: this public repository is a clean-room, open-source proof-of-concept(PoC) that demonstrates an end-to-end search pipeline, including query rewriting, dense retrieval, and reranking, using only publicly available datasets. This project is independently developed for research use as a reference architecture; it is not for clinical use and includes no employer-specific code, configurations, or data.
For convenience, the project supports two execution modes:
- Sample (demo): processes the first 1,000 documents by default (controlled via
MAX_DOCS). - Full (experiments): processes the entire dataset; evalutation metrics (nDCG@k, P@k, R@100, MAP) are calculated using the
ir-measurespackage and summarized in a table.
On first run, the dataset (PubMed Central TREC CDS 2016 — see ir-datasets — pmc/v2/trec-cds-2016) will be downloaded automatically.
- PubMed Central TREC CDS 2016
- Documents: ~1.3M (PMC v2)
- Queries: 30
- Qrels: ~38K
- See the dataset page for Python API / CLI / PyTerrier examples.
- Link: ir-datasets — pmc/v2/trec-cds-2016
The download time will depend on your network and disk speed. The total dataset size is approximately 33GB, though the exact size may vary depending on the mirror and compression format used. To ensure license compliance and experimental reproducibility, this project exclusively uses the official PMC open-access snapshot and download links specified by TREC CDS, and does not repackage or redistribute the original documents.
End-to-end flow:
-
Query Rewriting (optional, controlled by
ENHANCE_QUERY): rewrite the user query via an Ollama/api/chatendpoint to improve semantic retrieval. For instance, a simple query like "Search for patients with active HIV encephalopathy." is transformed into a comprehensive, expert-level query like "Brain MRI report findings consistent with HIV encephalopathy or opportunistic infections in HIV-positive patients, including findings such as white matter hyperintensities, basal ganglia lesions, cortical atrophy, or evidence of toxoplasmosis, cryptococcoma, or progressive multifocal leukoencephalopathy (PML)." -
Dense Retrieval: build a FAISS index over document embeddings generated by a Hugging Face model and perform a cosine-similarity search with the query embedding to obtain the top-k candidates.
-
Rerank: rerank top-k candidate documents with a cross-encoder to produce the final ranking.
graph TD
A[User Query] --> B{"1. Query Rewriting ?"};
B -- Yes --> C[LLM Rewriting];
B -- No --> D[Original Query];
C --> E["2. Dense Retrieval"];
D --> E;
E -- Top-k Candidates --> F["3. Reranking"];
F --> G[Final Ranked List];
.
├── docker/
│ └── Dockerfile
├── scripts/
│ └── main.py # entrypoint (calls src.cds2016.cli)
├── src/
│ └── cds2016/
│ ├── cli.py # CLI pipeline
│ ├── config.py # configuration (env overrides supported)
│ ├── data.py # ir-datasets loading
│ ├── embeddings.py # embeddings for queries/documents
│ ├── enhance.py # query rewriting (LLM endpoint)
│ ├── index.py # FAISS index build/load/search
│ ├── search.py # dense search and BM25 hook
│ ├── rerank.py # cross-encoder
│ └── evaluate.py # metrics
├── docker-compose.yml
├── requirements.txt
└── LICENSE
This section guides you through setting up and running the project. Choose either Pip or Docker.
Prerequisites
- Python 3.11.13
- CUDA 12.4 (for GPU support). Ensure a compatible CUDA driver is installed on the host machine.
-
Set up environment variables
First, copy the example environment file. You can customize variables like model names or file paths in
.env.cp .env.example .env
-
Install dependencies
pip install -r requirements.txt
-
Run the pipeline
# Standard run python scripts/main.py # Verbose run python scripts/main.py -v
This method uses Docker Compose to manage the environment and dependencies. See the "Containerized Setup" section for more options.
-
Set up environment variables
First, copy the example environment file. Docker Compose will automatically load it.
cp .env.example .env
-
Build and run the container
This command builds the image and runs the main script. The
--rmflag cleans up the container after execution.docker compose build docker compose run --rm app python scripts/main.py
To run with verbosity:
docker compose run --rm app python scripts/main.py -v
- Default (no flags): level is INFO; printed to console only.
- With
-q(quiet): level is WARNING (show WARNING/ERROR only); printed to console only. - With
-v(verbose): level is DEBUG; printed to console and also written to./logs/cds2016_YYYYMMDD_HHMMSS.log.
Default is GPU (faiss-gpu-cu12) via requirements.txt.
- CPU-only environments:
pip uninstall -y faiss-gpu-cu12 faiss-gpu
pip install -U faiss-cpu- Switch back to GPU:
pip uninstall -y faiss-cpu
pip install -U faiss-gpu-cu12==1.8.0.2Note: faiss-gpu does not automatically fall back to CPU if CUDA/GPU is unavailable. Ensure a compatible NVIDIA driver and CUDA for the pinned build.
Environment variables via .env (see .env.example for the full list and defaults):
# Data and outputs
DATASET_ID=pmc/v2/trec-cds-2016
FAISS_DIR=./artifacts/cds2016_faiss
MAX_DOCS=1000
# Models
EMBED_MODEL=Qwen/Qwen3-Embedding-8B
EMBEDDING_BATCH=2
RERANK_MODEL=BAAI/bge-reranker-v2-gemma
RERANK_METHOD=gemma
RERANK_BATCH=16
# Retrieval
RETRIEVAL_TOPK=1000
RERANK_K=1000
FINAL_K=100
EVAL_AT=10
# Device
DEVICE=cuda
# (optional) Query rewriting (Ollama /api/chat endpoint)
ENHANCE_QUERY=false
ENHANCE_ENDPOINT=http://localhost:11434/api/chat
ENHANCE_MODEL=medgemma-27b-text-it:latestThis project can leverage GPUs via Hugging Face Accelerate sharded loading (device_map=auto) for embedding and reranking models. FAISS GPU search currently uses a single device by default in this repo.
Requirements:
- Install GPU toolkits and drivers (CUDA) compatible with your PyTorch/FAISS builds
- At least one visible GPU (check with
nvidia-smi)
How it works:
- When
DEVICEis set tocudaorgpu, CUDA is available, and Accelerate is installed, the code setsdevice_map=autoso embedding and reranking models can shard across GPUs; otherwisedevice_mapis not set and the model is moved to the specified device. - FAISS: the index builder tries to place the index on a single GPU. If OOM occurs, it falls back to CPU. Multi-GPU FAISS is possible but not enabled here to keep the demo simple.
Notes:
- If Accelerate is not installed, sharding is disabled automatically; the model uses a single device.
- If you see Flash Attention errors, the loader will retry without FA2; performance may change.
- Using a BF16-capable GPU is recommended for large models; otherwise, the process will fall back to FP16.
- Reranking is the heaviest stage. Start from a small
RERANK_BATCHand scale up. - When a FAISS operation causes an out-of-memory error, the process will automatically fall back to the CPU.
Start a container and run the pipeline:
# start container (detached)
docker compose up -d
# enter the container shell
docker compose exec app bash
# inside the container
python scripts/main.py -vOne-shot (non-interactive):
docker compose run --rm app python scripts/main.py -vGPU selection via compose (excerpt from docker-compose.yml):
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # or device_ids: ["0", "1"]
capabilities: [gpu]Environment & caches (defined in docker-compose.yml):
.envis loaded viaenv_file: .env(see.env.example)HF_HOME,TORCH_HOME,KERAS_HOMEredirected to/data/.cachefor shared caching- Volumes:
.:/workspacelive-mount the code./data/.ir_datasets:/root/.ir_datasetsfor ir-datasets cache/data/.cache:/data/.cachefor model and dataset caches
Notes:
- If the container logs show GPU-related errors, verify NVIDIA toolkit installation and that the listed GPU IDs are present (
nvidia-smi). - The compose GPU block may require a recent Docker/Compose. If unsupported in your environment, use a single-GPU setup or run with
docker runand--gpusflags.
The Command-Line Interface (CLI) prints evaluation metrics for each pipeline stage. During development, we compared several reranker models, including BAAI/bge-reranker-v2-gemma, ncbi/MedCPT-Cross-Encoder, and Qwen/Qwen3-Reranker-8B. This project defaults to using BAAI/bge-reranker-v2-gemma as it yielded the best overall performance in our internal tests.
| Method | nDCG@10 | P@10 | R@100 | MAP |
|---|---|---|---|---|
| Dense Search (baseline) | 24.69 | 30.33 | 12.87 | 3.32 |
| + Query Rewriting | 22.02 (-2.67) | 27.00 (-3.33) | 13.34 (+0.47) | 3.42 (+0.10) |
| + Rerank | 20.33 (-4.36) | 23.33 (-7.00) | 13.37 (+0.50) | 3.64 (+0.32) |
| + Rewriting & Rerank | 20.31 (-4.38) | 25.00 (-5.33) | 13.52 (+0.65) | 3.60 (+0.28) |
Note:
- The above results are from the full dataset; all values are percentages (%), shown with two decimal places.
- nDCG@10: Evaluates the "ranking quality" of the top 10 results; the more relevant items are ranked higher, the higher the score.
- P@10: Evaluates the "precision" of the top 10 results; calculates the proportion of these 10 results that are relevant.
- R@100: Evaluates the "recall" (or coverage) of the top 100 results; measures the proportion of all relevant items that the system successfully found.
- MAP: A comprehensive average score for the system, evaluating both "precision" and "ranking quality".
- Metric Evaluation:
- Dense Search demonstrates superior top-10 ranking quality (nDCG@10, P@10) compared to the reranked version. However, reranking yields a slight improvement in overall recall (R@100) and MAP.
- Impact of Query Rewriting:
- Query rewriting effectively broadens the search scope, boosting R@100 and MAP. However, it also introduces noise, which degrades the precision of top-ranked results (nDCG@10, P@10) during the dense retrieval stage.
- Limitations and Potential Factors:
- A discrepancy exists between the public test set queries (long, expert-phrased sentences) and real-world clinical queries (often short and ambiguous). Consequently, the quantitative metrics may not fully capture the practical benefits of query rewriting.
- A domain gap between the cross-encoder's pre-training data and the medical corpus compromises the precision of top-ranked results, leading to a decline in metrics like nDCG@10.
-
Integrate BM25 Retrieval:
- Beyond the existing vector-based dense retrieval, we will add a keyword-based BM25 retrieval path. The current codebase includes a basic interface for
pyserini, but an index creation process has not yet been implemented.
- Beyond the existing vector-based dense retrieval, we will add a keyword-based BM25 retrieval path. The current codebase includes a basic interface for
-
Implement RRF (Reciprocal Rank Fusion) for Hybrid Ranking:
- Develop an RRF fusion mechanism to combine ranking results from multiple sources (e.g., Dense Search, BM25, and their respective query-enhanced versions) into a single, more robust ranking.
- This project is an independently developed demonstration. Its pipeline and architecture serve as a production-ready reference for clinical search systems.
- All demonstrations and reports use public data (
pmc/v2/trec-cds-2016).
- PubMed Central (TREC CDS) —
pmc/v2/trec-cds-2016: