Skip to content

Rule-based clinical NLP dashboard for breast cancer phenotyping using spaCy, medspaCy, and Dash, with fully auditable evidence extraction.

License

Notifications You must be signed in to change notification settings

jcaperella29/NLP_Phenotyper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 phenotyper_dash

Python spaCy medspaCy Dash License Status

A local, fully offline Dash application for extracting breast cancer phenotypes from free-text clinical notes using rule-based NLP (spaCy + medspaCy) with transparent, auditable evidence tracking.

This project intentionally prioritizes determinism, explainability, and traceability over black-box prediction.


🚀 What this app does

  • Ingests multiple clinical notes per patient
  • Extracts structured breast cancer phenotypes:
    • ER / PR status and percentages
    • HER2 (IHC, FISH, final status)
    • Ki-67
    • Histology
    • Grade
    • Clinical and pathologic stage
  • Aggregates note-level findings into one patient-level row
  • Preserves all evidence mentions, including:
    • Source note
    • Text snippet
    • Confidence score
    • Negation / uncertainty flags
  • Runs entirely locally (no APIs, no cloud, no LLM calls)

---Clinical notes (.txt) | v [ spaCy + medspaCy ]

Rule-based NER

ConText (negation / uncertainty) | v [ Normalization layer ]

Percent parsing

HER2 reconciliation

Histology / grade / stage normalization | v [ Aggregation layer ]

Note-type precedence

Evidence-aware selection

Deterministic conflict resolution | v Dash UI

Patient phenotype table

Evidence table with snippets


🛠 Installation

Local (virtualenv)

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r requirements.txt
python -m spacy download en_core_web_sm

python app.py

Open it at  http://127.0.0.1:8050

###🐳 Docker
in bash
docker build -t phenotyper-dash .
docker run -p 8050:8050 phenotyper-dash


Then open it same as before.

###📦 Apptainer / Singularity

in bash

apptainer build phenotyper.sif phenotyper.def
apptainer run phenotyper.sif
then open it again in the standard way.

🧪 How to use the app
1️⃣ Upload notes

Upload one or more .txt files containing clinical notes such as:

Pathology

Oncology consults

Radiology reports

Progress notes

2️⃣ (Optional) Upload a mapping CSV

A mapping CSV lets you control patient identity and note metadata.

Supported columns:
🧪 How to use the app
1️⃣ Upload notes

Upload one or more .txt files containing clinical notes such as:

Pathology

Oncology consults

Radiology reports

Progress notes

2️⃣ (Optional) Upload a mapping CSV

A mapping CSV lets you control patient identity and note metadata.

Supported columns:
| Column                  | Required | Description                   |
| ----------------------- | -------- | ----------------------------- |
| `note_id` OR `filename` | ✅        | Links row to uploaded file    |
| `patient_id`            | ✅        | Patient identifier            |
| `note_date`             | ❌        | Used for precedence           |
| `note_type`             | ❌        | Used for confidence & ranking |


Column	Required	Description
note_id OR filename	✅	Links row to uploaded file
patient_id	✅	Patient identifier
note_date	❌	Used for precedence
note_type	❌	Used for confidence & ranking

If no mapping is provided, defaults are:

note_id = filename stem

patient_id = filename stem

note_type = Unknown

3️⃣ Run extraction

Click “Run extraction”.

The app will:

Process each note with spaCy / medspaCy

Extract structured fields

Record all evidence mentions

Aggregate results to patient level


📊 Outputs
Patient phenotype table

One row per patient

Deterministic values

Source note metadata

Confidence buckets

Evidence table

Every extracted mention

Original text snippet

Negation / uncertainty flags

Confidence score

Both tables can be exported as CSV.

🧮 Aggregation logic (important)

For each phenotype field:

Prefer values with non-negated, non-uncertain evidence

Prefer Pathology / Addendum notes over Consults

Prefer newer notes if still tied

Fall back to first non-empty value if no clean evidence exists

HER2 final status rules

FISH overrides IHC

IHC 3+ → Positive

IHC 2+ → Equivocal

IHC 0 / 1+ → Negative

⚠️ Known limitations (v1)

This is a rule-based MVP by design.

NLP limitations

No deep ML / transformer models

Relies on curated rules and patterns

May miss highly non-standard phrasing

Clinical scope

Breast cancer only

Limited staging nuance (no full TNM parsing)

No treatment response or outcome inference

Data assumptions

Text input only (.txt)

No OCR / scanned PDFs

Assumes reasonably clean clinical notes

Not intended to:

Replace manual chart review

Make clinical decisions

Serve as a production CDS system

🎯 Why this design is intentional

Deterministic: same input → same output

Auditable: every value traceable to text

Privacy-safe: runs fully offline

Extensible: easy to add new rules or targeted ML later

Well-suited for:

Research preprocessing

Cohort discovery

QA / abstraction support

Phenotyping pipeline prototyping

🔮 Future directions (optional)

Targeted ML only where rules fail (e.g. free-text histology)

Genotype join keys (ERBB2, ESR1, PGR)

TNM parsing

Multi-cancer schemas

Read-only deployments behind auth

📌 Status

MVP complete.
Schema locked.
Containerized.
Ready for iteration and extension.
## 🧠 Architecture overview

About

Rule-based clinical NLP dashboard for breast cancer phenotyping using spaCy, medspaCy, and Dash, with fully auditable evidence extraction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published