A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.
- Comprehensive Entity Detection: Diseases, genes, proteins, chemicals, and anatomical terms
- Advanced Context Analysis: Identifies negated, historical, family history, uncertain, and confirmed conditions
- Template-Based Matching: 57,476+ curated medical terms across 6 specialized templates
- BioBERT Integration: State-of-the-art biomedical language models for high accuracy
- Dual Interface: Command-line tool and interactive Streamlit web application
- Rich Output: 15-column Excel reports with visualizations and JSON export
- Scope Reversal Detection: Handles complex negation patterns ("no fever but has cough")
- Production Ready: Comprehensive test suite and robust error handling
# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean
# Create conda environment
conda env create -f py311_bionlp_environment.yml
# Activate environment
conda activate py311_bionlp
# Install package
pip install -e .
# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_mdCommand Line:
# Run NER pipeline on default input
./run_ner_pipeline.sh --run
# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --runWeb Interface:
# Launch Streamlit app
./run_app.sh
# Opens at http://localhost:8501Python API:
from src.enhanced_medical_ner_predictor import MedicalNERPredictor
# Initialize predictor
predictor = MedicalNERPredictor()
# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)
# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])The pipeline generates comprehensive Excel reports with 15 columns:
| Column | Description |
|---|---|
| Visualization | HTML entity highlighting with color-coded labels |
| Detected Diseases | Identified disease/condition entities |
| Disease Count | Total number of diseases detected |
| Detected Genes | Identified gene/protein entities |
| Gene Count | Total number of genes detected |
| Negated Entities | Conditions explicitly denied or absent |
| Historical Entities | Past medical history mentions |
| Family Entities | Family medical history |
| Uncertain Entities | Possible or speculative conditions |
| Confirmed Entities | Explicitly confirmed conditions |
| Section Categories | Clinical note sections (Chief Complaint, Assessment, Plan, etc.) |
| JSON Export | Complete structured data for all entities |
- Diseases & Conditions: Diabetes, hypertension, pneumonia, cancer types, etc.
- Genes & Proteins: BRCA1, TP53, kinesin, hemoglobin, etc.
- Chemicals & Drugs: Aspirin, metformin, chemotherapy agents, etc.
- Anatomical Terms: Heart, lungs, liver, blood vessels, etc.
- Negation Detection: "No evidence of diabetes", "denies chest pain"
- Historical Context: "History of hypertension", "previous stroke"
- Family History: "Mother has breast cancer", "family history of diabetes"
- Uncertainty: "Possible pneumonia", "rule out myocardial infarction"
- Scope Reversal: "No fever but has cough" (correctly identifies cough as confirmed)
- target_rules_template.xlsx: 57,476 curated medical terms
- negated_rules_template.xlsx: 99 negation patterns
- historical_rules_template.xlsx: 82 historical context patterns
- family_rules_template.xlsx: 79 family history patterns
- uncertainty_rules_template.xlsx: 48 uncertainty patterns
- confirmed_rules_template.xlsx: 138 confirmation patterns
medical-nlp-lean/
βββ src/ # Core Python modules
β βββ enhanced_medical_ner_predictor.py
β βββ performance_analyzer.py
βββ app/ # Streamlit web application
β βββ medical_nlp_app.py
βββ data/
β βββ external/ # Template files
β βββ raw/ # Input data
βββ models/
β βββ pretrained/ # BioBERT models (~1.6GB)
βββ output/ # Generated results
β βββ results/ # Excel outputs
β βββ visualizations/ # PNG visualizations
β βββ logs/ # Execution logs
βββ tests/ # Comprehensive test suite
βββ configs/ # Configuration files
Run the comprehensive test suite to validate installation:
# Run all tests
./run_tests.sh
# Quick validation
./run_tests.sh --quick
# Specific test category
python tests/master_test_script.py --category scope_reversalTest Categories:
- Scope reversal detection (103 patterns)
- Template pattern validation
- Context classification
- Negation detection
- Output formatting
- UI consistency
- Pipeline integration
Customize pipeline behavior in configs/pipeline_config.yaml:
pipeline:
confidence_thresholds:
curated_templates: 0.3 # Lower threshold for template matches
general_patterns: 0.5 # Higher threshold for general patterns
proximity_weighting:
max_boost: 0.3 # Confidence boost for nearby matches
models:
disease_model: "models/pretrained/Disease"
chemical_model: "models/pretrained/Chemical"
gene_model: "models/pretrained/Gene"
spacy_model: "en_core_web_sm"
biomedical_model: "en_ner_bc5cdr_md"- Processing Speed: ~100 clinical notes in <1 minute
- Memory Usage: ~2GB for typical workloads
- Accuracy: 95%+ for medical entity detection
- Models: 3 BioBERT models (~1.6GB total)
- Installation Guide: Complete setup instructions
- API Reference: Python API documentation
- Template Guide: How to customize medical term templates
- Configuration: Pipeline configuration options
- Examples: Sample clinical text processing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- spaCy: Industrial-strength NLP library
- BioBERT: Pre-trained biomedical language models
- Hugging Face: Model hosting and transformers
- Streamlit: Interactive web application framework
For questions, issues, or collaboration:
- GitHub Issues: Report bugs or request features
- Documentation: See project wiki for detailed guides
Medical NLP Pipeline - Extract insights from clinical text with confidence! π§¬