Skip to content

datagodzilla/medical-nlp-lean

Repository files navigation

Medical NLP - Named Entity Recognition Pipeline

Python License Status

A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.


🌟 Features

  • Comprehensive Entity Detection: Diseases, genes, proteins, chemicals, and anatomical terms
  • Advanced Context Analysis: Identifies negated, historical, family history, uncertain, and confirmed conditions
  • Template-Based Matching: 57,476+ curated medical terms across 6 specialized templates
  • BioBERT Integration: State-of-the-art biomedical language models for high accuracy
  • Dual Interface: Command-line tool and interactive Streamlit web application
  • Rich Output: 15-column Excel reports with visualizations and JSON export
  • Scope Reversal Detection: Handles complex negation patterns ("no fever but has cough")
  • Production Ready: Comprehensive test suite and robust error handling

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean

# Create conda environment
conda env create -f py311_bionlp_environment.yml

# Activate environment
conda activate py311_bionlp

# Install package
pip install -e .

# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_md

Basic Usage

Command Line:

# Run NER pipeline on default input
./run_ner_pipeline.sh --run

# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run

Web Interface:

# Launch Streamlit app
./run_app.sh

# Opens at http://localhost:8501

Python API:

from src.enhanced_medical_ner_predictor import MedicalNERPredictor

# Initialize predictor
predictor = MedicalNERPredictor()

# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)

# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])

πŸ“Š Output

The pipeline generates comprehensive Excel reports with 15 columns:

Column Description
Visualization HTML entity highlighting with color-coded labels
Detected Diseases Identified disease/condition entities
Disease Count Total number of diseases detected
Detected Genes Identified gene/protein entities
Gene Count Total number of genes detected
Negated Entities Conditions explicitly denied or absent
Historical Entities Past medical history mentions
Family Entities Family medical history
Uncertain Entities Possible or speculative conditions
Confirmed Entities Explicitly confirmed conditions
Section Categories Clinical note sections (Chief Complaint, Assessment, Plan, etc.)
JSON Export Complete structured data for all entities

🎯 Key Capabilities

Medical Entity Recognition

  • Diseases & Conditions: Diabetes, hypertension, pneumonia, cancer types, etc.
  • Genes & Proteins: BRCA1, TP53, kinesin, hemoglobin, etc.
  • Chemicals & Drugs: Aspirin, metformin, chemotherapy agents, etc.
  • Anatomical Terms: Heart, lungs, liver, blood vessels, etc.

Context Classification

  • Negation Detection: "No evidence of diabetes", "denies chest pain"
  • Historical Context: "History of hypertension", "previous stroke"
  • Family History: "Mother has breast cancer", "family history of diabetes"
  • Uncertainty: "Possible pneumonia", "rule out myocardial infarction"
  • Scope Reversal: "No fever but has cough" (correctly identifies cough as confirmed)

Template System

  • target_rules_template.xlsx: 57,476 curated medical terms
  • negated_rules_template.xlsx: 99 negation patterns
  • historical_rules_template.xlsx: 82 historical context patterns
  • family_rules_template.xlsx: 79 family history patterns
  • uncertainty_rules_template.xlsx: 48 uncertainty patterns
  • confirmed_rules_template.xlsx: 138 confirmation patterns

πŸ—οΈ Architecture

medical-nlp-lean/
β”œβ”€β”€ src/                        # Core Python modules
β”‚   β”œβ”€β”€ enhanced_medical_ner_predictor.py
β”‚   └── performance_analyzer.py
β”œβ”€β”€ app/                        # Streamlit web application
β”‚   └── medical_nlp_app.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ external/              # Template files
β”‚   └── raw/                   # Input data
β”œβ”€β”€ models/
β”‚   └── pretrained/            # BioBERT models (~1.6GB)
β”œβ”€β”€ output/                    # Generated results
β”‚   β”œβ”€β”€ results/              # Excel outputs
β”‚   β”œβ”€β”€ visualizations/       # PNG visualizations
β”‚   └── logs/                 # Execution logs
β”œβ”€β”€ tests/                     # Comprehensive test suite
└── configs/                   # Configuration files

πŸ§ͺ Testing

Run the comprehensive test suite to validate installation:

# Run all tests
./run_tests.sh

# Quick validation
./run_tests.sh --quick

# Specific test category
python tests/master_test_script.py --category scope_reversal

Test Categories:

  • Scope reversal detection (103 patterns)
  • Template pattern validation
  • Context classification
  • Negation detection
  • Output formatting
  • UI consistency
  • Pipeline integration

βš™οΈ Configuration

Customize pipeline behavior in configs/pipeline_config.yaml:

pipeline:
  confidence_thresholds:
    curated_templates: 0.3    # Lower threshold for template matches
    general_patterns: 0.5     # Higher threshold for general patterns
  proximity_weighting:
    max_boost: 0.3           # Confidence boost for nearby matches

models:
  disease_model: "models/pretrained/Disease"
  chemical_model: "models/pretrained/Chemical"
  gene_model: "models/pretrained/Gene"
  spacy_model: "en_core_web_sm"
  biomedical_model: "en_ner_bc5cdr_md"

πŸ“ˆ Performance

  • Processing Speed: ~100 clinical notes in <1 minute
  • Memory Usage: ~2GB for typical workloads
  • Accuracy: 95%+ for medical entity detection
  • Models: 3 BioBERT models (~1.6GB total)

πŸ“š Documentation

  • Installation Guide: Complete setup instructions
  • API Reference: Python API documentation
  • Template Guide: How to customize medical term templates
  • Configuration: Pipeline configuration options
  • Examples: Sample clinical text processing

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • spaCy: Industrial-strength NLP library
  • BioBERT: Pre-trained biomedical language models
  • Hugging Face: Model hosting and transformers
  • Streamlit: Interactive web application framework

πŸ“§ Contact

For questions, issues, or collaboration:


Medical NLP Pipeline - Extract insights from clinical text with confidence! 🧬