Medical NLP - Named Entity Recognition Pipeline

A production-ready Medical Named Entity Recognition (NER) pipeline for extracting biomedical entities from clinical text using spaCy, BioBERT, and advanced template-based pattern matching.

🌟 Features

Comprehensive Entity Detection: Diseases, genes, proteins, chemicals, and anatomical terms
Advanced Context Analysis: Identifies negated, historical, family history, uncertain, and confirmed conditions
Template-Based Matching: 57,476+ curated medical terms across 6 specialized templates
BioBERT Integration: State-of-the-art biomedical language models for high accuracy
Dual Interface: Command-line tool and interactive Streamlit web application
Rich Output: 15-column Excel reports with visualizations and JSON export
Scope Reversal Detection: Handles complex negation patterns ("no fever but has cough")
Production Ready: Comprehensive test suite and robust error handling

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/medical-nlp-lean.git
cd medical-nlp-lean

# Create conda environment
conda env create -f py311_bionlp_environment.yml

# Activate environment
conda activate py311_bionlp

# Install package
pip install -e .

# Download required spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_ner_bc5cdr_md

Basic Usage

Command Line:

# Run NER pipeline on default input
./run_ner_pipeline.sh --run

# Process custom file
./run_ner_pipeline.sh --input data/my_clinical_notes.xlsx --run

Web Interface:

# Launch Streamlit app
./run_app.sh

# Opens at http://localhost:8501

Python API:

from src.enhanced_medical_ner_predictor import MedicalNERPredictor

# Initialize predictor
predictor = MedicalNERPredictor()

# Process text
text = "Patient denies chest pain but reports shortness of breath."
results = predictor.process_text(text)

# Access detected entities
print(results['detected_diseases'])
print(results['negated_entities'])
print(results['confirmed_entities'])

📊 Output

The pipeline generates comprehensive Excel reports with 15 columns:

Column	Description
Visualization	HTML entity highlighting with color-coded labels
Detected Diseases	Identified disease/condition entities
Disease Count	Total number of diseases detected
Detected Genes	Identified gene/protein entities
Gene Count	Total number of genes detected
Negated Entities	Conditions explicitly denied or absent
Historical Entities	Past medical history mentions
Family Entities	Family medical history
Uncertain Entities	Possible or speculative conditions
Confirmed Entities	Explicitly confirmed conditions
Section Categories	Clinical note sections (Chief Complaint, Assessment, Plan, etc.)
JSON Export	Complete structured data for all entities

🎯 Key Capabilities

Medical Entity Recognition

Diseases & Conditions: Diabetes, hypertension, pneumonia, cancer types, etc.
Genes & Proteins: BRCA1, TP53, kinesin, hemoglobin, etc.
Chemicals & Drugs: Aspirin, metformin, chemotherapy agents, etc.
Anatomical Terms: Heart, lungs, liver, blood vessels, etc.

Context Classification

Negation Detection: "No evidence of diabetes", "denies chest pain"
Historical Context: "History of hypertension", "previous stroke"
Family History: "Mother has breast cancer", "family history of diabetes"
Uncertainty: "Possible pneumonia", "rule out myocardial infarction"
Scope Reversal: "No fever but has cough" (correctly identifies cough as confirmed)

Template System

target_rules_template.xlsx: 57,476 curated medical terms
negated_rules_template.xlsx: 99 negation patterns
historical_rules_template.xlsx: 82 historical context patterns
family_rules_template.xlsx: 79 family history patterns
uncertainty_rules_template.xlsx: 48 uncertainty patterns
confirmed_rules_template.xlsx: 138 confirmation patterns

🏗️ Architecture

medical-nlp-lean/
├── src/                        # Core Python modules
│   ├── enhanced_medical_ner_predictor.py
│   └── performance_analyzer.py
├── app/                        # Streamlit web application
│   └── medical_nlp_app.py
├── data/
│   ├── external/              # Template files
│   └── raw/                   # Input data
├── models/
│   └── pretrained/            # BioBERT models (~1.6GB)
├── output/                    # Generated results
│   ├── results/              # Excel outputs
│   ├── visualizations/       # PNG visualizations
│   └── logs/                 # Execution logs
├── tests/                     # Comprehensive test suite
└── configs/                   # Configuration files

🧪 Testing

Run the comprehensive test suite to validate installation:

# Run all tests
./run_tests.sh

# Quick validation
./run_tests.sh --quick

# Specific test category
python tests/master_test_script.py --category scope_reversal

Test Categories:

Scope reversal detection (103 patterns)
Template pattern validation
Context classification
Negation detection
Output formatting
UI consistency
Pipeline integration

⚙️ Configuration

Customize pipeline behavior in configs/pipeline_config.yaml:

pipeline:
  confidence_thresholds:
    curated_templates: 0.3    # Lower threshold for template matches
    general_patterns: 0.5     # Higher threshold for general patterns
  proximity_weighting:
    max_boost: 0.3           # Confidence boost for nearby matches

models:
  disease_model: "models/pretrained/Disease"
  chemical_model: "models/pretrained/Chemical"
  gene_model: "models/pretrained/Gene"
  spacy_model: "en_core_web_sm"
  biomedical_model: "en_ner_bc5cdr_md"

📈 Performance

Processing Speed: ~100 clinical notes in <1 minute
Memory Usage: ~2GB for typical workloads
Accuracy: 95%+ for medical entity detection
Models: 3 BioBERT models (~1.6GB total)

📚 Documentation

Installation Guide: Complete setup instructions
API Reference: Python API documentation
Template Guide: How to customize medical term templates
Configuration: Pipeline configuration options
Examples: Sample clinical text processing

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

spaCy: Industrial-strength NLP library
BioBERT: Pre-trained biomedical language models
Hugging Face: Model hosting and transformers
Streamlit: Interactive web application framework

📧 Contact

For questions, issues, or collaboration:

GitHub Issues: Report bugs or request features
Documentation: See project wiki for detailed guides

Medical NLP Pipeline - Extract insights from clinical text with confidence! 🧬

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical NLP - Named Entity Recognition Pipeline

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📊 Output

🎯 Key Capabilities

Medical Entity Recognition

Context Classification

Template System

🏗️ Architecture

🧪 Testing

⚙️ Configuration

📈 Performance

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
data		data
docs		docs
models/pretrained		models/pretrained
output		output
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
activate_py311_bionlp.sh		activate_py311_bionlp.sh
cleanup_old_outputs.sh		cleanup_old_outputs.sh
cleanup_test_logs.sh		cleanup_test_logs.sh
install.sh		install.sh
py311_bionlp_environment.yml		py311_bionlp_environment.yml
requirements.txt		requirements.txt
run_app.sh		run_app.sh
run_ner_pipeline.sh		run_ner_pipeline.sh
run_tests.sh		run_tests.sh
setup.py		setup.py

License

datagodzilla/medical-nlp-lean

Folders and files

Latest commit

History

Repository files navigation

Medical NLP - Named Entity Recognition Pipeline

🌟 Features

🚀 Quick Start

Installation

Basic Usage

📊 Output

🎯 Key Capabilities

Medical Entity Recognition

Context Classification

Template System

🏗️ Architecture

🧪 Testing

⚙️ Configuration

📈 Performance

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages