HealthScribe3000 is a multi-phase framework that performs perspective classification followed by perspective-aware summarization of healthcare QA answers. It aims to provide summaries that are aware of context, tone, and intent, inspired by real-world medical communication needs.
In healthcare, different answers to the same question can reflect various perspectives such as:
- 💡 Information — factual statements
- 🎯 Suggestion — advice or instructions
⚠️ Cause — reasons or explanations- ✅ Query — affirming a question
- 👤 Experience — sharing own experience
This project provides a pipeline that:
- Identifies such perspectives in QA pairs using BERT-based classification.
- Generates summaries for each detected perspective using a fine-tuned Pegasus model with structured prompts.
- Encodes QA pairs using a BERT-based encoder
- Predicts perspective labels using a linear classification head
- Trained with
BCEWithLogitsLoss
- Constructs prompts with QA, answer, and predicted perspectives
- Uses Pegasus to extract perspective-specific spans and summarize them
- Trained with
CrossEntropyLoss
HEALTHSCRIBE3000/
└── multiperspect_health/
├── A. Project Proposal & Baseline/
├── config/
├── data/
├── inference/
├── models/
├── modules/
├── saved_models/
├── training/
├── utils/
├── main.py
├── requirements.txt
├── .gitignore
├── architecture.jpeg
└── README.md
We use a custom dataset of 3167 healthcare questions and 9987 answers, each annotated with:
- Answer spans for each perspective
- Perspective-wise summaries
Files:
train.json,valid.json,test.jsonin/data/
git clone https://github.com/theshamiksinha/HealthScribe3000.git
cd HealthScribe3000
pip install -r requirements.txtThe entire pipeline can be executed with a single command:
python main.pyThis will:
- Train or load the perspective classifier
- Predict perspectives on the test dataset
- Train or load the summarization model
- Generate perspective-aware summaries
Alternatively, you can run individual components:
python training/train_classifier.py --config config/config.yamlpython training/train_llm.py --config config/config.yamlpython inference/evaluate_summariser.py --config config/config.yaml| Module | Model / Technique |
|---|---|
| Perspective Classification | BERT + Linear Layers |
| Summarization | Pegasus |
| Prompting Style | Template-based |
| Span Extraction | Implicit via generated prompt |
| Loss Functions | BCEWithLogits, CrossEntropy |
| Evaluation | ROUGE, BLEU, METEOR, BERTScore |
You can find post-training metrics under:
/eval_after_training/metrics.txt
Includes:
- ROUGE-1, ROUGE-2, ROUGE-L
- BLEU
- METEOR
- BERTScore
Summarize the responses to the health question below.
Focus on highlighting insights from the SUGGESTION perspective. Use a Advisory, Recommending. tone.
Be clear and concise. Perspective Definition: Advice or recommendations to assist users in making informed medical decisions, solving problems, or improving health issues.
Question: What is the treatment for gestational diabetes?
Answer: Treatment involves healthy eating and regular physical activity...
- Extending the model to handle multilingual healthcare content
- Incorporating domain-specific medical knowledge bases
- Building an interactive demo for clinical usage
- Exploring few-shot capabilities for rare medical conditions
- Shamik Sinha – @theshamiksinha
- Vansh Yadav – @vansh22559
- Shrutya Chawla – @shrutya22487
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this useful, leave a ⭐ on GitHub!
