This project implements a medical question summarization system using the MeQSum dataset. It employs Round-Trip Translation (RTT) for data augmentation to create a more diverse dataset, and utilizes pre-trained models like T5 and BART for summarization. The performance is evaluated using metrics such as ROUGE and BLEU.
The goal is to enhance the diversity of the training dataset through RTT, leading to improved summarization performance on medical questions by introducing paraphrased variations.
- Dataset Preparation: Load and preprocess the MeQSum dataset, which includes medical questions and their summaries.
- Round-Trip Translation (RTT) for Data Augmentation: Translate questions to multiple languages (Spanish, German, Italian, Chinese Simplified, French) and back to English to generate diverse paraphrases.
- Question Selection Using Similarity Metrics:
- Fréchet Question Distance (FQD)
- Precision Recall Question Distance (PRQD)
- Question Semantic Volume (QSV) [Bonus]
- Summarization Model: Fine-tune and apply pre-trained models (T5, BART) on original and augmented datasets.
- Evaluation and Analysis: Assess summaries using ROUGE-1, ROUGE-2, BLEU, and compare performance.
-
Clone the repository:
git clone https://github.com/omidnaeej/RTT-Medical-Question-Summarization.git cd RTT-Medical-Question-Summarization -
Install the required packages:
pip install -r requirements.txt
Note: This project uses Python 3.11+. Ensure you have CUDA support if using GPU for models.
-
Open the Jupyter notebook:
jupyter notebook RTT-Medical-Question-Summarization.ipynb -
Run the cells sequentially to preprocess data, perform RTT augmentation, select questions, train/evaluate models, and view results.
The project compares summarization performance on original vs. augmented datasets, showing improvements in evaluation metrics due to RTT augmentation.