Skip to content

omidnaeej/RTT-Medical-Question-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RTT-Medical-Question-Summarization: Round-Trip Translation Augmented Medical Question Summarization

Description

This project implements a medical question summarization system using the MeQSum dataset. It employs Round-Trip Translation (RTT) for data augmentation to create a more diverse dataset, and utilizes pre-trained models like T5 and BART for summarization. The performance is evaluated using metrics such as ROUGE and BLEU.

Motivation

The goal is to enhance the diversity of the training dataset through RTT, leading to improved summarization performance on medical questions by introducing paraphrased variations.

Project Outline

  1. Dataset Preparation: Load and preprocess the MeQSum dataset, which includes medical questions and their summaries.
  2. Round-Trip Translation (RTT) for Data Augmentation: Translate questions to multiple languages (Spanish, German, Italian, Chinese Simplified, French) and back to English to generate diverse paraphrases.
  3. Question Selection Using Similarity Metrics:
    • Fréchet Question Distance (FQD)
    • Precision Recall Question Distance (PRQD)
    • Question Semantic Volume (QSV) [Bonus]
  4. Summarization Model: Fine-tune and apply pre-trained models (T5, BART) on original and augmented datasets.
  5. Evaluation and Analysis: Assess summaries using ROUGE-1, ROUGE-2, BLEU, and compare performance.

Installation

  1. Clone the repository:

    git clone https://github.com/omidnaeej/RTT-Medical-Question-Summarization.git
    cd RTT-Medical-Question-Summarization
    
  2. Install the required packages:

    pip install -r requirements.txt
    

Note: This project uses Python 3.11+. Ensure you have CUDA support if using GPU for models.

Usage

  1. Open the Jupyter notebook:

    jupyter notebook RTT-Medical-Question-Summarization.ipynb
    
  2. Run the cells sequentially to preprocess data, perform RTT augmentation, select questions, train/evaluate models, and view results.

Results

The project compares summarization performance on original vs. augmented datasets, showing improvements in evaluation metrics due to RTT augmentation.

About

Round-Trip Translation Augmented Medical Question Summarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published