MedSlice

Repository for MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning (preprint)

Overview

This repository serves to extract RCH (History of Present Illness and Interval History) and AP (Assessment and Plan) sections for progress notes. The code provides tools for preprocessing data, running model inference, and generating output reports in CSV format, with the option to output a PDF to visualize the sections. You can use the sectioning.py script to execute these tasks directly from the command line, and the finetuning.py script to fine-tune a model for sectioning. Follow the steps below to set up the required environment and learn how to use the scripts.

⚠️ Please note that due to PHI data being used to train our models, we are not able to share them. However, you can reproduce the steps used during training with the finetuning.py script.

Getting started

Clone Repository and access it

git clone https://github.com/lindvalllab/sectioning.git
cd sectioning

Install Conda (if not already installed)
If you don't have Conda installed, you can download it from Miniconda or Anaconda.
Create the Conda Environment and activate it
Use the provided environment.yml file to create the environment. Run the following command:
```
conda env create -f environment.yml
conda activate sectioning
```

Install VLLM and Unsloth
You will need to install VLLM and Unsloth with pip:

pip install vllm
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

Usage

The entry points for running the workflow are the sectioning.py and finetuning.py scripts. Each script takes specific arguments and can be run directly from the command line.

1. `finetuning.py`

This script fine-tunes a model on custom data using rsLoRA.

Arguments:

model_name: Name of the pre-trained model. Required.
data_path: Path to the dataset CSV file. Required.
--n_epochs: Number of training epochs. Optional, defaults to 5.
--r_lora: LoRA rank. Optional, defaults to 16.
--use_rslora: Whether to use rsLoRA. Optional, defaults to True.
--output_folder: Folder to save the fine-tuned model. Optional, defaults to "models".
--max_seq_length: Maximum sequence length. Optional, defaults to 8192.
--load_in_4bit: Whether to load the model in 4-bit precision. Optional, defaults to False.

Example:

python finetuning.py "unsloth/Meta-Llama-3.1-8B-Instruct" data/path/to/finetuning/dataset.csv --n_epochs 5 --r_lora 16

2. `sectioning.py`

This script runs the sectioning workflow for extracting RCH and AP sections from progress notes.

Arguments:

model_path: Path to the trained model directory. Required.
data_path: Path to the input data file. Required.
sectioned_output_path: Path to save the postprocessed CSV data. Required.
--pdf_output_path, -p: Path to save the generated PDF report. If not provided, no PDF will be generated.
--note_text_column: Column containing the notes. Optional, defaults to None.

Example:

python sectioning.py models/Meta-Llama-3.1-8B-Instruct /path/to/evaluation/dataset.csv /path/to/output.csv --pdf_output_path /path/to/report.pdf

Dataset: CORAL

The CORAL dataset can be used as an example for running the sectioning tool in a notebook environment.

Download the CORAL dataset
The dataset is available on PhysioNet and requires credentialed access. You can download it from PhysioNet - Curated Oncology Reports (CORAL) and place it in the data folder.
Notebook Example
An example notebook demonstrating how to generate sections on both the annotated and unannotated CORAL datasets can be found in:
```
examples/coral.ipynb
```
This notebook provides a step-by-step guide to using the sectioning tool interactively instead of running a script.
Output Files
In the outputs folder, we provide the sectioned CORAL notes in the form of indexes only, as the CORAL dataset requires credentialed access.
- For the annotated data, you can merge on the file_number column to retrieve the full dataset.
- For the unannotated data, you can merge on the coral_idx column to obtain the complete dataframe.
Additional Annotations
We also provide 50 notes from the unannotated breast dataset, manually annotated by our annotator KS. These annotations can be found in the columns: {section}_start_gt and {section}_end_gt

Project Organization

  ├── LICENSE                <- GPL-3.0 License
  ├── README.md              <- The top-level README for developers using this project.
  ├── data                   <- A placeholder for your data, one or several csv files.
  ├── environment.yml        <- The requirements file for reproducing the sectioning environment.
  ├── examples               <- Folder containing coral example for using the sectioning tool
  │   └── coral.ipynb              <- Example of the sectioning tool for annotating the CORAL dataset
  ├── models                 <- A placeholder for your models, has to be readable by VLLM.
  ├── sectioning.py          <- Main script to run the sectioning.
  ├── outputs                <- Output placeholder, where our CORAL outputs are stored as indexes.
  │   ├── annotated_breastca_outputs.csv
  │   ├── annotated_pdac_outputs.csv
  │   ├── unannotated_breastca_outputs.csv
  │   ├── unannotated_breastca_outputs_KS_labels.csv
  │   └── unannotated_pdac_outputs.csv
  └── src                    <- Additional source code for use in this project.
     ├── __init__.py               <- Makes src a Python module.
     ├── benchmarking              <- Scripts to benchmark the sectioning tool, when the ground truth is provided.
     │   ├── __init__.py                 <- Makes benchmarking a Python module.
     │   └── scorer.py                   <- Code for the sectioning scorer.
     ├── inference                 <- Scripts to perform inference using VLLM and fuzzy matching.
     │   ├── __init__.py                 <- Makes inference a Python module.
     │   ├── inference.py                <- Code for VLLM inference.
     │   └── output_matching.py          <- Code for fuzzy matching between LLM outputs and input.
     ├── preprocessing             <- Scripts to preprocess the inputs before downstream processing. You can adapt this to your input format and structure.
     │   ├── __init__.py                 <- Makes preprocessing a Python module.
     │   └── preprocessing.py            <- Code for preprocessing the data.
     ├── prompt.txt                <- Prompt passed to the model, as a txt file.
     ├── report                    <- Scripts to report the sections as pdf and csv file.
     │  ├── __init__.py                  <- Makes report a Python module.
     │  ├── pdfgenerator.py              <- Code to generate the PDF file with overlayed LLM sections.
     │  └── postprocessing.py            <- Code to extract the sections as text using indexes found with fuzzy matching.
     └── schema.json               <- Output schema passed to the model, as a json file.

Citation

If you use this repository, please cite our preprint:

@misc{davis2025medslicefinetunedlargelanguage,
      title={MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning}, 
      author={Joshua Davis and Thomas Sounack and Kate Sciacca and Jessie M Brain and Brigitte N Durieux and Nicole D Agaronnik and Charlotta Lindvall},
      year={2025},
      eprint={2501.14105},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.14105}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedSlice

Overview

Getting started

Usage

1. `finetuning.py`

Arguments:

Example:

2. `sectioning.py`

Arguments:

Example:

Dataset: CORAL

Project Organization

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
examples		examples
models		models
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
finetuning.py		finetuning.py
sectioning.py		sectioning.py

License

lindvalllab/MedSlice

Folders and files

Latest commit

History

Repository files navigation

MedSlice

Overview

Getting started

Usage

1. finetuning.py

Arguments:

Example:

2. sectioning.py

Arguments:

Example:

Dataset: CORAL

Project Organization

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `finetuning.py`

2. `sectioning.py`

Packages