Automatically assigning 3-digit ICF codes to functional activity descriptions
This open-source software package implements a variety of methods for automatically coding descriptions of mobility activities in text documents, as described in the following paper:
- D Newman-Griffis and E Fosler-Lussier, "Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health". arXiv 2011.13978.
The included makefile provides pre-written commands for preprocessing, experimentation, and analysis of activity report data.
The requirements.txt file lists all required Python packages installable with pip. Just run
pip install -r requirements.txt
to install all packages.
Source code of two packages is required for generating BERT features; these packages (BERT and bert_to_hdf5) are automatically downloaded by the setup target in the makefile.
The processing pipeline in this package includes several primary elements, described here with reference to key code files. (For technical reference on script usage, see makefile)
- Dataset preprocessing: tokenization and formatting for analysis.
- See
preprocess_dataset_[spacy|bert]inmakefile
- See
- Classification experiments using scikit-learn: experiments under the classification paradigm, using k-Nearest Neighbors, Support Vector Machine, and Deep Neural Network models.
- See
run_classifierinmakefile
- See
- Classification experiments using BERT fine-tuning: adaptation of BERT fine-tuning to the ICF coding case.
- See
utils/modified_BERT_run_classifier.py
- See
- Candidate selection experiments: experiments under the candidate selection paradigm
- See
experiments/candidate_selection
- See
- Detailed analysis of experimental outputs: performance and confusion analysis by ICF code
- See
analysis/per_code_performance.py
- See
The data used in the accompanying paper are not readily available due to patient confidentiality protections. Requests for information about the data may be directed to [email protected].
However, this package includes two tiny datasets for code demonstration purposes:
data/demo_datasets/demo_labeled_dataset5 short, synthetic clinical documents with mobility-related information. Text files are located in thetxtsubdirectory, andcsvcontains corresponding CSV files with standoff annotations.data/demo_datasets/demo_unlabeled_dataset5 more short, synthetic clinical documents, only one of which contains mobility-related information. Text files are provided without corresponding annotations.
If you use this software in your own work, please cite the following paper:
@article{newman-griffis2020automated,
title={Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health},
author={Newman-Griffis, Denis and Fosler-Lussier, Eric},
journal={arXiv preprint arXiv:2011.13978},
year={2020}
}
All source code, documentation, and data contained in this package are distributed under the terms in the LICENSE file (modified BSD).
