Medical Named Entity Recognition (MedNER) is a deep learning-based project designed to extract medical entities from text using a fine-tuned BERT model. This project utilizes the Hugging Face transformers library to identify named entities such as diseases, medications, genes, and other biomedical terms.
-
Dataset
- The dataset is sourced from
parsa-mhmdi/Medical_NERon Hugging Face. - It consists of tokenized medical text with annotated named entities in the IOB format.
- The dataset is sourced from
-
Model
- A fine-tuned
bert-base-casedmodel is used for Named Entity Recognition (NER). - The model is trained using the Hugging Face
TrainerAPI.
- A fine-tuned
-
Training Pipeline
- Tokenization using
AutoTokenizerfrom Hugging Face. - Data alignment to match tokenized input with entity labels.
- Training with evaluation and model selection based on best validation performance.
- Tokenization using
-
Deployment
- The trained model is deployed as a Hugging Face Space using
Gradio. - A web-based interactive demo is provided for real-time text analysis.
- The trained model is deployed as a Hugging Face Space using
This repository contains the following essential files:
.git- Version control folder (not necessary for direct use)..gradio- Configuration files for Gradio interface settings..gitattributes- Defines Git LFS tracking for large files.app.py- Main script for running the Gradio interface.config.json- Configuration file for the model, specifying hyperparameters.README.md- Documentation containing project details and usage instructions.requirements.txt- Lists all dependencies required to run the project.tokenizer.json- Tokenizer configuration containing vocabulary and model-specific settings.tokenizer_config.json- Configuration settings for the tokenizer.trainer_code.ipynb- Jupyter Notebook containing training scripts and model fine-tuning process.vocab.txt- Vocabulary file used by the tokenizer.
To run the project locally, clone the repository and install dependencies:
git clone https://huggingface.co/spaces/parsa-mhmdi/MedNER
cd MedNER
pip install -r requirements.txtRun the application using:
python app.pyThis will launch a Gradio interface where you can enter medical text to identify named entities.
To train the model from scratch, run the following script:
python train.pyThis will:
- Load the dataset
- Tokenize and preprocess text
- Train the
bert-base-casedmodel - Save the best-performing model checkpoint
To save storage space, the best model is compressed and uploaded to Hugging Face:
import shutil
shutil.make_archive("./ner_model_compressed", 'zip', "./ner_model")The compressed model is then uploaded to the repository:
from huggingface_hub import upload_folder
upload_folder(repo_id="parsa-mhmdi/MedNER", folder_path="./ner_model_compressed.zip")Try the live demo of MedNER on Hugging Face Spaces: 🔗 MedNER Hugging Face Space
We welcome contributions! Feel free to fork the repository and submit a pull request with improvements.
This project is open-source and available under the MIT License.