This repository contains data and BioBert based NER model monologg/biobert_v1.1_pubmed from community-uploaded Hugging Face models for detecting entities such as chemical and disease.
-
Create a Conda environment called "Ktrain_NER" with Python 3.7.0:
conda create -n Ktrain_NER python=3.7.0
-
Activate the Conda environment:
conda activate Ktrain_NER
Install required packages .
$ pip install tensorflow==2.1.0$ pip install pytorch==1.4.0$ pip install ktrain==0.12.0If you want to convert your IOB schemed data to BILOU schemed using iobToBilou.py in utilities folder, install spaCy using bellow command .
$ conda install -c conda-forge spacyDownload dataset provided in data folder(BC5CDR-IOB), locate it in any directory you want and address TRAIN_DATA and VALIDATION_DATA in parameters.py .
Use train-dev.tsv for training and test.tsv for validation.
Ktrain can use both
validationandtraindatas or justtrain.
lr_find() records loss over range of LRs .
def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None,
stop_factor=4, show_plot=False, verbose=1):
"""
Args:
start_lr (float): smallest lr to start simulation
lr_mult (float): multiplication factor to increase LR.
Ignored if max_epochs is supplied.
max_epochs (int): maximum number of epochs to simulate.
lr_mult is ignored if max_epoch is supplied.
Default is None. Set max_epochs to an integer
(e.g., 5) if lr_find is taking too long
and running for more epochs than desired.
stop_factor(int): factor used to determine threhsold that loss
must exceed to stop training simulation.
Increase this if loss is erratic and lr_find
exits too early.
show_plot (bool): If True, automatically invoke lr_plot
verbose (bool): specifies how much output to print
Returns:
float: Numerical estimate of best lr.
The lr_plot method should be invoked to
identify the maximal loss associated with falling loss.
"""For using lr_find() we need to a learner object; that we can construct it using ktrain.get_learner() function by passing model and data .
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128, eval_batch_size=64)After trying some LRs(1e-5, 1e-4, 5e-3, 8e-4) we found that in our case optimal lr is approximately 1e-3 .
Use python run_ner.py to train and validate model.
We got the best result using SGDR learning rate scheduler on BC5CDR-IOB with lr=1e-3,n_cycles=3, cycle_len=1 and cycle_mult=2. weights are availabel in weights folder.
learner.fit(1e-3, 3, cycle_len=1, cycle_mult=2, checkpoint_folder='/checkpoints/SGDR', early_stopping=3)| precision | recall | f1-score | support | |
|---|---|---|---|---|
| Chemical | 0.91 | 091 | 0.91 | 5385 |
| Disease | 0.75 | 0.81 | 0.78 | 4424 |
| micro avg | 0.83 | 0.87 | 0.85 | 9809 |
| macro avg | 0.84 | 0.87 | 0.85 | 9809 |
We used crawl-300d-2M-subword from fastext pre-trained word vectors instead of randomly initialized word embeddings with the same parameters and data as before .
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| Disease | 0.76 | 0.79 | 0.77 | 4424 |
| Chemical | 0.91 | 0.89 | 0.90 | 5385 |
| micro avg | 0.84 | 0.85 | 0.84 | 9809 |
| macro avg | 0.84 | 0.85 | 0.85 | 9809 |
In this expriment we used BC5CDR-BILOU _ BILOU schemed data set instead of IOB with crawl-300d-2M-subword(fastText word vector) and same parameters as before .
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| Chemical | 0.91 | 0.74 | 0.82 | 5374 |
| Disease | 0.74 | 0.72 | 0.73 | 4397 |
| micro avg | 0.83 | 0.73 | 0.78 | 9771 |
| macro avg | 0.83 | 0.73 | 0.78 | 9771 |

