Ktrain BioBert_NER

This repository contains data and BioBert based NER model monologg/biobert_v1.1_pubmed from community-uploaded Hugging Face models for detecting entities such as chemical and disease.

Setting up an environment

Follow the installation instructions for Conda.
Create a Conda environment called "Ktrain_NER" with Python 3.7.0:
```
conda create -n Ktrain_NER python=3.7.0
```
Activate the Conda environment:
```
conda activate Ktrain_NER
```

Installation

Install required packages .

$ pip install tensorflow==2.1.0

$ pip install pytorch==1.4.0

$ pip install ktrain==0.12.0

If you want to convert your IOB schemed data to BILOU schemed using iobToBilou.py in utilities folder, install spaCy using bellow command .

$ conda install -c conda-forge spacy

Dataset

Download dataset provided in data folder(BC5CDR-IOB), locate it in any directory you want and address TRAIN_DATA and VALIDATION_DATA in parameters.py . Use train-dev.tsv for training and test.tsv for validation.

Ktrain can use both validation and train datas or just train.

Learning rate hyper-parameter

lr_find() records loss over range of LRs .

def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None, 
                stop_factor=4, show_plot=False, verbose=1):

"""
Args:
            start_lr (float): smallest lr to start simulation
            lr_mult (float): multiplication factor to increase LR.
                             Ignored if max_epochs is supplied.
            max_epochs (int):  maximum number of epochs to simulate.
                               lr_mult is ignored if max_epoch is supplied.
                               Default is None. Set max_epochs to an integer
                               (e.g., 5) if lr_find is taking too long
                               and running for more epochs than desired.
            stop_factor(int): factor used to determine threhsold that loss 
                              must exceed to stop training simulation.
                              Increase this if loss is erratic and lr_find
                              exits too early.
            show_plot (bool):  If True, automatically invoke lr_plot
            verbose (bool): specifies how much output to print
        Returns:
            float:  Numerical estimate of best lr.  
                    The lr_plot method should be invoked to
                    identify the maximal loss associated with falling loss.
"""

For using lr_find() we need to a learner object; that we can construct it using ktrain.get_learner() function by passing model and data .

learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128, eval_batch_size=64)

After trying some LRs(1e-5, 1e-4, 5e-3, 8e-4) we found that in our case optimal lr is approximately 1e-3 .

Train and validate model

Use python run_ner.py to train and validate model.

Result

We got the best result using SGDR learning rate scheduler on BC5CDR-IOB with lr=1e-3,n_cycles=3, cycle_len=1 and cycle_mult=2. weights are availabel in weights folder.

learner.fit(1e-3, 3, cycle_len=1, cycle_mult=2, checkpoint_folder='/checkpoints/SGDR', early_stopping=3)

	precision	recall	f1-score	support
Chemical	0.91	091	0.91	5385
Disease	0.75	0.81	0.78	4424
micro avg	0.83	0.87	0.85	9809
macro avg	0.84	0.87	0.85	9809

Result using fastText

We used crawl-300d-2M-subword from fastext pre-trained word vectors instead of randomly initialized word embeddings with the same parameters and data as before .

	precision	recall	f1-score	support
Disease	0.76	0.79	0.77	4424
Chemical	0.91	0.89	0.90	5385
micro avg	0.84	0.85	0.84	9809
macro avg	0.84	0.85	0.85	9809

Result using fastText and BILOU schemed data

In this expriment we used BC5CDR-BILOU _ BILOU schemed data set instead of IOB with crawl-300d-2M-subword(fastText word vector) and same parameters as before .

	precision	recall	f1-score	support
Chemical	0.91	0.74	0.82	5374
Disease	0.74	0.72	0.73	4397
micro avg	0.83	0.73	0.78	9771
macro avg	0.83	0.73	0.78	9771

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
pic		pic
scripts		scripts
utilities		utilities
weights		weights
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ktrain BioBert_NER

Setting up an environment

Installation

Dataset

Learning rate hyper-parameter

Train and validate model

Result

Result using fastText

Result using fastText and BILOU schemed data

Refernces

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

miladnouriezade/Ktrain-BioBert_NER

Folders and files

Latest commit

History

Repository files navigation

Ktrain BioBert_NER

Setting up an environment

Installation

Dataset

Learning rate hyper-parameter

Train and validate model

Result

Result using fastText

Result using fastText and BILOU schemed data

Refernces

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages