Lexicon Induction for isiXhosa Medical Translation

This honours project adapts the work done by Hu et al. in the paper Domain Adaptation of Neural Machine Translation by Lexicon Induction.

Installation

To create an Anaconda environment with all the necessary dependencies run the following command:

conda env create -f environment.yml

Then activate the environment using:

conda activate medtranslate-dali

Generating the Synthetic Data

Run fast-align on the large general domain parallel corpus of data following the instructions in the fast-align repository.
Using the output from fast-align, generate the seed lexicon using build_lexicon.py by running a command such as the following:

python build_lexicon.py \
    --train_data corpus.txt \
    --aligned_file corpus.fwd_align \
    --output_file lexicon.txt

Add the small in-domain lexicon to the seed lexicon to get the final in-domain bilingual lexicon.
Using the word-for-word back-translation script in the DALI repository, back-translate your English in-domain corpus of data into isiXhosa.

python wfw_backtranslation.py \
    --lexicon_infile lexicon.txt \
    --tgt_infile monolingual_corpus.en \
    --src_outfile synthetic_corpus.xh

Training

Choose a translation direction to train models in and run the bash script for that direction to initiate a hyperparameter sweep
Alternatively use the training python script to train a single model at a time, for example:

python train_en_to_xh.py \
        --base_model_path nllb-200/ \
        --data_dir data-bin/ \
        --output_dir finetuned_models/ \
        --learning_rate 5e-6 \
        --batch_size 4 \
        --num_epochs 5 \
        --gradient_accumulation_steps 8 \
        --warmup_ratio 0.1 \
        --weight_decay 0.01 \
        --label_smoothing 0.1

Evaluation

BLEU, chrF and chrF++ Scores

If the hyperparameter sweep was run, then the script eng_to_xho_eval.sh / xho_to_eng_eval.sh can be run to evaluate all of the models. To evaluate an individual model:

Generate translations of the evaluation set using the model:

python generate_translations.py \
        --model_path finetuned_models/model1 \
        --source_file dev.en \
        --output_file model1_predictions.xh \
        --source_lang "eng_Latn" \
        --target_lang "xho_Latn"

Generate the BLEU, chrF and chrF++ scores:

python verify_scores.py \
		--predictions model1_predictions.xh \
		--references dev.xh

Health Term Error Rate

To calculate the health term error rate run evaluate_terminology.py with the following command:

python evaluate_terminology.py \
    --predictions model1_predictions.xh \
    --references dev.xh \
    --terms_csv medical_terms.csv \
    --direction en_to_xh

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Eng-Xho_Training		Eng-Xho_Training
Evaluation_Scripts		Evaluation_Scripts
Xho-Eng_Training		Xho-Eng_Training
fast_align @ cab1e9a		fast_align @ cab1e9a
.gitmodules		.gitmodules
README.md		README.md
build_lexicon.py		build_lexicon.py
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lexicon Induction for isiXhosa Medical Translation

Installation

Generating the Synthetic Data

Training

Evaluation

BLEU, chrF and chrF++ Scores

Health Term Error Rate

About

Uh oh!

Releases

Packages

Languages

ElijahSherman/MedTranslate_DALI

Folders and files

Latest commit

History

Repository files navigation

Lexicon Induction for isiXhosa Medical Translation

Installation

Generating the Synthetic Data

Training

Evaluation

BLEU, chrF and chrF++ Scores

Health Term Error Rate

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages