This project provides Forced Alignment using Montreal Forced Aligner (MFA) on an audio dataset. The goal is to generate time-aligned phonetics and word level transcriptions from audio files and their corresponding text transcripts.
- Conda environment (version used: 24.9.2)
- Instructions to install MFA.
- Create conda environment:
cd forced_alignment
mkdir envs
conda create --prefix envs/mfa_env python=3.11
conda activate envs/mfa_env
conda install -c conda-forge montreal-forced-aligner- Used latest (version: 3.3.8) montreal-forced-aligner
- Place your audio and text files in
data/, such that thebasename of corresponding filesshould be same except extension
data
|-- /example_corpus
| |-- F2BJRLP1.wav
| |-- F2BJRLP1.txt
| |-- F2BJRLP2.wav
| |-- F2BJRLP2.txt
| |-- F2BJRLP3.wav
| |-- F2BJRLP3.txt
...
- Note:
*.wav: Audio files.*.txt: Corresponding text transcriptions.
-
Decide the top dictionaries based on the number of oov (out-of-vocabulary) words
-
Usage
python src/validate_dicts.py `input_dir` `out_dir` `lang`
-
Example:
python src/validate_dicts.py --input_dir "data/example_corpus" --out_dir "egs/validation_output" --lang "english"
-
Perform the forced alignment on the audio and transcription files using the top dictionaries we get from validation
-
Usage:
python src/align_data.py `input_dir` `oov_summary_file` `lang` `top_k`
-
Example:
python src/align_data.py --input_dir "data/example_corpus" --oov_summary_file "egs/validation_output/validation_oov_summary.txt" --lang "english" --top_k 3
- Perform validation and alignment using a single bash file
- Example:
bash egs/align_data.sh
-
As the corpus goes on increasing the number of out-of vocabulary words also increases.
-
Token for OOV words corresponding phonemes -
spn -
Using a pre-trained G2P (grapheme to Phoneme) model, we can estimate pronounciation for those OOVs
-
By combining G2P generated dict for OOVs can be combined with the pre-trained to perform the alignments
-
Usage:
python src/train_dict.py `txt_dir` `corpus_dir` `base_dictionary` `g2p_model` `acoustic_model` `output_dir`
-
Example
python src/train_dict.py --txt_dir "data/txt_corpus" --corpus_dir "data/example_corpus" --base_dictionary "dictionaries/english_us_arpa.dict" --g2p_model "english_us_arpa" --acoustic_model "english_us_arpa" --output_dir "egs/custom_dict/aligned_data"
-
Note:
dictionaries/english_us_arpa.dictis the pre-trained dictionary that can be downloaded from english_us_arpa.dict
- A short PDF is attached in
docs/with relevant screenshots showing that the G2P based custom dictionary will help to handle the OOVs, and the ARPABET phoneme dictionary provides better alignments as compared to other dictionaries for the example corpus I have taken - Output textgrid files are visualized using praat software available at PRAAT FOR LINUX