To automatically align speech audio and text transcripts at the word and phoneme level using the Montreal Forced Aligner (MFA).
# 1️⃣ Create and activate environment
conda create -n mfa_env -c conda-forge montreal-forced-aligner -y
conda activate mfa_env
# 2️⃣ Download models
mfa model download dictionary english_us_arpa
mfa model download acoustic english_mfa
# 3️⃣ Prepare dataset
# Ensure data/ready_corpus contains .wav and .txt pairs
# 4️⃣ Validate
mfa validate data/ready_corpus english_us_arpa english_mfa
# 5️⃣ Align
mfa align data/ready_corpus english_us_arpa english_mfa outputs/aligned
📊 Outputs
Alignment Files → outputs/aligned/*.TextGrid
Alignment Report → outputs/aligned/alignment_analysis.csv
Each .TextGrid contains:
Word tier → timestamps for words
Phone tier → timestamps for phonemes
🔍 Visualization
Open in Praat
:
Open → Read from file → F2BJ_RLP1.wav
Open → Read from file → F2BJ_RLP1.TextGrid
Select both → View & Edit
🧠 Observations
Word and phone boundaries aligned accurately.
Minor timing deviations in fast speech segments.
english_us_arpa dictionary and english_mfa acoustic model performed well.