This repository contains a recipe for training FastSpeech2 with Hybrid Segmentation (HS), a state-of-the-art text-to-speech (TTS) model. The training utilizes the Espnet toolkit and is tailored for the Indian languages, covering 13 major languages of India.
For finer details and comprehensive information, please refer to the Wiki section of the repository.
We are providing the training text and duration info for each language. User can use their data and can generate their own duration info. See below:
Download the language data from IIT Madras TTS Database, which includes a special corpus of 13 Indian languages. The dataset comprises 10,000+ spoken sentences/utterances in both mono and English, recorded by male and female native speakers. The speech waveform files are available in .wav format, accompanied by corresponding text.
In FastSpeech2, which is a neural TTS model, the duration file is used to represent the durations of phonemes in the input text. During the training of FastSpeech2, the model learns to predict these durations as part of the overall sequence-to-sequence training.
We are using the Hybrid segmentation (lab grown aligner) for getting the duration files. Another popular forced aligner is the Montreal Forced Aligner (MFA) which can be installed from here. Refer github.
We have already provided the Training TEXT and respective duration info of each language model.
- Install the Espnet toolkit.
- After installation, update the espnet path and kaldi path in
path.sh.
Please follow Wiki section for more details of each file.
-
In
local/data.sh, adjust dev and eval set divisions (line numbers 79-82) based on the data for training the model. -
Modify the
run.shfile:- Adjust the waveform to 48 kHz if needed (double the values at
fs,n_fft, andn_shift). - Make necessary changes to the script according to your requirements.
- Adjust the waveform to 48 kHz if needed (double the values at
-
Check mismatch between number of frames for each wave file and respective duration file. Remove those files generated as output (Make relevant changes to the file before running):
perl check_mismatch_across_durationFile_espnet.pl -
Make changes to the duration_info folder (See Wiki section)
-
Update configurations in
tts.shwhere necessary.(Important: Add the duration file and Point to duration_info path for the teacher_dumpdir variable )
- To check GPU availability:
nvidia-smi - Run the training script:
bash run.sh- Note 1: Try to execute the script stage by stage, as mentioned in
tts.sh(usually line numbers 29-30) as it'll be helpful in finding the errors. - Note 2: Run the training part using the screen utuility of linux.
- Note 1: Try to execute the script stage by stage, as mentioned in
After the completion of training (till stage 7), Follow the steps below to set up the environment and perform text synthesis.
-
Create a "model" folder:
mkdir model
-
Copy the following files to the "model" folder:
dump/raw/eval1/feats_type(or use the one from train/validation)exp/tts_stats_raw_char_None/train/feats_stats.npzexp/tts_train_raw_char_None/train.loss.ave.pth(you can try other models as well, modify accordingly in synthesis scripts)exp/tts_train_raw_char_None/config.yaml(modifystats_filetomodel/feats_stats.npzor provide the full path to the "model" folder)
-
Create a "test_folder" containing the text for synthesis in Kaldi format, following the pattern used during training.
-
Prepare
run_synthesis.shandtts_synthesis.sh. Inrun_synthesis.sh, set the path totest_folder,model_path, and modify$inference_config. -
Ensure that the feature extraction part in
tts_synthesis.shmatches the configuration used during training.
Run the following command to synthesize text:
bash run_synthesis.shOR you could use the inferencing file available at Fastspeech2 Inferencing (The output files will be located in exp/tts_train_raw_char_None/decode_train.loss.ave/test_folder.)
For more detailed information, troubleshooting, and tips, please consult the Wiki section of the repository.
Happy training!
If you use this Fastspeech2 Model in your research or work, please consider citing:
“ COPYRIGHT 2023, Speech Technology Consortium,
Bhashini, MeiTY and by Hema A Murthy & S Umesh,
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING and ELECTRICAL ENGINEERING,
IIT MADRAS. ALL RIGHTS RESERVED "
This work is licensed under a Creative Commons Attribution 4.0 International License.
