FastSpeech2 with Hybrid Segmentation Training Recipe

Overview

This repository contains a recipe for training FastSpeech2 with Hybrid Segmentation (HS), a state-of-the-art text-to-speech (TTS) model. The training utilizes the Espnet toolkit and is tailored for the Indian languages, covering 13 major languages of India.

For finer details and comprehensive information, please refer to the Wiki section of the repository.

We are providing the training text and duration info for each language. User can use their data and can generate their own duration info. See below:

Data

Download the language data from IIT Madras TTS Database, which includes a special corpus of 13 Indian languages. The dataset comprises 10,000+ spoken sentences/utterances in both mono and English, recorded by male and female native speakers. The speech waveform files are available in .wav format, accompanied by corresponding text.

Duration info of the text data

In FastSpeech2, which is a neural TTS model, the duration file is used to represent the durations of phonemes in the input text. During the training of FastSpeech2, the model learns to predict these durations as part of the overall sequence-to-sequence training.

We are using the Hybrid segmentation (lab grown aligner) for getting the duration files. Another popular forced aligner is the Montreal Forced Aligner (MFA) which can be installed from here. Refer github.

We have already provided the Training TEXT and respective duration info of each language model.

Installation

Install the Espnet toolkit.
After installation, update the espnet path and kaldi path in path.sh.

Configuration

Please follow Wiki section for more details of each file.

In local/data.sh, adjust dev and eval set divisions (line numbers 79-82) based on the data for training the model.
Modify the run.sh file:
- Adjust the waveform to 48 kHz if needed (double the values at fs, n_fft, and n_shift).
- Make necessary changes to the script according to your requirements.
Check mismatch between number of frames for each wave file and respective duration file. Remove those files generated as output (Make relevant changes to the file before running):

perl check_mismatch_across_durationFile_espnet.pl
Make changes to the duration_info folder (See Wiki section)
Update configurations in tts.sh where necessary.(Important: Add the duration file and Point to duration_info path for the teacher_dumpdir variable )

Training

To check GPU availability: nvidia-smi
Run the training script: bash run.sh
- Note 1: Try to execute the script stage by stage, as mentioned in tts.sh (usually line numbers 29-30) as it'll be helpful in finding the errors.
- Note 2: Run the training part using the screen utuility of linux.

Synthesis of unseen text

After the completion of training (till stage 7), Follow the steps below to set up the environment and perform text synthesis.

Preparation

Create a "model" folder:
```
mkdir model
```
Copy the following files to the "model" folder:
- dump/raw/eval1/feats_type (or use the one from train/validation)
- exp/tts_stats_raw_char_None/train/feats_stats.npz
- exp/tts_train_raw_char_None/train.loss.ave.pth (you can try other models as well, modify accordingly in synthesis scripts)
- exp/tts_train_raw_char_None/config.yaml (modify stats_file to model/feats_stats.npz or provide the full path to the "model" folder)
Create a "test_folder" containing the text for synthesis in Kaldi format, following the pattern used during training.
Prepare run_synthesis.sh and tts_synthesis.sh. In run_synthesis.sh, set the path to test_folder, model_path, and modify $inference_config.
Ensure that the feature extraction part in tts_synthesis.sh matches the configuration used during training.

Text Synthesis

Run the following command to synthesize text:

bash run_synthesis.sh

OR you could use the inferencing file available at Fastspeech2 Inferencing (The output files will be located in exp/tts_train_raw_char_None/decode_train.loss.ave/test_folder.)

Wiki Reference

For more detailed information, troubleshooting, and tips, please consult the Wiki section of the repository.

Happy training!

Citation

If you use this Fastspeech2 Model in your research or work, please consider citing:

Bhashini, MeiTY and by Hema A Murthy & S Umesh,

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING and ELECTRICAL ENGINEERING,

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Assamese		Assamese
Bengali		Bengali
Bodo		Bodo
Gujarati		Gujarati
Hindi		Hindi
Kannada		Kannada
Malayalam		Malayalam
Manipuri		Manipuri
Marathi		Marathi
Odia		Odia
Rajasthani		Rajasthani
Tamil		Tamil
Telugu		Telugu
conf		conf
data/train_example		data/train_example
espnet_clean		espnet_clean
get_phone_text_IE		get_phone_text_IE
get_phone_text_mono		get_phone_text_mono
local		local
pyscripts		pyscripts
scripts		scripts
sid		sid
steps		steps
utils		utils
README.md		README.md
check_mismatch_across_durationFile_espnet.pl		check_mismatch_across_durationFile_espnet.pl
cmd.sh		cmd.sh
path.sh		path.sh
run.sh		run.sh
run_synthesis.sh		run_synthesis.sh
tts.sh		tts.sh
tts_synthesis.sh		tts_synthesis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FastSpeech2 with Hybrid Segmentation Training Recipe

Overview

Data

Duration info of the text data

Installation

Configuration

Training

Synthesis of unseen text

Preparation

Text Synthesis

Wiki Reference

Citation

About

Uh oh!

Releases

Packages

Languages

smtiitm/Training_Fastspeech2_HS_Model

Folders and files

Latest commit

History

Repository files navigation

FastSpeech2 with Hybrid Segmentation Training Recipe

Overview

Data

Duration info of the text data

Installation

Configuration

Training

Synthesis of unseen text

Preparation

Text Synthesis

Wiki Reference

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages