This repository now fully supports the fine-tuning of the Chatterbox Turbo model!
- What is it? A faster, GPT-2 based architecture with a strong English foundation.
- Smart Multi-Language Support: The setup script automatically merges Turbo's large English vocabulary with our custom 23-language grapheme set.
- The Result: Get the speed and quality of Turbo while seamlessly fine-tuning on new languages like Turkish, French, Spanish, and more.
Read the Standart vs. Turbo Modes section below for details!
A modular infrastructure for fine-tuning both Chatterbox TTS (Standart) and Chatterbox Turbo models with your own dataset and generating high-quality speech synthesis.
This kit is specially designed to support new languages by intelligently extending the model's vocabulary for maximum performance and faster adaptation.
This repository operates in two distinct modes, controlled by the is_turbo setting in src/config.py. Please decide which mode you need before you begin.
- Architecture: Llama-based.
- Tokenizer: Grapheme (character) based. The
tokenizer.jsondownloaded bysetup.pycontains a small, efficient vocabulary (~2,454 tokens) covering 23 languages. - Best for: Training a model with full control over a specific language from a more fundamental level.
- Architecture: GPT-2 based.
- Tokenizer: BPE-based. It starts with a large, powerful English vocabulary (~50,000+ tokens).
- Smart Merging: When you run
setup.py, this large vocabulary is automatically extended with our multi-language grapheme set. - Best for: Leveraging a strong English base for faster, high-quality fine-tuning on other languages.
If you plan to switch between Standard Mode (is_turbo = False) and Turbo Mode (is_turbo = True), you MUST completely delete the pretrained_models directory and the preprocessed_dir directory created with the preprocess = True operation before running the setup.py file again.
The setup script replaces the token files in place. If you run the setup for Standard mode after setting up for Turbo (or vice versa), the token files will become corrupted and cause errors that are difficult to debug during training.
Correct Workflow for Changing Modes:
- DELETE the entire
pretrained_modelsfolder.
# On Linux or macOS
rm -rf pretrained_models
# On Windows (in Command Prompt)
rmdir /s /q pretrained_models-
Update the
src/config.pyfile, setting theis_turboflag to your desired new mode and setting preprocess = True if it is False. -
RUN setup.py again to download and prepare the correct files for the new mode.
python setup.py
- Update the
new_vocab_sizevalue in thesrc/config.pyfile with the new value provided by the setup script. Also ensure preprocess = True.
This repository uses an offline preprocessing strategy to maximize training speed. This script processes all audio files, extracts speaker embeddings and acoustic tokens, and saves them as .pt files.
Chatterbox uses a grapheme-based (character-level) tokenizer. The tokenizer.json file downloaded by setup.py includes support for 23 languages from the original Chatterbox repository, covering most common characters across multiple languages.
- Default Support: The provided tokenizer already includes characters for English, Turkish, French, German, Spanish, and 18+ other languages
- When to customize: If your target language has special characters not covered in the default tokenizer, you can create a custom
tokenizer.json - Examples of special characters by language:
- Turkish:
Γ§, Δ, Ε, ΓΆ, ΓΌ, Δ± - French:
Γ©, Γ¨, Γͺ, Γ , ΓΉ, Γ§ - German:
Γ€, ΓΆ, ΓΌ, Γ - Spanish:
Γ±, Γ‘, Γ©, Γ, Γ³, ΓΊ
- Turkish:
- Critical: The
NEW_VOCAB_SIZEvariable in bothsrc/config.pyANDinference.pymust exactly match the total number of tokens in yourtokenizer.jsonfile - Default vocab size: Check the downloaded
tokenizer.jsonto see the exact token count, then setNEW_VOCAB_SIZEaccordingly
- Training (Input): Chatterbox's encoder and T3 module work with 16,000 Hz (16kHz) audio. Even if your dataset uses different rates,
dataset.pyautomatically resamples to 16kHz. - Output (Inference): The model's vocoder generates audio at 24,000 Hz (24kHz).
chatterbox-finetune/
βββ pretrained_models/ # setup.py downloads required models here
β βββ ve.safetensors
β βββ s3gen.safetensors
β βββ t3.safetensors
β βββ tokenizer.json
βββ MyTTSDataset/ # Your custom dataset in LJSpeech format
β βββ metadata.csv # Dataset metadata (file|text|normalized_text)
β βββ wavs/ # Directory containing WAV files
βββ FileBasedDataset/ # Your custom dataset in LJSpeech format
β βββ 0a0bc5d3-f195-464a-8716-d6e01fd4784f.txt # Dataset metadata (text)
β βββ 0a0bc5d3-f195-464a-8716-d6e01fd4784f.wav # WAV files
βββ speaker_reference/ # Speaker reference audio files
β βββ reference.wav # Reference audio for voice cloning
βββ src/
β βββ config.py # All settings and hyperparameters
β βββ dataset.py # Data loading and processing
β βββ model.py # Model weight transfer and training wrapper
| βββ preprocess_ljspeech.py # Preprocessing script
| βββ preprocess_file_based.py # Preprocessing script
β βββ utils.py # Logger and VAD utilities
βββ train.py # Main training script
βββ inference.py # Speech synthesis script (with VAD support)
βββ setup.py # Setup script for downloading models
βββ requirements.txt # Required dependencies
βββ README.md # This file
Requires Python 3.8+ and GPU (recommended):
Install FFmpeg (Required):
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpegInstall Python Dependencies:
git clone https://github.com/gokhaneraslan/chatterbox-finetuning.git
cd chatterbox-finetuning
pip install -r requirements.txtThis multi-step process prepares all necessary files based on your chosen mode. This script downloads the necessary base models (ve, s3gen, t3) and default tokenizer. Must be run before training.
Step 2.1: Choose Your Mode
Open src/config.py and set the is_turbo variable to True or False.
# In src/config.py
is_turbo: bool = True # Set to True for Turbo, False for Standart```Step 2.2: Run the Setup Script This command will download the correct model files. If Turbo mode is enabled, it will also automatically merge the tokenizers for you.
python setup.pyStep 2.3: Update Config (Turbo Mode ONLY) If you ran the setup in Turbo mode, the script will output a final message like this:
Please update the 'new_vocab_size' in 'src/config.py' to the following value: 52260
Copy this exact number and paste it into the new_vocab_size variable in src/config.py. Do not skip this step!
Create a .env file or edit src/config.py to specify your dataset location and training parameters.
During training, the script loads the original model weights, intelligently resizes them for the new vocabulary size, and initializes new tokens using mean initialization from existing tokens for faster adaptation.
We recommend using the TTS Dataset Generator tool to automatically create high-quality datasets from audio or video files.
Quick Start:
# Install the dataset generator
git clone https://github.com/gokhaneraslan/tts-dataset-generator.git
cd tts-dataset-generator
pip install -r requirements.txt
# Generate dataset from your audio/video file
python main.py --file your_audio.mp4 --model large --language en --ljspeech TrueThis will automatically:
- Segment audio into optimal chunks (3-10 seconds)
- Transcribe using Whisper AI
- Generate properly formatted
metadata.csvand audio files - Output directly to
MyTTSDataset/folder in LJSpeech format
Benefits:
- Saves hours of manual segmentation and transcription
- Optimizes chunk duration for TTS training
- Handles multiple languages (en, tr, fr, de, es, etc.)
- Works with both audio and video files
Your dataset should follow the LJSpeech format with a CSV file:
filename|raw_text|normalized_text
Example metadata.csv:
recording_001|Hello world.|hello world
recording_002|This is a test recording.|this is a test recording
Place your dataset in the MyTTSDataset/ folder:
MyTTSDataset/
βββ metadata.csv
βββ wavs/
βββ recording_001.wav
βββ recording_002.wav
βββ ...
Dataset Quality Requirements:
- Sample rate: 16kHz, 22.05kHz, or 44.1kHz (will be resampled to 16kHz automatically)
- Format: WAV (mono or stereo - will be converted to mono automatically)
- Duration: 3-10 seconds per segment (optimal for TTS)
- Minimum total duration: 30+ minutes for basic training
- Recommended: 1 hour of clean audio for optimal results
- Audio quality: Clean, minimal background noise
Important: Ensure the NEW_VOCAB_SIZE in both src/config.py AND inference.py matches the number of tokens in your custom tokenizer.json.
For non-English languages:
- Create your custom
tokenizer.jsonwith all characters in your target language - Count the total tokens in your JSON file
- Update
NEW_VOCAB_SIZEin both files to match this count
Most Important Settings:
# In src/config.py
is_turbo: bool = True # Set True if you're training Turbo, False if you're training Normal.
# --- Vocabulary ---
# The size of the NEW vocabulary (from tokenizer.json)
# Ensure this matches the JSON file generated by your tokenizer script.
# For Turbo mode: Use the exact number provided by setup.py (e.g., 52260)
new_vocab_size: int = 52260 if is_turbo else 2454
# In inference.py
NEW_VOCAB_SIZE = 2454 # Must be identical to config.pyOther key parameters to adjust:
# Dataset
DATASET_PATH = "MyTTSDataset"
METADATA_FILE = "metadata.csv"
# Training
BATCH_SIZE = 4 # Adjust based on your GPU VRAM
LEARNING_RATE = 5e-5
NUM_EPOCHS = 50If your dataset is file-based dataset, set ljspeech = False in the configuration file.
# In src/config.py
ljspeech = False If you have already done the preprocessing process once, set preprocess=False in the config file to avoid doing it again.
# In src/config.py
preprocess = False python train.pyThe trained model will be saved as chatterbox_output/t3_finetuned.safetensors. The filename will be
t3_turbo_finetuned.safetensors for Turbo mode.
Training Tips:
- VRAM: T3 is a Transformer model with high VRAM usage. For 12GB VRAM, use
batch_size=4. For lower VRAM, usebatch_size=2withgrad_accum=32. - Mixed Precision: Code uses
fp16=Trueby default for faster training and memory efficiency. - Checkpointing: Models are saved every epoch in
chatterbox_output/. - Recommended Training Duration: For optimal results with 1 hour of target speaker audio, train for 150 epochs or 1000 steps. This configuration typically produces high-quality voice cloning results.
The inference script loads your fine-tuned .safetensors file and uses Silero VAD to automatically trim unwanted silence/noise at the end of generated audio.
Chatterbox is a voice cloning/style transfer model. You must provide a reference .wav file (audio prompt) for inference.
Place your reference audio in speaker_reference/:
speaker_reference/
βββ reference.wav
Reference Audio Requirements:
- Format: WAV, mono or stereo
- Sample rate: Any (will be resampled automatically)
- Duration: 3-10 seconds recommended
- Quality: Clean audio with minimal background noise
Edit inference.py to set your text and audio prompt paths:
TEXT_TO_SAY = "This is a test of the fine-tuned model."
AUDIO_PROMPT = "speaker_reference/reference.wav"Run inference:
python inference.pyThe output will be saved as output_stitched.wav (24kHz).
Multiple Sentences: The script automatically splits long text into sentences for better quality:
TEXT_TO_SAY = "Hello! How are you today? This is amazing."Audio Processing: All audio is automatically processed to mono and resampled to the correct sample rate using FFmpeg. The output format is:
- Channels: Mono (1 channel)
- Sample Rate: 24kHz
- Codec: 16-bit PCM WAV
Original Chatterbox training pipelines often process audio "on-the-fly" (resampling, feature extraction) during training. This causes the GPU to wait for the CPU, slowing down training significantly.
By running preprocess.py, we:
- Extract Speaker Embeddings (Voice Encoder)
- Extract Acoustic Tokens (S3Gen)
- Tokenize Text
- Save everything as optimized PyTorch tensors (
.pt) This allows thedataset.pyto simply load tensors, maximizing GPU utilization.
Standart Model Tokenizer:
The pretrained_models/tokenizer.json file downloaded by setup.py includes support for 23 languages with extensive grapheme coverage. This file is used by src/chatterbox/tokenizer.py during both training and inference.
Turbo Model Tokenizer (Smart Vocab Extension):
Turbo mode uses GPT-2's powerful BPE tokenizer as a base. The setup.py script performs a "Vocab Extension": it intelligently adds all unique characters from our 23-language grapheme set to the GPT-2 vocabulary. This process ensures that:
- The model retains its powerful knowledge of English words and structures.
- Special characters from other languages (e.g.,
Δ, Ε, Δ±for Turkish;Γ©, Γ , Γ§for French) are recognized as single, whole tokens, dramatically improving learning efficiency. - You do not need to create a custom tokenizer manually. The setup is fully automated.
Default Multi-Language Support: The provided tokenizer already covers common characters from 23 languages, including but not limited to:
- Latin-based languages (English, French, Spanish, German, Italian, Portuguese)
- Turkish with special characters (Γ§, Δ, Δ±, ΓΆ, Ε, ΓΌ)
- Eastern European languages
- And more
When to Create a Custom Tokenizer: You only need to create a custom tokenizer if:
- Your target language has special characters not in the default set
- You want to optimize the vocab size for a specific language
- You need to add domain-specific symbols or characters
Creating a Custom Tokenizer (Optional):
-
Identify all characters in your target language:
- All letters (including accented/special characters)
- Numbers (0-9)
- Punctuation marks
- Special symbols used in your language
-
Create the JSON mapping - Example structure:
{
"a": 0,
"b": 1,
"c": 2,
"Γ§": 3,
"d": 4,
...
" ": 100,
".": 101,
",": 102,
...
}-
Count total tokens in your JSON file
-
Update NEW_VOCAB_SIZE in both
src/config.pyANDinference.pyto match the token count -
Replace
pretrained_models/tokenizer.jsonwith your custom file before training
Vocab Size Examples:
- Default (23 languages): Check your downloaded
tokenizer.jsonfor exact count - Custom French: ~200 tokens (if you want French-only optimization)
- Custom German: ~180 tokens (if you want German-only optimization)
Important: The default tokenizer should work for most languages. Only customize if you have specific requirements or encounter missing characters.
During inference, inference.py uses Silero VAD to prevent hallucinations and sentence-ending elongations. This automatically trims unwanted silence and noise from generated audio.
All audio processing uses FFmpeg for professional-quality results:
- Input: Automatic conversion to mono (1 channel)
- Resampling: Automatic resampling to required sample rates
- Training: 16kHz processing
- Output: 24kHz, 16-bit PCM WAV format
- Codec:
pcm_s16le(16-bit signed little-endian PCM)
- VE (Voice Encoder): Extracts speaker embeddings from reference audio
- T3 (Text-to-Speech): Main transformer-based TTS model (this is what you fine-tune)
- S3Gen (Vocoder): Converts mel-spectrograms to waveforms
Error: RuntimeError: Error(s) in loading state_dict for T3... size mismatch
- Solution:
NEW_VOCAB_SIZEdoesn't match the token count intokenizer.json. - Check:
- Count tokens in your
tokenizer.jsonfile - Verify
NEW_VOCAB_SIZEinsrc/config.pymatches this count - Verify
NEW_VOCAB_SIZEininference.pyalso matches (must be identical)
- Count tokens in your
- Common mistake: Updating only one file but not the other
Error: FileNotFoundError: ... ve.safetensors
- Solution: You haven't downloaded base models. Run
python setup.py.
Error: CUDA out of memory
- Solution: Reduce
BATCH_SIZEinsrc/config.pyor enable gradient accumulation.
Poor Quality Output:
- Check reference audio quality (should be clean, at least 5 seconds)
- Ensure adequate training data (minimum 30 minutes recommended)
Based on the Chatterbox TTS model architecture. Special thanks to the original authors and contributors.
For issues and questions:
- Check the troubleshooting section above
- Review
src/config.pyfor configuration options - Open an issue on GitHub with detailed error messages and your setup information