AI.voice.cloning.Demo.mp4
This repository contains a dynamic text-to-speech cloning application built with Streamlit and the Bark TTS model. This project allows users to upload an audio file, clone the voice, and generate speech from the provided text input. The application is designed to be user-friendly and provides functionalities to upload audio, generate speech, and download the generated audio file.
- Overview
- Features
- Installation
- Usage
- Explanation of Code
- Diagram
- Screenshot
- Voice Samples
- Contributing
- License
The dynamic text-to-speech cloning application leverages the power of the Bark TTS model to synthesize speech from text input. Users can upload an audio file to clone the voice and generate speech in that voice. The application also allows users to download the generated audio file.
- Upload audio files (WAV, MP3) to clone the voice.
- Convert text to speech using the cloned voice.
- Download the generated speech as a WAV file.
- Interactive and user-friendly Streamlit interface.
-
Clone the repository:
git clone https://github.com/Mercity-AI/Voice-Cloning-Demo.git cd Voice-Cloning-Demo -
Install the required packages:
pip install -r requirements.txt
-
Run the Streamlit application:
streamlit run app.py
-
Open your web browser and go to
http://localhost:8501. -
Upload an audio file, enter the text you want to convert to speech, and click "Generate Speech".
-
Download the generated speech by clicking the "Download Audio" button.
Building a voice cloning pipeline involves setting up a system that can take an audio input of a speaker's voice and generate new speech that mimics the same voice by matching with the text given by the user.
from TTS.tts.configs.bark_config import BarkConfig
from TTS.tts.models.bark import Bark
from scipy.io.wavfile import write as write_wav
import osThe TTS library is a powerful tool for text-to-speech conversion. It supports multiple TTS models, including Bark. These imports specifically bring in the configuration and model components required to set up and use the Bark TTS model.
SciPy is a scientific computing library in Python. Here, it is used to save the generated speech waveform to an audio file. The write_wav function writes a NumPy array to a WAV file, which is a common format for storing audio data.
The OS library provides a way to interact with the operating system. It is used for handling directory paths and file management.
config = BarkConfig()
model = Bark.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="bark/", eval=True)- Initializes the configuration for the Bark model. This configuration includes various parameters that control the model's behavior during speech synthesis.
- Initializes the Bark TTS model using the specified configuration. This sets up the model architecture and prepares it for loading pre-trained weights.
- Loads the pre-trained weights for the Bark model from the specified checkpoint directory. This is crucial for ensuring the model has learned to generate high-quality speech based on extensive training data.
text = "Mercity ai is a leading AI innovator in India, with OpenAI planning collaboration."
voice_dirs = "/Users/username/Desktop/projects/AI voice Cloning/Speaker voice/"- Defines the text that will be converted into speech. This is the input that the TTS model will process to generate the corresponding audio output.
- Specifies the directory containing the speaker's audio files. These files are used to extract speaker-specific characteristics (embeddings) for voice cloning.
output_dict = model.synthesize(text, config, speaker_id='speaker', voice_dirs="bark_voices", temperature=0.95)Uses the Bark model to synthesize speech from the input text. The method combines the text with the speaker-specific embeddings extracted from the audio files in the voice_dirs directory.
Parameters:
text: The text to be converted to speech.config: The model configuration.speaker_id: An identifier for the speaker (not deeply detailed here but typically used to select the appropriate speaker embedding).voice_dirs: Directory containing the speaker's audio files.temperature: A parameter that controls the randomness of the output. Lower values make the output more deterministic, while higher values introduce more variation.
write_wav("SamAltman.wav", 24000, output_dict["wav"])Saves the synthesized speech to a WAV file. The sample rate is set to 24,000 Hz.
To listen to the generated speech sample, click the play button below:
Sample-2:
WhatIF.mp4
To know more about it visit our page - https://www.mercity.ai/blog-post/how-to-build-real-time-voice-cloning-pipelines
Contributions are welcome! Please open an issue or submit a pull request for any improvements or suggestions.
