This repository is part of my master's thesis project. It is based on the official OpenVPI DiffSinger implementation.
In addition to reproducing and adapting the core DiffSinger model, this repo includes all scripts and resources used throughout my research pipeline. These include external tools, dependent repositories, and custom scripts located in the user_script/ directory. While the main singing voice synthesis model comes from OpenVPI's DiffSinger, the end-to-end workflow and data processing were extended to suit the needs of my thesis experiments.
This work explores phoneme-mapped cross-lingual transfer learning for singing voice synthesis (SVS), focusing on adapting an English-trained DiffSinger model to German using minimal target-language data. We focus on the acoustic model (not the variance model), and investigate how data quality—particularly accent, vocal range, and recording conditions—impacts low-resource SVS performance. Thesis can be found here.
Please follow the installation and dependency setup as described in the original DiffSinger repository. This fork maintains compatibility with the upstream environment and training pipeline.
The experimental pipeline includes the following key stages, with associated scripts and tools:
- Extract audio:
mp4_to_wav - Clean audio: fishaudio preprocess tools
- Auto-slice: AudioSlicer
- Manual adjustment (optional):
slice_audio,trim_audio
- Automatic transcription: Whisper via fishaudio
- Manual annotation (optional):
lyrics_to_lab,check_lab
- Convert GT-Singer format to DiffSinger format:
convert,cleanup - Select wavs by target duration:
filter_by_duration - Calculate total corpus length:
calculate_duration - Clean up folder:
clean up folder
- Check and fill missing words in lexicon:
check_lexicon
- Automatic Alignment using Montreal Forced Aligner
- Manual Alignmeny using Vlabeler
- Phoneme-to-phoneme mapping via IPA & PHOIBLE:
phoneme_mapping
- ph num english: colstone/ENG_dur_num
- ph num german switch this file:
dur_num_dict.txt - Note sequence: OpenVPI/SOME
- f0 and time-step: OpenVPI MakeDiffSinger
- Combine multiple ds files (optional):
combine_ds.py
- FFE & MCD:
user_script/06_objective_evaluation/FFE&MCD/ - Intelligibility transcription (Whisper): fishaudio transcribe
- Word Error Rate (WER):
run_wer_eval.py
- Paper: DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
- Implementation: OpenVPI/DiffSinger
- Denoising Diffusion Probabilistic Models (DDPM): paper, implementation
- DDIM for diffusion sampling acceleration
- PNDM for diffusion sampling acceleration
- DPM-Solver++ for diffusion sampling acceleration
- UniPC for diffusion sampling acceleration
- Rectified Flow (RF): paper, implementation
- RoPE for transformer encoder
- HiFi-GAN and NSF for waveform reconstruction
- pc-ddsp for waveform reconstruction
- RMVPE and yxlllc's fork for pitch extraction
- Vocal Remover and yxlllc's fork for harmonic-noise separation
The following repositories are used as part of the data preparation and evaluation pipeline described in the Workflow Overview:
- OpenVPI/AudioSlicer – Automatic audio slicing
- OpenVPI/MakeDiffSinger – Data preprocessing utilities
- OpenVPI/SOME – Note duration extraction
- fishaudio/audio-preprocess – Audio cleaning and Whisper-based lyric transcription
- PHOIBLE - Phoible phonological feature database
- Montreal Forced Aligner (MFA) – Phoneme-level alignment
- Vlabeler -manual phoneme-level alignment
- colstone/ENG_dur_num – Duration-number mapping utilities
- GTSinger - Dataset
Any organization or individual is prohibited from using any functionalities included in this repository to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
This forked DiffSinger repository is licensed under the Apache 2.0 License.