Skip to content

X-LANCE/LSCodec-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

This is the inference checkpoint and code of LSCodec.

paper demo

Environment

Our code is tested on Python 3.10. Please use the requirements.txt:

conda create -n lscodec python=3.10
conda activate lscodec
pip install -r requirements.txt

Or, if your prefer Docker, you can directly use the image from vec2wav 2.0:

docker pull cantabilekwok511/vec2wav2.0:v0.2
docker run -it -v /path/to/vec2wav2.0:/workspace cantabilekwok511/vec2wav2.0:v0.2

Checkpoints

Checkpoints can be downloaded from HuggingFace or Modelscope.

We have two versions of LSCodec: 50Hz and 25Hz. You can use this script to automatically download them:

bash download_ckpt.sh 50hz
# or bash download_ckpt.sh 25hz

This will create pretrained/ (or pretrained_25hz, respectively) and download the following files:

  • codebook.npy: the codebook, (1, 300, 64) for LSCodec-50Hz; (1, 1024, 64) for LSCodec-25Hz.
  • encoder_config.yml, vocoder_config.yml: configs for the encoder and vocoder, respectively.
  • lscodec_encoder.pt, lscodec_vocoder.pt: checkpoints for the encoder and vocoder, respectively.

This downloading script will also prompt you to download the WavLM checkpoint manually. Please put this file under the pretrained model directory as well. If you have a WavLM checkpoint downloaded already, you can also ln -s it.

Encoding Waveform to Tokens

This codebase uses kaldiio to load and store data. Firstly, please prepare a wav.scp file containing the wav files:

utt-1 /path/to/utt_1.wav
utt-2 /path/to/utt_2.wav
...

You can also refer to example/wav.scp for example.

Then, encoding can be done by

source path.sh
encode.py --wav-scp example/wav.scp \
          --outdir example/tokens/ \
          --pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.

where the tokens are stored in example/tokens/feats.ark and feats.scp. The feats.scp should look like:

3570_5694_000009_000002 /path/to/example/tokens/feats.ark:24
8455_210777_000079_000002 /path/to/example/tokens/feats.ark:677

You can also look into lscodec/bin/encode.py if you want to save into different formats.

Vocoding with Reference Prompts

Once encoded, LSCodec tokens can be vocoded into 24kHz waveforms using

source path.sh
decode_wav_prompt.py --feats-scp example/tokens/feats.scp \
    --prompt-wav-scp example/prompt.scp \
    --outdir example/wav \
    --pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.

where --prompt-wav-scp prompt.scp specifies the prompt wav for each utterance's token sequence. This prompt.scp looks like:

utt-1 /path/to/reference_utt_1.wav
utt-2 /path/to/reference_utt_2.wav

Finally, the decoded waveforms can be found in example/wav.

Combining Encoding and Vocoding into One Step

If you want to use one script for the encoding and vocoding process together, consider:

source path.sh
recon_with_prompt.py --wav-scp example/wav.scp \
    --prompt-wav-scp example/prompt.scp \
    --outdir example/wav \
    --pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.

Citation

@inproceedings{guo25_interspeech,
  title     = {{LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec}},
  author    = {Yiwei Guo and Zhihan Li and Chenpeng Du and Hankun Wang and Xie Chen and Kai Yu},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {5018--5022},
  doi       = {10.21437/Interspeech.2025-1106},
  issn      = {2958-1796},
}

About

Inference code for Interspeech 2025 paper, "LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published