This is the inference checkpoint and code of LSCodec.
Our code is tested on Python 3.10. Please use the requirements.txt:
conda create -n lscodec python=3.10
conda activate lscodec
pip install -r requirements.txtOr, if your prefer Docker, you can directly use the image from vec2wav 2.0:
docker pull cantabilekwok511/vec2wav2.0:v0.2
docker run -it -v /path/to/vec2wav2.0:/workspace cantabilekwok511/vec2wav2.0:v0.2Checkpoints can be downloaded from HuggingFace or Modelscope.
We have two versions of LSCodec: 50Hz and 25Hz. You can use this script to automatically download them:
bash download_ckpt.sh 50hz
# or bash download_ckpt.sh 25hz
This will create pretrained/ (or pretrained_25hz, respectively) and download the following files:
codebook.npy: the codebook, (1, 300, 64) for LSCodec-50Hz; (1, 1024, 64) for LSCodec-25Hz.encoder_config.yml,vocoder_config.yml: configs for the encoder and vocoder, respectively.lscodec_encoder.pt,lscodec_vocoder.pt: checkpoints for the encoder and vocoder, respectively.
This downloading script will also prompt you to download the WavLM checkpoint manually. Please put this file under the pretrained model directory as well.
If you have a WavLM checkpoint downloaded already, you can also ln -s it.
WavLM-Large.pt: WavLM-Large checkpoint from the official repo.
This codebase uses kaldiio to load and store data.
Firstly, please prepare a wav.scp file containing the wav files:
utt-1 /path/to/utt_1.wav
utt-2 /path/to/utt_2.wav
...
You can also refer to example/wav.scp for example.
Then, encoding can be done by
source path.sh
encode.py --wav-scp example/wav.scp \
--outdir example/tokens/ \
--pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.where the tokens are stored in example/tokens/feats.ark and feats.scp. The feats.scp should look like:
3570_5694_000009_000002 /path/to/example/tokens/feats.ark:24
8455_210777_000079_000002 /path/to/example/tokens/feats.ark:677
You can also look into lscodec/bin/encode.py if you want to save into different formats.
Once encoded, LSCodec tokens can be vocoded into 24kHz waveforms using
source path.sh
decode_wav_prompt.py --feats-scp example/tokens/feats.scp \
--prompt-wav-scp example/prompt.scp \
--outdir example/wav \
--pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.where --prompt-wav-scp prompt.scp specifies the prompt wav for each utterance's token sequence. This prompt.scp looks like:
utt-1 /path/to/reference_utt_1.wav
utt-2 /path/to/reference_utt_2.wav
Finally, the decoded waveforms can be found in example/wav.
If you want to use one script for the encoding and vocoding process together, consider:
source path.sh
recon_with_prompt.py --wav-scp example/wav.scp \
--prompt-wav-scp example/prompt.scp \
--outdir example/wav \
--pretrained-dir pretrained/
# specify pretrained_25hz for 25Hz version.@inproceedings{guo25_interspeech,
title = {{LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec}},
author = {Yiwei Guo and Zhihan Li and Chenpeng Du and Hankun Wang and Xie Chen and Kai Yu},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {5018--5022},
doi = {10.21437/Interspeech.2025-1106},
issn = {2958-1796},
}