This repository contains code for the paper 'PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables'. The architectures are heavily based on FT-JNF and the time-domain Speakerbeam (see links below). Preprocessing and training code for the Vibravox and Oldenburg datasets is included as well.
- Landing page with audio examples: https://bose.github.io/passe/
- ArXiv preprint: https://doi.org/10.48550/arXiv.2509.20875
-
Original FT-JNF repository: https://github.com/sp-uhh/deep-non-linear-filter/
-
Orignal time-domain Speakerbeam repository: https://github.com/BUTSpeechFIT/speakerbeam/
-
The training code in this repository is based on this template: https://github.com/victoresque/pytorch-template
Many thanks to the original authors!
Python packages and versions used for this project are listed in requirements.txt.
The code and training configuration files were used on an AWS instance with --instance-type g6.24xlarge (see https://aws.amazon.com/ec2/instance-types/) so it might be necessary to adjust number of workers, batch sizes and number of GPUs to your hardware to be able to retrain the models.
This is how to initialize the DNN models used in the paper:
# assumed channel order here is 0: 'outer microphone'
# 1: 'in-ear microphone'
e = torch.randn(1, 2, 3*16000) # enrollment signal with 2 channels, each 3 seconds long
x = torch.randn(1, 2, 3*16000) # input signal with 2 channels, each 3 seconds long
# SE
model = FTJNF_MAG(n_channels=1, ref_ch=[0], mask_ch=[0])
# PSE with OM conditioning
model = FTJNF_MAG_1mult_learned_enc(
n_channels=1, ref_ch=[0], cond_vector_size=192, mask_ch=[0], cond_vector_ch=0)
# PSE with IM conditioning
model = FTJNF_MAG_1mult_learned_enc(
n_channels=1, ref_ch=[0], cond_vector_size=192, mask_ch=[0], cond_vector_ch=1)
# AS-SE
model = FTJNF_MAG(n_channels=2, ref_ch=[0, 1], mask_ch=[0])
# PASSE with OM conditioning
model = FTJNF_MAG_1mult_learned_enc(
n_channels=2, ref_ch=[0,1], cond_vector_size=192, mask_ch=[0], cond_vector_ch=0)
# PASSE with IM conditioning
model = FTJNF_MAG_1mult_learned_enc(
n_channels=2, ref_ch=[0,1], cond_vector_size=192, mask_ch=[0], cond_vector_ch=1)
y = model(x, e)
print(y.shape)
torchinfo.summary(model, input_size=(x.shape, e.shape))Preprocessing files are located in the preprocessing_vibravox and preprocessing_oldenburg folders.
Before you run any of the files, please review them and make sure to change paths based on your system directory structure.
Files located in the folder preprocessing_vibravox/.
download_vibravox.pydownloads the dataset from huggingface. Please download thespeech_cleanandspeechless_noisysubsets.convert_vibravox_to_wav.pyreads from the huggingface dataset, applies resampling to 16 kHz, and performs channel selection, and saves the dataset again as wave files with a corresponding file list that contains speaker information as well. This file also creates a channel indexing text file.preprocess_vibravox_TSE.pycan be used to obtain the training and validation sets, which contain all scenarios (speech mixed with noise, speech mixed with interferer, speech mixed with interferer and noise). Please note that the in-ear microphone signals for vibravox with interfering talkers are not valid for training and evaluation due to the lack of isolated interfering talker in-ear signals (see paper).preprocess_vibravox_TSE_scaled_outer_interferer.pycan be used to obtain the training and validation sets where the corresponding interfering talker in-ear signals are approximated by a scaling factor (Configuration (D) in the paper).preprocess_vibravox_TSE_separate_eval_sets.pycan be used to obtain the test sets, which are separated by scenario.prepare_vibravox_enrollments.pyto create a table of enrollment file names. The actual selection takes place in the training dataset code. You might need to uncomment the correct splits and change paths here too.
Files located in the folder preprocessing_oldenburg/.
- please download https://doi.org/10.5281/zenodo.10844598 (speech) and https://doi.org/10.5281/zenodo.11196866 (impulse responses).
- You will also need to download the
noise_fullbandpart of the DNS5 challenge dataset (using the script provided in https://github.com/microsoft/DNS-Challenge/tree/v5dnschallenge_ICASSP2023) preclean_oldenburg.pyto perform resampling and removal of ambient noise in the recorded own voice speechsimulate_oldenburg_noise.pyto spatialize the DNS challenge noise with the impulse responsessimulate_oldenburg_interferers.pyto spatialize the Oldenburg interferers with the impulse responsespreprocess_oldenburg_TSE.pycan be used to obtain the training and validation sets, which contain all scenarios (speech mixed with noise, speech mixed with interferer, speech mixed with interferer and noise)preprocess_oldenburg_TSE_separate_eval_sets.pycan be used to obtain the test sets, which are separated by scenario.prepare_oldenburg_enrollments.pyto create a table of enrollment file names. The actual selection takes place in the training dataset code. You might need to uncomment the correct splits and change paths here too.
After generating the data required for training, you can train a model using one of the configuration files provided in the folder training_code/configs/. You probably want to change the file paths for the pre-processed data here, too.
Example call: train.py -c configs/Vibravox_SE.json -d 0,1 for training the basic single-channel unconditioned system on the Vibravox dataset, using GPUs 0 and 1.
If you found this code helpful or want to reference it, please cite as:
@article{passe2025,
title = {{PAS-SE}: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables},
journal = {arXiv:2509.20875},
author = {Ohlenbusch, Mattes and Kegler, Mikolaj and Stamenovic, Marko},
year = {2025},
month = sep,
doi = {10.48550/arXiv.2509.20875}
}