Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition

This is the official implementation of our Interspeech 2025 paper: Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition.

Installation

First, create a conda virtual environment and activate it:

conda create -n dpavhubert python=3.8 -y
conda activate dpavhubert

Then, clone this directory:

git clone https://github.com/Cyion/dpav_hubert.git
cd dpav_hubert
git submodule init
git submodule update

Lastly, install fairseq and the other packages:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Joint Distillation and Pruning

To run a joint distillation and pruning experiment execute the following script:

sbatch scripts/run_pruning.sh

For configuration see scripts/run_pruning.sh. Configuration files can be found in avhubert/conf/.

Joint Distillation and Merging

To run a joint distillation and merging experiment execute the following script:

sbatch scripts/run_merging.sh

For configuration see scripts/run_merging.sh. Configuration files can be found in avhubert/conf/.

Joint Distillation, Pruning and Merging

To run a joint distillation, pruning and merging experiment execute the following script:

sbatch scripts/run_pruning_merging.sh

For configuration see scripts/run_pruning_merging.sh. Configuration files can be found in avhubert/conf/.

Evaluate Model FLOPs

To evaluate the FLOPs of a model run:

python avhubert/eval_flops.py --ckpt <path to model checkpoint>

Implementation Details

The teacher-student architecture (AVHubertDistill) is implemented in hubert_distill.py (for configuration see AVHubertDistillConfig in same file). The joint distillation and pruning, and joint distillation and merging objectives (AVHubertDistillCriterion) are implemented in hubert_distill_criterion.py (for configuration see AVHubertDistillCriterionConfig in same file). Removing the hard concrete masks from the model and pruning the components according to the mask values is implemented in prune.py. The avhubert model (see hubert.py) implements a prune-function, which prunes the components and returns the new config of the components. The hard concrete mask layer (HardConcrete) is implemented in hardconcrete.py and used in encoder.py and resnet.py to prune attention heads, FFN intermediate dimensions and CNN channels. The expected number of parameters of the model is calculated using the get_num_params-fumction. Pruning of the avhubert components, e.g. linear layers, conv2d layers, layer norms, and PReLU, is implemented in pruning_utils.py. Merging the query, key, and value projection weights is implemented in merge.py. The MHA layer has a merge-function which is called to merge the weights, where the parameter "t" corresponds to beta and "type" defines the projections to merge, e.g. QK, QV, KV, or QKV.

Citation

If you find this work helpful, please cite the paper:

@inproceedings{li25q_interspeech,
  title     = {{Efficient Noise-Robust Hybrid Audiovisual Encoder  with Joint Distillation and Pruning for Audiovisual Speech Recognition}},
  author    = {{Zhengyang Li and Pascal Reichert and Thomas Graave and Patrick Blumenberg and Tim Fingscheidt}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{1833--1837}},
  doi       = {{10.21437/Interspeech.2025-1464}},
  issn      = {{2958-1796}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
avhubert		avhubert
fairseq		fairseq
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition

Installation

Joint Distillation and Pruning

Joint Distillation and Merging

Joint Distillation, Pruning and Merging

Evaluate Model FLOPs

Implementation Details

Citation

About

Uh oh!

Releases

Packages

Languages

License

ifnspaml/Efficient-AVASR

Folders and files

Latest commit

History

Repository files navigation

Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech Recognition

Installation

Joint Distillation and Pruning

Joint Distillation and Merging

Joint Distillation, Pruning and Merging

Evaluate Model FLOPs

Implementation Details

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages