Works Collection of VIPL-AVSU-Group

This is a collection of works from the Audio-Visual Speech Understanding Group at VIPL.

Recent News:

[2025-07]: 1 paper is accepted by BMVC 2025! Congratulations to Tian-Yue!

[2025-06]: 2 papers are accepted by IEEE ICCV 2025! Congratulations to Fei-Xiang and Zhao-Xin!

[2025-05]: 1 paper is accepted by IEEE FG 2025! Congratulations to Song-Tao!

[2024-12]: Start of Challenge MAVSR-2025 @ IEEE FG 2025! Welcome to the competition!

[2024-06]: Championship in the open track of the AVSE Challenge @ InterSpeech 2024! Congratulations to Fei-Xiang!

[2024-02]: 1 paper is accepted by CVPR 2024! Congratulations to Yuan-Hang!

[2023-08]: 3 papers are accepted by BMVC 2023! Congratulations to Bing-Quan, Song-Tao and Fei-Xiang!

[2022-06]: Championship again of the AVA Active Speaker Challenge @ CVPR 2022! More details can be found here. Congratulations to Yuan-Hang and Su-San!

[2022-03]: 1 paper is accepted by ICPR 2022! Congratulations to Da-Lu!

[2021-07]: 1 paper is accepted by ICME Workshop 2021! Congratulations to Da-Lu!

[2021-07]: 1 paper is accepted by ACM MM 2021! Congratulations to Yuan-hang and Su-San!

[2021-06]: Champion of the AVA Active Speaker Challenge @ CVPR 2021! More details can be found here. Congratulations to Yuan-Hang and Su-San!

Datasets

CAS-VSR-MOV20: A dataset for VSR in HARD practical conditions, MAVSR-2025@FG

This is a Mandarin audio visual speech analysis dataset for exploring the practical performance of existing VSR models in hard cases, inlucding diversed lighting, blue, pose conditions, and so on.

Dataset Link: https://github.com/VIPL-Audio-Visual-Speech-Understanding/CAS-VSR-MOV20

CAS-VSR-S101: A dataset for sentence-level audio visual speech analysis, CVPR 2024

This is a Mandarin audio visual speech analysis dataset, involving almost all common Chinese characters and numbers of speakers speaking in diversed visual settings.

Dataset Link: https://github.com/VIPL-Audio-Visual-Speech-Understanding/CAS-VSR-S101
Paper Link: CAS-VSR-S101 paper

CAS-VSR-S68: A dataset for lip reading with unseen speakers, BMVC 2023

This lip reading dataset is designed for evaluation of speaker-adaptive/speaker-aware VSR in an extreme setting where the speech content is highly diverse (involving almost all common Chinese characters) while the number of speakers is limited.

Dataset Link: https://github.com/jinchiniao/CAS-VSR-S68
Paper Link: https://arxiv.org/abs/2310.05058

CAS-VSR-W1k (LRW-1000): A naturally-distributed large-scale lip reading benchmark, FG 2019

The largest Mandarin word-level audio-visual speech recognition dataset, involving all the pronunciations of Chinese characters and most common Chinese characters.

Dataset Link：https://vipl.ict.ac.cn/resources/databases/201810/t20181017_32714.html
Agreement: link1 or link2
Codes: DenseNet3D @fengdalu @NirHeaven
SOTA Accuracies: https://paperswithcode.com/sota/lipreading-on-lrw-1000

Challenges

2025-The 2nd Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ IEEE FG

Welcome to the competition!

Introduction @IEEE FG Website: https://fg2025.ieee-biometrics.org/participate/competitions/
Homepage: here
Date: 2024/12 - 2025/05

2022 世界机器人大赛-共融机器人挑战赛-语音识别技术赛

Homepage: here
Date: 2022/06-2022/12
欢迎报名！

2019-The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ ACM ICMI

This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems.

Introduction @ICMI Website: https://icmi.acm.org/2019/index.php?id=challenges#speech
Homepage: here
Date: 2019/04 - 2019/08

Publications

Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition", BMVC 2025, (Oral).
Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "CogCM: Cognition-Inspired Contextual Modeling for Audio Visual Speech Enhancement", ICCV 2025. [Project Page]
Zhaoxin Yuan, Shuang Yang, Shiguang Shan, Xilin Chen, "Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information", ICCV 2025.
Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Dynamic Visual Speaking Patterns: You Are the Way You Speak", FG 2025.
Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen, "ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations", CVPR 2024.
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-guided self-supervised learning for disentangled visual speech representations", Frontiers of Computer Science, 2024.
Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues", BMVC 2023.
Bingquan Xia, Shuang Yang, Shiguang Shan, Xilin Chen. "UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading". BMVC 2023.
Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen. "Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading", BMVC 2023. [PDF] | [Dataset] | [code]
Yuanhang Zhang, Susan Liang, Shuang Yang, Shiguang Shan, "Unicon+: ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2022", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2022 (1st Place).
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-Driven Deformation Flow for Effective Lip Reading", ICPR 2022
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, "ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2021", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2021 (1st Place). [PDF]
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen, "UniCon: Unified Context Network for Robust Active Speaker Detection", ACM MM 2021.(Oral). [Website] | [PDF]
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Learn an Effective Lip Reading Model without Pains", ICME Workshop 2021
[PDF] | [code]
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Synchronous Bidirectional Learning for Multilingual Lip Reading", BMVC 2020
[PDF] | [code]
Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading", FG 2020
[PDF] | [code]
Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lipreading", FG 2020
[PDF] | [code]
Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition", FG 2020 (oral)
[PDF] | [code]
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading", FG 2020
[PDF]
Yuanhang Zhang, Jingyun Xiao, Shuang Yang, Shiguang Shan, "Multi-Task Learning for Audio-Visual Active Speaker Detection", CVPR ActivityNet Challenge 2019
[PDF]
Yang Shuang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. "LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild." FG 2019 [PDF] | [Dataset] | Code@fengdalu Code@NirHeaven

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
CAS-VSR-S101-Release Agreement.pdf		CAS-VSR-S101-Release Agreement.pdf
LRW-1000-Release Agreement.pdf		LRW-1000-Release Agreement.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Works Collection of VIPL-AVSU-Group

Recent News:

Datasets

CAS-VSR-MOV20: A dataset for VSR in HARD practical conditions, MAVSR-2025@FG

CAS-VSR-S101: A dataset for sentence-level audio visual speech analysis, CVPR 2024

CAS-VSR-S68: A dataset for lip reading with unseen speakers, BMVC 2023

CAS-VSR-W1k (LRW-1000): A naturally-distributed large-scale lip reading benchmark, FG 2019

Challenges

2025-The 2nd Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ IEEE FG

2022 世界机器人大赛-共融机器人挑战赛-语音识别技术赛

2019-The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ ACM ICMI

Publications

About

Uh oh!

Releases

Packages

Contributors 4

VIPL-Audio-Visual-Speech-Understanding/VIPL-AVSU-Group

Folders and files

Latest commit

History

Repository files navigation

Works Collection of VIPL-AVSU-Group

Recent News:

Datasets

CAS-VSR-MOV20: A dataset for VSR in HARD practical conditions, MAVSR-2025@FG

CAS-VSR-S101: A dataset for sentence-level audio visual speech analysis, CVPR 2024

CAS-VSR-S68: A dataset for lip reading with unseen speakers, BMVC 2023

CAS-VSR-W1k (LRW-1000): A naturally-distributed large-scale lip reading benchmark, FG 2019

Challenges

2025-The 2nd Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ IEEE FG

2022 世界机器人大赛-共融机器人挑战赛-语音识别技术赛

2019-The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ ACM ICMI

Publications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages