This is a collection of works from the Audio-Visual Speech Understanding Group at VIPL.
[2025-07]: 1 paper is accepted by BMVC 2025! Congratulations to Tian-Yue!
[2025-06]: 2 papers are accepted by IEEE ICCV 2025! Congratulations to Fei-Xiang and Zhao-Xin!
[2025-05]: 1 paper is accepted by IEEE FG 2025! Congratulations to Song-Tao!
[2024-12]: Start of Challenge MAVSR-2025 @ IEEE FG 2025! Welcome to the competition!
[2024-06]: Championship in the open track of the AVSE Challenge @ InterSpeech 2024! Congratulations to Fei-Xiang!
[2024-02]: 1 paper is accepted by CVPR 2024! Congratulations to Yuan-Hang!
[2023-08]: 3 papers are accepted by BMVC 2023! Congratulations to Bing-Quan, Song-Tao and Fei-Xiang!
[2022-06]: Championship again of the AVA Active Speaker Challenge @ CVPR 2022! More details can be found here. Congratulations to Yuan-Hang and Su-San!
[2022-03]: 1 paper is accepted by ICPR 2022! Congratulations to Da-Lu!
[2021-07]: 1 paper is accepted by ICME Workshop 2021! Congratulations to Da-Lu!
[2021-07]: 1 paper is accepted by ACM MM 2021! Congratulations to Yuan-hang and Su-San!
[2021-06]: Champion of the AVA Active Speaker Challenge @ CVPR 2021! More details can be found here. Congratulations to Yuan-Hang and Su-San!
This is a Mandarin audio visual speech analysis dataset for exploring the practical performance of existing VSR models in hard cases, inlucding diversed lighting, blue, pose conditions, and so on.
This is a Mandarin audio visual speech analysis dataset, involving almost all common Chinese characters and numbers of speakers speaking in diversed visual settings.
- Dataset Link: https://github.com/VIPL-Audio-Visual-Speech-Understanding/CAS-VSR-S101
- Paper Link: CAS-VSR-S101 paper
This lip reading dataset is designed for evaluation of speaker-adaptive/speaker-aware VSR in an extreme setting where the speech content is highly diverse (involving almost all common Chinese characters) while the number of speakers is limited.
- Dataset Link: https://github.com/jinchiniao/CAS-VSR-S68
- Paper Link: https://arxiv.org/abs/2310.05058
The largest Mandarin word-level audio-visual speech recognition dataset, involving all the pronunciations of Chinese characters and most common Chinese characters.
- Dataset Link:https://vipl.ict.ac.cn/resources/databases/201810/t20181017_32714.html
- Agreement: link1 or link2
- Codes: DenseNet3D @fengdalu @NirHeaven
- SOTA Accuracies: https://paperswithcode.com/sota/lipreading-on-lrw-1000
Welcome to the competition!
- Introduction @IEEE FG Website: https://fg2025.ieee-biometrics.org/participate/competitions/
- Homepage: here
- Date: 2024/12 - 2025/05
- Homepage: here
- Date: 2022/06-2022/12
- 欢迎报名!
This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems.
- Introduction @ICMI Website: https://icmi.acm.org/2019/index.php?id=challenges#speech
- Homepage: here
- Date: 2019/04 - 2019/08
-
Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition", BMVC 2025, (Oral).
-
Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "CogCM: Cognition-Inspired Contextual Modeling for Audio Visual Speech Enhancement", ICCV 2025. [Project Page]
-
Zhaoxin Yuan, Shuang Yang, Shiguang Shan, Xilin Chen, "Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information", ICCV 2025.
-
Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Dynamic Visual Speaking Patterns: You Are the Way You Speak", FG 2025.
-
Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen, "ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations", CVPR 2024.
-
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-guided self-supervised learning for disentangled visual speech representations", Frontiers of Computer Science, 2024.
-
Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues", BMVC 2023.
-
Bingquan Xia, Shuang Yang, Shiguang Shan, Xilin Chen. "UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading". BMVC 2023.
-
Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen. "Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading", BMVC 2023. [PDF] | [Dataset] | [code]
-
Yuanhang Zhang, Susan Liang, Shuang Yang, Shiguang Shan, "Unicon+: ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2022", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2022 (1st Place).
-
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-Driven Deformation Flow for Effective Lip Reading", ICPR 2022
-
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, "ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2021", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2021 (1st Place). [PDF]
-
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen, "UniCon: Unified Context Network for Robust Active Speaker Detection", ACM MM 2021.(Oral). [Website] | [PDF]
-
Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Learn an Effective Lip Reading Model without Pains", ICME Workshop 2021
[PDF] | [code] -
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Synchronous Bidirectional Learning for Multilingual Lip Reading", BMVC 2020
[PDF] | [code] -
Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading", FG 2020
[PDF] | [code] -
Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lipreading", FG 2020
[PDF] | [code] -
Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition", FG 2020 (oral)
[PDF] | [code] -
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading", FG 2020
[PDF] -
Yuanhang Zhang, Jingyun Xiao, Shuang Yang, Shiguang Shan, "Multi-Task Learning for Audio-Visual Active Speaker Detection", CVPR ActivityNet Challenge 2019
[PDF] -
Yang Shuang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. "LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild." FG 2019 [PDF] | [Dataset] | Code@fengdalu Code@NirHeaven