Skip to content

VIPL-Audio-Visual-Speech-Understanding/VIPL-AVSU-Group

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Works Collection of VIPL-AVSU-Group

This is a collection of works from the Audio-Visual Speech Understanding Group at VIPL.

Recent News:

[2025-07]: 1 paper is accepted by BMVC 2025! Congratulations to Tian-Yue!

[2025-06]: 2 papers are accepted by IEEE ICCV 2025! Congratulations to Fei-Xiang and Zhao-Xin!

[2025-05]: 1 paper is accepted by IEEE FG 2025! Congratulations to Song-Tao!

[2024-12]: Start of Challenge MAVSR-2025 @ IEEE FG 2025! Welcome to the competition!

[2024-06]: Championship in the open track of the AVSE Challenge @ InterSpeech 2024! Congratulations to Fei-Xiang!

[2024-02]: 1 paper is accepted by CVPR 2024! Congratulations to Yuan-Hang!

[2023-08]: 3 papers are accepted by BMVC 2023! Congratulations to Bing-Quan, Song-Tao and Fei-Xiang!

[2022-06]: Championship again of the AVA Active Speaker Challenge @ CVPR 2022! More details can be found here. Congratulations to Yuan-Hang and Su-San!

[2022-03]: 1 paper is accepted by ICPR 2022! Congratulations to Da-Lu!

[2021-07]: 1 paper is accepted by ICME Workshop 2021! Congratulations to Da-Lu!

[2021-07]: 1 paper is accepted by ACM MM 2021! Congratulations to Yuan-hang and Su-San!

[2021-06]: Champion of the AVA Active Speaker Challenge @ CVPR 2021! More details can be found here. Congratulations to Yuan-Hang and Su-San!

Datasets

CAS-VSR-MOV20: A dataset for VSR in HARD practical conditions, MAVSR-2025@FG

This is a Mandarin audio visual speech analysis dataset for exploring the practical performance of existing VSR models in hard cases, inlucding diversed lighting, blue, pose conditions, and so on.

CAS-VSR-S101: A dataset for sentence-level audio visual speech analysis, CVPR 2024

This is a Mandarin audio visual speech analysis dataset, involving almost all common Chinese characters and numbers of speakers speaking in diversed visual settings.

CAS-VSR-S68: A dataset for lip reading with unseen speakers, BMVC 2023

This lip reading dataset is designed for evaluation of speaker-adaptive/speaker-aware VSR in an extreme setting where the speech content is highly diverse (involving almost all common Chinese characters) while the number of speakers is limited.

CAS-VSR-W1k (LRW-1000): A naturally-distributed large-scale lip reading benchmark, FG 2019

The largest Mandarin word-level audio-visual speech recognition dataset, involving all the pronunciations of Chinese characters and most common Chinese characters.

Challenges

2025-The 2nd Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ IEEE FG

Welcome to the competition!

2022 世界机器人大赛-共融机器人挑战赛-语音识别技术赛

  • Homepage: here
  • Date: 2022/06-2022/12
  • 欢迎报名!

2019-The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR) @ ACM ICMI

This challenge aims at exploring the complementarity between visual and acoustic information in real-world speech recognition systems.

Publications

  • Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition", BMVC 2025, (Oral).

  • Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "CogCM: Cognition-Inspired Contextual Modeling for Audio Visual Speech Enhancement", ICCV 2025. [Project Page]

  • Zhaoxin Yuan, Shuang Yang, Shiguang Shan, Xilin Chen, "Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information", ICCV 2025.

  • Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Dynamic Visual Speaking Patterns: You Are the Way You Speak", FG 2025.

  • Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen, "ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations", CVPR 2024.

  • Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-guided self-supervised learning for disentangled visual speech representations", Frontiers of Computer Science, 2024.

  • Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen, "Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues", BMVC 2023.

  • Bingquan Xia, Shuang Yang, Shiguang Shan, Xilin Chen. "UniLip: Learning Visual-Textual Mapping with Uni-Modal Data for Lip Reading". BMVC 2023.

  • Songtao Luo, Shuang Yang, Shiguang Shan, Xilin Chen. "Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading", BMVC 2023. [PDF] | [Dataset] | [code]

  • Yuanhang Zhang, Susan Liang, Shuang Yang, Shiguang Shan, "Unicon+: ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2022", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2022 (1st Place).

  • Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Audio-Driven Deformation Flow for Effective Lip Reading", ICPR 2022

  • Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, "ICTCAS-UCAS-TAL Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2021", The ActivityNet Large-Scale Activity Recognition Challenge at CVPR 2021 (1st Place). [PDF]

  • Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen, "UniCon: Unified Context Network for Robust Active Speaker Detection", ACM MM 2021.(Oral). [Website] | [PDF]

  • Dalu Feng, Shuang Yang, Shiguang Shan, Xilin Chen, "Learn an Effective Lip Reading Model without Pains", ICME Workshop 2021
    [PDF] | [code]

  • Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Synchronous Bidirectional Learning for Multilingual Lip Reading", BMVC 2020
    [PDF] | [code]

  • Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen, "Deformation Flow Based Two-Stream Network for Lip Reading", FG 2020
    [PDF] | [code]

  • Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen, "Mutual Information Maximization for Effective Lipreading", FG 2020
    [PDF] | [code]

  • Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, Xilin Chen, "Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition", FG 2020 (oral)
    [PDF] | [code]

  • Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen, "Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading", FG 2020
    [PDF]

  • Yuanhang Zhang, Jingyun Xiao, Shuang Yang, Shiguang Shan, "Multi-Task Learning for Audio-Visual Active Speaker Detection", CVPR ActivityNet Challenge 2019
    [PDF]

  • Yang Shuang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. "LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild." FG 2019 [PDF] | [Dataset] | Code@fengdalu Code@NirHeaven