📝 Publications

- AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence Xize Cheng, Ziang Zhang, Zehan Wang, Minghui Fang, Rongjie Huang, Siqi Zheng, Ruofan Hu, Jionghao Bai, Tao Jin, Zhou Zhao Under Review

- OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao Under Review

- OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao ACL2023(Oral)

- MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition Xize Cheng, Tao Jin, Rongjie Huang, Linjun Li, Wang Lin, Zehan Wang, Huadai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao ICCV2023

- AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation Rongjie Haung*, Xize Cheng*, Huadai Liu*, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao ACL2023
Full Publication List
[*] denotes co-first authors, [#] denotes co-supervised, [✉] denotes corresponding author,
Spoken Dialogue System & Audio-Visual Speech Understanding
-
ICLR2025
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, Zehan Wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, Zhou Zhao. -
ICLR2025
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling Shengpeng Ji, Ziyue Jiang, Xize Cheng, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Zhou Zhao, et al. -
Survey
WavChat: A Survey of Spoken Dialogue Models Shengpeng Ji, Shujie Liu, Xize Cheng, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao, et al. -
EMNLP2024
AudioVSR: Enhancing Video Speech Recognition with Audio Data. Xiaoda Yang *#, Xize Cheng*, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin. -
ACMMM2024
Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning Xiaoda Yang *#, Xize Cheng*, Dongjie Fu, Minghui Fang, Jialung Zuo, Shengpeng Ji, Zhou Zhao, Jin Tao -
ACMMM2024
SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing Lingyu Xiong #, Xize Cheng ✉, Jintao Tan #, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihui Hu. -
ACMMM2024 Oral
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts Dongjie Fu#, Xize Cheng, Xiaoda Yang#, Wang Hanting, Zhou Zhao, Tao Jin. -
ACL2024
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation. Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, Zhou Zhao. -
ACL2024
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation. Songju Lei #, Xize Cheng ✉, Mengjiao Lyu, Jianqiao Hu, Jintao Tan #, Runlin Liu, Lingyu Xiong #, Tao Jin, Xiandong Li, Zhou Zhao. -
ICCV2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. Xize Cheng, Tao Jin, Rongjie Huang, Linjun Li, Wang Lin, Zehan Wang, Huadai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao. -
ACL2023
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation. Rongjie Haung*, Xize Cheng*, Huadai Liu*, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao. -
ACL2023
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation. Linjun Li*, Tao Jin*, Xize Cheng*, Ye Wang, Wang Lin, Rongjie Huang and Zhou Zhao.
Multi-Modal Alignment
-
Under Review
AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence Xize Cheng, Ziang Zhang, Zehan Wang, Minghui Fang, Rongjie Huang, Siqi Zheng, Ruofan Hu, Jionghao Bai, Tao Jin, Zhou Zhao. -
ICLR2025
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao. -
ICML2024
Omnibind: Large-scale omni multimodal representation via binding spaces Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao -
NIPS2023
Connecting Multi-modal Contrastive Representations. Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao. -
ACMMM2023
Rethinking Missing Modality Learning from a Decoding Perspective. Tao Jin *, Xize Cheng *, Linjun Li, Wang Lin, Ye Wang, Zhou Zhao. -
ACL2023 Oral
Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning. Ye Wang, Wang Lin, Shengyu Zhang, Tao Jin, Linjun Li, Xize Cheng and Zhou Zhao. -
ACL2023 Oral
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment. Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao.