Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng 0004, Jie Zhou 0016. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE Transactions on Audio, Speech & Language Processing, 29:2476-2483, 2021. [doi]

Abstract

Abstract is missing.