Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng 0004, Jie Zhou 0016. Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog. IEEE Transactions on Audio, Speech & Language Processing, 29:2476-2483, 2021. [doi]
Abstract is missing.