VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. pages 24206-24221, 2021. [doi]

Authors

Hassan Akbari

This author has not been identified. Look up 'Hassan Akbari' in Google

Liangzhe Yuan

This author has not been identified. Look up 'Liangzhe Yuan' in Google

Rui Qian

This author has not been identified. Look up 'Rui Qian' in Google

Wei-Hong Chuang

This author has not been identified. Look up 'Wei-Hong Chuang' in Google

Shih-Fu Chang

This author has not been identified. Look up 'Shih-Fu Chang' in Google

Yin Cui

This author has not been identified. Look up 'Yin Cui' in Google

Boqing Gong

This author has not been identified. Look up 'Boqing Gong' in Google