Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations

Shahram Ghorbani, Yashesh Gaur, Yu Shi, Jinyu Li. Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations. In IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. pages 621-628, IEEE, 2021. [doi]

Abstract

Abstract is missing.