Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations

Shahram Ghorbani, Yashesh Gaur, Yu Shi, Jinyu Li. Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations. In IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021. pages 621-628, IEEE, 2021. [doi]

@inproceedings{GhorbaniGSL21,
  title = {Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations},
  author = {Shahram Ghorbani and Yashesh Gaur and Yu Shi and Jinyu Li},
  year = {2021},
  doi = {10.1109/SLT48900.2021.9383466},
  url = {https://doi.org/10.1109/SLT48900.2021.9383466},
  researchr = {https://researchr.org/publication/GhorbaniGSL21},
  cites = {0},
  citedby = {0},
  pages = {621-628},
  booktitle = {IEEE Spoken Language Technology Workshop, SLT 2021, Shenzhen, China, January 19-22, 2021},
  publisher = {IEEE},
  isbn = {978-1-7281-7066-4},
}