Xudong Lin 0003, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani. Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. pages 7005-7015, Computer Vision Foundation / IEEE, 2021. [doi]
Abstract is missing.