MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Difei Gao, Luowei Zhou, Lei Ji 0001, Linchao Zhu, Yi Yang, Mike Zheng Shou. MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. pages 14773-14783, IEEE, 2023. [doi]

Authors

Difei Gao

This author has not been identified. Look up 'Difei Gao' in Google

Luowei Zhou

This author has not been identified. Look up 'Luowei Zhou' in Google

Lei Ji 0001

This author has not been identified. Look up 'Lei Ji 0001' in Google

Linchao Zhu

This author has not been identified. Look up 'Linchao Zhu' in Google

Yi Yang

This author has not been identified. Look up 'Yi Yang' in Google

Mike Zheng Shou

This author has not been identified. Look up 'Mike Zheng Shou' in Google