Spatiotemporal Representation Enhanced ViT for Video Recognition

Min Li, Fengfa Li, Bo Meng, Ruwen Bai, Junxing Ren, Zihao Huang, Chenghua Gao. Spatiotemporal Representation Enhanced ViT for Video Recognition. In Stevan Rudinac, Alan Hanjalic, Cynthia C. S. Liem, Marcel Worring, Björn Þór Jónsson 0001, Bei Liu, Yoko Yamakata, editors, MultiMedia Modeling - 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 - February 2, 2024, Proceedings, Part I. Volume 14554 of Lecture Notes in Computer Science, pages 28-40, Springer, 2024. [doi]

Abstract

Abstract is missing.