Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Yuanhao Xiong, Long Zhao 0003, Boqing Gong, Ming-Hsuan Yang 0001, Florian Schroff, Ting Liu 0005, Cho-Jui Hsieh, Liangzhe Yuan. Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [doi]

Abstract

Abstract is missing.