What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Brian Chen 0001, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas 0001, Shih-Fu Chang, Rogério Feris, James R. Glass, Hilde Kuehne. What, When, and Where? Self-Supervised Spatio- Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pages 18419-18429, IEEE, 2024. [doi]

Abstract

Abstract is missing.