Abstract is missing.
- Efficient One-Stage Video Object Detection by Exploiting Temporal ConsistencyGuanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson 0002. 1-16 [doi]
- Leveraging Action Affinity and Continuity for Semi-supervised Temporal Action SegmentationGuodong Ding, Angela Yao. 17-32 [doi]
- Spotting Temporally Precise, Fine-Grained Events in VideoJames Hong, Haotian Zhang, Michaël Gharbi, Matthew Fisher, Kayvon Fatahalian. 33-51 [doi]
- Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence TranslationNadine Behrmann, S. Alireza Golestaneh, Zico Kolter, Jürgen Gall, Mehdi Noroozi. 52-68 [doi]
- Efficient Video Transformers with Spatial-Temporal Token SelectionJunke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang. 69-86 [doi]
- Long Movie Clip Classification with State-Space Video ModelsMd Mohaiminul Islam, Gedas Bertasius. 87-104 [doi]
- Prompting Visual-Language Models for Efficient Video UnderstandingChen Ju, Tengda Han, Kunhao Zheng, Ya Zhang 0002, Weidi Xie. 105-124 [doi]
- Asymmetric Relation Consistency Reasoning for Video Relation GroundingHuan Li, Ping Wei 0001, Jiapeng Li, Zeyu Ma, Jiahui Shang, Nanning Zheng 0001. 125-141 [doi]
- Self-supervised Social Relation Representation for Human Group DetectionJiacheng Li, Ruize Han, Haomin Yan, Zekun Qian, Wei Feng 0005, Song Wang 0002. 142-159 [doi]
- K-centered Patch Sampling for Efficient Video RecognitionSeong Hyeon Park, Jihoon Tack, Byeongho Heo, Jung-Woo Ha 0001, Jinwoo Shin. 160-176 [doi]
- A Deep Moving-Camera Background ModelGuy Erez, Ron Shapira Weber, Oren Freifeld. 177-194 [doi]
- GraphVid: It only Takes a Few Nodes to Understand a VideoEitan Kosman, Dotan Di Castro. 195-212 [doi]
- Delta Distillation for Efficient Video ProcessingAmirHossein Habibian, Haitam Ben Yahia, Davide Abati, Efstratios Gavves, Fatih Porikli. 213-229 [doi]
- MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation LearningDavid Junhao Zhang, Kunchang Li, Yali Wang 0001, Yunpeng Chen, Shashwat Chandra, Yu Qiao 0001, Luoqi Liu, Mike Zheng Shou. 230-248 [doi]
- COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only ModalityHonglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao 0003, Ting Liu, Mubbasir Kapadia, Hans Peter Graf. 249-266 [doi]
- E-NeRV: Expedite Neural Video Representation with Disentangled Spatial-Temporal ContextZizhang Li, Mengmeng Wang, Huaijin Pi, Kechun Xu, Jianbiao Mei, Yong Liu 0007. 267-284 [doi]
- TDViT: Temporal Dilated Video Transformer for Dense Video TasksGuanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson 0002. 285-301 [doi]
- Semi-supervised Learning of Optical Flow by Flow SupervisorWoobin Im, Sebin Lee, Sung-Eui Yoon. 302-318 [doi]
- Flow Graph to Video Grounding for Weakly-Supervised Multi-step LocalizationNikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, Allan D. Jepson. 319-335 [doi]
- Deep 360$^\circ $ Optical Flow Estimation Based on Multi-projection FusionYiheng Li, Connelly Barnes, Kun Huang, Fang-Lue Zhang. 336-352 [doi]
- MaCLR: Motion-Aware Contrastive Learning of Representations for VideosFanyi Xiao, Joseph Tighe, Davide Modolo. 353-370 [doi]
- Learning Long-Term Spatial-Temporal Graphs for Active Speaker DetectionKyle Min 0001, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar. 371-387 [doi]
- Frozen CLIP Models are Efficient Video LearnersZiyi Lin, Shijie Geng, Renrui Zhang, Peng Gao 0007, Gerard de Melo, Xiaogang Wang 0001, Jifeng Dai, Yu Qiao 0001, Hongsheng Li 0001. 388-404 [doi]
- PIP: Physical Interaction Prediction via Mental Simulation with Span SelectionJiafei Duan, Samson Yu 0001, Soujanya Poria, Bihan Wen, Cheston Tan. 405-421 [doi]
- Panoramic Vision Transformer for Saliency Detection in 360$^\circ $ VideosHeeseung Yun, Sehun Lee, Gunhee Kim. 422-439 [doi]
- Bayesian Tracking of Video Graphs Using Joint Kalman Smoothing and RegistrationAditi Basu Bal, Ramy Mounir, Sathyanarayanan N. Aakur, Sudeep Sarkar, Anuj Srivastava. 440-456 [doi]
- Motion Sensitive Contrastive Learning for Self-supervised Video RepresentationJingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di Huang 0001. 457-474 [doi]
- Dynamic Temporal Filtering in Video ModelsFuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei 0001. 475-492 [doi]
- Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot ClassificationRenrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao 0007, Kunchang Li, Jifeng Dai, Yu Qiao 0001, Hongsheng Li 0001. 493-510 [doi]
- Temporal Lift Pooling for Continuous Sign Language RecognitionLianyu Hu 0003, Liqing Gao, Zekang Liu, Wei Feng 0005. 511-527 [doi]
- MORE: Multi-Order RElation Mining for Dense Captioning in 3D ScenesYang Jiao, Shaoxiang Chen 0001, Zequn Jie, Jingjing Chen, Lin Ma 0002, Yu-Gang Jiang. 528-545 [doi]
- SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual GroundingMengxue Qu, Yu Wu 0011, Wu Liu, Qiqi Gong, Xiaodan Liang, Olga Russakovsky, Yao Zhao 0001, Yunchao Wei. 546-562 [doi]
- Cross-Modal Prototype Driven Network for Radiology Report GenerationJun Wang, Abhir Bhalerao, Yulan He 0001. 563-579 [doi]
- TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and TextsChuan Guo, Xinxin Zuo, Sen Wang 0003, Li Cheng 0001. 580-597 [doi]
- SeqTR: A Simple Yet Universal Network for Visual GroundingChaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji. 598-615 [doi]
- VTC: Improving Video-Text Retrieval with User CommentsLaura Hanu, James Thewlis, Yuki M. Asano, Christian Rupprecht 0001. 616-633 [doi]
- FashionViL: Fashion-Focused Vision-and-Language Representation LearningXiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang. 634-651 [doi]
- Weakly Supervised Grounding for VQA in Vision-Language TransformersAisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels da Vitoria Lobo, Mubarak Shah. 652-670 [doi]
- Automatic Dense Annotation of Large-Vocabulary Sign Language VideosLiliane Momeni, Hannah Bull, K. R. Prajwal, Samuel Albanie, Gül Varol, Andrew Zisserman. 671-690 [doi]
- MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text RetrievalYuying Ge, Yixiao Ge, Xihui Liu, Jinpeng Wang, Jianping Wu, Ying Shan, Xiaohu Qie, Ping Luo 0002. 691-708 [doi]
- GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and RetrievalYuxuan Wang, Difei Gao, Licheng Yu, Weixian Lei, Matt Feiszli, Mike Zheng Shou. 709-725 [doi]
- A Simple and Robust Correlation Filtering Method for Text-Based Person SearchWei Suo, Mengyang Sun, Kai Niu 0005, Yiqi Gao, Peng Wang, Yanning Zhang, Qi Wu 0001. 726-742 [doi]