Abstract is missing.
- Making the Most of Text Semantics to Improve Biomedical Vision-Language ProcessingBenedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie L. Hyland, Maria Wetscherek, Tristan Naumann, Aditya V. Nori, Javier Alvarez-Valle, Hoifung Poon, Ozan Oktay. 1-21 [doi]
- Generative Negative Text Replay for Continual Vision-Language PretrainingShipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, Xuming He 0001. 22-38 [doi]
- Video Graph Transformer for Video Question AnsweringJunbin Xiao, Pan Zhou, Tat-Seng Chua, Shuicheng Yan. 39-58 [doi]
- Trace Controlled Text to Image GenerationKun Yan, Lei Ji 0001, Chenfei Wu, Jianmin Bao, Ming Zhou 0001, Nan Duan, Shuai Ma 0001. 59-75 [doi]
- Video Question Answering with Iterative Video-Text Co-tokenizationA. J. Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova. 76-94 [doi]
- Rethinking Data Augmentation for Robust Visual Question AnsweringLong Chen 0016, Yuhang Zheng, Jun Xiao 0001. 95-112 [doi]
- Explicit Image Caption EditingZhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, Jun Xiao. 113-129 [doi]
- Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal GroundingJiachang Hao, Haifeng Sun 0001, Pengfei Ren, Jingyu Wang, Qi Qi 0001, Jianxin Liao. 130-147 [doi]
- Reliable Visual Question Answering: Abstain Rather Than Answer IncorrectlySpencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez 0001, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach. 148-166 [doi]
- GRIT: Faster and Better Image Captioning Transformer Using Dual Visual FeaturesVan Quang Nguyen, Masanori Suganuma, Takayuki Okatani. 167-184 [doi]
- Selective Query-Guided Debiasing for Video Corpus Moment RetrievalSunjae Yoon, Ji Woo Hong, Eunseop Yoon, DahYun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo. 185-200 [doi]
- Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embodied Reference UnderstandingCheng Shi, Sibei Yang. 201-218 [doi]
- Object-Centric Unsupervised Image CaptioningZihang Meng, David Yang, Xuefei Cao, Ashish Shah, Ser-Nam Lim. 219-235 [doi]
- Contrastive Vision-Language Pre-training with Limited ResourcesQuan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo Chen 0004. 236-253 [doi]
- Learning Linguistic Association Towards Efficient Text-Video RetrievalSheng Fang, Shuhui Wang, Junbao Zhuo, Xinzhe Han, Qingming Huang. 254-270 [doi]
- ASSISTER: Assistive Navigation via Conditional Instruction GenerationZanming Huang, Zhongkai Shangguan, Jimuyang Zhang, Gilad Bar, Matthew Boyd, Eshed Ohn-Bar. 271-289 [doi]
- X-DETR: A Versatile Architecture for Instance-wise Vision-Language TasksZhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, Stefano Soatto. 290-308 [doi]
- Learning Disentanglement with Decoupled Labels for Vision-Language NavigationWenhao Cheng, Xingping Dong, Salman H. Khan 0001, Jianbing Shen. 309-329 [doi]
- Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputQingpei Guo, Kaisheng Yao, Wei Chu. 330-346 [doi]
- Word-Level Fine-Grained Story VisualizationBowen Li. 347-362 [doi]
- Unifying Event Detection and Captioning as Sequence Generation via Pre-trainingQi Zhang, Yuqing Song 0003, Qin Jin. 363-379 [doi]
- Multimodal Transformer with Variable-Length Memory for Vision-and-Language NavigationChuang Lin, Yi Jiang, Jianfei Cai 0001, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan. 380-397 [doi]
- Fine-Grained Visual EntailmentChristopher Thomas, Yipeng Zhang, Shih-Fu Chang. 398-416 [doi]
- Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point CloudsAyush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki. 417-433 [doi]
- New Datasets and Models for Contextual Reasoning in Visual DialogYifeng Zhang, Ming Jiang, Qi Zhao. 434-451 [doi]
- VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature SelectionJoanna Hong, Minsu Kim, Yong Man Ro. 452-468 [doi]
- Classification-Regression for Chart ComprehensionMatan Levy, Rami Ben-Ari, Dani Lischinski. 469-484 [doi]
- AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric AssistantBenita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou. 485-501 [doi]
- FindIt: Generalized Localization with Natural Language QueriesWeicheng Kuo, Fred Bertsch, Wei Li, A. J. Piergiovanni, Mohammad Saffar, Anelia Angelova. 502-520 [doi]
- UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language ModelingZhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu 0006, Faisal Ahmed 0001, Zicheng Liu 0001, Yumao Lu, Lijuan Wang. 521-539 [doi]
- Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsGolnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin. 540-557 [doi]
- The Abduction of Sherlock Holmes: A Dataset for Visual Abductive ReasoningJack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi. 558-575 [doi]
- Speaker-Adaptive Lip Reading with User-Dependent PaddingMinsu Kim, Hyunjun Kim, Yong Man Ro. 576-593 [doi]
- TISE: Bag of Metrics for Text-to-Image Synthesis EvaluationTan M. Dinh, Rang Nguyen, Binh-Son Hua. 594-609 [doi]
- SemAug: Semantically Meaningful Image Augmentations for Object Detection Through Language GroundingMorgan Heisler, Amin Banitalebi-Dehkordi, Yong Zhang 0004. 610-626 [doi]
- Referring Object Manipulation of Natural Images with Conditional Classifier-Free GuidanceMyungsub Choi. 627-643 [doi]
- NewsStories: Illustrating Articles with Visual SummariesReuben Tan, Bryan A. Plummer, Kate Saenko, J. P. Lewis 0001, Avneesh Sud, Thomas Leung. 644-661 [doi]
- Webly Supervised Concept Expansion for General Purpose Vision ModelsAmita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi. 662-681 [doi]
- FedVLN: Privacy-Preserving Federated Vision-and-Language NavigationKaiwen Zhou, Xin Eric Wang. 682-699 [doi]
- CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalHaoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang 0001. 700-716 [doi]
- Language-Driven Artistic Style TransferTsu-Jui Fu, Xin Eric Wang, William Yang Wang. 717-734 [doi]
- Single-Stream Multi-level Alignment for Vision-Language PretrainingZaid Khan 0001, B. G. Vijay Kumar, Xiang Yu 0002, Samuel Schulter, Manmohan Chandraker, Yun Fu 0001. 735-751 [doi]