Yolo Yunlong Tang, Jing Bi 0002, Chao Huang 0033, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, Junjia Guo, Zhuo Liu, Luchuan Song, Ali Vosoughi, Jinxi He, Liu He, Zeliang Zhang 0001, Jiebo Luo 0001, Chenliang Xu. Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting. In Sven Koenig, Chad Jenkins, Matthew E. Taylor, editors, Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20-27, 2026. pages 41697-41699, AAAI Press, 2026. [doi]
Abstract is missing.