Visionary: vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression

Zhengwu Yuan, Peixian Tang, Xinguang Sang, Fan Zhang, Zheqi Zhang. Visionary: vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression. The Visual Computer, 41(3):1673-1688, February 2025. [doi]

Abstract

Abstract is missing.