Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan 0001. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. In Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024. pages 5971-5984, Association for Computational Linguistics, 2024. [doi]

Authors

Bin Lin

This author has not been identified. Look up 'Bin Lin' in Google

Yang Ye

This author has not been identified. Look up 'Yang Ye' in Google

Bin Zhu

This author has not been identified. Look up 'Bin Zhu' in Google

Jiaxi Cui

This author has not been identified. Look up 'Jiaxi Cui' in Google

Munan Ning

This author has not been identified. Look up 'Munan Ning' in Google

Peng Jin

This author has not been identified. Look up 'Peng Jin' in Google

Li Yuan 0001

This author has not been identified. Look up 'Li Yuan 0001' in Google