Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, Lidong Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Yansong Feng, Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023. pages 543-553, Association for Computational Linguistics, 2023. [doi]

Authors

Hang Zhang

This author has not been identified. Look up 'Hang Zhang' in Google

Xin Li

This author has not been identified. Look up 'Xin Li' in Google

Lidong Bing

This author has not been identified. Look up 'Lidong Bing' in Google