Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, Lidong Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Yansong Feng, Els Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023. pages 543-553, Association for Computational Linguistics, 2023. [doi]

@inproceedings{ZhangLB23-2,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Hang Zhang and Xin Li and Lidong Bing},
  year = {2023},
  url = {https://aclanthology.org/2023.emnlp-demo.49},
  researchr = {https://researchr.org/publication/ZhangLB23-2},
  cites = {0},
  citedby = {0},
  pages = {543-553},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023},
  editor = {Yansong Feng and Els Lefever},
  publisher = {Association for Computational Linguistics},
}