Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren. Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, Petr Motlícek, editors, Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021. pages 321-325, ISCA, 2021. [doi]

@inproceedings{HouYLDZMB21,
  title = {Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams},
  author = {Yuanbo Hou and Zhesong Yu and Xia Liang and Xingjian Du and Bilei Zhu and Zejun Ma and Dick Botteldooren},
  year = {2021},
  doi = {10.21437/Interspeech.2021-37},
  url = {https://doi.org/10.21437/Interspeech.2021-37},
  researchr = {https://researchr.org/publication/HouYLDZMB21},
  cites = {0},
  citedby = {0},
  pages = {321-325},
  booktitle = {Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021},
  editor = {Hynek Hermansky and Honza Cernocký and Lukás Burget and Lori Lamel and Odette Scharenborg and Petr Motlícek},
  publisher = {ISCA},
}