Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren. Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, Petr Motlícek, editors, Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021. pages 321-325, ISCA, 2021. [doi]

Authors

Yuanbo Hou

This author has not been identified. Look up 'Yuanbo Hou' in Google

Zhesong Yu

This author has not been identified. Look up 'Zhesong Yu' in Google

Xia Liang

This author has not been identified. Look up 'Xia Liang' in Google

Xingjian Du

This author has not been identified. Look up 'Xingjian Du' in Google

Bilei Zhu

This author has not been identified. Look up 'Bilei Zhu' in Google

Zejun Ma

This author has not been identified. Look up 'Zejun Ma' in Google

Dick Botteldooren

This author has not been identified. Look up 'Dick Botteldooren' in Google