A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by CLIP Variants

Soichiro Okazaki, Quan Kong, Tomoaki Yoshinaga. A Multi-Modal Fusion Approach for Audio-Visual Scene Classification Enhanced by CLIP Variants. In Frederic Font, Annamaria Mesaros, Daniel P. W. Ellis, Eduardo Fonseca, Magdalena Fuentes, Benjamin Elizalde, editors, Proceedings of the 6th Workshop on Detection and Classification of Acoustic Scenes and Events 2021 (DCASE 2021), Online, November 15-19, 2021. pages 95-99, 2021. [doi]

Abstract

Abstract is missing.