VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang 0012, Kunal Dhawan, Ke Hu, Shinji Watanabe 0001, Jagadeesh Balam, Boris Ginsburg. VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning. In Luis Chiruzzo, Alan Ritter, Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 - May 4, 2025. pages 5787-5802, Association for Computational Linguistics, 2025. [doi]

Abstract

Abstract is missing.