VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang 0012, Kunal Dhawan, Ke Hu, Shinji Watanabe 0001, Jagadeesh Balam, Boris Ginsburg. VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning. In Luis Chiruzzo, Alan Ritter, Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 - May 4, 2025. pages 5787-5802, Association for Computational Linguistics, 2025. [doi]

Authors

Yifan Peng

This author has not been identified. Look up 'Yifan Peng' in Google

Krishna C. Puvvada

This author has not been identified. Look up 'Krishna C. Puvvada' in Google

Zhehuai Chen

This author has not been identified. Look up 'Zhehuai Chen' in Google

Piotr Zelasko

This author has not been identified. Look up 'Piotr Zelasko' in Google

He Huang 0012

This author has not been identified. Look up 'He Huang 0012' in Google

Kunal Dhawan

This author has not been identified. Look up 'Kunal Dhawan' in Google

Ke Hu

This author has not been identified. Look up 'Ke Hu' in Google

Shinji Watanabe 0001

This author has not been identified. Look up 'Shinji Watanabe 0001' in Google

Jagadeesh Balam

This author has not been identified. Look up 'Jagadeesh Balam' in Google

Boris Ginsburg

This author has not been identified. Look up 'Boris Ginsburg' in Google