Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg. Show and Speak: Directly Synthesize Spoken Description of Images. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. pages 4190-4194, IEEE, 2021. [doi]
Abstract is missing.