CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan SerrĂ , Taylor Berg-Kirkpatrick, Julian J. McAuley. CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023, New Paltz, NY, USA, October 22-25, 2023. pages 1-5, IEEE, 2023. [doi]

Abstract

Abstract is missing.