CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian J. McAuley. CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023, New Paltz, NY, USA, October 22-25, 2023. pages 1-5, IEEE, 2023. [doi]

Authors

Hao-Wen Dong

This author has not been identified. Look up 'Hao-Wen Dong' in Google

Xiaoyu Liu

This author has not been identified. Look up 'Xiaoyu Liu' in Google

Jordi Pons

This author has not been identified. Look up 'Jordi Pons' in Google

Gautam Bhattacharya

This author has not been identified. Look up 'Gautam Bhattacharya' in Google

Santiago Pascual

This author has not been identified. Look up 'Santiago Pascual' in Google

Joan Serrà

This author has not been identified. Look up 'Joan Serrà' in Google

Taylor Berg-Kirkpatrick

This author has not been identified. Look up 'Taylor Berg-Kirkpatrick' in Google

Julian J. McAuley

This author has not been identified. Look up 'Julian J. McAuley' in Google