Abstract is missing.
- Promptspeaker: Speaker Generation Based on Text DescriptionsYongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie 0001, Zhifei Li. 1-7 [doi]
- Enhancing Task-Oriented Dialogues With Chitchat: A Comparative Study Based on Lexical Diversity And DivergenceArmand Stricker, Patrick Paroubek. 1-8 [doi]
- AWMC: Online Test-Time Adaptation Without Mode Collapse for Continual AdaptationJae Hong Lee, Do-Hee Kim, Joon-Hyuk Chang. 1-8 [doi]
- Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual AlignmentsSara Papi, Peidong Wang, Junkun Chen, Jian Xue, Jinyu Li 0001, Yashesh Gaur. 1-8 [doi]
- CTC Blank Triggered Dynamic Layer-Skipping for Efficient Ctc-Based Speech RecognitionJunfeng Hou, Peiyao Wang, JinCheng Zhang, Meng Yang, Minwei Feng, Jingcheng Yin. 1-5 [doi]
- VITS-Based Singing Voice Conversion System with DSPGAN Post-Processing for SVCC2023Yiquan Zhou, Meng Chen, Yi Lei, Jihua Zhu, Weifeng Zhao. 1-8 [doi]
- Minisuperb: Lightweight Benchmark for Self-Supervised Speech ModelsYu-Hsiang Wang, Huang-Yu Chen, Kai-Wei Chang, Winston H. Hsu, Hung-yi Lee. 1-8 [doi]
- Few-Shot Spoken Language Understanding Via Joint Speech-Text ModelsChung-Ming Chien, Mingjiamei Zhang, Ju-Chieh Chou, Karen Livescu. 1-8 [doi]
- Detection of Vowel Errors in Children's Speech using Synthetic Phonetic TranscriptsIlja Baumann, Dominik Wagner, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet. 1-8 [doi]
- Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language UnderstandingPavel Denisov, Ngoc Thang Vu. 1-8 [doi]
- Whisper-Slu: Extending a Pretrained Speech-to-Text Transformer for Low Resource Spoken Language UnderstandingQuentin Meeus, Marie-Francine Moens, Hugo Van Hamme. 1-6 [doi]
- Evaluating Self-Supervised Speech Models on a Taiwanese Hokkien CorpusYi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi. 1-7 [doi]
- QUICKVC: A Lightweight VITS-Based Any-to-Many Voice Conversion Model using ISTFT for Faster ConversionHoujian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro. 1-7 [doi]
- Preserving Phonemic Distinctions For Ordinal Regression: A Novel Loss Function For Automatic Pronunciation AssessmentBi-Cheng Yan, Hsin-Wei Wang, Yi-Cheng Wang, Jiun-Ting Li, Chi-Han Lin, Berlin Chen. 1-7 [doi]
- LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification ModelsChi-Chang Lee, Hong-Wei Chen, Chu-Song Chen, Hsin-Min Wang, Tsung-Te Liu, Yu Tsao 0001. 1-8 [doi]
- WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layerTakuma Okamoto, Haruki Yamashita, Yamato Ohtani, Tomoki Toda, Hisashi Kawai. 1-8 [doi]
- The Role of Feature Correlation on Quantized Neural NetworksDavid Qiu, Shaojin Ding, Yanzhang He. 1-7 [doi]
- Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech RecognitionKaixun Huang, Ao Zhang, Binbin Zhang, Tianyi Xu, Xingchen Song, Lei Xie 0001. 1-8 [doi]
- Bisinger: Bilingual Singing Voice SynthesisHuali Zhou, Yueqian Lin, Yao Shi, Peng Sun, Ming Li. 1-8 [doi]
- Enhancing Expressivity Transfer in Textless Speech-to-Speech TranslationJarod Duret, Benjamin O'Brien, Yannick Estève, Titouan Parcollet. 1-8 [doi]
- Joint Federated Learning and Personalization for on-Device ASRJunteng Jia, Ke Li, Mani Malek 0001, Kshitiz Malik, Jay Mahadeokar, Ozlem Kalinli, Frank Seide. 1-8 [doi]
- Mask-Conformer: Augmenting Conformer with Mask-Predict DecoderYosuke Higuchi, Andrew Rosenberg, Yuan Wang, Murali Karthick Baskar, Bhuvana Ramabhadran. 1-8 [doi]
- Transcribing and Aligning Conversational Speech: A Hybrid Pipeline Applied to French ConversationsHiroyoshi Yamasaki, Jérôme Louradour, Julie Hunter, Laurent Prévot 0001. 1-6 [doi]
- E3 TTS: Easy End-to-End Diffusion-Based Text To SpeechYuan Gao, Nobuyuki Morioka, Yu Zhang 0033, Nanxin Chen. 1-8 [doi]
- Ending the Blind Flight: Analyzing the Impact of Acoustic and Lexical Factors on WAV2VEC 2.0 in Air-Traffic ControlAlexander Blatt, Badr M. Abdullah, Dietrich Klakow. 1-8 [doi]
- Improving Audiovisual Active Speaker Detection in Egocentric Recordings with the Data-Efficient Image TransformerJason Clarke, Yoshihiko Gotoh, Stefan Goetze. 1-8 [doi]
- FedCPC: An Effective Federated Contrastive Learning Method for Privacy Preserving Early-Stage Alzheimers Speech DetectionWenqing Wei, Zhengdong Yang, Yuan Gao, Jiyi Li, Chenhui Chu, Shogo Okada, Sheng Li 0010. 1-6 [doi]
- Towards General-Purpose Text-Instruction-Guided Voice ConversionChun-Yi Kuan, Chen-An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, Hung-yi Lee. 1-8 [doi]
- HEVAL: A New Hybrid Evaluation Metric for Automatic Speech Recognition TasksZitha Sasindran, Harsha Yelchuri, T. V. Prabhakar, Supreeth Rao. 1-7 [doi]
- Robust End-to-End Diarization with Domain Adaptive Training and Multi-Task LearningIvan Fung, Lahiru Samarakoon, Samuel J. Broughton. 1-7 [doi]
- Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available DataYifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan S. Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel 0001, Jee-weon Jung, Soumi Maiti, Shinji Watanabe 0001. 1-8 [doi]
- HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTSDake Guo, Xinfa Zhu, Liumeng Xue, Tao Li, Yuanjun Lv, Yuepeng Jiang, Lei Xie 0001. 1-7 [doi]
- Salt: Distinguishable Speaker Anonymization Through Latent Space TransformationYuanjun Lv, Jixun Yao, Peikun Chen, Hongbin Zhou, Heng Lu 0004, Lei Xie 0001. 1-8 [doi]
- Neuralkalman: A Learnable Kalman Filter for Acoustic Echo CancellationYixuan Zhang 0005, Meng Yu, Hao Zhang, Dong Yu, DeLiang Wang. 1-7 [doi]
- Multi Transcription-Style Speech Transcription Using Attention-Based Encoder-Decoder ModelYan Huang 0028, Piyush Behre, Guoli Ye, Shawn Chang, Yifan Gong 0001. 1-6 [doi]
- MASR: Multi-Label Aware Speech RepresentationAnjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth. 1-8 [doi]
- Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid MeetingsHao Zhang, Meng Yu, Dong Yu. 1-7 [doi]
- On Decoder-Only Architecture For Speech-to-Text and Large Language Model IntegrationJian Wu 0027, Yashesh Gaur, Zhuo Chen 0006, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li 0001, Shujie Liu 0001, Bo Ren, Linquan Liu, Yu Wu. 1-8 [doi]
- Dialect Adaptation and Data Augmentation for Low-Resource ASR: Taltech Systems for the Madasr 2023 ChallengeTanel Alumäe, Jiaming Kong, Daniil Robnikov. 1-7 [doi]
- BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech RecognitionPeikun Chen, Fan Yu, Yuhao Liang, Hongfei Xue, Xucheng Wan, Naijun Zheng, Huan Zhou 0004, Lei Xie 0001. 1-7 [doi]
- MUST: A Multilingual Student-Teacher Learning Approach for Low-Resource Speech RecognitionMuhammad Umar Farooq, Rehan Ahmad, Thomas Hain. 1-6 [doi]
- Wiki-En-ASR-Adapt: Large-Scale Synthetic Dataset for English ASR CustomizationAlexandra Antonova. 1-8 [doi]
- Improved Long-Form Speech Recognition By Jointly Modeling The Primary And Non-Primary SpeakersGuru Prakash Arumugam, Shuo-Yiin Chang, Tara N. Sainath, Rohit Prabhavalkar, Quan Wang, Shaan Bijwadia. 1-8 [doi]
- GPU-Accelerated Wfst Beam Search Decoder for CTC-Based Speech RecognitionDaniel Galvez, Tim Kaldewey. 1-7 [doi]
- Boosting Modality Representation With Pre-Trained Models and Multi-Task Training for Multimodal Sentiment AnalysisJiarui Hai, Yu-Jeh Liu, Mounya Elhilali. 1-8 [doi]
- Extending Self-Distilled Self-Supervised Learning For Semi-Supervised Speaker VerificationJeong Hwan Choi, Jehyun Kyung, Ju-Seok Seong, Ye-Rin Jeoung, Joon-Hyuk Chang. 1-8 [doi]
- MelHuBERT: A Simplified Hubert on Mel SpectrogramsTzu-Quan Lin, Hung-yi Lee, Hao Tang 0002. 1-8 [doi]
- Espnet-Summ: Introducing a Novel Large Dataset, Toolkit, and a Cross-Corpora Evaluation of Speech Summarization SystemsRoshan S. Sharma, William Chen, Takatomo Kano, Ruchira Sharma, Siddhant Arora, Shinji Watanabe 0001, Atsunori Ogawa, Marc Delcroix, Rita Singh, Bhiksha Raj. 1-8 [doi]
- Towards Robust Packet Loss Concealment System With ASR-Guided RepresentationsDa-Hee Yang, Joon-Hyuk Chang. 1-8 [doi]
- Findings of the 2023 ML-Superb Challenge: Pre-Training And Evaluation Over More Languages And BeyondJiatong Shi, William Chen, Dan Berrebbi, Hsiu-Hsuan Wang, Wei-Ping Huang, En-Pei Hu, Ho-Lam Chuang, Xuankai Chang, Yuxun Tang, Shang-wen Li 0001, Abdelrahman Mohamed, Hung-yi Lee, Shinji Watanabe 0001. 1-8 [doi]
- The Voicemos Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple DomainsErica Cooper, Wen-Chin Huang, Yu Tsao 0001, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi. 1-7 [doi]
- Scenario-Aware Audio-Visual TF-Gridnet for Target Speech ExtractionZexu Pan, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux. 1-8 [doi]
- Robust Logarithmic Champernowne Algorithm for Feedback Cancellation in Hearing aidsVanitha Devi R, Vasundhara. 1-5 [doi]
- Summarize While Translating: Universal Model With Parallel Decoding for Summarization and TranslationTakatomo Kano, Atsunori Ogawa, Marc Delcroix, Kohei Matsuura, Takanori Ashihara, William Chen, Shinji Watanabe 0001. 1-8 [doi]
- A Comparative Study of Voice Conversion Models With Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda. 1-6 [doi]
- Knowledge Distillation From Offline to Streaming Transducer: Towards Accurate and Fast Streaming Model by Matching AlignmentsJi-Hwan Mo, Jae-Jin Jeon, Mun-Hak Lee, Joon-Hyuk Chang. 1-7 [doi]
- Magnitude-and-Phase-Aware Speech Enhancement With Parallel Sequence ModelingYuewei Zhang, Huanbin Zou, Jie Zhu. 1-8 [doi]
- Not All Errors Are Created Equal: Evaluating The Impact of Model and Speaker Factors on ASR Outcomes in Clinical PopulationsDaniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland Barnard, Keith A. Josephs, Jennifer L. Whitwell, David T. Jones, Hugo Botha. 1-6 [doi]
- Fast Conformer With Linearly Scalable Attention For Efficient Speech RecognitionDima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg. 1-8 [doi]
- Adversarial Augmentation For Adapter LearningJen-Tzung Chien, Wei-Yu Sun. 1-7 [doi]
- Towards Developing State-of-The-Art TTS Synthesisers for 13 Indian Languages with Signal Processing Aided AlignmentsAnusha Prakash 0001, Srinivasan Umesh, Hema A. Murthy. 1-8 [doi]
- On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech RecognitionNick Rossenbach, Benedikt Hilmes, Ralf Schlüter. 1-8 [doi]
- The Singing Voice Conversion Challenge 2023Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda. 1-8 [doi]
- Prompting and Adapter Tuning For Self-Supervised Encoder-Decoder Speech ModelKai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul Kuo-Ming Huang, Chien-Yu Huang, Shang-wen Li 0001, Hung-yi Lee. 1-8 [doi]
- End-to-End Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature AnalysisCan Cui, Imran A. Sheikh, Mostafa Sadeghi, Emmanuel Vincent 0001. 1-8 [doi]
- Using Joint Training Speaker Encoder With Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice ConversionHoujian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro. 1-8 [doi]
- ED-CEC: Improving Rare word Recognition Using ASR Postprocessing Based on Error Detection and Context-Aware Error CorrectionJiajun He, Zekun Yang, Tomoki Toda. 1-6 [doi]
- Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding InitializationWei-Ping Huang, Sung-Feng Huang, Hung-yi Lee. 1-8 [doi]
- Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech RecognitionGeoffroy Vanderreydt, Amrutha Prasad, Driss Khalil, Srikanth R. Madikeri, Kris Demuynck, Petr Motlícek. 1-7 [doi]
- Model-Based Fairness Metric for Speaker VerificationMaliha Jahan, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak, Jesús Villalba 0001. 1-7 [doi]
- Leveraging the Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR TaskSakriani Sakti, Benita Angela Titalim. 1-8 [doi]
- Vsanet: Real-Time Speech Enhancement Based on Voice Activity Detection and Causal Spatial AttentionYuewei Zhang, Huanbin Zou, Jie Zhu. 1-8 [doi]
- A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot CapabilityJian Xue, Peidong Wang, Jinyu Li 0001, Eric Sun. 1-7 [doi]
- Contextual Spelling Correction with Large Language ModelsGan Song, Zelin Wu, Golan Pundak, Angad Chandorkar, Kandarp Joshi, Xavier Velez, Diamantino Caseiro, Ben Haynor, Weiran Wang, Nikhil Siddhartha, Pat Rondon, Khe Chai Sim. 1-8 [doi]
- Thai-Dialect: Low Resource Thai Dialectal Speech to Text CorporaArtit Suwanbandit, Jaturong Chitiyaphol, Sutthinan Chuenchom, Kanyarat Kwiecien, Husen Sawal, Ruslan Uthai, Orathai Sangpetch, Ekapol Chuangsuwanich. 1-8 [doi]
- The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASRYuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong-Aik Lee, Zhijie Yan, Hui Bu. 1-8 [doi]
- ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker EmbeddingsJenthe Thienpondt, Kris Demuynck. 1-8 [doi]
- After: Active Learning Based Fine-Tuning Framework for Speech Emotion RecognitionDongyuan Li, Yusong Wang, Kotaro Funakoshi, Manabu Okumura. 1-8 [doi]
- Improving Severity Preservation of Healthy-to-Pathological Voice Conversion With Global Style TokensBence Mark Halpern, Wen-Chin Huang, Lester Phillip Violeta, R. J. J. H. van Son, Tomoki Toda. 1-7 [doi]
- Two-Pass Endpoint Detection for Speech RecognitionAnirudh Raju, Aparna Khare, Di He 0004, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran, Roland Maas, Ariya Rastrow. 1-8 [doi]
- Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech RecognitionYuang Li, Yu Wu 0012, Jinyu Li 0001, Shujie Liu 0001. 1-8 [doi]
- Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data AugmentationZhaofeng Lin, Tanvina Patel, Odette Scharenborg. 1-8 [doi]
- SQAT-LD: SPeech Quality Assessment Transformer Utilizing Listener Dependent Modeling for Zero-Shot Out-of-Domain MOS PredictionKailai Shen, Diqun Yan, Li Dong 0006, Ying Ren, Xiaoxun Wu, Jing Hu. 1-6 [doi]
- End-To-End Training of a Neural HMM with Label and Transition ProbabilitiesDaniel Mann, Tina Raissi, Wilfried Michel, Ralf Schlüter, Hermann Ney. 1-8 [doi]
- Enabling Noisy Label Usage for Out-of-Airspace Data in Read-Back Error DetectionLakshmi Rajendram Bashyam, Alexander Blatt, Dietrich Klakow. 1-8 [doi]
- KAQ: A Non-Intrusive Stacking Framework for Mean Opinion Score Prediction with Multi-Task LearningChenglin Xu, Xiguang Zheng, Chen Zhang, Chao Zhou, Qi Huang, Bing Yu. 1-8 [doi]
- Speech Emotion Diarization: Which Emotion Appears When?Yingzhi Wang, Mirco Ravanelli, Alya Yacoubi. 1-7 [doi]
- FAT-HuBERT: Front-End Adaptive Training of Hidden-Unit BERT For Distortion-Invariant Robust Speech RecognitionDongning Yang, Wei Wang, Yanmin Qian. 1-8 [doi]
- LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-End ASR ModelsAleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg. 1-7 [doi]
- Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASRYangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie 0001. 1-7 [doi]
- Audio-Visual Neural Syntax AcquisitionCheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David D. Cox, David Harwath, Yang Zhang 0001, Karen Livescu, James R. Glass. 1-8 [doi]
- TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for PytorchJeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma 0010, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar 0003, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe 0001, Yangyang Shi, Yumeng Tao. 1-9 [doi]
- Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and DetectionJiachen Lian, Carly Feng, Naasir Farooqi, Steve Li, Anshul Kashyap, Cheol Jun Cho, Peter Wu, Robbie Netzorg, Tingle Li, Gopala Krishna Anumanchipalli. 1-8 [doi]
- Improving Large-Scale Deep Biasing With Phoneme Features and Text-Only Data in Streaming TransducerJin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma. 1-8 [doi]
- Av-Data2Vec: Self-Supervised Learning of Audio-Visual Speech Representations with Contextualized Target RepresentationsJiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli. 1-8 [doi]
- Generalized Zero-Shot Audio-to-Intent ClassificationVeera Raghavendra Elluru, Devang Kulshreshtha, Rohit Paturi, Sravan Bodapati, Srikanth Ronanki. 1-8 [doi]
- Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song IdentificationYanmei Gu, Jing Li, Jiayi Zhou, Zhiming Wang, Huijia Zhu. 1-8 [doi]
- Domain Adaptation by Data Distribution Matching Via Submodularity For Speech RecognitionYusuke Shinohara, Shinji Watanabe 0001. 1-7 [doi]
- Toward Universal Speech Enhancement For Diverse Input ConditionsWangyou Zhang, Kohei Saijo, Zhong-qiu Wang, Shinji Watanabe 0001, Yanmin Qian. 1-6 [doi]
- Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found DataYusheng Tian, Wei Liu, Tan Lee. 1-7 [doi]
- Towards a Unified End-to-End Language Understanding System for Speech and Text InputsMohan Li, Catalin Zorila, Cong-Thanh Do, Rama Doddipatla. 1-8 [doi]
- Improving Multilingual and Code-Switching ASR Using Large Language Model Generated TextKe Hu, Tara N. Sainath, Bo Li 0028, Yu Zhang, Yong Cheng, Tao Wang, Yujing Zhang, Frederick Liu. 1-7 [doi]
- A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, And ExtractionKohei Saijo, Wangyou Zhang, Zhong-qiu Wang, Shinji Watanabe 0001, Tetsunori Kobayashi, Tetsuji Ogawa. 1-6 [doi]
- Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding ApproachJunkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li 0001. 1-7 [doi]
- Invert-Classify: Recovering Discrete Prosody Inputs for Text-To-SpeechNicholas Sanders, Korin Richmond. 1-7 [doi]
- On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic EnvironmentsWilliam Ravenscroft, Stefan Goetze, Thomas Hain. 1-7 [doi]
- Speaker Adaptation for End-to-End Speech Recognition Systems in Noisy EnvironmentsDominik Wagner, Ilja Baumann, Sebastian P. Bayerl, Korbinian Riedhammer, Tobias Bocklet. 1-6 [doi]
- Adapting Pretrained Speech Model for Mandarin Lyrics Transcription and AlignmentJun-You Wang, Chon-In Leong, Yu-Chen Lin, Li Su, Jyh-Shing Roger Jang. 1-8 [doi]
- Consistency Based Unsupervised Self-Training for ASR PersonalisationJisi Zhang, Vandana Rajan, Haaris Mehmood, David Tuckey, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan, Gil Ho Lee, Jungin Lee, Seokyeong Jung. 1-8 [doi]
- Improved Multi-Modal Emotion Recognition Using Squeeze-and-Excitation Block in Cross-Modal AttentionJunchen Liu, Jesin James, Karan Nathwani. 1-8 [doi]
- Multitask Learning Model with Text and Speech Representation for Fine-Grained Speech ScoringSeongjin Park, Rutuja Ubale. 1-7 [doi]
- Transduce and Speak: Neural Transducer for Text-To-Speech with Semantic Token PredictionMinchan Kim, Myeonghun Jeong, Byoung Jin Choi, Dongjune Lee, Nam Soo Kim. 1-7 [doi]
- U2-KWS: Unified Two-Pass Open-Vocabulary Keyword Spotting with Keyword BiasAo Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, Lei Xie 0001. 1-8 [doi]
- Clustering Unsupervised Representations as Defense Against Poisoning Attacks on Speech Commands Classification SystemThomas Thebaud, Sonal Joshi, Henry Li, Martin Sustek, Jesús Villalba 0001, Sanjeev Khudanpur, Najim Dehak. 1-8 [doi]
- Robust Recognition of Speaker Emotion With Difference Feature Extraction Using a Few Enrollment UtterancesDaichi Hayakawa, Takehiko Kagoshima, Kenji Iwata, Norbert Braunschweiler, Rama Doddipatla. 1-7 [doi]
- Efficient Cascaded Streaming ASR System Via Frame Rate ReductionXingyu Cai, David Qiu, Shaojin Ding, Dongseong Hwang, Weiran Wang, Antoine Bruguier, Rohit Prabhavalkar, Tara N. Sainath, Yanzhang He. 1-8 [doi]
- Joint Energy-Based Model for Robust Speech Classification System Against Dirty-Label Backdoor Poisoning AttacksMartin Sustek, Sonal Joshi, Henry Li, Thomas Thebaud, Jesús Villalba 0001, Sanjeev Khudanpur, Najim Dehak. 1-8 [doi]
- Gated Multi Encoders and Multitask Objectives for Dialectal Speech Recognition in Indian LanguagesSathvik Udupa, Jesuraja Bandekar, Deekshitha G, Saurabh Kumar, Prasanta Kumar Ghosh, Sandhya Badiger, Abhayjeet Singh, Savitha Murthy, Priyanka Pai, Srinivasa Raghavan K. M., Raoul Nanavati. 1-8 [doi]
- Detecting Speech Abnormalities With a Perceiver-Based Sequence Classifier that Leverages a Universal Speech ModelHagen Soltau, Izhak Shafran, Alex Ottenwess, Joseph R. Duffy, Rene L. Utianski, Leland R. Barnard, John L. Stricker, Daniela A. Wiepert, David T. Jones, Hugo Botha. 1-7 [doi]
- Discriminative Speech Recognition Rescoring With Pre-Trained Language ModelsPrashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko. 1-7 [doi]
- Fast-Hubert: an Efficient Training Framework for Self-Supervised Speech Representation LearningGuanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie Chen 0001. 1-7 [doi]
- PP-MET: A Real-World Personalized Prompt Based Meeting Transcription SystemXiang Lyu, Yuhang Cao, Qing Wang, Jingjing Yin, Yuguang Yang 0005, Pengpeng Zou, Yanni Hu, Heng Lu 0004. 1-8 [doi]
- COCO-NUT: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-Based ControlAya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari. 1-8 [doi]
- LAE-ST-MOE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-Switching ASRGuodong Ma, Wenxuan Wang, Yuke Li, Yuting Yang, Binbin Du, Haoran Fu. 1-8 [doi]
- Reducing the Cost of Spoof Detection Labeling using Mixed-Strategy Active Learning and Pretrained ModelsMark Lindsey, Nathaniel R. Robinson, Francis Kubala, Richard M. Stern. 1-7 [doi]
- Investigating The Effect of Language Models in Sequence Discriminative Training For Neural TransducersZijian Yang, Wei Zhou 0043, Ralf Schlüter, Hermann Ney. 1-8 [doi]
- Segment-Level Vectorized Beam Search Based on Partially Autoregressive InferenceMasao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe 0001. 1-8 [doi]
- Audio-Adapterfusion: A Task-Id-Free Approach for Efficient and Non-Destructive Multi-Task Speech RecognitionHillary Ngai, Rohan Agrawal, Neeraj Gaur, W. Ronny Huang, Parisa Haghani, Pedro Moreno Mengibar. 1-8 [doi]
- Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue DomainSangeet Sagar, Mirco Ravanelli, Bernd Kiefer, Ivana Kruijff-Korbayová, Josef van Genabith. 1-7 [doi]
- Hierarchical Attention-Based Contextual Biasing For Personalized Speech Recognition Using Neural TransducersSibo Tong, Philip Harding, Simon Wiesler. 1-8 [doi]
- Transformer Attractors for Robust and Efficient End-To-End Neural DiarizationLahiru Samarakoon, Samuel J. Broughton, Marc Härkönen, Ivan Fung. 1-8 [doi]
- Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model AdaptationJerome R. Bellegarda. 1-8 [doi]
- Study on the Correlation Between Objective Evaluations and Subjective Speech Quality and IntelligibilityHsin-Tien Chiang, Kuo-Hsuan Hung, Szu-Wei Fu, Heng-Cheng Kuo, Ming-Hsueh Tsai, Yu Tsao 0001. 1-7 [doi]
- Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised LearningWilliam Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe 0001. 1-8 [doi]
- Learning From Flawed Data: Weakly Supervised Automatic Speech RecognitionDongji Gao, Hainan Xu, Desh Raj, Leibny Paola García-Perera, Daniel Povey, Sanjeev Khudanpur. 1-8 [doi]
- Neuralecho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network For Acoustic Echo Cancellation and Speech EnhancementMeng Yu 0003, Yong Xu 0004, Chunlei Zhang, Shi-Xiong Zhang, Dong Yu 0001. 1-8 [doi]
- Importance of Smoothness Induced by Optimizers in Fl4Asr: Towards Understanding Federated Learning for End-To-End ASRSheikh Shams Azam, Tatiana Likhomanenko, Martin Pelikan, Jan Honza Silovsky. 1-8 [doi]
- An Exploration of Task-Decoupling on Two-Stage Neural Post Filter for Real-Time Personalized Acoustic Echo CancellationZihan Zhang, Jiayao Sun, Xianjun Xia, Ziqian Wang, Xiaopeng Yan, Yijian Xiao, Lei Xie 0001. 1-7 [doi]
- Spectral Tilt May Have a Smaller Impact on the Intelligibility of Speech in NoiseYoshiki Sato, Julián Villegas. 1-5 [doi]
- Yodas: Youtube-Oriented Dataset for Audio and SpeechXinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe 0001. 1-8 [doi]
- Generative Speech Recognition Error Correction With Large Language Models and Task-Activating PromptingChao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke. 1-8 [doi]
- Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations and Multi-Task LearningShaoxiong Lin, Chao Zhang, Yanmin Qian. 1-7 [doi]
- Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer's Disease DetectionZiyun Cui, Wen Wu, Wei-Qiang Zhang, Ji Wu, Chao Zhang 0031. 1-8 [doi]
- Towards Matching Phones and Speech RepresentationsGene-Ping Yang, Hao Tang 0002. 1-8 [doi]
- Building High-Accuracy Multilingual ASR With Gated Language Experts and Curriculum TrainingEric Sun, Jinyu Li 0001, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu 0001, Edward Lin, Yifan Gong 0001. 1-7 [doi]
- Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented SpeechYuanyuan Zhang, Aaricia Herygers, Tanvina Patel, Zhengjun Yue, Odette Scharenborg. 1-8 [doi]
- Acoustic Model Fusion For End-to-End Speech RecognitionZhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang 0001, Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, Yaqiao Deng, Man-Hung Siu. 1-7 [doi]
- Joint Audio and Speech UnderstandingYuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James R. Glass. 1-8 [doi]
- MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice EnhancementWeiming Xu, Zhouxuan Chen, Zhili Tan, Shubo Lv, Runduo Han, Wenjiang Zhou, Weifeng Zhao, Lei Xie. 1-8 [doi]
- Meta-Learning Framework for End-to-End Imposter Identification in Unseen Speaker RecognitionAshutosh Chaubey, Sparsh Sinha, Susmita Ghose. 1-8 [doi]
- Identifying People with Mild Cognitive Impairment at Risk of Developing Dementia using Speech AnalysisBahman Mirheidari, Ronan O'Malley, Daniel Blackburn, Heidi Christensen. 1-6 [doi]
- Vits-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 ModelingZiqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie 0001. 1-8 [doi]
- SLM: Bridge the Thin Gap Between Speech and Text Foundation ModelsMingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao 0007, Nanxin Chen, Yu Zhang 0033, Hagen Soltau, Paul K. Rubenstein, Lukas Zilka, Dian Yu, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu. 1-8 [doi]
- Flap: Fast Language-Audio Pre-TrainingChing-feng Yeh, Po-Yao Huang 0001, Vasu Sharma, Shang-wen Li 0001, Gargi Ghosh. 1-8 [doi]
- Crosssinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual SingersXintong Wang, Chang Zeng, Jun Chen, Chunhui Wang. 1-6 [doi]
- Cross-Modal Learning for CTC-Based ASR: Leveraging CTC-Bertscore and Sequence-Level TrainingMun-Hak Lee, Sang-Eon Lee, Ji-Eun Choi, Joon-Hyuk Chang. 1-8 [doi]
- Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition And Phoneme To Grapheme TranslationWonjun Lee, Gary Geunbae Lee, Yunsu Kim 0001. 1-8 [doi]
- Zero-Shot Domain-Sensitive Speech Recognition with Prompt-Conditioning Fine-TuningFeng-Ting Liao, Yung-Chieh Chan, Yi-Chang Chen, Chan-Jan Hsu, Da-shan Shiu. 1-8 [doi]
- LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener EnhancementZili Qi, Xinhui Hu, Wangjin Zhou, Sheng Li 0010, Hao Wu, Jian Lu, Xinkang Xu. 1-6 [doi]
- No Pitch Left Behind: Addressing Gender Unbalance In Automatic Speech Recognition Through Pitch ManipulationDennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli. 1-8 [doi]
- Parameter-Efficient Cross-Language Transfer Learning for a Language-Modular Audiovisual Speech RecognitionZhengyang Li, Thomas Graave, Jing Liu, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt. 1-8 [doi]
- Haha-POD: An Attempt for Laughter-Based Non-Verbal Speaker VerificationYuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li. 1-7 [doi]
- Low-Rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech RecognitionYu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth Gurunath Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastrow, Ivan Bulyko. 1-8 [doi]
- CAMSAT: Augmentation Mix and Self-Augmented Training Clustering for Self-Supervised Speaker RecognitionAbderrahim Fathan, Jahangir Alam. 1-8 [doi]
- A Token-Wise Beam Search Algorithm for RNN-TGil Keren. 1-8 [doi]
- Locality Enhanced Dynamic Biasing and Sampling Strategies For Contextual ASRMd Asif Jalal, Pablo Peso Parada, George Pavlidis, Vasileios Moschopoulos, Karthikeyan Saravanan, Chrysovalantis-Giorgos Kontoulis, Jisi Zhang, Anastasios Drosou, Gil Ho Lee, Jungin Lee, Seokyeong Jung. 1-8 [doi]
- Can Unpaired Textual Data Replace Synthetic Speech in ASR Model Adaptation?Pasquale D'Alterio, Christian Hensel, Bashar Awwad Shiekh Hasan. 1-8 [doi]
- Semi-Supervised Multi-Channel Speaker Diarization With Cross-Channel AttentionShilong Wu, Jun Du, Mao-Kui He, Shutong Niu, Hang Chen, Haitao Tang, Chin-Hui Lee 0001. 1-8 [doi]
- Can We Use Speaker Embeddings On Spontaneous Speech Obtained From Medical Conversations To Predict Intelligibility?Sebastião Quintas, Mathieu Balaguer, Julie Mauclair, Virginie Woisard, Julien Pinquier. 1-7 [doi]
- Simulation of Teacher-Learner Interaction in English Language Pronunciation LearningElaf Islam, Thomas Hain, Protima Nomo Sudro. 1-6 [doi]
- Deriving Translational Acoustic Sub-Word EmbeddingsAmit Meghanani, Thomas Hain. 1-8 [doi]
- Pseudo-Label Based Supervised Contrastive Loss for Robust Speech RepresentationsVarun Krishna, Sriram Ganapathy. 1-8 [doi]
- LV-CTC: Non-Autoregressive ASR With CTC and Latent Variable ModelsYuya Fujita, Shinji Watanabe 0001, Xuankai Chang, Takashi Maekaku. 1-6 [doi]
- Voiceextender: Short-Utterance Text-Independent Speaker Verification With Guided Diffusion ModelYayun He, Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao 0006. 1-8 [doi]
- Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal ProcessingWangyou Zhang, Lei Yang, Yanmin Qian. 1-6 [doi]
- Variational Gaussian Process Data UncertaintyJeremy Heng Meng Wong, Huayun Zhang, Nancy F. Chen. 1-8 [doi]
- Prompt Pool Based Class-Incremental Continual Learning for Dialog State TrackingHong Liu, Yucheng Cai, Yuan Zhou, Zhijian Ou, Yi Huang, Junlan Feng. 1-8 [doi]
- Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State TrackingJihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim 0001, Gary Geunbae Lee. 1-8 [doi]
- Permod: Perceptually Grounded Voice Modification With Latent Diffusion ModelsRobin Netzorg, Ajil Jalal, Luna McNulty, Gopala Krishna Anumanchipalli. 1-8 [doi]
- Combining Relative and Absolute Learning Formulations to Predict Emotional Attributes From SpeechAbinay Reddy Naini, Shruthi Subramanium, Seong-Gyun Leem, Carlos Busso. 1-8 [doi]
- Zero-Shot Singing Voice Synthesis from Musical ScoreJun-You Wang, Hung-yi Lee, Jyh-Shing Roger Jang, Li Su. 1-8 [doi]
- Brouhaha: Multi-Task Training for Voice Activity Detection, Speech-to-Noise Ratio, and C50 Room Acoustics EstimationMarvin Lavechin, Marianne Métais, Hadrien Titeux, Alodie Boissonnet, Jade Copet, Morgane Rivière, Elika Bergelson, Alejandrina Cristià, Emmanuel Dupoux, Hervé Bredin. 1-7 [doi]
- Efficient Text-Only Domain Adaptation For CTC-Based ASRChang Chen, Xun Gong, Yanmin Qian. 1-7 [doi]
- Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-Supervised SettingHemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah. 1-7 [doi]
- Zero-Shot Emotion Transfer for Cross-Lingual Speech SynthesisYuke Li, Xinfa Zhu, Yi Lei, Hai Li, Junhui Liu, Danming Xie, Lei Xie 0001. 1-8 [doi]
- The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections Through Federated LearningLillian Zhou, Yuxin Ding, Mingqing Chen, Harry Zhang, Rohit Prabhavalkar, Dhruv Guliani, Giovanni Motta, Rajiv Mathews. 1-7 [doi]
- Cross-Modal Alignment With Optimal Transport For CTC-Based ASRXugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai. 1-7 [doi]
- Paraconsistent Feature Analysis for the Competency Evaluation of Voice ImpersonationRajeev Rajan, Noumida Abdul Kareem, Sreelakshmi S. 1-7 [doi]
- Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech RecognitionYujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang. 1-6 [doi]
- Generative Linguistic Representation for Spoken Language IdentificationPeng Shen, Xuguang Lu, Hisashi Kawai. 1-8 [doi]