Abstract is missing.
- Study of vocal fold vibration using M-mode ultrasound: a proof of conceptJuliette Dindart, Agnès Rouxel, Crystal Lin, Trung Kien Bui, Muriel Lefort, Claire Pillot-Loiseau, Christophe Trésallet, Frédérique Frouin. [doi]
- Universal Speech Enhancement with Regression and Generative MambaRong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukic, Szu-Wei Fu, Yu Tsao 0001. [doi]
- ProBiEM: Acoustic and Lexical Correlates of Prosodic Prominence in English-Malayalam Bilingual SpeechAnindita Mondal, Rahul Biju, Anil Kumar Vuppala, Reni K. Cherian, Chiranjeevi Yarra. [doi]
- PPGs-BERT: Leveraging Phoneme Sequence and BERT for Alzheimer's Disease Detection from Spontaneous SpeechQi Sun, Ziyue Qiu, Yu Pu, Jinpeng Li, Xuchu Chen, Wei-Qiang Zhang. [doi]
- Multimodal Assessment of Speech Impairment in Amyotrophic Lateral Sclerosis Using Audio-Visual and Machine Learning ApproachesFrancesco Pierotti, Andrea Bandini. [doi]
- The Speech Accessibility Project: Best Practices for Collection and Curation of Disordered SpeechChris Zwilling, Mark Hasegawa-Johnson, Heather Hodges, Lorraine O. Ramig, Adina Bradshaw, Clarion Mendes, Heejin Kim, Alexandria Barkhimer, Laura Mattie, Meg Dickinson, Shawnise Carter, Marie Moore Channell. [doi]
- Challenges and practical guidelines for atypical speech data collection, annotation, usage and sharing: A multi-project perspectiveZhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley, Willemijn Doedens, Lottie Stipdonk, Yuanyuan Zhang, Elke De Witte, Erfan Loweimi, Hugo Van Hamme, Djaina Satoer, Marina B. Ruiter, Laureano Moro-Velázquez, Nicholas Cummins, Odette Scharenborg. [doi]
- D-GAT: Dual Graph Attention Network for Global HRTF InterpolationJunsheng Hu, Shaojie Li, Qintuya Si, De Hu. [doi]
- Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variationHaley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein. [doi]
- Performance of Montreal Forced Aligner on Cantonese Spontaneous SpeechKa Ki SO, Chenzi Xu, Grace Wenling Cao, Peggy Mok. [doi]
- Perception of Long and Short Vowel Contrast in Te Reo Māori in Clean and Everyday Listening EnvironmentsC. T. Justine Hui, Jenice Kuzhikombil, Isabella Shields, Hiraia Haami-Wells, Catherine I. Watson, Peter J. Keegan. [doi]
- Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASRXugang Lu, Peng Shen, Yu Tsao 0001, Hisashi Kawai. [doi]
- Leveraging Geographic Metadata for Dialect-Aware Speech RecognitionPouya Mehralian, Hugo Van Hamme. [doi]
- The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language VarietiesWilliam Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe 0001. [doi]
- Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech EnhancementJangyeon Kim, Ui-Hyeop Shin, Jaehyun Ko, Hyung-Min Park. [doi]
- NeuroSpex+: Dual-Task Training of Neuro-Guided Speaker Extraction with Speech Envelope and WaveformDashanka De Silva, Siqi Cai 0002, Saurav Pahuja, Tanja Schultz, Haizhou Li 0001. [doi]
- Employing self-supervised learning models for cross-linguistic child speech maturity classificationTheo Zhang, Madurya Suresh, Anne Warluamont, Kasia Hitczenko, Alejandrina Cristià, Margaret Cychosz. [doi]
- Speaker-specific Patterns of Phonetic Covariation in Korean Word-medial Stops and the Role of Phonological and Morphological ContextsChloe D. Kwon. [doi]
- Fine-Tuning ASR for Stuttered Speech: Personalized vs. Generalized ApproachesDena F. Mujtaba, Nihar R. Mahapatra. [doi]
- Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented GenerationMu Yang, Bowen Shi 0002, Matthew Le 0001, Wei-Ning Hsu, Andros Tjandra. [doi]
- SCD-Conformer: Semantic Content Disentanglement for Text-Independent Speaker VerificationShanshan Yao, Dianlong Liu, Tian Li. [doi]
- LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-AttentionAditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi, Pankaj Wasnik. [doi]
- Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech RecognitionTao Zhong, Mengzhe Geng, Shujie Hu, Guinan Li, Xunying Liu. [doi]
- Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMsSimon Sedlácek, Bolaji Yusuf, Jan Svec, Pradyoth Hegde, Santosh Kesiraju, Oldrich Plchot, Jan Cernocký. [doi]
- PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice AssociationAbdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, Muhammad Irzam Liaqat, Markus Schedl, Mubashir Noman. [doi]
- Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained ModelsSeung Jae Lee, Paul Hongsuck Seo. [doi]
- Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric SpeechDimme de Groot, Tanvina Patel, Devendra Kayande, Odette Scharenborg, Zhengjun Yue. [doi]
- A Two-Stage Hierarchical Deep Filtering Framework for Real-Time Speech EnhancementShenghui Lu, Hukai Huang, Jinanglong Yao, Kaidi Wang, Qingyang Hong, Lin Li. [doi]
- Leveraging Cascaded Binary Classification and Multimodal Fusion for Dementia Detection through Spontaneous SpeechYin-Long Liu, Yuanchao Li, Rui Feng, Liu He, Jia-xin Chen, Yi-Ming Wang, Yu-Ang Chen, Yan-Han Peng, Jia-Hong Yuan, Zhen-Hua Ling. [doi]
- Augment Mandarin to Cantonese Speech Databases via Retrieval-Augmented Generation and Speech SynthesisFan Liu, Cheng Gong, Boyu Zhu, Ruihao Jing, Chunyu Qiang, Tianrui Wang, Xiao-lei Zhang, Xuelong Li. [doi]
- X-ARES: A Comprehensive Framework for Assessing Audio Encoder PerformanceJunbo Zhang, Heinrich Dinkel, Yadong Niu, Chenyu Liu, Si Cheng, Anbei Zhao, Jian Luan 0001. [doi]
- Chain-of-Thought Training for Open E2E Spoken Dialogue SystemsSiddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe 0001. [doi]
- DYNAC: Dynamic Vocabulary-based Non-Autoregressive Contextualization for Speech RecognitionYui Sudo, Yosuke Fukumoto, Muhammad Shakeel 0001, Yifan Peng 0003, Chyi-Jiunn Lin, Shinji Watanabe 0001. [doi]
- Efficient Speech Enhancement via Embeddings from Pre-trained Generative AudioencodersXingwei Sun, Heinrich Dinkel, Yadong Niu, Linzhang Wang, Junbo Zhang, Jian Luan 0001. [doi]
- Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced AlignmentDohyun Kim, Jiwook Hwang. [doi]
- DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech GenerationJiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu 0001. [doi]
- Flexible VAD-PVAD Transition: A Detachable PVAD Module for Dynamic Encoder RNN VADEn-Lun Yu, Chien-Chun Wang, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen. [doi]
- NanoCodec: Towards High-Quality Ultra Fast Speech LLM InferenceEdresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukic, Jason Li, Boris Ginsburg. [doi]
- Hearing deficits of transformer-based ASR for anechoic and spatial signalsDirk Eike Hoffner, Simon Weihe, Thomas Brand, Bernd T. Meyer. [doi]
- On the Relevance of Clinical Assessment Tasks for the Automatic Detection of Parkinson's Disease Medication State from SpeechDavid Gimeno-Gómez, Rubén Solera-Ureña, Anna Pompili, Carlos D. Martínez-Hinarejos, Rita Cardoso, Isabel Guimarães, Joaquim J. Ferreira, Alberto Abad. [doi]
- Dog2vec: Self-Supervised Pre-Training for Canine Vocal RepresentationXingyuan Li, Kenny Q. Zhu, Mengyue Wu. [doi]
- Room Impulse Response as a Prompt for Acoustic Echo CancellationFei Zhao, Shulin He, Xueliang Zhang. [doi]
- A real-time MRI study on asymmetry in velum dynamics during VCV production with nasal soundsChetan Sharma, Vaishnavi Chandwanshi, Shreya Shrikant Karkun, Aditya Anand Gupta, Prasanta Kumar Ghosh. [doi]
- EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-ContrastShreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman. [doi]
- Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and ChallengesHashim Ali 0003, Surya Subramani, Raksha Varahamurthy, Nithin Sai Adupa, Lekha Bollinani, Hafiz Malik. [doi]
- LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR BenchmarkingJaume Santamaria-Jorda, Pablo Segovia-Martínez, Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Adrià Giménez, Rubén Gaspar Aparicio, René Fernández Sánchez, Jorge Civera, Albert Sanchís, Alfons Juan. [doi]
- Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinson's Disease ClassifiersTerry Yi Zhong, Esther Janse, Cristian Tejedor-Garcia, Louis ten Bosch, Martha A. Larson. [doi]
- Exploiting Bispectral Features for Single-Channel Speech EnhancementVenkatesh Parvathala, Ramesh Gundluru, Sreekanth Sankala, K. Sri Rama Murty. [doi]
- Pull It Together: Reducing the Modality Gap in Contrastive LearningAmit Sofer, Yoav Goldman, Shlomo E. Chazan. [doi]
- Automated evaluation of children's speech fluency for low-resource languagesBowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin. [doi]
- Children's Voice Privacy: First Steps and Emerging ChallengesAjinkya Kulkarni, Francisco Teixeira, Enno Hermann, Thomas Rolland, Isabel Trancoso, Mathew Magimai-Doss. [doi]
- Non-Intrusive Binaural Speech Intelligibility Prediction Using Mamba for Hearing-Impaired ListenersKatsuhiko Yamamoto, Koichi Miyazaki. [doi]
- Voice Conversion for Likability Control via Automated Rating of Speech Synthesis CorporaHitoshi Suda, Shinnosuke Takamichi, Satoru Fukayama. [doi]
- Clustering-based Hard Negative Sampling for Supervised Contrastive Speaker VerificationPiotr Masztalski, Michal Romaniuk, Jakub Zak, Mateusz Matuszewski, Konrad Kowalczyk. [doi]
- Synthetic Speech Source Tracing using Metric LearningDimitrios Koutsianos, Stavros Zacharopoulos, Yannis Panagakis, Themos Stafylakis. [doi]
- TTMBA: Towards Text To Multiple Sources Binaural Audio GenerationYuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang. [doi]
- Development and Validation of a Wav2Vec 2.0-Based Cross-Language Methodology for Measurement of Articulatory PrecisionTanya Talkar, Kan Kawabata, Connor Higgins, Sean Tobyne. [doi]
- ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled MechanismHsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao 0001, Chi-Chun Lee. [doi]
- WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer BiasingYu Nakagome, Michael Hentschel. [doi]
- SDBench: A Comprehensive Benchmark Suite for Speaker DiarizationBerkin Durmus, Blaise Munyampirwa, Eduardo Pacheco, Atila Orhon, Andrey Leonov. [doi]
- Scalable Offline ASR for Command-Style Dictation in CourtroomsKumarmanas Nethil, Vaibhav Mishra, Kriti Anandan, Kavya Manohar. [doi]
- Speech-Based Automatic Chronic Kidney Disease Diagnosis via Transformer Fusion of Glottal and Spectrogram FeaturesJihyun Mun, Minhwa Chung, SunHee Kim. [doi]
- SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice AssistantYixuan Hou, Heyang Liu, Yuhao Wang, Ziyang Cheng 0002, Ronghua Wu, Qunshan Gu, Yanfeng Wang 0001, Yu Wang 0027. [doi]
- Evaluating Speech Enhancement Performance Across Demographics and LanguageJosé Giraldo, Alex Peiró Lilja, Carme Armentano-Oller, Rodolfo Zevallos, Cristina España-Bonet. [doi]
- Co-registration of real-time MRI and respiration for speech researchYubin Zhang, Prakash Kumar, Ye Tian, Ziwei Zhao, Xuan Shi, Kevin Huang, Kevin Lee, Haley Hsu, Shrikanth Narayanan, Krishna S. Nayak, Louis Goldstein. [doi]
- Foundation Model Hidden Representations for Heart Rate Estimation from AuscultationJingping Nie, Tien Dung Tran, Karan Thakkar, Vasudha Kowtha, Jon Huang, Carlos Avendaño, Erdrin Azemi, Vikramjit Mitra. [doi]
- A Gradient Effect of Hand Beat Timing on Spoken Word RecognitionChengjia Ye, James M. McQueen, Hans Rutger Bosker. [doi]
- Lessons Learnt: Revisit Key Training Strategies for Effective Speech Emotion Recognition in the WildJing-Tong Tzeng, Bo-Hao Su, Ya-Tse Wu, Hsing-Hang Chou, Chi-Chun Lee. [doi]
- U-SAM: An Audio Language Model for Unified Speech, Audio, and Music UnderstandingZiqian Wang, Xianjun Xia, Xinfa Zhu, Lei Xie. [doi]
- A Multimodal Chinese Dataset for Cross-lingual Sarcasm DetectionXiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler. [doi]
- A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech TranslationVerena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, Barbara Plank. [doi]
- TA-RIR: Topology-Aware Neural Modeling of Acoustic Propagation for Room Impulse Response SynthesisJunhui Zhao, Hang Chen, Qing Wang, Jun Du, Yanhui Tu, Feng Ma. [doi]
- On the Production and Perception of a Single Speaker's GenderRobin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Anumanchipalli. [doi]
- Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech EnhancementYujie Yang, Bing Yang, Xiaofei Li. [doi]
- No Audiogram: Leveraging Existing Scores for Personalized Speech Intelligibility PredictionHaoshuai Zhou, Changgeng Mo, Boxuan Cao, Linkai Li, Shan Xiang Wang. [doi]
- Unified Semi-Supervised Pipeline for Automatic Speech RecognitionNune Tadevosyan, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Ante Jukic. [doi]
- Enhanced Hybrid Transducer and Attention Encoder Decoder with Text DataYun Tang, Eesung Kim, Vijendra Raj Apsingekar. [doi]
- Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic ConversationsSeongsil Heo, Christi Miller, Calvin Murdock, Michael J. Proulx. [doi]
- Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation ExperienceAndrew Chang 0003, Chenkai Hu, Ji Qi, Zhuojian Wei, Kexin Zhang, Viswadruth Akkaraju, David Poeppel, Dustin Freeman. [doi]
- xLSTM-SENet: xLSTM for Single-Channel Speech EnhancementNikolai Lund Kühne, Jan Østergaard, Jesper Jensen 0001, Zheng-Hua Tan. [doi]
- Temp4Cap: Temporally-aligned Automated Audio CaptioningHo-Young Choi, Jae-Heung Cho, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang. [doi]
- Can We Reconstruct a Dysarthric Voice with the Large Speech Model Parler TTS?Ariadna Sanchez, Simon King. [doi]
- On the Language and Gender Biases in PSTN, VoIP and Neural Audio CodecsKemal Altwlkany, Amar Kuric, Emanuel Lacic. [doi]
- Semantic Processing During Spoken Word Production by Children with Cochlear ImplantsMan Wang, Yixin Ding, Niels O. Schiller. [doi]
- Speaker Conditioning of Voice Activity Detection via Implicit SeparationMatthew Maciejewski. [doi]
- Federated Learning with Feature Space Separation for Speaker RecognitionYing Meng, Zhihua Fang, Liang He. [doi]
- Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation AlgorithmZhaoyang Li, Jie Wang, Xiaoxiao Li, Wangjie Li, Longjie Luo, Lin Li, Qingyang Hong. [doi]
- SpeechSEC: A Unified Multi-Task Framework for Speech Synthesis, Editing, and ContinuationLiming Liang, Dongchao Yang, Xianwei Zhuang, Yuxin Xie 0004, Luo Chen, Yuehan Jin, Yuexian Zou. [doi]
- Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression ClassifierYongjie Si, Yanxiong Li, Jiaxin Tan, Qianhua He, Il-Youp Kwak. [doi]
- Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching GatesHaoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu. [doi]
- Quadruple Path Modeling with Latent Feature Transfer for Permutation-free Continuous Speech SeparationJihyun Kim, Doyeon Kim, Hyewon Han, Jinyoung Lee, Jonguk Yoo, Chang Woo Han, Jeongook Song, Hoon-Young Cho, Hong-Goo Kang. [doi]
- ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation LearningJunyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang 0001. [doi]
- Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech ProsodyDavid Sasu, Benedict Quartey, Kweku Andoh Yamoah, Natalie Schluter. [doi]
- Simultaneous Masked and Unmasked Decoding with Speculative Decoding Masking for Fast ASR without Accuracy LossKoji Okabe, Hitoshi Yamamoto. [doi]
- Sounding Like a Winner? Prosodic Differences in Post-Match InterviewsSofoklis Kakouros, Haoyu Chen. [doi]
- Developing High-Quality TTS for Punjabi and Urdu: Benchmarking against MMS ModelsFatima Naseem, Maham Sajid, Farah Adeeba, Sahar Rauf, Asad Mustafa, Sarmad Hussain. [doi]
- Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech AudiometryLongbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim. [doi]
- Phonetically-Augmented Discriminative Rescoring for Voice Search Error CorrectionChristophe Van Gysel, Maggie Wu, Lyan Verwimp, Caglar Tirkaz, Marco Bertola, Zhihong Lei, Youssef Oualil. [doi]
- DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice ConversionXiaosu Su, Bowen Yang, Xiaowei Yi, Yun Cao. [doi]
- LinearVC: Linear Transformations of Self-Supervised Features Through the Lens of Voice ConversionHerman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau. [doi]
- SLASH: Self-Supervised Speech Pitch Estimation Leveraging DSP-derived Absolute PitchRyo Terashima, Yuma Shirahata, Masaya Kawamura. [doi]
- Vision-Integrated High-Quality Neural Speech CodingYao Guo, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling. [doi]
- Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian LanguagesUtkarsh Pathak, Chandra Sai Krishna Gunda, Anusha Prakash 0001, Keshav Agarwal, Hema A. Murthy. [doi]
- Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTubeIngo Siegert, Jan Marquenie, Sven Grawunder. [doi]
- Discovering Directions of Uncertainty in Speech InpaintingKfir Cohen, Lior Wolf, Bracha Laufer-Goldshtein. [doi]
- Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding SamplingMd Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasios Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong. [doi]
- Fifteen Years of Child-Centered Long-Form Recordings: Promises, Resources, and Remaining Challenges to ValidityLoann Peurey, Marvin Lavechin, Tarek Kunze, Manel Khentout, Lucas Gautheron, Emmanuel Dupoux, Alejandrina Cristià. [doi]
- Structured Codebook Based Hierarchical Framework for DNN for Computationally Efficient Speech EnhancementChidambar B, Hanumanth Rao Naidu. [doi]
- Scaling Laws for Synthetic Speech for Model TrainingChristoph Minixhofer, Ondrej Klejch, Peter Bell 0001. [doi]
- An Exploration of Interpretable Deep Learning Models for the Assessment of Mild Cognitive ImpairmentEmma Cathrine Liisborg Leschly, Oliver Roesler, Michael Neumann, Jackson Liscombe, Abhishek Hosamath, Lakshmi Arbatti, Line H. Clemmensen, Melanie Ganz, Vikram Ramanarayanan. [doi]
- Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue SystemsMikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara. [doi]
- LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword SpottingPai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge. [doi]
- Exploiting Context-dependent Duration Features for Voice Anonymization Attack SystemsNatalia A. Tomashenko, Emmanuel Vincent 0001, Marc Tommasi. [doi]
- PAST: Phonetic-Acoustic Speech TokenizerNadav Har-Tuv, Or Tal, Yossi Adi. [doi]
- Coping with segmental-prosodic incongruity in spoken word recognition in JapaneseTerumichi Ariga. [doi]
- VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech PretrainingJianheng Zhuo, Yifan Yang 0005, Yiwen Shao, Yong Xu 0004, Dong Yu 0001, Kai Yu 0004, Xie Chen 0001. [doi]
- Effect of physical exercise on voice in people living with COPDLauren G. Reinders, Loes van Bemmel, Alexander Mackay, David Nobbs, Frits M. E. Franssen, Hester Gietema, Simona Schäfer, Sami O. Simons. [doi]
- Cross-Attention-Based Target Sound Extraction by Fully Leveraging Enrollment in a Shared Latent SpaceXue Yang, Guiru Shen, Yu Yang. [doi]
- Neutral Tone Variation in Beijing Mandarin: Is Neutral Tone Toneless?Xiao Dong, Fengming Liu, Chien-Jer Charles Lin, Monica Nesbitt, Shuju Shi. [doi]
- Evaluating the Effectiveness of Pre-Trained Audio Embeddings for Classification of Parkinson's Disease Speech DataEmmy Postma, Cristian Tejedor García. [doi]
- Parameter-efficient Fine-tuning of Conformer-based Streaming Speech Recognition into Non-streaming ModelsYunjae Nam, Jeong U. Han, Kiyeon Kim, Jaemin Lim. [doi]
- TF-Mamba: A Time-Frequency Network for Sound Source LocalizationYang Xiao, Rohan Kumar Das. [doi]
- Exploratory Analysis of Brainstem fMRI Data During Sustained PhonationCarey-Smith, Hu Cheng, Pertti Palo, Daniel Aalto, Steven M. Lulich. [doi]
- Significance of Time-Frequency preprocessing for automatic Ultrasonic Vocalization classification in Autism Spectrum Disorder model detectionSzymon Szmajdzinski, Juliusz Wójtowicz-Kruk, Ivan Ryzhankow, Lukasz Lazarski, Jakub Zak, Wladyslaw Sredniawa. [doi]
- In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech ConversionJiawei Jin, Zhihan Yang, Yixuan Zhou, Zhiyong Wu. [doi]
- Is your model big enough? Training and interpreting large-scale monolingual speech foundation modelsYaroslav Getman, Tamás Grósz, Tommi Lehtonen, Mikko Kurimo. [doi]
- Using and comprehending language in face-to-face conversationJudith Holler. [doi]
- Open Universal Arabic ASR LeaderboardYingzhi Wang, Anas Alhmoud, Muhammad Alqurishi. [doi]
- Sub-band based Adaptive IIR Algorithm with Biquad Filter Stability Constraints for Feedforward Hear-Through EqualizationRishabh Gupta, MLNS Karthik, Omsrinath Chelamkuri. [doi]
- Non-Standard Accent TTS Support via Large Multi-Accent Frontend Pronunciation Knowledge TransferNoe Berger, Siqi Sun, Korin Richmond. [doi]
- Joint Reference Microphone Selection and Filter Order Determination in Multi-channel Active Noise ControlDe Hu, Shuyao Liu, Yanrong He. [doi]
- Layer-Wise Decision Fusion for Fake Audio Detection Using XLS-RYixuan Xiao, Ngoc Thang Vu. [doi]
- On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech RecognitionShujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Huimeng Wang, Guinan Li, Chengxi Deng, Tianzi Wang, Mingyu Cui, Helen Meng, Xunying Liu. [doi]
- Joint Target-Speaker ASR and Activity DetectionChikara Maeda, Muhammad Shakeel 0001, Yui Sudo. [doi]
- RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech RetrievalHaoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang 0075, Jiabei He 0001, Shiwan Zhao, Xiangyu Kong 0001, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin. [doi]
- Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from EnsemblesMiika Toikkanen, June-Woo Kim. [doi]
- Tonality-Based Accompaniment-Guided Automatic Singing EvaluationPei-Chin Hsieh, Yih-Liang Shen, Ngoc Son Tran, Tai-Shih Chi. [doi]
- DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech CodecPeijie Chen, Wenhao Guan, Kaidi Wang, Weijie Wu, Hukai Huang, Qingyang Hong, Lin Li. [doi]
- MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple SpeakersKyeongman Park, Seongho Joo, Kyomin Jung. [doi]
- Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative DecodingZijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu. [doi]
- SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio InformationChih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee. [doi]
- Language-Agnostic Speech Tokenizer for Spoken Term Detection with Efficient RetrievalAnup Singh, Kris Demuynck, Vipul Arora 0001. [doi]
- Thai Speech Spoofing Detection Dataset with Variations in Speaking StylesTicho Urai, Pachara Boonsarngsuk, Ekapol Chuangsuwanich. [doi]
- Lateral Channel Formation in Australian English /l/: Insights from Magnetic Resonance ImagingTünde Szalay, Michael Proctor, Amelia Gully, Tharinda Piyadasa, Craig T. Jin, David Waddington, Naeim Sanaei, Sheryl Foster, Kirrie J. Ballard. [doi]
- From KAN to GR-KAN: Advancing Speech Enhancement with KAN-Based MethodologyHaoyang Li, Yuchen Hu, Chen Chen 0075, Sabato Marco Siniscalchi, Songting Liu, Eng Siong Chng. [doi]
- Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk DetectionYiFan Gao, Jiao Fu, Long Guo, Hong Liu. [doi]
- Voice Activity-based Text Segmentation for ASR Text DenormalizationSashi Novitasari, Takashi Fukuda, Gakuto Kurata. [doi]
- Examining Test-Time Adaptation for Personalized Child Speech RecognitionZhonghao Shi, Xuan Shi, Anfeng Xu, TianTian Feng, Harshvardhan Srivastava, Shrikanth Narayanan, Maja J. Mataric. [doi]
- Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language ModelsPotsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul. [doi]
- A Copula-Based Generative Score-Level Fusion Model for Speaker VerificationSandro Cumani. [doi]
- LiRI Corpus Platform: Demonstration of a Web-Based Infrastructure for Multimodal Corpus AnalysisTeodora Vukovic, Jérémy Zehr, Jonathan Schaber, Igor Mustac, Nikolina Rajovic, Daniel McDonald, Johannes Graën, Noah Bubenhofer. [doi]
- Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba FrameworkJoun Yeop Lee, Sangjun Park, Byoung Jin Choi, Ji-Hyun Lee, Min Kyung Kim, Hoon-Young Cho. [doi]
- Concurrent Speech and Auditory Tag Clouds for Non-Visual Web InteractionDhia Eddine Merzougui, Nilesh Tete, Fabrice Maurel, Gaël Dias, Mohammed Hasanuzzaman, Aurélien Bournonville, Edgar Madelaine, Thomas Berthelin Le Tellier, François Ledoyen, Laure Poutrain-Lejeune, François Rioult, Jérémie Pantin. [doi]
- Effect of Noise Floor in Room Impulse Response on Speech Perception Under Spherical Harmonics-based Spatial Sound ReproductionYunqi C. Zhang, Dhruv Jagmohan, Hong Kit Li, C. T. Justine Hui, Yusuke Hioka. [doi]
- Exploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASRCarlos Carvalho, Jinchuan Tian, William Chen, Yifan Peng 0003, Alberto Abad, Shinji Watanabe 0001. [doi]
- Low Complex IIR Adaptive Hear-Through Ambient Filtering for Overcoming Practical Constraints in EarbudsRishabh Gupta, MLNS Karthik, Yughendaran Palanivel. [doi]
- Joint Rate Allocation and Sensor Selection for Speech Enhancement in Wireless Acoustic Sensor NetworksDe Hu, Qilong Li. [doi]
- Lightweight Speech Enhancement for Mandarin Esophageal SpeechJia-Jyu Su, Yen-Ting Lin, Wu-Hao Li, Chao-Kai Chang, Yan-Zhi Chen, Chen-Yu Chiang. [doi]
- Multimodal Dynamics of Hand Gestures and Pauses in Multiparty InteractionsDelphine Charuau, Naomi Harte. [doi]
- Legally validated evaluation framework for voice anonymizationNathalie Vauquier, Brij Mohan Lal Srivastava, Seyed Ahmad Hosseini, Emmanuel Vincent 0001. [doi]
- Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic IntelligenceEdem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamäki. [doi]
- SA-RAS: Speaker-Aware Style Retrieval Augmented Generation for Expressive Zero-Shot Text-to-Speech SynthesisXueru Li, Jingyuan Xing, Xiaofen Xing, Zhipeng Li, Xiangmin Xu. [doi]
- Multimodal Speech, Language and Orofacial Analysis for Remote Assessment of Positive, Negative and Cognitive Symptoms in SchizophreniaMichael Neumann, Hardik Kothare, Beverly Insel, Anzalee Khan, Danyah Nadim, Jean-Pierre Lindenmayer, Vikram Ramanarayanan. [doi]
- Accessible Delivery of Visual-Acoustic Biofeedback for Speech Sound DisorderTara McAllister, Peter Traver, Amanda Eads, William Haack, Helen Carey, Yi Shan, Wendy Liang, Tae Hong Park. [doi]
- Towards Domain-Specific Spoken Language Understanding for a Catalan Voice-Controlled Video GameAlex Peiró Lilja, Rodolfo Zevallos, Carme Armentano-Oller, José Giraldo, Cristina España-Bonet, Mireia Farrús. [doi]
- MOPSA: Mixture of Prompt-Experts Based Speaker Adaptation for Elderly Speech RecognitionChengxi Deng, Xurong Xie, Shujie Hu, Mengzhe Geng, Yicong Jiang, Jiankun Zhao, Jiajun Deng, Guinan Li, Youjun Chen, Huimeng Wang, Haoning Xu, Mingyu Cui, Xunying Liu. [doi]
- CS-FLEURS: A Massively Multilingual and Code-Switched Speech DatasetBrian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Sai Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali 0002, Sanjeev Khudanpur, Shinji Watanabe 0001. [doi]
- Evaluating Automatic Speech Recognition Pipelines for Mandarin-English Bilingual Child Language Assessment in TelehealthHongchen Wu, Yao Du, Zirong Li, Yixin Gu, Disha Thotappala Jayaprakash, Li Sheng. [doi]
- Leveraging Unlabeled Audio-Visual Data in Speech Emotion Recognition using Knowledge DistillationVarsha Pendyala, Pedro Morgado 0001, William A. Sethares. [doi]
- Nosey: Open-Source Hardware for Acoustic NasalanceMaya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton, Sam Kirkham. [doi]
- Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task ClassificationWilliam Ravenscroft, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong. [doi]
- EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary ClassificationDeok-Hyeon Cho, Hyung-Seok Oh, Seung-bin Kim, Seong-Whan Lee. [doi]
- Temporal Modeling of Room Impulse Response Generation via Multi-Scale Autoregressive LearningSheng Lyu, Yuemin Yu, Chenshu Wu. [doi]
- Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive LearningChangin Choi, Sungjun Lim, Wonjong Rhee. [doi]
- Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech RecognitionJiamin Xie, Ju Lin, Yiteng Huang, Tyler Vuong, Zhaojiang Lin, Zhaojun Yang, Peng Su, Prashant Rawat, Sangeeta Srivastava, Ming Sun 0013, Florian Metze. [doi]
- Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of SpeakersYuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen. [doi]
- Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of MethodologiesCarlos Mena, Pol Serra, Jacobo-Romero, Abir Messaoudi, José Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Iván Meza, Javier Hernando. [doi]
- End-to-End DOA-Guided Speech Extraction in Noisy Multi-Talker ScenariosKangqi Jing, Wenbin Zhang, Yu Gao. [doi]
- A Study of Real-world Audio-Visual Corpus Design and Production: A Perspective from MISP ChallengesHang Chen, Jun Du, Qing Wang, Juan Xie, Shi-Fu XIong. [doi]
- Voices of `cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologiesMaxwell Hope, Éva Székely. [doi]
- Influence of Room Acoustics on Objective Voice Assessment Methods in the Context of Speech and Language TherapySven Franz, Tanja Grewe, Bernd T. Meyer, Jörg Bitzer. [doi]
- Context-Driven Dynamic Pruning for Large Speech Foundation ModelsMasao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe 0001. [doi]
- Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech ModelingTahiya Chowdhury, Verónica Romero 0002. [doi]
- Comparison of Acoustic and Textual Features for Dysarthria Severity Classification in Amyotrophic Lateral SclerosisY. S. Upendra Vishwanath, Tanuka Bhattacharjee, Deekshitha G, Sathvik Udupa, Chowdam Venkata Thirumala Kumar, Madassu Keerthipriya, Darshan Chikktimmegowda, Dipti Baskar, Yamini Belur, Seena Vengalil, Atchayaram Nalini, Prasanta Kumar Ghosh. [doi]
- Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual LearningKi-Joong Kwon, Jun Ho So, Sang-Hoon Lee. [doi]
- Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource LanguagesRishabh Ranjan, Ayinala Likhith, Mayank Vatsa, Richa Singh 0001. [doi]
- Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text InterleavingJingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu. [doi]
- TargetVoice: Single Channel Low-Latency Target Speaker ExtractionArun Kumar Pallala, Nivedita Chennupati, Balaji Padmanaban, Rakesh Pogula, Uma Subhashini Ravuri, Naveen Ellanki, Harish Rajamani, Naveen Ambati. [doi]
- QUADS: Quantized Distillation Framework for Efficient Speech Language UnderstandingSubrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam. [doi]
- A-SMiLE: Affective Sparse Mixture-of-Experts Adapter with Multi-Task Learning for Spoken Dialogue ModelsYi-Wen Chao, Yizhou Peng, Dianwen Ng, Yukun Ma, Chongjia Ni, Eng Siong Chng, Eng Siong Chng. [doi]
- TADA: Training-free Attribution and Out-of-Domain Detection of Audio DeepfakesAdriana Stan, David Combei, Dan Oneata, Horia Cucu. [doi]
- Creaky Voice Facilitates More Efficient Phonological Processing of Mandarin Tone 3Zixia Fan, Ronny Ibrahim, Joshua Penney, Felicity Cox. [doi]
- In-context Language Learning for Endangered Languages in Speech RecognitionZhaolin Li, Jan Niehues. [doi]
- Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning ModelsTuan Dat Phuong, Long Vu-Hoang, Huy Dat Tran. [doi]
- Backchannel prediction for natural spoken dialog systems using general speaker and listener informationYoshinori Fukunaga, Ryota Nishimura, Kengo Ohta, Norihide Kitaoka. [doi]
- On the cross-modal makeup of charisma: Insights from a field-data analysisOliver Niebuhr. [doi]
- Universal Preference-Score-based Pairwise Speech Quality AssessmentYu-Fei Shi, Yang Ai, Zhen-Hua Ling. [doi]
- Word-Level Error Analysis in Decoding Systems: From Speech Recognition to Brain-Computer InterfacesJingya Huang, Aashish N. Patel, Sowmya Manojna Narasimha, Gal Mishne, Vikash Gilja. [doi]
- Word Level Timestamp Generation for Automatic Speech Recognition and TranslationKe Hu, Krishna C. Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang 0012, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg. [doi]
- Zero-Shot Learning for Acoustic Event Classification Using an Attribute Vector and Conditional GANKohei Uehara, Ryoichi Takashima, Tetsuya Takiguchi. [doi]
- Introducing EMOPARKNZ: the Emotional Speech Database from New Zealand English Speakers with Parkinson's DiseaseItay Ben-Dom, Catherine I. Watson, Clare M. McCann. [doi]
- A Semantic Information-based Hierarchical Speech Enhancement Method Using Factorized Codec and Diffusion ModelYang Xiang, Canan Huang, Desheng Hu, Jingguang Tian, Xinhui Hu, Chao Zhang. [doi]
- Towards Efficiently Whisper Fine-tuning with Monotonic AlignmentsZiyang Zhuang, Tao Wei 0003, Ming Fang, Ning Cheng 0001, Shaojun Wang, Jing Xiao 0006. [doi]
- Efficient Streaming Speech Quality Prediction with Spiking Neural NetworksMattias Nilsson 0001, Riccardo Miccini, Julian Rossbroich, Clément Laroche, Tobias Piechowiak, Friedemann Zenke. [doi]
- The Role of Contextual Variation in Learning Cantonese Tones from Naturalistic SpeechFengyue Lisa Zhao, Jennifer Kuo. [doi]
- Representation of Perceived Prosodic Similarity of Conversational FeedbackLivia Qian, Carol Figueroa, Gabriel Skantze. [doi]
- Training Articulatory Inversion Models for Interspeaker ConsistencyCharles McGhee, Mark J. F. Gales, Kate M. Knill. [doi]
- Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count ControlYunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee. [doi]
- Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language ModelsKe-Han Lu, Chun-Yi Kuan, Hung-yi Lee. [doi]
- Mitigating Non-Target Speaker Bias in Guided Speaker EmbeddingShota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara. [doi]
- DnR-nonverbal: Cinematic Audio Source Separation DatasetContaining Non-Verbal SoundsTakuya Hasumi, Yusuke Fujita. [doi]
- Acoustic Representation and Realization of Weak Elements Subcategories: In the Case of Tianjin MandarinZhijie Li, Hui Feng. [doi]
- TalTech Systems for the Interspeech 2025 ML-SUPERB 2.0 ChallengeTanel Alumäe, Artem Fedorchenko. [doi]
- Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder ModelsNikola Ljubesic, Ivan Porupski, Peter Rupnik. [doi]
- A Practitioner's Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish GaelicOndrej Klejch, William Lamb, Peter Bell 0001. [doi]
- Identifying Vocal and Facial Biomarkers of Depression in Large-Scale Remote Recordings: A Multimodal Study Using Mixed-Effects ModelingNelson Hidalgo Julia, Robert Lewis, Craig Ferguson, Simon Goldberg, Wendy Lau, Caroline Swords, Gabriela Valdivia, Christine D. Wilson-Mendenhall, Raquel Tartar, Rosalind Picard, Richard Davidson. [doi]
- CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation TokenizerDaiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada. [doi]
- Speechless: Speech Instruction Training Without Speech for Low Resource LanguagesAlan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip. [doi]
- Equivalence and differences: Formant patterns of labialization and pharyngealization in TashlhiytPhilipp Buech, Anne Hermes, Rachid Ridouane. [doi]
- MASV: Speaker Verification with Global and Local Context MambaYang Liu, Li Wan, Yiteng Huang, Ming Sun 0013, Xinhao Mei, Xubo Liu, Yangyang Shi, Florian Metze. [doi]
- Direct-path Relative Harmonic Coefficients Detection for Multi-source Direction-of-Arrival Estimation in Reverberant EnvironmentsLiang Tao, Maoshen Jia, Yonggang Hu. [doi]
- On the influence of language similarity in non-target speaker verification trialsPaul M. Reuter, Michael Jessen. [doi]
- On the reliability of feature attribution methods for speech classificationGaofei Shen, Hosein Mohebbi, Arianna Bisazza, Afra Alishahi, Grzegorz Chrupala. [doi]
- A Study on Speech Assessment with Visual CuesShafique Ahmed, Ryandhimas E. Zezario, Nasir Saleem, Amir Hussain 0001, Hsin-Min Wang, Yu Tsao 0001. [doi]
- Efficient Trie-based Biasing using K-step Prediction for Rare Word RecognitionKwok Chin Yuen, Jia Qi Yip. [doi]
- Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time OrderingIvan Medennikov, Taejin Park, Weiqing Wang, He Huang 0012, Kunal Dhawan, Jinhan Wang, Jagadeesh Balam, Boris Ginsburg. [doi]
- Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri WomenSakshi Joshi, Eldho Ittan George, Tahir Javed, Kaushal Bhogale, Nikhil Narasimhan, Mitesh M. Khapra. [doi]
- Probing Prosodic Differences Between Two Regional Varieties of Brazilian PortugueseGustavo Silveira, Aviad Albert, Martine Grice. [doi]
- Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languagesTuan Nguyen, Huy Dat Tran. [doi]
- FLASepformer: Efficient Speech Separation with Gated Focused Linear Attention TransformerHaoxu Wang, Yiheng Jiang, Gang Qiao, Pengteng Shi, Biao Tian. [doi]
- Directional Speech Recognition with Full-Duplex CapabilityJu Lin, Yiteng Huang, Ming Sun 0013, Frank Seide, Florian Metze. [doi]
- Deep-Simplex Multichannel Speech SeparationTzlil Avidan, Bracha Laufer-Goldshtein. [doi]
- Multimodal Emotion Diarization: Frame-Wise Integration of Text and Audio RepresentationsZiv Tamir, Thomas Thebaud, Jesús Villalba 0001, Najim Dehak, Oren Kurland. [doi]
- Constrained LDDMM for Dynamic Vocal Tract Morphing: Integrating Volumetric and Real-Time MRITharinda Piyadasa, Joan Glaunès, Amelia Gully, Michael Proctor, Kirrie J. Ballard, Tünde Szalay, Naeim Sanaei, Sheryl Foster, David Waddington, Craig T. Jin. [doi]
- Assessing the Performance and Efficiency of Mamba ASR in Low-Resource ScenariosRodolfo Zevallos, Martí Cortada Garcia, Sarah Solito, Carlos Mena, Alex Peiró Lilja, Javier Hernando. [doi]
- Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech RecognitionShunsuke Mitsumori, Sara Kashiwagi, Keitaro Tanaka, Shigeo Morishima. [doi]
- Self-Improvement for Audio Large Language Model using Unlabeled SpeechShaowen Wang, Xinyuan Chen, Yao Xu. [doi]
- Alzheimer's Disease Detection Using Co-Attention Mechanism for Acoustic and ASR-Transcribed Text FeaturesYongqi Shao 0001, Tao Fang. [doi]
- Test-Time Training for Speech EnhancementAvishkar Behera, Riya Ann Easow, Venkatesh Parvathala, K. Sri Rama Murty. [doi]
- HWB-Net: A Novel High-Performance and Efficient Hybrid Waveform Bandwidth Extension MethodXin Liu, Shulin He, Xueliang Zhang. [doi]
- Towards Pre-training an Effective Respiratory Audio Foundation ModelDaisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada. [doi]
- Rasmalai : Resources for Adaptive Speech Modeling in IndiAn Languages with Accents and IntonationsAshwin Sankar, Yoach Lacombe, Sherry Thomas, Praveen Srinivasa Varadhan, Sanchit Gandhi, Mitesh M. Khapra. [doi]
- DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language ModelsHeng-Jui Chang, Hongyu Gong, Changhan Wang, James R. Glass, Yu-An Chung. [doi]
- Source Verification for Speech DeepfakesViola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro. [doi]
- Cross-attention and Self-attention for Audio-visual Speaker Diarization in MISP-Meeting ChallengeZhaoyang Li, Haodong Zhou, Longjie Luo, Xiaoxiao Li, Yongxin Chen, Lin Li, Qingyang Hong. [doi]
- Multitask Learning with Fused Attention for Improved ASR and Mispronunciation Detection in Children's Speech Sound DisordersSelina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So. [doi]
- Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical ApplicationsCarol Y. Espy-Wilson. [doi]
- LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMsPooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli. [doi]
- Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language ModelsShunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi. [doi]
- Stuttering Detection Based on Self-Attention Weights of Temporal Acoustic Vector SequenceGenzo Miyahara, Tsuneo Kato, Akihiro Tamura. [doi]
- Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and ActivationZhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie. [doi]
- Evaluating Logit-Based GOP Scores for Mispronunciation DetectionAditya Kamlesh Parikh, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik. [doi]
- MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie DubbingJunjie Zheng, Zihao Chen, Chaofan Ding, Yunming Liang, Yihan Fan, Huan Yang, Lei Xie, Xinhan Di. [doi]
- Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASRHongli Yang, Sheng Li 0010, Hao Huang 0009, Ayiduosi Tuohan, Yizhou Peng. [doi]
- DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning ObjectiveHyung-Gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe 0001, Ahmed Hussen Abdelaziz. [doi]
- Discrete Tokens Exhibit Interlanguage Speech Intelligibility Benefit: an Analytical Study Towards Accent-robust ASR Only with Native Speech DataKentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu. [doi]
- Dynamic Layer Gating for Speech EnhancementVenkatesh Parvathala, K. Sri Rama Murty. [doi]
- Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention MechanismOrchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTCQingzheng Wang, Jiancheng Sun, Yifan Peng 0003, Shinji Watanabe 0001. [doi]
- When focus shapes the flow: prosodic restructuring in Mandarin complex nominalsAnqi Xu 0002, Yu-Yin Hsu. [doi]
- Investigating Glottal Stop Coda Loss During Sound Change of Checked Syllables Based on Speech-EGG Voice Offset AlignmentBingliang Zhao, Xiyu Wu. [doi]
- Improving Audio Classification by Transitioning from Zero- to Few-ShotJames Taylor, Wolfgang Mack. [doi]
- Scaling beyond Denoising: Submitted System and Findings in URGENT Challenge 2025Zhihang Sun, Andong Li, Tong Lei, Rilin Chen, Meng Yu 0003, Chengshi Zheng, Yi Zhou 0014, Dong Yu 0001. [doi]
- Clinical Annotations for Automatic Stuttering Severity AssessmentAna Rita Valente, Rufael Marew, Hawau Olamide Toyin, Hamdan Al-Ali, Anelise Bohnen, Inma Becerra, Elsa Marta Soares, Gonçalo Leal, Hanan Aldarmaki. [doi]
- Voice-Based Dysphagia Detection: Leveraging Self-Supervised Speech RepresentationInjune Hwang, Jung-Min Kim, Ju Seok Ryu, Kyogu Lee. [doi]
- APTTS: Adversarial Post-training in Latent Flow Matching for Fast and High-fidelity Text-to-SpeechHyungchan Yoon, Chanwoo Lee, Hoodong Lee, Stanley Jungkyu Choi. [doi]
- GTA: Towards Generative Text-To-Audio Retrieval via Multi-Scale TokenizerMinghui Fang 0002, Shengpeng Ji, Jialong Zuo, Xize Cheng, Wenrui Liu 0003, Xiaoda Yang, Ruofan Hu, Jieming Zhu, Zhou Zhao 0001. [doi]
- Discrete Audio Representations for Automated Audio CaptioningJingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu. [doi]
- FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion DistillationTakuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo. [doi]
- A Dataset for Automatic Assessment of TTS Quality in SpanishAlejandro Sosa Welford, Leonardo Pepino. [doi]
- SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-SwitchingVasista Sai Lodagala, Lamya Alkanhal, Daniel Izham, Shivam Mehta, Shammur Absar Chowdhury, Aqeelah Makki, Hamdy S. Hussein, Gustav Eje Henter, Ahmed Ali 0002. [doi]
- Voice Impression Control in Zero-Shot TTSKenichi Fujita, Shota Horiguchi, Yusuke Ijima. [doi]
- Age-related changes in multisensory integration of emotions in an audiovisual face-prosody-semantics Stroop taskYi Lin, Shumeng Ni, Yangfan Lu. [doi]
- OpusLM: A Family of Open Unified Speech Language ModelsJinchuan Tian, William Chen, Yifan Peng 0003, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Huck Yang, Shinji Watanabe 0001. [doi]
- L3C-DeepMFC: Low-Latency Low-Complexity Deep Marginal Feedback Cancellation with Closed-Loop Fine Tuning for Hearing AidsFengyuan Hao, Brian C. J. Moore, Huiyong Zhang, Xiaodong Li 0002, Chengshi Zheng. [doi]
- Codec-Based Deepfake Source Tracing via Neural Audio Codec TaxonomyXuanjun Chen, I-Ming Lin, Lin Zhang, Jiawei Du, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang. [doi]
- Phonetic Posteriorgram-Based Phoneme Selection for Vocal Cord Disorder Classification in Continuous Mandarin SpeechChih-Ning Chen, Yu-Lan Chuang, Ming-Jhang Yang, Wei-Cheng Hsu, Yung-An Tsou, Yi-Wen Liu. [doi]
- Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline ScenariosAswin Shanmugam Subramanian, Amit Das 0007, Naoyuki Kanda, Jinyu Li 0001, Xiaofei Wang 0007, Yifan Gong 0001. [doi]
- Spoken Language Understanding on Unseen Tasks With In-Context LearningNeeraj Agrawal, Sriram Ganapathy. [doi]
- DuRep: Dual-Mode Speech Representation Learning via ASR-Aware DistillationPrabash Reddy Male, Swayambhu Nath Ray, Harish Arsikere, Akshat Jaiswal, Prakhar Swarup, Prantik Sen, Debmalya Chakrabarty, K. V. Vijay Girish, Nikhil Bhave, Frederick Weber, Sambuddha Bhattacharya, Sri Garimella. [doi]
- Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learningShi-Xin Fang, Liang-Yeh Shen, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee. [doi]
- Eigenvoice Synthesis based on Model Editing for Speaker GenerationMasato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda. [doi]
- EnCodecMAE: leveraging neural codecs for universal audio representation learningLeonardo Pepino, Pablo Riera, Luciana Ferrer. [doi]
- Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source TracingYang Xiao, Rohan Kumar Das. [doi]
- Anomalous Sound Detection Based Feature Fusion and Dual-path Non-linear Independent Components EstimationYawei Wang, Qiaoling Zhang, Yi Zhang, Junyao Hu. [doi]
- ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and SpeechYu Pan 0008, Yanni Hu, Yuguang Yang 0005, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma 0003, Jianjun Zhao 0001. [doi]
- Cantonese Punctuation Restoration using LLM Annotated DataKing Yiu Suen, Rudolf Chow, Albert Y. S. Lam. [doi]
- SIDC-KWS: Efficient Spiking Inception-Dilated Conformer with Self-Attention for Keyword SpottingJin-Gyo Lim, Seong-Eun Kim. [doi]
- DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask OptimizationGeonyoung Lee, Geonhee Han, Paul Hongsuck Seo. [doi]
- RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed ModelingLong-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen. [doi]
- Acoustic scattering AI for non-invasive object classifications: A case study on hair assessmentLong Vu-Hoang, Tuan Nguyen, Huy Dat Tran. [doi]
- Analysis of Avian Biphonic Vocalization Using Computational ModellingNoumida A, Rajeev Rajan. [doi]
- Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive CommunicationÉva Székely, Péter Mihajlik, Máté Soma Kádár, László Tóth 0001. [doi]
- CabinSep: IR-Augmented Mask-Based MVDR for Real-Time In-car Speech Separation with Distributed Heterogeneous ArraysRunduo Han, Yanxin Hu, Yihui Fu, Zihan Zhang, Yukai Jv, Li Chen, Lei Xie 0001. [doi]
- Mimic Blocker: Self-Supervised Adversarial Training for Voice Conversion Defense with Pretrained Feature ExtractorsGwangyeol Yu, Junhyeok Lee, Seoryeong Kim, Jimin Lee, Jehyuk Lee. [doi]
- Naturalness-Aware Curriculum Learning with Dynamic Temperature for Speech Deepfake DetectionTaewoo Kim, Guisik Kim, Choongsang Cho, Young Han Lee. [doi]
- Towards Secure User Authentication for Headphones via In-Ear or In-Earcup MicrophonesN. Shashaank, Xiao Quan, Andrew Kaluzny, Leonard Varghese, Marko Stamenovic, Chuan-Che Huang. [doi]
- A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech RecognitionShiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin. [doi]
- STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and AttributionAnton Firc, Manasi Chhibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka. [doi]
- Intrasentential English in Swedish TTS: perceived English-accentednessChristina Tånnander, David House, Jonas Beskow, Jens Edlund. [doi]
- Improving Low-Resource Dialect Classification Using Retrieval-based Voice ConversionLea Fischbach, Akbar Karimi 0001, Caroline Kleen, Alfred Lameli, Lucie Flek. [doi]
- DiffStereo: End-to-End Mono-to-Stereo Audio Generation with Diffusion TransformerSuqi Zhang, Zheqi Dai, Yongyi Zang, Yin Cao, Qiuqiang Kong. [doi]
- Variability in performance across four generations of automatic speaker recognition systemsLauren Harrington, Vincent Hughes, Philip Harrison, Paul Foulkes, Jessica Wormald, Finnian Kelly, David van der Vloed. [doi]
- A simple method for predicting Clinical Scores in Huntington's Disease by leveraging ASR's uncertainty on spontaneous speechHadrien Titeux, Quang Tuan Rémy Nguyen, Andres Gil-Salcedo, Anne-Catherine Bachoud-Lévi, Emmanuel Dupoux. [doi]
- On the Relationship between Accent Strength and Articulatory FeaturesKevin Huang, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan. [doi]
- AfriHuBERT: A self-supervised speech representation model for African languagesJesujoba O. Alabi, Xuechen Liu, Dietrich Klakow, Junichi Yamagishi. [doi]
- Explainable Speech Emotion Recognition Through Attentive Pooling: Insights from Attention-Based Temporal LocalizationTahitoa Leygue, Astrid Sabourin, Christian Bolzmacher, Sylvain Bouchigny, Margarita Anastassova, Quoc-Cuong Pham. [doi]
- Teacher-Free Knowledge Distillation for Improving Short-Utterance Spoken Language IdentificationSpandan Dey, Hirak Mondal, Sanjay Kumar Kurmi. [doi]
- Explainable Depression Detection using Masked Hard Instance MiningPatawee Prakrankamanant, Shinji Watanabe 0001, Ekapol Chuangsuwanich. [doi]
- Vela: Scalable Embeddings with Voice Large Language Models for Multimodal RetrievalRuofan Hu, Yan Xia 0006, Minjie Hong, Jieming Zhu, Bo Chen 0023, Xiaoda Yang, Minghui Fang 0002, Tao Jin 0004. [doi]
- Whisper-Based Multilingual Alzheimer's Disease Detection and Improvements for Low-Resource LanguageKaichen Jia, Jinpeng Li, Ke Li, Wei-Qiang Zhang. [doi]
- Unified Variational and Physics-aware Model for Room Impulse Response EstimationLouis Lalay, Mathieu Fontaine 0002, Roland Badeau. [doi]
- Frequency-Domain Enhanced Extreme Bandwidth Extension Network with ICCRN for Superior Speech QualityHongtao Bao, Xueliang Zhang. [doi]
- From Pretraining to Performance: Benchmarking Self-Supervised Speech Models for Interspeech-25 SER ChallengeDrishya Uniyal, Vinayak Abrol. [doi]
- DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow MatchingWei Chen 0071, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu 0001. [doi]
- BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight IndexingMasaya Kawamura, Takuya Hasumi, Yuma Shirahata, Ryuichi Yamamoto. [doi]
- Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context LearningJunchuan Zhao, Xintong Wang, Ye Wang. [doi]
- Adaptive Differential Denoising for Respiratory Sounds ClassificationGaoyang Dong, Zhicheng Zhang, Ping Sun, Minghui Zhang. [doi]
- How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does NotFrancesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane. [doi]
- Investigating the Impact of Word Informativeness on Speech Emotion RecognitionSofoklis Kakouros. [doi]
- Towards High-Quality LLM-Based Data for French Spontaneous Speech Simplification: an Exo-Refinement ApproachLucia Ormaechea Grijalba, Nikos Tsourakis, Pierrette Bouillon, Benjamin Lecouteux, Didier Schwab. [doi]
- Web-Based Application for Real-Time Biofeedback of Vocal Resonance in Gender-Affirming Voice Training: Design and Usability EvaluationTara McAllister, Collin Eagen, Yi Shan, Peter Traver, Daphna Harel, Tae Hong Park, Vesna D. Novak. [doi]
- Dhvani: A Weakly-supervised Phonemic Error Detection and Personalized Feedback System for HindiArnav Rustagi, Satvik Bajpai, Nimrat Kaur, Siddharth 0001. [doi]
- Finetune Large Pre-Trained Model Based on Frequency-Wise Multi-Query Attention Pooling for Anomalous Sound DetectionNan Jiang 0022, Yan Song 0001, Qing Gu 0002, Haoyu Song, Lirong Dai 0001, Ian McLoughlin 0001. [doi]
- Multivariate Probabilistic Assessment of Speech QualityFredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee. [doi]
- Leveraging Information Retrieval to Enhance Spoken Language Understanding Prompts in Few-Shot LearningPierre Lepagnol, Sahar Ghannay, Thomas Gerald, Christophe Servan, Sophie Rosset. [doi]
- Speech Reduction in French: The Relationship Between Vowel Space and Articulation DynamicsKübra Bodur, Corinne Fredouille, Christine Meunier. [doi]
- FaVC: A Validated, Transcribed, Parallel Farsi Speech Dataset for Voice ConversionMina Serajian, Saeed Najafzadeh Rahaghi, Hadi Veisi, Saman Haratizadeh. [doi]
- Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech TranslationChenyang Le, Yinfeng Xia, Huiyan Li, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian. [doi]
- Towards a dynamical model of transitions between fluent and stuttered speechYijing Lu, Khalil Iskarous, Louis Goldstein. [doi]
- Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from VideosYuchi Ishikawa, Shota Nakada, Hokuto Munakata, Kazuhiro Saito, Tatsuya Komatsu, Yoshimitsu Aoki. [doi]
- In-context learning capabilities of Large Language Models to detect suicide risk among adolescents from speech transcriptsFilomene Roquefort, Alexandre Ducorroy, Rachid Riad. [doi]
- Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound ClassificationPeidong Wei, Shiyu Miao, Lin Li. [doi]
- Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancerBence Mark Halpern, Thomas Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Martijn Wieling 0001, Defne Abur, Tomoki Toda. [doi]
- Online AV-CrossNet: a Causal and Efficient Audiovisual System for Speech Enhancement and Target Speaker ExtractionCheng Yu, Vahid Ahmadi Kalkhorani, Buye Xu, DeLiang Wang. [doi]
- Talker Normalization in Chinese Bilinguals: A Comparative StudyMingxi Lu, Ran Tao, Yujia Tian. [doi]
- Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 ChallengeShangkun Huang, Yuxuan Du, Jingwen Yang, Dejun Zhang, Xupeng Jia, Jing Deng, Jintao Kang, Rong Zheng. [doi]
- Bilingual Speakers Exhibit Cognitive Fatigue: A Speech Disfluencies Case Study on Research TalksAshwin Ram, Marisol Muñoz, Zoi Gkalitsiou, Alexandros G. Dimakis. [doi]
- Articulatory Feature Prediction from Surface EMG during Speech ProductionJihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica González Machorro, Yoonjeong Lee, Björn W. Schuller, Louis Goldstein, Shrikanth Narayanan. [doi]
- Vocal-tract model with two directions: Static design for a dummy head and dynamic design for a speaking machineTakayuki Arai. [doi]
- Exploratory Study of Filled Pauses in Ukrainian Language: Phonetic Properties of Filled PausesAnna Havras, Carlos Mendes, Helena Moniz, Gueorgui Hristovsky, João Miranda. [doi]
- Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and DementiaSophie Young, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen. [doi]
- Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic ScenariosJakob Kienegger, Timo Gerkmann. [doi]
- Dysarthric Speech Recognition Using Curriculum Learning and Multi-stream ArchitectureI-Ting Hsieh, Chung-Hsien Wu 0001. [doi]
- Exploring Efficient Directional and Distance Cues for Regional Speech SeparationYiheng Jiang, Haoxu Wang, Yafeng Chen, Gang Qiao, Biao Tian. [doi]
- Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text FeaturesTatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa. [doi]
- Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource SettingsJesuraj Bandekar, Prasanta Kumar Ghosh. [doi]
- Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse ResponsesChristopher Ick, Gordon Wichern, Yoshiki Masuyama, François G. Germain, Jonathan Le Roux. [doi]
- LLM-based phoneme-to-grapheme for phoneme-based speech recognitionTe Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou. [doi]
- Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual ForgeriesMinyoung Kim, Sehwan Park, Sungmin Cha, Paul Hongsuck Seo. [doi]
- From Sharpness to Better Generalization for Speech Deepfake DetectionWen Huang 0004, Xuechen Liu, Xin Wang 0037, Junichi Yamagishi, Yanmin Qian. [doi]
- LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language ModelsChang Liu, Zhen-Hua Ling, Yu Gu. [doi]
- ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature OptimizationPengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li. [doi]
- VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware EvaluationYubin Kim 0002, Taehan Kim, Wonjune Kang, Eugene Park, Joonsik Yoon, Dongjae Lee, Xin Liu 0034, Daniel McDuff, Hyeonhoon Lee, Cynthia Breazeal, Hae Won Park 0001. [doi]
- Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational ModelingTianTian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan. [doi]
- ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual InputsEray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido 0001, Abeer Alwan. [doi]
- Ultra-Low Bit Post-Training Quantization of Large Speech Models via K-Means Clustering and Mixed Precision AllocationTianteng Gu, Bei Liu 0003, Haoyu Wang 0007, Yanmin Qian. [doi]
- Interactive Fusion of Multi-View Speech Embeddings via Pretrained Large-Scale Speech Models for Speech Emotional Attribute Prediction in Naturalistic ConditionsYuyun Liu, Yujia Gu, Jiahao Luo, Wenming Zheng, Cheng Lu 0005, Yuan Zong. [doi]
- Oral Reading Errors by Grade 3 Children in Indian Schools: A Hindi-English PerspectiveSneha Raman, Preeti Rao. [doi]
- Optimizing Pause Context in Fine-Tuning Pre-trained Large Language Models for Dementia DetectionXiaoquan Ke, Man-Wai Mak, Helen Meng. [doi]
- "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language UnderstandingAlkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis. [doi]
- EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion TransformerJiarui Hai, Yong Xu 0004, Hao Zhang 0112, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu 0001. [doi]
- From Speech Science to Language TransparenceAlexander Waibel. [doi]
- Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and ProductionLinda Bakkouche, Brechtje Post. [doi]
- An Effective Anomalous Sound Detection Method Based on Global and Local Attribute MiningNan Jiang 0022, Yan Song 0001, Qing Gu 0002, Haoyu Song, Lirong Dai 0001, Ian McLoughlin 0001. [doi]
- Multimodal and Multitask Learning for Predicting Multiple Scores in L2 English SpeechSeHyun Oh, SunHee Kim, Minhwa Chung. [doi]
- Acoustic and Linguistic Biomarkers for Cognitive Impairment Detection from SpeechCatarina Botelho, David Gimeno-Gómez, Francisco Teixeira, John Mendonça, Patrícia Pereira, Diogo A. P. Nunes, Thomas Rolland, Anna Pompili, Rubén Solera-Ureña, Maria Ponte, David Martins de Matos, Carlos D. Martínez-Hinarejos, Isabel Trancoso, Alberto Abad. [doi]
- Location-Aware Target Speaker Extraction for Hearing AidsDaniel-José Alcala Padilla, Nils L. Westhausen, Swati Vivekananthan, Bernd T. Meyer. [doi]
- Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech SynthesisZhiyong Chen, Shuhang Wu, Xinnuo Li, Zhiqi Ai, Shugong Xu. [doi]
- First Steps Towards Voice Anonymization for Code-Switching SpeechSarina Meyer, Ekaterina Kolos, Ngoc Thang Vu. [doi]
- Lexical stress affects lenition: The case of Italian palato-alveolar affricatesBowei Shao, Philipp Buech, Anne Hermes, Maria Giavazzi. [doi]
- Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark ExtractionXiangyu Zhang 0005, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tünde Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps. [doi]
- Improving Generalization of End-to-End ASR through Diversity and Independence RegularizationYe-Eun Ko, Mun-Hak Lee, Dong-hyun Kim, Joon-Hyuk Chang. [doi]
- Assessment of the synthetic quality and controllability of laughing onset in speech-laugh synthesisRyo Setoguchi, Yoshiko Arimoto. [doi]
- Is Synthetic Data Truly Effective for Training Speech Language Models?Tomoya Mizumoto, Atsushi Kojima, Yusuke Fujita, Lianbo Liu, Yui Sudo. [doi]
- Harnessing Text-to-Speech Voice Cloning Models for Improved Audiological Speech AssessmentLidea Shahidi, Erdem Baha Topbas, Thu Ngan Dang, Tobias Goehring. [doi]
- Acoustic Detection of UAV Abnormality Using One Ground-Based Acoustic Vector SensorDengjian Zhou, Jianghan Hai, Sijia Liao, Yue Ivan Wu, Kainam Thomas Wong, Xiujuan Zheng. [doi]
- Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step SamplingQixi Zheng, Yushen Chen, Zhikang Niu, Ziyang Ma 0001, Xiaofei Wang, Kai Yu 0004, Xie Chen 0001. [doi]
- GenECA: A General-Purpose Framework for Real-Time Adaptive Multimodal Embodied Conversational AgentsSantosh V. Patapati, Aashrith Tatineni, Trisanth Srinivasan. [doi]
- Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker DiarizationJiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Díez, Jan Cernocký, Lukás Burget. [doi]
- Comparison-Based Automatic Evaluation for Meeting SummarizationZiwei Gong, Lin Ai, Harsh Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg. [doi]
- AdaKWS: Towards Robust Keyword Spotting with Test-Time AdaptationYang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das. [doi]
- A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR ConditionsZheng Wang, Xiaobin Rong, Yu Sun, Tianchi Sun, Zhibin Lin, Jing Lu. [doi]
- Overestimated performance of auditory attention decoding caused by experimental design in EEG recordingsYujie Yan, Xiran Xu, Haolin Zhu, Songyi Li, Bo Wang 0110, Xihong Wu, Jing Chen 0019. [doi]
- Automatic Labeling and Correction of Noisy Labels for Robust Self-Supervised Speaker VerificationAbderrahim Fathan, Jahangir Alam 0001. [doi]
- Pushing the Frontiers of Self-Distillation Prototypes Network with Dimension Regularization and Score NormalizationYafeng Chen, Chong Deng, Hui Wang, Yiheng Jiang, Han Yin, Qian Chen, Wen Wang. [doi]
- ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance TonalityYu-Xiang Luo, Yi-Cheng Lin, Ming-To Chuang, Jia-Hung Chen, I-Ning Tsai, Pei Xing Kiew, Yueh-Hsuan Huang, Chien-Feng Liu, Yu-Chen Chen, Bo-Han Feng, Wenze Ren, Hung-yi Lee. [doi]
- VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and MandarinZhiqi Ai, Meixuan Bao, Zhiyong Chen, Zhi Yang, Xinnuo Li, Shugong Xu. [doi]
- Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech modelsParismita Gogoi, Sishir Kalita, Wendy Lalhminghlui, Viyazonuo Terhiija, Moakala Tzudir, Priyankoo Sarmah, S. R. M. Prasanna. [doi]
- Rhotic Articulation in Australian English: Insights from MRIMichael Proctor, Tünde Szalay, Tharinda Piyadasa, Craig T. Jin, Naeim Sanaei, Amelia Gully, David Waddington, Sheryl Foster, Kirrie J. Ballard. [doi]
- Long-Context Speech Synthesis with Context-Aware MemoryZhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu. [doi]
- Data-driven approaches to pitch modelling in two Mexican Spanish ethnolects: K-means Clustering & GAMMsGilly Marchini, Jeremy Steffman. [doi]
- NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR DecodingVladimir Bataev, Andrei Andrusenko, Lilit Grigoryan, Aleksandr Laptev, Vitaly Lavrukhin, Boris Ginsburg. [doi]
- Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel DataQibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li. [doi]
- Processing of grammatical information in cochlear implant simulated speech by German adult listenersAtty Schouwenaars, Esther Ruigendijk. [doi]
- Continual Speech Learning with Fused Speech FeaturesGuitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari. [doi]
- HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance RegularizationHyebin Ahn, Kangwook Jang, Hoirin Kim. [doi]
- On Enhancing the Performance of Children's ASR Task in Limited Data ScenarioAnkita, Shambhavi, Syed Shahnawazuddin. [doi]
- Co-Speech Motion for Virtual Agents in Dialogue Using LLM-Driven Primitive Action SelectionMuhammad Yeza Baihaqi, Angel García Contreras, Seiya Kawano, Koichiro Yoshino. [doi]
- LightL2S: Ultra-Low Complexity Lip-to-Speech Synthesis for Multi-Speaker ScenariosYifan Liang, Kang Yang, Fangkun Liu, Andong Li, Xiaodong Li 0002, Chengshi Zheng. [doi]
- CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech RecognitionTianyi Tan, Xin'an Chen, Xiaohuai Le, Wenzhi Fan, Xianjun Xia, Chuanzeng Huang, Jing Lu. [doi]
- Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating NasalitySaba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Y. Espy-Wilson. [doi]
- Scaling and Prompting for Improved End-to-End Spoken Grammatical Error CorrectionMengjie Qian 0001, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J. F. Gales. [doi]
- Articulatory Strategy in Vowel Production as a Basis for Speaker DiscriminationJustin J. H. Lo, Patrycja Strycharczuk, Sam Kirkham. [doi]
- WAKE: Watermarking Audio with Key EnrichmentYaoxun Xu, Jianwei Yu, Hangting Chen, Zhiyong Wu, Xixin Wu, Dong Yu, Rongzhi Gu, Yi Luo. [doi]
- Towards a Japanese Full-duplex Spoken Dialogue SystemAtsumoto Ohashi, Shinya Iizuka, Jingjing Jiang, Ryuichiro Higashinaka. [doi]
- Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech RecognitionIlja Baumann, Dominik Wagner 0002, Korbinian Riedhammer, Tobias Bocklet. [doi]
- Beyond Similarity Scoring: Detecting Entailment and Contradiction in Multilingual and Multimodal ContextsOthman Istaiteh, Salima Mdhaffar, Yannick Estève. [doi]
- End-to-End Diarization utilizing Attractor Deep ClusteringDavid Palzer, Matthew Maciejewski, Eric Fosler-Lussier. [doi]
- Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQYunkee Chae, Kyogu Lee. [doi]
- The Prosodic Characteristics of Standard Chinese Rhetorical Questions in Naturalistic SettingsShuwen Chen, Qingke Sun, Yue Huang, Yingyi Luo. [doi]
- Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS ModelsKyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi. [doi]
- Semi-Supervised Learning for Automatic Speech Recognition with Word Error Rate Estimation and Targeted Domain Data SelectionChanho Park, Thomas Hain. [doi]
- Automatic classification of stop realisation with wav2vec2.0James Tanner, Morgan Sonderegger, Jane Stuart-Smith, Jeff Mielke, Tyler Kendall. [doi]
- Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLMDariia Puhach, Amir H. Payberah, Éva Székely. [doi]
- Reconstruction of the Complete Vocal Tract Contour Through Acoustic to Articulatory Inversion Using Real-Time MRI DataSofiane Azzouz, Pierre-André Vuissoz, Yves Laprie. [doi]
- A Silent Speech Decoding System from EEG and EMG with Heterogenous Electrode ConfigurationsMasakazu Inoue, Motoshige Sato, Kenichi Tomeoka, Nathania Nah, Eri Hatakeyama, Kai Arulkumaran, Ilya Horiguchi, Shuntaro Sasai. [doi]
- Skip-Salsa: Skip Synchronous Fusion of ASR LLM DecodersAshish R. Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi. [doi]
- An approach to measuring the performance of Automatic Speech Recognition(ASR) models in the context of Large Language Model(LLM) powered applicationsSujith Pulikodan, Sahapthan K, Prasanta Kumar Ghosh, Visruth Sanka, Nihar Desai. [doi]
- Mitigating Overfitting During Speech Foundation Model Fine-tuning: Applications to Dysarthric Speech DetectionYan Xiong 0002, Visar Berisha, Julie Liss, Chaitali Chakrabarti. [doi]
- Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource ScenariosGerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando. [doi]
- H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical IndexingAkanksha Singh, Yi-Ping Phoebe Chen, Vipul Arora 0001. [doi]
- Improving Automatic Speech Recognition for Children's Reading Assessment with Disfluency-aware Language ModelsJazmín Vidal, Luciana Ferrer, Juan Esteban Kamienkowski, Pablo Riera. [doi]
- Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI ApproachesBornali Phukon, Xiuwen Zheng 0003, Mark Hasegawa-Johnson. [doi]
- Advancing Pediatric ASR: The Role of Voice Generation in Disordered SpeechKaren Rosero, Ali N. Salman, Shreeram Suresh Chandra, Berrak Sisman, Cortney Van't Slot, Alex A. Kane, Rami R. Hallac, Carlos Busso. [doi]
- Subtyping Speech Errors in Childhood Speech Sound Disorders with Acoustic-to-Articulatory Speech InversionNina R. Benway, Saba Tabatabaee, Benjamin Munson, Jonathan Preston, Carol Y. Espy-Wilson. [doi]
- MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion RecognitionHyo Jin Jon, Longbin Jin, Hyuntaek Jung, Hyunseo Kim 0003, Donghun Min, Eun Yi Kim. [doi]
- Simultaneous Speech Translation Integrated Compact Multiple Sound Spot Synthesis System On A Laptop Carried Out With A BackpackTakuma Okamoto, Michiyo Kono. [doi]
- Enhancing Syllabic Recognition via Speech-EEG Phase Analysis and Non-Activity State ModelingRini A. Sharon, Hema A. Murthy. [doi]
- Evaluating ASR Robustness to Spontaneous Speech Errors: A Study of WhisperX Using a Speech Error DatabaseJohn Alderete, Macarious Kin Fung Hui, Aanchan Mohan. [doi]
- Multichannel Keyword Spotting for Noisy ConditionsDzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov. [doi]
- Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription UncertaintyKomei Hiruta, Yosuke Yamano, Hideaki Tamori. [doi]
- SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcriptionRaymond Grossman, Taejin Park, Kunal Dhawan, Andrew Titus, Sophia Zhi, Yulia Shchadilova, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg. [doi]
- NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech DataTahir Javed, Kaushal Santosh Bhogale, Mitesh M. Khapra. [doi]
- Leveraging SSL Speech Features and Mamba for Enhanced DeepFake DetectionHoan My Tran, Damien Lolive, David Guennec, Aghilas Sini, Arnaud Delhay, Pierre-François Marteau. [doi]
- Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-TuningÖmer Tarik Özyilmaz, Matt Coler, Matias Valdenegro-Toro. [doi]
- MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASRDimitrios Damianos, Georgios Paraskevopoulos, Alexandros Potamianos. [doi]
- Speech Enhancement with Dual-path Multi-Channel Linear Prediction Filter and Multi-norm BeamformingChengyuan Qin, Wenmeng Xiong, Jing Zhou, Maoshen Jia, Changchun Bao. [doi]
- Representing Speech Through Autoregressive Prediction of Cochlear TokensGreta Tuckute, Klemen Kotar, Evelina Fedorenko, Daniel Yamins. [doi]
- HK-GenSpeech: A Generative AI Scene Creation Framework for Speech Based Cognitive AssessmentVi Jun Sean Yong, Serkan Kumyol, Pau Le Lisa Low, Winnie Suk Wai Leung, Tristan Braud. [doi]
- Supralaryngeal Kinematics of Implosives in Central Vietnamese: An EMA StudyPaul McGuire, Kye Shibata, Thanh Viet Cao, Feng-fan Hsieh, Yueh-Chin Chang. [doi]
- Vector Quantized Cross-lingual Unsupervised Domain Adaptation for Speech Emotion RecognitionPravin Mote, Donita Robinson, Elizabeth Richerson, Carlos Busso. [doi]
- On the Design of a Robust Superdirective Beamformer and Topology Parameter Optimization with Frustum-Shaped Microphone Arrays Featuring Multiple RingsKunlong Zhao, Gongping Huang, Xudong Zhao, Jingdong Chen, Jacob Benesty, Zoran Cvetkovic. [doi]
- Distilling a speech and music encoder with task arithmeticFabian Ritter Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M. Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee. [doi]
- Temporal organization of prenuclear glides in Hefei MandarinYifan Yang, Zhiheng Qian. [doi]
- Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic DataYu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang 0001, Xie Chen 0001. [doi]
- Attention Models and Auditory Transduction Features for Noise RobustnessCathal Ó Faoláin, Andrew Hines. [doi]
- How to Recover Long Audio Sequences Through Gradient Inversion Attack With Dynamic Segment-based ReconstructionXijie Zeng, Frank Rudzicz. [doi]
- Differentiable K-means for Fully-optimized Discrete Token-based ASRKentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe 0001. [doi]
- Scalable Spontaneous Speech Dataset (SSSD): Crowdsourcing Data Collection to Promote Dialogue ResearchZaid Sheikh, Shuichiro Shimizu, Siddhant Arora, Jiatong Shi, Samuele Cornell, Xinjian Li, Shinji Watanabe 0001. [doi]
- Articulatory clarity and variability before and after surgery for tongue cancerThomas Tienkamp, Fleur van Ast, Roos van der Veen, Teja Rebernik, Raoul Buurke, Nikki Hoekzema, Katharina Polsterer, Hedwig Sekeres, Rob van Son, Martijn Wieling 0001, Max J. H. Witjes, Sebastiaan A. H. J. de Visscher, Defne Abur. [doi]
- Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative AnalysisTianyi Xu, Hongjie Chen 0001, Qing Wang 0039, Hang Lv 0006, Jian Kang 0006, Jie Li 0001, Zhennan Lin, Yongxiang Li, Lei Xie 0001. [doi]
- Evaluating the suitability of acoustic parameters for capturing breathy voice in non-pathological female speakersChloe Patman, Paul Foulkes, Kirsty McDougall. [doi]
- Investigating effects of sex hormones, cycle phases and age on female fundamental frequencyMelanie Weirich, Adrian P. Simpson. [doi]
- Analyzing the Impact of Accent on English Speech: Acoustic and Articulatory PerspectivesGowtham Premananth, Vinith Kugathasan, Carol Y. Espy-Wilson. [doi]
- Speaker Separation for an Unknown Number of Speakers with Encoder-Decoder-Based Contextual Information ModuleXue Yang, Guiru Shen, Yu Yang. [doi]
- Speech-guided Grapheme-to-Phoneme Conversion for Cantonese Text-to-SpeechTimothy Shin Heng Mak, King Yiu Suen, Albert Y. S. Lam. [doi]
- Automatic Dialectal Transcription: An Evaluation on Finnish and NorwegianOlli Kuparinen. [doi]
- Generalizable Audio Spoofing Detection using Non-Semantic RepresentationsArnab Das, Yassine El Kheir, Carlos Franzreb, Tim Herzig, Tim Polzehl, Sebastian Möller 0001. [doi]
- Turing's Echo: Investigating Linguistic Sensitivity of Deepfake Voice Detection via GamificationBinh Nguyen, Thai Le. [doi]
- Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text NormalizationLuong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau. [doi]
- MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase ReconstructionMohammed Salah Al-Radhi, Géza Németh, Branislav Gerazov. [doi]
- The Faetar Speech Recognition BenchmarkMichael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar. [doi]
- Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and VideoNaoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura. [doi]
- VIB-based Real Pre-emphasis Audio Deepfake Source TracingThien-Phuc Doan, Kihun Hong, Souhwan Jung. [doi]
- Exploring the Power of Empirical Mode Decomposition for Sensing the Sound of Silence: A Pilot Study on Mice Autism Detection via Ultrasonic VocalisationChenhao Wu 0004, Xiangjun Cai, Haojie Zhang, Tianrui Jia, Yilu Deng, Kun Qian 0003, Björn W. Schuller, Yoshiharu Yamamoto, Jiang Liu 0005. [doi]
- Revisiting WFST-based Hybrid Japanese Speech Recognition System for Individuals with Organic Speech DisordersNaoki Hojo, Ryoichi Takashima, Chihiro Sugiyama, Nobukazu Tanaka, Kanji Nohara, Kazunori Nozaki, Tetsuya Takiguchi. [doi]
- What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-trainingMarianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem H. Zuidema, Martijn Bentum. [doi]
- Multi-Channel Acoustic Echo Cancellation Based on Direction-of-Arrival EstimationFei Zhao, Xueliang Zhang 0001, Zhong-Qiu Wang 0001. [doi]
- Multimodal Silent Recognition of Phonemes Using Radar and Optopalatographic Silent Speech InterfacesJoão Menezes, Aubin Mouras, Arne-Lukas Fietkau, Dani Kazzy, Peter Birkholz. [doi]
- CBA: Backdoor Attack on Deep Speech Classification via Audio CompressionYuheng Huang, Ying Ren, Wenjie Zhang, Diqun Yan. [doi]
- Power Spectral Density Estimation for Acoustic Source Separation Using A Spherical Microphone ArrayLiang Tao, Maoshen Jia, Yonggang Hu. [doi]
- Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech CorporaKentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu. [doi]
- Online Audio-Visual Autoregressive Speaker ExtractionZexu Pan, Wupeng Wang, Shengkui Zhao, Chong Zhang 0003, Kun Zhou 0003, Yukun Ma, Bin Ma 0001. [doi]
- From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation ModelsAsim Ersoy, Basel Ahmad Mousi, Shammur Absar Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani. [doi]
- How sibilant spectra shape gender perception in prepubertal children: A voice morphing studyRiccarda Funk, Melanie Weirich, Adrian P. Simpson. [doi]
- Neural Speech Extraction with Human FeedbackMalek Itani, Ashton Graves, Sefik Emre Eskimez, Shyamnath Gollakota. [doi]
- A Chinese Heart Failure Status Speech Database with Universal and Personalised ClassificationYue Pan, Liwei Liu, Changxin Li, Xingyao Wang, Yili Xia, Hanyue Zhang, Ming Chu. [doi]
- Universal Semantic Disentangled Privacy-preserving Speech Representation LearningBiel Tura Vecino, Subhadeep Maji, Aravind Varier, Antonio Bonafonte, Ivan Valles, Michael Owen, Constantinos Papayiannis, Leif Rädel, Grant P. Strimel, Oluwaseyi Feyisetan, Roberto Barra-Chicote, Ariya Rastrow, Volker Leutnant, Trevor Wood. [doi]
- Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASRAbhijit Sinha, Hemant Kumar Kathania, Mikko Kurimo. [doi]
- Sentence-Final Particles in Mandarin Child-Directed Speech: Frequency and Impact on Speech RateYizhi Liu, Luyuan Geng, Yan Gu, Mengru Han. [doi]
- Towards atypical speech transcription using LLM-based ASRJinda Zhang, Aanchan Mohan. [doi]
- InfiniteAudio: Infinite-Length Audio Generation with ConsistencyChaeyoung Jung, Hojoon Ki, Ji-Hoon Kim, Junmo Kim, Joon Son Chung. [doi]
- Understanding Dementia Speech Alignment with Diffusion-Based Image GenerationMansi, Anastasios Lepipas, Dominika C. Woszczyk, Yiying Guan, Soteris Demetriou. [doi]
- FUSE-MOS: Fusion of Speech Embeddings for MOS Prediction with Uncertainty QuantificationEnjamamul Hoq, Nikhil Gupta, Danielle Omondi, Ifeoma Nwogu. [doi]
- Advancing Emotion Recognition via Ensemble Learning: Integrating Speech, Context, and Text RepresentationsXiaohan Shi, Jinyi Mi, Xingfeng Li 0001, Tomoki Toda. [doi]
- Cross-corpus open-set Speech Emotion Recognition Method Based on Spatiotemporal Features with Inverse-Entropy RegularizationZhaohui Zhou, Hui Luo. [doi]
- AxLSTMs: learning self-supervised audio representations with xLSTMsSarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan. [doi]
- Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech RecognitionDominik Wagner 0002, Ilja Baumann, Natalie Engert, Seanie Lee, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet. [doi]
- Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-TuningLong Mai, Julie Carson-Berndsen. [doi]
- Restoring Harmonics: Enhancing Speech Quality with Deep Mask and Harmonic Restoration NetworkYu Zhao, Zengqiang Shang, Mou Wang, Xin Liu, Pengyuan Zhang. [doi]
- The State Of TTS: A Case Study with Human Fooling RatesPraveen Srinivasa Varadhan, Sherry Thomas, Sai Teja M. S., Suvrat Bhooshan, Mitesh M. Khapra. [doi]
- The 1st SpeechWellness Challenge: Detecting Suicide Risk Among AdolescentsWen Wu 0007, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou 0001, Runsen Chen, Chao Zhang 0031. [doi]
- SpeechMLC: Speech Multi-label ClassificationMiseul Kim, Seyun Um, Hyeonjin Cha, Hong-Goo Kang. [doi]
- Meta-Learning Approaches for Speaker-Dependent Voice Fatigue ModelsRoseline Polle, Agnes Norbury, Alexandra Livia Georgescu, Nicholas Cummins, Stefano Goria. [doi]
- Analysis of Phonetic Level Similarities Across Languages in Emotional SpeechPravin Mote, Abinay Reddy Naini, Donita Robinson, Elizabeth Richerson, Carlos Busso. [doi]
- AttentiveMOS: A Lightweight Attention-Only Model forSpeech Quality PredictionImran E. Kibria, Donald S. Williamson. [doi]
- The Effect of Word Predictability on Spoken Cross-Language IntelligibilityWei Xue, Iuliia Zaitova, Bernd Möbius. [doi]
- Multilingual Query-by-Example KWS for Indian Languages using TransliterationKirandevraj R, Vinod K. Kurmi, Vinay P. Namboodiri, C. V. Jawahar. [doi]
- Adaptive Across-Subcenter Representation Learning for Imbalanced Anomalous Sound DetectionDong Wang 0013, Jiqing Han 0001, Guibin Zheng, Tieran Zheng, Yongjun He 0002. [doi]
- Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech RecognitionRaphaël Bagat, Irina Illina, Emmanuel Vincent 0001. [doi]
- Adaptive Knowledge Distillation for Device-Directed Speech DetectionHyung-Gun Chi, Florian Pesce, Wonil Chang, Oggi Rudovic, Arturo Argueta, Stefan Braun, Vineet Garg, Ahmed Hussen Abdelaziz. [doi]
- Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missionsAnkush Raut, Projna Paromita, Sydney R. Begerowski, Suzanne T. Bell, Theodora Chaspari. [doi]
- Tonal Variation and Word Meaning in TaiwaneseYu-Ying Chuang, Sheng-Fu Wang. [doi]
- From Talking and Listening Devices to Intelligent Communicative MachinesRoger K. Moore. [doi]
- StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice ConversionFengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan 0001, Zhiyong Wu 0001. [doi]
- SaD: A Scenario-Aware Discriminator for Speech EnhancementXihao Yuan, Siqi Liu, Yan Chen, Hang Zhou, Chang Liu, Hanting Chen, Jie Hu 0021. [doi]
- Factors affecting the in-context learning abilities of LLMs for dialogue state trackingPradyoth Hegde, Santosh Kesiraju, Jan Svec, Simon Sedlácek, Bolaji Yusuf, Oldrich Plchot, Deepak K. T, Jan Cernocký. [doi]
- StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency SegmentationSuhita Ghosh, Mélanie Jouaiti, Jan-Ole Perschewski, Sebastian Stober. [doi]
- Analysis of ABC Frontend Audio Systems for the NIST-SRE24Sara Barahona, Anna Silnova, Ladislav Mosner, Junyi Peng, Oldrich Plchot, Johan Rohdin, Lin Zhang 0054, Jiangyu Han, Petr Pálka, Federico Landini, Lukás Burget, Themos Stafylakis, Sandro Cumani, Dominik Bobos, Miroslav Hlavácek, Martin Kodovsky, Tomás Pavlícek. [doi]
- LRBA: Stealthy Backdoor Attacks on Speech Classification via Latent Rearrangement in VITSZexin Li, Wenhan Yao, Ye Xiao, Jinsu Yang, Fen Xiao, Weiping Wen. [doi]
- Addressing Task Conflicts in Stuttering Detection via MMoE-Based Multi-Task LearningXiaokang Liu, Xingfeng Li, Yudong Yang, Lan Wang, Nan Yan. [doi]
- Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm DetectionZhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler. [doi]
- Ranking and Selection of Bias Words for Contextual Bias Speech RecognitionHaoxiang Hou, Xun Gong 0005, Wangyou Zhang, Wei Wang 0010, Yanmin Qian. [doi]
- Modeling Formant Dynamics in Mandarin /ai/: Effects of Speech Style and Speech RateYunzhuo Xiang, Jingyi Sun. [doi]
- Bridging Speech and Singing: Multi-stage Speech-Prompted Singing Voice Conversion with Speaker Embedding AdaptationMingda Liu, Jiatong Shi. [doi]
- Towards Accurate Phonetic Error Detection Through Phoneme Similarity ModelingXuanru Zhou, Jiachen Lian, Cheol Jun Cho, Tejas S. Prabhune, Shuhe Li, William Li, Rodrigo Ortiz, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli. [doi]
- Replay Attacks Against Audio Deepfake DetectionNicolas M. Müller, Piotr Kawa, Wei Herng Choong, Adriana Stan, Aditya Tirumala Bukkapatnam, Karla Pizzi, Alexander Wagner, Philip Sperl. [doi]
- Audio-Based Classification and Geographic Regression of Austrian DialectsLorenz Gutscher, Michael Pucher. [doi]
- Enhancing Target-speaker Automatic Speech Recognition Using Multiple Speaker Embedding Extractors with Virtual Speaker EmbeddingJu-Seok Seong, Jeong Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang. [doi]
- "Dyadosyncrasy", Idiosyncrasy and Demographic Factors in Turn-TakingJulio Cesar Cavalcanti, Gabriel Skantze. [doi]
- Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep ModelsQuentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard. [doi]
- AA-SLLM: An Acoustically Augmented Speech Large Language Model for Speech Emotion RecognitionJialong Mai, Xiaofen Xing, Weidong Chen, Yuanbo Fang, Xiangmin Xu. [doi]
- FreeCodec: A Disentangled Neural Speech Codec with Fewer TokensYouqiang Zheng, Weiping Tu, Yueteng Kang, Jie Chen, Yike Zhang, Li Xiao 0007, Yuhong Yang 0001, Long Ma. [doi]
- ArVoice: A Multi-Speaker Dataset for Arabic Speech SynthesisHawau Olamide Toyin, Rufael Marew, Humaid Alblooshi, Samar M. Magdy, Hanan Aldarmaki. [doi]
- Voice Adaptation for Swiss GermanSamuel Stucki, Jan Deriu, Mark Cieliebak. [doi]
- Infant Cry Emotion Recognition Using Improved ECAPA-TDNN with Multi-scale Feature Fusion and Attention EnhancementJunyu Zhou, Yanxiong Li, Haolin Yu. [doi]
- SupraDoRAL: Automatic Word Prominence Detection Using Suprasegmental Dependencies of Representations with Acoustic and Linguistic ContextJhansi Mallela, Upendra Vishwanath Y. S., Sankara Bharadwaj Rangavajjala, Bhaskar Bhatt, Chiranjeevi Yarra. [doi]
- TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement NetworkXiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu. [doi]
- FaiST: A Benchmark Dataset for Fairness in Speech TechnologyMaliha Jahan, Yinglun Sun, Priyam Mazumdar, Zsuzsanna Fagyal, Thomas Thebaud, Jesús Villalba 0001, Mark Hasegawa-Johnson, Najim Dehak, Laureano Moro-Velázquez. [doi]
- A Naturally Elicited Multimodal Stress Database and Speech Breathing Based Stress DetectionKarumannil Mohamed Ismail Yasar Arafath, Mohammed Abeer K. C., Aurobinda Routray. [doi]
- Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting ChallengeLongjie Luo, Shenghui Lu, Lin Li, Qingyang Hong. [doi]
- Towards Personalised Audio Visual Speech EnhancementMandar Gogate, Kia Dashtipour, Amir Hussain 0001. [doi]
- VisualSpeech: Enhancing Prosody Modeling in TTS Using VideoShumin Que, Anton Ragni. [doi]
- WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank DetectionHainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg. [doi]
- Parameter-Efficient Fine-tuning with Instance-Aware Prompt and Parallel Adapters for Speaker VerificationShengyu Peng, Wu Guo, Jie Zhang 0042, Yu Guan, Lipeng Dai, Zuoliang Li. [doi]
- Towards Sentence Level Imagined Speech Generation from EEG signalsSparsh Rastogi, Harsh Dadwal, Khushboo Modi, Jatin Bedi, Jasmeet Singh. [doi]
- A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao 0001, Yuki Mitsufuji. [doi]
- HiFiTTS-2: A Large-Scale High Bandwidth Speech DatasetRyan Langman, Xuesong Yang, Paarth Neekhara, Shehzeen Hussain, Edresson Casanova, Evelina Bakhturina, Jason Li. [doi]
- Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory FeaturesRyo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara. [doi]
- Monotonic Attention for Robust Text-to-Speech Synthesis in Large Language Model FrameworksYike Zhang, Yiming Li, Jie Chen, Qinghua Wu, Songjun Cao, Long Ma. [doi]
- Speech Enhancement based on cascaded two flowsSeonggyu Lee, Sein Cheong, Sangwook Han, Kihyuk Kim, Jong Won Shin. [doi]
- Decoding Alzheimer's: Interpretable Visual and Logical Attention in Picture Description TasksNing Wang, Bingyang Wen, Minghui Wu, Yang Sun, Zongru Shao, Haojie Zhou, K. P. Subbalakshmi. [doi]
- Character Error Rate Estimation for Semi-Supervised Training of Speech Recognition for Arabic DialectsChanho Park, Oscar Saz. [doi]
- When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed SoundsMinsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho. [doi]
- Modeling Multi-Turn Spoken Language Understanding with Dynamic Graph Convolutional NetworksYi Huang 0017, Si Chen, Jingyu Yao, Junlan Feng. [doi]
- RELATE: Subjective evaluation dataset for automatic evaluation of relevance between text and audioYusuke Kanamori, Yuki Okamoto, Taisei Takano, Shinnosuke Takamichi, Yuki Saito 0001, Hiroshi Saruwatari. [doi]
- Causal Structure Discovery for Error Diagnostics of Children's ASRVishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen. [doi]
- TinyClick: Single-Turn Agent for Empowering GUI AutomationPawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz. [doi]
- Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource LanguagesSeraphina Fong, Marco Matassoni, Alessio Brutti. [doi]
- Intelligibility Prediction for Time-Modified Speech Signals Using Spectro-Temporal Modulation FeaturesAymen Bashir, Haolan Wang, Amin Edraki, Wai-Yip Chan, Jesper Jensen 0001. [doi]
- ReSepNet: A Unified-Light Model for Recursive Speech Separation with Unknown Speaker CountHadi Alizadeh, Rahil Mahdian Toroghi, Hassan Zareian. [doi]
- Self-supervised Optimality-Guided Learning of Speech ArticulationJuraj Simko, Benjamin Elie, Alice Turk. [doi]
- Bridging ASR and LLMs for Dysarthric Speech Recognition: Benchmarking Self-Supervised and Generative ApproachesAhmed Aboeitta, Ahmed Sharshar, Youssef Nafea, Shady Shehata. [doi]
- End-to-End Indian Language Dubbing with Zero-Shot Speaker PreservationGiri Raju, Sandeep Konam. [doi]
- Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation ModelsZhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu. [doi]
- Adversarial Attacks on Text-dependent Speaker Verification SystemSreekanth Sankala, Venkatesh Parvathala, Ramesh Gundluru, K. Sri Rama Murty. [doi]
- Training-Free Voice Conversion with Factorized Optimal TransportAlexander Lobashev, Assel Yermekova, Maria A. Larchenko. [doi]
- Enhancing Serialized Output Training for Multi-Talker ASR with Soft Monotonic Alignment and Utterance-level TimestampFengyun Tan, Tao Wei 0003, Kun Zou, Ning Cheng 0001, Shaojun Wang, Jing Xiao 0006. [doi]
- Anne Rowling Neurological Speech Corpus: clinically annotated longitudinal dataset for developing speech biomarkers in neurodegenerative disordersJohnny Tam, Christine Weaver, Oliver Watts, Siddharthan Chandran, Suvankar Pal, Rowling Speech Consortium. [doi]
- Efficient and Direct Duplex Modeling for Speech-to-Speech Language ModelKe Hu, Ehsan Hosseini-Asl, Chen Chen 0075, Edresson Casanova, Subhankar Ghosh, Piotr Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg. [doi]
- TELVID: A Multilingual Multi-modal Corpus for Speaker RecognitionKaren Jones, Kevin Walker, Christopher Caruso, Elliot Singer, Trang Nguyen, Robert B. Dunn, Stephanie M. Strassel. [doi]
- Gradual modeling of the Lombard effect by modifying speaker embeddings from a Text-To-Speech modelThiago Henrique Gomes Lobato, Magnus Schäfer. [doi]
- Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature FusionKumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik. [doi]
- Defending Speech-enabled LLMs Against Adversarial Jailbreak ThreatsAntonios Alexos, Raghuveer Peri, Sai Muralidhar Jayanthi, Metehan Cekic, Srikanth Vishnubhotla, Kyu J. Han, Srikanth Ronanki. [doi]
- Agent-based modelling, sound change, and metaphony in Southern Italian varieties of Italo-RomanceLilian von Bressensdorf, Pia Greca, Jonathan Harrington. [doi]
- Evaluation of a model for sound radiation from the vocal tract wallPeter Birkholz, Tianyi Zhang. [doi]
- Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice ConversionKaidi Wang 0001, Wenhao Guan, Ziyue Jiang 0001, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li. [doi]
- Simple and Effective Content Encoder for Singing Voice Conversion via SSL-Embedding Dimension ReductionWangjin Zhou, Tianjiao Du, Chenglin Xu, Sheng Li 0010, Yi Zhao 0006, Tatsuya Kawahara. [doi]
- Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh 0001. [doi]
- The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech modelsYi Wang, Oli Danyi Liu, Peter Bell 0001. [doi]
- Predicting Adolescent Suicidal Risk from Multi-task-based Speech: An Ensemble Learning ApproachXi Chen, Renzhe Yu, Yanshen Tan, Yiyi Li, Quan Qian, Ying Lin. [doi]
- FlowSE: Efficient and High-Quality Speech Enhancement via Flow MatchingZiqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie. [doi]
- Exploring Pre-trained models on Ultrasound Modeling for Mice Autism Detection with Uniform Filter Bank and Attentive ScoringYuchen Song, Yucong Zhang, Ming Li. [doi]
- A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech SignalsBubai Maji, Monorama Swain, Shazia Nasreen, Debabrata Majumdar, Rajlakshmi Guha, Aurobinda Routray, Anders Søgaard. [doi]
- Knowledge Distillation Method for Pruned RNN-T Models via Pruning Bounds Sharing and Losses ConfusionXiaocan Zhang, WeiWei Jiang, Guibin Zheng, Chenhao Jing, Jiqing Han 0001, Tieran Zheng. [doi]
- SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation ToolkitWen-Chin Huang, Erica Cooper, Tomoki Toda. [doi]
- Interspeech 2025 URGENT Speech Enhancement ChallengeKohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar 0003, Marvin Sach, Yihui Fu, Wei Wang 0010, Tim Fingscheidt, Shinji Watanabe 0001. [doi]
- How do both phonological and syntactic complexity influence speech planning?Ivan Yuen, Katherine Demuth, Stefanie Shattuck-Hufnagel. [doi]
- Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?Aditya Kommineni, Digbalay Bose, TianTian Feng, So-Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan. [doi]
- Speech Mutil-label Emotion Recognition Using Asymmetric Class Loss Function Based on Effective SamplesShanshan Xiang, Hankiz Yilahun, Askar Hamdulla. [doi]
- The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and RecognitionMing Gao, Shilong Wu, Hang Chen 0001, Jun Du 0002, Chin-Hui Lee 0001, Shinji Watanabe 0001, Jingdong Chen, Sabato Marco Siniscalchi, Odette Scharenborg. [doi]
- Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech DetectionShaojie Li, Qintuya Si, De Hu. [doi]
- Improved Intelligibility of Dysarthric Speech using Conditional Flow MatchingShoutrik Das, Nishant Singh, Arjun Gangwar, S. Umesh. [doi]
- Dynamic Acoustic Model Architecture Optimization in Training for ASRJingjing Xu 0002, Zijian Yang, Albert Zeyer, Eugen Beck, Ralf Schlüter, Hermann Ney. [doi]
- Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMsErfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz. [doi]
- DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic RegenerationSanberk Serbest, Tijana Stojkovic, Milos Cernak, Andrew Harper. [doi]
- Privacy-Preserving Speaker Verification via End-to-End Secure Representation LearningChenguang Hu, Yaqian Hao, Fulin Zhang, XiaoXue Luo, Yao Shen, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng. [doi]
- SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound ExtractionTuochao Chen, D. Shin, Hakan Erdogan, Sinan Hersek. [doi]
- Patient-Aware Feature Alignment for Robust Lung Sound Classification: Cohesion-Separation and Global Alignment LossesSeung Gyu Jeong, Seong-Eun Kim. [doi]
- Synthetic Data Generation for Phrase Break Prediction with Large Language ModelHoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim. [doi]
- Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-SpeechNam Gyu Kim, Deok-Hyeon Cho, Seung-bin Kim, Seong-Whan Lee. [doi]
- FFD: Fine-Finger Diffusion Model for Music to Fine-grained Finger Dance GenerationBoya Dong, Wentao Lei, Li Liu. [doi]
- The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African LanguagesChris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Toluwase Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal. [doi]
- Speech Annotation for A: Accuracy, Access, and ApplicationZirong Li, Hongchen Wu, Yixin Gu, Yao Du, Yang Yue. [doi]
- TSDT-Net: Ultra-Low-Complexity Two-Stage Model Combining Dual-Path-Transformer and Transform-Average-Concatenate Network for Speech EnhancementYi Gao, Hangting Chen, Siyu Zhang, Qingshan Yang, Jingcong Chen. [doi]
- 2D Immersed Boundary Method in Vocal Tract Acoustics: An Eulerian-Lagrangian Model for Simulation of DiphthongsRongshuai Wu, Debasish Ray Mohapatra, Sidney Fels. [doi]
- VoiceNoNG: Robust High-Quality Speech Editing Model without HallucinationsSung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukic, Huck Yang, Yu Tsao 0001, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu. [doi]
- Automatic Speech Recognition of African American English: Lexical and Contextual EffectsHamid Mojarad, Kevin Tang. [doi]
- Tracking /r/ Deletion: Forced Alignment of Pronunciation Variants and Sociophonetic Insights into Post-Obstruent Final /r/ in FrenchAnisia Popescu, Lori Lamel, Marc Evrard, Ioana Vasilescu. [doi]
- Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in the Speech Accessibility Project ChallengeKaito Takahashi, Keigo Hojo, Toshimitsu Sakai, Yukoh Wakabayashi, Norihide Kitaoka. [doi]
- SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech RecognitionYuta Hirano, Sakriani Sakti. [doi]
- Adapting Whisper for low-resource Hindi-English Code-Mix speech with on-the-fly Augmentation & LLM-Synthesised DataAstik Biswas, Oleg Shevelev, Amine Abdaoui, Vivek Tyagi, Abdelmoumene Boumadane. [doi]
- ArticulateX: End-to-End Monolingual Speech Translation in Articulator SpaceVishal Kumar, Vinayak Abrol. [doi]
- MelRe: Vision-Based Mel-Spectrogram RestorationKaixuan Luan, Xiaoda Yang, Shile Cai, Ruofan Hu, Minghui Fang 0002, Wenrui Liu 0003, Jialong Zuo, Jiaqi Duan, Yuhang Ma, Junyu Lu. [doi]
- Multimodal Speech-Based Biomarkers Outperform the ALS Functional Rating Scale in Predicting Individual Disease Progression in ALSHardik Kothare, Michael Neumann, Vikram Ramanarayanan. [doi]
- SpokenNativQA: Multilingual Everyday Spoken Queries for LLMsFiroj Alam, Md. Arid Hasan, Shammur Absar Chowdhury. [doi]
- Unmasking real-world audio deepfakes: A data-centric approachDavid Combei, Adriana Stan, Dan Oneata, Nicolas M. Müller, Horia Cucu. [doi]
- Benchmarking Time-localized Explanations for Audio Classification ModelsCecilia Bolaños, Leonardo Pepino, Martín Meza, Luciana Ferrer. [doi]
- Score-Based Training for Energy-Based TTS ModelsWanli Sun, Anton Ragni. [doi]
- Investigating the Reasoning Abilities of Large Language Models for Understanding Spoken Language in Interpersonal InteractionsPranjal Aggarwal, Ghritachi Mahajani, Pavan Kumar Malasani, Vaibhav Jamadagni, Caroline J. Wendt, Ehsanul Haque Nirjhar, Theodora Chaspari. [doi]
- Deep learning based spatial aliasing reduction in beamforming for audio captureMateusz Guzik, Giulio Cengarle, Daniel Arteaga. [doi]
- AF-Vocoder: Artifact-Free Neural Vocoder with Global Artifact FilterZhuangqi Chen, Xianjun Xia, Xiaohuai Le, Siyu Sun, Chuanzeng Huang. [doi]
- EnvSDD: Benchmarking Environmental Sound Deepfake DetectionHan Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang 0001, Mark D. Plumbley. [doi]
- Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads DownYingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli. [doi]
- Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic AnalysisMiao Zhang, Aref Farhadipour, Annie Baker, Jiachen Ma, Bogdan Pricop, Eleanor Chodroff. [doi]
- FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue SystemsYizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma 0001, Eng Siong Chng. [doi]
- Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker CodesRogier C. van Dalen, Shucong Zhang, Titouan Parcollet, Sourav Bhattacharya. [doi]
- Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AIShree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely. [doi]
- A Siamese Network-Based Framework for Voice Mimicry Proficiency Assessment Using X-Vector EmbeddingsBhasi K. C., Rajeev Rajan. [doi]
- Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASRZheng Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard. [doi]
- Automatic detection of speech sound disorders in German-speaking children: augmenting the data with typically developed speechDarline Monika Marx, Marco Matassoni, Alessio Brutti. [doi]
- SiamCTC: Learning Speech Representations through Monotonic Temporal AlignmentSooHwan Eom, Mark Hasegawa-Johnson, Chang D. Yoo. [doi]
- Language and Accent Familiarity Effects on the Use of Acoustic Cues in Talker IdentificationShengyue Xiong, Zhe-chen Guo, Bharath Chandrasekaran. [doi]
- Label Semantic-Driven Contrastive Learning for Speech Emotion RecognitionJiaxi Hu, Leyuan Qu, Haoxun Li, Taihao Li. [doi]
- Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and AffectJaya Narain, Vasudha Kowtha, Colin Lea, Lauren Tooley, Dianna Yee, Vikramjit Mitra, Zifang Huang, Miquel Espi Marques, Jon Huang, Carlos Avendaño, Shirley Ren. [doi]
- Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech ExtractionWenxuan Wu, Shuai Wang 0016, Xixin Wu, Helen Meng, Haizhou Li 0001. [doi]
- Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity RatingsLinda Bakkouche, Charles McGhee, Emily Lau, Stephanie Cooper, Xinbing Luo, Madeleine Rees, Kai Alter, Brechtje Post, Julia Schwarz. [doi]
- DiffMV-ETS: Diffusion-based Multi-Voice Electromyography-to-Speech Conversion using Speaker-Independent Speech Training TargetsKevin Scheck, Tom Dombeck, Zhao Ren, Peter Wu, Michael Wand 0002, Tanja Schultz. [doi]
- FT-Boosted SV: Towards Noise Robust Speaker Verification for English Speaking Classroom EnvironmentsSaba Tabatabaee, Jing Liu 0064, Carol Y. Espy-Wilson. [doi]
- Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear ModulationMyeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park. [doi]
- HASRD: Hierarchical Acoustic and Semantic Representation DisentanglementAmir Hussein, Sameer Khurana, Gordon Wichern, François G. Germain, Jonathan Le Roux. [doi]
- Cocktail-Party Audio-Visual Speech RecognitionThai Binh Nguyen, Ngoc-Quan Pham, Alexander Waibel. [doi]
- Towards Adaptable and Intelligible Speech Synthesis in Noisy EnvironmentsLubos Marcinek, Jonas Beskow, Joakim Gustafson. [doi]
- Linguistic Masking and Its Release in Simulated Electric-acoustic HearingYuting Ding, Xuefei Wang, Fei Chen. [doi]
- Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological KnowledgeAditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik. [doi]
- The Development of Speech Rhythm in Putonghua-Learning Preschool Children in South Xinjiang Uyghur Autonomous Region of ChinaAijun Li, Zhiwei Wang, Jun Gao, Xin Zhou. [doi]
- Enhancing Lyrics Transcription on Music Mixtures with Consistency LossJiawen Huang, Felipe Sousa, Emir Demirel, Emmanouil Benetos, Igor Gadelha. [doi]
- Spot and Merge: A Hybrid Context Biasing Approach for Rare Word and Out of Vocabulary RecognitionJatin Agrawal, Bramhendra Koilakuntla, Srikanth Konjeti. [doi]
- GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style TokensTadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai. [doi]
- CHSER: A Dataset and Case Study on Generative Speech Error Correction for Child ASRNatarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan. [doi]
- When The MOS Predictor Asks For Training Annotation In Cross Lingual/Domain AdaptationNatacha Miniconi, Meysam Shamsi, Anthony Larcher. [doi]
- Bridging the Training-Inference Gap in TTS: Training Strategies for Robust Generative Postprocessing for Low-Resource SpeakersFrank Zalkow, Paolo Sani, Kishor Kayyar Lakshminarayana, Emanuël A. P. Habets, Nicola Pia, Christian Dittmar. [doi]
- Developing a LeFF Transformer Model for Exacerbated Speech Detection in COPD and AsthmaYuyang Yan, Sami O. Simons, Visara Urovi. [doi]
- Spoken Question Answering for Visual QueriesNimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle. [doi]
- Synthetic Dysarthric Speech: A Supplement, Not a Substitute for Authentic Data in Dysarthric Speech RecognitionJingting Li, Keyi Feng, Xinran Zhao, Yan Wang, Su-Jing Wang. [doi]
- Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative SamplesChun-Yi Kuan, Hung-yi Lee. [doi]
- SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech RecognitionLongjie Luo, Lin Li, Qingyang Hong. [doi]
- ABHINAYA - A System for Speech Emotion Recognition In Naturalistic Conditions ChallengeSoumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy. [doi]
- DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-SpeechDeok-Hyeon Cho, Hyung-Seok Oh, Seung-bin Kim, Seong-Whan Lee. [doi]
- SepVAC: Multitask Learning of Speaker Separation, Speaker Localization, Microphone Array Localization, and Room Acoustic Parameter Estimation in Various Acoustic ConditionsRoland Hartanto, Sakriani Sakti, Koichi Shinoda. [doi]
- Jointly Improving Dialect Identification and ASR in Indian Languages using Multimodal Feature FusionSaurabh Kumar, Amartyaveer, Prasanta Kumar Ghosh. [doi]
- Rapport-Building Dialogue Strategies for Deeper Connection: Integrating Proactive Behavior, Personalization, and Aizuchi BackchannelsMuhammad Yeza Baihaqi, Angel F. Garcia Contreras, Seiya Kawano, Koichiro Yoshino. [doi]
- Towards Temporally Explainable Dysarthric Speech Clarity AssessmentSeohyun Park, Chitralekha Gupta, Michelle Kah Yian Kwan, Xinhui Fung, Alexander Wenjun Yip, Suranga Nanayakkara. [doi]
- LSPnet: an ultra-low bitrate hybrid neural codecBowen Zhang, Ian McLoughlin, Xiaoxiao Miao, A. S. Madhukumar. [doi]
- Speech transcription from South Tyrolean Dialect to Standard German with WhisperLuca Ducceschi, Greta H. Franzini. [doi]
- E2E-BPVC: End-to-End Background-Preserving Voice Conversion via In-Context LearningYihan Liu, Zhengyang Chen, Leying Zhang, Yanmin Qian. [doi]
- Speech power spectra: a window into neural oscillations in Parkinson's diseaseSevada Hovsepyan, Mathew Magimai-Doss. [doi]
- Towards Early Prediction of Self-Supervised Speech Model PerformanceRyan Whetten, Lucas Maison, Titouan Parcollet, Marco Dinarelli, Yannick Estève. [doi]
- Identification of Pathological Pronunciation Profiles in ASR Transcription ErrorsMargot Masson, Isabelle Ferrané, Julie Mauclair. [doi]
- Modality-Agnostic Multimodal Emotion Recognition using a Contrastive Masked AutoencoderGeorgios Chochlakis, Turab Iqbal, Woo Hyun Kang, Zhaocheng Huang. [doi]
- Speaker-Aware Multi-Task Learning for Speech Emotion RecognitionXiaohan Shi, Xingfeng Li 0001, Tomoki Toda. [doi]
- DLF-EEND: Dynamic Layer Fusion for End-to-End Speaker DiarizationWooil Kim, Bongsu Jung. [doi]
- CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech TranslationJiale Ou, Hongying Zan. [doi]
- Functional Connectivity and Hilbert-Based Features for Covert Speech EEG Variability Analysis and ClassificationSaravanakumar Duraisamy, Maurice Rekrut, Luis A. Leiva. [doi]
- Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASRMartin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox. [doi]
- Articulatory modeling of the S-shaped F2 trajectories observed in Öhman's spectrographic analysis of VCV syllablesFrédéric Berthommier. [doi]
- Visual features of the oral region in Polish sibilants produced by children with various sibilance patternsAgata Sage, Zuzanna Miodonska, Michal Krecichwost, Ewa Kwasniok, Pawel Badura. [doi]
- Acoustic similarities, articulatory uniqueness: Speech production mechanisms in individuals with congenital lip paralysisAnne Hermes, Ivana Didirková, Philipp Buech, Gilles Vannuscorps. [doi]
- Optimizing CLAP Reward with LLM Feedback for Semantically Aligned and Diverse Automated Audio CaptioningSeyun Ahn, Pil Moo Byun, Won-Gook Choi, Joon-Hyuk Chang. [doi]
- Theoretical proposal for a unified Bayesian model of adaptation in non-interactive and interactive speech productionMélen Guillaume, Anahita Basirat, Julien Diard. [doi]
- Transcribing Oral History Recordings Using the Transcription PortalChristoph Draxler, Julian Pömp, Henk van den Heuvel, Fabio Ardolino, Arjan van Hessen. [doi]
- Real-Time Diffusion Buffer for Speech Enhancement On A LaptopBunlong Lay, Rostilav Makarov, Timo Gerkmann. [doi]
- ASR-based segmentation for the analysis of larger child-speech datasets: Performance evaluation on vowels from Australian-English speaking children aged 4 to 11 yearsRui Cai, Titia Benders. [doi]
- Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic InformationNicholas Sanders, Yuanchao Li, Korin Richmond, Simon King 0001. [doi]
- Can ASR generate valid measures of child reading fluency?Wieke Harmsen, Roeland Van Hout, Catia Cucchiarini, Helmer Strik. [doi]
- Beat gestures made by human-like avatars affect speech perceptionMatteo Maran, Renske Rötjes, Anna R. E. Schreurs, Hans Rutger Bosker. [doi]
- Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State TrackingHaris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura. [doi]
- Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent - high front vowel sequencesSilke Hamann, Andrea Alicehajic. [doi]
- Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech RecognitionLeonora Vesterbacka, Faton Rekathati, Robin Kurtz, Justyna Sikora, Agnes Toftgård. [doi]
- WTFormer: A Wavelet Conformer Network for MIMO Speech Enhancement with Spatial Cues PeservationLu Han, Junqi Zhao, Renhua Peng. [doi]
- I want a horror - comedy - movie: Slips-of-the-Tongue Impact Conversational Recommender System PerformanceMaria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee. [doi]
- Investigating continuous autoregressive generative speech enhancementHaici Yang, Gordon Wichern, Ryo Aihara, Yoshiki Masuyama, Sameer Khurana, François G. Germain, Jonathan Le Roux. [doi]
- Voxplorer: Voice data exploration and projection in an interactive dashboardAlessandro De Luca 0005, Srikanth Madikeri, Volker Dellwo. [doi]
- Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking ModelLucas H. Ueda, João Lima, Leonardo Marques, Paula Dornhofer Paro Costa. [doi]
- Accelerating Diffusion-based Text-to-Speech Model Trainingwith Dual Modality AlignmentJeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, Xie Chen 0001. [doi]
- A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback ControlYuan-Kuei Wu, Juan Azcarreta Ortiz, Kashyap Patel, Buye Xu, Jung-Suk Lee, Sanha Lee, Ashutosh Pandey 0004. [doi]
- VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific LatentsHaiyun Li, Zhiyong Wu, Xiaofeng Xie, Jingran Xie, Yaoxun Xu, Hanyang Peng. [doi]
- CAGCRN: Real-Time Speech Enhancement with a Lightweight Model for Joint Acoustic Echo Cancellation and Noise SuppressionYuyang Wang, Yonghui Liu, Jianbing Liu, Kai Niu 0001, Zhiqiang He 0001. [doi]
- Thinking Fast and Slow: Robust Speech Recognition via Deep Filter-TuningDianwen Ng, Kun Zhou 0003, Bin Ma 0001, Eng Siong Chng. [doi]
- Robot-assisted Recognition of Vocal Emotions in Pseudospeech for Cochlear Implanted AdolescentsGloria Araiza-Illan, Luke Meyer, Bert Maat, Deniz Baskent. [doi]
- Semantic-Aware Interpretable Multimodal Music Auto-TaggingAndreas Patakis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou. [doi]
- Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech RecognitionThomas Rolland, Alberto Abad. [doi]
- Contextual predictability effects on acoustic distinctiveness in read Polish speechZofia Malisz, Jan Foremski, Malgorzata Kul. [doi]
- Perception of Emotional Speech by Individuals with High Borderline Personality FeaturesYizhou Chen, Xiyu Wu. [doi]
- Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum LearningYejin Jeon, Solee Im, Youngjae Kim, Gary Geunbae Lee. [doi]
- Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble FusionAjinkya Kulkarni, Sandipana Dowerah, Tanel Alumäe, Mathew Magimai-Doss. [doi]
- SCRIBAL: A Digital Transcription Tool in Higher EducationJavier Román, Pol Pastells, Mauro Vázquez Chas, Clara Puigventós, Montserrat Nofre, Mariona Taulé, Mireia Farrús. [doi]
- VoiceNet: Multilingual On-Device Phoneme-To-Audio AlignmentKun Jin, Siva Penke, Srinivasa Algubelli. [doi]
- Boundary-Conscious Pruning: Hard Set-Aware Model Compression for Efficient Speaker RecognitionSeongkyu Mun, Jubum Han. [doi]
- Crowdsourcing MUSHRA Tests in the Age of Generative Speech Technologies: A Comparative Analysis of Subjective and Objective Testing MethodsLaura Lechler, Chamran Moradi, Ivana Balic. [doi]
- Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human FeedbackJingyi Chen, Ju-Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault. [doi]
- Articulatory variations in Apical Vowels in Southwestern MandarinJing Huang, Feng-fan Hsieh, Yueh-Chin Chang. [doi]
- Rollback Speech: Smart Feedback Prompts for Lost Utterances in Unstable Online CallsYuni Amaloa Quintero Villalobos, Wafaa Wardah, Sebastian Möller 0001, Robert P. Spang. [doi]
- On-device Streaming Discrete Speech UnitsKwangHee Choi, Masao Someki, Emma Strubell, Shinji Watanabe 0001. [doi]
- Optimized Real-time Speech Enhancement with Deep SSMs on Raw AudioYan Ru Pei, Ritik Shrivastava, Sidharth. [doi]
- Transcribing Diverse Voices: Using Whisper for ICE corporaAndreas Weilinghoff. [doi]
- ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech RecognitionThai Binh Nguyen, Thi-Van Nguyen, Quoc Truong Do, Chi Mai Luong. [doi]
- Analysis and Extension of a Near-End Listening Enhancement Method Based on Long-Term Fractile Noise StatisticsFilippo Villani, Wai-Yip Chan, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen 0001. [doi]
- Mitigating Language Mismatch in SSL-Based Speaker AnonymizationZhe Zhang, Wen-Chin Huang, Xin Wang 0037, Xiaoxiao Miao, Junichi Yamagishi. [doi]
- Enabling the replicability of speech synthesis perceptual evaluationsSébastien Le Maguer, Gwénolé Lecorvé, Damien Lolive, Naomi Harte, Juraj Simko. [doi]
- EEG-based Voice Conversion : Hearing the Voice of Your BrainYizhong Geng, Wenxin Fu, Qihang Lu, Bingsong Bai, Cong Wang, Yingming Gao, Ya Li. [doi]
- Vocoder-Projected Feature DiscriminatorTakuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo. [doi]
- Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity ControlMasato Murata, Koichi Miyazaki, Tomoki Koriyama. [doi]
- PredTrAD - Prediction-based Transformer for Anomaly Detection in Multivariate Time Series DataJan Schuster, Alexander Wölfel, Fabian Brunner, Christian Bergler. [doi]
- Investigating Stochastic Methods for Prosody Modeling in Speech SynthesisPaul Mayer, Florian Lux, Alejandro Pérez González de Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu. [doi]
- The Role of Voiced Consonant Duration in Sung Vowel-Consonant and Consonant-Vowel RecognitionAllan Vurma, Einar Meister, Lya Meister, Jaan Ross, Marju Raju, Veeda Kala, Tuuri Dede. [doi]
- Pitfalls and Limits in Automatic Dementia AssessmentFranziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer. [doi]
- Do you read me? - flow of speech effect on speaker recognition systemsAlicja Martinek, Joanna Gajewska, Ewelina Bartuzi-Trokielewicz. [doi]
- TVC-MusicGen: Time-Varying Structure Control for Background Music Generation via Self-Supervised TrainingChenyu Yang, Hangting Chen, Shuai Wang 0016, Haina Zhu, Haizhou Li 0001. [doi]
- Band-SCNet: A Causal, Lightweight Model for High-Performance Real-Time Music Source SeparationJunqi Yang, Yuhong Yang 0001, Weiping Tu, Xin Zhao, Cedar Lin. [doi]
- Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect IdentificationBadr M. Abdullah, Matthew Baas, Bernd Möbius, Dietrich Klakow. [doi]
- Defend for Self-Vocoding: A Novel Enhanced Decoder Network for Watermark RecoveryYu-sheng Lin, Ching-Yu Yang, Hsing-Hang Chou, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee. [doi]
- Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language ModelYong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu 0001. [doi]
- A Neural Codec Approach for Noise-Robust Bandwidth ExpansionXi Liu, Mu Yang, Szu-Jui Chen, John H. L. Hansen. [doi]
- The Interspeech 2025 Challenge on Speech Emotion Recognition in Naturalistic ConditionsAbinay Reddy Naini, Lucas Goncalves, Ali N. Salman, Pravin Mote, Ismail Rasim Ulgen, Thomas Thebaud, Laureano Moro-Velázquez, Leibny Paola García, Najim Dehak, Berrak Sisman, Carlos Busso. [doi]
- Structured pruning for efficient systolic array accelerated cascade Speech-to-Text TranslationJean-Luc Rouas, Charles Brazier, Leila Ben Letaifa, Rafael Medina 0001, Pedro Palacios, David Atienza, Giovanni Ansaloni. [doi]
- Automatic Speech Recognition Biases in Newcastle English: an Error AnalysisDana Serditova, Kevin Tang, Jochen Steffens. [doi]
- SonarGuard2: Ultrasonic Face Liveness Detection Based on Adaptive Doppler Effect Feature ExtractionXiaoming Zhang, Ke-Yue Zhang, Taiping Yao, Songjun Cao, Shouhong Ding, Long Ma. [doi]
- WhisperD: Dementia Speech Recognition and Filler Word Detection with WhisperEmmanuel Akinrintoyo, Nadine Abdelhalim, Nicole Salomons. [doi]
- Evaluation of Three Automatic Alignment Tools for the Processing of Non-native FrenchQian Zhou, Mathilde Hutin. [doi]
- Bidirectional Spoken-Written Text Conversion with Large Language ModelsMuyeol Choi, HyunJung Choi, Yohan Lim, Jeong-Uk Bang, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Minsoo Kim, Sanghun Kim. [doi]
- Evaluating Parameter Sharing for Spoofing-Aware Speaker Verification: A Case Study on the ASVspoof 5 DatasetAykut Büker, Oguzhan Kurnaz, Sule Bekiryazici, Selim Can Demirtas, Cemal Hanilçi. [doi]
- Probing the Robustness Properties of Neural Speech CodecsWei-Cheng Tseng, David Harwath. [doi]
- An Exploratory Framework for LLM-assisted Human Annotation of Speech DatasetsAlexander Johnson, Harsh Deshpande, Emmy Phung, Ahmad Emami. [doi]
- From Weak Labels to Strong Results: Utilizing 5, 000 Hours of Noisy Classroom Transcripts with Minimal Accurate DataAhmed Adel Attia, Dorottya Demszky, Jing Liu 0064, Carol Y. Espy-Wilson. [doi]
- Pick and Summarize: Integrating Extractive and Abstractive Speech SummarizationTakatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, Shinji Watanabe 0001. [doi]
- Feature Importance across Domains for Improving Non-Intrusive Speech Intelligibility Prediction in Hearing AidsRyandhimas E. Zezario, Sabato Marco Siniscalchi, Fei Chen 0011, Hsin-Min Wang, Yu Tsao 0001. [doi]
- Network of acoustic characteristics for the automatic detection of suicide risk from speech. Contribution to the 2025 SpeechWellness challenge by the Semawave teamVincent P. Martin, Charles Brazier, Maxime Amblard, Michel Musiol, Jean-Luc Rouas. [doi]
- Diarization-Guided Multi-Speaker EmbeddingsJoonas Kalda, Clément Pagés, Tanel Alumäe, Hervé Bredin. [doi]
- Pitch Target Realization in Putonghua Tone Production of Children from Dialect-Speaking RegionsMengxue Cao, Tianxin Zheng, Jiewen Zheng. [doi]
- SPCODEC: Split and Prediction for Neural Speech CodecLiang Wen, Lizhong Wang, Yuxing Zheng, Weijing Shi, Kwang-Pyo Choi. [doi]
- Context is all you need? Low-resource conversational ASR profits from context, coming from the same or from the other speakerJulian Linke, Jana Winkler, Barbara Schuppler. [doi]
- VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger BridgeZijing Zhao 0008, Kai Wang, Hao Huang 0009, Ying Hu 0005, Liang He 0003, Jichen Yang. [doi]
- Bringing Interpretability to Neural Audio CodecsSamir Sadok, Julien Hauret, Éric Bavu. [doi]
- Neural Spectral Band Generation for Audio CodingWoongjib Choi, Byeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang. [doi]
- Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion RecognitionMehedi Hasan Bijoy, Dejan Porjazovski, Tamás Grósz, Mikko Kurimo. [doi]
- You Are What You Say: Exploiting Linguistic Content for VoicePrivacy AttacksÜnal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters. [doi]
- WhiStress: Enriching Transcriptions with Sentence Stress DetectionIddo Yosha, Dorin Shteyman, Yossi Adi. [doi]
- Self-Supervised Models of Speech Processing for Haitian CreoleWilliam N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang. [doi]
- REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice ConversionIshan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv Ratn Shah. [doi]
- CEREALES : a new dataset of Quebec French accented speech with applications to speech recognitionLucas Maison, Thomas Soulas, Marie-Jean Meurs. [doi]
- Fine-tuning Strategies for Automatic Speech Recognition of Low-Resource Speech with Autism Spectrum DisorderYeseul Park, Bowon Lee. [doi]
- Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token DenoisingYe-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai, Zhen-Hua Ling. [doi]
- Defending Unauthorized Voice Cloning with Watermark-Aware CodecsJiankun Zhao, Lingwei Meng, Chengxi Deng, Helen Meng, Xixin Wu. [doi]
- Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language ModelsChi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, Hung-yi Lee. [doi]
- Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech AnalysisAnna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny X. Tang, Sunghye Cho. [doi]
- OMPAL: Bridging Speech and Learning with an Open-Source Mandarin Pronunciation Assessment Corpus for Global LearnersWen-Wei Hsieh, Hao-Wei Chi, Kuan-Chen Wang, Ping-Cheng Yeh, Te-Hsin Liu, Chen-Yu Chiang. [doi]
- Focal Modulation Network: A Novel Solution for Polyphonic Music Instrument Recognition without Attention and Aggregation StrategyLekshmi Chandrika Reghunath, Rajeev Rajan. [doi]
- MDDM: A Multi-view Discriminative Enhanced Diffusion-based Model for Speech EnhancementNan Xu, Zhaolong Huang, Xiaonan Zhi. [doi]
- A Watermark for Auto-Regressive Speech Generation ModelsYihan Wu, Ruibo Chen, Georgios Milis, Junfeng Guo, Heng Huang. [doi]
- Exploring Generative Error Correction for Dysarthric Speech RecognitionMoreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi. [doi]
- Exploring auditory feedback mechanisms in speech recognitionLouise Coppieters de Gibson, Philip N. Garner. [doi]
- Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMsHayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe 0001. [doi]
- EmbedAug: An Augmentation Scheme for End-to-End Automatic Speech RecognitionAshish Panda, Sunil Kumar Kopparapu. [doi]
- EmoJudge: LLM Based Post-Hoc Refinement for Multimodal Speech Emotion RecognitionPrabhav Singh, Jesús Villalba 0001. [doi]
- Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean SpeechDanilo de Oliveira, Julius Richter, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann. [doi]
- ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise ExtractionMinu Kim 0001, Kangwook Jang, Hoirin Kim. [doi]
- Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue DetectionGriffin Dietz Smith, Dianna Yee, Jennifer King Chen, Leah Findlater. [doi]
- Beyond Conventional Metrics: using Entropic Triangles to Explain Balancing Methods in Acoustic Scene ClassificationClaudia Montero-Ramírez, Alba Martínez-Serrano, Jorge Garcelán-Gómez, Francisco J. Valverde-Albacete, Carmen Peláez-Moreno. [doi]
- Pretraining Multi-Speaker Identification for Neural Speaker DiarizationShota Horiguchi, Atsushi Ando, Naohiro Tawara, Marc Delcroix. [doi]
- Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation ImprovementTuan Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel. [doi]
- PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel DetectionOguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep Chinchali. [doi]
- AuralNet: Hierarchical Attention-based 3D Binaural Localization of Overlapping SpeakersLinya Fu, Yu Liu, Zhijie Liu, Zedong Yang, Zhong-qiu Wang, Youfu Li 0001, He Kong. [doi]
- Open-Set Source Tracing of Audio Deepfake SystemsNicholas Klein, Hemlata Tak, Elie Khoury 0001. [doi]
- BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLMXun Gong 0005, Anqi Lv, Wangyou Zhang, Zhiming Wang, Huijia Zhu, Yanmin Qian. [doi]
- LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech CodecYiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen 0001, Kai Yu 0004. [doi]
- ATMM-SAGA: Alternating Training for Multi-Module with Score-Aware Gated Attention SASV systemAmro Asali, Yehuda Ben-Shimol, Itshak Lapidot. [doi]
- Seamless Dysfluent Speech Text Alignment for Disordered Speech AnalysisZongli Ye, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Haodong Li, Shuhe Li, Chenxu Guo, Anaisha Das, Peter Park, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli. [doi]
- EmoDB 2.0: A Database of Emotional Speech in a World that is not Black or White but GreyFelix Burkhardt, Oliver Schrüfer, Uwe D. Reichel, Hagen Wierstorf, Anna Derington, Florian Eyben, Björn W. Schuller. [doi]
- A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative ListenersKakeru Yazawa, Takayuki Konishi. [doi]
- SMARTMOS: Modeling Subjective Audio Quality Evaluation for Real-Time ApplicationsSivakumar Balasubramanian, Jose Antonio Jimenez Amador, Kaustubh Kalgaonkar, King-wei Hor, Sriram Srinivasan. [doi]
- Few-Shot Speech Deepfake Detection Adaptation with Gaussian ProcessesNeta Glazer, David Chernin, Idan Achituve, Sharon Gannot, Ethan Fetaya. [doi]
- Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion RecognitionYoujun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu. [doi]
- M3L: A Multi-Modal and Multi-Lingual Depression Detection FrameworkJiajun You, Shuai Wang, Xun Gong, Xiang Wan. [doi]
- Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech CorpusKalle Lahtinen, Einari Vaaras, Liisa Mustanoja, Okko Räsänen. [doi]
- The Text-to-speech in the Wild (TITW) DatabaseJee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang 0037, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas W. D. Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe 0001. [doi]
- From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTSJuliana Francis, Joakim Gustafsson, Éva Székely. [doi]
- Learning More with Less: Self-Supervised Approaches forLow-Resource Speech Emotion RecognitionZiwei Gong, Pengyuan Shi, Kaan Donbekci, Lin Ai, Run Chen, David Sasu, Zehui Wu, Julia Hirschberg. [doi]
- Characterization of voice cue sensitivity and vocal emotion recognition across the adult lifespanLaura Rachman, Deniz Baskent. [doi]
- Can Emotion Fool Anti-spoofing?Aurosweta Mahapatra, Ismail Rasim Ulgen, Abinay Reddy Naini, Carlos Busso, Berrak Sisman. [doi]
- Leveraging Multi-Level Features of ATST with Conformer-Based Dual-Branch Network for Sound Event DetectionLipeng Dai, Qing Wang, Jie Zhang, Shengyu Peng, Yu Guan, Wu Guo. [doi]
- Speech stimulus design to study the neural coding of speech and the impact of cochlear synaptopathyEtienne Gaudrain, Sarah Verhulst, Deniz Baskent. [doi]
- CAMER: Contribution-Aware Multimodal Emotion RecognitionSun-Kyung Lee 0001, Jong-Hwan Kim 0001. [doi]
- Variability in Intervocalic /t/ and Community Diversity in Australian EnglishHannah White, Joshua Penney, Felicity Cox. [doi]
- Multilingual Speech Assessment Using Cross-Attention and Multitask LearningSeHyun Oh, Minhwa Chung, SunHee Kim. [doi]
- Scaling pseudo-labeling data for end-to-end low-resource speech translation (the case of Kurdish language)Mohammad MohammadAmini, Aghilas Sini, Marie Tahon, Antoine Laurent. [doi]
- Lightweight Speech Enhancement Model Based on Harmonic Attention and Phase Estimation with Skin-Attachable AccelerometerYonghun Song, Yeeun Kim, Yoonyoung Chung. [doi]
- Chain-of-Thought Distillation with Fine-Grained Acoustic Cues for Speech Emotion RecognitionJialong Mai, Xiaofen Xing, Yangbiao Li, Xiangmin Xu. [doi]
- Can AI Understand Mandarin Speech Prosody? A Framework and Benchmark ShowcaseZilong Wang 0006, Xiaoxue Zhang, Xinyang Jiang, Kaitao Song, Jue Yu. [doi]
- Accessible Real-time Eye-gaze Tracking for Neurocognitive Health Assessment: A Multimodal Web-based ApproachDaniel Tisdale, Jackson Liscombe, David Pautler, Michael Neumann, Vikram Ramanarayanan. [doi]
- IDIR: Identifying and Distilling Informative Relations for Speaker VerificationChong-Xin Gan, Zhe Li 0030, Zezhong Jin, Zilong Huang, Man-Wai Mak, Kong-Aik Lee. [doi]
- PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech DialogsSho Inoue, Shuai Wang 0016, Haizhou Li 0001. [doi]
- Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated ApproachNick Rossenbach, Benedikt Hilmes, Leon Brackmann, Moritz Gunz, Ralf Schlüter. [doi]
- Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech RecognitionAsahi Sakuma, Hiroaki Sato, Ryuga Sugano, Tadashi Kumano, Yoshihiko Kawai, Tetsuji Ogawa. [doi]
- ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition SystemsAnand Kumar Rai, Satyam Rahangdale, Utkarsh Anand, Animesh Mukherjee 0001. [doi]
- ARiSE: Auto-Regressive Multi-Channel Speech EnhancementPengjie Shen, Xueliang Zhang, Zhong-qiu Wang. [doi]
- Extended High-frequency Cues to Phoneme Recognition: Insights from ASRZhe-chen Guo, Bharath Chandrasekaran. [doi]
- TS3-Codec: Transformer-Based Simple Streaming Single CodecHaibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li 0001. [doi]
- Developing a Top-tier Framework in Naturalistic Conditions Challenge for Categorized Emotion Prediction: From Speech Foundation Models and Learning Objective to Data Augmentation and Engineering ChoicesTianTian Feng, Thanathai Lertpetchpun, Dani Byrd, Shrikanth Narayanan. [doi]
- An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speechQingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia. [doi]
- Enhancing Audio Deepfake Detection by Improving Representation Similarity of Bonafide SpeechSeung-bin Kim, Hyun-seo Shin, Jungwoo Heo, Chan-yeong Lim, Kyo-Won Koo, Jisoo Son, Sanghyun Hong, Souhwan Jung, Ha-Jin Yu. [doi]
- The 2024 NIST Speaker Recognition EvaluationCraig S. Greenberg, Lukas L. Diduch, Audrey Tong, Elliot Singer, Trang Nguyen, Robert Dunn, Lisa P. Mason, Beth Matys. [doi]
- Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker PrivacyElvir Karimov, Alexander Varlamov, Danil Ivanov, Dmitrii Korzh, Oleg Rogov. [doi]
- Extended Loss: Incorporating Long Context into Training Models when using Short Audio FramesQuang Minh Dinh, Hoda Rezaee Kaviani, Mehrdad Hosseinzadeh, Yuanhao Yu. [doi]
- ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language LearnersAnindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra. [doi]
- Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice ConversionSeymanur Akti, Tuan Nam Nguyen, Alexander Waibel. [doi]
- Unified Text and Speaker Verification using SSL model for Text-Dependent Speaker VerificationNathan Griot, Driss Matrouf, Raphaël Blouet, Jean-François Bonastre, Ana Mantecon. [doi]
- SOMSRED-SVC: Sequential Output Modeling with Speaker Vector Constraints for Joint Multi-Talker Overlapped ASR and Speaker DiarizationNaoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura. [doi]
- Factorized RVQ-GAN For Disentangled Speech TokenizationSameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, François G. Germain, Gordon Wichern, Jonathan Le Roux. [doi]
- Towards Machine Unlearning for Paralinguistic Speech ProcessingOrchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Shubham Singh, Swarup Ranjan Behera, Vandana Rajan, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-TransformerOrchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Stress in Spoken and Whistled GreekAndre Batchelder-Schwab, Vasileios Michos, Jonathan Barnes. [doi]
- Leveraging Ordinal Information for Speech-based Depression ClassificationLishi Zuo, Man-Wai Mak. [doi]
- The Sub-3Sec Problem: From Text-Independent to Text-Dependent CorpusRuichen Zuo, Kong-Aik Lee, Zilong Huang, Man-Wai Mak. [doi]
- SpeechRefiner: Towards Perceptual Quality Refinement for Front-End AlgorithmsSirui Li, Shuai Wang, Zhijun Liu, Zhongjie Jiang, Yannan Wang, Haizhou Li. [doi]
- Leveraging AM and FM Rhythm Spectrograms for Dementia Classification and AssessmentParismita Gogoi, Vishwanath Pratap Singh, Seema Khadirnaikar, Soma Siddhartha, Sishir Kalita, Jagabandhu Mishra, Md. Sahidullah, Priyankoo Sarmah, S. R. M. Prasanna. [doi]
- From Context to Code-switching: Examining the Interplay of Language Proficiency and Multilingualism in SpeechDebasmita Bhattacharya, Aanya Tolat, Julia Hirschberg. [doi]
- ASVspoof2019 vs. ASVspoof5: Assessment and ComparisonAvishai Weizman, Yehuda Ben-Shimol, Itshak Lapidot. [doi]
- Alzheimer's Dementia Detection Using Perplexity from Paired Large Language ModelsYao Xiao, Heidi Christensen, Stefan Goetze. [doi]
- MIKU-PAL: An Automated and Standardized Multimodal Method for Speech Paralinguistic and Affect LabelingYifan Cheng, Ruoyi Zhang, Jiatong Shi. [doi]
- Learning Optimal Prosody Embedding Codebook based on F0 and EnergyDavid Portes, Ales Horák. [doi]
- Prediction of listening effort ratings for habitual and clear-Lombard speech presented in noiseEsther Janse, Chen Shen, Martin Cooke. [doi]
- Analysis of the ABC Classification Backends for NIST SRE24Sandro Cumani, Anna Silnova, Sara Barahona, Ladislav Mosner, Oldrich Plchot, Johan Rohdin. [doi]
- Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer's Disease DetectionYin-Long Liu, Rui Feng, Jia-xin Chen, Yi-Ming Wang, Jia-Hong Yuan, Zhen-Hua Ling. [doi]
- Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents GenerationSteffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer. [doi]
- Vo-Ve: An Explainable Voice-Vector for Speaker Identity EvaluationJaejun Lee, Kyogu Lee. [doi]
- Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker VerificationJin Sob Kim, Hyun-Joon Park, WooSeok Shin, Sung Won Han. [doi]
- Inter-Speaker Relative Cues for Text-Guided Target Speech ExtractionWang Dai, Archontis Politis, Tuomas Virtanen. [doi]
- Enhancing Transcripts of Open-Source Automatic Speech Recognition Models Through Fine-Tuning with Laughter and Speech-LaughPhuoc Hoang Ho, Dragos Alexandru Balan, Dirk K. J. Heylen, Khiet P. Truong. [doi]
- Pairwise Evaluation of Accent Similarity in Speech SynthesisJinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond. [doi]
- Switch Conformer with Universal Phonetic Experts for Multilingual ASRMasato Mimura, Jaeyoung Lee, Tatsuya Kawahara. [doi]
- Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained EnvironmentsReo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda. [doi]
- 75-Speaker Annot-16: A benchmark dataset for speech articulatory rt-MRI annotation with articulator contours and phonetic alignmentXuan Shi, Yubin Zhang, Yijing Lu, Marcus Ma, TianTian Feng, Asterios Toutios, Haley Hsu, Louis Goldstein, Shrikanth Narayanan. [doi]
- Relative cue weighting in multilingual stop voicing productionLe Xuan Chan, Annika Heuser. [doi]
- A Study of Speech Embedding Similarities Between Australian Aboriginal and High-Resource LanguagesEliathamby Ambikairajah, Jingyao Wu 0002, Ting Dang, Vidhyasaharan Sethu. [doi]
- Is it all about race?: A Cross-examination of /s/ in a Multilingual (Nigerian) ContextOluwasegun Amoniyan. [doi]
- Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity EstimationGowtham Premananth, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson. [doi]
- Reddit FlairShare: A Human-Annotated Dataset of Gender-Progressive Online DiscourseCarlos Hartmann. [doi]
- CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition ChallengeZehua Liu, Xiaolou Li, Chen Chen 0075, Lantian Li, Dong Wang 0013. [doi]
- Effects of Prosodic Information on Dialect Classification Using Whisper FeaturesPhoebe Parsons, Heming Strømholt Bremnes, Knut Kvale, Torbjørn Svendsen, Giampiero Salvi. [doi]
- Lightweight Front-end Enhancement for Robust ASR via Frame Resampling and Sub-Band PruningSiyi Zhao, Wei Wang, Yanmin Qian. [doi]
- A Joint Network for Singing Melody Extraction from Polyphonic Music with Attention Aggregation and Self-Consistency TrainingJiabo Jing, Ying Hu, Hao Huang, Liang He, Zhijian Ou. [doi]
- MVP: Multi-source Voice Pathology detectionAlkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis. [doi]
- EAA: Emotion-Aware Audio Large Language Models with Dual Cross-Attention and Context-Aware Instruction TuningHongfei Du, Sidi Lu, Gang Zhou, Ye Gao. [doi]
- MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization ChallengeZijiang Yang 0007, Meishu Song, Xin Jing 0001, Haojie Zhang, Kun Qian 0003, Bin Hu 0001, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto. [doi]
- Individualized speech enhancement for hearing-impaired listenersChuan Wen, Sarah Verhulst. [doi]
- Speech UnlearningJiali Cheng, Hadi Amiri. [doi]
- Spatio-Spectral Diarization of Meetings by Combining TDOA-based Segmentation and Speaker Embedding-based ClusteringTobias Cord-Landwehr, Tobias Gburrek, Marc Deegen, Reinhold Haeb-Umbach. [doi]
- A semi-automatic pipeline for transcribing and segmenting child speechPolychronia Christodoulidou, James Tanner, Jane Stuart-Smith, Michael McAuliffe, Mridhula Murali, Amy Smith, Lauren Taylor, Joanne Cleland, Anja Kuschmann. [doi]
- Position also matters! Separating Same Instruments in String Quartet using Timbral and Positional CuesYuetonghui Xu, Yiwen Wang, Xihong Wu, Xiaobing Li, Feng Yu. [doi]
- Listen through the Sound: Generative Speech Restoration Leveraging Acoustic Context RepresentationSoo-Whan Chung, Min-Seok Choi. [doi]
- MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech RecognitionYinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian. [doi]
- Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute PredictionThanathai Lertpetchpun, TianTian Feng, Dani Byrd, Shrikanth Narayanan. [doi]
- Pitch Contour Model (PCM) with Transformer Cross-Attention for Speech Emotion RecognitionMinji Ryu, Ji-Hyeon Hur, Sung Heuk Kim, Gahgene Gweon. [doi]
- Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning ApproachYi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee. [doi]
- Robust Personal Voice Activity Detection for Mitigating Domain Mismatch and False Acceptance ScenariosYuke Lin, Jun Chen, Wenjie Li, Longshuai Xiao, Chao Weng. [doi]
- Towards an Ultra-Low-Delay Neural Audio Coding with Computational EfficiencyByeong Hyeon Kim, Hyungseob Lim, Inseon Jang, Hong-Goo Kang. [doi]
- Analyzing the Importance of Blank for CTC-Based Knowledge DistillationBenedikt Hilmes, Nick Rossenbach, Ralf Schlüter. [doi]
- VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in SpeechHarm Lameris, Joakim Gustafsson, Éva Székely. [doi]
- Efficient Neural and Numerical Methods for High-QualityOnline Speech Spectrogram Inversion via Gradient TheoremAndres Fernandez, Juan Azcarreta Ortiz, Çagdas Bilen, Jesus Monge-Alvarez. [doi]
- ADCeleb: A Longitudinal Speech Dataset from Public Figures for Early Detection of Alzheimer's DiseaseKunxiao Gao, Anna Favaro, Najim Dehak, Laureano Moro-Velázquez. [doi]
- Articulatory Vowel Distinctiveness in SpanishKristin Teplansky, Emily Rangel, Mimi LaValley, Jinuk Kwon, Beiming Cao, Jun Wang 0037. [doi]
- Modeling Vowel System Typology Using Iterated Confusion MinimizationJohn McGahay. [doi]
- REAL-T: Real Conversational Mixtures for Target Speaker ExtractionShaole Li, Shuai Wang 0016, Jiangyu Han, Ke Zhang, Wupeng Wang, Haizhou Li 0001. [doi]
- Evaluating Wav2Vec2-Bert for Computer-Assisted Pronunciation Training for isiZuluAlexandra Fort, Francis Tyers. [doi]
- On the Within-class Variation Issue in Alzheimer's Disease DetectionJiawen Kang 0002, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng. [doi]
- Model as Loss: A Self-Consistent Training ParadigmSaisamarth Rajesh Phaye, Milos Cernak, Andrew Harper. [doi]
- SEED: Speaker Embedding Enhancement Diffusion ModelKihyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung. [doi]
- Prolongation in RomanianOana Niculescu, Monica Vasileanu. [doi]
- Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using Speech and Large Language ModelsChanwoo Park, Anna Seo Gyeong Choi, Sunghye Cho, Chanwoo Kim. [doi]
- Robustness of F0 Ratio as a Diagnostic: Comparing Creaky Voice in Danish and Seoul KoreanMichaela Watkins, Rasmus Puggaard-Rode, Paul Boersma, Silke Hamann. [doi]
- Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLMZhaokai Sun, Li Zhang, Qing Wang, Pan Zhou, Lei Xie. [doi]
- Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily MeasurementsRobert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F. Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind W. Picard. [doi]
- Bayesian Learning for Domain-Invariant Speaker Verification and Anti-SpoofingJin Li, Man-Wai Mak, Johan Rohdin, Kong-Aik Lee, Hynek Hermansky. [doi]
- Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion RecognitionCheng-Hung Hu, Yusuke Yasuda, Akifumi Yoshimoto, Tomoki Toda. [doi]
- Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE DatasetRui Liu 0008, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li 0001. [doi]
- Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofingThanapat Trachu, Thanathai Lertpetchpun, Ekapol Chuangsuwanich. [doi]
- Lexical competition in the process of Cantonese tone merging: Diverse Impact Mechanisms Across Different Individuals and Tone PairsLishan Li, Yaolin Zhou, Xiaoying Xu. [doi]
- Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and DetectionChenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Peter Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli. [doi]
- Selective Channel Attention based Target Speaker Voice Activity Detection for Speaker Diarization under AD-HOC Microphone Array SettingsHongyu Zhang, Ming Cheng, Jing Feng, Ming Li. [doi]
- SPEAKtoCOPD: a flashmob study to collect COPD speechLoes van Bemmel, Lauren G. Reinders, Folkert Brijker, Bas Holverda, Frits M. E. Franssen, Hanneke van Helvoort, Visara Urovi, Marieke Spreeuwenberg, Sami O. Simons. [doi]
- Towards Human-like Multimodal Conversational Agent by Generating Engaging SpeechTaesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim. [doi]
- Dual Orthogonality Sub-center Loss for Enhanced Anomalous Sound DetectionDong Wang 0013, Jiqing Han 0001, Tieran Zheng, Guibin Zheng, Yongjun He 0002. [doi]
- J-j-j-just Stutter: Benchmarking Whisper's Performance Disparities on Different Stuttering PatternsCharan Sridhar, Shaomei Wu. [doi]
- The Interspeech 2025 Speech Accessibility Project ChallengeXiuwen Zheng 0003, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu J. Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Varma Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling. [doi]
- Adapting Whisper for Streaming Speech Recognition via Two-Pass DecodingHaoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, Michele M. Franceschini. [doi]
- EASY: Emotion-aware Speaker Anonymization via Factorized DistillationJixun Yao, Hexin Liu, Eng Siong Chng, Lei Xie 0001. [doi]
- Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 ChallengeMing Cheng, Fei Su, Cancan Li, Juan Liu, Ming Li. [doi]
- Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM: Data Condensation and Spoken QA GenerationQiongqiong Wang, Hardik B. Sailor, Tianchi Liu 0004, Ai Ti Aw. [doi]
- Evaluating Progress of CALL System Users on Accentedness and Comprehensibility: An Acoustic and ASR-Based ApproachWenwei Dong, Catia Cucchiarini, Roeland Van Hout, Helmer Strik. [doi]
- Synonymity-Based Semantic Coding for Efficient Speech CompressionShanhui Gan, Zijian Liang, Kai Niu 0001, Ping Zhang 0003. [doi]
- SSF-DST: A Spectro-Spatial Features Enhanced Deep Spatiotemporal Network for EEG-Based Auditory Attention DetectionTong Zhu, Xiaoke Yang, Jian Zhou 0006, Lu Li, Zhao Lv, Cunhang Fan. [doi]
- A Data-Driven Diffusion-based Approach for Audio Deepfake ExplanationsPetr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj. [doi]
- Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame RateHanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu. [doi]
- Training Onset-and-Offset-Aware Sound Event Detection on a Heterogeneous Dataset via Probabilistic Sequential ModelingTomoya Yoshinaga, Yoshiaki Bando, Keitaro Tanaka, Keisuke Imoto, Masaki Onishi, Shigeo Morishima. [doi]
- Does effortful speech production indicate communication difficulty caused by noise and hearing aid support?Lena-Marie Huttner, Jeppe H. Christensen, Gitte Keidser, Tobias May, Torsten Dau, Sergi Rotger-Griful. [doi]
- Robust fine-tuning of speech recognition models via model merging: application to disordered speechAlexandre Ducorroy, Rachid Riad. [doi]
- Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency DetectionJinming Zhang, Xuanru Zhou, Jiachen Lian, Shuhe Li, William Li, Zoe Ezzes, Rian Bogley, Lisa Wauters, Zachary Miller, Jet Vonk, Brittany Morin, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli. [doi]
- Improving Child Speech Recognition and Reading Mistake Detection by Using PromptsLingyun Gao, Cristian Tejedor García, Catia Cucchiarini, Helmer Strik. [doi]
- Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword SpottingYoungmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho. [doi]
- LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented SpeechNiyati Bafna, Matthew Wiesner. [doi]
- Unlearning LLM-Based Speech Recognition ModelsZhe Liu. [doi]
- Cryfish: On deep audio analysis with Large Language ModelsAnton Mitrofanov, Sergey Novoselov, Tatiana Prisyach, Vladislav Marchevskiy, Arseniy Karelin, Nikita Khmelev, Dmitry Dutov, Stepan Malykh, Igor Agafonov, Aleksandr Nikitin, Oleg Petrov. [doi]
- GTAnet: Geometry-Guided Temporal Attention for EEG-Based Sound Source Tracking in Cocktail Party ScenariosSaurav Pahuja, Gabriel Ivucic, Siqi Cai 0002, Dashanka De Silva, Haizhou Li 0001, Tanja Schultz. [doi]
- Weight Factorization and Centralization for Continual Learning in Speech RecognitionEnes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel. [doi]
- Conformer-based Ultrasound-to-Speech ConversionIbrahim Ibrahimov, Csaba Zainkó, Gábor Gosztolya. [doi]
- Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language EvaluationLucía Gómez-Zaragozá, Javier Marín-Morales, Mariano Alcañiz, Mohammad Soleymani. [doi]
- Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual RepresentationsTeng Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang. [doi]
- Intelligibility of Text-to-Speech Systems for Mathematical ExpressionsSujoy Roychowdhury, Ranjani H. G., Sumit Soman, Nishtha Paul, Subhadip Bandyopadhyay, Siddhanth Iyengar. [doi]
- An Investigative Study on Recent Sharpness- and Flatness-Based Optimizers for Enhanced Self-Supervised Speaker VerificationAbderrahim Fathan, Jahangir Alam 0001, Xiaolin Zhu. [doi]
- DRI-GAN: A Novel Dual Real Input GAN with Triplet Loss for Cross-Lingual and Noisy SLUAnkit Kumar, Munir Georges. [doi]
- Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASRMingyu Cui, Yifan Yang 0005, Jiajun Deng, Jiawen Kang 0002, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen 0001, Xunying Liu. [doi]
- Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTSAnuprabha M, Krishna Gurugubelli, Anil Kumar Vuppala. [doi]
- Few-step Adversarial Schrödinger Bridge for Generative Speech EnhancementSeungu Han, Sungho Lee, Juheon Lee, Kyogu Lee. [doi]
- Towards Few-Shot Training-Free Anomaly Sound DetectionHo-Hsiang Wu, Wei-Cheng Lin, Abinaya Kumar, Luca Bondi, Shabnam Ghaffarzadegan, Juan Pablo Bello. [doi]
- Modality-Specific Speech Enhancement and Noise-Adaptive Fusion for Acoustic and Body-Conduction Microphone FrameworkYunsik Kim, Yoonyoung Chung. [doi]
- A Deformable Convolution GAN Approach for Speech Dereverberation in Cochlear Implant UsersHsin-Tien Chiang, John H. L. Hansen. [doi]
- GLCLAP: A Novel Contrastive Learning Pre-trained Model for Contextual Biasing in ASRYuxiang Kong, Fan Cui, Liyong Guo, Heinrich Dinkel, Lichun Fan, Junbo Zhang, Jian Luan 0001. [doi]
- MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal PromptZhichao Wu, Yueteng Kang, Songjun Cao, Long Ma, Qiulin Li, Qun Yang. [doi]
- Data Augmentation using Speech Synthesis for Speaker-Independent Dysarthria Severity ClassificationMinseop Kim, Minsu Han, Seokyoung Hong, Myoung-Wan Koo. [doi]
- Private kNN-VC: Interpretable Anonymization of Converted SpeechCarlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller 0001. [doi]
- GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning ConstraintsJiajun He, Jinyi Mi, Tomoki Toda. [doi]
- EmoSpeechAuth: Emotion-Aware Speaker VerificationMagdalena Golebiowska, Piotr Syga. [doi]
- R2S: Real-to-Synthetic Representation Learning for Training Speech Recognition Models on Synthetic DataMinh Tran, Debjyoti Paul, Yutong Pang, Laxmi Pandey, Jinxi Guo, Ke Li 0023, Shun Zhang, Xuedong Zhang, Xin Lei. [doi]
- Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second LatencyBunlong Lay, Rostilav Makarov, Timo Gerkmann. [doi]
- Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound ClassificationEmiliano Acevedo, Martín Rocamora, Magdalena Fuentes. [doi]
- Brain-tuned Speech Models Better Reflect Speech Processing Stages in the BrainOmer Moussa, Mariya Toneva. [doi]
- French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech EnhancementThomas Joubaud, Julien Hauret, Véronique Zimpfer, Éric Bavu. [doi]
- Using gender, phonation and age to interpret automatically discovered speech attributes for explainable speaker recognitionCarole Millot, Clara Ponchard, Cédric Gendrot, Jean-François Bonastre, Orane Dufour. [doi]
- Towards Diverse and Efficient Audio Captioning via Diffusion ModelsManjie Xu, Chenxing Li, Yong Ren, Xinyi Tu, Ruibo Fu, Wei Liang, Dong Yu 0001. [doi]
- Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised LearningSarenne Wallbridge, Christoph Minixhofer, Catherine Lai, Peter Bell 0001. [doi]
- Towards a Unified Benchmark for Arabic Pronunciation Assessment: Qur'anic Recitation as Case StudyYassine El Kheir, Omnia Ibrahim, Amit Meghanani, Nada AlMarwani, Hawau Olamide Toyin, Sadeen Alharbi, Modar Alfadly, Lamya Alkanhal, Ibrahim Selim, Shehab Elbatal, Salima Mdhaffar, Thomas Hain, Yasser Hifny, Mostafa Shahin, Ahmed Ali 0002. [doi]
- Pushing the Limits of Beam Search Decoding for Transducer-based ASR modelsLilit Grigoryan, Vladimir Bataev, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg. [doi]
- Continuous prediction of backchannel timing for human-robot interactionMichael Paierl, Martin Hagmüller, Barbara Schuppler. [doi]
- Investigating Gender Bias in Text-to-Audio Generation ModelsAarish Shah Mohsin, Mohammad Nadeem, Shahab Saquib Sohail, Tughrul Arslan, Mandar Gogate, Nasir Saleem, Amir Hussain 0001. [doi]
- Revival with Voice: Multi-modal Controllable Text-to-Speech SynthesisMinsu Kim, Pingchuan Ma 0001, Honglie Chen, Stavros Petridis, Maja Pantic. [doi]
- A Hybrid Approach to Combining Role Diarization with ASR for Professional ConversationsBongjun Kim, Arindam Ghosh, Mark C. Fuhs, Anurag Chowdhury, Deblin Bagchi, Monika Woszczyna. [doi]
- JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking StylesYuto Kondo, Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko. [doi]
- Pushing the Limits of End-to-End DiarizationSamuel J. Broughton, Lahiru Samarakoon. [doi]
- FlowTSE: Target Speaker Extraction with Flow MatchingAviv Navon, Aviv Shamsian, Yael Segal-Feldman, Neta Glazer, Gil Hetz, Joseph Keshet. [doi]
- Acquiring Pronunciation from Speech Audio via Multi-task LearningSiqi Sun, Korin Richmond. [doi]
- LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic ContextNatsuo Yamashita, Masaaki Yamamoto, Hiroaki Kokubo, Yohei Kawaguchi. [doi]
- Extending the Fongbe to French Speech Translation Corpus: resources, models and benchmarkD. Fortune Kponou, Salima Mdhaffar, Fréjus A. A. Laleye, Eugène C. Ezin, Yannick Estève. [doi]
- Spoken Language Modeling with Duration-Penalized Self-Supervised UnitsNicol Visser, Herman Kamper. [doi]
- Granary: Speech Recognition and Translation Dataset in 25 European LanguagesNithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang 0012, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng 0003, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg. [doi]
- SynHate: Detecting Hate Speech in Synthetic Deepfake AudioRishabh Ranjan, Kishan Pipariya, Mayank Vatsa, Richa Singh 0001. [doi]
- Band-Split Self-supervised Mamba for Infant-centered Audio AnalysisXulin Fan, Jialu Li 0002, Mark Hasegawa-Johnson, Nancy L. McElwain. [doi]
- Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0Robin Huo, Ewan Dunbar. [doi]
- StarGAN-Aug: A Cross-domain Fault Audio Generation Method for High-performance Fault Diagnosis of Power TransformersBen Niu 0011, Yangjie Wei, Gang Yang, Yuqiao Wang, Shengling Yu. [doi]
- AC/DC: LLM-based Audio Comprehension via Dialogue ContinuationYusuke Fujita, Tomoya Mizumoto, Atsushi Kojima, Lianbo Liu, Yui Sudo. [doi]
- ASR Confidence Estimation using True Class Lexical Similarity ScoreNagarathna Ravi, Thishyan Raj T, Ravi Teja Chaganti, Vipul Arora 0001. [doi]
- SardinianVoxes: A Speech Recognition Dataset for the Sardinian LanguagesSalvatore Carta, Alessandro Giuliani, Marco Manolo Manca, Mirko Marras, Leonardo Piano. [doi]
- Conveying Gender Through Speech: Insights from Trans MenAlice Ross, Cliodhna Hughes, Eddie L. Ungless, Catherine Lai. [doi]
- ViToSA: Audio-Based Toxic Spans Detection on Vietnamese Speech UtterancesHuy Ba Do, Vy Le-Phuong Huynh, Luan Thanh Nguyen. [doi]
- PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural PruningTruong Do, Phuong Minh Nguyen 0001, Le-Minh Nguyen 0001. [doi]
- NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive ModelsNeil Shah, Shirish Karande 0001, Vineet Gandhi. [doi]
- Multimodal Prosody Modeling: A Use Case for Multilingual Sentence Mode PredictionBogdan Vlasenko, Mathew Magimai-Doss. [doi]
- Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphereMingru Yang, Yanmei Gu, Qianhua He, Yanxiong Li, Peirong Zhang, Yongqiang Chen, Zhiming Wang, Huijia Zhu, Jian Liu, Weiqiang Wang. [doi]
- Benchmarking Neural Speech Codec Intelligibility with SIToolAnna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, Emanuël A. P. Habets. [doi]
- Label-Context-Dependent Internal Language Model Estimation for CTCZijian Yang, Minh-Nghia Phan, Ralf Schlüter, Hermann Ney. [doi]
- Modeling Probabilistic Reduction using Information Theory and Naive Discriminative LearningAnna Stein, Kevin Tang. [doi]
- Medusa: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic ConditionsGeorgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos. [doi]
- DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion ModelXueyuan Chen, Dongchao Yang, Wenxuan Wu, Minglin Wu, Jing Xu, Xixin Wu, Zhiyong Wu 0001, Helen Meng. [doi]
- Self-supervised learning of speech representations with Dutch archival dataNik Vaessen, Roeland Ordelman, David A. van Leeuwen. [doi]
- FairASR: Fair Audio Contrastive Learning for Automatic Speech RecognitionJongsuk Kim, Jaemyung Yu, Minchan Kwon, Junmo Kim 0002. [doi]
- DepressGEN: Synthetic Data Generation Framework for Depression DetectionWenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang. [doi]
- Pinyin-Guided Chinese Speech Recognition with Large Language ModelJie Zhengjie, Gaofeng Cheng. [doi]
- AusKidTalk: Using Strategic Data Collection and Out-of-Domain Tools to Semi-Automate Novel Corpora AnnotationTünde Szalay, Mostafa Shahin, Tharmakulasingam Sirojan, Zheng Nan, Renata Huang, Kirrie J. Ballard, Beena Ahmed. [doi]
- OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and CleaningYifan Peng 0003, Muhammad Shakeel 0001, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe 0001. [doi]
- Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature AttributionDennis Fucci, Marco Gaido, Matteo Negri, Mauro Cettolo, Luisa Bentivogli. [doi]
- Who, When, and What: Leveraging the "Three Ws" Concept for Emotion Recognition in ConversationXiaohan Shi, Xingfeng Li 0001, Tomoki Toda. [doi]
- Are You Being Sarcastic? Prosodic Cues to Irony Perception in GermanSophia Fünfgeld, Angelika Braun, Katharina Zahner-Ritter. [doi]
- Speaker Normalization and Content Restoration for Zero-Shot Voice Conversion with Attention-Enhanced DiscriminatorDesheng Hu, Yang Xiang, Jian Lu, Xinhui Hu, Xinkang Xu. [doi]
- Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker VerificationZhe Li 0030, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng. [doi]
- End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled DataAishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan Devarkonda, Santosh Kesiraju, Anil Kumar Vuppala. [doi]
- SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum DomainZixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei. [doi]
- The function of creaky voice in South Korean: A perception studyPatrik Hrabánek, Michaela Watkins, Silke Hamann. [doi]
- Unleashing the Inner Monster: Demonstrating High-Fidelity Human to Non-Human Voice ConversionNamhyun Cho, Sunmin Kim, Minsu Kang, Seolhee Lee, Choonghyeon Lee, Yangsun Lee. [doi]
- Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question AnsweringEbru Arisoy, Merve Ünlü Menevse, Yusufcan Manav, Arzucan Özgür. [doi]
- CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt TuningJiacheng Shi, Yanfu Zhang, Ye Gao. [doi]
- Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake DetectionFalih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Jiangang Ma, Vidya Saikrishna, Feng Xia 0001. [doi]
- Neurodyne: Neural Pitch Manipulation with Representation Learning and Cycle-Consistency GANYicheng Gu, Chaoren Wang, Zhizheng Wu 0001, Lauri Juvela. [doi]
- MMLoRA: Multitask Memory Parameter-Efficient Fine-Tuning for Multimodal SERYuanbo Fang, Xiaofen Xing, Xueru Li, Weibin Zhang, Xiangmin Xu. [doi]
- FoleyMaster: High-Quality Video-to-Audio Synthesis via MLLM-Augmented Prompt Tuning and Joint Semantic-Temporal AdaptationLiming Liang, Luo Chen, Yuehan Jin, Xianwei Zhuang, Yuxin Xie, Yongkang Yin, Yuexian Zou. [doi]
- Regularizing Learnable Feature Extraction for Automatic Speech RecognitionPeter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney. [doi]
- CAPR: Confidence-Aware Prompt Refinement in Large Language ModelsJen-Tzung Chien, Po-Chun Huang. [doi]
- Word stress in self-supervised speech models: A cross-linguistic comparisonMartijn Bentum, Louis ten Bosch, Tomas O. Lentz. [doi]
- Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio DomainsTakanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi. [doi]
- Synthesizing Speech with Selected Perceptual Voice Qualities - A Case Study with Creaky VoiceFrederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach. [doi]
- PeriodCodec: A Pitch-Controllable Neural Audio Codec Using Periodic Signals for Singing Voice SynthesisMasato Takagi, Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda. [doi]
- Multi-task learning for speech emotion recognition in naturalistic conditionsBartlomiej Zgórzynski, Juliusz Wójtowicz-Kruk, Piotr Masztalski, Wladyslaw Sredniawa. [doi]
- Improving User Impression of Spoken Dialogue Systems by Controlling Para-linguistic Expression Based on IntimacyShoki Kawanishi, Akinori Ito, Yuya Chiba, Takashi Nose. [doi]
- RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow MatchingHyun-Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han 0003, Eunwoo Song. [doi]
- EME-TTS: Unlocking the Emphasis and Emotion Link in Speech SynthesisHaoxun Li, Leyuan Qu, Jiaxi Hu, Taihao Li. [doi]
- Text-Enhanced Audio Encoder for Large Language Model based Speech Recognition via Cross-Modality Pre-training with Unpaired Audio-Text DataHang Su, Yuxiang Kong, Lichun Fan, Jian Luan 0001. [doi]
- LATE: Open Source Toolkit for Latvian and Latgalian Speech TranscriptionArturs Znotins, Didzis Gosko, Normunds Gruzitis. [doi]
- Efficient Multilingual ASR Finetuning via LoRA Language ExpertsJiahong Li, Yiwen Shao, Jianheng Zhuo, Chenda Li, Liliang Tang, Dong Yu 0001, Yanmin Qian. [doi]
- Length Aware Speech Translation for Video DubbingAswin Shanmugam Subramanian, Harveen Singh Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li 0001. [doi]
- CommissionsQC: a Québec French Speech Corpus for Automatic Speech RecognitionCoralie Serrand, Amira Morsli, Gilles Boulianne. [doi]
- Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASRWeiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang 0012, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg. [doi]
- Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic TransformXiangzhu Kong, Hao Huang, Zhijian Ou. [doi]
- Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningDebarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy. [doi]
- Multistage Universal Speech Enhancement System for URGENT ChallengeXiaohuai Le, Zhuangqi Chen, Siyu Sun, Xianjun Xia, Chuanzeng Huang. [doi]
- Effect of Loudspeaker Emitted Speech on ASR performanceVikram C. M, Sanjoy Pal, Nidhi Mantri, Gopal Kumar Agrawal. [doi]
- CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language ModelsJiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda. [doi]
- AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and RecognitionYuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen. [doi]
- Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting TasksYuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan. [doi]
- Acoustic Features of Mandarin Tone Production in Noise: A Comparison Between Chinese Native Speakers and Korean L2 LearnersJinxin Ji, Yiying Hu, Xiaohu Yang, Gang Peng. [doi]
- Zero-Shot Mono-to-Binaural Speech SynthesisAlon Levkovitch, Julian Salazar, Soroosh Mariooryad, R. J. Skerry-Ryan, Nadav Bar, W. Bastiaan Kleijn, Eliya Nachmani. [doi]
- Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial AnimationHyung Kyu Kim, Hak Gu Kim. [doi]
- Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker ExtractionZexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou 0003, Yukun Ma, Chong Zhang 0003, Bin Ma 0001. [doi]
- Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition DifficultyHongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie 0001. [doi]
- Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech RecognitionChanghan Oh, Kiyoung Park, Jeom Ja Kang, Woo-Yong Choi, Hwa Jeon Song. [doi]
- Large Language Models based ASR Error Correction for Child ConversationsAnfeng Xu, TianTian Feng, So-Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan. [doi]
- Decoding Listener's Identity: Person Identification from EEG Signals Using a Lightweight Spiking TransformerZheyuan Lin, Siqi Cai 0002, Haizhou Li 0001. [doi]
- A Cookbook for Community-driven Data Collection of Impaired Speech in Low-Resource LanguagesSumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful. [doi]
- GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting ExamplesHarry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun-Jin Park, Dhruuv Agarwal, Quan Wang. [doi]
- VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice ConversionJoon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee. [doi]
- Loquacious Set: 25, 000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial UseTitouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier C. van Dalen. [doi]
- Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature FusionHonghong Wang, Jing Deng, Fanqin Meng, Rong Zheng. [doi]
- TF-SkiMNet: Speech Enhancement Based on Inplace Modeling and Skipping Memory in Time-Frequency DomainZixuan Li, Shulin He, Jinglin Bai, Xueliang Zhang. [doi]
- Who knows best? Effects of speech disfluencies on incentivized decision-makingAmbika Kirkland, Jens Edlund. [doi]
- Building an Accurate Open-Source Hebrew ASR System through CrowdsourcingYanir Marmor, Yair Lifshitz, Yoad Snapir, Kinneret Misgav. [doi]
- Grammatical Error Detection on Spontaneous Children's Speech Using Iterative Pseudo LabelingChristopher Gebauer, Lars Rumberg, Lars Köhn, Hanna Ehlert, Edith Beaulac, Jörn Ostermann. [doi]
- Selective Auditory Attention Decoding in Naturalistic Conversations Using EEG-Based Speech Envelope Tracking in Multi-Speaker EnvironmentsGabriel Ivucic, Saurav Pahuja, Dashanka De Silva, Tanja Schultz. [doi]
- Generating Consistent Prosodic Patterns from Open-Source TTS SystemsHa Eun Shim, Olivia Yung, Paige Tuttösí, Boey Kwan, Angelica Lim, Yue Wang, H. Henny Yeung. [doi]
- Using Neurogram Similarity Index Measure (NSIM) to Model Hearing Loss and Cochlear Neural DegenerationAhsan J. Cheema, Sunil Puria. [doi]
- From Scarcity to Sufficiency: Speech Recognition Pipeline for Zero-resource LanguageNikolay Karpov, Sofia Kostandian, Nune Tadevosyan, Alexan Ayrapetyan, Andrei Andrusenko, Ara Yeroyan, Mher Yerznkanyan, Vitaly Lavrukhin. [doi]
- Fact-Controlled Diagnosis of Hallucinations in Medical Text SummarizationSuhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen. [doi]
- Count Your Speakers! Multitask Learning for Multimodal Speaker DiarizationPrabhav Singh, Jesús Villalba 0001, Najim Dehak. [doi]
- Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data FilteringAndrés Carofilis, Pradeep Rangappa, Srikanth Madikeri, Shashi Kumar, Sergio Burdisso, Jeena Prakash, Esaú Villatoro-Tello, Petr Motlícek, Bidisha Sharma, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke. [doi]
- Improving End-to-end Mixed-case ASR with Knowledge Distillation and Integration of Voice Activity CuesSashi Novitasari, Takashi Fukuda, Gakuto Kurata. [doi]
- Spatially Weighted Contrastive Learning for Robust Sound Source LocalizationHyun Soo Kim, Da-Hee Yang, Joon-Hyuk Chang. [doi]
- A Cascaded Multimodal Framework for Automatic Social Communication Severity Assessment in Children with Autism Spectrum DisorderJihyun Mun, SunHee Kim, Minhwa Chung. [doi]
- PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech EditingYou Zhang 0001, Baotong Tian, Lin Zhang 0054, Zhiyao Duan. [doi]
- Why is children's ASR so difficult? Analyzing children's phonological error patterns using SSL-based phoneme recognizersKoharu Horii, Naohiro Tawara, Atsunori Ogawa, Shoko Araki. [doi]
- FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 ChallengeNabarun Goswami, Tatsuya Harada. [doi]
- Attention-Free Dual-Mode ASR with Latency-Controlled Selective State SpacesTakafumi Moriya, Masato Mimura, Kiyoaki Matsui, Hiroshi Sato, Kohei Matsuura. [doi]
- Automatic Speech Recognition for Low-Resourced Middle Eastern LanguagesRazhan Hameed, Sina Ahmadi, Hanah Hadi, Rico Sennrich. [doi]
- HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion RecognitionOrchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Decoding Speaker-Normalized Pitch from EEG for Mandarin PerceptionJia-xin Chen, Yi-Ming Wang, Ziyu Zhang, Jiayang Han, Yin-Long Liu, Rui Feng, Xiuyuan Liang, Zhen-Hua Ling, Jia-Hong Yuan. [doi]
- A Robust Hybrid ACC-PM Approach for Personal Sound ZonesYaqi Zhu, Lei Zhou, Hongqing Liu 0001, Liming Shi, Lu Gan 0002. [doi]
- SNR-Aligned Consistent Diffusion for Adaptive Speech EnhancementYonghyeon Jun, Beomjun Woo, Myeonghun Jeong, Nam Soo Kim. [doi]
- Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech AnnotationRui Hu, Xiaolong Lin, Jiawang Liu, Shixi Huang, Zhenpeng Zhan. [doi]
- Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASRLonghao Li, Yangze Li, Hongfei Xue, Jie Liu, Shuai Fang, Kai Wang, Lei Xie 0001. [doi]
- Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource LanguagesChin-Jou Li, Eunjung Yeo, KwangHee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen. [doi]
- Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost FunctionKwok Chin Yuen, Jia Qi Yip, Eng Siong Chng. [doi]
- The mutual exclusivity bias of bilingual visually grounded speech modelsDan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper. [doi]
- Efficient and Microphone-Fault-Tolerant 3D Sound Source LocalizationYiyuan Yang, Shitong Xu, Niki Trigoni, Andrew Markham. [doi]
- Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation ModelsOrchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- A Simple-Yet-Effective Data Augmentation Method for Speaker Identification in NovelsWenjie Zhong, Jason Naradowsky, Yusuke Miyao. [doi]
- Hybrid Expert Knowledge and Self-Supervised Learning for Diagnostic Modeling of Adductor Spasmodic and Primary Myotonic DysphoniaZhou Du, Hang Chen, Huijun Ding, Jun Du, Zhen Chen. [doi]
- MSFNet: A Nested Model for Multi-Sampling-Frequency Speech EnhancementVenkatesh Parvathala, K. Sri Rama Murty. [doi]
- OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic VocabularyYui Sudo, Yusuke Fujita, Atsushi Kojima, Tomoya Mizumoto, Lianbo Liu. [doi]
- Visually-Adaptive Guided Robust Speech Recognition with Parameter-Efficient AdaptationZhao Yang, Rui Jiang, Yue Heng Yeo, Xiao Fu, Wei Xi, Jizhong Zhao. [doi]
- "Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language UnderstandingAlkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis. [doi]
- Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme ConditioningHien Ohnaka, Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto. [doi]
- Towards Frame-level Quality Predictions of Synthetic SpeechMichael Kuhlmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach. [doi]
- BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-AttentionYassine El Kheir, Tim Polzehl, Sebastian Möller 0001. [doi]
- STCON NIST SRE24 System: Composite Speaker Recognition Solution for Challenging ScenariosStepan Malykh, Alexander Anikin, Nikita Khmelev, Anastasia Korenevskaya, Anastasia Zorkina, Sergey Novoselov, Vladislav Marchevskiy, Vladimir Volokhov, Andrey Shulipa, Alexander Kozlov, Alexander Melnikov, Vasiliy Galyuk, Timur Pekhovskiy. [doi]
- Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors ApproachUmberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti. [doi]
- End-to-End Speech Translation Guided by Robust Translation Capability of Large Language ModelYosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi. [doi]
- WhisperMSS: A Two-Stage Framework for Mandarin Singing Transcription and Segmentation Using Pretrained ModelsRuoxuan Liang, Xiangjian Zeng, Zhen Liu, QingQiang Wu, Ruichen Zhang, Le Ren. [doi]
- Breaking Resource Barriers in Speech Emotion Recognition via Data DistillationYi Chang 0004, Zhao Ren, Zhonghao Zhao, Thanh-Tam Nguyen, Kun Qian 0003, Tanja Schultz, Björn W. Schuller. [doi]
- Visual Cues Support Robust Turn-taking Prediction in NoiseSam O'Connor Russell, Naomi Harte. [doi]
- Hybrid HMM-SVM classifier using frication-based features for detection of non-normative sibilant articulation patterns in Polish children's speechZuzanna Miodonska. [doi]
- FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 AccentsSatu Hopponen, Tomi Kinnunen, Alexandre Nikolaev, Rosa González Hautamäki, Lauri Tavi, Einar Meister. [doi]
- Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face GenerationFang Kang, Yin Cao, Haoyu Chen. [doi]
- Multitalker Babble in English Vowel Perception Training: A Comparison between Humans and Neural ModelsWenwei Dong, Alif Silpachai, Catia Cucchiarini, Helmer Strik. [doi]
- Tungnaá In Live Performance: An Implementation Of Interactive Artistic Text-To-VoiceVictor Shepardson, Jonathan Reus, Thor Magnusson. [doi]
- Video-to-Audio Generation with Fine-grained Temporal SemanticsYuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, Dong Yu 0001. [doi]
- Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion RecognitionJule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang. [doi]
- On Retrieval of Long Audios with Complex Text QueriesRuochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He 0004, Venkatesh Ravichandran. [doi]
- Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt TuningHongli Yang, Yizhou Peng, Hao Huang, Sheng Li. [doi]
- ADI-20: Arabic Dialect Identification dataset and modelsHaroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares. [doi]
- Recreating Neural Activity During Speech Production with Language and Speech Model EmbeddingsOwais Mujtaba Khanday, Pablo Rodríguez San Esteban, Zubair Ahmad Lone, Marc Ouellet, Jose A. Gonzalez-Lopez. [doi]
- Exploiting Echo Path Priors for Enhanced Stereo Acoustic Echo CancellationJinfu Wang, Ziteng Wang, Xin Liu, Yang Liu, Qing Shi, Zhengqiang Luo, Feiran Yang. [doi]
- The Role of Syntactic Structures in Shaping Directionality in Trisyllabic Tone Sandhi: Evidence from Tianjin MandarinSiqi Lu, Hui Feng, Ziyu Xiong. [doi]
- Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection SystemsKwok Chin Yuen, Jia Qi Yip, Zhen Qiu, Chi-Hung Chi, Kwok-Yan Lam. [doi]
- CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource LanguagesHongchen Wu, Yixin Gu. [doi]
- Improving Bird Classification with Primary Color AdditivesEzhini Rasendiran R, Chandresh Kumar Maurya. [doi]
- LombardTokenizer: Disentanglement and Control of Vocal Effort in a Neural Speech CodecMaxime Jacquelin, Maëva Garnier, Laurent Girin, Rémy Vincent, Olivier Perrotin. [doi]
- Heart Rate as a Proxy Measure to Assess Human Confidence in Spoken SpeechHarish Battula, Gauri Deshpande, Yagna Gudipalli, Sachin Patel. [doi]
- Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic PrecisionZhaoqing Li, Haoning Xu, Zengrui Jin, Lingwei Meng, Tianzi Wang, Huimeng Wang, Youjun Chen, Mingyu Cui, Shujie Hu, Xunying Liu. [doi]
- Automatic Detection and Sub-typing of Primary Progressive Aphasia from Speech: Integrating Task-Specific Features and Spatio-Semantic GraphsFritz Peters, W. Richard Bevan-Jones, Grace Threlfall, Jenny M. Harris, Julie S. Snowden, Matthew Jones, Jennifer C. Thompson, Daniel J. Blackburn, Heidi Christensen. [doi]
- Lessons Learned from the URGENT 2024 Speech Enhancement ChallengeWangyou Zhang, Kohei Saijo, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar 0003, Marvin Sach, Wei Wang 0010, Yihui Fu, Shinji Watanabe 0001, Tim Fingscheidt, Yanmin Qian. [doi]
- Robust Neural Codec Language Modeling with Phoneme Position Prediction for Zero-Shot TTSChunhui Lu, Xue Wen 0002, Liming Song, Junkwang Oh. [doi]
- Beyond Attacks: Advancing Fake Speech Detection with Attack-Agnostic MethodsShilpa Chandra, Akansha Tyagi, Shiven Patel, Padmanabhan Rajan. [doi]
- A Multi-Stream Framework Utilizing 3D Human Reconstruction for Cued Speech RecognitionKaterina Papadimitriou, Gerasimos Potamianos. [doi]
- PERCEPT-US: A Multimodal American English Child Speech Corpus Specialized for Articulatory FeedbackAmanda Eads, Heather Kabakoff, Nina Benway, Elaine Hitchcock, Jonathan L. Preston, Tara McAllister. [doi]
- Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEGSiavash Shams, Richard J. Antonello, Gavin Mischler, Stephan Bickel, Ashesh D. Mehta, Nima Mesgarani. [doi]
- MTSE: Multi-Target Speaker Extraction for Conversation ScenariosThomas Serre, Mathieu Fontaine 0002, Eric Benhaim, Slim Essid. [doi]
- PromptEVC: Controllable Emotional Voice Conversion with Natural Language PromptsTianhua Qi, Shiyan Wang, Cheng Lu 0005, Tengfei Song, Hao Yang 0006, Zhanglin Wu, Wenming Zheng. [doi]
- PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion RecognitionOrchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Jaya Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental SoundsAndrew Chang 0003, Yike Li, Iran R. Roman, David Poeppel. [doi]
- Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event DetectionShangkun Huang, Jing Deng, Jintao Kang, Rong Zheng. [doi]
- Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage FilteringPradeep Rangappa, Andrés Carofilis, Jeena Prakash, Shashi Kumar, Sergio Burdisso, Srikanth R. Madikeri, Esaú Villatoro-Tello, Bidisha Sharma, Petr Motlícek, Kadri Hacioglu, Shankar Venkatesan, Saurabh Vyas, Andreas Stolcke. [doi]
- Pre-aspiration in Iceland Is Conditioned by Gender/SexMeike Rommel, Mísa Hejná, Nicole Dehé. [doi]
- Corpus-Based Insights into Mandarin Neutral Tone: Effects of Tonal Context and Structural Patterns in Spontaneous SpeechJingyi Sun, Nicolas Audibert, Yaru Wu, Martine Adda-Decker. [doi]
- GALAXY: A Large-Scale Open-Domain Dataset for Multimodal LearningYihan Wu, Yichen Lu, Yijing Chen, Jiaqi Song, William Chen, Ruihua Song, Shinji Watanabe 0001. [doi]
- Frozen Large Language Models Can Perceive Paralinguistic Aspects of SpeechWonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli. [doi]
- Language-Agnostic Suicidal Risk Detection Using Large Language ModelsJune-Woo Kim, Wonkyo Oh, Haram Yoon, Sung Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang. [doi]
- Exploring the Limits of Conformer CTC-Encoder for Speech Emotion Recognition using Large Language ModelsEdmilson Da Silva Morais, Hagai Aronowitz, Aharon Satt, Ron Hoory, Avihu Dekel, Brian Kingsbury, George Saon. [doi]
- Text Entry for All: Towards Speech-based Multimodal Interaction for Inclusion, Accessibility and the Preservation of the World's Linguistic HeritageJulián Zapata, Lara Hanna. [doi]
- Leveraging Text and Speech Processing for Suicide Risk Classification in Chinese AdolescentsJustyna Krzywdziak, Bartlomiej Eljasiak, Joanna Stepien, Michal Swiatek, Agnieszka Pruszek. [doi]
- First Analyze Then Enhance: A Task-Aware System for Speech Separation, Denoising, and DereverberationShaoxiang Dang, Li Li 0063, Shogo Seki, Hiroaki Kudo. [doi]
- Real-time TSE demonstration via SoundBeam with KDKeigo Wakayama, Tomoko Kawase, Takafumi Moriya, Marc Delcroix, Hiroshi Sato, Tsubasa Ochiai, Masahiro Yasuda, Shoko Araki. [doi]
- ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical DeploymentShengkui Zhao, Zexu Pan, Bin Ma 0001. [doi]
- Mitigating Audiovisual Mismatch in Visual-Guide Audio CaptioningLe Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu 0001. [doi]
- Mamba-based Hybrid Model for Speech EnhancementSe-Ha Kim, Tae-Gyeong Kim, Chang-Jae Chun. [doi]
- Better Pseudo-labeling with Multi-ASR Fusion and Error Correction by SpeechLLMJeena Prakash, Blessingh Kumar, Kadri Hacioglu, Bidisha Sharma, Sindhuja Gopalan, Malolan Chetlur, Shankar Venkatesan, Andreas Stolcke. [doi]
- Assessment of L2 Oral Proficiency using Speech Large Language ModelsRao Ma, Mengjie Qian 0001, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J. F. Gales. [doi]
- What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue SystemsKiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel F. Garcia Contreras, Koichiro Yoshino. [doi]
- Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASRMingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie 0001. [doi]
- Uni-VERSA: Versatile Speech Assessment with a Unified NetworkJiatong Shi, Hye-jin Shim, Shinji Watanabe 0001. [doi]
- What the Filler? Both ASR Systems and Humans Struggle More With Other Kinds of Disfluencies Than With Filler ParticlesSaskia Wepner, Lucas Eckert, Gernot Kubin, Barbara Schuppler. [doi]
- Apical vs. Regular Vowel Duration: A Corpus-based Analysis of Contextual Influences in Standard MandarinJingyi Sun, Bowei Shao, Martine Adda-Decker. [doi]
- Pitch Accent Detection improves Pretrained Automatic Speech RecognitionDavid Sasu, Natalie Schluter. [doi]
- Does English fish sound like French fiche? Perceptual similarity judgments versus acoustic similarityRory Turnbull, Elisa Kiefer, Sharon Peperkamp. [doi]
- Leveraging Self-Supervised Learning Based Speaker Diarization for MISP 2025 AVSD ChallengeZeyan Song, Tianchi Sun, Ronghui Hu, Kai Chen 0029, Jing Lu. [doi]
- LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric SecurityNidheesh Gorthi, Kartik Thakral, Rishabh Ranjan, Richa Singh 0001, Mayank Vatsa. [doi]
- French schwa is not acoustically distinct from its two lexical neighbors /ø/ and /œ/Mathilde Hutin, Mélanie Lancien, Noam Faust. [doi]
- Towards Classification of Typical and Atypical Disfluencies: A Self Supervised Representation ApproachPriyanka Kommagouni, Pragya Khanna, Vamshiraghusimha Narasinga, Anirudh Bocha, Anil Kumar Vuppala. [doi]
- Differentiable Reward Optimization for LLM based TTS systemChangfeng Gao, Zhihao Du, Shiliang Zhang. [doi]
- Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of ExpertsHojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho. [doi]
- Influence of wall coverings of 3D-printed vocal tract models on measured transfer functionsPeter Birkholz, Dominik Schäfer, Patrick Häsner, Jihyeon Yun, Iris Kruppke, Rémi Blandin. [doi]
- Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity PredictionLauren L. White, Ewan Carr, Judith Dineley, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard J. B. Dobson, Vaibhav Naraya, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins. [doi]
- MOVER: Combining Multiple Meeting Recognition SystemsNaoyuki Kamo, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani. [doi]
- Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type ClassifierTarek Kunze, Marianne Métais, Hadrien Titeux, Lucas Elbert, Joseph Coffey, Emmanuel Dupoux, Alejandrina Cristià, Marvin Lavechin. [doi]
- REB-former: RWKV-enhanced E-branchformer for Speech RecognitionJie Song, Wang Xiang, Jian Zhou 0006, Cunhang Fan, Zhao Lv. [doi]
- SpeechDialogueFactory: A Framework for Natural Speech Dialogue GenerationMinghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari. [doi]
- An Effective Training Framework for Light-Weight Automatic Speech Recognition ModelsAbdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman. [doi]
- Evaluating Deep Speaker Embedding Robustness to Domain, Sampling Rate, and Codec VariationsAlexandre Ferro Filho, Diogo Fernandes Costa Silva, Pedro Elias Engelberg Silva Borges, Arlindo Rodrigues Galvão Filho. [doi]
- EATS-Speech: Emotion-Adaptive Transformation and Priority Synthesis for Zero-Shot Text-to-SpeechJingyuan Xing, Zhipeng Li, Shuaiqi Chen, Xiaofen Xing, Xiangmin Xu. [doi]
- J-SPAW: Japanese speaker verification and spoofing attacks recorded in-the-wild datasetSayaka Shiota, Suzuka Horie, Kouta Kanno, Shinnosuke Takamichi. [doi]
- Impact of Background Noise on Turn-Taking Dynamics in Triadic ConversationsValeska Slomianka, Tobias May, Torsten Dau. [doi]
- Physiologically-Informed Feature Analysis of Acquired Speech Disorders for Stroke AssessmentGiulia Sanguedolce, Jón Guðnason, Dragos-Cristian Gruia, Emilie D'Olne, Fatemeh Geranmayeh, Patrick A. Naylor. [doi]
- SGED-Probe: Probing E2E ASR decoder and aligner for spoken grammar error detection under three speaking practice conditionsChowdam Venkata Thirumala Kumar, Chiranjeevi Yarra. [doi]
- Mispronunciation Detection Without L2 Pronunciation Dataset in Low-Resource Setting: A Case Study in Finland SwedishNhan Phan, Mikko Kuronen, Maria Kautonen, Riikka Ullakonoja, Anna von Zansen, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo. [doi]
- Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language ModelKiyotada Mori, Seiya Kawano, Angel F. Garcia Contreras, Koichiro Yoshino. [doi]
- Code Mix TTS: An Approach to Infer Human Like Speech for Multi-Lingual Input TextsVishal Gourav, Phanindra Mankale. [doi]
- ClaritySpeech: Dementia Obfuscation in SpeechDominika C. Woszczyk, Ranya Aloufi, Soteris Demetriou. [doi]
- A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater's Shadowing and Sequence-to-sequence Voice ConversionHaopeng Geng, Daisuke Saito, Nobuaki Minematsu. [doi]
- EEG-based Speech Decoding Based on Multi-mode Joint ModelingPeiran Li, Fei Chen, Xixin Wu. [doi]
- VCapAV: A Video-Caption Based Audio-Visual Deepfake Detection DatasetYuxi Wang, Yikang Wang, Qishan Zhang, Hiromitsu Nishizaki, Ming Li. [doi]
- Tonal Contrasts in the Malipo Variety of the Mienic LanguageChanghong Du, Fang Hu. [doi]
- Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion FrameworkKyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon. [doi]
- A Three-Stage Beamforming with Harmonic Guidance for Multi-Channel Speech EnhancementNurali Alip, Tianrui Wang, Rui Cao, Meng Ge, Jingru Lin, Longbiao Wang, Jianwu Dang 0001. [doi]
- SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker VerificationThéo Lepage, Réda Dehak. [doi]
- GigaAM: Efficient Self-Supervised Learner for Speech RecognitionAleksandr Kutsakov, Alexandr Maximenko, Georgii Gospodinov, Pavel Bogomolov, Fyodor Minkin. [doi]
- Boosting StoRM Convergence with Metric Guidance and Non-uniform State-Sampling for Optimal DereverberationChandra Mohan Sharma, Arnab Kumar Roy, Anupam Mandal, Prasanta Kumar Ghosh, Prasanna Kumar Kr. [doi]
- GoP2Vec: A few shot learning for pronunciation assessment with goodness of pronunciation (GoP) based representations from an i-vector framework and augmentationMeenakshi Sirigiraju, Chiranjeevi Yarra. [doi]
- Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and VerificationPierre Falez, Tony Marteau, Damien Lolive, Arnaud Delhay. [doi]
- Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder FusionMaxim Markitantov, Elena Ryumina, Heysem Kaya, Alexey Karpov 0001. [doi]
- A Domain Robust Pre-Training Method with Local Prototypes for Speaker VerificationQing Gu, Yan Song, Haoyu Song, Nan Jiang, Lirong Dai, Ian McLoughlin 0001. [doi]
- Tonal Perception in Changde MandarinZhenrui Zhang, Fang Hu. [doi]
- Multi-view Fusion and Parameter Perturbation for Few-Shot Class-Incremental Audio ClassificationYulu Fang, Mingyue He, Qisheng Xu, Jianqiao Zhao, Cheng Yang 0004, Kele Xu, Yong Dou. [doi]
- Test-Time Training for Speech-based Depression DetectionSri Harsha Dumpala, Chandramouli Shama Sastry, Rudolf Uher, Sageev Oore. [doi]
- Efficient Noise-Robust Hybrid Audiovisual Encoder with Joint Distillation and Pruning for Audiovisual Speech RecognitionZhengyang Li, Pascal Reichert, Thomas Graave, Patrick Blumenberg, Tim Fingscheidt. [doi]
- WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio ProcessingOguzhan Baser, Ahmet Ege Tanriverdi, Kaan Kale, Sandeep Chinchali, Sriram Vishwanath. [doi]
- Synchronous analysis of abnormal acoustic and linguistic production in Parkinson's speechDaniel Escobar-Grisales, Cristian David Ríos-Urrego, Sabato Marco Siniscalchi, Adolfo M. García, Yamile Bocanegra, Leonardo Moreno, Elmar Nöth, Juan Rafael Orozco-Arroyave. [doi]
- Unsupervised Rhythm and Voice Conversion to Improve ASR on Dysarthric SpeechKarl El Hajal, Enno Hermann, Sevada Hovsepyan, Mathew Magimai-Doss. [doi]
- Towards Inclusive and Fair ASR: Insights from the SAPC Challenge for Optimizing Disordered Speech RecognitionNada Gohider, Otman Basir. [doi]
- A Comparative Study on Proactive and Passive Detection of Deepfake SpeechChia-Hua Wu, Wanying Ge, Xin Wang 0037, Junichi Yamagishi, Yu Tsao 0001, Hsin-Min Wang. [doi]
- Effective Context in Neural Speech ModelsYen Meng, Sharon Goldwater, Hao Tang 0002. [doi]
- Evaluating Speech Foundation Models for Automatic Speech Recognition in the Low-Resource Kanyen'kéha LanguageMengzhe Geng, Patrick Littell, Aidan Pine, Robbie Jimerson, Gilles Boulianne, Vishwa Gupta, Rolando Coto-Solano, Anna Kazantseva, Marc Tessier, Delaney Lothian, Akwiratékha' Martin, Eric Joanis, Samuel Larkin, Roland Kuhn 0001. [doi]
- Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS PredictionOrchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- SQ-AST: A Transformer-Based Model for Speech Quality PredictionWafaa Wardah, Robert P. Spang, Vincent Barriac, Jan Reimes, Anna Llagostera, Jens Berger, Sebastian Möller 0001. [doi]
- On Apical Vowels in Eastern Zhenjiang MandarinXuying Wang, Fang Hu. [doi]