Abstract is missing.
- Visual scene display application for augmentative and alternative communicationKarthik Venkat Sridaran, Raja Praveen, Reuben T. Varghese, Ajish K. Abraham, Shankar R, Winnie Rachel Cherian. [doi]
- Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask LearningArnav Goel, Medha Hira, Anubha Gupta. [doi]
- Ethnolinguistic Identification of Vietnamese-German Heritage SpeechThanh Lan Truong, Andrea Weber. [doi]
- SWiBE: A Parameterized Stochastic Diffusion Process for Noise-Robust Bandwidth ExpansionYin-Tse Lin, Shreya G. Upadhyay, Bo-Hao Su, Chi-Chun Lee. [doi]
- ParaCLAP - Towards a general language-audio model for computational paralinguistic tasksXin Jing, Andreas Triantafyllopoulos, Björn W. Schuller. [doi]
- Classification of Room Impulse Responses and its application for channel verification and diarizationYuri Y. Khokhlov, Tatiana Prisyach, Anton Mitrofanov, Dmitry Dutov, Igor Agafonov, Tatiana Timofeeva, Aleksei Romanenko, Maxim Korenevsky. [doi]
- Leveraging Language Model Capabilities for Sound Event DetectionHualei Wang, Jianguo Mao, Zhifang Guo, Jiarui Wan, Hong Liu 0007, Xiangdong Wang. [doi]
- Towards Speech-to-Pictograms TranslationCécile Macaire, Chloé Dion, Didier Schwab, Benjamin Lecouteux, Emmanuelle Esperança-Rodier. [doi]
- Pre-training Feature Guided Diffusion Model for Speech EnhancementYiyuan Yang, Niki Trigoni, Andrew Markham. [doi]
- Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech SeparationTsun-An Hsieh, Heeyoul Choi, Minje Kim. [doi]
- Leveraging Universal Speech Representations for Detecting and Assessing the Severity of Mild Cognitive Impairment Across LanguagesAnna Favaro, Tianyu Cao 0003, Najim Dehak, Laureano Moro-Velázquez. [doi]
- Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer VoicesAtli Sigurgeirsson, Eddie L. Ungless. [doi]
- Enhancing No-Reference Speech Quality Assessment with Pairwise, Triplet Ranking Losses, and ASR PretrainingBao Thang Ta, Minh Tu Le, Van Hai Do, Huynh Thi Thanh Binh. [doi]
- Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizationsGuillem Bonafos, Clara Bourot, Pierre Pudlo, Jean-Marc Freyermuth, Laurence Reboul, Samuel Tronçon, Arnaud Rey. [doi]
- Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated VoiceShubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan. [doi]
- Articulatory synthesis using representations learnt through phonetic label-aware contrastive lossJesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh. [doi]
- STraDa: A Singer Traits DatasetYuexuan Kong, Viet-Anh Tran, Romain Hennequin. [doi]
- CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword SpottingSichen Jin, Youngmoon Jung, Seungjin Lee, Jaeyoung Roh, Changwoo Han, Hoonyoung Cho. [doi]
- Enhanced Feature Learning with Normalized Knowledge Distillation for Audio TaggingYuwu Tang, Ziang Ma, Haitao Zhang. [doi]
- Multimodal Continuous Fingerspelling Recognition via Visual Alignment LearningKaterina Papadimitriou, Gerasimos Potamianos. [doi]
- CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition ChallengeChen Chen, Zehua Liu, Xiaolou Li, Lantian Li, Dong Wang. [doi]
- Exploring the Benefits of Tokenization of Discrete Acoustic UnitsAvihu Dekel, Raul Fernandez. [doi]
- On Disfluency and Non-lexical Sound Labeling for End-to-end Automatic Speech RecognitionPéter Mihajlik, Yan Meng, Mate S. Kadar, Julian Linke, Barbara Schuppler, Katalin Mády. [doi]
- Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio DetectionXiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao 0001, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi. [doi]
- Macro-descriptors for Alzheimer's disease detection using large language modelsCatarina Botelho, John Mendonça, Anna Pompili, Tanja Schultz, Alberto Abad, Isabel Trancoso. [doi]
- Predicting Heart Activity from Speech using Data-driven and Knowledge-based featuresGasser Elbanna, Zohreh Mostaani, Mathew Magimai-Doss. [doi]
- Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language ModelsZhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang. [doi]
- Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and MetadataRyandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao. [doi]
- Graph Attention Based Multi-Channel U-Net for Speech Dereverberation With Ad-Hoc Microphone ArraysHongmei Guo, Yijiang Chen, Xiaolei Zhang 0001, Xuelong Li 0001. [doi]
- Key Acoustic Cues for the Realization of Metrical Prominence in Tone Languages: A Cross-Dialect StudyYiying Hu, Hui Feng. [doi]
- TraceableSpeech: Towards Proactively Traceable Text-to-Speech with WatermarkingJunzuo Zhou, Jiangyan Yi, Tao Wang 0074, Jianhua Tao 0001, Ye Bai, Chu-Yuan Zhang, Yong Ren, Zhengqi Wen. [doi]
- Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?TianTian Feng, Dimitrios Dimitriadis, Shrikanth S. Narayanan. [doi]
- Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic InteractionsAnfeng Xu, Kevin Huang, TianTian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan. [doi]
- DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free GuidanceJinhyeok Yang, Junhyeok Lee, Hyeong-Seok Choi, Seunghoon Ji, Hyeongju Kim, Juheon Lee. [doi]
- Noise-robust Speech Separation with Fast Generative CorrectionHelin Wang, Jesús Villalba 0001, Laureano Moro-Velázquez, Jiarui Hai, Thomas Thebaud, Najim Dehak. [doi]
- Impact of the tonal factor on diphthong realizations in Standard Mandarin with Generalized Additive Mixed ModelsChenyu Li, Jalal Al-Tamimi. [doi]
- Quantification of stylistic differences in human- and ASR-produced transcripts of African American EnglishAnnika Heuser, Tyler Kendall, Miguel Del Rio 0001, Quinn McNamara, Nishchal Bhandari, Corey Miller, Migüel Jetté. [doi]
- Disentangling Age and Identity with a Mutual Information Minimization for Cross-Age Speaker VerificationFengrun Zhang, Wangjin Zhou, Yiming Liu, Wang Geng, Yahui Shan, Chen Zhang. [doi]
- AG-LSEC: Audio Grounded Lexical Speaker Error CorrectionRohit Paturi, Xiang Li, Sundararajan Srinivasan. [doi]
- Towards Classifying Mother Tongue from Infant Cries - Findings Substantiating Prenatal Learning TheoryTim Polzehl, Tim Herzig, Friedrich Wicke, Kathleen Wermke, Razieh Khamsehashari, Michiko Dahlem, Sebastian Möller 0001. [doi]
- A Transformer-Based Voice Activity DetectorBiswajit Karan, Joshua Jansen van Vüren, Febe de Wet, Thomas Niesler. [doi]
- Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo SegmentYiwen Shao, Shi-Xiong Zhang 0001, Yong Xu, Meng Yu, Dong Yu 0001, Daniel Povey, Sanjeev Khudanpur. [doi]
- An End-to-End Speech Summarization Using Large Language ModelHengchao Shang, Zongyao Li, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Daimeng Wei, Hao Yang. [doi]
- Exploring the anatomy of articulation rate in spontaneous English speech: relationships between utterance length effects and social factorsJames Tanner, Morgan Sonderegger, Jane Stuart-Smith, Tyler Kendall, Jeff Mielke, Robin Dodsworth, Erik Thomas. [doi]
- EEND-M2F: Masked-attention mask transformers for speaker diarizationMarc Härkönen, Samuel J. Broughton, Lahiru Samarakoon. [doi]
- An Uyghur Extension to the MASSIVE Multi-lingual Spoken Language Understanding Corpus with Comprehensive EvaluationsAinikaerjiang Aimaiti, Di Wu, Liting Jiang, Gulinigeer Abudouwaili, Hao Huang, Wushour Silamu. [doi]
- Exploring Sentence Type Effects on the Lombard Effect and Intelligibility Enhancement: A Comparative Study of Natural and Grid SentencesHongyang Chen, Yuhong Yang 0001, Zhongyuan Wang 0001, Weiping Tu, Haojun Ai, Cedar Lin. [doi]
- SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and UnderstandingTitouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya. [doi]
- An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTSXiaofei Wang 0009, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li 0001, Naoyuki Kanda. [doi]
- WenetSpeech4TTS: A 12, 800-hour Mandarin TTS Corpus for Large Speech Generation Model BenchmarkLinhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie. [doi]
- MSR-86K: An Evolving, Multilingual Corpus with 86, 300 Hours of Transcribed Audio for Speech Recognition ResearchSong Li, Yongbin You, Xuezhi Wang 0008, Zhengkun Tian, Ke Ding, Guanglu Wan. [doi]
- The reasonable effectiveness of speaker embeddings for violence detectionSarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio GenerationBaihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan 0008, Ji Zhang 0011, Kai Yu 0004, Mengyue Wu. [doi]
- ZeroST: Zero-Shot Speech TranslationSameer Khurana, Chiori Hori, Antoine Laurent, Gordon Wichern, Jonathan Le Roux. [doi]
- Switching Tongues, Sharing Hearts: Identifying the Relationship between Empathy and Code-switching in SpeechDebasmita Bhattacharya, Eleanor Lin, Run Chen, Julia Hirschberg. [doi]
- MaLa-ASR: Multimedia-Assisted LLM-Based ASRGuanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen 0001. [doi]
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text AlignmentKe-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang 0012, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee. [doi]
- PhoneViz: exploring alignments at a glanceMargot Masson, Erfan A. Shams, Iona Gessinger, Julie Carson-Berndsen. [doi]
- Multimodal Fusion of Music Theory-Inspired and Self-Supervised Representations for Improved Emotion RecognitionXiaohan Shi, Xingfeng Li 0001, Tomoki Toda. [doi]
- Multimodal Digital Biomarkers for Longitudinal Tracking of Speech Impairment Severity in ALS: An Investigation of Clinically Important DifferencesMichael Neumann, Hardik Kothare, Jackson Liscombe, Emma C. L. Leschly, Oliver Roesler, Vikram Ramanarayanan. [doi]
- Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-TuningChung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le 0001, Bowen Shi, Wei-Ning Hsu. [doi]
- Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech RecognitionYuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang 0001. [doi]
- Missingness-resilient Video-enhanced Multimodal Disfluency DetectionPayal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi Zhu 0002. [doi]
- Temporal Co-Registration of Simultaneous Electromagnetic Articulography and Electroencephalography for Precise Articulatory and Neural Data AlignmentDaniel Friedrichs, Monica Lancheros, Sam Kirkham, Lei He 0021, Andrew Clark, Clemens Lutz, Volker Dellwo, Steven Moran. [doi]
- Participant-Pair-Wise Bottleneck Transformer for Engagement Estimation from Video ConversationKeita Suzuki, Nobukatsu Hojo, Kazutoshi Shinoda, Saki Mizuno, Ryo Masumura. [doi]
- Hear Your Face: Face-based voice conversion with F0 estimationJaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee. [doi]
- Simul-Whisper: Attention-Guided Streaming Whisper with Truncation DetectionHaoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li. [doi]
- ClariTTS: Feature-ratio Normalization and Duration Stabilization for Code-mixed Multi-speaker Speech SynthesisChangHwan Kim. [doi]
- Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker VerificationZezhong Jin, Youzhi Tu, Man-Wai Mak. [doi]
- Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation MapsMattias Nilsson, Riccardo Miccini, Clement Laroche, Tobias Piechowiak, Friedemann Zenke. [doi]
- Non-Linear Inference Time Intervention: Improving LLM TruthfulnessJakub Hoscilowicz, Adam Wiacek, Jan Chojnacki, Adam Cieslak, Leszek Michon, Artur Janicki. [doi]
- Bridging Emotions Across Languages: Low Rank Adaptation for Multilingual Speech Emotion RecognitionLucas Goncalves, Donita Robinson, Elizabeth Richerson, Carlos Busso. [doi]
- 2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment RetrievalJiajun He, Tomoki Toda. [doi]
- Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language ModelsMinh Nguyen 0007, Franck Dernoncourt, Seunghyun Yoon 0002, Hanieh Deilamsalehy, Hao Tan 0002, Ryan A. Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen. [doi]
- LAFMA: A Latent Flow Matching Model for Text-to-Audio GenerationWenhao Guan, Kaidi Wang, Wangjin Zhou, Yang Wang, Feng Deng, Hui Wang, Lin Li, Qingyang Hong, Yong Qin. [doi]
- RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake DetectionYujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang 0006, Shunbo Dong, Siding Zeng, Jianhua Tao 0001, Zhao Lv, Cunhang Fan. [doi]
- Post-Net: A linguistically inspired sequence-dependent transformed neural architecture for automatic syllable stress detectionSai Harshitha Aluru, Jhansi Mallela, Chiranjeevi Yarra. [doi]
- Enhancing CTC-based speech recognition with diverse modeling unitsShiyi Han, Mingbin Xu, Zhihong Lei, Zhen Huang 0001, Xingyu Na. [doi]
- LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech RecognitionEunseop Yoon, Hee Suk Yoon, John B. Harvill, Mark Hasegawa-Johnson, Chang D. Yoo. [doi]
- Clever Hans Effect Found in Automatic Detection of Alzheimer's Disease through SpeechYin-Long Liu, Rui Feng, Jia-Hong Yuan, Zhen-Hua Ling. [doi]
- Navigating the Minefield of MT Beam Search in Cascaded Streaming Speech TranslationRastislav Rabatin, Frank Seide, Ernie Chang. [doi]
- FVTTS : Face Based Voice Synthesis for Text-to-SpeechMinyoung Lee 0003, Eunil Park, Sungeun Hong. [doi]
- Revisiting Pitch Jumps: F0 Ratio in Seoul KoreanMichaela Watkins, Paul Boersma, Silke Hamann. [doi]
- SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker DiarizationNaoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Atsushi Ando, Ryo Masumura. [doi]
- TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, Ricard Marxer. [doi]
- Transmitted and Aggregated Self-Attention for Automatic Speech RecognitionTian-Hao Zhang, Xinyuan Qian 0001, Feng Chen 0040, Xu-Cheng Yin. [doi]
- Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Self-Supervised Speech Representations are More Phonetic than SemanticKwangHee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, Shinji Watanabe 0001. [doi]
- GPA: Global and Prototype Alignment for Audio-Text RetrievalYuxin Xie 0004, Zhihong Zhu, Xianwei Zhuang, Liming Liang, Zhichang Wang, Yuexian Zou. [doi]
- SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker RecognitionTianhao Wang, Lantian Li, Dong Wang. [doi]
- Speech dereverberation constrained on room impulse response characteristicsLouis Bahrman, Mathieu Fontaine 0002, Jonathan Le Roux, Gaël Richard. [doi]
- FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow MatchingChaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung. [doi]
- LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related TasksAmit Meghanani, Thomas Hain. [doi]
- MULTI-CONVFORMER: Extending Conformer with Multiple Convolution KernelsDarshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe 0001. [doi]
- Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children's Automatic Speech Recognition AdaptationThomas Rolland, Alberto Abad. [doi]
- Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech GenerationHanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li. [doi]
- Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual AnalysesDavid Doukhan, Lena Dodson, Manon Conan, Valentin Pelloin, Aurélien Clamouse, Mélina Lepape, Géraldine Van Hille, Cécile Méadel, Marlène Coulomb-Gully. [doi]
- Characterizing code-switching: Applying Linguistic Principles for Metric Assessment and DevelopmentJie Chi, Electra Wallington, Peter Bell 0001. [doi]
- Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers' ability to model hesitation phenomenaVincenzo Norman Vitale, Loredana Schettino, Francesco Cutugno. [doi]
- Multimodal Fusion for Vocal Biomarkers Using Vector Cross-AttentionVladimir Despotovic, Abir Elbéji, Petr V. Nazarov, Guy Fagherazzi. [doi]
- The influence of L2 accent strength and different error types on personality trait ratingsSarah Wesolek, Piotr Gulgowski, Joanna Blaszczak, Marzena Zygis. [doi]
- Form and Function in Prosodic Representation: In the Case of 'ma' in Tianjin MandarinTianqi Geng, Hui Feng. [doi]
- X-Singer: Code-Mixed Singing Voice Synthesis via Cross-Lingual LearningJi-Sang Hwang, HyeongRae Noh, Yoonseok Hong, Insoo Oh. [doi]
- Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term DependencyHaochen Wu, Wu Guo, Zhentao Zhang, Wenting Zhao, Shengyu Peng, Jie Zhang. [doi]
- VSASV: a Vietnamese Dataset for Spoofing-Aware Speaker VerificationVu Hoang, Viet-Thanh Pham, Hoa Nguyen Xuan, Pham Nhi, Phuong Dat, Thi Thu Trang Nguyen. [doi]
- DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice ConversionZiqian Ning, Shuai Wang, Pengcheng Zhu 0004, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi. [doi]
- Bilingual Rhotic Production Patterns: A Generational Comparison of Spanish-English Bilingual Speakers in CanadaIoana Colgiu, Laura Spinu, Rajiv Rao, Yasaman Rafat. [doi]
- Whispering in Norwegian: Navigating Orthographic and Dialectic ChallengesPer Egil Kummervold, Javier de la Rosa 0001, Freddy Wetjen, Rolv-Arild Braaten, Per Erik Solberg. [doi]
- A Comparative Analysis of Federated Learning for Speech-Based Cognitive Decline DetectionStefan Kalabakov, Monica González Machorro, Florian Eyben, Björn W. Schuller, Bert Arnrich. [doi]
- FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge DistillationSwarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani. [doi]
- QHM-GAN: Neural Vocoder based on Quasi-Harmonic ModelingShaowen Chen, Tomoki Toda. [doi]
- Enhancing ECAPA-TDNN with Feature Processing Module and Attention Mechanism for Speaker VerificationShiu-Hsiang Liou, Po-Cheng Chan, Chia-Ping Chen, Tzu-Chieh Lin, Chung-Li Lu, Yu-Han Cheng, Hsiang-Feng Chuang, Wei-Yu Chen. [doi]
- OR-TSE: An Overlap-Robust Speaker Encoder for Target Speech ExtractionYiru Zhang, Linyu Yao, Qun Yang. [doi]
- Acquisition of high vowel devoicing in Japanese: A production experiment with three and four year oldsHyun Kyung Hwang, Manami Hirayama. [doi]
- Self-Supervised Speaker Verification with Mini-Batch Prediction CorrectionJunxu Wang, Zhihua Fang, Liang He. [doi]
- Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding ExtractorYongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He. [doi]
- RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker ScenariosYiwen Shao, Shi-Xiong Zhang 0001, Dong Yu 0001. [doi]
- A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword SpottingYuang Li, Min Zhang, Chang Su 0001, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang. [doi]
- Exploring the Complementary Nature of Speech and Eye Movements for Profiling Neurological DisordersYuzhe Wang, Anna Favaro, Thomas Thebaud, Jesús Villalba 0001, Najim Dehak, Laureano Moro-Velázquez. [doi]
- Low Bitrate High-Quality RVQGAN-based Discrete Speech TokenizerSlava Shechtman, Avihu Dekel. [doi]
- The Whole Is Bigger Than the Sum of Its Parts: Modeling Individual Annotators to Capture Emotional VariabilityJames Tavernor, Yara El-Tawil, Emily Mower Provost. [doi]
- As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification ResearchWiebke Hutiri, Tanvina Patel, Aaron Yi Ding, Odette Scharenborg. [doi]
- Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and DiagnosisXintong Wang, Mingqian Shi, Ye Wang. [doi]
- Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task LearningZhenyu Wang, Shuyu Kong, Li Wan, Biqiao Zhang, Yiteng Huang, Mumin Jin, Ming Sun, Xin Lei, Zhaojun Yang. [doi]
- SWAN: SubWord Alignment Network for HMM-free word timing estimation in end-to-end automatic speech recognitionWoo Hyun Kang, Srikanth Vishnubhotla, Rudolf Braun, Yogesh Virkar, Raghuveer Peri, Kyu J. Han. [doi]
- AVCap: Leveraging Audio-Visual Features as Text Tokens for CaptioningJongsuk Kim, Jiwon Shin, Junmo Kim 0002. [doi]
- HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature FusionLiwei Liu, Huihui Wei, Dongya Liu, Zhonghua Fu. [doi]
- MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and DetectionDa Mu, Zhicheng Zhang, Haobo Yue. [doi]
- Text-aware and Context-aware Expressive Audiobook Speech SynthesisDake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, WenJie Tian, Lei Xie. [doi]
- From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMsMinxue Niu, Mimansa Jaiswal, Emily Mower Provost. [doi]
- MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask OptimizationAdriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma 0001, Lu Yin 0006, Qiao Xiao, Stavros Petridis, Shiwei Liu 0003, Maja Pantic. [doi]
- A data-driven model of acoustic speech intelligibility for optimization-based models of speech productionBenjamin Elie, Juraj Simko, Alice Turk. [doi]
- MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion GuidanceSemin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin Choi, Nam Soo Kim. [doi]
- MMSD-Net: Towards Multi-modal Stuttering DetectionLiangyu Nie, Sudarsana Reddy Kadiri, Ruchit Agrawal. [doi]
- Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech TransformerLiming Wang, Yuan Gong 0001, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James R. Glass. [doi]
- Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion RecognitionYi-Cheng Lin, Haibin Wu, Huang-Cheng Chou, Chi-Chun Lee, Hung-yi Lee. [doi]
- Tradition or Innovation: A Comparison of Modern ASR Methods for Forced AlignmentRotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff. [doi]
- TM-PATHVQA: 90000+ Textless Multilingual Questions for Medical Visual Question AnsweringTonmoy Rajkhowa, Amartya Roy Chowdhury, Sankalp Nagaonkar, Achyut Mani Tripathi, S. R. Mahadeva Prasanna. [doi]
- Sparse Binarization for Fast Keyword SpottingJonathan Svirsky, Uri Shaham 0001, Ofir Lindenbaum. [doi]
- Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec TokensHaici Yang, Jiaqi Su, Minje Kim, Zeyu Jin. [doi]
- Interface Design for Self-Supervised Speech ModelsYi-Jen Shih, David Harwath. [doi]
- DeFTAN-AA: Array Geometry Agnostic Multichannel Speech EnhancementDongheon Lee, Jung-Woo Choi. [doi]
- Diffusion Synthesizer for Efficient Multilingual Speech to Speech TranslationNameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu 0001, Eloi du Bois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang, Zoë Abrams, Morgan McGuire. [doi]
- ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf modelsJee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Alex Gichamba, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe 0001. [doi]
- Study Selectively: An Adaptive Knowledge Distillation based on a Voting Network for Heart Sound ClassificationXihang Qiu, Lixian Zhu, Zikai Song, Zeyu Chen, Haojie Zhang, Kun Qian, Ye Zhang, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller. [doi]
- Interpretable Temporal Class Activation Representation for Audio Spoofing DetectionMenglu Li, Xiao-Ping Zhang 0002. [doi]
- CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake DetectionYongyi Zang, Jiatong Shi, You Zhang 0001, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan. [doi]
- Learn and Don't Forget: Adding a New Language to ASR Foundation ModelsMengjie Qian, Siyuan Tang, Rao Ma, Kate M. Knill, Mark J. F. Gales. [doi]
- Towards Explainable Monaural Speaker Separation with Auditory-based TrainingHassan Taherian, Vahid Ahmadi Kalkhorani, Ashutosh Pandey 0004, Daniel Wong, Buye Xu, DeLiang Wang. [doi]
- SALSA: Speedy ASR-LLM Synchronous AggregationAshish R. Mittal, Darshan Prabhu, Sunita Sarawagi, Preethi Jyothi. [doi]
- Sample-Efficient Diffusion for Text-To-Speech SynthesisJustin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu. [doi]
- Speech Boosting: Low-Latency Live Speech Enhancement for TWS EarbudsHanbin Bae, Pavel Andreev, Azat Saginbaev, Nicholas Babaev, Won Jun Lee, Hosang Sung, Hoon-Young Cho. [doi]
- Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio contentRémi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard. [doi]
- Self-Train Before You TranscribeRobert Flynn, Anton Ragni. [doi]
- DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for DubbingNeha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah. [doi]
- Signal processing algorithm effective for sound quality of hearing loss simulatorsToshio Irino, Shintaro Doan, Minami Ishikawa. [doi]
- How rhythm metrics are linked to produced and perceived speaker charismaOliver Niebuhr, Nafiseh Taghva. [doi]
- Unsupervised Online Continual Learning for Automatic Speech RecognitionSteven Vander Eeckt, Hugo Van Hamme. [doi]
- Investigation of Layer-Wise Speech Representations in Self-Supervised Learning Models: A Cross-Lingual Study in Detecting DepressionBubai Maji, Rajlakshmi Guha, Aurobinda Routray, Shazia Nasreen, Debabrata Majumdar. [doi]
- Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk AssessmentMaurice Gerczuk, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller. [doi]
- Low Complexity Echo Delay Estimator Based on Binarized Feature MatchingYi Gao, Xiang Su. [doi]
- Novel-view Acoustic Synthesis From 3D Reconstructed RoomsByeongjoo Ahn, Karren D. Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang. [doi]
- When Whisper Listens to Aphasia: Advancing Robust Post-Stroke Speech RecognitionGiulia Sanguedolce, Sophie Brook, Dragos C. Gruia, Patrick A. Naylor, Fatemeh Geranmayeh. [doi]
- A toolkit for joint speaker diarization and identification with application to speaker-attributed ASRGiovanni Morrone, Enrico Zovato, Fabio Brugnara, Enrico Sartori, Leonardo Badino. [doi]
- Age-related Differences in Acoustic Cues for the Perception of Checked Syllables in Shengzhou WuBingliang Zhao, Jiangping Kong, Xiyu Wu. [doi]
- Harder or Different? Understanding Generalization of Audio Deepfake DetectionNicolas M. Müller, Nicholas W. D. Evans, Hemlata Tak, Philip Sperl, Konstantin Böttinger. [doi]
- Source Tracing of Audio Deepfake SystemsNicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury 0001. [doi]
- Cross-Attention-Guided WaveNet for EEG-to-MEL Spectrogram ReconstructionHao Li, Yuan Fang, Xueliang Zhang, Fei Chen 0011, Guanglai Gao. [doi]
- Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from SpeechTobias Weise, Philipp Klumpp, Kubilay Can Demir, Paula Andrea Pérez-Toro, Maria Schuster, Elmar Nöth, Björn Heismann, Andreas K. Maier, Seung-Hee Yang. [doi]
- Dysarthric Speech Recognition Using Curriculum Learning and Articulatory Feature EmbeddingI-Ting Hsieh, Chung-Hsien Wu 0001. [doi]
- Automatic Longitudinal Investigation of Multiple Sclerosis SubjectsGábor Gosztolya, Veronika Svindt, Judit Bóna, Ildikó Hoffmann. [doi]
- Information-theoretic hypothesis generation of relative cue weighting for the voicing contrastAnnika Heuser, Jianjing Kuang. [doi]
- From Sound to Meaning in the Auditory Cortex: A Neuronal Representation and Classification AnalysisKumar Neelabh, Vishnu Sreekumar. [doi]
- TD-PLC: A Semantic-Aware Speech Encoding for Improved Packet Loss ConcealmentJinghong Zhang, Zugang Zhao, Yonghui Liu, Jianbing Liu, Zhiqiang He 0001, Kai Niu 0001. [doi]
- AdaRA: Adaptive Rank Allocation of Residual Adapters for Speech Foundation ModelZhouyuan Huo, Dongseong Hwang, Gan Song, Khe Chai Sim, Weiran Wang. [doi]
- Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis FrameworkHokuto Munakata, Ryo Terashima, Yusuke Fujita. [doi]
- Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism DiagnosisJialu Li 0002, Mark Hasegawa-Johnson, Karrie Karahalios. [doi]
- Highly Intelligible Speaker-Independent Articulatory SynthesisCharles McGhee, Kate M. Knill, Mark J. F. Gales. [doi]
- A Human-in-the-Loop Approach to Improving Cross-Text Prosody TransferHimanshu Maurya, Atli Sigurgeirsson. [doi]
- On the Encoding of Gender in Transformer-based ASR RepresentationsAravind Krishnan, Badr M. Abdullah, Dietrich Klakow. [doi]
- LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR SystemsTahir Javed, Janki Nawale, Sakshi Joshi, Eldho Ittan George, Kaushal Santosh Bhogale, Deovrat Mehendale, Mitesh M. Khapra. [doi]
- Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge DistillationKohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix. [doi]
- What happens in continued pre-training? Analysis of self-supervised speech models with continued pre-training for colloquial Finnish ASRYaroslav Getman, Tamás Grósz, Mikko Kurimo. [doi]
- DGPN: A Dual Graph Prototypical Network for Few-Shot Speech Spoofing Algorithm RecognitionZirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang, Björn W. Schuller. [doi]
- An End-to-End Approach for Chord-Conditioned Song GenerationShuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu. [doi]
- SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASRShuaishuai Ye, Shunfei Chen, Xinhui Hu, Xinkang Xu. [doi]
- Evaluating Italian Vowel Variation with the Recurrent Neural Network PhonetAustin Jones, Margaret E. L. Renwick. [doi]
- RASU: Retrieval Augmented Speech Understanding through Generative ModelingHao Yang, Min Zhang, Minghan Wang, Jiaxin Guo. [doi]
- Toward Fully-End-to-End Listened Speech Decoding from EEG SignalsJihwan Lee, Aditya Kommineni, TianTian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Reddy Kadiri, Shrikanth Narayanan. [doi]
- Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First MeetingMuhammad Yeza Baihaqi, Angel F. Garcia Contreras, Seiya Kawano, Koichiro Yoshino. [doi]
- DualPure: An Efficient Adversarial Purification Method for Speech Command RecognitionHao Tan, Xiaochen Liu, Huan Zhang, Junjian Zhang, Yaguan Qian, Zhaoquan Gu. [doi]
- Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment DetectionNana Lin, Youxiang Zhu, Xiaohui Liang, John A. Batsis, Caroline Summerour. [doi]
- Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and EnvironmentTakuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari. [doi]
- Effects of talker and playback rate of reverberation-induced speech on speech intelligibility of older adultsNao Hodoshima. [doi]
- Towards Responsible Speech ProcessingIsabel Trancoso. [doi]
- M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language RepresentationDaisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto. [doi]
- Preservation, conservation and phonetic study of the voices of Italian poets: A study on the seven years of the VIP archiveFederico Lo Iacono, Valentina Colonna, Antonio Romano. [doi]
- Automated content assessment and feedback for Finnish L2 learners in a picture description speaking taskNhan Phan, Anna von Zansen, Maria Kautonen, Ekaterina Voskoboinik, Tamás Grósz, Raili Hildén, Mikko Kurimo. [doi]
- The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal ApproachBenjamin Barrera-Altuna, Daeun Lee, Zaima Zarnaz, Jinyoung Han, Seungbae Kim. [doi]
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition SystemLingwei Meng, Jiawen Kang 0002, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng. [doi]
- Balanced-Wav2Vec: Enhancing Stability and Robustness of Representation Learning Through Sample Reweighting TechniquesMun-Hak Lee, Jae Hong Lee, Do-Hee Kim, Ye-Eun Ko, Joon-Hyuk Chang. [doi]
- The Use of Modifiers and f0 in Remote Referential Communication with Human and Computer PartnersIona Gessinger, Bistra Andreeva, Benjamin R. Cowan. [doi]
- Textual-Driven Adversarial Purification for Speaker VerificationSizhou Chen, Yibo Bai, Jiadi Yao, Xiao-lei Zhang, Xuelong Li. [doi]
- Do Speaker-dependent Vowel Characteristics depend on Speech Style?Nicolas Audibert, Cécile Fougeron, Christine Meunier. [doi]
- Exploring Impact of Pausing and Lexical Stress Patterns on L2 English Comprehensibility in Real TimeSylvain Coulange, Tsuneo Kato, Solange Rossato, Monica Masperi. [doi]
- Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited AnnotationsBulat Khaertdinov, Pedro Jeuris, Annanda Sousa, Enrique Hortal. [doi]
- Well, what can you do with messy data? Exploring the prosody and pragmatic function of the discourse marker "well" with found data and speech synthesisJohannah O'Mahony, Catherine Lai, Éva Székely. [doi]
- Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot StudyPeikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie. [doi]
- Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEIAnaïs Rameau, Satrajit Ghosh, Alexandros Sigaras, Olivier Elemento, Jean-Christophe Bélisle-Pipon, Vardit Ravitsky, Maria Powell, Alistair Johnson, David Dorr, Philip R. O. Payne, Micah Boyer, Stephanie Watts, Ruth Bahr, Frank Rudzicz, Jordan Lerner-Ellis, Shaheen Awan, Don Bolser, Yael Bensoussan. [doi]
- Speaker Personalization for Automatic Speech Recognition using Weight-Decomposed Low-Rank AdaptationGeorge Joseph, Arun Baby. [doi]
- Unsupervised Improved MVDR Beamforming for Sound EnhancementJacob Kealey, John R. Hershey, François Grondin. [doi]
- Dynamic Gated Recurrent Neural Network for Compute-efficient Speech EnhancementLongbiao Cheng, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu. [doi]
- Learning from Multiple Annotator Biased Labels in Multimodal ConversationKazutoshi Shinoda, Nobukatsu Hojo, Saki Mizuno, Keita Suzuki, Satoshi Kobashikawa, Ryo Masumura. [doi]
- LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound ClassificationLi Xiao 0007, Lucheng Fang, Yuhong Yang 0001, Weiping Tu. [doi]
- Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech TranslationPeidong Wang, Jian Xue, Jinyu Li 0001, Junkun Chen, Aswin Shanmugam Subramanian. [doi]
- Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and InteractionYuan Gao, Hao Shi, Chenhui Chu, Tatsuya Kawahara. [doi]
- Enhancing Multilingual Voice Toxicity Detection with Speech-Text AlignmentJoseph Liu 0001, Mahesh Kumar Nandwana, Janne Pylkkönen, Hannes Heikinheimo, Morgan McGuire. [doi]
- Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target DatasetHideyuki Oiso, Yuto Matsunaga, Kazuya Kakizaki, Taiki Miyagawa. [doi]
- Audio Mamba: Selective State Spaces for Self-Supervised Audio RepresentationsSarthak Yadav, Zheng-Hua Tan. [doi]
- Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio PretrainingJinlong Xue, Yayue Deng, Yingming Gao, Ya Li. [doi]
- Key-Element-Informed sLLM Tuning for Document SummarizationSangwon Ryu, Heejin Do, Yunsu Kim 0001, Gary Geunbae Lee, Jungseul Ok. [doi]
- Adapter pre-training for improved speech recognition in unseen domains using low resource adapter tuning of self-supervised modelsSathvik Udupa, Jesuraj Bandekar, Saurabh Kumar, Deekshitha G, Sandhya Badiger, Abhayjeet Singh Savitha Murthy, Priyanka Pai, Srinivasa Raghavan K. M., Raoul Nanavati, Prasanta Kumar Ghosh. [doi]
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language ModelsBolaji Yusuf, Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran. [doi]
- Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional SpeechMartina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino. [doi]
- Expressive paragraph text-to-speech synthesis with multi-step variational autoencoderXuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang. [doi]
- Audio Editing with Non-Rigid Text PromptsFrancesco Paissan, Luca Della Libera, Zhepei Wang, Paris Smaragdis, Mirco Ravanelli, Cem Subakan. [doi]
- Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingJizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang. [doi]
- Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech RecognitionJingjing Xu 0002, Wei Zhou 0043, Zijian Yang, Eugen Beck, Ralf Schlüter. [doi]
- MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion RecognitionZiping Zhao 0001, Tian Gao, Haishuai Wang, Björn W. Schuller. [doi]
- Probing the Feasibility of Multilingual Speaker AnonymizationSarina Meyer, Florian Lux, Ngoc Thang Vu. [doi]
- Towards Self-Attention Understanding for Automatic Articulatory Processes Analysis in Cleft Lip and Palate SpeechIlja Baumann, Dominik Wagner 0002, Maria Schuster, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet. [doi]
- Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech EnhancementWangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe 0001, Yanmin Qian. [doi]
- Perceptual Learning in Lexical Tone: Phonetic Similarity vs. Phonological CategoriesAriëlle Reitsema, Chenxin Li, Leanne van Lambalgen, Laura Preining, Saskia Galindo Jong, Qing Yang, Xinyi Wen, Yiya Chen. [doi]
- Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive RepresentationRiyansha Singh, Parinita Nema, Vinod K. Kurmi. [doi]
- Cross-Modality Diffusion Modeling and Sampling for Speech RecognitionChia-Kai Yeh, Chih-Chun Chen, Ching-Hsien Hsu, Jen-Tzung Chien. [doi]
- Quantifying Unintended Memorization in BEST-RQ ASR EncodersVirat Shejwalkar, Om Thakkar 0001, Arun Narayanan. [doi]
- Analysis of articulatory setting for L1 and L2 English speakers using MRI dataKevin Huang, Jack Goldberg, Louis Goldstein, Shrikanth Narayanan. [doi]
- Speech and Language Recognition with Low-rank Adaptation of Pretrained ModelsAmrutha Prasad, Srikanth R. Madikeri, Driss Khalil, Petr Motlícek, Christof Schüpbach. [doi]
- Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End ModelsMatthew Perez, Aneesha Sampath, Minxue Niu, Emily Mower Provost. [doi]
- Speech Recognition for Greek Dialects: A Challenging BenchmarkSocrates Vakirtzian, Chara Tsoukala, Stavros Bompolas, Katerina Mouzou, Vivian Stamou, Georgios Paraskevopoulos, Antonios Dimakis, Stella Markantonatou, Angela Ralli, Antonios Anastasopoulos. [doi]
- Measuring acoustic dissimilarity of hierarchical markers in task-oriented dialogue with MFCC-based dynamic time warpingNatalia Morozova, Guanghao You, Sabine Stoll, Adrian Bangerter. [doi]
- Unified Audio Visual Cues for Target Speaker ExtractionTianci Wu, Shulin He, Jiahui Pan, Haifeng Huang, Zhijian Mo, Xueliang Zhang. [doi]
- Neural Compression Augmentation for Contrastive Audio Representation LearningZhaoyu Wang, Haohe Liu, Harry Coppock, Björn W. Schuller, Mark D. Plumbley. [doi]
- Speech-MASSIVE: A Multilingual Speech Dataset for SLU and BeyondBeomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier. [doi]
- A multimodal analysis of different types of laughter expression in conversational dialoguesKexin Wang, Carlos Ishi, Ryoko Hayashi. [doi]
- Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive LearningJincen Wang, Yan Zhao, Cheng Lu 0005, Chuangao Tang, Sunan Li, Yuan Zong, Wenming Zheng. [doi]
- DGSRN: Noise-Robust Speech Recognition Method with Dual-Path Gated Spectral Refinement NetworkWenjun Wang, Shangbin Mo, Ling Dong, Zhengtao Yu 0001, Junjun Guo, Yuxin Huang. [doi]
- CaptainA self-study mobile app for practising speaking: task completion assessment and feedback with generative AINhan Phan, Anna von Zansen, Maria Kautonen, Tamás Grósz, Mikko Kurimo. [doi]
- Stream-based Active Learning for Anomalous Sound Detection in Machine Condition MonitoringTuan Vu Ho, Kota Dohi, Yohei Kawaguchi. [doi]
- Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-DistillationDawei Liang, Alice Zhang, David Harwath, Edison Thomaz. [doi]
- Transfer Learning from Whisper for Microscopic Intelligibility PredictionPaul Best, Santiago Cuervo, Ricard Marxer. [doi]
- MSDET: Multitask Speaker Separation and Direction-of-Arrival Estimation TrainingRoland Hartanto, Sakriani Sakti, Koichi Shinoda. [doi]
- Efficient CNNs with Quaternion Transformations and Pruning for Audio TaggingAryan Chaudhary, Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley. [doi]
- What if HAL breathed? Enhancing Empathy in Human-AI Interactions with Breathing Speech SynthesisNicolò Loddo, Francisca Pessanha, Almila Akdag Salah. [doi]
- LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete CodesTrung Dang 0002, David Aponte, Dung N. Tran, Kazuhito Koishida. [doi]
- Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image RetrievalLiFeng Zhou, Yuke Li, Rui Deng, Yuting Yang, Haoqi Zhu. [doi]
- Enhancing Partially Spoofed Audio Localization with Boundary-aware Attention MechanismJiafeng Zhong, Bin Li, Jiangyan Yi. [doi]
- Unmasking Neural Codecs: Forensic Identification of AI-compressed SpeechDenise Moussa, Sandra Bergmann, Christian Riess. [doi]
- PFCA-Net: Pyramid Feature Fusion and Cross Content Attention Network for Automated Audio CaptioningJianyuan Sun, Wenwu Wang 0001, Mark D. Plumbley. [doi]
- Phonetic Enhanced Language Modeling for Text-to-Speech SynthesisKun Zhou 0003, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Trung Hieu Nguyen 0001, Jia Qi Yip, Bin Ma 0001. [doi]
- Knowledge Distillation for Tiny Speech Enhancement with Latent Feature AugmentationBehnam Gholami, Mostafa El-Khamy, Kee-Bong Song. [doi]
- Automatic Children Speech Sound Disorder Detection with Age and Speaker Bias MitigationGahye Kim, Yunjung Eom, Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So. [doi]
- OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-BranchformerYifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel 0001, KwangHee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe 0001. [doi]
- Neural Network Augmented Kalman Filter for Robust Acoustic Howling SuppressionYixuan Zhang, Hao Zhang, Meng Yu, Dong Yu. [doi]
- Measurement and simulation of pressure losses due to airflow in vocal tract modelsPeter Birkholz, Patrick Häsner. [doi]
- Conformer without ConvolutionsMatthijs Van Keirsbilck, Alexander Keller 0001. [doi]
- GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech SynthesisZehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan. [doi]
- Whister: Using Whisper's representations for Stuttering detectionVrushank Changawala, Frank Rudzicz. [doi]
- Deep Prosodic Features in Tandem with Perceptual Judgments of Word Reduction for Tone Recognition in Conversed SpeechXiang-Li Lu, Yi-Fen Liu. [doi]
- MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech SynthesisQian Yang, Jialong Zuo, Zhe Su, Ziyue Jiang 0001, Mingze Li, Zhou Zhao 0001, Feiyang Chen, Zhefeng Wang 0001, Baoxing Huai. [doi]
- ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and DatasetsJiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe 0001. [doi]
- Iterative Prototype Refinement for Ambiguous Speech Emotion RecognitionHaoqin Sun, Shiwan Zhao, Xiangyu Kong, Xuechen Wang, Hui Wang, Jiaming Zhou, Yong Qin. [doi]
- NeuRO: an application for code-switched autism detection in childrenMohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan, Muskaan Singh. [doi]
- VoiceDefense: Protecting Automatic Speaker Verification Models Against Black-box Adversarial AttacksYip Keng Kan, Ke Xu, Hao Li, Jie Shi. [doi]
- Uh, um and mh: Are filled pauses prone to conversational converge?Mathilde Hutin, Junfei Hu, Liesbeth Degand. [doi]
- UNIQUE : Unsupervised Network for Integrated Speech Quality EvaluationJuhwan Yoon, WooSeok Ko, Seyun Um, Sungwoong Hwang, Soojoong Hwang, ChangHwan Kim, Hong-Goo Kang. [doi]
- Serialized Output Training by Learned DominanceYing Shi 0001, Lantian Li, Shi Yin, Dong Wang, Jiqing Han 0001. [doi]
- The sub-band cepstrum as a tool for locating local spectral regions of phonetic sensitivity: A first attempt with multi-speaker vowel dataMichael Lambropoulos, Frantz Clermont, Shunichi Ishihara. [doi]
- Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation NetworkYehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet. [doi]
- Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech DetectionDuc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong-Aik Lee, Eng Siong Chng. [doi]
- Investigating Decoder-only Large Language Models for Speech-to-text TranslationChao-Wei Huang, Hui Lu, Hongyu Gong, Hirofumi Inaguma, Ilia Kulikov, Ruslan Mavlyutov, Sravya Popuri. [doi]
- LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASRZheshu Song, Jianheng Zhuo, Yifan Yang 0005, Ziyang Ma, Shixiong Zhang 0001, Xie Chen 0001. [doi]
- Neuromorphic Keyword Spotting with Pulse Density Modulation MEMS MicrophonesSidi Yaya Arnaud Yarga, Sean U. N. Wood. [doi]
- ConnecTone: a modular AAC system prototype with contextual generative text prediction and style-adaptive conversational TTSJuliana Francis, Éva Székely, Joakim Gustafson. [doi]
- A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech modelsAntón de la Fuente, Dan Jurafsky. [doi]
- Assessing the impact of contextual framing on subjective TTS qualityJens Edlund, Christina Tånnander, Sébastien Le Maguer, Petra Wagner. [doi]
- Intrusive schwa within French stop-liquid clusters: An acoustic analysisMinmin Yang, Rachid Ridouane. [doi]
- Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing PipelineAli N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ülgen, Carlos Busso, Berrak Sisman. [doi]
- WHiSER: White House Tapes Speech Emotion Recognition CorpusAbinay Reddy Naini, Lucas Goncalves, Mary A. Kohler, Donita Robinson, Elizabeth Richerson, Carlos Busso. [doi]
- Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASRShaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang 0006. [doi]
- Connected Speech-Based Cognitive Assessment in Chinese and EnglishSaturnino Luz, Sofia de la Fuente Garcia, Fasih Haider, Davida Fromm, Brian MacWhinney, Alyssa Lanzi, Ya-Ning Chang, Chia-Ju Chou, Yi-Chien Liu. [doi]
- Enhanced Deep Speech Separation in Clustered Ad Hoc Distributed Microphone EnvironmentsJihyun Kim, Stijn Kindt, Nilesh Madhu, Hong-Goo Kang. [doi]
- MinSpeech: A Corpus of Southern Min Dialect for Automatic Speech RecognitionJiayan Lin, Shenghui Lu, Hukai Huang, Wenhao Guan, Binbin Xu, Hui Bu, Qingyang Hong, Lin Li. [doi]
- ConvoCache: Smart Re-Use of Chatbot ResponsesConor Atkins, Ian D. Wood, Mohamed Ali Kâafar, Hassan Asghar 0001, Nardine Basta, Michal Kepkowski. [doi]
- Utilization of Text Data for Response Timing Detection in Attentive ListeningYu Watanabe, Koichiro Ito, Shigeki Matsubara. [doi]
- GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative ModelYingying Gao, Shilei Zhang, Chao Deng, Junlan Feng. [doi]
- Specializing Self-Supervised Speech Representations for Speaker SegmentationSéverin Baroudi, Thomas Pellegrini, Hervé Bredin. [doi]
- How Should We Extract Discrete Audio Tokens from Self-Supervised Models?Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli. [doi]
- Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised ModelsJing Xu, Minglin Wu, Xixin Wu, Helen Meng. [doi]
- Combining Acoustic Feature Sets for Detecting Mild Cognitive Impairment in the Interspeech'24 TAUKADIAL ChallengeGábor Gosztolya, László Tóth 0001. [doi]
- Learning Pronunciation from Other Accents via Pronunciation Knowledge TransferSiqi Sun, Korin Richmond. [doi]
- Towards interfacing large language models with ASR systems using confidence measures and promptingMaryam Naderi, Enno Hermann, Alexandre Nanchen, Sevada Hovsepyan, Mathew Magimai-Doss. [doi]
- Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing LossMuhammad Shakeel 0001, Yui Sudo, Yifan Peng, Shinji Watanabe 0001. [doi]
- SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice ConversionBingsong Bai, Fengping Wang, Yingming Gao, Ya Li. [doi]
- Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speechAnna Oura, Hideaki Kikuchi, Tetsunori Kobayashi. [doi]
- Magnitude and timing of acceleration peaks in stressed and unstressed syllablesMalin Svensson Lundmark. [doi]
- Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker EmbeddingRui Wang, Liping Chen, Kong-Aik Lee, Zhen-Hua Ling. [doi]
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text PerturbationRuizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey. [doi]
- The Use of Phone Categories and Cross-Language Modeling for Phone Alignment of PanãraEmily P. Ahn, Eleanor Chodroff, Myriam Lapierre, Gina-Anne Levow. [doi]
- Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data AugmentationDena F. Mujtaba, Nihar R. Mahapatra, Megan Arney, J. Scott Yaruss, Caryn Herring, Jia-bin. [doi]
- An Exploration of Length Generalization in Transformer-Based Speech EnhancementQiquan Zhang, Hongxu Zhu, Xinyuan Qian 0001, Eliathamby Ambikairajah, Haizhou Li 0001. [doi]
- Leveraging large language models for post-transcription correction in contact centersBramhendra Koilakuntla, Prajesh Rana, Paras Ahuja, Srikanth Konjeti, Jithendra Vepa. [doi]
- On the social bias of speech self-supervised modelsYi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee. [doi]
- Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunctionDaryush D. Mehta, Jarrad H. Van Stan, Hamzeh Ghasemzadeh, Robert E. Hillman. [doi]
- Can Large Language Models Understand Spatial Audio?Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan 0019, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang. [doi]
- Target conversation extraction: Source separation using turn-taking dynamicsTuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota. [doi]
- Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based AdaptationShiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin. [doi]
- Less is More: Accurate Speech Recognition & Translation without Web-Scale DataKrishna C. Puvvada, Piotr Zelasko, He Huang 0012, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg. [doi]
- A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker VerificationXujiang Xing, Mingxing Xu, Thomas Fang Zheng. [doi]
- SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion ModelsDongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng. [doi]
- Speak in the Scene: Diffusion-based Acoustic Scene Transfer toward Immersive Speech GenerationMiseul Kim, Soo-Whan Chung, Youna Ji, Hong-Goo Kang, Min-Seok Choi. [doi]
- QMixCAT: Unsupervised Speech Enhancement Using Quality-guided Signal Mixing and Competitive Alternating Model TrainingShilin Wang, Haixin Guan, Yanhua Long. [doi]
- Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading LearningLucas Block Medin, Thomas Pellegrini, Lucile Gelin. [doi]
- Who Finds This Voice Attractive? A Large-Scale Experiment Using In-the-Wild DataHitoshi Suda, Aya Watanabe, Shinnosuke Takamichi. [doi]
- Learning from memory-based modelsRhiannon Mogridge, Anton Ragni. [doi]
- Nasal Air Flow During Speech Production In KorebajuJenifer Vega Rodríguez, Nathalie Vallée, Christophe Savariaux, Silvain Gerber. [doi]
- Exploring In-Context Learning of Textless Speech Language Model for Speech Classification TasksKai-Wei Chang, Ming-Hao Hsu, Shan-Wen Li 0001, Hung-yi Lee. [doi]
- It's Time to Take Action: Acoustic Modeling of Motor Verbs to Detect Parkinson's DiseaseDaniel Escobar-Grisales, Cristian David Ríos-Urrego, Ilja Baumann, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet, Adolfo M. García, Juan Rafael Orozco-Arroyave. [doi]
- Contrastive Feedback Mechanism for Simultaneous Speech TranslationHaotian Tan, Sakriani Sakti. [doi]
- Speaker Change Detection with Weighted-sum Knowledge Distillation based on Self-supervised Pre-trained ModelsHang Su, Yuxiang Kong, Lichun Fan, Peng Gao 0013, Yujun Wang, Zhiyong Wu 0001. [doi]
- Investigating ASR Error Correction with Large Language Model and Multilingual 1-best HypothesesSheng Li, Chen Chen, Kwok Chin Yuen, Chenhui Chu, Eng Siong Chng, Hisashi Kawai. [doi]
- Hierarchical Distribution Adaptation for Unsupervised Cross-corpus Speech Emotion RecognitionCheng Lu 0005, Yuan Zong, Yan Zhao, Hailun Lian, Tianhua Qi, Björn W. Schuller, Wenming Zheng. [doi]
- Detecting Empathy in SpeechRun Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, Jun Shin, Julia Hirschberg. [doi]
- Automatic Assessment of Speech Production Skills for Children with Cochlear Implants Using Wav2Vec2.0 Acoustic EmbeddingsSeonwoo Lee, SunHee Kim, Minhwa Chung. [doi]
- ASGIR: audio spectrogram transformer guided classification and information retrieval for birdsYashwardhan Chaudhuri, Paridhi Mundra, Arnesh Batra, Orchid Chetia Phukan, Arun Balaji Buduru. [doi]
- Deep Echo Path Modeling for Acoustic Echo CancellationFei Zhao, Chenggang Zhang, Shulin He, Jinjiang Liu, Xueliang Zhang. [doi]
- Speech enabled visual acuity testBoon Peng Yap, Kok Liang Tan, Zhenghao Li, Rong Tong. [doi]
- Familiar and Unfamiliar Speaker Identification in Speech and SingingKatelyn Taylor, Amelia Jane Gully, Helena Daffern. [doi]
- The MARRYS helmet: A new device for researching and training "jaw dancing"Vidar Freyr Gudmundsson, Keve Márton Gönczi, Malin Svensson Lundmark, Donna Erickson, Oliver Niebuhr. [doi]
- Text-only Domain Adaptation for CTC-based Speech Recognition through Substitution of Implicit Linguistic Information in the Search SpaceTatsunari Takagi, Yukoh Wakabayashi, Atsunori Ogawa, Norihide Kitaoka. [doi]
- Prosodic marking of syntactic boundaries in KhoekhoeKira Tulchynska, Sylvanus Job, Alena Witzlack-Makarevich, Margaret Zellers. [doi]
- Vision Transformer Segmentation for Visual Bird Sound DenoisingSahil Kumar, Jialu Li 0004, Youshan Zhang. [doi]
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language ModelsWeiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou 0002, Zhisheng Wang, Zhiyong Wu 0001, Xixin Wu, Helen Meng. [doi]
- Context-Aware Speech Recognition Using Prompts for Language LearnersJian Cheng. [doi]
- LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer GuidanceShihao Chen, Yu Gu, Jie Zhang 0042, Na Li, Rilin Chen, Liping Chen, Lirong Dai 0001. [doi]
- CrisperWhisper: Accurate Timestamps on Verbatim Speech TranscriptionsMario Zusag, Laurin Wagner, Bernhard Thallinger. [doi]
- Lightweight Dynamic Sparse Transformer for Monaural Speech EnhancementZehua Zhang, Xuyi Zhuang, Yukun Qian, Mingjiang Wang. [doi]
- Production of fricative consonants in French-speaking children with cochlear implants and typical hearing: acoustic and phonological analysesSophie Fagniart, Brigitte Charlier, Véronique Delvaux, Bernard Harmegnies, Anne Huberlant, Myriam Piccaluga, Kathy Huet. [doi]
- Speech Prefix-Tuning with RNNT Loss for Improving LLM PredictionsMurali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Neeraj Gaur, Zhong Meng. [doi]
- Audio Fingerprinting with Holographic Reduced RepresentationsYusuke Fujita, Tatsuya Komatsu. [doi]
- Textless Dependency Parsing by Labeled Sequence PredictionShunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi. [doi]
- Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment QualityTina Raissi, Christoph Lüscher, Simon Berger, Ralf Schlüter, Hermann Ney. [doi]
- FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel FilterYuanjun Lv, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie. [doi]
- Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker VerificationAnith Selvakumar, Homa Fashandi. [doi]
- Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech RecognitionYicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu 0008, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian 0001. [doi]
- Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic FeaturesLiangwei Chen, Xiren Zhou, Qiang Tu, Huanhuan Chen. [doi]
- Towards a better understanding of receptive multilingualism: listening conditions and priming effectsWei Xue, Ivan Yuen, Bernd Möbius. [doi]
- BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound ClassificationJune-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung. [doi]
- Streamlining Speech Enhancement DNNs: an Automated Pruning Method Based on Dependency Graph with Advanced Regularized Loss StrategiesZugang Zhao, Jinghong Zhang, Yonghui Liu, Jianbing Liu, Kai Niu 0001, Zhiqiang He 0001. [doi]
- On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation ModelsJinchuan Tian, Yifan Peng, William Chen, KwangHee Choi, Karen Livescu, Shinji Watanabe 0001. [doi]
- Automatic Evaluation of a Sentence Memory Test for Preschool ChildrenIlja Baumann, Nicole Unger, Dominik Wagner 0002, Korbinian Riedhammer, Tobias Bocklet. [doi]
- Reading Miscue Detection in Primary School through Automatic Speech RecognitionLingyun Gao, Cristian Tejedor García, Helmer Strik, Catia Cucchiarini. [doi]
- Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systemsKwok Chin Yuen, Jia Qi Yip, Eng Siong Chng. [doi]
- Behavioral evidence for higher speech rate convergence following natural than artificial time altered speechJérémy Giroud, Jessica Lei, Kirsty Phillips, Matthew H. Davis. [doi]
- Real-Time Gaze-directed speech enhancement for audio-visual hearing-aidsArif Reza Anway, Bryony Buck, Mandar Gogate, Kia Dashtipour, Michael Akeroyd, Amir Hussain 0001. [doi]
- Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and TextJinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang. [doi]
- Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech RecognitionAndrés Piñeiro Martín, Carmen García-Mateo, Laura Docío Fernández, Maria del Carmen Lopez-Perez, Georg Rehm. [doi]
- Reference-Free Estimation of the Quality of Clinical Notes Generated from Doctor-Patient ConversationsMojtaba Kadkhodaie Elyaderani, John Glover, Thomas Schaaf. [doi]
- Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention MaskTianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu. [doi]
- The PESQetarian: On the Relevance of Goodhart's Law for Speech EnhancementDanilo de Oliveira, Simon Welker, Julius Richter, Timo Gerkmann. [doi]
- Investigation of look-ahead techniques to improve response time in spoken dialogue systemMasaya Ohagi, Tomoya Mizumoto, Katsumasa Yoshikawa. [doi]
- Translingual Language Markers for Cognitive Assessment from Spontaneous SpeechBao Hoang, Yijiang Pang, Hiroko H. Dodge, Jiayu Zhou. [doi]
- Empowering Low-Resource Language ASR via Large-Scale Pseudo LabelingKaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra. [doi]
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech ReconstructionXueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng. [doi]
- Hybrid-Diarization System with Overlap Post-Processing for the DISPLACE 2024 ChallengeGabriel Pirlogeanu, Octavian Pascu, Alexandru-Lucian Georgescu, Horia Cucu. [doi]
- How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and MachinesAilin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung. [doi]
- Lightweight Zero-shot Text-to-Speech with Mixture of AdaptersKenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima. [doi]
- Speaking of Health: Leveraging Large Language Models to assess Exercise Motivation and Behavior of Rehabilitation PatientsSuhas BN, Amanda Rebar, Saeed Abdullah. [doi]
- Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRIYaoyao Yue, Michael Proctor, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie Ballard, Craig T. Jin. [doi]
- On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker DiarizationYiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang. [doi]
- Production of phrases by mechanical models of the human vocal tractTakayuki Arai, Ryohei Suzuki, Chandler Earp, Shinya Tsuji, Keiko Ochi. [doi]
- Towards Realistic Emotional Voice Conversion using Controllable Emotional IntensityTianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng. [doi]
- How Much Context Does My Attention-Based ASR System Need?Robert Flynn, Anton Ragni. [doi]
- Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech RecognitionJinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei. [doi]
- Knowledge boosting during low-latency inferenceVidya Srinivas, Malek Itani, Tuochao Chen, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota. [doi]
- A demonstrator for articulation-based command word recognitionJoão Vítor Possamai de Menezes, Arne-Lukas Fietkau, Tom Diener, Steffen Kürbis, Peter Birkholz. [doi]
- Neural Blind Source Separation and Diarization for Distant Speech RecognitionYoshiaki Bando, Tomohiko Nakamura, Shinji Watanabe 0001. [doi]
- Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesisTakuma Okamoto, Yamato Ohtani, Hisashi Kawai. [doi]
- Towards Scalable Remote Assessment of Mild Cognitive Impairment Via Multimodal DialogOliver Roesler, Jackson Liscombe, Michael Neumann, Hardik Kothare, Abhishek Hosamath, Lakshmi Arbatti, Doug Habberstad, Christiane Suendermann-Oeft, Meredith Bartlett, Cathy Zhang, Nikhil Sukhdev, Kolja Wilms, Anusha Badathala, Sandrine Istas, Steve Ruhmel, Bryan Hansen, Madeline Hannan, David Henley, Arthur Wallace, Ira Shoulson, David Suendermann-Oeft, Vikram Ramanarayanan. [doi]
- Prompting Large Language Models with Audio for General-Purpose Speech SummarizationWonjune Kang, Deb Roy. [doi]
- Cognitive Insights Across Languages: Enhancing Multimodal Interview AnalysisDavid Ortiz-Perez, Jose Garcia-Rodriguez, David Tomás 0001. [doi]
- Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language ModelsZiyun Cui, Chang Lei, Wen Wu, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang. [doi]
- Total-Duration-Aware Duration Modeling for Text-to-Speech SystemsSefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Jinyu Li 0001, Sheng Zhao, Naoyuki Kanda. [doi]
- SparseWAV: Fast and Accurate One-Shot Unstructured Pruning for Large Speech Foundation ModelsTianteng Gu, Bei Liu, Hang Shao 0005, Yanmin Qian. [doi]
- Enhanced Reverberation as Supervision for Unsupervised Speech SeparationKohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux. [doi]
- Effect of Complex Boundary Tones on Tone Identification: An Experimental Study with Mandarin-speaking Preschool ChildrenAijun Li, Jun Gao, Zhiwei Wang. [doi]
- PPPR: Portable Plug-in Prompt Refiner for Text to Audio GenerationShuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao 0001, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang. [doi]
- Reinforcement Learning from Answer Reranking Feedback for Retrieval-Augmented Answer GenerationMinh Nguyen 0007, Toàn Quoc Nguyên, Kishan KC, Zeyu Zhang 0002, Thuy Vu. [doi]
- Sound of Traffic: A Dataset for Acoustic Traffic Identification and CountingShabnam Ghaffarzadegan, Luca Bondi, Wei-Cheng Lin, Abinaya Kumar, Ho-Hsiang Wu, Hans-Georg Horst, Samarjit Das. [doi]
- NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting TranscriptionAlon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Peer, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong 0001, Min Tang, Huaming Wang, Eyal Krupka. [doi]
- Text-aware Speech Separation for Multi-talker Keyword SpottingHaoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan 0002, Hao Li, Kai Yu. [doi]
- MaskSR: Masked Language Model for Full-band Speech RestorationXu Li, Qirui Wang, Xiaoyu Liu. [doi]
- Speech Recognition Models are Strong Lip-readersK. R. Prajwal, Triantafyllos Afouras, Andrew Zisserman. [doi]
- RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape ClassificationJacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal Rosenwein. [doi]
- Detection of background agents speech in contact centersAbhishek Kumar, Srikanth Konjeti, Jithendra Vepa. [doi]
- Autoregressive cross-interlocutor attention scores meaningfully capture conversational dynamicsMatthew McNeill, Rivka Levitan. [doi]
- What Does it Take to Generalize SER Model Across Datasets? A Comprehensive BenchmarkAdham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed. [doi]
- Collecting Mandible Movement in Brazilian PortugueseDonna Erickson, Albert Rilliard, Malin Svensson Lundmark, Adelaide Silva, Leticia Rebollo Couto, Oliver Niebuhr, João Antônio de Moraes. [doi]
- Generalized Fake Audio Detection via Deep Stable LearningZhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao 0001, Xin Qi, Yi Lu, Shuchen Shi. [doi]
- Tackling Missing Modalities in Audio-Visual Representation Learning Using Masked AutoencodersGeorgios Chochlakis, Chandrashekhar Lavania, Prashant Mathur, Kyu J. Han. [doi]
- Wave to Interlingua: Analyzing Representations of Multilingual Speech Transformers for Spoken Language TranslationBadr M. Abdullah, Mohammed Maqsood Shaik, Dietrich Klakow. [doi]
- Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speechShivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter. [doi]
- ElasticAST: An Audio Spectrogram Transformer for All Length and ResolutionsJiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak. [doi]
- Faster Vocoder: a multi threading approach to achieve low latency during TTS InferenceVishal Gourav, Ankit Tyagi, Phanindra Mankale. [doi]
- Speech ReaLLM - Real-time Speech Recognition with Multimodal Language Models by Teaching the Flow of TimeFrank Seide, Yangyang Shi, Morrie Doulaty, Yashesh Gaur, Junteng Jia, Chunyang Wu. [doi]
- A Functional Trade-off between Prosodic and Semantic Cues in Conveying SarcasmZhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler. [doi]
- Automatic Assessment of Dysarthria using Speech and synthetically generated Electroglottograph signalFathima Zaheera, Supritha Shetty, Gayadhar Pradhan, Deepak K. T. [doi]
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech RepresentationsKunal Dhawan, Nithin Rao Koluguri, Ante Jukic, Ryan Langman, Jagadeesh Balam, Boris Ginsburg. [doi]
- Text Injection for Neural Contextual BiasingZhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran. [doi]
- K-means and hierarchical clustering of f0 contoursConstantijn Kaland, Jeremy Steffman, Jennifer Cole 0001. [doi]
- Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-SpeechDong Yang, Tomoki Koriyama, Yuki Saito. [doi]
- PL-TTS: A Generalizable Prompt-based Diffusion TTS Augmented by Large Language ModelShuhua Li, Qirong Mao, Jiatong Shi. [doi]
- Investigating the Influence of Stance-Taking on Conversational Timing of Task-Oriented SpeechSara Ng, Gina-Anne Levow, Mari Ostendorf, Richard A. Wright. [doi]
- SimuSOE: A Simulated Snoring Dataset for Obstructive Sleep Apnea-Hypopnea Syndrome Evaluation during WakefulnessJie Lin, Xiuping Yang, Li Xiao 0007, Xinhong Li, Weiyan Yi, Yuhong Yang 0001, Weiping Tu, Xiong Chen. [doi]
- Mitigating Overfitting in Structured Pruning of ASR Models with Gradient-Guided Parameter RegularizationDong-hyun Kim, Joon-Hyuk Chang. [doi]
- Privacy PORCUPINE: Anonymization of Speaker Attributes Using Occurrence Normalization for Space-Filling Vector QuantizationMohammad Hassan Vali, Tom Bäckström. [doi]
- Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in ConversationHaoxiang Shi, Ziqi Liang, Jun Yu 0001. [doi]
- Parameter-Efficient Adapter Based on Pre-trained Models for Speech TranslationNan Chen, Yonghe Wang, Feilong Bao. [doi]
- Contrastive Learning and Inter-Speaker Distribution Alignment Based Unsupervised Domain Adaptation for Robust Speaker VerificationZuoliang Li, Wu Guo, Bin Gu, Shengyu Peng, Jie Zhang. [doi]
- No-Reference Speech Intelligibility Prediction Leveraging a Noisy-Speech ASR Pre-Trained ModelHaolan Wang, Amin Edraki, Wai-Yip Chan, Iván López-Espejo, Jesper Jensen 0001. [doi]
- PitchFlow: adding pitch control to a Flow-matching based TTS modelTasnima Sadekova, Mikhail A. Kudinov, Vadim Popov, Assel Yermekova, Artem Khrapov. [doi]
- MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event DetectionPengfei Cai, Yan Song 0001, Kang Li, Haoyu Song, Ian McLoughlin 0001. [doi]
- Challenging margin-based speaker embedding extractors by using the variational information bottleneckThemos Stafylakis, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukás Burget. [doi]
- Spoken-to-written text conversion with Large Language ModelHyunJung Choi, Muyeol Choi, Yohan Lim, Minkyu Lee, Seon Hui Kim, Seung Yun, Donghyun Kim, Sang-hun Kim. [doi]
- Quantifying the effect of speech pathology on automatic and human speaker verificationBence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan A. H. J. de Visscher, Max J. H. Witjes, Martijn Wieling 0001, Defne Abur, Tomoki Toda. [doi]
- SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture SpeechJingru Lin, Meng Ge, Junyi Ao, Liqun Deng, Haizhou Li 0001. [doi]
- Leveraging Adapter for Parameter-Efficient ASR EncoderKyuhong Shim, Jinkyu Lee 0004, Hyunjae Kim. [doi]
- Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech RecognitionKwangyoun Kim, Suwon Shon, Yi-Te Hsu, Prashant Sridhar, Karen Livescu, Shinji Watanabe 0001. [doi]
- Neural Codec-based Adversarial Sample Detection for Speaker VerificationXuanjun Chen, Jiawei Du, Haibin Wu, Jyh-Shing Roger Jang, Hung-yi Lee. [doi]
- Comparing ASR Systems in the Context of Speech DisfluenciesMaria Teleki, Xiangjue Dong, Soohwan Kim, James Caverlee. [doi]
- Sound Event Bounding BoxesJanek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux. [doi]
- MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword SpottingZhiqi Ai, Zhiyong Chen, Shugong Xu. [doi]
- Cross-Linguistic Intelligibility of Non-Compositional Expressions in Spoken ContextIuliia Zaitova, Irina Stenger, Wei Xue, Tania Avgustinova, Bernd Möbius, Dietrich Klakow. [doi]
- Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box PerformanceSi Ioi Ng, Lingfeng Xu, Kimberly D. Mueller, Julie Liss, Visar Berisha. [doi]
- Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and ProbingMartin Lebourdais, Théo Mariotte, Antonio Almudévar, Marie Tahon, Alfonso Ortega Giménez. [doi]
- The prosody of the verbal prefix ge-: historical and experimental evidenceChiara Riegger, Tina Bögel, George Walkden. [doi]
- G2PA: G2P with Aligned Audio for Mandarin ChineseXingxing Yang 0005. [doi]
- Contemplative Mechanism for Speech Recognition: Speech Encoders can ThinkTien-Ju Yang, Andrew Rosenberg, Bhuvana Ramabhadran. [doi]
- Speakers Unembedded: Embedding-free Approach to Long-form Neural DiarizationXiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan. [doi]
- DiarizationLM: Speaker Diarization Post-Processing with Large Language ModelsQuan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, Hank Liao. [doi]
- Acceleration of Posteriorgram-based DTW by Distilling the Class-to-class Distances Encoded in the Classifier Used to Calculate PosteriorsHaitong Sun, Jaehyun Choi, Nobuaki Minematsu, Daisuke Saito. [doi]
- DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language UnderstandingSuwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe 0001, Karen Livescu. [doi]
- Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSingJiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe 0001. [doi]
- Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on WhisperTianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie. [doi]
- Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-TuningJehyun Kyung, Serin Heo, Joon-Hyuk Chang. [doi]
- Gryannote open-source speaker diarization labeling toolClément Pages, Hervé Bredin. [doi]
- A novel experimental design for the study of listener-to-listener convergence in phoneme categorizationQingye Shen, Leonardo Lancia, Noël Nguyen. [doi]
- LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech RecognitionSreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha. [doi]
- Parameter-efficient Fine-tuning of Speaker-Aware Dynamic Prompts for Speaker VerificationZhe Li, Man-Wai Mak, Hung-yi Lee, Helen Meng. [doi]
- Scaling up masked audio encoder learning for general audio classificationHeinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang 0004. [doi]
- Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper ModelsYiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng. [doi]
- TokSing: Singing Voice Synthesis based on Discrete TokensYuning Wu, Chunlei Zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin. [doi]
- HybridVC: Efficient Voice Style Conversion with Text and Audio PromptsXinlei Niu, Jing Zhang 0052, Charles Patrick Martin. [doi]
- Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-ExampleSuhita Ghosh, Mélanie Jouaiti, Arnab Das, Yamini Sinha, Tim Polzehl, Ingo Siegert, Sebastian Stober. [doi]
- Improving Self-supervised Pre-training using Accent-Specific CodebooksDarshan Prabhu, Abhishek Gupta, Omkar Nitsure, Preethi Jyothi, Sriram Ganapathy. [doi]
- Binaural Selective Attention Model for Target Speaker ExtractionHanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu, Eliathamby Ambikairajah. [doi]
- Zero-Shot Fake Video Detection by Audio-Visual ConsistencyXiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, Dong Wang. [doi]
- Small-E: Small Language Model with Linear Attention for Efficient Speech SynthesisThéodor Lemerle, Nicolas Obin, Axel Roebel. [doi]
- A Pilot Study of GSLM-based Simulation of Foreign Accentuation Only Using Native Speech CorporaKentaro Onda, Joonyong Park, Nobuaki Minematsu, Daisuke Saito. [doi]
- Towards a General-Purpose Model of Perceived Pragmatic SimilarityNigel G. Ward, Andres Segura, Alejandro Ceballos, Divette Marco. [doi]
- MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveformsSeung-bin Kim, Chan-yeong Lim, Jungwoo Heo, Ju-ho Kim, Hyun-seo Shin, Kyo-Won Koo, Ha-Jin Yu. [doi]
- Modeling Vocal Tract Like Acoustic Tubes Using the Immersed Boundary MethodRongshuai Wu, Debasish Ray Mohapatra, Sidney Fels. [doi]
- Emotion Arithmetic: Emotional Speech Synthesis via Weight Space InterpolationPavan Kalyan, Preeti Rao, Preethi Jyothi, Pushpak Bhattacharyya. [doi]
- Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI DataTomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu 0001, Fangxu Xing, Maureen Stone 0001, Jiachen Zhuo, Jerry L. Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas K. Maier. [doi]
- Leveraging Phonemic Transcription and Whisper toward Clinically Significant Indices for Automatic Child Speech AssessmentYeh-Sheng Lin, Shu-Chuan Tseng, Jyh-Shing Roger Jang. [doi]
- All Neural Low-latency Directional Speech ExtractionAshutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu. [doi]
- FLEURS-R: A Restored Multilingual Speech Corpus for Generation TasksMin Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani. [doi]
- On the impact of several regularization techniques on label noise robustness of self-supervised speaker verification systemsAbderrahim Fathan, Xiaolin Zhu, Jahangir Alam 0001. [doi]
- DB-PMAE: Dual-Branch Prototypical Masked AutoEncoder with locality for domain robust speaker verificationWei-Lin Xie, Yu-Xuan Xi, Yan Song, Jian-Tao Zhang, Hao-Yu Song, Ian McLoughlin 0001. [doi]
- Linear-Complexity Self-Supervised Learning for Speech ProcessingShucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya. [doi]
- Bird Whisperer: Leveraging Large Pre-trained Acoustic Model for Bird Call ClassificationMuhammad Umer Sheikh, Hassan Abid, Bhuiyan Sanjid Shafique, Asif Hanif, Muhammad Haris Khan. [doi]
- Improving Copy-Synthesis Anti-Spoofing Training Method with Rhythm and Speaker PerturbationJingze Lu, Yuxiang Zhang, Zhuo Li, Zengqiang Shang, Wenchao Wang, Pengyuan Zhang. [doi]
- Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow AnalysisKubilay Can Demir, Belén Lojo Rodríguez, Tobias Weise, Andreas K. Maier, Seung-Hee Yang. [doi]
- ComFeAT: combination of neural and spectral features for improved depression detectionOrchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- A New Approach to Voice AuthenticityNicolas M. Müller, Piotr Kawa, Shen Hu, Matthias Neu, Jennifer Williams 0001, Philip Sperl, Konstantin Böttinger. [doi]
- Unified Framework for Spoken Language Understanding and Summarization in Task-Based Human Dialog processingEunice Akani, Frédéric Bechet, Benoît Favre, Romain Gemignani. [doi]
- Uncertainty-Aware Mean Opinion Score PredictionHui Wang, Shiwan Zhao, Jiaming Zhou, Xiguang Zheng, Haoqin Sun, Xuechen Wang, Yong Qin. [doi]
- Hold Me Tight: Stable Encoder-Decoder Design for Speech EnhancementDaniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs. [doi]
- RepCNN: Micro-sized, Mighty Models for Wakeword DetectionArnav Kundu, Prateeth Nayak, Priyanka Padmanabhan, Devang Naik. [doi]
- Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion RecognitionYing Hu 0005, Huamin Yang, Hao Huang 0009, Liang He 0003. [doi]
- Array Geometry-Robust Attention-Based Neural Beamformer for Moving SpeakersMarvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo. [doi]
- ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2vec2.0 Based ASRVishwanath Pratap Singh, Federico Malato, Ville Hautamäki, Md. Sahidullah, Tomi Kinnunen. [doi]
- Can Modelling Inter-Rater Ambiguity Lead To Noise-Robust Continuous Emotion Predictions?Ya-Tse Wu, Jingyao Wu 0002, Vidhyasaharan Sethu, Chi-Chun Lee. [doi]
- Examining Vocal Tract Coordination in Childhood Apraxia of Speech with Acoustic-to-Articulatory Speech Inversion Feature SetsNina R. Benway, Jonathan L. Preston, Carol Y. Espy-Wilson. [doi]
- Refining Self-supervised Learnt Speech Representation using Brain ActivationsHengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling. [doi]
- Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context MaskingKhanh Le, Duc Chau. [doi]
- URGENT Challenge: Universality, Robustness, and Generalizability For Speech EnhancementWangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Jan Pirklbauer, Marvin Sach, Shinji Watanabe 0001, Tim Fingscheidt, Yanmin Qian. [doi]
- Evaluating Speech Recognition Performance Towards Large Language Model Based Voice AssistantsZhe Liu 0011, Suyoun Kim, Ozlem Kalinli. [doi]
- PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech RepresentationsKou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Yuto Kondo. [doi]
- The Production of Contrastive Focus by 7 to 13-year-olds Learning Mandarin ChineseZimeng Li, Zhongxuan Mao, Shengting Shen, Ivan Yuen, Ping Tang. [doi]
- Extraction of interpretable and shared speaker-specific speech attributes through binary auto-encoderImen Ben Amor, Jean-François Bonastre, Salima Mdhaffar. [doi]
- SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space ModelingHiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix. [doi]
- FastLips: an End-to-End Audiovisual Text-to-Speech System with Lip Features Prediction for Virtual AvatarsMartin Lenglet, Olivier Perrotin, Gérard Bailly. [doi]
- Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution NetworkYanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He. [doi]
- Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice ConversionHiroki Kanagawa, Yusuke Ijima. [doi]
- Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural VocoderTakuma Okamoto, Yamato Ohtani, Sota Shimizu, Tomoki Toda, Hisashi Kawai. [doi]
- What do people hear? Listeners' Perception of Conversational SpeechAdaeze Adigwe, Sarenne Wallbridge, Simon King. [doi]
- Voice quality in telephone speech: Comparing acoustic measures between VoIP telephone and high-quality recordingsChenzi Xu, Jessica Wormald, Paul Foulkes, Philip Harrison, Vincent Hughes, Poppy Welch, Finnian Kelly, David van der Vloed. [doi]
- Whisper Multilingual Downstream Task Tuning Using Task VectorsJi Hun Kang, Jae Hong Lee, Mun-Hak Lee, Joon-Hyuk Chang. [doi]
- Real-world PTSD Recognition: A Cross-corpus and Cross-linguistic EvaluationAlexander Kathan, Martin Bürger, Andreas Triantafyllopoulos, Sabrina Milkus, Jonas Hohmann, Pauline Muderlak, Jürgen Schottdorf, Richard Musil, Björn W. Schuller, Shahin Amiriparian. [doi]
- Schrödinger Bridge for Generative Speech EnhancementAnte Jukic, Roman Korostik, Jagadeesh Balam, Boris Ginsburg. [doi]
- Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech ModelsNeil Kumar Shah, Shirish S. Karande, Vineet Gandhi. [doi]
- Fine-Grained and Interpretable Neural Speech EditingMax Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo. [doi]
- A Language Modeling Approach to Diacritic-Free Hebrew TTSAmit Roth, Arnon Turetzky, Yossi Adi. [doi]
- 2.5D Vocal Tract Modeling: Bridging Low-Dimensional Efficiency with 3D AccuracyDebasish Ray Mohapatra, Victor Zappi, Sidney Fels. [doi]
- Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice SynthesisTaewoo Kim, Choongsang Cho, Young Han Lee. [doi]
- Efficient Audio Captioning with Encoder-Level Knowledge DistillationXuenan Xu, Haohe Liu, Mengyue Wu, Wenwu Wang 0001, Mark D. Plumbley. [doi]
- SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token SynchronizationYoungjin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim. [doi]
- Lifelong Learning MOS Prediction for Synthetic Speech Quality EvaluationFélix Saget, Meysam Shamsi, Marie Tahon. [doi]
- Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech AugmentationAkihiro Kato, Hiroyuki Nagano, Kohei Chike, Masaki Nose. [doi]
- Boosting CTC-based ASR using inter-layer attention-based CTC lossKeigo Hojo, Yukoh Wakabayashi, Kengo Ohta, Atsunori Ogawa, Norihide Kitaoka. [doi]
- A comparative analysis of sequential models that integrate syllable dependency for automatic syllable stress detectionJhansi Mallela, Sai Harshitha Aluru, Chiranjeevi Yarra. [doi]
- Phonological Feature Detection for US English using the Phonet LibraryHarsha Veena Tadavarthy, Austin Jones, Margaret E. L. Renwick. [doi]
- Dual-path Adaptation of Pretrained Feature Extraction Module for Robust Automatic Speech RecognitionHao Shi, Tatsuya Kawahara. [doi]
- SCDNet: Self-supervised Learning Feature based Speaker Change DetectionYue Li, Xinsheng Wang, Li Zhang, Lei Xie. [doi]
- COSMIC: Data Efficient Instruction-tuning For Speech In-Context LearningJing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu 0001, Jinyu Li. [doi]
- Glottal inverse filtering and vocal tract tuning for the numerical simulation of vowel /a/ with different levels of vocal effortMarc Freixes, Marc Arnela, Joan Claudi Socoró, Luis Joglar-Ongay, Oriol Guasch, Francesc Alías Pujol. [doi]
- YOLOPitch: A Time-Frequency Dual-Branch YOLO Model for Pitch EstimationXuefei Li, Hao Huang, Ying Hu, Liang He, Jiabao Zhang, Yuyi Wang. [doi]
- Comparing Discrete and Continuous Space LLMs for Speech RecognitionYaoxun Xu, Shi-Xiong Zhang 0001, Jianwei Yu 0001, Zhiyong Wu, Dong Yu 0001. [doi]
- Motion Based Audio-Visual SegmentationJiahao Li, Miao Liu, Shu Yang, Jing Wang, Xiang Xie. [doi]
- CreakVC: a voice conversion tool for modulating creaky voiceHarm Lameris, Joakim Gustafson, Éva Székely. [doi]
- Multi-Channel Extension of Pre-trained Models for Speaker VerificationLadislav Mosner, Romain Serizel, Lukás Burget, Oldrich Plchot, Emmanuel Vincent 0001, Junyi Peng, Jan Cernocký. [doi]
- Reduce, Reuse, Recycle: Is Perturbed Data Better than Other Language Augmentation for Low Resource Self-Supervised Speech ModelsAsad Ullah, Alessandro Ragano, Andrew Hines. [doi]
- Interference Aware Training Target for DNN based joint Acoustic Echo Cancellation and Noise SuppressionVahid Khanagha, Dimitris Koutsaidis, Kaustubh Kalgaonkar, Sriram Srinivasan. [doi]
- Reshape Dimensions Network for Speaker RecognitionIvan Yakovlev, Rostislav Makarov, Andrei Balykin, Pavel Malov, Anton Okhotnikov, Nikita Torgashov. [doi]
- A comparative study of the impact of voiceless alveolar and palato-alveolar sibilants in English on lip aperture and protrusion during VCV productionChetan Sharma, Vaishnavi Chandwanshi, Prasanta Kumar Ghosh. [doi]
- AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimatorsJaden Pieper, Stephen Voran. [doi]
- DeWinder: Single-Channel Wind Noise Reduction using Ultrasound SensingKuang Yuan, Shuo Han, Swarun Kumar, Bhiksha Raj. [doi]
- Exploring compressibility of transformer based text-to-music (TTM) modelsVasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan. [doi]
- Self-supervised speaker verification with relational mask predictionJu-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, MinJae Lee, Ha-Jin Yu. [doi]
- EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-SpeechDeok-Hyeon Cho, Hyung-Seok Oh, Seung-bin Kim, Sang-Hoon Lee, Seong-Whan Lee. [doi]
- Investigating Confidence Estimation Measures for Speaker DiarizationAnurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna. [doi]
- Towards a Quantitative Analysis of Coarticulation with a Phoneme-to-Articulatory ModelChaofei Fan, Jaimie M. Henderson, Chris Manning, Francis R. Willett. [doi]
- Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free TrainingHiroki Kanagawa, Takafumi Moriya, Yusuke Ijima. [doi]
- Elucidating Clock-drift Using Real-world Audios In Wireless Mode For Time-offset Insensitive End-to-End Asynchronous Acoustic Echo CancellationPremanand Nayak, M. Ali Basha Shaik. [doi]
- X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice ConversionHoujian Guo, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro. [doi]
- SER Evals: In-domain and Out-of-domain benchmarking for speech emotion recognitionMohamed Osman, Daniel Z. Kaplan, Tamer Nadeem. [doi]
- Positional Description for Numerical NormalizationDeepanshu Gupta, Javier Latorre. [doi]
- The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised dataGeorgios Paraskevopoulos, Chara Tsoukala, Athanasios Katsamanis, Vassilis Katsouros. [doi]
- ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion DatasetsShahin Amiriparian, Filip Packan, Maurice Gerczuk, Björn W. Schuller. [doi]
- Instruction Data Generation and Unsupervised Adaptation for Speech Language ModelsVahid Noroozi, Zhehuai Chen, Somshubra Majumdar, Steve Huang, Jagadeesh Balam, Boris Ginsburg. [doi]
- Leveraging Graphic and Convolutional Neural Networks for Auditory Attention Detection with EEGSaurav Pahuja, Gabriel Ivucic, Pascal Himmelmann, Siqi Cai, Tanja Schultz, Haizhou Li 0001. [doi]
- Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple featuresShaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo. [doi]
- Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data AnalysisXin Wang 0037, Tomi Kinnunen, Kong-Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi. [doi]
- Towards EMG-to-Speech with Necklace Form FactorPeter Wu, Ryan Kaveh, Raghav Nautiyal, Christine Zhang, Albert Guo, Anvitha Kachinthaya, Tavish Mishra, Bohan Yu, Alan W. Black, Rikky Muller, Gopala Krishna Anumanchipalli. [doi]
- Efficient Joint Bemforming and Acoustic Echo Cancellation Structure for Conference Call ScenariosOfer Schwartz, Sharon Gannot. [doi]
- Multi-speaker and multi-dialectal Catalan TTS models for video gamingAlex Peiró Lilja, José Giraldo, Martí Llopart-Font, Carme Armentano-Oller, Baybars Külebi, Mireia Farrús. [doi]
- Using wav2vec 2.0 for phonetic classification tasks: methodological aspectsLila Kim, Cédric Gendrot. [doi]
- Reducing Speech Distortion and Artifacts for Speech Enhancement by Loss FunctionHaixin Guan, Wei Dai, Guangyong Wang, Xiaobin Tan, Peng Li, Jiaen Liang. [doi]
- InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate PredictionsYu Nakagome, Michael Hentschel. [doi]
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word SpotterAndrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg. [doi]
- Efficient Integrated Features Based on Pre-trained Models for Speaker VerificationYishuang Li, Wenhao Guan, Hukai Huang, Shiyu Miao, Qi Su, Lin Li, Qingyang Hong. [doi]
- Enhancing Japanese Text-to-Speech Accuracy with a Novel Combination Transformer-BERT-based G2P: Integrating Pronunciation Dictionaries and Accent SandhiKiyoshi Kurihara, Masanori Sano. [doi]
- Reinforcement Learning based Data Augmentation for Noise Robust Speech Emotion RecognitionSumit Ranjan, Rupayan Chakraborty, Sunil Kumar Kopparapu. [doi]
- Contextual Biasing Speech Recognition in Speech-enhanced Large Language ModelXun Gong 0005, Anqi Lv, Zhiming Wang, Yanmin Qian. [doi]
- Entrainment Analysis and Prosody Prediction of Subsequent Interlocutor's Backchannels in DialogueKeiko Ochi, Koji Inoue, Divesh Lala, Tatsuya Kawahara. [doi]
- PARIS: Pseudo-AutoRegressIve Siamese Training for Online Speech SeparationZexu Pan, Gordon Wichern, François G. Germain, Kohei Saijo, Jonathan Le Roux. [doi]
- Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural TransducerPeng Wang, Yifan Yang 0005, Zheng Liang, Tian Tan 0002, Shiliang Zhang, Xie Chen 0001. [doi]
- ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency DistillationYatong Bai, Trung Dang 0002, Dung N. Tran, Kazuhito Koishida, Somayeh Sojoudi. [doi]
- Universal Score-based Speech Enhancement with High Content PreservationRobin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu. [doi]
- M2ASR: Multilingual Multi-task Automatic Speech Recognition via Multi-objective OptimizationA. F. M. Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen. [doi]
- A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History ArchivesJan Lehecka, Josef V. Psutka, Lubos Smídl, Pavel Ircing, Josef Psutka. [doi]
- Adapter Learning from Pre-trained Model for Robust Spoof Speech DetectionHaochen Wu, Wu Guo, Shengyu Peng, Zhuhai Li, Jie Zhang. [doi]
- IIITH Ucchar e-Sudharak: an automatic English pronunciation corrector for school-going children with a teacher in the loopMeenakshi Sirigiraju, Arjun Rajasekar, Abhishikth Meejuri, Chiranjeevi Yarra. [doi]
- Emotion-Aware Speech Self-Supervised Representation Learning with Intensity KnowledgeRui Liu, Zening Ma. [doi]
- A Dataset and Two-pass System for Reading Miscue DetectionRaj Gothi, Rahul Kumar, Mildred Pereira, Nagesh Nayak, Preeti Rao. [doi]
- MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech EnhancementZizhen Lin, Xiaoting Chen, Junyu Wang. [doi]
- Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing AttacksKexu Liu, Yuanxin Wang, Shengchen Li, Xi Shao. [doi]
- Enrolment-based personalisation for improving individual-level fairness in speech emotion recognitionAndreas Triantafyllopoulos, Björn W. Schuller. [doi]
- HuBERT-EE: Early Exiting HuBERT for Efficient Speech RecognitionJi Won Yoon, Beom Jun Woo, Nam Soo Kim. [doi]
- tinyCLAP: Distilling Constrastive Language-Audio Pretrained ModelsFrancesco Paissan, Elisabetta Farella. [doi]
- FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion DistillationTakuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo. [doi]
- SOT Triggered Neural Clustering for Speaker Attributed ASRXianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland. [doi]
- Neural Codec Language Models for Disentangled and Textless Voice ConversionAlan Baade, Puyuan Peng, David Harwath. [doi]
- The speech motor chaining web app for speech motor learningJonathan L. Preston, Nina R. Benway, Nathan R. Prestopnik, Nathan Preston. [doi]
- A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion RecognitionShreya G. Upadhyay, Carlos Busso, Chi-Chun Lee. [doi]
- Spoofed Speech Detection with a Focus on Speaker EmbeddingHoan My Tran, David Guennec, Philippe Martin, Aghilas Sini, Damien Lolive, Arnaud Delhay, Pierre-François Marteau. [doi]
- SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech SynthesisOsamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari. [doi]
- INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion RecognitionAndreas Triantafyllopoulos, Anton Batliner, Simon David Noel Rampp, Manuel Milling, Björn W. Schuller. [doi]
- TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-SpeechDonghyun Seong, Hoyoung Lee, Joon-Hyuk Chang. [doi]
- Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake DetectionJuan M. Martín-Doñas, Aitor Álvarez, Eros Rosello, Angel M. Gomez, Antonio M. Peinado. [doi]
- An Effective Local Prototypical Mapping Network for Speech Emotion RecognitionYuxuan Xi, Yan Song 0001, Lirong Dai 0001, Haoyu Song, Ian McLoughlin 0001. [doi]
- "So . . . my child . . . " - How Child ADHD Influences the Way Parents TalkAnika A. Spiesberger, Andreas Triantafyllopoulos, Alexander Kathan, Anastasia Semertzidou, Caterina Gawrilow, Tilman Reinelt, Wolfgang A. Rauch, Björn W. Schuller. [doi]
- Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignmentChristoph Boeddeker, Tobias Cord-Landwehr, Reinhold Haeb-Umbach. [doi]
- Neurocomputational model of speech recognition for pathological speech detection: a case study on Parkinson's disease speech detectionSevada Hovsepyan, Mathew Magimai-Doss. [doi]
- SEQ-former: A context-enhanced and efficient automatic speech recognition frameworkQinglin Meng, Min Liu, Kaixun Huang, Kun Wei, Lei Xie, Zongfeng Quan, Weihong Deng, Quan Lu, Ning Jiang, Guoqing Zhao. [doi]
- Crosslinguistic Comparison of Acoustic Variation in the Vowel Sequences /ia/ and /io/ in Four Romance LanguagesJohanna Cronenberg, Ioana Chitoran, Lori Lamel, Ioana Vasilescu. [doi]
- Pre-trained Feature Fusion and Matching for Mild Cognitive Impairment DetectionJunwen Duan, Fangyuan Wei, Hong-Dong Li, Jin Liu. [doi]
- Factor-Conditioned Speaking-Style CaptioningAtsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura. [doi]
- Cascaded Transfer Learning Strategy for Cross-Domain Alzheimer's Disease Recognition through Spontaneous SpeechGuanlin Chen, Yun Jin. [doi]
- Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake DetectionThien-Phuc Doan, Long Nguyen-Vu, Kihun Hong, Souhwan Jung. [doi]
- Towards measuring fairness in speech recognition: Fair-Speech datasetIrina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, Michael L. Seltzer. [doi]
- Singing Voice Graph Modeling for SingFake DetectionXuanjun Chen, Haibin Wu, Roger Jang, Hung-yi Lee. [doi]
- Integrating Speech Self-Supervised Learning Models and Large Language Models for ASRLing Dong, Zhengtao Yu 0001, Wenjun Wang, Yuxin Huang, Shengxiang Gao, Guojiang Zhou. [doi]
- Exploring Pre-trained Speech Model for Articulatory Feature Extraction in Dysarthric Speech Using ASRYuqin Lin, Longbiao Wang, Jianwu Dang 0001, Nobuaki Minematsu. [doi]
- DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modificationXi Liu, John H. L. Hansen. [doi]
- Understanding "understanding": presenting a richly annotated multimodal corpus of dyadic interactionLeonie Schade, Nico Dallmann, Olcay Türk, Stefan Lazarov, Petra Wagner. [doi]
- Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesisChristina Tånnander, Shivam Mehta, Jonas Beskow, Jens Edlund. [doi]
- YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency DetectionXuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli. [doi]
- FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech SynthesisYinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang. [doi]
- Mandarin T3 Production by Chinese and Japanese Native SpeakersQi Wu. [doi]
- wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered SpeechMarvin Borsdorf, Zexu Pan, Haizhou Li 0001, Tanja Schultz. [doi]
- VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-SpeechHeeseung Kim, Sang Gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim 0001, Sungroh Yoon. [doi]
- On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN ModelsErfan Loweimi, Mengjie Qian, Kate M. Knill, Mark J. F. Gales. [doi]
- MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response ScenariosYu-Wen Chen, Zhou Yu, Julia Hirschberg. [doi]
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource ScenariosCheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi. [doi]
- One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural ModelZhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu. [doi]
- DiffATR: Diffusion-based Generative Modeling for Audio-Text RetrievalYifei Xin, Xuxin Cheng, Zhihong Zhu, Xusheng Yang, Yuexian Zou. [doi]
- Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder PromptingYosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe 0001. [doi]
- Contextual Interactive Evaluation of TTS Models in Dialogue SystemsSiyang Wang, Éva Székely, Joakim Gustafson. [doi]
- On the relationship between speech production and vocabulary size in 3-5 year oldsAlexis DeMaere, Nicole van Rootselaar, Fangfang Li, Robbin Gibb, Claudia L. R. Gonzalez. [doi]
- Efficient and Robust Long-Form Speech Recognition with Hybrid H3-ConformerTomoki Honda, Shinsuke Sakai, Tatsuya Kawahara. [doi]
- LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style CaptioningMasaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana. [doi]
- Exploring the limits of decoder-only models trained on public speech recognition corporaAnkit Gupta 0001, George Saon, Brian Kingsbury. [doi]
- Robust spread spectrum speech watermarking using linear prediction and deep spectral shapingDavid Looney, Nikolay D. Gaubitch. [doi]
- Phonological-Level Mispronunciation Detection and DiagnosisMostafa Shahin, Beena Ahmed. [doi]
- Knowledge-Preserving Pluggable Modules for Multilingual Speech Translation TasksNan Chen, Yonghe Wang, Feilong Bao. [doi]
- Towards Audio Codec-based Speech SeparationJia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma 0001. [doi]
- Exploring adaptation techniques of large speech foundation models for low-resource ASR: a case study on Northern SámiYaroslav Getman, Tamás Grósz, Katri Hiovain-Asikainen, Mikko Kurimo. [doi]
- Searching for Structure: Appraising the Organisation of Speech Features in wav2vec 2.0 EmbeddingsPatrick Cormac English, John D. Kelleher, Julie Carson-Berndsen. [doi]
- Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe 0001, Barry-John Theobald. [doi]
- Speaker Conditional Sinc-Extractor for Personal VADEn-Lun Yu, Kuan-Hsun Ho, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen. [doi]
- Real-time Speech Summarization for Medical ConversationsKhai Le-Duc, Khai-Nguyen Nguyen, Long Vo-Dang, Truong-Son Hy. [doi]
- Acoustical analysis of the initial phones in speech-laughRyo Setoguchi, Yoshiko Arimoto. [doi]
- AnoPatch: Towards Better Consistency in Machine Anomalous Sound DetectionAnbai Jiang, Bing Han, Zhiqiang Lv, YuFeng Deng, Wei-Qiang Zhang, Xie Chen 0001, Yanmin Qian, Jia Liu 0001, Pingyi Fan. [doi]
- Self-Supervised Embeddings for Detecting Individual Symptoms of DepressionSri Harsha Dumpala, Katerina Dikaios, Abraham Nunes, Frank Rudzicz, Rudolf Uher, Sageev Oore. [doi]
- Transformer-based Model for ASR N-Best Rescoring and RewritingIwen E. Kang, Christophe Van Gysel, Man-Hung Siu. [doi]
- Prosody-Driven Privacy-Preserving Dementia DetectionDominika Woszczyk, Ranya Aloufi, Soteris Demetriou. [doi]
- Lightweight Audio Segmentation for Long-form Speech TranslationJaesong Lee, Soyoon Kim, Hanbyul Kim, Joon Son Chung. [doi]
- This Paper Had the Smartest Reviewers - Flattery Detection Utilising an Audio-Textual Transformer-Based ApproachLukas Christ, Shahin Amiriparian, Friederike Hawighorst, Ann-Kathrin Schill, Angelo Boutalikakis, Lorenz Graf-Vlachy, Andreas König 0007, Björn W. Schuller. [doi]
- XANE: eXplainable Acoustic Neural EmbeddingsSri Harsha Dumpala, Dushyant Sharma, Chandramouli Shama Sastry, Stanislav Yu. Kruchinin, James Fosburgh, Patrick A. Naylor. [doi]
- The Interspeech 2024 Challenge on Speech Processing Using Discrete UnitsXuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe 0001, Yossi Adi, Xie Chen 0001, Qin Jin. [doi]
- USD-AC: Unsupervised Speech Disentanglement for Accent ConversionJen-Hung Huang, Wei-Tsung Lee, Chung-Hsien Wu 0001. [doi]
- Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource SettingsPraveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra. [doi]
- Voice Disorder Analysis: a Transformer-based ApproachAlkis Koudounas, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Elena Baralis. [doi]
- VoxSim: A perceptual voice similarity datasetJunseok Ahn, Youkyum Kim, Yeunju Choi, Doyeop Kwak, Ji-Hoon Kim, Seongkyu Mun, Joon Son Chung. [doi]
- Optimizing Large-Scale Context Retrieval for End-to-End ASRZhiqi Huang, Diamantino Caseiro, Kandarp Joshi, Christopher Li, Pat Rondon, Zelin Wu, Petr Zadrazil, Lillian Zhou. [doi]
- An efficient text augmentation approach for contextualized Mandarin speech recognitionNaijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Huan Zhou 0008. [doi]
- Prompt Tuning for Speech Recognition on Unknown Spoken Name EntitiesXizi Wei, Stephen McGregor. [doi]
- SeMaScore: A new evaluation metric for automatic speech recognition tasksZitha Sasindran, Harsha Yelchuri, T. Venkata Prabhakar. [doi]
- Oversampling, Augmentation and Curriculum Learning for Speaking Assessment with Limited Training DataTin Mei Lun, Ekaterina Voskoboinik, Ragheb Al-Ghezi, Tamás Grósz, Mikko Kurimo. [doi]
- Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data StrategiesSrija Anand, Praveen Srinivasa Varadhan, Ashwin Sankar, Giri Raju, Mitesh M. Khapra. [doi]
- Hierarchical Multi-Task Learning with CTC and Recursive OperationNahomi Kusunoki, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi. [doi]
- Improving Noise Robustness in Self-supervised Pre-trained Model for Speaker VerificationChan-yeong Lim, Hyun-seo Shin, Ju-ho Kim, Jungwoo Heo, Kyo-Won Koo, Seung-bin Kim, Ha-Jin Yu. [doi]
- CALL system using pitch-accent feature representations reflecting listeners' subjective adequacyIkuyo Masuda-Katsuse, Ayako Shirose. [doi]
- An Attribute Interpolation Method in Speech Synthesis by Model MergingMasato Murata, Koichi Miyazaki, Tomoki Koriyama. [doi]
- Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language ModelsXuenan Xu, Pingyue Zhang, Ming Yan 0008, Ji Zhang, Mengyue Wu. [doi]
- Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker VerificationShengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang. [doi]
- AraOffence: Detecting Offensive Speech Across Dialects in Arabic MediaYoussef Nafea, Shady Shehata, Zeerak Talat, Ahmed Aboeitta, Ahmed Sharshar, Preslav Nakov. [doi]
- BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge DistillationZihan Zhang, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie 0001. [doi]
- SilentCipher: Deep Audio WatermarkingMayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji. [doi]
- DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation ModelsTzu-Quan Lin, Hung-yi Lee, Hao Tang 0002. [doi]
- A Framework for Phoneme-Level Pronunciation Assessment Using CTCXinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi. [doi]
- Prompting Whisper for QA-driven Zero-shot End-to-end Spoken Language UnderstandingMohan Li, Simon Keizer, Rama Doddipatla. [doi]
- IndicMOS: Multilingual MOS Prediction for 7 Indian languagesSathvik Udupa, Soumi Maiti, Prasanta Kumar Ghosh. [doi]
- RW-VoiceShield: Raw Waveform-based Adversarial Attack on One-shot Voice ConversionChing-Yu Yang, Shreya G. Upadhyay, Ya-Tse Wu, Bo-Hao Su, Chi-Chun Lee. [doi]
- On the Use of Plausible Arguments in Explainable Conversational AIMartina Di Bratto, Maria Di Maro, Antonio Origlia. [doi]
- Target Speaker Extraction with Curriculum LearningYun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi. [doi]
- ANIMAL-CLEAN - A Deep Denoising Toolkit for Animal-Independent Signal EnhancementAlexander Barnhill, Elmar Nöth, Andreas K. Maier, Christian Bergler. [doi]
- Simulating articulatory trajectories with phonological feature interpolationAngelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux. [doi]
- Out-of-distribution generalisation in spoken language understandingDejan Porjazovski, Anssi Moisio, Mikko Kurimo. [doi]
- MSA-DPCRN: A Multi-Scale Asymmetric Dual-Path Convolution Recurrent Network with Attentional Feature Fusion for Acoustic Echo CancellationYe Ni, Cong Pang, Chengwei Huang, Cairong Zou. [doi]
- Detection of Cognitive Impairment And Alzheimer's Disease Using a Speech- and Language-Based ProtocolTanya Talkar, Sherman Charles, Chelsea Krantsevich, Kan Kawabata. [doi]
- Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of Speech-Silence and Word-PunctuationJinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu. [doi]
- Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion RecognitionSeong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso. [doi]
- FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart GlassesZhongweiyang Xu, Ali Aroudi, Ke Tan 0001, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta. [doi]
- FakeSound: Deepfake General Audio DetectionZeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu. [doi]
- Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech RecognitionWilliam Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs. [doi]
- Electroglottography for the assessment of dysphonia in Parkinson's disease and multiple system atrophyKhalid Daoudi, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner. [doi]
- CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous SpeechJiali Cheng, Mohamed Elgaar, Nidhi Vakil, Hadi Amiri. [doi]
- GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-SpeechWenbin Wang, Yang Song, Sanjay Jha 0001. [doi]
- VoxFlow AI: wearable voice converter for atypical speechGrzegorz P. Mika, Konrad Zielinski, Pawel Cyrta, Marek Grzelec. [doi]
- Active Speaker Detection in Fisheye Meeting Scenes with Scene Spatial SpectrumsXinghao Huang, WeiWei Jiang, Long Rao, Wei Xu, Wenqing Cheng. [doi]
- Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice ConversionPravin Mote, Berrak Sisman, Carlos Busso. [doi]
- Modelling Lexical Characteristics of the Healthy Aging Population: A Corpus-Based StudyHan Kunmei. [doi]
- Blind Zero-Shot Audio Restoration: A Variational Autoencoder Approach for Denoising and InpaintingVeranika Boukun, Jakob Drefs, Jörg Lücke. [doi]
- Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise RatioLi Li 0063, Shogo Seki. [doi]
- Zero-shot Out-of-domain is No Joke: Lessons Learned in the VoiceMOS 2023 MOS Prediction ChallengeMarie Kunesová, Jan Lehecka, Josef Michálek, Jindrich Matousek, Jan Svec. [doi]
- Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading ExpertHan EunGi, Oh Hyun-Bin, Kim Sung-Bin, Corentin Nivelet Etcheberry, Suekyeong Nam, Janghoon Ju, Tae Hyun Oh. [doi]
- Quantifying the Role of Textual Predictability in Automatic Speech RecognitionSean Robertson, Gerald Penn, Ewan Dunbar. [doi]
- E-ODN: An Emotion Open Deep Network for Generalised and Adaptive Speech Emotion RecognitionLiuxian Ma, Lin Shen, Ruobing Li, Haojie Zhang, Kun Qian, Bin Hu, Björn W. Schuller, Yoshiharu Yamamoto. [doi]
- SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation ModelsChun Yin, Tai-Shih Chi, Yu Tsao 0001, Hsin-Min Wang. [doi]
- Large Language Models for Dysfluency Detection in Stuttered SpeechDominik Wagner 0002, Sebastian P. Bayerl, Ilja Baumann, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet. [doi]
- SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech ModelsYuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin. [doi]
- Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation AssessmentHeejin Do, Wonjun Lee, Gary Geunbae Lee. [doi]
- Zero-Shot End-To-End Spoken Question Answering In Medical DomainYanis Labrak, Adel Moumen, Richard Dufour, Mickael Rouvier. [doi]
- Variable Segment Length and Domain-Adapted Feature Optimization for Speaker DiarizationChenyuan Zhang, Linkai Luo, Hong Peng, Wei Wen. [doi]
- Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive DisorderYuejiao Wang, Xianmin Gong, Lingwei Meng, Xixin Wu, Helen Meng. [doi]
- Fine-Tuning Strategies for Dutch Dysarthric Speech Recognition: Evaluating the Impact of Healthy, Disease-Specific, and Speaker-Specific DataSpyretta Leivaditi, Tatsunari Matsushima, Matt Coler, Shekhar Nayak, Vass Verkhodanova. [doi]
- DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise RobustnessVikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva. [doi]
- SELM: Enhancing Speech Emotion Recognition for Out-of-Domain ScenariosHazim T. Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh. [doi]
- Decoding Human Language Acquisition: EEG Evidence for Predictive Probabilistic Statistics in Word SegmentationBin Zhao, Mingxuan Huang, Chenlu Ma, Jinyi Xue, Aijun Li, Kunyu Xu. [doi]
- RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox ApproximationLiam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine 0002, Yoshiaki Bando, Kazuyoshi Yoshii. [doi]
- MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech RepresentationsHemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah. [doi]
- Are Articulatory Feature Overlaps Shrouded in Speech Embeddings?Erfan A. Shams, Iona Gessinger, Patrick Cormac English, Julie Carson-Berndsen. [doi]
- Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch ConformerJizhen Li, Xinmeng Xu, Weiping Tu, Yuhong Yang 0001, Rong Zhu. [doi]
- A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systemsSuyuan Liu, Molly Babel, Jian Zhu. [doi]
- Residual Speaker Representation for One-Shot Voice ConversionLe Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao 0001. [doi]
- Relational Proxy Loss for Audio-Text based Keyword SpottingYoungmoon Jung, Seungjin Lee, Joon-Young Yang, Jaeyoung Roh, Chang Woo Han, Hoonyoung Cho. [doi]
- MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion RecognitionHaiyang Sun 0004, Fulin Zhang, Yingying Gao, Shilei Zhang, Zheng Lian, Junlan Feng. [doi]
- An Inter-Speaker Fairness-Aware Speech Emotion Regression FrameworkHsing-Hang Chou, Woan-Shiuan Chien, Ya-Tse Wu, Chi-Chun Lee. [doi]
- Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion RecognitionQifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li. [doi]
- Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language ModelsChun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee. [doi]
- Sequential Editing for Lifelong Training of Speech Recognition ModelsDevang Kulshreshtha, Nikolaos Pappas 0004, Brady Houston, Saket Dingliwal, Srikanth Ronanki. [doi]
- MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction TuningHang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu 0015, Zejun Ma. [doi]
- DysArinVox: DYSphonia & DYSarthria mandARIN speech corpusHaojie Zhang, Tao Zhang, Ganjun Liu, Dehui Fu, Xiaohui Hou, Ying Lv. [doi]
- Optimizing the role of human evaluation in LLM-based spoken document summarization systemsMargaret Kroll, Kelsey Kraus. [doi]
- Shared-Adapters: A Novel Transformer-based Parameter Efficient Transfer Learning Approach For Children's Automatic Speech RecognitionThomas Rolland, Alberto Abad. [doi]
- DiffVC+: Improving Diffusion-based Voice Conversion for Speaker AnonymizationFan Huang, Kun Zeng, Wei Zhu. [doi]
- Joint Learning of Context and Feedback Embeddings in Spoken DialogueLivia Qian, Gabriel Skantze. [doi]
- Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech RecognitionChengxu Yang, Lin Zheng, Sanli Tian, Gaofeng Cheng, Sujie Xiao, Ta Li. [doi]
- Interleaved Audio/Audiovisual Transfer Learning for AV-ASR in Low-Resourced LanguagesZhengyang Li, Patrick Blumenberg, Jing Liu, Thomas Graave, Timo Lohrenz, Siegfried Kunzmann, Tim Fingscheidt. [doi]
- Adding User Feedback To Enhance CB-WhisperRaul Monteiro. [doi]
- Online Knowledge Distillation of Decoder-Only Large Language Models for Efficient Speech RecognitionJeehye Lee, Hyeji Seo. [doi]
- Leveraging Speech Data Diversity to Document Indigenous Heritage and CultureAllahsera Tapo, Éric Le Ferrand, Zoey Liu, Christopher Homan, Emily Prud'hommeaux. [doi]
- Locally Aligned Rectified Flow Model for Speech Enhancement Towards Single-Step DiffusionZhengxiao Li, Nakamasa Inoue. [doi]
- DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species RecognitionXin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard 0001, Alice Baird, Björn W. Schuller. [doi]
- Complex Image-Generative Diffusion Transformer for Audio DenoisingJunhui Li, Pu Wang, Jialu Li, Youshan Zhang. [doi]
- RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attentionMingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie 0001. [doi]
- Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments?Candy Olivia Mawalim, Shogo Okada, Masashi Unoki. [doi]
- Modeling probabilistic reduction across domains with Naive Discriminative LearningAnna Stein, Kevin Tang. [doi]
- State-of-the-art speech production MRI protocol for new 0.55 Tesla scannersPrakash Kumar, Ye Tian, Yongwan Lim, Sophia X. Cui, Christina Hagedorn, Dani Byrd, Uttam K. Sinha, Shrikanth Narayanan, Krishna S. Nayak. [doi]
- MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual FeaturesKatharina Anderer, Andreas Reich, Matthias Wölfel. [doi]
- Deciphering Assamese Vowel Harmony with Featural InfoWaveGANSneha Ray Barman, Shakuntala Mahanta, Neeraj Kumar Sharma 0007. [doi]
- UY/CH-CHILD - A Public Chinese L2 Speech Database of Uyghur ChildrenMewlude Nijat, Chen Chen, Dong Wang, Askar Hamdulla. [doi]
- Speech emotion recognition with deep learning beamforming on a distant human-robot interaction scenarioRicardo García, Rodrigo Mahú, Nicolás Grágeda, Alejandro Luzanto, Nicolas Bohmer, Carlos Busso, Néstor Becerra Yoma. [doi]
- VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-SpeechAshishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah. [doi]
- Efficient SQA from Long Audio Contexts: A Policy-driven ApproachAlexander Johnson, Peter Plantinga, Pheobe Sun, Swaroop Gadiyaram, Abenezer Girma, Ahmad Emami. [doi]
- Speech foundation models in healthcare: Effect of layer selection on pathological speech feature predictionDaniela A. Wiepert, Rene L. Utianski, Joseph R. Duffy, John L. Stricker, Leland R. Barnard, David T. Jones, Hugo Botha. [doi]
- Using articulated speech EEG signals for imagined speech decodingChris Bras, Tanvina Patel, Odette Scharenborg. [doi]
- Spoken-Term Discovery using Discrete Speech UnitsBenjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, Herman Kamper. [doi]
- Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition SystemsAjinkya Kulkarni, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso. [doi]
- Exploring the Robustness of Text-to-Speech Synthesis Based on Diffusion Probabilistic Models to Heavily Noisy TranscriptionsJingyi Feng, Yusuke Yasuda, Tomoki Toda. [doi]
- Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target SignalsKentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari. [doi]
- ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational EfficiencyYafeng Chen, Siqi Zheng, Hui Wang 0030, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li. [doi]
- Self-training ASR Guided by Unsupervised ASR TeacherHyung Yong Kim, Byeong-Yeol Kim, Yunkyu Lim, Jihwan Park, Shukjae Choi, Yooncheol Ju, Jinseok Park, Youshin Lim, Seung Woo Yu, Hanbin Lee, Shinji Watanabe 0001. [doi]
- Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology AccessibilityXiuwen Zheng 0003, Bornali Phukon, Mark Hasegawa-Johnson. [doi]
- DBD-CI: Doubling the Band Density for Bilateral Cochlear ImplantsMingyue Shi, Huali Zhou, Qinglin Meng, Nengheng Zheng. [doi]
- Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech SynthesisWing-Zin Leung, Mattias Cross, Anton Ragni, Stefan Goetze. [doi]
- TEEMI: a speaking practice tool for L2 English learnersSzu-Yu Chen, Tien-Hong Lo, Yao-Ting Sung, Ching-Yu Tseng, Berlin Chen. [doi]
- Towards objective and interpretable speech disorder assessment: a comparative analysis of CNN and transformer-based modelsMalo Maisonneuve, Corinne Fredouille, Muriel Lalain, Alain Ghio, Virginie Woisard. [doi]
- High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language ModelJoun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim. [doi]
- Multimodal Segmentation for Vocal Tract ModelingRishi Jain, Bohan Yu, Peter Wu, Tejas S. Prabhune, Gopala Anumanchipalli. [doi]
- A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech RecognitionYangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie. [doi]
- Centroid Estimation with Transformer-Based Speaker Embedder for Robust Target Speaker ExtractionWoon-Haeng Heo, Joongyu Maeng, Yoseb Kang, Namhyun Cho. [doi]
- Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representationsMukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater. [doi]
- Towards End-to-End Unified Recognition for Mandarin and CantoneseMeiling Chen, Pengjie Liu, Heng Yang, Haofeng Wang. [doi]
- Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody ModelingYuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang. [doi]
- Automatic recognition and detection of aphasic natural speechMara Barberis, Pieter De Clercq, Bastiaan Tamm, Hugo Van Hamme, Maaike Vandermosten. [doi]
- Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion RecognitionOliver Schrüfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn W. Schuller. [doi]
- LingWav2Vec2: Linguistic-augmented wav2vec 2.0 for Vietnamese Mispronunciation DetectionTuan Nguyen, Huy Dat Tran. [doi]
- A Small and Fast BERT for Chinese Medical Punctuation RestorationTongtao Ling, Yutao Lai, Lei Chen, Shilei Huang, Yi Liu. [doi]
- Prompting Large Language Models with Mispronunciation Detection and Diagnosis AbilitiesMinglin Wu, Jing Xu, Xixin Wu, Helen Meng. [doi]
- Towards Multilingual Audio-Visual Question AnsweringOrchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Enhancing Neural Transducer for Multilingual ASR with Synchronized Language DiarizationAmir Hussein, Desh Raj, Matthew Wiesner, Daniel Povey, Paola Garcia, Sanjeev Khudanpur. [doi]
- Infusing Acoustic Pause Context into Text-Based Dementia AssessmentFranziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer. [doi]
- TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch InformationYiwen Wang, Xihong Wu. [doi]
- A Contrastive Learning Approach to Mitigate Bias in Speech ModelsAlkis Koudounas, Flavio Giobergia, Eliana Pastor, Elena Baralis. [doi]
- Frication noise features of Polish voiceless dental fricative and affricate produced by children with and without speech disorderZuzanna Miodonska, Michal Krecichwost, Ewa Kwasniok, Agata Sage, Pawel Badura. [doi]
- Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention DiscriminatorWoo Jin Chung, Hong-Goo Kang. [doi]
- LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual InformationYunrui Cai, Zhiyong Wu 0001, Jia Jia 0001, Helen Meng. [doi]
- Quantity-sensitivity affects recall performance of word stressConstantijn Kaland, Maria Lialiou. [doi]
- GSQA: An End-to-End Model for Generative Spoken Question AnsweringMin-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-wen Li 0001, Hung-yi Lee. [doi]
- ASA: An Auditory Spatial Attention Dataset with Multiple Speaking LocationsZijie Lin, Tianyu He, Siqi Cai, Haizhou Li 0001. [doi]
- Affricates in LushootseedTed Kye. [doi]
- The Second DISPLACE Challenge: DIarization of SPeaker and LAnguage in Conversational EnvironmentsShareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K. T., S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy. [doi]
- Analyzing Speech Motor Movement using Surface Electromyography in Minimally Verbal Adults with Autism Spectrum DisorderWazeer Zulfikar, Nishat Protyasha, Camila Canales, Heli Patel, James Williamson, Laura Sarnie, Lisa Nowinski, Nataliya Kosmyna, Paige Townsend, Sophia Yuditskaya, Tanya Talkar, Utkarsh Oggy Sarawgi, Christopher J. Mcdougle, Thomas F. Quatieri, Pattie Maes, Maria Mody. [doi]
- Learning Representation of Therapist Empathy in Counseling Conversation Using Siamese Hierarchical Attention NetworkDehua Tao, Tan Lee 0001, Harold Chui, Sarah Luk. [doi]
- SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASRQiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng. [doi]
- VoxMed: one-step respiratory disease classifier using digital stethoscope soundsParidhi Mundra, Manik Sharma, Yashwardhan Chaudhuri, Orchid Chetia Phukan, Arun Balaji Buduru. [doi]
- Word-level Text Markup for Prosody Control in Speech SynthesisYuliya Korotkova, Ilya Kalinovskiy, Tatiana Vakhrusheva. [doi]
- Diffusion Gaussian Mixture Audio DenoisePu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang. [doi]
- Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASRYerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang. [doi]
- Direct Speech Synthesis from Non-Invasive, Neuromagnetic SignalsJinuk Kwon, David Harwath, Debadatta Dash, Paul Ferrari, Jun Wang 0037. [doi]
- On Calibration of Speech Classification Models: Insights from Energy-Based Model InvestigationsYaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng. [doi]
- Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and EvaluationKe Chen 0021, Jiaqi Su, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Zeyu Jin. [doi]
- Echoes of Implicit Bias Exploring Aesthetics and Social Meanings of Swiss German Dialect FeaturesTillmann Pistor, Adrian Leemann. [doi]
- Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect IdentificationYaqian Hao, Chenguang Hu, Yingying Gao, Shilei Zhang, Junlan Feng. [doi]
- Automatic pitch accent classification through image classificationNa Hu, Hugo Schnack, Amalia Arvaniti. [doi]
- An inclusive approach to creating a palette of synthetic voices for gender diversityÉva Székely, Maxwell Hope. [doi]
- All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scaleVasista Sai Lodagala, Abhishek Biswas, Shoutrik Das, Jordan Fernandes, Srinivasan Umesh. [doi]
- Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular SpeechAleksei Gusev, Anastasia Avdeeva. [doi]
- Neural ATSM: Fully Neural Network-based Adaptive Time-Scale Modification Using Sentence-Specific Dynamic ControlJaeuk Lee, Sohee Jang, Joon-Hyuk Chang. [doi]
- VN-SLU: A Vietnamese Spoken Language Understanding DatasetTuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen. [doi]
- Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language ModelJinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li. [doi]
- Few-Shot Keyword Spotting from Mixed SpeechJunming Yuan, Ying Shi, Lantian Li, Dong Wang, Askar Hamdulla. [doi]
- Mixed Children/Adult/Childrenized Fine-Tuning for Children's ASR: How to Reduce Age Mismatch and Speaking Style MismatchThomas Graave, Zhengyang Li, Timo Lohrenz, Tim Fingscheidt. [doi]
- Sustained Vowels for Pre- vs Post-Treatment COPD ClassificationAndreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian B. Pokorny, Maurice Gerczuk, Shahin Amiriparian, Thomas M. Berghaus, Björn W. Schuller. [doi]
- A multimodal approach to study the nature of coordinative patterns underlying speech rhythmJinyu Li, Leonardo Lancia. [doi]
- JenGAN: Stacked Shifted Filters in GAN-Based Speech SynthesisHyunjae Cho, Junhyeok Lee, Wonbin Jung. [doi]
- Retrieval-Augmented Classifier Guidance for Audio GenerationHo-Young Choi, Won-Gook Choi, Joon-Hyuk Chang. [doi]
- Spoof Diarization: "What Spoofed When" in Partially Spoofed AudioLin Zhang, Xin Wang, Erica Cooper, Mireia Díez, Federico Landini, Nicholas W. D. Evans, Junichi Yamagishi. [doi]
- Auditory Attention Decoding in Four-Talker Environment with EEGYujie Yan, Xiran Xu, Haolin Zhu, Pei Tian, Zhongshu Ge, Xihong Wu, Jing Chen 0019. [doi]
- Evaluating the Santa Barbara Corpus: Challenges of the Breadth of Conversational Spoken LanguageMatthew Maciejewski, Dominik Klement, Ruizhe Huang, Matthew Wiesner, Sanjeev Khudanpur. [doi]
- MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video DatasetKim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae Hyun Oh. [doi]
- E-Paraformer: A Faster and Better Parallel Transformer for Non-autoregressive End-to-End Mandarin Speech RecognitionKun Zou, Fengyun Tan, Ziyang Zhuang, Chenfeng Miao, Tao Wei, Shaodan Zhai, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao 0006. [doi]
- Automatic Detection of Hearing Loss from Children's Speech using wav2vec 2.0 FeaturesJessica Monaghan, Arun Sebastian, Nicky Chong-White, Vicky Zhang, Vijayalakshmi Easwar, Pádraig Kitterick. [doi]
- Disentangling prosody and timbre embeddings via voice conversionNicolas Gengembre, Olivier Le Blouch, Cédric Gendrot. [doi]
- Homograph Disambiguation with Text-to-Text Transfer TransformerMarkéta Rezácková, Daniel Tihelka, Jindrich Matousek. [doi]
- Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System DesignMing Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee 0001. [doi]
- How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?Tianchi Liu 0004, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li 0001. [doi]
- Improving Speech-Based Dysarthria Detection using Multi-task Learning with Gradient ProjectionYan Xiong, Visar Berisha, Julie Liss, Chaitali Chakrabarti. [doi]
- Variability of speech timing features across repeated recordings: a comparison of open-source extraction techniquesJudith Dineley, Ewan Carr, Lauren L. White, Catriona Lucas, Zahia Rahman, Tian Pan 0004, Faith Matcham, Johnny Downs, Richard J. B. Dobson, Thomas F. Quatieri, Nicholas Cummins. [doi]
- Learning from Back Chunks: Acquiring More Future Knowledge for Streaming ASR Models via Self DistillationYuting Yang, Guodong Ma, Yuke Li, Binbin Du, Haoqi Zhu, Liang Ruan. [doi]
- On the Effectiveness of Acoustic BPE in Decoder-Only TTSBohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen 0001, Kai Yu 0004. [doi]
- Analysis of Pathological Speech - Pitfalls along the WayElmar Nöth. [doi]
- Domain Adaptation for Contrastive Audio-Language ModelsSoham Deshmukh, Rita Singh, Bhiksha Raj. [doi]
- XTTS: a Massively Multilingual Zero-Shot Text-to-Speech ModelEdresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber. [doi]
- Codecfake: An Initial Dataset for Detecting LLM-based Deepfake AudioYi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao 0001, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi. [doi]
- Phonological Symmetry Does Not Predict Generalization of Perceptual Adaptation to VowelsZuheyra Tokac, Jennifer Cole 0001. [doi]
- Improving Domain-Specific ASR with LLM-Generated Contextual DescriptionsJiwon Suh, Injae Na, Woohwan Jung. [doi]
- Speech quality evaluation of neural audio codecsThomas Muller, Stéphane Ragot, Laetitia Gros, Pierrick Philippe, Pascal Scalart. [doi]
- Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding ModelHayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe 0001. [doi]
- A ChatGPT-based oral Q&A practice system for first-time student participants in international conferencesMayuko Aiba, Daisuke Saito, Nobuaki Minematsu. [doi]
- MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in ConversationZilong Huang, Man-Wai Mak, Kong-Aik Lee. [doi]
- Joint prediction of subjective listening effort and speech intelligibility based on end-to-end learningDirk Eike Hoffner, Jana Roßbach, Bernd T. Meyer. [doi]
- Listeners' F0 preferences in quiet and stationary noiseOlympia Simantiraki, Martin Cooke. [doi]
- Multi-mic Echo Cancellation Coalesced with Beamforming for Real World Adverse Acoustic ConditionsPremanand Nayak, Kamini Sabu, M. Ali Basha Shaik. [doi]
- Voiced and voiceless laterals in AngamiViyazonuo Terhiija, Priyankoo Sarmah. [doi]
- Does the Lombard Effect Matter in Speech Separation? Introducing the Lombard-GRID-2mix DatasetIva Ewert, Marvin Borsdorf, Haizhou Li 0001, Tanja Schultz. [doi]
- EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual ScenariosTejes Srivastava, Jiatong Shi, William Chen, Shinji Watanabe 0001. [doi]
- Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR ModelsVictor Miara, Théo Lepage, Réda Dehak. [doi]
- A powerful and modern AAC composition tool for impaired speakersAanchan Mohan, Monideep Chakraborti, Katelyn Eng, Nailia Kushaeva, Mirjana Prpa, Jordan Lewis, Tianyi Zhang, Vince Geisler, Carol Geisler. [doi]
- Learnable Layer Selection and Model Fusion for Speech Self-Supervised Learning ModelsSheng-Chieh Chiu, Chia-Hua Wu, Jih-Kang Hsieh, Yu Tsao 0001, Hsin-Min Wang. [doi]
- Prosody of speech production in latent post-stroke aphasiaCong Zhang, Tong Li, Gayle DeDe, Christos Salis. [doi]
- Aerodynamics of Sakata labial-velar oral stopsLorenzo Maselli, Véronique Delvaux. [doi]
- The Difficulty and Importance of Estimating the Lower and Upper Bounds of Infant Speech ExposureJoseph Coffey, Okko Räsänen, Camila Scaff, Alejandrina Cristià. [doi]
- SAMSEMO: New dataset for multilingual and multimodal emotion recognitionPawel Bujnowski, Bartlomiej Kuzma, Bartlomiej Paziewski, Jacek Rutkowski, Joanna Marhula, Zuzanna Bordzicka, Piotr Andruszkiewicz. [doi]
- Applying Reinforcement Learning and Multi-Generators for Stage Transition in an Emotional Support Dialogue SystemJeremy Chang, Kuan-Yu Chen, Chung-Hsien Wu 0001. [doi]
- Frontier of Frontend for Conversational Speech ProcessingShoko Araki. [doi]
- ATTEST: an analytics tool for the testing and evaluation of speech technologiesDmitrii Obukhov, Marcel de Korte, Andrey Adaschik. [doi]
- Accent Conversion with Articulatory RepresentationsYashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Y. Espy-Wilson, Andrea Fanelli. [doi]
- Automated Human-Readable Label Generation in Open Intent DiscoveryGrant Anderson, Emma Hart, Dimitra Gkatzia, Ian Beaver. [doi]
- Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain DiscriminatorJaewon Kim, Won-Gook Choi, Seyun Ahn, Joon-Hyuk Chang. [doi]
- H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-SpeechDonghyun Seong, Joon-Hyuk Chang. [doi]
- RepTor: Re-parameterizable Temporal Convolution for Keyword Spotting via Differentiable Kernel SearchEunik Park, Daehyun Ahn, HyungJun Kim. [doi]
- Do we EXPECT TO find phonetic traces for syntactic traces?Jonathan Him Nok Lee, Mark Liberman, Martin Salzmann. [doi]
- Analysis and Visualization of Directional Diversity in Listening Fluency of World Englishes Speakers in the Framework of Mutual ShadowingYu Tomita, Yingxiang Gao, Nobuaki Minematsu, Noriko Nakanishi, Daisuke Saito. [doi]
- A Cluster-based Personalized Federated Learning Strategy for End-to-End ASR of Dementia PatientsWei-Tung Hsu, Chin-Po Chen, Yun-Shao Lin, Chi-Chun Lee. [doi]
- Performant ASR Models for Medical Entities in Accented SpeechTejumade Afonja, Tobi Olatunji, Sewade Ogun, Naome A. Etori, Abraham Toluwase Owodunni, Moshood Yekini. [doi]
- Multi-modal Adversarial Training for Zero-Shot Voice CloningJohn Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu. [doi]
- Controlling Emotion in Text-to-Speech with Natural Language PromptsThomas Bott, Florian Lux, Ngoc Thang Vu. [doi]
- Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-SynthesisChin-Yun Yu, György Fazekas. [doi]
- Modality Translation Learning for Joint Speech-Text ModelPin-Yen Liu, Jen-Tzung Chien. [doi]
- Streaming Audio Transformers for Online Audio TaggingHeinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang. [doi]
- Personalized Speech Enhancement Without a Separate Speaker Embedding ModelTanel Pärnamaa, Ando Saabas. [doi]
- Reliable dialogue system for facilitating student-counselor communicationMahdin Rohmatillah, Bryan Gautama Ngo, Willianto Sulaiman, Po-Chuan Chen, Jen-Tzung Chien. [doi]
- CDSD: Chinese Dysarthria Speech DatabaseYan Wan, Mengyi Sun, Xinchen Kang, Jingting Li 0001, Pengfei Guo, Ming Gao, Su-Jing Wang. [doi]
- Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM frameworkBhasi K. C., Rajeev Rajan, Noumida Abdul Kareem. [doi]
- Using Large Language Model for End-to-End Chinese ASR and NERYuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang. [doi]
- Real-time scheme for rapid extraction of speaker embeddings in challenging recording conditionsKai Liu, Ziqing Du, Huan Zhou 0008, Xucheng Wan, Naijun Zheng. [doi]
- Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech DetectionShruti Palaskar, Ognjen Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed H. Tewfik. [doi]
- Exploring Syllable Discriminability during Diadochokinetic Task with Increasing Dysarthria Severity for Patients with Amyotrophic Lateral SclerosisNeelesh Samptur, Tanuka Bhattacharjee, Anirudh Chakravarty K, Seena Vengalil, Yamini Belur, Atchayaram Nalini, Prasanta Kumar Ghosh. [doi]
- RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech EnhancementHonglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic. [doi]
- Beam-search SIEVE for low-memory speech recognitionMartino Ciaperoni, Athanasios Katsamanis, Aristides Gionis, Panagiotis Karras. [doi]
- Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error ClassificationsKorbinian Kuhn, Verena Kersken, Gottfried Zimmermann. [doi]
- Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and RecognitionGuinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu. [doi]
- Exploring Self-Supervised Speech Representations for Cross-lingual Acoustic-to-Articulatory InversionYun Hao, Reihaneh Amooie, Wietse de Vries, Thomas Tienkamp, Rik van Noord, Martijn Wieling 0001. [doi]
- Prompt Link Multimodal Fusion in Multimodal Sentiment AnalysisKang Zhu, Cunhang Fan, Jianhua Tao 0001, Zhao Lv. [doi]
- Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training StrategyLinhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie. [doi]
- Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World EffectivenessSai Srujana Buddi, Satyam Kumar 0001, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya. [doi]
- Contextual Biasing with the Knuth-Morris-Pratt Matching AlgorithmWeiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Yanzhang He, Pedro Moreno Mengibar. [doi]
- Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation ModelsDominik Wagner 0002, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet. [doi]
- A Comprehensive Investigation on Speaker Augmentation for Speaker RecognitionZhenyu Zhou, Shibiao Xu, Shi Yin, Lantian Li, Dong Wang. [doi]
- Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic UnitsBolaji Yusuf, Jan Honza Cernocký, Murat Saraçlar. [doi]
- EZTalking: English assessment platform for teachers and studentsYu-Sheng Tsao, Yung-Chang Hsu, Jiun-Ting Li, Siang-Hong Weng, Tien-Hong Lo, Berlin Chen. [doi]
- HebDB: a Weakly Supervised Dataset for Hebrew Speech ProcessingArnon Turetzky, Or Tal, Yael Segal, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya Roni Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi. [doi]
- mHuBERT-147: A Compact Multilingual HuBERT ModelMarcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu. [doi]
- Perception of music and speech: Focus on rhythm processingBarbara Tillmann. [doi]
- Robust Laughter Segmentation with Automatic Diverse Data SynthesisTaisei Omine, Kenta Akita, Reiji Tsuruno. [doi]
- Predicting Acute Pain Levels Implicitly from Vocal FeaturesJennifer Williams 0001, Eike Schneiders, Henry Card, Tina Seabrooke, Beatrice Pakenham-Walsh, Tayyaba Azim, Lucy Valls-Reed, Ganesh Vigneswaran, John Robert Bautista, Rohan Chandra, Arya Farahi. [doi]
- Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion PredictionJingyao Wu 0002, Ting Dang, Vidhyasaharan Sethu, Eliathamby Ambikairajah. [doi]
- W-GVKT: Within-Global-View Knowledge Transfer for Speaker VerificationZezhong Jin, Youzhi Tu, Man-Wai Mak. [doi]
- Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech dataYuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana. [doi]
- A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language UnderstandingGaëlle Laperrière, Sahar Ghannay, Bassam Jabaian, Yannick Estève. [doi]
- On The Performance of EMA-synchronized Speech and Stand-alone Speech in Acoustic-to-articulatory InversionQiang Fang. [doi]
- AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detectionRong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia-bin, Ming Li. [doi]
- Revisiting Convolution-free Transformer for Speech RecognitionZejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff. [doi]
- RAST: A Reference-Audio Synchronization Tool for Dubbed ContentDavid Meyer, Eitan Abecassis, Clara Fernandez-Labrador, Christopher Schroers. [doi]
- Noise-aware Speech Enhancement using Diffusion Probabilistic ModelYuchen Hu, Chen Chen 0075, Ruizhe Li, Qiushi Zhu, Eng Siong Chng. [doi]
- PERSONA: an application for emotion recognition, gender recognition and age estimationDevyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic ControlAlexander Blatt, Aravind Krishnan, Dietrich Klakow. [doi]
- SDAEC: Signal Decoupling for Advancing Acoustic Echo CancellationFei Zhao, Jinjiang Liu, Xueliang Zhang. [doi]
- Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal RepresentationYifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou. [doi]
- Utilizing Adaptive Global Response Normalization and Cluster-Based Pseudo Labels for Zero-Shot Voice ConversionJi Sub Um, Hoirin Kim. [doi]
- DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of SpeechFredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee. [doi]
- A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic LanguagesNikhil Jakhar, Sudhanshu Srivastava, Arun Baby. [doi]
- Stress transfer in speech-to-speech machine translationSai Akarsh C, Vamshiraghusimha Narasinga, Anil Kumar Vuppala. [doi]
- The Processing of Stress in End-to-End Automatic Speech Recognition ModelsMartijn Bentum, Louis ten Bosch, Tom Lentz. [doi]
- Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank ThresholdingTakafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami. [doi]
- Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing DetectionZihan Pan, Tianchi Liu 0004, Hardik B. Sailor, Qiongqiong Wang. [doi]
- SRC4VC: Smartphone-Recorded Corpus for Voice Conversion BenchmarkYuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari. [doi]
- Improving Speech Recognition with Prompt-based Contextualized ASR and LLM-based Re-predictorNguyen Manh Tien Anh, Thach Ho Sy. [doi]
- Acoustic changes in speech prosody produced by children with autism after robot-assisted speech trainingSi Chen, Bruce Xiao Wang, Yitian Hong, Fang Zhou, Angel Chan, Po-yi Tang, Bin Li, Chunyi Wen, James Cheung, Yan Liu, Zhuoming Chen. [doi]
- Guided conditioning with predictive network on score-based diffusion model for speech enhancementDail Kim, Da-Hee Yang, Donghyun Kim, Joon-Hyuk Chang, Jeonghwan Choi, Moa Lee, Jaemo Yang, Han-gil Moon. [doi]
- Improving Multilingual ASR Robustness to Errors in Language InputBrady Houston, Omid Sadjadi, Zejiang Hou, Srikanth Vishnubhotla, Kyu J. Han. [doi]
- A Multimodal Framework for the Assessment of the Schizophrenia SpectrumGowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Y. Espy-Wilson. [doi]
- BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform GenerationHui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling. [doi]
- Voice Quality Variation in AAE: An Additional Challenge for Addressing Bias in ASR Models?Li-Fang Lai, Nicole R. Holliday. [doi]
- Embedding Learning for Preference-based Speech Quality AssessmentCheng-Hung Hu, Yusuke Yasuda, Tomoki Toda. [doi]
- Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword RecognitionHao Yen, Pin-Jui Ku, Sabato Marco Siniscalchi, Chin-Hui Lee 0001. [doi]
- Translating speech with just imagesDan Oneata, Herman Kamper. [doi]
- Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event DetectionHyeonuk Nam, Seong-Hu Kim, Deokki Min, Junhyeok Lee, Yong-Hwa Park. [doi]
- Influences of Morphosyntax and Semantics on the Intonation of Mandarin Chinese Wh-indeterminatesHongchen Wu, Jiwon Yun. [doi]
- Visualization for improving foreign language pronunciationCharlotte Yoder, Karrie Karahalios, Mark Hasegawa-Johnson, Shreyansh Agrawal. [doi]
- How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?Prad Kadambi, Tristan J. Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine C. Hustad, Visar Berisha. [doi]
- On Comparing Time- and Frequency-Domain Rhythm Measures in Classifying Assamese DialectsJoyshree Chakraborty, Leena Dihingia, Priyankoo Sarmah, Rohit Sinha 0003. [doi]
- Unified Multi-Talker ASR with and without Target-speaker EnrollmentRyo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando. [doi]
- Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake SpeechTomi H. Kinnunen, Rosa González Hautamäki, Xin Wang, Junichi Yamagishi. [doi]
- VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic FeaturesTomoki Koriyama. [doi]
- One-class learning with adaptive centroid shift for audio deepfake detectionHyun-Myung Kim, Kangwook Jang, Hoirin Kim. [doi]
- Pragmatically similar utterance finder demonstrationNigel G. Ward, Andres Segura. [doi]
- In search of structure and correspondence in intra-speaker trial-to-trial variabilityVivian Guo Li. [doi]
- Edged based audio-visual speech enhancement demonstratorSong Chen 0005, Mandar Gogate, Kia Dashtipour, Jasper Kirton-Wingate, Adeel Hussain, Faiyaz Doctor, Tughrul Arslan, Amir Hussain 0001. [doi]
- Multilingual Speech and Language Analysis for the Assessment of Mild Cognitive Impairment: Outcomes from the Taukadial ChallengePaula Andrea Pérez-Toro, Tomás Arias-Vergara, Philipp Klumpp, Tobias Weise, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier. [doi]
- TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASRJens Heitkaemper, Joe Caroselli, Arun Narayanan, Nathan Howard. [doi]
- Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion StrategyYuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonan Cheng, Long Ye, Jianhua Tao 0001. [doi]
- MultiStage Speech Bandwidth Extension with Flexible Sampling Rate ControlYe-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling. [doi]
- Bilingual and Code-switching TTS Enhanced with Denoising Diffusion Model and GANHuai-Zhe Yang, Chia-Ping Chen, Shan-Yun He, Cheng-Ruei Li. [doi]
- Synthesizing Long-Form Speech merely from Sentence-Level Corpus with Content Extrapolation and LLM Contextual EnrichmentShijie Lai, Minglu He, Zijing Zhao 0008, Kai Wang, Hao Huang, Jichen Yang. [doi]
- LiteFocus: Accelerated Diffusion Inference for Long Audio SynthesisZhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang. [doi]
- Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and TechnologyRobin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli. [doi]
- AR-NLU: A Framework for Enhancing Natural Language Understanding Model Robustness against ASR ErrorsEmmy Phung, Harsh Deshpande, Ahmad Emami, Kanishk Singh. [doi]
- Improving Multilingual Text-to-Speech with Mixture-of-Language-Experts and Accent DisentanglementJing Wu, Ting Chen, Minchuan Chen, Wei Hu, Shaojun Wang, Jing Xiao. [doi]
- Decoder-only Architecture for Streaming End-to-end Speech RecognitionEmiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe 0001. [doi]
- Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation ModelsRuchao Fan, Natarajan Balaji Shankar, Abeer Alwan. [doi]
- Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum DisorderJihyun Mun, SunHee Kim, Minhwa Chung. [doi]
- Predefined Prototypes for Intra-Class Separation and DisentanglementAntonio Almudévar, Théo Mariotte, Alfonso Ortega Giménez, Marie Tahon, Luis Vicente, Antonio Miguel, Eduardo Lleida. [doi]
- VoiCor: A Residual Iterative Voice Correction Framework for Monaural Speech EnhancementRui Cao, Tianrui Wang, Meng Ge, Andong Li, Longbiao Wang, Jianwu Dang 0001, Yungang Jia. [doi]
- Self-supervised Speech Representations Still Struggle with African American Vernacular EnglishKalvin Chang, Yi-Hui Chou, Jiatong Shi, Hsuan-Ming Chen, Nicole Holliday, Odette Scharenborg, David R. Mortensen. [doi]
- A Low-Bitrate Neural Audio Codec Framework with Bandwidth Reduction and Recovery for High-Sampling-Rate WaveformsYang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, Zhen-Hua Ling. [doi]
- CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis SystemsHaibin Wu, Yuan Tseng, Hung-yi Lee. [doi]
- Fine-tuning of Pre-trained Models for Classification of Vocal Intensity Category from Speech SignalsManila Kodali, Sudarsana Reddy Kadiri, Paavo Alku. [doi]
- Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction LossYuewei Zhang, Huanbin Zou, Jie Zhu. [doi]
- ASTRA: Aligning Speech and Text Representations for Asr without SamplingNeeraj Gaur, Rohan Agrawal, Gary Wang, Parisa Haghani, Andrew Rosenberg, Bhuvana Ramabhadran. [doi]
- Challenges of German Speech Recognition: A Study on Multi-ethnolectal Speech Among AdolescentsMartha Schubert, Daniel Duran 0001, Ingo Siegert. [doi]
- Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of AdaptersUmberto Cappellazzo, Daniele Falavigna, Alessio Brutti. [doi]
- An Analysis of the Variance of Diffusion-based Speech EnhancementBunlong Lay, Timo Gerkmann. [doi]
- Training speech-breathing coordination in computer-assisted readingDelphine Charuau, Andrea Briglia, Erika Godde, Gérard Bailly. [doi]
- Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation AnalysisValentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan. [doi]
- On Improving Error Resilience of Neural End-to-End Speech CodersKishan Gupta, Nicola Pia, Srikanth Korse, Andreas Brendel, Guillaume Fuchs, Markus Multrus. [doi]
- Frequency-mix Knowledge Distillation for Fake Speech DetectionCunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv. [doi]
- NumberLie: a game-based experiment to understand the acoustics of deception and truthfulnessAlessandro De Luca, Andrew Clark, Volker Dellwo. [doi]
- Multi-label Bird Species Classification from Field Recordings using Mel_Graph-GCN FrameworkNoumida Abdul Kareem, Rajeev Rajan. [doi]
- Keyword-Guided Adaptation of Automatic Speech RecognitionAviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet. [doi]
- Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword SpottingShuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang. [doi]
- PAM: Prompting Audio-Language Models for Audio Quality AssessmentSoham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang. [doi]
- Design of Feedback Active Noise Cancellation Filter Using Nested Recurrent Neural NetworksAlireza Bayestehtashk, Amit Kumar, Mike Wurtz. [doi]
- Leveraging Large Language Models to Refine Automatic Feedback Generation at Articulatory Level in Computer Aided Pronunciation TrainingHuihang Zhong, Yanlu Xie, ZiJin Yao. [doi]
- Gender and age based f0-variation in the German Plapper CorpusMelanie Weirich, Daniel Duran 0001, Stefanie Jannedy. [doi]
- Improving child speech recognition with augmented child-like speechYuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg. [doi]
- Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion StrategiesChung-Wen Wu, Berlin Chen. [doi]
- Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPUDaniel Galvez, Vladimir Bataev, Hainan Xu, Tim Kaldewey. [doi]
- How Consistent are Speech-Based Biomarkers in Remote Tracking of ALS Disease Progression Across Languages? A Case Study of English and DutchHardik Kothare, Michael Neumann, Cathy Zhang, Jackson Liscombe, Jordi W. J. van Unnik, Lianne C. M. Botman, Leonard H. van den Berg, Ruben P. A van Eijk, Vikram Ramanarayanan. [doi]
- Spoken Word2Vec: Learning Skipgram Embeddings from SpeechMohammad Amaan Sayeed, Hanan Aldarmaki. [doi]
- ED-sKWS: Early-Decision Spiking Neural Networks for Rapid, and Energy-Efficient Keyword SpottingZeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li 0001. [doi]
- Auditory Spatial Attention Detection Based on Feature Disentanglement and Brain Connectivity-Informed Graph Neural NetworksYixiang Niu, Ning Chen 0007, Hongqing Zhu, Zhiying Zhu 0001, Guangqiang Li, Yibo Chen. [doi]
- Gender and Language Identification in Multilingual Models of Speech: Exploring the Genericity and Robustness of Speech RepresentationsSéverine Guillaume, Maxime Fily, Alexis Michaud, Guillaume Wisniewski. [doi]
- FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody ConsistencyRui Liu 0008, Jiatian Xi, Ziyue Jiang 0001, Haizhou Li 0001. [doi]
- SecureSpectra: Safeguarding Digital Identity from Deep Fake Threats via Intelligent SignaturesOguzhan Baser, Kaan Kale, Sandeep P. Chinchali. [doi]
- Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language ModelsSheng Feng, Heyang Liu, Yu Wang 0002, Yanfeng Wang 0001. [doi]
- Confidence Estimation for Automatic Detection of Depression and Alzheimer's Disease Based on Clinical InterviewsWen Wu 0007, Chao Zhang 0031, Philip C. Woodland. [doi]
- Bridging Language Gaps in Audio-Text RetrievalZhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang. [doi]
- Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core ClippingLun Wang 0001, Om Thakkar 0001, Zhong Meng, Nicole Rafidi, Rohit Prabhavalkar, Arun Narayanan. [doi]
- Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speechPan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave, Marilyn A. Ladewig, Rus Heywood, Jordan R. Green. [doi]
- To what extent can ASV systems naturally defend against spoofing attacks?Jee-weon Jung, Xin Wang 0037, Nicholas W. D. Evans, Shinji Watanabe 0001, Hye-jin Shim, Hemlata Tak, Siddhant Arora, Junichi Yamagishi, Joon Son Chung. [doi]
- Boosting the Transferability of Adversarial Examples with Gradient-Aligned Ensemble Attack for Speaker RecognitionZhuhai Li, Jie Zhang, Wu Guo, Haochen Wu. [doi]
- Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech SystemsZhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian. [doi]
- Children's Speech Recognition through Discrete Token EnhancementVrunda N. Sukhadia, Shammur Absar Chowdhury. [doi]
- TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech SynthesizersYakun Song, Zhuo Chen 0006, Xiaofei Wang, Ziyang Ma, Guanrou Yang, Xie Chen 0001. [doi]
- Improved Factorized Neural Transducer Model For Text-only Domain AdaptationJunzhe Liu, Jianwei Yu 0001, Xie Chen 0001. [doi]
- Acoustic Effects of Facial Feminisation Surgery on Speech and Singing: A Case StudyCliodhna Hughes, Guy Brown, Ning Ma, Nicola Dibben. [doi]
- On the calibration of powerset speaker diarization modelsAlexis Plaquet, Hervé Bredin. [doi]
- MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning ModelJiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe 0001. [doi]
- A Parameter-efficient Language Extension Framework for Multilingual ASRWei Liu 0147, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee 0001. [doi]
- ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in MeetingsThéo Mariotte, Anthony Larcher, Silvio Montrésor, Jean-Hugh Thomas. [doi]
- Domain-Aware Data Selection for Speech Classification via Meta-ReweightingJunghun Kim, Ka-Hyun Park, Hoyoung Yoon, U Kang. [doi]
- An Investigation of Group versus Individual Fairness in Perceptually Fair Speech Emotion RecognitionWoan-Shiuan Chien, Chi-Chun Lee. [doi]
- Articulatory Configurations across Genders and Periods in French Radio and TV archivesBenjamin Elie, David Doukhan, Rémi Uro, Lucas Ondel Yang, Albert Rilliard, Simon Devauchelle. [doi]
- Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker DiarizationJeong Hwan Choi, Ye-Rin Jeoung, Ilseok Kim, Joon-Hyuk Chang. [doi]
- LUPET: Incorporating Hierarchical Information Path into Multilingual ASRWei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee 0001. [doi]
- DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion RecognitionJialong Mai, Xiaofen Xing, Weidong Chen, Xiangmin Xu. [doi]
- Disentangled Representation Learning for Environment-agnostic Speaker RecognitionKihyun Nam, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung. [doi]
- A Study on the Information Mechanism of the 3rd Tone Sandhi Rule in Mandarin Disyllabic WordsXiaowang Liu, Jinsong Zhang. [doi]
- PLDNet: PLD-Guided Lightweight Deep Network Boosted by Efficient Attention for Handheld Dual-Microphone Speech EnhancementNan Zhou, Youhai Jiang, Jialin Tan, Chongmin Qi. [doi]
- Optical Flow Guided Tongue Trajectory Generation for Diffusion-based Acoustic to Articulatory InversionYudong Yang, Rongfeng Su, Rukiye Ruzi, Manwa L. Ng, Shaofeng Zhao, Nan Yan, Lan Wang. [doi]
- USM RNN-T model weights binarizationOleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng. [doi]
- VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification BenchmarkYuke Lin, Ming Cheng 0005, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li. [doi]
- Evaluating a 3-factor listener model for prediction of speech intelligibility to hearing-impaired listenersMark A. Huckvale, Gaston Hilkhuysen. [doi]
- Revealing Confounding Biases: A Novel Benchmarking Approach for Aggregate-Level Performance Metrics in Health AssessmentsStefano Goria, Roseline Polle, Salvatore Fara, Nicholas Cummins. [doi]
- Spatial Acoustic Enhancement Using Unbiased Relative Harmonic CoefficientsLiang Tao, Maoshen Jia, Yonggang Hu, Changchun Bao. [doi]
- Dynamic Data Pruning for Automatic Speech RecognitionQiao Xiao, Pingchuan Ma 0001, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin 0006, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu 0003. [doi]
- Pitch-driven adjustments in tongue positions: Insights from ultrasound imagingMay Pik Yu Chan, Jianjing Kuang. [doi]
- Sign Value Constraint Decomposition for Efficient 1-Bit Quantization of Speech Translation TasksNan Chen, Yonghe Wang, Feilong Bao. [doi]
- OCEAN-AI: open multimodal framework for personality traits assessment and HR-processes automatizationElena Ryumina, Dmitry Ryumin, Alexey Karpov 0001. [doi]
- Low-dimensional Style Token Control for Hyperarticulated Speech SynthesisMiku Nishihara, Dan Wells, Korin Richmond, Aidan Pine. [doi]
- Collaborative Contrastive Learning for Hypothesis Domain AdaptationJen-Tzung Chien, I-Ping Yeh, Man-Wai Mak. [doi]
- Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective BaselineShiran Aziz, Yossi Adi, Shmuel Peleg. [doi]
- Preliminary Investigation of Psychometric Properties of a Novel Multimodal Dialog Based Affect Production Task in Children and Adolescents with AutismCarly Demopoulos, Linnea Lampinen, Cristian Preciado, Hardik Kothare, Vikram Ramanarayanan. [doi]
- Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative ConditionsMoreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi. [doi]
- Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic AlignmentPaarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Boris Ginsburg. [doi]
- Bridging Child-Centered Speech Language Identification and Language Diarization via PhoneticsYujia Wang, Hexin Liu, Leibny Paola Garcia. [doi]
- Towards generalisable and calibrated audio deepfake detection with self-supervised representationsOctavian Pascu, Adriana Stan, Dan Oneata, Elisabeta Oneata, Horia Cucu. [doi]
- Backchannel prediction, based on who, when and whatYo-Han Park, Wencke Liermann, Yong-Seok Choi, Seung Hi Kim, Jeong-Uk Bang, Seung Yun, Kong-Joo Lee. [doi]
- Enabling Conversational Speech Synthesis using Noisy Spontaneous DataLiisa Rätsep, Rasmus Lellep, Mark Fishel. [doi]
- Evaluating Transformer-Enhanced Deep Reinforcement Learning for Speech Emotion RecognitionSiddique Latif, Raja Jurdak, Björn W. Schuller. [doi]
- Towards realtime co-speech gestures synthesis using STARGATELouis Abel, Vincent Colotte, Slim Ouni. [doi]
- WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker ExtractionShuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu 0001, Yanmin Qian, Haizhou Li 0001. [doi]
- Meta Learning Text-to-Speech Synthesis in over 7000 LanguagesFlorian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu. [doi]
- DreamVoice: Text-Guided Voice ConversionJiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali. [doi]
- AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the WildYongkang Yin, Xu Li 0015, Ying Shan, Yuexian Zou. [doi]
- Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network FrameworkArunav Arya, Murtiza Ali, Karan Nathwani. [doi]
- AVR: synergizing foundation models for audio-visual humor detectionSarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma 0002. [doi]
- Depression Enhances Internal Inconsistency between Spoken and Semantic Emotion: Evidence from the Analysis of Emotion Expression in ConversationXinyi Wu, Changqing Xu, Nan Li, Rongfeng Su, Lan Wang, Nan Yan. [doi]
- Guiding Frame-Level CTC Alignments Using Self-knowledge DistillationEungbeom Kim, Hantae Kim, Kyogu Lee. [doi]
- Dataset-Distillation Generative Model for Speech Emotion RecognitionFabian Ritter Gutierrez, Kuan-Po Huang, Jeremy H. M. Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng. [doi]
- HypR: A comprehensive study for ASR hypothesis revising with a reference corpusYiwei Wang, Ke-Han Lu, Kuan-Yu Chen 0001. [doi]
- Adversarial Robustness Analysis in Automatic Pathological Speech Detection ApproachesMahdi Amiri, Ina Kodrasi. [doi]
- LibriheavyMix: A 20, 000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker DiarizationZengrui Jin, Yifan Yang, Mohan Shi, Wei Kang 0006, Xiaoyu Yang 0005, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu 0004, Shi-Xiong Zhang 0001, Daniel Povey. [doi]
- Modelled Multivariate Overlap: A method for measuring vowel mergerIrene Smith, Morgan Sonderegger, The Spade Consortium. [doi]
- Online Subloop Search via Uncertainty Quantization for Efficient Test-Time AdaptationJae Hong Lee, Sang-Eon Lee, Dong-hyun Kim, Do-Hee Kim, Joon-Hyuk Chang. [doi]
- LLM-Driven Multimodal Opinion Expression IdentificationBonian Jia, Huiyao Chen, Yueheng Sun, Meishan Zhang, Min Zhang. [doi]
- Urdu Alternative Questions: A Hat PatternBenazir Mumtaz, Miriam Butt. [doi]
- Speech Topic Classification Based on Multi-Scale and Graph Attention NetworksFangjing Niu, Xiaozhe Qi, Xinya Chen, Liang He 0003. [doi]
- SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation MetricsTakaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe 0001, Hiroshi Saruwatari. [doi]
- CEC: A Noisy Label Detection Method for Speaker RecognitionYao Shen, Yingying Gao, Yaqian Hao, Chenguang Hu, Fulin Zhang, Junlan Feng, Shilei Zhang. [doi]
- Multi-latency look-ahead for streaming speaker segmentationBilal Rahou, Hervé Bredin. [doi]
- Lightweight Transducer Based on Frame-Level CriterionGenshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye. [doi]
- Exploring the Capability of Mamba in Speech ApplicationsKoichi Miyazaki, Yoshiki Masuyama, Masato Murata. [doi]
- QGAN: Low Footprint Quaternion Neural Vocoder for Speech SynthesisAryan Chaudhary, Vinayak Abrol. [doi]
- Wav2vec 2.0 Embeddings Are No Swiss Army Knife - A Case Study for Multiple SclerosisGábor Gosztolya, Mercedes Vetráb, Veronika Svindt, Judit Bóna, Ildikó Hoffmann. [doi]
- Investigating self-supervised speech models' ability to classify animal vocalizations: The case of gibbon's vocal signaturesJules Cauzinille, Benoît Favre, Ricard Marxer, Dena J. Clink, Abdul Hamid Ahmad, Arnaud Rey. [doi]
- A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous SpeechLoukas Ilias, Dimitris Askounis. [doi]
- EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and DereverberationJulius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe 0001, Alexander Richard, Timo Gerkmann. [doi]
- Enhancing Non-Matching Reference Speech Quality Assessment through Dynamic Weight AdaptationBao Thang Ta, Van Hai Do, Huynh Thi Thanh Binh. [doi]
- Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0Marianne de Heer Kloots, Willem H. Zuidema. [doi]
- Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlationPaige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim. [doi]
- Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and TranslationAndrew Rouditchenko, Yuan Gong 0001, Samuel Thomas 0001, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, James Glass. [doi]
- Cross-transfer Knowledge between Speech and Text Encoders to Evaluate Customer SatisfactionLuis Felipe Parra-Gallego, Tilak Purohit, Bogdan Vlasenko, Juan Rafael Orozco-Arroyave, Mathew Magimai-Doss. [doi]
- Experimental evaluation of MOS, AB and BWS listening test designsDan Wells, Andrea Lorena Aldana Blanco, Cassia Valentini-Botinhao, Erica Cooper, Aidan Pine, Junichi Yamagishi, Korin Richmond. [doi]
- PARAN: Variational Autoencoder-based End-to-End Articulation-to-Speech System for Speech IntelligibilitySeyun Um, Doyeon Kim, Hong-Goo Kang. [doi]
- Automatic Speech Recognition with parallel L1 and L2 acoustic phone models to evaluate /l/ allophony in L2 English speech productionAnisia Popescu, Lori Lamel, Ioana Vasilescu, Laurence Devillers. [doi]
- LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous SpeechHaechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim. [doi]
- Multimodal Belief PredictionJohn Murzaku, Adil Soubki, Owen Rambow. [doi]
- Custom wake word detectionKesavaraj V, Charan Devarkonda, Vamshiraghusimha Narasinga, Anil Kumar Vuppala. [doi]
- BESST Dataset: A Multimodal Resource for Speech-based Stress Detection and AnalysisJan Pesán, Vojtech Jurík, Martin Karafiát, Jan Cernocký. [doi]
- FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based VocoderRubing Shen, Yanzhen Ren, Zongkun Sun. [doi]
- Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech RecognitionYue Gu, Zhihao Du, Shiliang Zhang, Jiqing Han 0001, Yongjun He. [doi]
- Few-Shot Keyword-Incremental Learning with Total CalibrationIlseok Kim, Ju-Seok Seong, Joon-Hyuk Chang. [doi]
- Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion RecognitionJincen Wang, Yan Zhao, Cheng Lu, Hailun Lian, Hongli Chang, Yuan Zong, Wenming Zheng. [doi]
- Anti-spoofing Ensembling Model: Dynamic Weight Allocation in Ensemble Models for Improved Voice Biometrics SecurityEros Rosello, Angel M. Gomez, Iván López-Espejo, Antonio M. Peinado, Juan M. Martín-Doñas. [doi]
- 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesisSewade Ogun, Abraham Toluwase Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin P. Adewumi. [doi]
- EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and BenchmarkZiyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen 0001, Thomas Hain. [doi]
- Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian BridgeThanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich. [doi]
- NAST: Noise Aware Speech Tokenization for Speech Language ModelsShoval Messica, Yossi Adi. [doi]