Abstract is missing.
- SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-SpeechHyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo. 1-5 [doi]
- Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitchHanbin Bae, Young-Sun Joo. 6-10 [doi]
- Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output EmbeddingsMartin Lenglet, Olivier Perrotin, Gérard Bailly. 11-15 [doi]
- TriniTTS: Pitch-controllable End-to-end TTS without External AlignerYooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe 0001. 16-20 [doi]
- JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to SpeechDan Lim, Sunghee Jung, Eesung Kim. 21-25 [doi]
- Interpretable dysarthric speaker adaptation based on optimal-transportRosanna Turrisi, Leonardo Badino. 26-30 [doi]
- Dysarthric Speech Recognition From Raw Waveform with Parametric CNNsZhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic. 31-35 [doi]
- The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech RecognitionLuke Prananta, Bence Mark Halpern, Siyuan Feng 0001, Odette Scharenborg. 36-40 [doi]
- Investigating Self-supervised Pretraining Frameworks for Pathological Speech RecognitionLester Phillip Violeta, Wen-Chin Huang, Tomoki Toda. 41-45 [doi]
- Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentationChitralekha Bhat, Ashish Panda, Helmer Strik. 46-50 [doi]
- Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech RecognitionAbner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas K. Maier, Seung-Hee Yang. 51-55 [doi]
- Regularizing Transformer-based Acoustic Models by Penalizing Attention WeightsMun-Hak Lee, Joon-Hyuk Chang, Sang-Eon Lee, Ju-Seok Seong, Chanhee Park, Haeyoung Kwon. 56-60 [doi]
- Content-Context Factorized Representations for Automated Speech RecognitionDavid M. Chan, Shalini Ghosh. 61-65 [doi]
- Comparison and Analysis of New Curriculum Criteria for End-to-End ASRGeorgios Karakasidis, Tamás Grósz, Mikko Kurimo. 66-70 [doi]
- Incremental learning for RNN-Transducer based speech recognition modelsDeepak Baby, Pasquale D'Alterio, Valentin Mendelev. 71-75 [doi]
- Production federated keyword spotting via distillation, filtering, and joint federated-centralized trainingAndrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun-Jin Park, Alex Park 0001, Sara Ng, Jessica Nguyen, Ignacio Lopez-Moreno, Rajiv Mathews, Françoise Beaufays. 76-80 [doi]
- Use of prosodic and lexical cues for disambiguating wh-words in KoreanJieun Song, Hae-Sung Jeon, Jieun Kiaer. 81-85 [doi]
- Autoencoder-Based Tongue Shape Estimation During Continuous SpeechVinicius Ribeiro, Yves Laprie. 86-90 [doi]
- Phonetic erosion and information structure in function words: the case of miaGiuseppe Magistro, Claudia Crocco. 91-95 [doi]
- Dynamic Vertical Larynx Actions Under Prosodic FocusMiran Oh, Yoon-Jeong Lee. 96-100 [doi]
- Fundamental Frequency Variability over Time in Telephone InteractionsLeah Bradshaw, Eleanor Chodroff, Lena Jäger, Volker Dellwo. 101-105 [doi]
- SHAS: Approaching optimal Segmentation for End-to-End Speech TranslationIoannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-Jussà. 106-110 [doi]
- M-Adapter: Modality Adaptation for End-to-End Speech-to-Text TranslationJinming Zhao, Hao Yang, Gholamreza Haffari, Ehsan Shareghi. 111-115 [doi]
- Cross-Modal Decision Regularization for Simultaneous Speech TranslationMohd Abbas Zaidi, Beomseok Lee, Sangha Kim 0002, Chanwoo Kim. 116-120 [doi]
- Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech TranslationRyo Fukuda, Katsuhito Sudoh, Satoshi Nakamura 0001. 121-125 [doi]
- Generalized Keyword Spotting using ASR embeddingsKirandevraj R, Vinod Kumar Kurmi, Vinay P. Namboodiri, C. V. Jawahar. 126-130 [doi]
- Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification LossYoungdo Ahn, Sung Joo Lee, Jong Won Shin. 131-135 [doi]
- Improving Speech Emotion Recognition Through Focus and Calibration Attention MechanismsJunghun Kim, Yoojin An, Jihie Kim. 136-140 [doi]
- The Emotion is Not One-hot Encoding: Learning with Grayscale Label for Emotion Recognition in ConversationJoosung Lee. 141-145 [doi]
- Probing speech emotion recognition transformers for linguistic knowledgeAndreas Triantafyllopoulos, Johannes Wagner 0001, Hagen Wierstorf, Maximilian Schmitt, Uwe Reichel, Florian Eyben, Felix Burkhardt, Björn W. Schuller. 146-150 [doi]
- End-To-End Label Uncertainty Modeling for Speech-based Arousal Recognition Using Bayesian Neural NetworksNavin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann. 151-155 [doi]
- Mind the gap: On the value of silence representations to lexical-based speech emotion recognitionMatthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost. 156-160 [doi]
- Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion ClassifierHuang-Cheng Chou, Chi-Chun Lee, Carlos Busso. 161-165 [doi]
- Positional Encoding for Capturing Modality Specific Cadence for Emotion DetectionHira Dhamyal, Bhiksha Raj, Rita Singh. 166-170 [doi]
- Speak Like a Professional: Increasing Speech Intelligibility by Mimicking Professional Announcer Voice with Voice ConversionTuan Vu Ho, Maori Kobayashi, Masato Akagi. 171-175 [doi]
- Vector-quantized Variational Autoencoder for Phase-aware Speech EnhancementTuan Vu Ho, Quoc Huy Nguyen, Masato Akagi, Masashi Unoki. 176-180 [doi]
- iDeepMMSE: An improved deep learning approach to MMSE speech and noise power spectrum estimation for speech enhancementMinseung Kim, Hyungchan Song, Sein Cheong, Jong Won Shin. 181-185 [doi]
- Boosting Self-Supervised Embeddings for Speech EnhancementKuo-Hsuan Hung, Szu-Wei Fu, Huan-Hsin Tseng, Hsin-Tien Chiang, Yu Tsao 0001, Chii-Wann Lin. 186-190 [doi]
- Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip ConnectionsSeorim Hwang, Youngcheol Park, Sungwook Park. 191-195 [doi]
- CycleGAN-based Unpaired Speech DereverberationHannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey. 196-200 [doi]
- Attentive Training: A New Training Framework for Talker-independent Speaker ExtractionAshutosh Pandey 0004, DeLiang Wang. 201-205 [doi]
- Improved Modulation-Domain Loss for Neural-Network-based Speech EnhancementTyler Vuong, Richard M. Stern. 206-210 [doi]
- Perceptual Characteristics Based Multi-objective Model for Speech EnhancementChiang-Jen Peng, Yun-Ju Chan, Yih-Liang Shen, Cheng Yu, Yu Tsao 0001, Tai-Shih Chi. 211-215 [doi]
- Listen only to me! How well can target speech extraction handle false alarms?Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolíková, Hiroshi Sato, Tomohiro Nakatani. 216-220 [doi]
- Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature ExtractionHao Shi, Longbiao Wang, Sheng Li 0010, Jianwu Dang, Tatsuya Kawahara. 221-225 [doi]
- Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant EnvironmentsJean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann. 226-230 [doi]
- PodcastMix: A dataset for separating music and speech in podcastsNicolás Schmidt, Jordi Pons, Marius Miron. 231-235 [doi]
- Independence-based Joint Dereverberation and Separation with Neural Source ModelKohei Saijo, Robin Scheibler. 236-240 [doi]
- Spatial Loss for Unsupervised Multi-channel Source SeparationKohei Saijo, Robin Scheibler. 241-245 [doi]
- Effect of Head Orientation on Speech DirectivitySamuel Bellows, Timothy W. Leishman. 246-250 [doi]
- Unsupervised Training of Sequential Neural Beamformer Using Coarsely-separated and Non-separated SignalsKohei Saijo, Tetsuji Ogawa. 251-255 [doi]
- Blind Language Separation: Disentangling Multilingual Cocktail Party Voices by LanguageMarvin Borsdorf, Kevin Scheck, Haizhou Li 0001, Tanja Schultz. 256-260 [doi]
- NTF of Spectral and Spatial Features for Tracking and Separation of Moving Sound Sources in Spherical Harmonic DomainMateusz Guzik, Konrad Kowalczyk. 261-265 [doi]
- Modelling Turn-taking in Multispeaker Parties for Realistic Data SimulationJack Deadman, Jon Barker. 266-270 [doi]
- An Initialization Scheme for Meeting Separation with Spatial Mixture ModelsChristoph Böddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach. 271-275 [doi]
- Prototypical speaker-interference loss for target voice separation using non-parallel audio samplesSeongkyu Mun, Dhananjaya Gowda, Jihwan Lee, Changwoo Han, Dokyun Lee, Chanwoo Kim. 276-280 [doi]
- Reliability criterion based on learning-phase entropy for speaker recognition with neural networkPierre-Michel Bousquet, Mickael Rouvier, Jean-François Bonastre. 281-285 [doi]
- Attentive Feature Fusion for Robust Speaker VerificationBei Liu, Zhengyang Chen, Yanmin Qian. 286-290 [doi]
- Dual Path Embedding Learning for Speaker Verification with Triplet AttentionBei Liu, Zhengyang Chen, Yanmin Qian. 291-295 [doi]
- DF-ResNet: Boosting Speaker Verification Performance with Depth-First DesignBei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian. 296-300 [doi]
- Adaptive Rectangle Loss for Speaker VerificationRuida Li, Shuo Fang, Chenguang Ma, Liang Li. 301-305 [doi]
- MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker VerificationYang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu 0001, Hung-yi Lee, Helen Meng. 306-310 [doi]
- Enroll-Aware Attentive Statistics Pooling for Target Speaker VerificationLeying Zhang, Zhengyang Chen, Yanmin Qian. 311-315 [doi]
- Transport-Oriented Feature Aggregation for Speaker Embedding LearningYusheng Tian, Jingyu Li, Tan Lee. 316-320 [doi]
- Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation LearningMufan Sang, John H. L. Hansen. 321-325 [doi]
- CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolutionLinjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen. 326-330 [doi]
- Reliable Visualization for Deep Speaker RecognitionPengqi Li, Lantian Li, Askar Hamdulla, Dong Wang. 331-335 [doi]
- Unifying Cosine and PLDA Back-ends for Speaker VerificationZhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan. 336-340 [doi]
- CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding ExtractorYuheng Wei, Junzhao Du, Hui Liu, Qian Wang. 341-345 [doi]
- SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of SpeechWeidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du. 346-350 [doi]
- VoiceLab: Software for Fully Reproducible Automated Voice AnalysisDavid Feinberg. 351-355 [doi]
- TRILLsson: Distilled Universal Paralinguistic Speech RepresentationsJoel Shor, Subhashini Venugopalan. 356-360 [doi]
- Global Signal-to-noise Ratio Estimation Based on Multi-subband Processing Using Convolutional Neural NetworkNan Li, Meng Ge, Longbiao Wang, Masashi Unoki, Sheng Li 0010, Jianwu Dang. 361-365 [doi]
- A Sparsity-promoting Dictionary Model for Variational AutoencodersMostafa Sadeghi, Paul Magron. 366-370 [doi]
- Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion RecognitionYan Zhao, Jincen Wang, Ru Ye, Yuan Zong, Wenming Zheng, Li Zhao. 371-375 [doi]
- Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learningJohn H. L. Hansen, Zhenyu Wang. 376-380 [doi]
- PEAF: Learnable Power Efficient Analog Acoustic Features for Audio RecognitionBoris Bergsma, Minhao Yang, Milos Cernak. 381-385 [doi]
- Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical LoadGasser Elbanna, Alice Biryukov, Neil Scheidwasser-Clow, Lara Orlandic, Pablo Mainar, Mikolaj Kegler, Pierre Beckmann, Milos Cernak. 386-390 [doi]
- Generative Data Augmentation Guided by Triplet Loss for Speech Emotion RecognitionShijun Wang, Hamed Hemati, Jón Guðnason, Damian Borth. 391-395 [doi]
- Learning neural audio features without supervisionSarthak Yadav, Neil Zeghidour. 396-400 [doi]
- Densely-connected Convolutional Recurrent Network for Fundamental Frequency Estimation in Noisy SpeechYixuan Zhang, Heming Wang, DeLiang Wang. 401-405 [doi]
- Predicting label distribution improves non-intrusive speech quality estimationAbu Zaher Md Faridee, Hannes Gamper. 406-410 [doi]
- Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech ModelsTakanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka. 411-415 [doi]
- Dataset Pruning for Resource-constrained Spoofed Audio DetectionAbdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza. 416-420 [doi]
- EdiTTS: Score-based Editing for Controllable Text-to-SpeechJaesung Tae, Hyeongju Kim, Taesu Kim. 421-425 [doi]
- Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual InformationJie Chen, Changhe Song, Deyi Tuo, Xixin Wu, Shiyin Kang, Zhiyong Wu 0001, Helen Meng. 426-430 [doi]
- SpeechPainter: Text-conditioned Speech InpaintingZalan Borsos, Matthew Sharifi, Marco Tagliasacchi. 431-435 [doi]
- A polyphone BERT for Polyphone Disambiguation in Mandarin ChineseSong Zhang, Ken Zheng, Xiaoxu Zhu, Baoxiang Li. 436-440 [doi]
- Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual KnowledgeMutian He 0001, Jingzhou Yang, Lei He 0005, Frank K. Soong. 441-445 [doi]
- ByT5 model for massively multilingual grapheme-to-phoneme conversionJian Zhu, Cong Zhang, David Jurgens. 446-450 [doi]
- DocLayoutTTS: Dataset and Baselines for Layout-informed Document-level Neural Speech SynthesisPuneet Mathur, Franck Dernoncourt, Quan Hung Tran, Jiuxiang Gu, Ani Nenkova, Vlad I. Morariu, Rajiv Jain, Dinesh Manocha. 451-455 [doi]
- Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to SpeechGuangyan Zhang, Kaitao Song, Xu Tan 0003, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao. 456-460 [doi]
- Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech RecognitionJunrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang 0001, Shiyu Chang, Mark Hasegawa-Johnson. 461-465 [doi]
- An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech SynthesisTho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Steven Hung Quoc Truong. 466-470 [doi]
- Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networksCassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter. 471-475 [doi]
- An Automatic Soundtracking System for Text-to-Speech AudiobooksZikai Chen, Lin Wu, Junjie Pan, Xiang Yin 0006. 476-480 [doi]
- Environment Aware Text-to-Speech SynthesisDaxin Tan, Guangyan Zhang, Tan Lee. 481-485 [doi]
- SoundChoice: Grapheme-to-Phoneme Models with Semantic DisambiguationArtem Ploujnikov, Mirco Ravanelli. 486-490 [doi]
- Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text NormalizationEvelina Bakhturina, Yang Zhang, Boris Ginsburg. 491-495 [doi]
- Prosodic alignment for off-screen automatic dubbingYogesh Virkar, Marcello Federico, Robert Enyedi, Roberto Barra-Chicote. 496-500 [doi]
- A Study of Modeling Rising Intonation in Cantonese Neural Speech SynthesisQibing Bai, Tom Ko, Yu Zhang. 501-505 [doi]
- CAUSE: Crossmodal Action Unit Sequence Estimation from SpeechHirokazu Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka. 506-510 [doi]
- Visualising Model Training via Vowel Space for Text-To-Speech SystemsBinu Nisal Abeysinghe, Jesin James, Catherine I. Watson 0001, Felix Marattukalam. 511-515 [doi]
- Binary Early-Exit Network for Adaptive Inference on Low-Resource DevicesAaqib Saeed. 516-520 [doi]
- Streaming Speaker-Attributed ASR with Token-Level Speaker EmbeddingsNaoyuki Kanda, Jian Wu 0027, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen 0006, Jinyu Li 0001, Takuya Yoshioka. 521-525 [doi]
- Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text dataNaoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura. 526-530 [doi]
- Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-AdaptationYi-Kai Zhang, Da-Wei Zhou 0001, Han-Jia Ye, De-Chuan Zhan. 531-535 [doi]
- Federated Domain Adaptation for ASR with Full Self-SupervisionJunteng Jia, Jay Mahadeokar, Weiyi Zheng, Yuan Shangguan, Ozlem Kalinli, Frank Seide. 536-540 [doi]
- Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer's Speech DetectionLongfei Yang, Wenqing Wei, Sheng Li 0010, Jiyi Li, Takahiro Shinozaki. 541-545 [doi]
- Extending RNN-T-based speech recognition systems with emotion and language classificationZvi Kons, Hagai Aronowitz, Edmilson Da Silva Morais, Matheus Damasceno, Hong-Kwang Kuo, Samuel Thomas 0001, George Saon. 546-549 [doi]
- Thutmose Tagger: Single-pass neural model for Inverse Text NormalizationAlexandra Antonova, Evelina Bakhturina, Boris Ginsburg. 550-554 [doi]
- Leveraging Prosody for Punctuation Prediction of Spontaneous SpeechYeonjin Cho, Sara Ng, Trang tran, Mari Ostendorf. 555-559 [doi]
- A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party MeetingsFan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie 0001. 560-564 [doi]
- TMGAN-PLC: Audio Packet Loss Concealment using Temporal Memory Generative Adversarial NetworkYuansheng Guan, Guochen Yu, Andong Li, Chengshi Zheng, Jie Wang. 565-569 [doi]
- Real-Time Packet Loss Concealment With Mixed Generative and Predictive ModelJean-Marc Valin, Ahmed Mustafa, Christopher Montgomery, Timothy B. Terriberry, Michael Klingbeil, Paris Smaragdis, Arvindh Krishnaswamy. 570-574 [doi]
- PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial NetworkBaiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang. 575-579 [doi]
- INTERSPEECH 2022 Audio Deep Packet Loss Concealment ChallengeLorenz Diener, Sten Sootla, Solomiya Branets, Ando Saabas, Robert Aichner, Ross Cutler. 580-584 [doi]
- End-to-End Multi-Loss Training for Low Delay Packet Loss ConcealmentNan Li, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu. 585-589 [doi]
- Extended U-Net for Speaker Verification in Noisy EnvironmentsJu-ho Kim, Jungwoo Heo, Hye-jin Shim, Ha-Jin Yu. 590-594 [doi]
- Domain Agnostic Few-shot Learning for Speaker VerificationSeunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, Sungrack Yun. 595-599 [doi]
- Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?Qiongqiong Wang, Kong-Aik Lee, Tianchi Liu 0004. 600-604 [doi]
- Training speaker embedding extractors using multi-speaker audio with unknown speaker boundariesThemos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Anna Silnova, Lukás Burget, Jan Cernocký. 605-609 [doi]
- Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representationsChau Luu, Steve Renals, Peter Bell 0001. 610-614 [doi]
- Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verificationSaurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak. 615-619 [doi]
- Variability in Production of Non-Sibilant Fricative [ç] in /hi/Tsukasa Yoshinaga, Kikuo Maekawa, Akiyoshi Iida. 620-624 [doi]
- Streaming model for Acoustic to Articulatory Inversion with transformer networksSathvik Udupa, Aravind Illa, Prasanta Kumar Ghosh. 625-629 [doi]
- Trajectories predicted by optimal speech motor control using LSTM networksTsiky Rakotomalala, Pierre Baraduc, Pascal Perrier. 630-634 [doi]
- Exploration strategies for articulatory synthesis of complex syllable onsetsDaniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul Konstantin Krug, Peter Birkholz, Yi Xu. 635-639 [doi]
- Linguistic versus biological factors governing acoustic voice variationYoonjeong Lee, Jody Kreiman. 640-643 [doi]
- Acquisition of allophonic variation in second language speech: An acoustic and articulatory study of English laterals by Japanese speakersTakayuki Nagamine. 644-648 [doi]
- SAQAM: Spatial Audio Quality Assessment MetricPranay Manocha, Anurag Kumar 0003, Buye Xu, Anjali Menon, Israel Dejene Gebru, Vamsi Krishna Ithapu, Paul Calamia. 649-653 [doi]
- Speech Quality Assessment through MOS using Non-Matching ReferencesPranay Manocha, Anurag Kumar 0003. 654-658 [doi]
- An objective test tool for pitch extractors' response attributesHideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Tatsuya Kitamura, Hideki Banno, Masanori Morise. 659-663 [doi]
- Data Augmentation Using McAdams-Coefficient-Based Speaker Anonymization for Fake Audio DetectionKai Li, Sheng Li 0010, Xugang Lu, Masato Akagi, Meng Liu, Lin Zhang, Chang Zeng, Longbiao Wang, Jianwu Dang, Masashi Unoki. 664-668 [doi]
- Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation LearningSalah Zaiem, Titouan Parcollet, Slim Essid. 669-673 [doi]
- Transformer-based quality assessment model for generalized user-generated multimedia audio contentDeebha Mumtaz, Ajit Jena, Vinit Jakhetiya, Karan Nathwani, Sharath Chandra Guntuku. 674-678 [doi]
- Space-Efficient Representation of Entity-centric Query Language ModelsChristophe Van Gysel, Mirko Hannemann, Ernest Pusateri, Youssef Oualil, Ilya Oparin. 679-683 [doi]
- Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systemsSaket Dingliwal, Ashish Shenoy, Sravan Bodapati, Ankur Gandhe, Ravi Teja Gadde, Katrin Kirchhoff. 684-688 [doi]
- Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech RecognitionW. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor D. Strohman, Shankar Kumar. 689-693 [doi]
- UserLibri: A Dataset for ASR Personalization Using Only TextTheresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey. 694-698 [doi]
- A BERT-based Language Modeling FrameworkChin-Yueh Chien, Kuan-Yu Chen. 699-703 [doi]
- Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed MicrophonesYoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono. 704-708 [doi]
- Challenges and Opportunities in Multi-device Speech ProcessingGregory Ciccarelli, Jarred Barber, Arun Nair, Israel Cohen, Tao Zhang. 709-713 [doi]
- Practical Over-the-air Perceptual AcousticWatermarkingAmeya Agaskar. 714-718 [doi]
- Clustering-based Wake Word Detection in Privacy-aware Acoustic Sensor NetworksTimm Koppelmann, Luca Becker, Alexandru Nelus, Rene Glitza, Lea Schönherr, Rainer Martin 0001. 719-723 [doi]
- Relative Acoustic Features for Distance Estimation in Smart-HomesFrancesco Nespoli, Daniel Barreda, Patrick A. Naylor. 724-728 [doi]
- Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path NetworkAshutosh Pandey 0004, Buye Xu, Anurag Kumar 0003, Jacob Donley, Paul Calamia, DeLiang Wang. 729-733 [doi]
- Relationship between the acoustic time intervals and tongue movements of German diphthongsArne-Lukas Fietkau, Simon Stone, Peter Birkholz. 734-738 [doi]
- Development of allophonic realization until adolescence: A production study of the affricate-fricative variation of /z/ among Japanese childrenSanae Matsui, Kyoji Iwamoto, Reiko Mazuka. 739-743 [doi]
- Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognitionChung Soo Ahn, Chamara Kasun, Sunil Sivadas, Jagath C. Rajapakse. 744-748 [doi]
- Low-Level Physiological Implications of End-to-End Learning for Speech RecognitionLouise Coppieters de Gibson, Philip N. Garner. 749-753 [doi]
- Idiosyncratic lingual articulation of American English /æ/ and /ɑ/ using network analysisCarolina Lins Machado, Volker Dellwo, Lei He. 754-758 [doi]
- Method for improving the word intelligibility of presented speech using bone-conduction headphonesTeruki Toya, Wenyu Zhu, Maori Kobayashi, Kenichi Nakamura, Masashi Unoki. 759-763 [doi]
- Three-dimensional finite-difference time-domain acoustic analysis of simplified vocal tract shapesDebasish Ray Mohapatra, Mario Fleischer, Victor Zappi, Peter Birkholz, Sidney S. Fels. 764-768 [doi]
- Speech imitation skills predict automatic phonetic convergence: a GMM-UBM study on L2Dorina De Jong, Aldo Pastore, Noël Nguyen, Alessandro D'Ausilio. 769-773 [doi]
- Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAEMarc-Antoine Georges, Jean-Luc Schwartz, Thomas Hueber. 774-778 [doi]
- Deep Speech Synthesis from Articulatory RepresentationsPeter Wu, Shinji Watanabe 0001, Louis Goldstein, Alan W. Black, Gopala Krishna Anumanchipalli. 779-783 [doi]
- Orofacial somatosensory inputs in speech perceptual training modulate speech productionMonica Ashokumar, Jean-Luc Schwartz, Takayuki Ito 0002. 784-787 [doi]
- Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech CorpusMinchan Kim, Myeonghun Jeong, Byoung Jin Choi, SungHwan Ahn, Joun Yeop Lee, Nam Soo Kim. 788-792 [doi]
- DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation LearningTakaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto. 793-797 [doi]
- MSR-NV: Neural Vocoder Using Multiple Sampling RatesKentaro Mitsui, Kei Sawada. 798-802 [doi]
- SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral ShapingYuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani. 803-807 [doi]
- Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to EdgeSangjun Park, Kihyun Choo, Joohyung Lee, Anton V. Porov, Konstantin Osipov, June Sig Sung. 808-812 [doi]
- Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-SpeechJae-Sung Bae, Jinhyeok Yang, Taejun Bak, Young-Sun Joo. 813-817 [doi]
- End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC EstimationKrishna Subramani, Jean-Marc Valin, Umut Isik, Paris Smaragdis, Arvindh Krishnaswamy. 818-822 [doi]
- EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech ModelsPerry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman. 823-827 [doi]
- Fine-grained Noise Control for Multispeaker Speech SynthesisKarolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis. 828-832 [doi]
- WavThruVec: Latent speech representation as intermediate features for neural speech synthesisHubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby. 833-837 [doi]
- Fast Grad-TTS: Towards Efficient Diffusion-Based Speech Generation on CPUIvan Vovk, Tasnima Sadekova, Vladimir Gogoryan, Vadim Popov, Mikhail A. Kudinov, Jiansheng Wei. 838-842 [doi]
- Simple and Effective Unsupervised Speech SynthesisAlexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James R. Glass. 843-847 [doi]
- Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation GenerationReo Yoneyama, Yi-Chiao Wu, Tomoki Toda. 848-852 [doi]
- NeMo Open Source Speaker Diarization SystemTaejin Park, Nithin Rao Koluguri, Fei Jia, Jagadeesh Balam, Boris Ginsburg. 853-854 [doi]
- Voice2Alliance: Automatic Speaker Diarization and Quality Assurance of Conversational AlignmentBaihan Lin. 855-856 [doi]
- VAgyojaka: An Annotating and Post-Editing Tool for Automatic Speech RecognitionRishabh Kumar, Devaraja Adiga, Mayank Kothyari, Jatin Dalal, Ganesh Ramakrishnan, Preethi Jyothi. 857-858 [doi]
- SKYE: More than a conversational AIAlzahra Badi, Chungho Park, Min-Seok Keum, Miguel Alba, Youngsuk Ryu, Jeongmin Bae. 859-860 [doi]
- Training Data Generation with DOA-based Selecting and Remixing for Unsupervised Training of Deep Separation ModelsHokuto Munakata, Ryu Takeda, Kazunori Komatani. 861-865 [doi]
- Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel OutputHangting Chen, Yi Yang, Feng Dang, Pengyuan Zhang. 866-870 [doi]
- Joint Estimation of Direction-of-Arrival and Distance for Arrays with Directional Sensors based on Sparse Bayesian LearningFeifei Xiong, Pengyu Wang, Zhongfu Ye, Jinwei Feng. 871-875 [doi]
- How to Listen? Rethinking Visual Sound LocalizationHo-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello. 876-880 [doi]
- Small Footprint Neural Networks for Acoustic Direction of Arrival EstimationZhiheng Ouyang, Miao Wang, Wei-Ping Zhu 0001. 881-885 [doi]
- Multi-Modal Multi-Correlation Learning for Audio-Visual Speech SeparationXiaoyu Wang, Xiangyu Kong, Xiulian Peng, Yan Lu. 886-890 [doi]
- MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound SourcesHaoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang. 891-895 [doi]
- Iterative Sound Source Localization for Unknown Number of SourcesYanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang. 896-900 [doi]
- Distance-Based Sound SeparationKatharine Patterson, Kevin W. Wilson, Scott Wisdom, John R. Hershey. 901-905 [doi]
- VCSE: Time-Domain Visual-Contextual Speaker Extraction NetworkJunjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang. 906-910 [doi]
- TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source SeparationAli Aroudi, Stefan Uhlich, Marc Ferras Font. 911-915 [doi]
- PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech EnhancementXiaofeng Ge, Jiangyu Han, Yanhua Long, Haixin Guan. 916-920 [doi]
- Lightweight Full-band and Sub-band Fusion Network for Real Time Speech EnhancementZhuangqi Chen, Pingjian Zhang. 921-925 [doi]
- Cross-Layer Similarity Knowledge Distillation for Speech EnhancementJiaming Cheng, Ruiyu Liang, Yue Xie, Li Zhao, Björn W. Schuller, Jie Jia, Yiyuan Peng. 926-930 [doi]
- Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and DereverberationFeifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li, Jinwei Feng. 931-935 [doi]
- CMGAN: Conformer-based Metric GAN for Speech EnhancementRuizhe Cao, Sherif Abdulatif, Bin Yang. 936-940 [doi]
- Model Compression by Iterative Pruning with Knowledge Distillation and Its Application to Speech EnhancementZeyuan Wei, Li Hao, Xueliang Zhang. 941-945 [doi]
- Single-channel speech enhancement using Graph Fourier TransformChenhui Zhang, Xiang Pan. 946-950 [doi]
- Joint Optimization of the Module and Sign of the Spectral Real Part Based on CRN for Speech DenoisingZilu Guo, Xu Xu 0003, Zhongfu Ye. 951-955 [doi]
- Attentive Recurrent Network for Low-Latency Active Noise ControlHao Zhang, Ashutosh Pandey 0004, DeLiang Wang. 956-960 [doi]
- Memory-Efficient Multi-Step Speech Enhancement with Neural ODEJen-Hung Huang, Chung-Hsien Wu. 961-965 [doi]
- GLD-Net: Improving Monaural Speech Enhancement by Learning Global and Local Dependency Features with GLD BlockXinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Jianjun Hao. 966-970 [doi]
- Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head AttentionXinmeng Xu, Yang Wang, Jie Jia, Binbin Chen, Dejun Li. 971-975 [doi]
- Speech Enhancement with Fullband-Subband Cross-Attention NetworkJun Chen, Wei Rao, Zilin Wang, Zhiyong Wu 0001, Yannan Wang, Tao Yu, Shidong Shang, Helen Meng. 976-980 [doi]
- OSSEM: one-shot speaker adaptive speech enhancement using meta learningCheng Yu, Szu-Wei Fu, Tsun-An Hsieh, Yu Tsao 0001, Mirco Ravanelli. 981-985 [doi]
- Efficient Speech Enhancement with Neural Homomorphic SynthesisWenbin Jiang, Tao Liu, Kai Yu. 986-990 [doi]
- Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge DistillationManthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang. 991-995 [doi]
- Strategies to Improve Robustness of Target Speech Extraction to Enrollment VariationsHiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura. 996-1000 [doi]
- FedNST: Federated Noisy Student Training for Automatic Speech RecognitionHaaris Mehmood, Agnieszka Dobrowolska, Karthikeyan Saravanan, Mete Ozay. 1001-1005 [doi]
- SCaLa: Supervised Contrastive Learning for End-to-End Speech RecognitionLi Fu, Xiaoxiao Li, Runyu Wang, Lu Fan, Zhengchen Zhang, Meng Chen 0006, Youzheng Wu, Xiaodong He 0002. 1006-1010 [doi]
- NAS-SCAE: Searching Compact Attention-based Encoders For End-to-end Automatic Speech RecognitionYukun Liu, Ta Li, Pengyuan Zhang, Yonghong Yan 0002. 1011-1015 [doi]
- Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASRKun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma. 1016-1020 [doi]
- PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech RecognitionGuodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang. 1021-1025 [doi]
- Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech RecognitionKartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno. 1026-1030 [doi]
- Improving Rare Word Recognition with LM-aware MWER TrainingWeiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach. 1031-1035 [doi]
- Improving the Training Recipe for a Robust Conformer-based Hybrid ModelMohammad Zeineldeen, Jingjing Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney. 1036-1040 [doi]
- CTC Variations Through New WFST TopologiesAleksandr Laptev, Somshubra Majumdar, Boris Ginsburg. 1041-1045 [doi]
- Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech RecognitionMartin Sustek, Samik Sadhu, Hynek Hermansky. 1046-1050 [doi]
- Towards Efficiently Learning Monotonic Alignments for Attention-based End-to-End Speech RecognitionChenfeng Miao, Kun Zou, Ziyang Zhuang, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao 0006. 1051-1055 [doi]
- On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant trainingJisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker. 1056-1060 [doi]
- From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech RecognitionSelen Hande Kabil, Hervé Bourlard. 1061-1065 [doi]
- Domain Adversarial Self-Supervised Speech Representation Learning for Improving Unknown Domain Downstream TasksTomohiro Tanaka, Ryo Masumura, Hiroshi Sato, Mana Ihori, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya. 1066-1070 [doi]
- Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASRTakashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe 0001. 1071-1075 [doi]
- Reducing Offensive Replies in Open Domain Dialogue SystemsNaokazu Uchida, Takeshi Homma, Makoto Iwayama, Yasuhiro Sogawa. 1076-1080 [doi]
- Induce Spoken Dialog Intents via Deep Unsupervised Context Contrastive ClusteringTing-Wei Wu, Biing-Hwang Juang. 1081-1085 [doi]
- Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal InformationFumio Nihei, Ryo Ishii, Yukiko I. Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura. 1086-1090 [doi]
- Contextual Acoustic Barge-In Classification for Spoken Dialog SystemsDhanush Bekal, Sundararajan Srinivasan, Srikanth Ronanki, Sravan Bodapati, Katrin Kirchhoff. 1091-1095 [doi]
- Calibrate and Refine! A Novel and Agile Framework for ASR Error Robust Intent DetectionPeilin Zhou, Dading Chong, Helin Wang, Qingcheng Zeng. 1096-1100 [doi]
- ASR-Robust Natural Language Understanding on ASR-GLUE datasetLingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai 0002, Haitao Zheng. 1101-1105 [doi]
- From Disfluency Detection to Intent Detection and Slot FillingMai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen. 1106-1110 [doi]
- Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep AnalysisHengshun Zhou, Jun Du, Gongzhen Zou, Zhaoxu Nian, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Shifu Xiong, Jianqing Gao. 1111-1115 [doi]
- Extending Compositional Attention Networks for Social Reasoning in VideosChristina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos. 1116-1120 [doi]
- TopicKS: Topic-driven Knowledge Selection for Knowledge-grounded Dialogue GenerationShiquan Wang, Yuke Si, Xiao Wei, Longbiao Wang, Zhiqiang Zhuang, Xiaowang Zhang, Jianwu Dang. 1121-1125 [doi]
- Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languagesAndreas Liesenfeld, Mark Dingemanse. 1126-1130 [doi]
- Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language UnderstandingYi Zhu, Zexun Wang, Hang Liu, PeiYing Wang, Mingchao Feng, Meng Chen, Xiaodong He 0002. 1131-1135 [doi]
- Use of Nods Less Synchronized with Turn-Taking and Prosody During Conversations in Adults with AutismKeiko Ochi, Nobutaka Ono, Keiho Owada, Miho Kuroda, Shigeki Sagayama, Hidenori Yamasue. 1136-1140 [doi]
- DAVIS: Driver's Audio-Visual Speech recognitionDenis Ivanko, Dmitry Ryumin, Alexey M. Kashevnik, Alexandr Axyonov, Andrey Kitenko, Igor Lashkov, Alexey Karpov 0001. 1141-1142 [doi]
- Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion RecognitionEinari Vaaras, Manu Airaksinen, Okko Räsänen. 1143-1147 [doi]
- Emotion-Shift Aware CRF for Decoding Emotion Sequence in ConversationChun-Yu Chen, Yun-Shao Lin, Chi-Chun Lee. 1148-1152 [doi]
- Vaccinating SER to Neutralize Adversarial Attacks with Self-Supervised Augmentation StrategyBo-Hao Su, Chi-Chun Lee. 1153-1157 [doi]
- Speech Emotion Recognition in the Wild using Multi-task and Adversarial LearningJack Parry, Eric DeMattos, Anita Klementiev, Axel Ind, Daniela Morse-Kopp, Georgia Clarke, Dimitri Palaz. 1158-1162 [doi]
- The Magnitude and Phase based Speech Representation Learning using Autoencoder for Classifying Speech Emotions using Deep Canonical Correlation AnalysisAshishkumar Prabhakar Gudmalwar, Biplove Basel, Anirban Dutta, Ch V. Rama Rao. 1163-1167 [doi]
- Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual TasksLucas Goncalves, Carlos Busso. 1168-1172 [doi]
- SNRi Target Training for Joint Speech Enhancement and RecognitionYuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani. 1173-1177 [doi]
- Deep Self-Supervised Learning of Speech Denoising from Noisy SpeechesYutaro Sanada, Takumi Nakagawa, Yuichiro Wada, Kosaku Takanashi, Yuhui Zhang, Kiichi Tokuyama, Takafumi Kanamori, Tomonori Yamada. 1178-1182 [doi]
- NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional ResamplingChi-Chang Lee, Cheng-Hung Hu, Yu-Chen Lin, Chu-Song Chen, Hsin-Min Wang, Yu Tsao 0001. 1183-1187 [doi]
- FFC-SE: Fast Fourier Convolution for Speech EnhancementIvan Shchekotov, Pavel K. Andreev, Oleg Ivanov, Aibek Alanov, Dmitry Vetrov. 1188-1192 [doi]
- A Systematic Comparison of Phonetic Aware Techniques for Speech EnhancementOr Tal, Moshe Mandel, Felix Kreuk, Yossi Adi. 1193-1197 [doi]
- Multi-View Attention Transfer for Efficient Speech EnhancementWooSeok Shin, Hyun-Joon Park, Jin Sob Kim, Byung-Hoon Lee, Sung Won Han. 1198-1202 [doi]
- SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to SeparateNabarun Goswami, Tatsuya Harada. 1203-1207 [doi]
- Correcting Mispronunciations in Speech using Spectrogram InpaintingTalia Ben Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet. 1208-1212 [doi]
- Speech Audio Corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speechJason Fong, Daniel Lyth, Gustav Eje Henter, Hao Tang, Simon King. 1213-1217 [doi]
- End-to-End Binaural Speech SynthesisWen-Chin Huang, Dejan Markovic, Alexander Richard, Israel Dejene Gebru, Anjali Menon. 1218-1222 [doi]
- PoeticTTS - Controllable Poetry Reading for Literary StudiesJulia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu. 1223-1227 [doi]
- Articulatory Synthesis for Data Augmentation in Phoneme RecognitionPaul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu. 1228-1232 [doi]
- SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary TaskJihyun Lee, Gary Geunbae Lee. 1233-1237 [doi]
- Benchmarking Transformers-based models on French Spoken Language Understanding tasksOralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset. 1238-1242 [doi]
- mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot FillingSeong-Hwan Heo, WonKee Lee, Jong-Hyeok Lee. 1243-1247 [doi]
- Bottleneck Low-rank Transformers for Low-resource Spoken Language UnderstandingPu Wang, Hugo Van Hamme. 1248-1252 [doi]
- On joint training with interfaces for spoken language understandingAnirudh Raju, Milind Rao, Gautam Tiwari, Pranav Dheram, Bryan Anderson, Zhe Zhang, Chul Lee, Bach Bui, Ariya Rastrow. 1253-1257 [doi]
- Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised ModelsVineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed Hussen Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed H. Tewfik. 1258-1262 [doi]
- Building African VoicesPerez Ogayo, Graham Neubig, Alan W. Black. 1263-1267 [doi]
- Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparitiesPranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke. 1268-1272 [doi]
- Training and typological bias in ASR performance for world EnglishesMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen 0017, Xin Gao, Nicole R. Holliday. 1273-1277 [doi]
- A Study of Gender Impact in Self-supervised Models for Speech-to-Text SystemsMarcely Zanon Boito, Laurent Besacier, Natalia A. Tomashenko, Yannick Estève. 1278-1282 [doi]
- Automatic Dialect Density Estimation for African American EnglishAlexander Johnson, Kevin Everson, Vijay Ravi, Anissa Gladney, Mari Ostendorf, Abeer Alwan. 1283-1287 [doi]
- Improving Language Identification of Accented SpeechKunnar Kukk, Tanel Alumäe. 1288-1292 [doi]
- Design Guidelines for Inclusive Speaker Verification Evaluation DatasetsWiebke Toussaint, Lauriane Gorce, Aaron Yi Ding. 1293-1297 [doi]
- Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight ConsolidationViet Anh Trinh, Pegah Ghahremani, Brian King, Jasha Droppo, Andreas Stolcke, Roland Maas. 1298-1302 [doi]
- Gradual Improvements Observed in Learners' Perception and Production of L2 Sounds Through Continuing Shadowing Practices on a Daily BasisTakuya Kunihara, Chuanbo Zhu, Nobuaki Minematsu, Noriko Nakanishi. 1303-1307 [doi]
- Spoofed speech from the perspective of a forensic phoneticianChristin Kirchhübel, Georgina Brown. 1308-1312 [doi]
- Investigating Prosodic Variation in British English Varieties using ProPerHae-Sung Jeon, Stephen Nichols. 1313-1317 [doi]
- Perceived prominence and downstep in JapaneseHyun Kyung Hwang, Manami Hirayama, Takaomi Kato. 1318-1321 [doi]
- The discrimination of [zi]-[dʑi] by Japanese listeners and the prospective phonologization of /zi/Andrea Alicehajic, Silke Hamann. 1322-1326 [doi]
- Glottal inverse filtering based on articulatory synthesis and deep learningIngo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz. 1327-1331 [doi]
- Investigating phonetic convergence of laughter in conversationBogdan Ludusan, Marin Schröer, Petra Wagner. 1332-1336 [doi]
- Telling self-defining memories: An acoustic study of natural emotional speech productionsVéronique Delvaux, Audrey Lavallée, Fanny Degouis, Xavier Saloppe, Jean-Louis Nandrino, Thierry Pham. 1337-1341 [doi]
- Voicing neutralization in Romanian fricatives across different speech stylesLaura Spinu, Ioana Vasilescu, Lori Lamel, Jason Lilley. 1342-1346 [doi]
- Nasal Coda Loss in the Chengdu Dialect of Mandarin: Evidence from RT-MRISishi Liao, Phil Hoole, Conceição Cunha, Esther Kunay, Aletheia Cui, Lia Saki Bucar Shigemori, Felicitas Kleber, Dirk Voit, Jens Frahm, Jonathan Harrington. 1347-1351 [doi]
- ema2wav: doing articulation by PraatPhilipp Buech, Simon Roessig, Lena Pagel, Doris Mücke, Anne Hermes. 1352-1356 [doi]
- Improving Phonetic Transcriptions of Children's Speech by Pronunciation Modelling with Constrained CTC-DecodingLars Rumberg, Christopher Gebauer, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann. 1357-1361 [doi]
- Leveraging Simultaneous Translation for Enhancing Transcription of Low-resource Language via Cross Attention MechanismSoky Kak, Sheng Li 0010, Masato Mimura, Chenhui Chu, Tatsuya Kawahara. 1362-1366 [doi]
- KSC2: An Industrial-Scale Open-Source Kazakh Speech CorpusSaida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol. 1367-1371 [doi]
- Knowledge of accent differences can be used to predict speech recognitionTuende Szalay, Mostafa Ali Shahin, Beena Ahmed, Kirrie J. Ballard. 1372-1376 [doi]
- Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal featuresMaximilian Karl Scharf, Sabine Hochmuth, Lena L. N. Wong, Birger Kollmeier, Anna Warzybok. 1377-1381 [doi]
- End-to-end speech recognition modeling from de-identified dataMartin Flechl, Shou-Chun Yin, Junho Park, Peter Skala. 1382-1386 [doi]
- Multi-Task End-to-End Model for Telugu Dialect and Speech RecognitionAditya Yadavalli, Mirishkar Sai Ganesh, Anil Kumar Vuppala. 1387-1391 [doi]
- DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech RecognitionJiamin Xie, John H. L. Hansen. 1392-1396 [doi]
- Keyword Spotting with Synthetic Data using Heterogeneous Knowledge DistillationYuna Lee, Seung Jun Baek. 1397-1401 [doi]
- Probing phoneme, language and speaker information in unsupervised speech representationsMaureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski. 1402-1406 [doi]
- Automatic Detection of Reactive Attachment Disorder Through Turn-Taking Analysis in Clinical Child-Caregiver SessionsAndrei Bîrladeanu, Helen Minnis, Alessandro Vinciarelli. 1407-1410 [doi]
- Automatic Pronunciation Assessment using Self-Supervised Speech Representation LearningEesung Kim, Jae-Jin Jeon, Hyeji Seo, Hoon Kim. 1411-1415 [doi]
- Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded SpeechTyler Miller, David Harwath. 1416-1420 [doi]
- Pseudo Label Is Better Than Human LabelDongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman. 1421-1425 [doi]
- A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit DiscoveryWerner van der Merwe, Herman Kamper, Johan Adam du Preez. 1426-1430 [doi]
- PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker VerificationSiqi Zheng, Hongbin Suo, Qian Chen. 1431-1435 [doi]
- Cross-Age Speaker Verification: Learning Age-Invariant Speaker EmbeddingsXiaoyi Qin, Na Li 0012, Chao Weng, Dan Su 0002, Ming Li 0026. 1436-1440 [doi]
- Online Target Speaker Voice Activity Detection for Speaker DiarizationWeiqing Wang, Ming Li, Qingjian Lin. 1441-1445 [doi]
- Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddingsNiko Brummer, Albert Swart, Ladislav Mosner, Anna Silnova, Oldrich Plchot, Themos Stafylakis, Lukás Burget. 1446-1450 [doi]
- Deep speaker embedding with frame-constrained training strategy for speaker verificationBin Gu. 1451-1455 [doi]
- Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker DiarizationYifan Chen, Yifan Guo, Qingxuan Li, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan 0002. 1456-1460 [doi]
- End-to-End Audio-Visual Neural Speaker DiarizationMao-Kui He, Jun Du, Chin-Hui Lee. 1461-1465 [doi]
- Online Speaker Diarization with Core Samples SelectionYanyan Yue, Jun Du, Mao-Kui He, Yu Ting Yeung, Renyu Wang. 1466-1470 [doi]
- Robust End-to-end Speaker Diarization with Generic Neural ClusteringChenyu Yang, Yu Wang. 1471-1475 [doi]
- MSDWild: Multi-modal Speaker Diarization Dataset in the WildTao Liu, Shuai Fan 0005, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Yanmin Qian, Kai Yu 0004. 1476-1480 [doi]
- Unsupervised Speaker Diarization that is Agnostic to Language, Overlap-Aware, and Tuning FreeMd. Iftekhar Tanveer, Diego Casabuena, Jussi Karlgren, Rosie Jones. 1481-1485 [doi]
- Utterance-by-utterance overlap-aware neural diarization with Graph-PITKeisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Böddeker, Reinhold Haeb-Umbach. 1486-1490 [doi]
- Spatial-aware Speaker Diarizaiton for Multi-channel Multi-party MeetingJie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Qingyang Hong. 1491-1495 [doi]
- Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event DetectionYunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang. 1496-1500 [doi]
- An End-to-End Macaque Voiceprint Verification Method Based on Channel Fusion MechanismPeng Liu, Songbin Li, Jigang Tang. 1501-1505 [doi]
- Human Sound Classification based on Feature Fusion Method with Air and Bone Conducted SignalLiang Xu, Jing Wang, Lizhong Wang, Sijun Bi, Jianqian Zhang, Qiuyue Ma. 1506-1510 [doi]
- RaDur: A Reference-aware and Duration-robust Network for Target Sound DetectionDongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, Wenwu Wang. 1511-1515 [doi]
- Temporal Self Attention-Based Residual Network for Environmental Sound ClassificationAchyut Mani Tripathi, Konark Paul. 1516-1520 [doi]
- AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classificationJuncheng Li 0001, Shuhui Qu, Po-Yao Huang 0001, Florian Metze. 1521-1525 [doi]
- Improving Target Sound Extraction with Timestamp InformationHelin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou. 1526-1530 [doi]
- A Multi-grained based Attention Network for Semi-supervised Sound Event DetectionYing Hu, Xiujuan Zhu, Yunlong Li, Hao Huang, Liang He. 1531-1535 [doi]
- Temporal coding with magnitude-phase regularization for sound event detectionSangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali. 1536-1540 [doi]
- RCT: Random consistency training for semi-supervised sound event detectionNian Shao, Erfan Loweimi, Xiaofei Li. 1541-1545 [doi]
- Audio Pyramid Transformer with Domain Adaption for Weakly Supervised Sound Event Detection and Audio ClassificationYifei Xin, Dongchao Yang, Yuexian Zou. 1546-1550 [doi]
- Active Few-Shot Learning for Sound Event DetectionYu Wang 0105, Mark Cartwright, Juan Pablo Bello. 1551-1555 [doi]
- Uncertainty Calibration for Deep Audio ClassifiersTong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao. 1556-1560 [doi]
- Event-related data conditioning for acoustic event classificationYuanbo Hou, Dick Botteldooren. 1561-1565 [doi]
- A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTSHaohan Guo, Hui Lu, Xixin Wu, Helen Meng. 1566-1570 [doi]
- RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech InsertionDacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo. 1571-1575 [doi]
- FlowVocoder: A small Footprint Neural Vocoder based Normalizing Flow for Speech SynthesisManh Luong, Viet-Anh Tran. 1576-1580 [doi]
- DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-EncodersYanqing Liu, Ruiqing Xue, Lei He, Xu Tan 0003, Sheng Zhao. 1581-1585 [doi]
- AdaVocoder: Adaptive Vocoder for Custom VoiceXin Yuan, Robin Feng, Mingming Ye, Cheng Tuo, Minghang Zhang. 1586-1590 [doi]
- RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity ResponsesShengyuan Xu, Wenxiao Zhao, Jing Guo. 1591-1595 [doi]
- VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic FeatureChenpeng Du, Yiwei Guo, Xie Chen, Kai Yu 0004. 1596-1600 [doi]
- Improving GAN-based vocoder for fast and high-quality speech synthesisMengnan He, Tingwei Guo, Zhenxing Lu, Ruixiong Zhang, Caixia Gong. 1601-1605 [doi]
- SoftSpeech: Unsupervised Duration Model in FastSpeech 2Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yuchao Zhang. 1606-1610 [doi]
- A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTSHaohan Guo, Feng-Long Xie, Frank K. Soong, Xixin Wu, Helen Meng. 1611-1615 [doi]
- SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior KnowledgeYuhan Li, Ying Shen 0005, Dongqing Wang, Lin Zhang 0014. 1616-1620 [doi]
- Text-to-speech synthesis using spectral modeling based on non-negative autoencoderTakeru Gorai, Daisuke Saito, Nobuaki Minematsu. 1621-1625 [doi]
- Joint Modeling of Multi-Sample and Subband Signals for Fast Neural Vocoding on CPUHiroki Kanagawa, Yusuke Ijima, Hiroyuki Toda. 1626-1630 [doi]
- MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual BlocksTakuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki. 1631-1635 [doi]
- A compact transformer-based GAN vocoderChenfeng Miao, Ting Chen, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao. 1636-1640 [doi]
- Diffusion Generative Vocoder for Fullband Speech Synthesis Based on Weak Third-order SDE SolverHideyuki Tachibana, Muneyoshi Inahara, Mocho Go, Yotaro Katayama, Yotaro Watanabe. 1641-1645 [doi]
- On Adaptive Weight Interpolation of the Hybrid Autoregressive TransducerEhsan Variani, Michael Riley 0001, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran. 1646-1650 [doi]
- Learning to rank with BERT-based confidence models in ASR rescoringTing-Wei Wu, I-Fan Chen, Ankur Gandhe. 1651-1655 [doi]
- VQ-T: RNN Transducers using Vector-Quantized Prediction Network StatesJiatong Shi, George Saon, David Haws, Shinji Watanabe 0001, Brian Kingsbury. 1656-1660 [doi]
- WeNet 2.0: More Productive End-to-End Speech Recognition ToolkitBinbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv 0001, Lei Xie 0001, Chao Yang, Fuping Pan, Jianwei Niu 0002. 1661-1665 [doi]
- Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASRYufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang. 1666-1670 [doi]
- Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision StrategiesZehan Li, Haoran Miao, Keqi Deng, Gaofeng Cheng, Sanli Tian, Ta Li, Yonghong Yan 0002. 1671-1675 [doi]
- Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech RecognitionYe Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi, Xiaorui Wang. 1676-1680 [doi]
- CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-TransformerZhanheng Yang, Sining Sun, Jin Li, Xiaoming Zhang, Xiong Wang, Long Ma, Lei Xie. 1681-1685 [doi]
- LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERTRui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li 0001. 1686-1690 [doi]
- Multi-stage Progressive Compression of Conformer Transducer for On-device Speech RecognitionJash Rathod, Nauman Dawalatabad, Shatrughan Singh, Dhananjaya Gowda. 1691-1695 [doi]
- Streaming Align-Refine for Non-autoregressive DeliberationWeiran Wang, Ke Hu, Tara N. Sainath. 1696-1700 [doi]
- Federated Pruning: Improving Neural Network Efficiency with Federated LearningRongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong 0001, Giovanni Motta, Françoise Beaufays. 1701-1705 [doi]
- A Unified Cascaded Encoder ASR Model for Dynamic Model SizesShaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman. 1706-1710 [doi]
- 4-bit Conformer with Native Quantization Aware Training for Speech RecognitionShaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov. 1711-1715 [doi]
- Self-Distillation Based on High-level Information Supervision for Compressing End-to-End ASR ModelQiang Xu, Tongtong Song, Longbiao Wang, Hao Shi, Yuqin Lin, Yongjie Lv, Meng Ge, Qiang Yu 0005, Jianwu Dang. 1716-1720 [doi]
- Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translationYe Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang 0033, Alexis Conneau, Nobu Morioka. 1721-1725 [doi]
- A High-Quality and Large-Scale Dataset for English-Vietnamese Speech TranslationLinh The Nguyen, Nguyen Luong Tran, Long Doan, Manh Luong, Dat Quoc Nguyen. 1726-1730 [doi]
- Investigating Parameter Sharing in Multilingual Speech TranslationQian Wang, Chen Wang, Jiajun Zhang. 1731-1735 [doi]
- Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech DatasetZehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, Yonghong Yan 0002. 1736-1740 [doi]
- TALCS: An open-source Mandarin-English code-switching corpus and a speech recognition baselineChengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong, Changbin Chen, Jinfeng Bai. 1741-1745 [doi]
- Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech TranslationKeqi Deng, Shinji Watanabe 0001, Jiatong Shi, Siddhant Arora. 1746-1750 [doi]
- BARTpho: Pre-trained Sequence-to-Sequence Models for VietnameseNguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen. 1751-1755 [doi]
- Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition TaskMaxim Markitantov, Elena Ryumina, Dmitry Ryumin, Alexey Karpov 0001. 1756-1760 [doi]
- Bayesian Transformer Using Disentangled Mask AttentionJen-Tzung Chien, Yu-Han Huang. 1761-1765 [doi]
- Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep AnalysisHang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe 0001, Odette Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan. 1766-1770 [doi]
- From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech TranslationDanni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Pino. 1771-1775 [doi]
- Isochrony-Aware Neural Machine Translation for Automatic DubbingDerek Tam, Surafel Melaku Lakew, Yogesh Virkar, Prashant Mathur, Marcello Federico. 1776-1780 [doi]
- Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech TranslationQianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang. 1781-1785 [doi]
- A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker ExtractionZexu Pan, Meng Ge, Haizhou Li 0001. 1786-1790 [doi]
- Extending GCC-PHAT using Shift Equivariant Neural NetworksAxel Berg, Mark O'Connor, Kalle Åström, Magnus Oskarsson. 1791-1795 [doi]
- Heterogeneous Target Speech SeparationEfthymios Tzinis, Gordon Wichern, Aswin Shanmugam Subramanian, Paris Smaragdis, Jonathan Le Roux. 1796-1800 [doi]
- Separate What You Describe: Language-Queried Audio Source SeparationXubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Jinzheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang. 1801-1805 [doi]
- Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform DomainDejan Markovic, Alexandre Défossez, Alexander Richard. 1806-1810 [doi]
- End-to-end Speech-to-Punctuated-Text RecognitionJumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto. 1811-1815 [doi]
- End-to-End Dependency Parsing of Spoken FrenchAdrien Pupier, Maximin Coavoux, Benjamin Lecouteux, Jérôme Goulian. 1816-1820 [doi]
- Turn-Taking Prediction for Natural Conversational SpeechShuo-Yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He. 1821-1825 [doi]
- Streaming Intended Query Detection using E2E Modeling for Continued ConversationShuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara N. Sainath, Bo Li 0028, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman. 1826-1830 [doi]
- Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of CzechJan Lehecka, Jan Svec, Ales Prazák, Josef Psutka. 1831-1835 [doi]
- SVTS: Scalable Video-to-Speech SynthesisRodrigo Schoburg Carrillo de Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic. 1836-1840 [doi]
- One-step models in pitch perception: Experimental evidence from JapaneseTakeshi Kishiyama, Chuyu Huang, Yuki Hirose. 1841-1845 [doi]
- Generating iso-accented stimuli for second language research: methodology and a dataset for Spanish-accented EnglishRubén Pérez Ramón, Martin Cooke, María Luisa García Lecumberri. 1846-1850 [doi]
- Factors affecting the percept of Yanny v. Laurel (or mixed): Insights from a large-scale study on Swiss German listenersAdrian Leemann, Péter Jeszenszky, Carina Steiner, Corinne Lanthemann. 1851-1855 [doi]
- Effects of laryngeal manipulations on voice gender perceptionZhaoyan Zhang, Jason Zhang, Jody Kreiman. 1856-1860 [doi]
- Why is Korean lenis stop difficult to perceive for L2 Korean learners?Boram Lee, Naomi Yamaguchi, Cécile Fougeron. 1861-1865 [doi]
- Lexical stress in Spanish word segmentationAlvaro Martin Iturralde Zurita, Meghan Clayards. 1866-1870 [doi]
- Learning Audio-Text Agreement for Open-vocabulary Keyword SpottingHyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, Hong-Goo Kang. 1871-1875 [doi]
- Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word EmbeddingsBadr M. Abdullah, Bernd Möbius, Dietrich Klakow. 1876-1880 [doi]
- Personalized Keyword Spotting through Multi-task LearningSeunghan Yang, Byeonggeun Kim, Inseop Chung, Simyung Chang. 1881-1885 [doi]
- Deep LSTM Spoken Term Detection using Wav2Vec 2.0 RecognizerJan Svec, Jan Lehecka, Lubos Smídl. 1886-1890 [doi]
- Latency Control for Keyword SpottingChristin Jose, Joe Wang, Grant P. Strimel, Mohammad Omar Khursheed, Yuriy Mishchenko, Brian Kulis. 1891-1895 [doi]
- Improving Voice Trigger Detection with Metric LearningPrateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed H. Tewfik. 1896-1900 [doi]
- RNN Transducers for Named Entity Recognition with constraints on alignment for understanding medical conversationsHagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey. 1901-1905 [doi]
- Towards Automated Counselling Decision-Making: Remarks on Therapist Action Forecasting on the AnnoMI DatasetZixiu Wu, Rim Helaoui, Diego Reforgiato Recupero, Daniele Riboni. 1906-1910 [doi]
- Speech and the n-Back task as a lens into depression. How combining both may allow us to isolate different core symptoms of depressionSalvatore Fara, Stefano Goria, Emilia Molimpakis, Nicholas Cummins. 1911-1915 [doi]
- Enabling Off-the-Shelf Disfluency Detection and Categorization for Pathological SpeechAmrit Romana, Minxue Niu, Matthew Perez, Angela Roberts, Emily Mower Provost. 1916-1920 [doi]
- Challenges of using longitudinal and cross-domain corpora on studies of pathological speechCatarina Botelho, Tanja Schultz, Alberto Abad, Isabel Trancoso. 1921-1925 [doi]
- g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in MandarinYi-Chang Chen, Yu-Chuan Steven, Yen-Cheng Chang, Yi-Ren Yeh. 1926-1930 [doi]
- A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-SpeechByeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana. 1931-1935 [doi]
- Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noiseTuomo Raitio, Petko Petkov, Jiangchuan Li, P. V. Muhammed Shifas, Andrea Davis, Yannis Stylianou. 1936-1940 [doi]
- TTS-by-TTS 2: Data-Selective Augmentation for Neural Speech Synthesis Using Ranking Support Vector Machine with Variational AutoencoderEunwoo Song, Ryuichi Yamamoto, Ohsung Kwon, Chan Ho Song, Min-Jae Hwang, Suhyeon Oh, Hyun-Wook Yoon, Jin Seob Kim, Jae Min Kim. 1941-1945 [doi]
- Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentationGiulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabrys, Jaime Lorenzo-Trueba. 1946-1950 [doi]
- Real-Time Monitoring of Silences in Contact Center ConversationsDigvijay Ingle, Ayush Kumar, Krishnachaitanya Gogineni, Jithendra Vepa. 1951-1952 [doi]
- Humanizing bionic voice: interactive demonstration of aesthetic design and control factors influencing the devices assembly and waveshape engineeringKonrad Zielinski, Marek Grzelec, Martin Hagmüller. 1953-1954 [doi]
- Application for Real-time Personalized Speaker ExtractionDamien Ronssin, Milos Cernak. 1955-1956 [doi]
- Coswara: A website application enabling COVID-19 screening by analysing respiratory sound samples and health symptomsDebarpan Bhattacharya, Debottam Dutta, Neeraj Kumar Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan. 1957-1958 [doi]
- CoachLea: an Android Application to Evaluate the Speech Production and Perception of Children with Hearing LossP. Schäfer, Paula Andrea Pérez-Toro, Philipp Klumpp, Juan Rafael Orozco-Arroyave, Elmar Nöth, Andreas K. Maier, A. Abad, Maria Schuster, Tomás Arias-Vergara. 1959-1960 [doi]
- An Automated Mood Diary for Older User's using Ambient Assisted Living Recorded SpeechFasih Haider, Saturnino Luz. 1961-1962 [doi]
- Differential Time-frequency Log-mel Spectrogram Features for Vision Transformer Based Infant Cry RecognitionHai-tao Xu, Jie Zhang, Li-Rong Dai 0001. 1963-1967 [doi]
- Towards Automated Dialog Personalization using MBTI Personality IndicatorsDaniel Fernau, Stefan Hillmann, Nils Feldhus, Tim Polzehl. 1968-1972 [doi]
- Word-wise Sparse Attention for Multimodal Sentiment AnalysisFan Qian, Hongwei Song, Jiqing Han. 1973-1977 [doi]
- Estimation of speaker age and height from speech signal using bi-encoder transformer mixture modelTarun Gupta, Duc-Tuan Truong, Tran The Anh, Eng Siong Chng. 1978-1982 [doi]
- Exploring Multi-task Learning Based Gender Recognition and Age Estimation for Class-imbalanced DataWeiqiao Zheng, Ping Yang, Rongfeng Lai, Kongyang Zhu, Tao Zhang, Junpeng Zhang, Hongcheng Fu. 1983-1987 [doi]
- Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion RecognitionJie Wei, Guanyu Hu, Xinyu Yang, Anh Tuan Luu, Yizhuo Dong. 1988-1992 [doi]
- Impact of Background Noise and Contribution of Visual Information in Emotion Identification by Native Mandarin SpeakersMinyue Zhang, Hongwei Ding. 1993-1997 [doi]
- Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion RecognitionWei Yang, Satoru Fukayama, Panikos Heracleous, Jun Ogata. 1998-2002 [doi]
- Characterizing Therapist's Speaking Style in Relation to Empathy in PsychotherapyDehua Tao, Tan Lee, Harold Chui, Sarah Luk. 2003-2007 [doi]
- Hierarchical Attention Network for Evaluating Therapist Empathy in Counseling SessionDehua Tao, Tan Lee, Harold Chui, Sarah Luk. 2008-2012 [doi]
- Context-aware Multimodal Fusion for Emotion RecognitionJinchao Li, Shuai Wang, Yang Chao, Xunying Liu, Helen Meng. 2013-2017 [doi]
- Unsupervised Instance Discriminative Learning for Depression Detection from Speech SignalsJinhan Wang, Vijay Ravi, Jonathan Flint, Abeer Alwan. 2018-2022 [doi]
- How do our eyebrows respond to masks and whispering? The case of PersiansNasim Mahdinazhad Sardhaei, Marzena Zygis, Hamid Sharifzadeh. 2023-2027 [doi]
- State & Trait Measurement from Nonverbal Vocalizations: A Multi-Task Joint Learning ApproachAlice Baird, Panagiotis Tzirakis, Jeffrey A. Brooks, Lauren Kim, Michael Opara, Christopher B. Gregory, Jacob Metrick, Garrett Boseck, Dacher Keltner, Alan Cowen. 2028-2032 [doi]
- Confidence Measure for Automatic Age Estimation From SpeechAmruta Saraf, Ganesh Sivaraman, Elie Khoury 0001. 2033-2037 [doi]
- Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit QuantizationAndrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Kailash Gopalakrishnan. 2038-2042 [doi]
- Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech RecognitionGuangzhi Sun, Chao Zhang, Philip C. Woodland. 2043-2047 [doi]
- Bring dialogue-context into RNN-T for streaming ASRJunfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma. 2048-2052 [doi]
- Conformer with dual-mode chunked attention for joint online and offline ASRFelix Weninger, Marco Gaudesi, Md. Akmal Haidar, Nicola Ferri, Jesús Andrés-Ferrer, Puming Zhan. 2053-2057 [doi]
- Efficient Training of Neural Transducer for Speech RecognitionWei Zhou, Wilfried Michel, Ralf Schlüter, Hermann Ney. 2058-2062 [doi]
- Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech RecognitionZhifu Gao, Shiliang Zhang, Ian McLoughlin 0001, Zhijie Yan. 2063-2067 [doi]
- Pruned RNN-T for fast, memory-efficient ASR trainingFangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, Daniel Povey. 2068-2072 [doi]
- Deep Sparse Conformer for Speech RecognitionXianchao Wu. 2073-2077 [doi]
- Chain-based Discriminative Autoencoders for Speech RecognitionHung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang. 2078-2082 [doi]
- Streaming parallel transducer beam search with fast slow cascaded encodersJay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L. Seltzer. 2083-2087 [doi]
- Self-regularised Minimum Latency Training for Streaming Transformer-based Speech RecognitionMohan Li, Rama Sanand Doddipatla, Catalin Zorila. 2088-2092 [doi]
- On the Prediction Network Architecture in RNN-T for ASRDario Albesano, Jesús Andrés-Ferrer, Nicola Ferri, Puming Zhan. 2093-2097 [doi]
- Minimum latency training of sequence transducers for streaming end-to-end speech recognitionYusuke Shinohara, Shinji Watanabe 0001. 2098-2102 [doi]
- CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASRKeyu An, Huahuan Zheng, Zhijian Ou, Hongyu Xiang, Ke Ding, Guanglu Wan. 2103-2107 [doi]
- Attention Enhanced Citrinet for Speech RecognitionXianchao Wu. 2108-2112 [doi]
- Simple and Effective Zero-shot Cross-lingual Phoneme RecognitionQiantong Xu, Alexei Baevski, Michael Auli. 2113-2117 [doi]
- Robust Self-Supervised Audio-Visual Speech RecognitionBowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed. 2118-2122 [doi]
- Speech Sequence Embeddings using Nearest Neighbors Contrastive LearningRobin Algayres, Adel Nabli, Benoît Sagot, Emmanuel Dupoux. 2123-2127 [doi]
- Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Swithboard CorpusJunhao Xu, Shoukang Hu, Xunying Liu, Helen Meng. 2128-2132 [doi]
- Finer-grained Modeling units-based Meta-Learning for Low-resource Tibetan Speech RecognitionSiqing Qin, Longbiao Wang, Sheng Li 0010, Yuqin Lin, Jianwu Dang. 2133-2137 [doi]
- Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech ClassificationParvaneh Janbakhshi, Ina Kodrasi. 2138-2142 [doi]
- Automated Detection of Wilson's Disease Based on Improved Mel-frequency Cepstral Coefficients with Signal DecompositionZhenglin Zhang, Lizhuang Yang, Xun Wang, Hai Li 0006. 2143-2147 [doi]
- The effect of backward noise on lexical tone discrimination in Mandarin-speaking amusicsZixia Fan, Jing Shao, Weigong Pan, Min Xu, Lan Wang. 2148-2152 [doi]
- Automatic Selection of Discriminative Features for Dementia Detection in Cantonese-Speaking PeopleXiaoquan Ke, Man-Wai Mak, Helen M. Meng. 2153-2157 [doi]
- Automated Voice Pathology Discrimination from Continuous Speech Benefits from Analysis by Phonetic ContextZhuoya Liu, Mark A. Huckvale, Julian McGlashan. 2158-2162 [doi]
- Multi-Type Outer Product-Based Fusion of Respiratory Sounds for Detecting COVID-19Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller. 2163-2167 [doi]
- Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization CharacteristicsXueshuai Zhang, Jiakun Shen, Jun Zhou, Pengyuan Zhang, Yonghong Yan 0002, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shaoxing Zhang, Aijun Sun. 2168-2172 [doi]
- Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiersFarhad Javanmardi, Sudarsana Reddy Kadiri, Manila Kodali, Paavo Alku. 2173-2177 [doi]
- Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech RecognitionGerasimos Chatzoudis, Manos Plitsis, Spyridoula Stamouli, Athanasia-Lida Dimou, Nassos Katsamanis, Vassilis Katsouros. 2178-2182 [doi]
- Domain-aware Intermediate Pretraining for Dementia Detection with Limited DataYouxiang Zhu, Xiaohui Liang, John A. Batsis, Robert M. Roth. 2183-2187 [doi]
- Comparison of 5 methods for the evaluation of intelligibility in mild to moderate French dysarthric speechCécile Fougeron, Nicolas Audibert, Ina Kodrasi, Parvaneh Janbakhshi, Michaela Pernon, Nathalie Lévêque, Stephanie Borel, Marina Laganaro, Hervé Bourlard, Frédéric Assal. 2188-2192 [doi]
- Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain AdaptationKuan-Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee. 2193-2197 [doi]
- Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech RecognitionGuan-Ting Lin, Shang-wen Li 0001, Hung-yi Lee. 2198-2202 [doi]
- Distilling a Pretrained Language Model to a Multilingual ASR ModelKwangHee Choi, Hyung-Min Park. 2203-2207 [doi]
- Text-Only Domain Adaptation Based on Intermediate CTCHiroaki Sato, Tomoyasu Komori, Takeshi Mishima, Yoshihiko Kawai, Takahiro Mochizuki, Shoei Sato, Tetsuji Ogawa. 2208-2212 [doi]
- Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter WarpingJenthe Thienpondt, Kris Demuynck. 2213-2217 [doi]
- Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR ModelsYuki Takashima, Shota Horiguchi, Shinji Watanabe 0001, Leibny Paola García-Perera, Yohei Kawaguchi. 2218-2222 [doi]
- Improved CNN-Transformer using Broadcasted Residual Learning for Text-Independent Speaker VerificationJeong Hwan Choi, Joon-Young Yang, Ye-Rin Jeoung, Joon-Hyuk Chang. 2223-2227 [doi]
- Pushing the limits of raw waveform speaker recognitionJee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung. 2228-2232 [doi]
- PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language IdentificationHexin Liu, Leibny Paola García-Perera, Andy W. H. Khong, Suzy J. Styles, Sanjeev Khudanpur. 2233-2237 [doi]
- Prosodic Information in Dialect Identification of a Tonal Language: The case of AoMoakala Tzudir, Priyankoo Sarmah, S. R. Mahadeva Prasanna. 2238-2242 [doi]
- A Multimodal Strategy for Singing Language IdentificationWo Jae Lee, Emanuele Coviello. 2243-2247 [doi]
- A comparative study on vowel articulation in Parkinson's disease and multiple system atrophyKhalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Margherita Fabbri, Anne Pavy-Le Traon, Olivier Rascol, Virginie Woisard, Wassilios G. Meissner. 2248-2252 [doi]
- Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversionLuc Ardaillon, Nathalie Henrich, Olivier Perrotin. 2253-2257 [doi]
- Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing TasksTanya Talkar, Christina Manxhari, James J. Williamson, Kara M. Smith, Thomas F. Quatieri. 2258-2262 [doi]
- Investigating the Impact of Speech Compression on the Acoustics of Dysarthric SpeechKelvin Tran, Lingfeng Xu, Gabriela Stegmann, Julie Liss, Visar Berisha, Rene Utianski. 2263-2267 [doi]
- Speaker Trait Enhancement for Cochlear Implant Users: A Case Study for Speaker Emotion PerceptionAvamarie Brueggeman, John H. L. Hansen. 2268-2272 [doi]
- Optimal thyroplasty implant shape and stiffness for treatment of acute unilateral vocal fold paralysis: Evidence from a canine in vivo phonation modelNeha Reddy, Yoonjeong Lee, Zhaoyan Zhang, Dinesh K. Chhetri. 2273-2277 [doi]
- XLS-R: Self-supervised Cross-lingual Speech Representation Learning at ScaleArun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. 2278-2282 [doi]
- Semantically Meaningful Metrics for Norwegian ASR SystemsJanine Rugayan, Torbjørn Svendsen, Giampiero Salvi. 2283-2287 [doi]
- Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASROndrej Klejch, Electra Wallington, Peter Bell 0001. 2288-2292 [doi]
- Linguistically Informed Post-processing for ASR Error correction in SanskritRishabh Kumar, Devaraja Adiga, Rishav Ranjan, Amrith Krishna, Ganesh Ramakrishnan, Pawan Goyal 0002, Preethi Jyothi. 2293-2297 [doi]
- Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networksMahir Morshed, Mark Hasegawa-Johnson. 2298-2302 [doi]
- Comparison of Models for Detecting Off-Putting Speaking StylesDiego Aguirre, Nigel Ward, Jonathan E. Avila, Heike Lehnert-LeHouillier. 2303-2307 [doi]
- Multimodal Persuasive Dialogue Corpus using Teleoperated AndroidSeiya Kawano, Muteki Arioka, Akishige Yuguchi, Kenta Yamamoto, Koji Inoue, Tatsuya Kawahara, Satoshi Nakamura 0001, Koichiro Yoshino. 2308-2312 [doi]
- Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTSYookyung Shin, Younggun Lee, Suhee Jo, Yeongtae Hwang, Taesu Kim. 2313-2317 [doi]
- Strategies for developing a Conversational Speech Dataset for Text-To-Speech SynthesisAdaeze O. Adigwe, Esther Klabbers. 2318-2322 [doi]
- Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in SpeechXiyuan Gao, Shekhar Nayak, Matt Coler. 2323-2327 [doi]
- End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous DialogueKentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda. 2328-2332 [doi]
- Attention-based conditioning methods using variable frame rate for style-robust speaker verificationAmber Afshan, Abeer Alwan. 2333-2337 [doi]
- Learning from human perception to improve automatic speaker verification in style-mismatched conditionsAmber Afshan, Abeer Alwan. 2338-2342 [doi]
- Exploring audio-based stylistic variation in podcastsKatariina Martikainen, Jussi Karlgren, Khiet Truong. 2343-2347 [doi]
- Automatic Evaluation of Speaker SimilarityKamil Deja, Ariadna Sanchez, Julian Roth, Marius Cotescu. 2348-2352 [doi]
- Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)Ziyao Zhang, Alessio Falai, Ariadna Sanchez, Orazio Angelini, Kayoko Yanagisawa. 2353-2357 [doi]
- J-MAC: Japanese multi-speaker audiobook corpus for speech synthesisShinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari. 2358-2362 [doi]
- REYD - The First Yiddish Text-to-Speech Dataset and SystemJacob Webber, Samuel K. Lo, Isaac L. Bleaman. 2363-2367 [doi]
- Data-augmented cross-lingual synthesis in a teacher-student frameworkMarcel de Korte, Jaebok Kim, Aki Kunikoshi, Adaeze Adigwe, Esther Klabbers. 2368-2372 [doi]
- Production characteristics of obstruents in WaveNet and older TTS systemsAyushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen, Naomi Harte. 2373-2377 [doi]
- Back to the Future: Extending the Blizzard Challenge 2013Sébastien Le Maguer, Simon King, Naomi Harte. 2378-2382 [doi]
- BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpusJosh Meyer, David Ifeoluwa Adelani, Edresson Casanova, Alp Öktem, Daniel Whitenack, Julian Weber, Salomon Kabongo, Elizabeth Salesky, Iroro Orife, Colin Leong, Perez Ogayo, Chris Chinenye Emezue, Jonathan Mukiibi, Salomey Osei, Apelete Agbolo, Victor Akinode, Bernard Opoku, Samuel Olanrewaju, Jesujoba Alabi, Shamsuddeen Hassan Muhammad. 2383-2387 [doi]
- SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech SynthesisGeorgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis. 2388-2392 [doi]
- Domain Generalization with Relaxed Instance Frequency-wise Normalization for Multi-device Acoustic Scene ClassificationByeonggeun Kim, Seunghan Yang, Jangho Kim, Hyunsin Park, Juntae Lee, Simyung Chang. 2393-2397 [doi]
- Couple learning for semi-supervised sound event detectionRui Tao, Long Yan, Kazushige Ouchi, Xiangdong Wang. 2398-2402 [doi]
- Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRURajeev Rajan, Ananya Ayasi. 2403-2407 [doi]
- SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw WaveformsYuhang He, Andrew Markham. 2408-2412 [doi]
- ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep LearningChristian Bergler, Alexander Barnhill, Dominik Perrin, Manuel Schmitt, Andreas K. Maier, Elmar Nöth. 2413-2417 [doi]
- Convolutional Recurrent Neural Network with Auxiliary Stream for Robust Variable-Length Acoustic Scene ClassificationJoon-Hyuk Chang, Won-Gook Choi. 2418-2422 [doi]
- Unsupervised Symbolic Music Segmentation using Ensemble Temporal Prediction ErrorsShahaf Bassan, Yossi Adi, Jeffrey S. Rosenschein. 2423-2427 [doi]
- Visually-aware Acoustic Event Detection using Heterogeneous GraphsAmir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha. 2428-2432 [doi]
- A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene ClassificationArshdeep Singh, Mark D. Plumbley. 2433-2437 [doi]
- MAE-AST: Masked Autoencoding Audio Spectrogram TransformerAlan Baade, Puyuan Peng, David Harwath. 2438-2442 [doi]
- What can Speech and Language Tell us About the Working Alliance in PsychotherapySebastian Peter Bayerl, Gabriel Roccabruna, Shammur Absar Chowdhury, Tommaso Ciulli, Morena Danieli, Korbinian Riedhammer, Giuseppe Riccardi. 2443-2447 [doi]
- TB or not TB? Acoustic cough analysis for tuberculosis classificationGeoffrey T. Frost, Grant Theron, Thomas Niesler. 2448-2452 [doi]
- Are reported accuracies in the clinical speech machine learning literature overoptimistic?Visar Berisha, Chelsea Krantsevich, Gabriela Stegmann, Shira Hahn, Julie Liss. 2453-2457 [doi]
- Automatic Detection of Expressed Emotion from Five-Minute Speech Samples: Challenges and OpportunitiesBahman Mirheidari, André Bittar, Nicholas Cummins, Johnny Downs, Helen L. Fisher, Heidi Christensen. 2458-2462 [doi]
- Automatic cognitive assessment: Combining sparse datasets with disparate cognitive scoresBahman Mirheidari, Daniel Blackburn, Heidi Christensen. 2463-2467 [doi]
- Exploring Semi-supervised Learning for Audio-based COVID-19 Detection using FixMatchTing Dang, Thomas Quinnell, Cecilia Mascolo. 2468-2472 [doi]
- Analyzing the impact of SARS-CoV-2 variants on respiratory sound signalsDebarpan Bhattacharya, Debottam Dutta, Neeraj Sharma, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, Chandrakiran C, Sahiti Nori, Suhail K. K, Sadhana Gonuguntla, Murali Alagesan. 2473-2477 [doi]
- Automated Evaluation of Standardized Dementia Screening TestsFranziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer. 2478-2482 [doi]
- Alzheimer's Detection from English to Spanish Using Acoustic and Linguistic EmbeddingsPaula Andrea Pérez-Toro, Philipp Klumpp, Abner Hernandez, Tomas Arias, Patricia Lillo, Andrea Slachevsky, Adolfo Martín García, Maria Schuster, Andreas K. Maier, Elmar Nöth, Juan Rafael Orozco-Arroyave. 2483-2487 [doi]
- Extract and Abstract with BART for Clinical Notes from Doctor-Patient ConversationsJing Su, Longxiang Zhang, Hamid Reza Hassanzadeh, Thomas Schaaf. 2488-2492 [doi]
- Dyadic Interaction Assessment from Free-living Audio for Depression Severity AssessmentBishal Lamichhane, Nidal Moukaddam, Ankit B. Patel, Ashutosh Sabharwal. 2493-2497 [doi]
- COVID-19 detection based on respiratory sensing from speechVenkata Srikanth Nallanthighal, Aki Härmä, Helmer Strik. 2498-2502 [doi]
- Bifurcation and Reunion: A Loss-Guided Two-Stage Approach for Monaural Speech DereverberationXiaoXue Luo, Chengshi Zheng, Andong Li, Yuxuan Ke, Xiaodong Li 0002. 2503-2507 [doi]
- A deep complex multi-frame filtering network for stereophonic acoustic echo cancellationLinjuan Cheng, Chengshi Zheng, Andong Li, Yuquan Wu, Renhua Peng, Xiaodong Li 0002. 2508-2512 [doi]
- Speaker- and Phone-aware Convolutional Transformer Network for Acoustic Echo CancellationChang-Han, Weiping Tu, Yuhong Yang, Jingyi Li, Xinhong Li. 2513-2517 [doi]
- Personalized Acoustic Echo Cancellation for Full-duplex CommunicationsShimin Zhang, Ziteng Wang, Yukai Ju, Yihui Fu, Yueyue Na, Qiang Fu, Lei Xie. 2518-2522 [doi]
- LCSM: A Lightweight Complex Spectral Mapping Framework for Stereophonic Acoustic Echo CancellationChenggang Zhang, Jinjiang Liu, Xueliang Zhang 0001. 2523-2527 [doi]
- Joint Neural AEC and Beamforming with Double-Talk DetectionVinay Kothapally, Yong Xu 0004, Meng Yu 0003, Shi-Xiong Zhang, Dong Yu 0001. 2528-2532 [doi]
- Clock Skew Robust Acoustic Echo CancellationKarim Helwani, Erfan Soltanmohammadi, Michael Mark Goodwin, Arvindh Krishnaswamy. 2533-2537 [doi]
- A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR AccuracySankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park 0001, James Walker, Alexander Gruenstein. 2538-2542 [doi]
- Complex-Valued Time-Frequency Self-Attention for Speech DereverberationVinay Kothapally, John H. L. Hansen. 2543-2547 [doi]
- Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target SpeakersLiumeng Xue, Shan Yang, Na Hu, Dan Su 0002, Lei Xie 0001. 2548-2552 [doi]
- Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice ConversionSicheng Yang, Methawee Tantrawenith, Haolin Zhuang, Zhiyong Wu 0001, Aolan Sun, Jianzong Wang, Ning Cheng, Huaizhen Tang, Xintao Zhao, Jie Wang, Helen Meng. 2553-2557 [doi]
- FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice ConversionJiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang. 2558-2562 [doi]
- Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice ConversionYi Lei, Shan Yang, Jian Cong, Lei Xie 0001, Dan Su 0002. 2563-2567 [doi]
- AdaSpeech 4: Adaptive Text to Speech in Zero-Shot ScenariosYihan Wu, Xu Tan 0003, Bohan Li 0003, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, Tie-Yan Liu. 2568-2572 [doi]
- Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech SynthesisYixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu 0001, Yanyao Bian, Dan Su 0002, Helen Meng. 2573-2577 [doi]
- Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice ConversionHaoquan Yang, Liqun Deng, Yu Ting Yeung, Nianzu Zheng, Yong Xu. 2578-2582 [doi]
- Accent Conversion using Pre-trained Model and Synthesized Data from Voice ConversionTuan Nam Nguyen, Ngoc-Quan Pham, Alexander Waibel. 2583-2587 [doi]
- VoiceMe: Personalized voice generation in TTSPol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André, Nori Jacoby. 2588-2592 [doi]
- DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice ConversionRuibin Yuan, Yuxuan Wu, Jacob Li, Jaxter Kim. 2593-2597 [doi]
- Towards Improved Zero-shot Voice Conversion with Conditional DSVAEJiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu 0001. 2598-2602 [doi]
- Disentanglement of Emotional Style and Speaker Identity for Expressive Voice ConversionZongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li 0001. 2603-2607 [doi]
- Internal Language Model Adaptation with Text-Only Data for End-to-End Speech RecognitionZhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li 0001, Xie Chen, Yu Wu, Yifan Gong 0001. 2608-2612 [doi]
- A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and TextYe-Qian Du, Jie Zhang, Qiu-Shi Zhu, Lirong Dai 0001, Ming-hui Wu, Xin Fang, Zhou-Wang Yang. 2613-2617 [doi]
- Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech RecognitionXun Gong 0005, Zhikai Zhou, Yanmin Qia