簡易檢索 / 詳目顯示

研究生: 卓祐詮
Jhuo, You-Cyuan
論文名稱: 使用基於語音後驗機率的語音編輯進行資料擴增以實現構音障礙語音識別
Data Augmentation with PPG-Based Phone Editing for Dysarthric Speech Recognition
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 77
中文關鍵詞: 構音障礙資料擴增構音障礙語音辨識語音編輯語音後驗機率
外文關鍵詞: Dysarthric Data Augmentation, Dysarthric Speech Recognition, Phone Editing, Phonetic Posteriorgrams
相關次數: 點閱:43下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 構音障礙患者受發音不清和語音的低可辨識性所困擾,利用自動語音辨識系統幫助構音障礙患者的溝通至為重要。但由於發音錯誤和語音特徵的獨特性,使得構音障礙語音辨識極具挑戰。因此需要發展遷移學習和資料擴增等針對構音障礙語音辨識的改善技術,以提高辨識準確性。
    本論文提出了一種資料擴增方法,根據特定語者的發音變異資訊來修改典型語音的音位,以生成模仿構音障礙講者發音模式的合成語音。另外還探討了整合其他現有的特定語者構音障礙技術的有效性,其中包括語音風格轉換和音位級別的速度擾動。在針對構音障礙語音資料集UASpeech進行的實驗中,透過使用擴增資料集對HuBERT 語音辨識模型進行微調,我們的方法有效提升模型對於構音障礙語音識別的能力,整體達到19.53%的詞錯誤率。先前使用單一辨識模型的最佳系統是以生成對抗網路對構音障礙語音做資料擴增,比起該方法我們的整體詞錯誤率下降0.25%。對於低辨識度(Low)和極低辨識度(Very-Low)的子集,我們的詞錯誤率分別達到20.45%和53.71%,比先前系統各下降0.28%和1.05%。

    Individuals suffered from dysarthria experience impaired articulation and reduced speech intelligibility. Automatic Speech Recognition (ASR) systems, crucial for improving communication, face challenges in recognizing dysarthric speech due to mispronunciations and unique speech characteristics. Specialized ASR techniques including transfer learning and data augmentation are essential for enhancing recognition accuracy.
    This dissertation proposes a data augmentation approach that utilizes knowledge of speaker-dependent pronunciation variations to edit phones of typical speech, creating synthetic speech that resembles the pronunciation patterns of dysarthric speakers. We also explore the effectiveness of integrating existing personalized augmentation techniques including speaking style conversion and phonetic-level speed perturbation. In experiments using the dysarthric corpus UASpeech, fine-tuning the HuBERT ASR model on the augmented dataset demonstrates improved recognition performance, achieving an overall word error rate (WER) of 19.53%, For representative subgroups with low and very-low intelligibility, WERs 20.45% and 53.71% were achieve. The previous best-performing system using single ASR model augmented data with generative adversarial network (GAN). Our results surpass the system by 0.25% overall, and by 0.28% and 1.05% for the low and very-low intelligibility groups.

    摘要 I Abstract II Table of Contents III List of Tables V List of Figures VI Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation and Goals 3 1.3 Literature Review 4 1.3.1 Dysarthric Speech Recognition 4 1.3.2 Dysarthric Data Augmentation 7 1.3.3 Phonetic Posteriorgrams and Phone Editing 9 1.4 Problems 11 1.5 Brief Description of Research Methods 13 Chapter 2 Research Methods 15 2.1 Pronunciation Variation Analysis 16 2.1.1 Phonetic Mapping 17 2.1.2 Phonetic Mapping Assignments 18 2.1.3 Phonetic Mapping Matrix 20 2.2 PPG-Based Phone Editing 21 2.3 PPG-to-Speech Synthesizer 23 2.3.1 Speech Decomposition 24 2.3.2 Source-Filter Model 26 2.3.3 Self-Supervised Learning with Joint Optimization 27 2.4 Speed Perturbation 28 2.4.1 Waveform Similarity Overlap-Add 29 2.4.2 Phonetic-Level Speed Perturbation 29 2.5 HuBERT ASR 33 2.5.1 Masked Prediction 36 2.5.2 CTC loss 38 2.5.3 Viterbi Decoding 39 Chapter 3 Experimental Setup and Result 41 3.1 Dataset 41 3.1.1 UASpeech 41 3.1.2 VCTK 43 3.2 Baseline 44 3.2.1 GAN-Based Data Augmentation 45 3.2.2 Setting 46 3.3 Experimental Setting 48 3.3.1 Phone Editing 48 3.3.2 PPG-to-Speech Synthesizer 50 3.3.3 Speed Perturbation 52 3.3.4 HuBERT ASR 52 3.4 Results 54 3.4.1 Evaluations of PPG-to-Speech Synthesizer 54 3.4.2 Comparison with Baseline 55 3.4.3 Ablation Study 56 3.4.4 Comparison with Competitive Systems 59 Chapter 4 Conclusion and Future Work 61 Reference 63

    [1] P. Enderby, "Disorders of communication: dysarthria," Handbook of clinical neurology, vol. 110, pp. 273-281, 2013.
    [2] G. Moya-Galé and E. S. Levy, "Parkinson’s disease-associated dysarthria: prevalence, impact and management strategies," Research and Reviews in Parkinsonism, pp. 9-16, 2019.
    [3] C. Mitchell, M. Gittins, S. Tyson, A. Vail, P. Conroy, L. Paley, and A. Bowen, "Prevalence of aphasia and dysarthria among inpatient stroke survivors: describing the population, therapy provision and outcomes on discharge," Aphasiology, vol. 35, no. 7, pp. 950-960, 2021.
    [4] L. Hartelius, B. Runmarker, and O. Andersen, "Prevalence and characteristics of dysarthria in a multiple-sclerosis incidence cohort: relation to neurological data," Folia phoniatrica et logopaedica, vol. 52, no. 4, pp. 160-177, 2000.
    [5] S. Knuijt, J. G. Kalf, B. J. de Swart, G. Drost, H. T. Hendricks, A. C. Geurts, and B. G. van Engelen, "Dysarthria and dysphagia are highly prevalent among various types of neuromuscular diseases," Disability and rehabilitation, vol. 36, no. 15, pp. 1285-1289, 2014.
    [6] Y.-T. Wang, R. D. Kent, J. R. Duffy, and J. E. Thomas, "Dysarthria associated with traumatic brain injury: speaking rate and emphatic stress," Journal of communication disorders, vol. 38, no. 3, pp. 231-260, 2005.
    [7] T. Schölderle, A. Staiger, R. Lampe, K. Strecker, and W. Ziegler, "Dysarthria in adults with cerebral palsy: Clinical presentation and impacts on communication," Journal of Speech, Language, and Hearing Research, vol. 59, no. 2, pp. 216-229, 2016.
    [8] F. L. Darley, A. E. Aronson, and J. R. Brown, "Differential diagnostic patterns of dysarthria," Journal of speech and hearing research, vol. 12, no. 2, pp. 246-269, 1969.
    [9] M. Malik, M. K. Malik, K. Mehmood, and I. Makhdoom, "Automatic speech recognition: a survey," Multimedia Tools and Applications, vol. 80, pp. 9411-9457, 2021.
    [10] D. Yu and L. Deng, Automatic speech recognition. Springer, 2016.
    [11] V. Young and A. Mihailidis, "Difficulties in automatic speech recognition of dysarthric speakers and implications for speech-based applications used by the elderly: A literature review," Assistive Technology, vol. 22, no. 2, pp. 99-112, 2010.
    [12] A. Jaddoh, F. Loizides, and O. Rana, "Interaction between people with dysarthria and speech recognition systems: A review," Assistive Technology, vol. 35, no. 4, pp. 330-338, 2023.
    [13] B. F. Zaidi, S. A. Selouani, M. Boudraa, and M. Sidi Yakoub, "Deep neural network architectures for dysarthric speech analysis and recognition," Neural Computing and Applications, vol. 33, no. 15, pp. 9089-9108, 2021.
    [14] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE access, vol. 7, pp. 19143-19165, 2019.
    [15] J. Deng, F. R. Gutierrez, S. Hu, M. Geng, X. Xie, Z. Ye, S. Liu, J. Yu, X. Liu, and H. Meng, "Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition," in Interspeech, 2021, pp. 4818-4822.
    [16] H. P. Rowe, S. E. Gutz, M. F. Maffei, K. Tomanek, and J. R. Green, "Characterizing dysarthria diversity for automatic speech recognition: A tutorial from the clinical perspective," Frontiers in computer science, vol. 4, p. 770210, 2022.
    [17] H. Kim, M. Hasegawa-Johnson, A. Perlman, J. R. Gunderson, T. S. Huang, K. L. Watkin, and S. Frame, "Dysarthric speech database for universal access research," in Interspeech, 2008, vol. 2008, pp. 1741-1744.
    [18] F. Xiong, J. Barker, and H. Christensen, "Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 5836-5840.
    [19] C. Espana-Bonet and J. A. Fonollosa, "Automatic speech recognition with deep neural networks for impaired speech," in Advances in Speech and Language Technologies for Iberian Languages: Third International Conference, IberSPEECH 2016, Lisbon, Portugal, November 23-25, 2016, Proceedings 3, 2016: Springer, pp. 97-107.
    [20] Z. Qian, K. Xiao, and C. Yu, "A survey of technologies for automatic Dysarthric speech recognition," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2023, no. 1, p. 48, 2023.
    [21] M. J. Kim, B. Cao, K. An, and J. Wang, "Dysarthric Speech Recognition Using Convolutional LSTM Neural Network," in Interspeech, 2018, pp. 2948-2952.
    [22] S. R. Shahamiri, V. Lal, and D. Shah, "Dysarthric speech transformer: A sequence-to-sequence dysarthric speech recognition system," IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2023.
    [23] Y. Takashima, T. Takiguchi, and Y. Ariki, "End-to-end dysarthric speech recognition using multiple databases," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 6395-6399.
    [24] Z. Yue, F. Xiong, H. Christensen, and J. Barker, "Exploring appropriate acoustic and language modelling choices for continuous dysarthric speech recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6094-6098.
    [25] X. Xie, R. Ruzi, X. Liu, and L. Wang, "Variational auto-encoder based variability encoding for dysarthric speech recognition," in Interspeech, 2021, pp. 4808-4812.
    [26] F. Xiong, J. Barker, and H. Christensen, "Deep learning of articulatory-based representations and applications for improving dysarthric speech recognition," in speech communication; 13th ITG-symposium, 2018: VDE, pp. 1-5.
    [27] Z. Yue, E. Loweimi, Z. Cvetkovic, H. Christensen, and J. Barker, "Multi-modal acoustic-articulatory feature fusion for dysarthric speech recognition," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7372-7376.
    [28] A. Almadhor, R. Irfan, J. Gao, N. Saleem, H. T. Rauf, and S. Kadry, "E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition," Expert Systems with Applications, vol. 222, p. 119797, 2023.
    [29] J. Yu, X. Xie, S. Liu, S. Hu, M. W. Lam, X. Wu, K. H. Wong, X. Liu, and H. Meng, "Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus," in Interspeech, 2018, pp. 2938-2942.
    [30] M. K. Baskar, T. Herzig, D. Nguyen, M. Diez, T. Polzehl, L. Burget, and J. Černocký, "Speaker adaptation for Wav2vec2 based dysarthric ASR," in Interspeech, 2022, pp. 3403-3407.
    [31] M. Geng, X. Xie, Z. Ye, T. Wang, G. Li, S. Hu, X. Liu, and H. Meng, "Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2597-2611, 2022.
    [32] D. Wang, J. Yu, X. Wu, L. Sun, X. Liu, and H. Meng, "Improved end-to-end dysarthric speech recognition via meta-learning based model re-initialization," in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021: IEEE, pp. 1-5.
    [33] D. Wang, S. Liu, X. Wu, H. Lu, L. Sun, X. Liu, and H. Meng, "Speaker identity preservation in dysarthric speech reconstruction by adversarial speaker adaptation," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 6677-6681.
    [34] C. Bhat, B. Das, B. Vachhani, and S. K. Kopparapu, "Dysarthric Speech Recognition Using Time-delay Neural Network Based Denoising Autoencoder," in Interspeech, 2018, pp. 451-455.
    [35] A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, and L. Maaløe, "Self-supervised speech representation learning: A review," IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, 2022.
    [36] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in neural information processing systems, vol. 33, pp. 12449-12460, 2020.
    [37] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "Hubert: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451-3460, 2021.
    [38] C. J. Cho, P. Wu, A. Mohamed, and G. K. Anumanchipalli, "Evidence of vocal tract articulation in self-supervised learning of speech," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023: IEEE, pp. 1-5.
    [39] L. P. Violeta, W.-C. Huang, and T. Toda, "Investigating self-supervised pretraining frameworks for pathological speech recognition," in Interspeech, 2022, pp. 41-45.
    [40] A. Hernandez, P. A. Pérez-Toro, E. Nöth, J. R. Orozco-Arroyave, A. Maier, and S. H. Yang, "Cross-lingual self-supervised speech representations for improved dysarthric speech recognition," in Interspeech, 2022, pp. 51-55.
    [41] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, "Unsupervised cross-lingual representation learning for speech recognition," in Association for Computational Linguistics, 2019, pp. 31-38.
    [42] Z. Jin, M. Geng, X. Xie, J. Yu, S. Liu, X. Liu, and H. Meng, "Adversarial data augmentation for disordered speech recognition," in Interspeech, 2021, pp. 4803-4807.
    [43] Z. Jin, M. Geng, J. Deng, T. Wang, S. Hu, G. Li, and X. Liu, "Personalized adversarial data augmentation for dysarthric and elderly speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
    [44] M. Soleymanpour, M. T. Johnson, R. Soleymanpour, and J. Berry, "Synthesizing dysarthric speech using multi-speaker tts for dysarthric speech recognition," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7382-7386.
    [45] E. Hermann and M. M.-. Doss, "Few-shot dysarthric speech recognition with text-to-speech data augmentation," in Proc. INTERSPEECH 2023, 2023, pp. 156-160.
    [46] W.-Z. Leung, M. Cross, A. Ragni, and S. Goetze, "Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis," in Interspeech, 2024.
    [47] W.-Z. Zheng, J.-Y. Han, C.-K. Lee, Y.-Y. Lin, S.-H. Chang, and Y.-H. Lai, "Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients," Computer Methods and Programs in Biomedicine, vol. 215, p. 106602, 2022.
    [48] W.-Z. Zheng, J.-Y. Han, C.-Y. Chen, Y.-J. Chang, and Y.-H. Lai, "Improving the efficiency of dysarthria voice conversion system based on data augmentation," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 4613-4623, 2023.
    [49] J. Harvill, D. Issa, M. Hasegawa-Johnson, and C. Yoo, "Synthesis of new words for improved dysarthric speech recognition on an expanded vocabulary," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 6428-6432.
    [50] H. Wang, T. Thebaud, J. Villalba, M. Sydnor, B. Lammers, N. Dehak, and L. Moro-Velazquez, "Duta-vc: A duration-aware typical-to-atypical voice conversion approach with diffusion probabilistic model," in Interspeech, 2023, pp. 1548-1552.
    [51] H. Wang, Z. Jin, M. Geng, S. Hu, G. Li, T. Wang, H. Xu, and X. Liu, "Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024: IEEE, pp. 12311-12315.
    [52] M. Cui, J. Deng, S. Hu, X. Xie, T. Wang, S. Hu, M. Geng, B. Xue, X. Liu, and H. Meng, "Two-pass decoding and cross-adaptation based system combination of end-to-end conformer and hybrid tdnn asr systems," in Interspeech, 2022, pp. 2958-1796.
    [53] D. Felps, C. Geng, and R. Gutierrez-Osuna, "Foreign accent conversion through concatenative synthesis in the articulatory domain," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 8, pp. 2301-2312, 2012.
    [54] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5314-5318.
    [55] C. Churchwell, M. Morrison, and B. Pardo, "High-fidelity neural phonetic posteriorgrams," in ICASSP 2024 Workshop on Explainable Machine Learning for Speech and Audio, 2024.
    [56] K. Kintzley, A. Jansen, and H. Hermansky, "Event selection from phone posteriorgrams using matched filters," in Twelfth Annual Conference of the International Speech Communication Association, 2011.
    [57] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357-366, 1980.
    [58] S. Liu, Y. Cao, D. Wang, X. Wu, X. Liu, and H. Meng, "Any-to-many voice conversion with location-relative sequence-to-sequence modeling," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1717-1728, 2021.
    [59] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, "Phonetic posteriorgrams for many-to-one voice conversion without parallel data training," in 2016 IEEE International Conference on Multimedia and Expo (ICME), 2016: IEEE, pp. 1-6.
    [60] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, "Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5274-5278.
    [61] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, "Common voice: A massively-multilingual speech corpus," in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2019, pp. 4218–4222.
    [62] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon technical report n, vol. 93, p. 27403, 1993.
    [63] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA workshop on speech synthesis, 2004.
    [64] L.-W. Chen, S. Watanabe, and A. Rudnicky, "A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023: IEEE, pp. 1-5.
    [65] J. Kong, J. Kim, and J. Bae, "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis," Advances in neural information processing systems, vol. 33, pp. 17022-17033, 2020.
    [66] G. Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations (no. 2). Walter de Gruyter, 1971.
    [67] W. Verhelst and M. Roelands, "An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech," in 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993, vol. 2: IEEE, pp. 554-557.
    [68] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Association for Computational Linguistics, 2018, pp. 4171–4186.
    [69] J. Yamagishi, C. Veaux, and K. MacDonald, "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)," 2019.
    [70] J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, and C. Fuegen, "Libri-light: A benchmark for asr with limited or no supervision," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7669-7673.
    [71] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015: IEEE, pp. 5206-5210.
    [72] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, "Mosnet: Deep learning based objective assessment for voice conversion," in Interspeech, 2019, pp. 1541-1545.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE