簡易檢索 / 詳目顯示

研究生: 謝宜廷
Hsieh, I-Ting
論文名稱: 動態取樣元學習及階層式課程學習應用於低資源語音辨識
Dynamic Sampling Meta-learning and Hierarchical Curriculum Learning for Under-resourced Automatic Speech Recognition
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 多媒體系統與智慧型運算工程博士學位學程
Multimedia System and Intelligent Computing Ph.D. Degree Program
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 121
中文關鍵詞: 低資源語音辨識台語自動語音辨識病態語音辨識課程學習元學習
外文關鍵詞: Low-Resource Speech Recognition, Taiwanese Automatic Speech Recognition, Pathological Speech Recognition, Curriculum Learning, Meta-learning
相關次數: 點閱:7下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習與自監督式語音模型(Self-Supervised Speech Models)的迅速發展,自動語音辨識(Automatic Speech Recognition, ASR)技術已在高資源語言及一般語者語音上取得顯著成果。然而,當面對資料稀缺或語音特性高度變異的低資源情境時,ASR效能仍明顯受限。本論文以「低資源語音辨識」為核心研究主題,針對兩個具有代表性的應用場景台語(Taiwanese)與病態語音(Pathological Speech),系統性地探討在資料受限條件下如何提升語音辨識的穩定性與泛化能力。這兩者雖同屬低資源問題,但挑戰性質不同:台語ASR主要受到語料不足與語言差異性的限制;而病態語音ASR則因語者間的生理差異與構音變異過大,導致語音特徵分佈極為不均。本研究針對這兩種挑戰提出多層次的學習策略,以達成資料效率與辨識效能的平衡。
    在台語ASR部分,本研究初期著重於如何利用普通話語料來輔助台語ASR的建模。由於台語與普通話之間存在大量共通音素(Common Phonemes),本研究藉由此語音對應關係提出聲學與文字層面的資料擴增(Acoustic and Textual Data Augmentation)方法,將普通話語音的聲學特徵與詞彙資訊轉化為可應用於台語ASR的輔助訓練資料,以彌補台語語料不足的問題。進一步地,為了最大化不同輔助語言對台語ASR的效益,本研究提出基於動態採樣的元學習(Dynamic Sampling-based Meta-Learning)策略,從多語音資料中挑選與台語音韻結構最為相近的語言語料,進行元學習訓練。此方法能讓模型在多語環境中自動學習具遷移性的聲學知識,提升跨語言的可適應性與建模能力。實驗結果顯示,所提出的方法能有效提升台語ASR在低資源條件下的辨識正確率與訓練穩定性,證實語音共通性與跨語言學習的結合能顯著改善低資源語言辨識效能。
    相較之下,病態語音ASR面臨更為嚴峻的挑戰。除了同樣受到資料稀缺的限制外,病態語音因構音失調與語者可懂度(Intelligibility)差異極大,使模型難以擷取穩定的聲學特徵。本研究首先針對電子喉(Electrolarynx)語音提出基於音素親和矩陣(Phoneme Affinity Matrix)的資料選擇方法,透過分析音素分佈的相似度挑選具代表性的訓練樣本,以提高模型對特殊語音型態的辨識穩定性。進一步地,針對構音障礙語音辨識(Dysarthric Speech Recognition),本研究提出課程式學習(Curriculum Learning, CL)策略,依據語音可懂度分級,從高可懂度語音逐步訓練至低可懂度語音,使模型能以漸進方式習得不同發音難度下的聲學特徵,達到更穩健的訓練收斂。
    為進一步促進不同可懂度群體間的知識共享,本研究提出階層式課程學習(Hierarchical Curriculum Learning, HCL)與多層知識蒸餾(Multi-Level Knowledge Distillation, ML-KD)架構。HCL透過階層化訓練順序,使模型能分層學習由高可懂度到低可懂度的語音特徵;ML-KD則在不同層級模型間建立教師—學生(Teacher–Student)關係,將高可懂度語音的知識逐層傳遞至低可懂度模型。實驗結果顯示,結合HCL與ML-KD後,模型在UASpeech資料庫上能顯著降低字錯率(Word Error Rate, WER),平均改善超過10%,並有效縮小不同可懂度群體間的辨識落差。
    綜上所述,本論文從語言資料稀缺與語音變異過高兩個面向出發,提出多項具體策略以提升ASR的資料利用效率與模型適應能力。研究結果顯示,低資源語言(如台語)的問題可藉由基於音素共通性的跨語言資料擴增與元學習導向的語料選擇有效解決;而病態語音的高度變異性則可透過階層式課程訓練與知識蒸餾機制進行穩定化學習。整體而言,本研究建立了一條從語料選擇、模型學習到知識遷移的完整研究路徑,為低資源語音與病態語音辨識提供具系統性與可延展性的解決方案,並對語音技術於少數語言及臨床應用的發展具有重要貢獻。

    With the rapid progress of deep learning and self-supervised speech models, automatic speech recognition (ASR) has achieved remarkable success in high-resource languages and normal speech. However, its performance remains limited in low-resource scenarios, where data are scarce or speech exhibits high variability. This dissertation focuses on low-resource speech recognition, addressing two representative cases—Taiwanese and pathological speech—to systematically explore how to enhance recognition robustness and generalization under data-limited conditions. Although both fall under low-resource problems, their challenges differ: Taiwanese ASR suffers from insufficient training data and linguistic divergence, whereas pathological ASR faces extreme articulatory and speaker-dependent variability. To address these challenges, this study proposes multi-level learning strategies that achieve a balance between data efficiency and recognition performance.
    For Taiwanese ASR, this research initially investigates how to leverage Mandarin as an auxiliary resource, taking advantage of the shared phonemes between the two languages. By exploiting this phonetic correspondence, an acoustic and textual data augmentation framework is proposed to transfer acoustic and lexical information from Mandarin to Taiwanese, thereby compensating for data scarcity. Furthermore, to maximize the contribution of auxiliary languages, a dynamic sampling-based meta-learning strategy is developed to identify and utilize the most phonetically compatible multilingual corpora for training. This enables the model to acquire transferable acoustic representations in a multilingual setting. Experimental results show that the proposed approaches significantly improve recognition accuracy and stability for Taiwanese ASR under low-resource conditions, demonstrating that phonetic similarity and cross-lingual learning effectively enhance low-resource language recognition.
    In contrast, pathological speech ASR poses more severe challenges. Besides limited data, the articulatory impairments and large intelligibility variations among speakers make it difficult to capture consistent acoustic features. This study first introduces a phoneme affinity matrix-based data selection method for electrolaryngeal speech, which selects representative samples by analyzing phoneme distribution similarity to improve model robustness. For dysarthric speech recognition, a curriculum learning (CL) strategy is employed, in which the model is trained progressively from high- to low-intelligibility speech, enabling gradual adaptation to articulation difficulty and improving training stability.
    To promote knowledge sharing across intelligibility levels, this dissertation proposes a hierarchical curriculum learning (HCL) framework with multi-level knowledge distillation (ML-KD). HCL structures the training process hierarchically, allowing the model to learn representations from high- to low-intelligibility speech, while ML-KD transfers knowledge through teacher–student pairs across levels. Experimental results on the UASpeech database show that combining HCL and ML-KD substantially reduces the word error rate (WER) by more than 10% on average and effectively narrows performance gaps among intelligibility groups.
    In summary, this dissertation addresses data scarcity and acoustic variability in low-resource ASR through cross-lingual learning, curriculum-based training, and knowledge distillation. The results demonstrate that low-resource languages such as Taiwanese can benefit from phonetic similarity-based data augmentation and meta-learning-guided database selection, while pathological speech requires hierarchical and knowledge-driven training for stable adaptation. Overall, this work establishes a comprehensive and extensible framework for low-resource and pathological speech recognition, contributing to the advancement of speech technology for minority languages and clinical applications.

    中文摘要 I Abstract III 誌謝 V Contents VII List of Figures IX List of Tables XI Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Contributions 8 1.4 The Organization of this Dissertation 12 Chapter 2 Literature Review 13 2.1 Data Augmentation 13 2.1.1 Data Selection 13 2.1.2 Perturbation and Data Generation 15 2.2 Model Architecture 18 2.2.1 Hybrid Models 18 2.2.2 End-to-End Models 20 2.3 Training Strategy 23 2.3.1 Fine-tuning 23 2.3.2 Multi-task Learning 24 2.3.3 Meta-learning 24 2.3.4 Curriculum Learning 25 Chapter 3 Taiwanese Speech Recognition 25 3.1 Common Phoneme based Data Augmentation 26 3.1.1 Data Selection 26 3.1.2 Phoneme Set and Lexicon Definition 30 3.1.3 Experiment 33 3.2 Dynamic Sampling based Meta-learning 36 3.2.1 Supplementary Database Selection 37 3.2.2 Meta-learning 40 3.2.3 Dynamic Sampling Meta-learning 43 3.2.4 Experiment 51 3.3 Summary 55 Chapter 4 Pathological Speech Recognition 56 4.1 Data Augmentation for Electrolaryngeal Speech Recognition 57 4.1.1 Data Description 58 4.1.2 Data Selection based on Phoneme Affinity Matrix 60 4.1.3 Experiment 62 4.2 Curriculum Learning and Articulatory Feature for Dysarthric Speech Recognition 65 4.2.1 Curriculum Learning 66 4.2.2 Speech Feature Embedding 68 4.2.3 Experiment 69 4.3 Hierarchical Curriculum Learning for Dysarthric Speech Recognition 74 4.3.1 Fast Adaptation to Each Speech Intelligibility ASR by MAML 75 4.3.2 Multi-Level Knowledge Distillation 77 4.3.3 Fusion Model Training 79 4.3.4 Experiment 81 4.4 Summary 96 Chapter 5 Conclusions and Future Work 98 Bibliography 100 Publications 107

    [1] K.-H. Yap, “99 年人口及住宅普查,” 2010.
    [2] Y. -F, Liao, “台灣語言使用調查,” Tech. Rep., Aug. 2019
    [3] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” in Proceedings of the Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
    [4] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the International Conference on Machine Learning, 2017.
    [5] R.B. Bell, P. Andersen, R. Fernandes, "Oral, Head and Neck Oncology and Reconstructive Surgery-E-Book," Elsevier Health Sciences, 2017.
    [6] P. M. Stell, “The first laryngectomy,” The Journal of Laryngology & Otology, vol. 89, no. 4, pp. 353–358, 1975.
    [7] S. R. Cox, “Review of the electrolarynx: the past and present,” Perspectives of the ASHA Special Interest Groups, vol. 4, no. 1, pp. 118–129, 2019.
    [8] G.S. Meltzner, and R.E. Hillman, "Impact of aberrant acoustic properties on the perception of sound quality in electrolarynx speech." (2005).
    [9] A. T. Neel, “Vowel space characteristics and vowel identification accuracy,” Journal of Speech, Language, and Hearing Research, vol. 51, no. 3, pp. 574–585, 2008.
    [10] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 41–48.
    [11] H. Kim, M. Hasegawa Johnson, J. Gunderson, A. Perlman, T. Huang, K. Watkin, S. Frame, H. Vardhan Sharma and X. Zhou, "UASpeech", IEEE Dataport, March 17, 2023, doi:10.21227/f9tc-ab45
    [12] F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, no. 4, pp. 523–541, 2012.
    [13] N. J. De Vries, M. H. Davel, J. Badenhorst, W. D. Basson, F. De Wet, E. Barnard, and A. De Waal, “A smartphone-based ASR data collection tool for under-resourced languages,” Speech Communication, vol. 56, pp. 119–131, 2014.
    [14] M. Hasegawa-Johnson, P. Jyothi, D. McCloy, M. Mirbagheri, G. di Liberto, A. Das, B. Ekin, C. Liu, V. Manohar, H. Tang, E. C. Lalor, N. F. Chen, P. Hager, T. Kekona, R. Sloan, and A. K. C. Lee, “ASR for under-resourced languages from probabilistic transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 50–63, 2017.
    [15] S. Wotherspoon, W. Hartmann, M. Snover, and O. Kimball, “Improved data selection for domain adaptation in ASR,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 2021, pp. 7018–7022.
    [16] F. Xiong, J. Barker, Z. Yue, and H. Christensen, “Source domain data selection for improved transfer learning targeting dysarthric speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020, pp. 7424–7428.
    [17] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proceedings of Interspeech, 2015, p. 3586.
    [18] C. Bhat, A. Panda, and H. Strik, “Improved ASR performance for dysarthric speech using two-stage data augmentation,” in Proceedings of Interspeech 2022, 2022, pp. 46–50.
    [19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proceedings of Interspeech 2019, 2019, pp. 2613–2617.
    [20] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 2018, pp. 4779–4783.
    [21] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
    [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
    [23] C. Doersch, Tutorial on variational autoencoders. 2016 arXiv preprint arXiv:1606.05908.
    [24] Y. A. Li, A. Zare, and N. Mesgarani, “StarGANv2-VC: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” in Proceedings of Interspeech 2021, 2021, pp. 1349–1353.
    [25] J. Kong, J. Kim and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 2020, 33, 17022-17033.
    [26] L. -J. Yang, I. -P. Yeh and J. -T. Chien, "Low-Resource Speech Synthesis with Speaker-Aware Embedding," 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), Singapore, Singapore, 2022, pp. 235-239, doi: 10.1109/ISCSLP57327.2022.10038221.
    [27] Y. Liao, W. Hsu, C. Pan, W. Wang, M. Pleva and D. Hladek, "Personalized Taiwanese Speech Synthesis using Cascaded ASR and TTS Framework," 2022 32nd International Conference Radioelektronika (RADIOELEKTRONIKA), Kosice, Slovakia, 2022, pp. 01-05, doi: 10.1109/RADIOELEKTRONIKA54537.2022.9764940.
    [28] M. Soleymanpour, M. T. Johnson, R. Soleymanpour and J. Berry, "Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 7382-7386, doi: 10.1109/ICASSP43922.2022.9746585.
    [29] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, Article 105, 2024.
    [30] W.-Z. Leung, M. Cross, A. Ragni, and S. Goetze, “Training data augmentation for dysarthric automatic speech recognition by text-to-dysarthric-speech synthesis,” in Proceedings of Interspeech 2024, 2024, pp. 2494–2498.
    [31] Z. Jin, X. Xie, M. Geng, T. Wang, S. Hu, J. Deng, G. Li and X. Liu "Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10095547.
    [32] H. Wang, Z. Jin, M. Geng, S. Hu, G. Li, T. Wang, H Xu, X. Liu "Enhancing Pre-Trained ASR System Fine-Tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 12311-12315, doi: 10.1109/ICASSP48485.2024.10447702.
    [33] H. Wang, T. Thebaud, J. Villalba, M. Sydnor, B. Lammers, N. Dehak, and L. Moro-Velazquez, “DuTa-VC: A duration-aware typical-to-atypical voice conversion approach with diffusion probabilistic model,” in Proceedings of Interspeech 2023, 2023, pp. 1548–1552.
    [34] Y. Lin, L. Wang, Y. Yang, and J. Dang, “CFDRN: A cognition-inspired feature decomposition and recombination network for dysarthric speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3824–3836, 2023.
    [35] Y.-L. Chien, H.-H. Chen, M.-C. Yen, S.-W. Tsai, H.-M. Wang, Y. Tsao, and T.-S. C. Lee, “Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion,” in Proceedings of Interspeech 2023, 2023, pp. 5023–5026, doi: 10.21437/Interspeech.2023-866.
    [36] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
    [37] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
    [38] R. Jimerson and E. Prud’hommeaux. ASR for documenting acutely under-resourced indigenous languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
    [39] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (ICML '06). Association for Computing Machinery, New York, NY, USA, 369–376. https://doi.org/10.1145/1143844.1143891
    [40] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proceedings of Interspeech 2020, 2020, pp. 5036–5040.
    [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017, 30.
    [42] V. N. Sukhadia and S. Umesh, "Domain Adaptation of Low-Resource Target-Domain Models Using Well-Trained ASR Conformer Models," 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 2023, pp. 295-301, doi: 10.1109/SLT54892.2023.10023233.
    [43] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-Training for Speech Recognition,” in Proceedings of Interspeech 2019, 2019, pp. 3465–3469, doi: 10.21437/Interspeech.2019-1873.
    [44] A. Baevski, Y. Zhou, A. Mohamed and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020, 33, 12449-12460.
    [45] W.-N. Hsu, Y.-H. H. Tsai, B. Bolte, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
    [46] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu and F. Wei, "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
    [47] H. Zhu, L. Wang, G. Cheng, J. Wang, P. Zhang, and Y. Yan, “Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR,” in Proceedings of Interspeech 2022, 2022, pp. 4870–4874, doi: 10.21437/Interspeech.2022-909.
    [48] W. -N. Hsu, Y. -H. H. Tsai, B. Bolte, R. Salakhutdinov and A. Mohamed, "Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6533-6537, doi: 10.1109/ICASSP39728.2021.9414460.
    [49] Y.-H. Chou, K. Chang, M.-J. Wu, W. Ou, A. W.-H. Bi, C. Yang, B. Y. Chen, R.-W. Pai, P.-Y. Yeh, J.-P. Chiang, I.-T. Phoann, W. Chang, C. Cui, N. Chen, J. Shi, "Evaluating Self-Supervised Speech Models on a Taiwanese Hokkien Corpus," 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 2023, pp. 1-7, doi: 10.1109/ASRU57964.2023.10389734.
    [50] S. Hu, X. Xie, Z. Jin, M. Geng, Y. Wang, M. Cui, J. Deng, X. Liu and H. Meng, "Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10097275.
    [51] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, “Transfer learning for speech recognition on a budget,” in ACL, 2017.
    [52] L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2777–2789, 2024, doi: 10.1109/TASLP.2024.3402557.
    [53] W. Hou, Y. Dong, B. Zhuang, L. Yang, J. Shi, and T. Shinozaki, “Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning,” in Proceedings of Interspeech 2020, 2020, pp. 1037–1041, doi: 10.21437/Interspeech.2020-2164.
    [54] M. Geng, X. Xie, S. Liu, J. Yu, S. Hu, X. Liu, and H. Meng, “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” in Proceedings of Interspeech 2020, 2020, pp. 696–700, doi: 10.21437/Interspeech.2020-1161.
    [55] M. Geng, Z. Jin, T. Wang, S. Hu, J. Deng, M. Cui, G. Li, J. Yu, X. Xie, and X. Liu, “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” in Proceedings of Interspeech 2023, 2023, pp. 2328–2332, doi: 10.21437/Interspeech.2023-322.
    [56] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” 2017, p. 10.
    [57] J. -Y. Hsu, Y. -J. Chen and H. -y. Lee, "Meta Learning for End-To-End Low-Resource Speech Recognition," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7844-7848, doi: 10.1109/ICASSP40776.2020.9053112.
    [58] D. Wang, J. Yu, X. Wu, L. Sun, X. Liu and H. Meng, "Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization," 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, 2021, pp. 1-5, doi: 10.1109/ISCSLP49672.2021.9362068.
    [59] A. Suwanbandit, B. Naowarat, O. Sangpetch, and E. Chuangsuwanich, “Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition,” in Proceedings of Interspeech 2023, 2023, pp. 4069–4073, doi: 10.21437/Interspeech.2023-1828.
    [60] iCorpus 臺華平行新聞語料庫
    [61] 台語文語料庫蒐集及語料庫為本台語書面語音節詞頻統計.
    [62] 台語文數位典藏資料庫.
    [63] 新約聖經語料.
    [64] 臺語國校仔課本.
    [65] King-ASR-044. Available: https://dataoceanai.com/datasets/asr/taiwanese-mandarin-speech-recognition-corpus-mobile/
    [66] King-ASR-360. Available: https://services.isca-speech.org/iscapad/iscapad.php?module=article&id=11971&back=p,224
    [67] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics and Chinese Language Processing, vol. 10, no. 2, pp. 219–236, 2005.
    [68] Tagged Chinese Gigaword Version 2.0. Available: https://catalog.ldc.upenn.edu/LDC2009T14
    [69] PTT 八卦版問答中文語料. Available: https://github.com/zake7749/Gossiping-Chinese-Corpus
    [70] ChhoeTaigi 找台語:台語字詞資料庫. Available: https://github.com/ChhoeTaigi/ChhoeTaigiDatabase
    [71] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech, 2015.
    [72] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 5206–5210. [Online]. Available: http://ieeexplore.ieee.org/document/7178964/
    [73] V. Panayotov, D. Povey, and S. Khudanpur, “LibriSpeech language models, vocabulary and G2P models,” Oct. 2014. [Online]. Available: https://www.openslr.org/11/
    [74] “Mozilla Common Voice,” Tech. Rep., 2021. [Online]. Available: https://commonvoice. mozilla.org/zh-TW/datasets
    [75] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” arXiv:1711.00354 [cs], Oct. 2017, arXiv: 1711.00354.
    [76] “Wikimedia Downloads,” Tech. Rep., 2021. [Online]. Available: https://dumps.wikimedia.org/backup-index.html
    [77] T. Matsushita, “依重要順位排序的語彙資料庫(重要度順語彙),” Dec. 2011. [Online]. Available: http://www17408ui.sakura.ne.jp/tatsum/database.html
    [78] R. Lai and G. Winterstein, “Cifu: a Frequency Lexicon of Hong Kong Cantonese,” Marseille, 2020, p. 9. [Online]. Available: https://github.com/gwinterstein/Cifu
    [79] E. Chuangsuwanich, A. Suchato, K. Karunratanakul, B. Naowarat, C. CChaichot, and P. Sangsa-nga, “Gowajee Corpus,” Chulalongkorn University, Faculty of Engineering, Computer Engineering Department, Tech. Rep., Dec. 2020, version 0.9.1. [Online]. Available: https://github.com/ekapolc/gowajee_corpus
    [80] W. Phatthiphaiboon, “Lexicon-Thai,” Tech. Rep., 2017. [Online]. Available: https://github.com/PyThaiNLP/lexicon-thai
    [81] A. Nichol, J. Achiam and J. Schulman. On first-order meta-learning algorithms. 2018. arXiv preprint arXiv:1803.02999.
    [82] Speech Signal Processing Workshop, “Taiwanese Across Taiwan Corpus,” Tech. Rep., 2021, tAT-vol1. [Online]. Available: https://sites.google.com/speech.ntut.edu.tw/fsw/ home/tat-corpus?authuser=0
    [83] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks,” in Proceedings of Interspeech 2018, 2018, pp. 3743–3747, doi: 10.21437/Interspeech.2018-1417.
    [84] Y. Lin, L. Wang, J. Dang, S. Li and C. Ding, "End-to-End Articulatory Modeling for Dysarthric Articulatory Attribute Detection," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7349-7353, doi: 10.1109/ICASSP40776.2020.9054233.
    [85] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala and T. Ochiai, Espnet: End-to-end speech processing toolkit. 2018, arXiv preprint arXiv:1804.00015.
    [86] S. Hu, X. Xie, M. Geng, Z. Jin, J. Deng, G. Li, Y. Wang, M. Cui, T. Wang, H. Meng, and X. Liu, “Self-supervised ASR models and features for dysarthric and elderly speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3561–3575, 2024.

    QR CODE