| 研究生: |
趙哲宏 Zhao, Zhe-Hong |
|---|---|
| 論文名稱: |
使用多語言聲學資料之動態資料取樣的元學習於低資源語言自動語音辨識 Dynamic-Sampling Based Meta-Learning Using Multilingual Acoustic Data for Under-Resourced Speech Recognition |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 53 |
| 中文關鍵詞: | 自動語音辨識 、模型無關的元學習 、低資源語言 |
| 外文關鍵詞: | automatic speech recognition, model-agnostic meta-learning, under-resourced language |
| 相關次數: | 點閱:126 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,由於自動語音辨識(auto speech recognition, ASR)相關研究的蓬勃發展,除了辨識效能獲得提升,其應用也被廣泛地在各種領域使用。而在語音辨識相關的研究上,訓練資料的數量與訓練策略皆會對語音辨識的表現造成影響,。然而,不似中文與英文等等資源充足的語言,資源量較低的的語言便無法有效的達到足夠的資料量,且語料的收集與錄製也是一項耗費人力與時間的工作,因此為了提高低資源語音辨識器的準確度,需要從其他的資料擴充方式與訓練策略下手。
本論文以台語作為低資源語言相關研究的對象,以期解決其語料缺乏之問題,而在收集語料需消耗大量人力與時間的情況之下,採用其他語言之已標記語料進行訓練資料的增加便成為一個可行之道。以此為基礎,本論文以音素的角度為出發點,收集了中文、英語、日語、粵語、泰語等五種語言的語音語料庫,作為聲學模型額外的訓練資料,並以模型無關的元學習(model-agnostic meta-learning, MAML)作為訓練策略,以便從其他輔助語言中提取有幫助的資訊,且本論文進一步對元學習的取樣方式提出動態取樣的作法,以音素、發音以及語音辨識模型做為判定依據,動態分配各輔助語言的比例,以及提取各語言中較有幫助的語句,以有效運用各輔助語言所提供的資訊。
而本論文在台語上的實驗結果,顯示模型可以藉由透過其他語言的輔助資料進行訓練,學習其他領域的資訊以獲得更佳的初始參數,並且以本論文提出的資訊含量與動態取樣概念訂定各語言的語句之間的差異性後,再從訓練資料中抽取較有幫助的資料進行訓練,得到了20.68%的詞錯誤率以及8.35%的音節錯誤率,且辨識結果均優於純粹利用台語語料庫進行訓練的基線模型以及基於一般隨機取樣的模型無關的元學習方法。
In recent years, due to the booming research on automatic speech recognition (ASR), not only the recognition performance has been improved, but also its applications are widely used in various fields. In ASR-related research, the amount of training data and training strategy have an impact on the performance of speech recognition. However, unlike Chinese and English, which have sufficient resources, languages with low resources cannot effectively achieve a sufficient amount of data, and the collection and recording of the data is also a labor-intensive and time-consuming task. Therefore, to improve the accuracy of under-resourced speech recognizers, we need to start with other data expansion methods and training strategies.
In this thesis, Taiwanese is used as an under-resourced language to address the problem of lack of corpus. Since it is labor-intensive and time-consuming to collect corpora, increasing the training data by using labeled corpora of other languages becomes a feasible way. Based on this perspective, this thesis collects acoustic corpora of five languages, including Chinese, English, Japanese, Cantonese, and Thai, as additional training data for acoustic models. Model-agnostic meta-learning (MAML) is also used as a training strategy to extract helpful information from other auxiliary languages. This thesis further proposes a dynamic sampling approach for meta-learning, using phoneme, pronunciation, and speech recognition models as the basis for determining the proportion of each auxiliary language, and extracting the more helpful utterances from each language to effectively use the information provided by each auxiliary language.
The experimental results of this thesis on Taiwanese show that the model can learn information from other domains to obtain better initial parameters by training with auxiliary data from other languages. In addition, the information quantity and dynamic sampling concepts proposed in this thesis are used to determine the differences between the utterances of each language, and then the more helpful data are extracted from the auxiliary data for training. After the testing of the Taiwanese data, we obtained the word error rate of 20.68% and the syllable error rate of 8.35%. The results were better than the baseline model using only the Taiwanese corpus and the meta-learning method based on the general random sampling model.
[1] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu, R. Pang, Q. V. Le, andY. Wu, “Pushing the Limits of Semi-Supervised Learning for Automatic SpeechRecognition,”arXiv:2010.10504 [cs, eess], Oct. 2020, arXiv: 2010.10504. [Online].Available: http://arxiv.org/abs/2010.10504
[2] L. Deng and X. Li, “Machine Learning Paradigms for Speech Recognition: An Overview,”IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 5, pp. 1060–1089, May 2013.[Online]. Available: http://ieeexplore.ieee.org/document/6423821/
[3] U. Shrawankar and V. Thakare, “Adverse Conditions and ASR Techniques for RobustSpeech User Interface,” vol. 8, no. 5, p. 10, 2011.
[4] J. Rajnoha and P. Pollák, “ASR systems in Noisy Environment: Analysis and Solutionsfor Increasing Noise Robustness,” vol. 20, no. 1, p. 11, 2011.
[5]行政院主計總處99年人口及住宅普查(葉高華), “台澎金馬六歲以上人口在家裡用臺灣閩南語(福佬話/河佬話)的比例,” Tech. Rep., 2010.
[6]國家發展委員會人口推估查詢系統, “三階段人口(占總人口比率),” 2020. [Online].Available: https://pop-proj.ndc.gov.tw/
[7] Y.-F. LIAO, “臺灣語言使用調查,” Tech. Rep., Aug. 2019.
[8]教育部, “本土語言使用情況說明,” Tech. Rep., 2018.
[9] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classi-fication: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” Pitts-burgh, Pennsylvania, USA, 2006, p. 8.
[10] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,” in2016 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai: IEEE, Mar.2016, pp. 4960–4964. [Online]. Available: http://ieeexplore.ieee.org/document/7472621/50
[11] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na,Y. Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks forASR Based on Lattice-Free MMI,” Sep. 2016, pp. 2751–2755. [Online]. Available:http://www.isca-speech.org/archive/Interspeech_2016/abstracts/0595.html
[12] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture forefficient modeling of long temporal contexts,” Dresden, Germany, Sep. 2015, p. 5.
[13] D. Yu, L. Deng, and G. E. Dahl, “Roles of Pre-Training and Fine -Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition,” p. 8, Dec. 2010.
[14] S. Parveen and P. Green, “Multitask Learning in Connectionist Robust ASR Using Recur-rent Neural Networks,” 2003, p. 4.
[15] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptationof Deep Networks,” 2017, p. 10.
[16] A. Tjandra, S. Sakti, and S. Nakamura, “Sequence-to-Sequence Asr Optimization ViaReinforcement Learning,” in2018 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). Calgary, AB: IEEE, Apr. 2018, pp. 5829–5833.[Online]. Available: https://ieeexplore.ieee.org/document/8461705/
[17] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized ExperienceReplay,”arXiv:1511.05952 [cs], Feb. 2016, arXiv: 1511.05952. [Online]. Available:http://arxiv.org/abs/1511.05952
[18] N. T. Kleynhans and E. Barnard, “Efficient data selection for ASR,”LangResources & Evaluation, vol. 49, no. 2, pp. 327–353, Jun. 2015. [Online]. Available:http://link.springer.com/10.1007/s10579-014-9285-0
[19] “King-ASR-044, speechocean,” Tech. Rep., Aug. 2014. [Online]. Available: http://en.speechocean.com/datacenter/details/1409.html
[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: AnASR corpus based on public domain audio books,” in2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane,51
Queensland, Australia: IEEE, Apr. 2015, pp. 5206–5210. [Online]. Available:http://ieeexplore.ieee.org/document/7178964/
[21] “Mozilla Common Voice,” Tech. Rep., 2021. [Online]. Available: https://commonvoice.mozilla.org/zh-TW/datasets
[22] R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large-scale Japanesespeech corpus for end-to-end speech synthesis,”arXiv:1711.00354 [cs], Oct. 2017, arXiv:1711.00354. [Online]. Available: http://arxiv.org/abs/1711.00354
[23] E. Chuangsuwanich, A. Suchato, K. Karunratanakul, B. Naowarat, C. CChaichot, andP. Sangsa-nga, “Gowajee Corpus,” Chulalongkorn University, Faculty of Engineering,Computer Engineering Department, Tech. Rep., Dec. 2020, version 0.9.1. [Online].Available: https://github.com/ekapolc/gowajee_corpus
[24] C.-R. Huang, “Tagged Chinese Gigaword Version 2.0,” Tech. Rep., Jul. 2009. [Online].Available: https://catalog.ldc.upenn.edu/LDC2009T14
[25] V. Panayotov, D. Povey, and S. Khudanpur, “LibriSpeech language models, vocabularyand G2P models,” Oct. 2014. [Online]. Available: https://www.openslr.org/11/
[26] “Wikimedia Downloads,” Tech. Rep., 2021. [Online]. Available: https://dumps.wikimedia.org/backup-index.html
[27] T. Matsushita, “依重要順位排序的語彙資料庫(重要度順語彙),” Dec. 2011. [Online].Available: http://www17408ui.sakura.ne.jp/tatsum/database.html
[28] R. Lai and G. Winterstein, “Cifu: a Frequency Lexicon of Hong Kong Cantonese,”Marseille, 2020, p. 9. [Online]. Available: https://github.com/gwinterstein/Cifu
[29] W. Phatthiphaiboon, “Lexicon-Thai,” Tech. Rep., 2017. [Online]. Available: https://github.com/PyThaiNLP/lexicon-thai
[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann,P. Motlıcek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The KaldiSpeech Recognition Toolkit,” p. 4, 2011.52
[31] Speech Signal Processing Workshop, “Taiwanese Across Taiwan Corpus,” Tech. Rep.,2021, tAT-Vol1. [Online]. Available: https://sites.google.com/speech.ntut.edu.tw/fsw/home/tat-corpus?authuser=0
[32] H.-C. CHANG, C.-Y. KAO, and H.-C. LU, “台語文數位典藏資料庫,” Tech. Rep., 2006.[Online]. Available: http://ip194097.ntcu.edu.tw/nmtl/dadwt/pbk.asp
[33] “新約聖經語料,” Tech. Rep., 1995. [Online]. Available: https://bible.fhl.net/
[34] Y.-Y. YANG and H.-C. CHANG, “台語文語料庫蒐集及語料庫為本台語書面語音節詞頻統計,” Tech. Rep., Feb. 2011. [Online]. Available: http://ip194097.ntcu.edu.tw/giankiu/keoe/KKH/guliau-supin/guliau-supin.asp
[35]智財技轉處資訊科學研究所, “iCorpus:台華語新聞語料庫,” Tech. Rep., Feb. 2019.[Online]. Available: https://iptt.sinica.edu.tw/shares/465
[36] Y.-Y. YANG, “臺灣國校仔課本,” Tech. Rep., Jul. 2016. [Online]. Available:https://github.com/Taiwanese-Corpus/kok4hau7-kho3pun2
[37] Speech Signal Processing Workshop, “公視台語台與民視訓練語料,” Tech. Rep., 2021. [On-line]. Available: https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge-2020?authuser=0
[38] A. Stolcke, “SRILM – an Extensible Language Modeling Toolkit,” Denver, Colorado, USA,2002, p. 4.