| 研究生: |
王竣煌 Wang, Chun-Huang |
|---|---|
| 論文名稱: |
應用文本資料增強於低資源語言之語碼轉換語音辨識 Textual Data Augmentation for Code-Switching Speech Recognition with Under-Resourced Language |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 自動語音辨識 、低資源語言 、語碼轉換 、資料增強 |
| 外文關鍵詞: | automatic speech recognition, under-resourced language, code-switching, data augmentation |
| 相關次數: | 點閱:132 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自動語音辨識(automatic speech recognition, ASR)是語音相關研究的熱門主題之一,比起過去,將其應用於日常生活的案例更是日趨頻繁。而為了使語音辨識器具有準確的辨識能力,建構該辨識器的方法以及訓練資料皆是十分地重要。然而,並非所有語言的使用人口皆如普通話與英語多,因此,許多情況是無法蒐集到足夠大量的語音語料。而為了訓練低資源語言的辨識器,就須從方法上調整,並選擇對語料數量需求較低的作法。
而在本論文中,選擇了台語做為低資源語言(under-resourced language)的角色,其語音以及文字語料數量皆較為缺乏。此外,若再進一步考量台語的實際應用場合,台語與普通話間的語碼轉換(code-switching)亦相當頻繁地發生,而這也是本論文須處理的目標之一。所幸的是,台語與普通話在發音與文法上皆存在相似之處。為此,本論文提出共用音素的方法,借助普通話語音進行聲學模型(acoustic model)的訓練。另一方面,關於文字語料的缺乏,此則透過詞對詞以及額外規則由中文語料進行翻譯,藉此方式進行文字語料的資料增強(data augmentation)。此外,再額外訂立語碼轉換的翻譯規則,並將其結果用於語言模型(language model)的訓練。
在實驗方面,採用了包含共用音素的發音詞典(lexicon),並使用台語以及普通話語音共同訓練聲學模型。而此在詞錯誤率(word error rate)的表現為26.02%,且此結果比做為基線(baseline)而使用的純台語語料更好。另外,透過將文本資料增強之語碼轉換文字語料添加於語言模型的訓練中,並與上述採共用音素的聲學模型結合。得到了29.05%的詞錯誤率,再由實驗結果分析,說明了該辨識器具有對語碼轉換詞彙的辨識能力。
Automatic speech recognition is one of the hot topics in speech-related research. More and more cases of applying it to daily life are more frequent than in the past. In order to make the speech recognizer have accurate recognition ability, both of the methods of constructing the speech recognizer and the training corpora are very important. However, not all languages are widely used like Mandarin and English. Therefore, in many cases, it is hard to collect a large amount of speech corpus to train the speech recognizer. In this condition, it is necessary to adjust the method and choose the method which requires less training data.
In this thesis, Taiwanese is chosen as the role of under-resourced language, and its speech and text corpus are relatively lacking. In addition, if we further consider the practical application of Taiwanese, the code-switching between Taiwanese and Mandarin will occur quite frequently, and this is one of the goals that this thesis must deal with. Fortunately, both Taiwanese and Mandarin have similarities in pronunciation and grammar. To this end, this thesis proposes a method of sharing phones, using the Mandarin speech to train the shared acoustic models. On the other hand, regarding the lack of text corpus, this thesis translated Mandarin corpus into Taiwanese corpus through word-by-word and manually designed rules. It is expected that the text corpus could be augmented by this method. Moreover, additional translation rules for code-switching are established, and the results are used for the training of language models.
In terms of experiments, this thesis adopted a lexicon containing shared phones, and used Taiwanese and Mandarin speech to jointly train the acoustic models. The performance of word error rate was 26.02%, which was better than that trained by the pure Taiwanese corpus used as the baseline. In addition, this thesis used the code-switching text corpus to train the language model and combined it with the acoustic model of the shared phone. The performance of the word error rate was 29.05%, and the experimental results showed that the speech recognizer had the ability to recognize the code-switching vocabulary.
[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” in IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[2] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in ICASSP, 2013.
[3] Speech Recognition on LibriSpeech test-clean. Available: https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean
[4] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015.
[5] J. Rajnoha, and P. Pollák, “ASR systems in noisy environment: analysis and solutions for increasing noise robustness,” in Radioengineering, vol. 20, no. 1, pp. 74–84, 2011.
[6] V. Peddinti, V. Manohar, Y. Wang, D. Povey, and S. Khudanpur, “Far-field ASR without parallel data,” in Interspeech, 2016.
[7] S.-L. Chen, “Code-switched word recognition by Taiwanese-Mandarin bilinguals,” 2000.
[8] 葉高華, “99年人口及住宅普查,” 2010.
[9] J. Yi, J. Tao, Z. Wen,and Y. Bai, “Adversarial multilingual training for low-resource speech recognition,” in ICASSP, 2018.
[10] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, “On the end-to-end solution to Mandarin-English code-switching speech recognition,” arXiv:1811.00241, 2018.
[11] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Acoustic and textual data augmentation for improved asr of code-switching speech,” in Interspeech, 2018.
[12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
[13] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv:1508.01211, 2015.
[14] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018.
[15] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Interspeech, 2016.
[16] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech, 2015.
[17] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, “Transfer learning for speech recognition on a budget,” in ACL, 2017.
[18] P. Ghahremani, V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Investigation of transfer learning for ASR using LF-MMI trained neural networks,” in ASRU, 2017.
[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. , “Domain-adversarial training of neural networks,” in JMLR, 2016.
[20] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training for accented speech recognition,” in ICASSP, 2018.
[21] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[22] iCorpus 臺華平行新聞語料庫. Available: http://icorpus.iis.sinica.edu.tw
[23] 台語文語料庫蒐集及語料庫為本台語書面語音節詞頻統計. Available: http://ip194097.ntcu.edu.tw/giankiu/keoe/KKH/guliau-supin/guliau-supin.asp
[24] 台語文數位典藏資料庫. Available: http://ip194097.ntcu.edu.tw/nmtl/dadwt/pbk.asp
[25] 新約聖經語料. Available: https://bible.fhl.net
[26] 臺語國校仔課本. Available: https://github.com/Taiwanese-Corpus/kok4hau7-kho3pun2
[27] King-ASR-044. Available: https://kingline.speechocean.com/exchange.php?id=766&act=view
[28] King-ASR-360. Available: https://kingline.speechocean.com/category.php?id=120
[29] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics and Chinese Language Processing, vol. 10, no. 2, pp. 219–236, 2005.
[30] TCC-300. Available: http://www.aclclp.org.tw/use_mat_c.php
[31] Tagged Chinese Gigaword Version 2.0. Available: https://catalog.ldc.upenn.edu/LDC2009T14
[32] PTT 八卦版問答中文語料. Available: https://github.com/zake7749/Gossiping-Chinese-Corpus
[33] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” in Computer, Speech and Language, vol. 16, no. 1, pp. 69–88, 2002.
[34] ChhoeTaigi 找台語:台語字詞資料庫. Available: https://github.com/ChhoeTaigi/ChhoeTaigiDatabase
[35] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in ASRU, 2011.
[36] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. A. Ranzato, “Phrase-based & neural unsupervised machine translation,” arXiv:1804.07755, 2018.
[37] A. Stolcke, “SRILM - an extensible language modeling toolkit,” in Proc. ICSLP, pp. 901–904, 2002