簡易檢索 / 詳目顯示

研究生: 高驊岑
Kao, Hua-Cen
論文名稱: 應用資料擴增和資料挑選的噪聲學生訓練法於低資源語音辨識
Data Augmentation and Selection for Noisy Student Training on Under-Resourced Speech Recognition
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 71
中文關鍵詞: 自動語音辨識低資源語言資料增強噪聲學生訓練法
外文關鍵詞: Automatic speech recognition, Under-resourced language, Data augmentation, Noisy Student Training
相關次數: 點閱:42下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近幾年來,隨著自動語音辨識(automatic speech recognition, ASR) 相關研究的快速發展,辨識效能都有很大的進步,其應用也廣泛地使用在各個領域。而在語音辨識相關的研究中,已經表明訓練資料的數量與訓練策略會對語音辨識的表現造成影響。然而,並非所有語言的資源跟中文與英文一樣充足,且人工錄製語料是一項耗費大量成本的工作,因此為了提升低資源語音辨識器的效能,需要從其他的資料擴充方式與訓練策略下手。
    本論文選擇台語作為低資源語言的對象,為解決其語料缺乏的問題,收集了影音串流媒體網站上公開的大量戲劇語音作為未標記資料,並使用文本挑選方法,從收集文本裡選擇出合適的句子子集,並用語音合成模型與語音轉換模型從子集合成出多語者的合成資料,使擴充的資料能夠多樣化,也針對不同的擴增資料使用不同的資料挑選方法,過濾掉品質較差的資料,並以噪聲學生訓練法作為訓練策略,以便將挑選後的資料加入訓練流程裡。
    而本論文在台語上的實驗結果,顯示模型可以藉由擴充的資料進行訓練,並且以本論文提出的資料挑選方法訂定各擴充資料之間的置信度分數後,再從訓練資料中抽取較有幫助的資料進行訓練,可以得到23.6%的詞錯誤率以及10.9%的音節錯誤率,且辨識結果相較於純粹利用台語有標記資料進行訓練的基線模型均有較低的錯誤率。

    In recent years, with the rapid development of automatic speech recognition (ASR) related research, the recognition performance has been greatly improved, and its applications are also widely used in various fields. In studies related to speech recognition, it has been shown that the amount of training data and training strategies have an impact on the performance of speech recognition. However, not all language resources are as sufficient as Mandarin and English, and manual recording of corpus is a costly task. Therefore, in order to improve the performance of under-resourced speech recognizers, we should start with other data expansion methods and training strategies.
    In this thesis, Taiwanese was selected as the under-resourced language object. In order to solve the problem of lack of corpus, a large number of drama voices published on video streaming websites were collected as unlabeled data. Appropriate sentence subsets are selected from the collected texts by Text Selection, and the synthetic data of multilingual speakers are generated from the subsets using speech synthesis model and speech conversion model to enable diversification of augmented data. Data selection methods are used for the augmented data to filter out poor quality data. The Noisy Student Training approach is used as the training strategy so that selected data can be added to the training process.
    The data selection method proposed in this thesis determines the confidence score between each augmented data, and then extracts more helpful data from the training data for training. The experimental results show that the model can get better performance by training with helpful data in augmented data. On the Taiwanese test set, we can get a word error rate of 23.6% and a syllable error rate of 10.9%. The error rate of the recognition results is lower than the baseline model trained purely on Taiwanese labeled data.

    摘要 I Abstract III 誌謝 V Contents VI List of Tables IX List of Figures XI Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Literature Review 4 1.3.1 Speech Recognition 4 1.3.2 Model Adaption 7 1.3.3 Data selection 9 1.4 Problems 11 1.5 System Framework 13 Chapter 2 Data Collection 15 2.1 Taiwanese Speech Corpora 15 2.2 Taiwanese Drama Corpora 16 2.3 Taiwanese Text Corpora 18 2.4 Auxiliary Language Corpora 18 Chapter 3 Proposed Methods 20 3.1 Basic Taiwanese ASR training 21 3.1.1 End-to-End Model Architecture 21 3.1.2 Hybrid Training and Decoding 26 3.2 Unlabeled Data Processing 28 3.2.1 Text Selection 28 3.2.2 Speech Synthesis Model 29 3.2.3 Voice Conversion Model 35 3.3 Data Selection 42 3.3.1 Confidence Estimation Module 43 3.3.2 Pseudo-label Selection 46 3.3.3 ASR-confidence-based selection 48 3.4 Joint training of Labeled data and Selected data 49 Chapter 4 Experimental Results and Discussion 51 4.1 Experimental Settings 51 4.1.1 Testing Dataset 51 4.1.2 Model Settings 52 4.2 Evaluation of Under-Resourced ASR 55 4.3 Discussion 64 Chapter 5 Conclusion and Future Work 65 References 67

    1. Graves, A., A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. in 2013 IEEE international conference on acoustics, speech and signal processing. 2013. Ieee.
    2. Zhang, Y., et al., Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504, 2020.
    3. Deng, L. and X. Li, Machine learning paradigms for speech recognition: An overview. IEEE Transactions on Audio, Speech, and Language Processing, 2013. 21(5): p. 1060-1089.
    4. 楊士範. 【圖表】最新普查:全國6成常用國語,而這6縣市主要用台語. 2021.
    5. 教育部. 本土語言使用情況說明. 2018.
    6. Peddinti, V., D. Povey, and S. Khudanpur. A time delay neural network architecture for efficient modeling of long temporal contexts. in Sixteenth annual conference of the international speech communication association. 2015.
    7. Povey, D., et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI. in Interspeech. 2016.
    8. Alex Graves, S.F., Faustino Gomez, Jürgen Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in ICML. 2006.
    9. Chan, W., et al., Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
    10. Kim, S., T. Hori, and S. Watanabe. Joint CTC-attention based end-to-end speech recognition using multi-task learning. in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2017. IEEE.
    11. Watanabe, S., et al., Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 2017. 11(8): p. 1240-1253.
    12. Gulati, A., et al., Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
    13. Yu, D., L. Deng, and G. Dahl. Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. in Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. 2010. sn.
    14. Parveen, S. and P. Green. Multitask learning in connectionist robust ASR using recurrent neural networks. in Eighth European Conference on Speech Communication and Technology. 2003.
    15. Xie, Q., et al. Self-training with noisy student improves imagenet classification. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
    16. Park, D.S., et al., Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020.
    17. Hinton, G., O. Vinyals, and J. Dean, Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2(7).
    18. Tur, G., D. Hakkani-Tür, and R.E. Schapire, Combining active and semi-supervised learning for spoken language understanding. Speech Communication, 2005. 45(2): p. 171-186.
    19. Evermann, G. and P. Woodland. Posterior probability decoding, confidence estimation and system combination. in Proc. Speech Transcription Workshop. 2000. Citeseer.
    20. Hendrycks, D. and K. Gimpel, A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
    21. Li, Q., et al. Confidence estimation for attention-based sequence-to-sequence models for speech recognition. in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021. IEEE.
    22. Georgescu, A.-L., et al. Data-filtering methods for self-training of automatic speech recognition systems. in 2021 IEEE Spoken Language Technology Workshop (SLT). 2021. IEEE.
    23. Hsieh, I.-T., C.-H. Wu, and C.-H. Wang. Acoustic and Textual Data Augmentation for Code-Switching Speech Recognition in Under-Resourced Language. in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2020. IEEE.
    24. Workshop, S.S.P. Taiwanese across Taiwan Corpus. 2021.
    25. 王秀容, M.S. SuíSiann Dataset. 2019.
    26. 新約聖經語料. 1995; Available from: https://bible.fhl.net/.
    27. 台語文數位典藏資料庫. 2006.
    28. 台語文語料庫蒐集及語料庫為本台語書面語音節詞頻統計. 2011; Available from: http://ip194097.ntcu.edu.tw/giankiu/keoe/KKH/guliau-supin/guliau-supin.asp.
    29. 臺灣國校仔課本. 2016; Available from: https://github.com/Taiwanese-Corpus/kok4hau7-kho3pun2.
    30. iCorpus: 台華語新聞語料庫. 2019; Available from: https://iptt.sinica.edu.tw/shares/465.
    31. King-ASR-044, Speechocean. 2014; Available from: https://en.speechocean.com/datacenter/details/1409.html.
    32. Bu, H., et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). 2017. IEEE.
    33. Panayotov, V., et al. Librispeech: an asr corpus based on public domain audio books. in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2015. IEEE.
    34. Junichi Yamagishi, C.V., Kirsten MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). 2019; Available from: https://datashare.ed.ac.uk/handle/10283/3443.
    35. Vaswani, A., et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.
    36. 林洪邦 and 陳嘉平, NSYSU-MITLab 團隊於福爾摩沙語音辨識競賽 2020 之語音辨識系統. 中文計算語言學期刊, 2021. 26(1): p. 17-31.
    37. Lu, Y., et al., Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762, 2019.
    38. Ramachandran, P., B. Zoph, and Q.V. Le, Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
    39. Dai, Z., et al., Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
    40. Dauphin, Y.N., et al. Language modeling with gated convolutional networks. in International conference on machine learning. 2017. PMLR.
    41. Chen, Z., et al. Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection. in Interspeech. 2020.
    42. ChhoeTaigi 找台語:台語字詞資料庫. Available from: https://github.com/ChhoeTaigi/ChhoeTaigiDatabase.
    43. Shen, J., et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2018. IEEE.
    44. Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780.
    45. Li, Y.A., A. Zare, and N. Mesgarani, Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion. arXiv preprint arXiv:2107.10394, 2021.
    46. Kum, S. and J. Nam, Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Applied Sciences, 2019. 9(7): p. 1324.
    47. Griffin, D. and J. Lim, Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 1984. 32(2): p. 236-243.
    48. Oord, A.v.d., et al., Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
    49. Yu, F., V. Koltun, and T. Funkhouser. Dilated residual networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
    50. Yamamoto, R., E. Song, and J.-M. Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. IEEE.
    51. Papineni, K., et al. Bleu: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.
    52. Park, D.S., et al., Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
    53. Watanabe, S., et al., Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
    54. Kingma, D.P. and J. Ba, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    55. Hayashi, T., et al. ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2020. IEEE.
    56. Loshchilov, I. and F. Hutter, Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

    下載圖示 校內:2024-09-01公開
    校外:2024-09-01公開
    QR CODE