簡易檢索 / 詳目顯示

研究生: 葉育輔
Yeh, Yu-Fu
論文名稱: 基於混合深度神經網路架構之台語語音辨識
Taiwanese Speech Recognition Based on Hybrid Deep Neural Network Architecture
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 55
中文關鍵詞: 語音辨識台語資料增強深度神經網路聲學模型語言模型
外文關鍵詞: Speech Recognition, Taiwanese, Data Augmentation, Deep Neural Network Acoustic Model, Language Model
相關次數: 點閱:71下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,許多人將深度學習的技術應用於語音辨識中,且其成效越來越不錯,但這些語音辨識系統都使用在英文或是中文上,導致許多只會講台語的老年人無法使用,故針對此問題,本研究針對台語語音辨識系統進行開發。本研究使用Kaldi語音辨識工具包實現一個基於混和神經網路之台語辨識系統,所使用之台語語料庫由台灣台語國賽朗讀與同學錄音所蒐集,共約11小時的音檔。因語音訓練資料為小型資料庫,故使用兩種語音增強方法來增加訓練資料,讓聲學模型能夠更加魯棒性與更有效的訓練,一個方法為速度擾動,分別將原始資料加快1.1倍速度與放慢0.9倍速度,另外一個方法為使用多條件訓練資料,將原始語音模擬混響及添加背景噪音,其背景噪音包含音樂聲、說話聲與雜訊。另外,因速度對於在線解碼非常重要,故語言模型使用傳統n-gram模型,相較於深度學習之語言模型速度快上許多。而聲學模型針對不同深度神經網路架構來訓練,包含TDNN、CNN-TDNN與CNN-LSTM-TDNN,並測試找出本資料庫最適用之聲學模型架構,解碼的部分由加權有限狀態轉換器將HMM模型、辭典與語言模型三個不同層級的轉換結合成解碼圖,再利用解碼圖與深度神經網路模型透過解碼器將語音訊號解碼辨識出結果。在實驗結果上,本系統測試音檔在語言模型領域外之字符錯誤率達7.61%,語言模型領域內之字符錯誤率達3.95%,實際在線解碼測試字符錯誤率為3.06%。

    In recent years, many people have applied deep learning techniques to speech recognition, and their results are getting better and better. But these speech recognition systems are used in English or Chinese, which makes many elderly people who only speak Taiwanese unable to use them. Therefore, in order to solve this problem, this study develops a speech recognition system for Taiwanese. In this study, Kaldi Speech Recognition Toolkit was used to implement a Taiwanese Speech recognition system based on a hybrid neural network. The Taiwanese corpus used was collected by Taiwan Taiwanese National Reading Competition and Classmate Recording, and a total of about 11 hours of audio files were collected. Because the training data is a small dataset, two audio augmentation methods are used to increase the training data, so that the acoustic model can be more robust and more effective training. One method is speed perturbation, which speeds up the original data by 1.1 times and slows it down by 0.9 times. Another method is to use multi-condition training data to simulate reverberation of the original speech and add background noise. The background noise includes music, speech, and noise. In addition, because decoding speed is very important for online decoding, the language model uses the traditional n-gram model, which is much faster than the language model of deep learning. The acoustic model is trained for different deep neural network architectures, including TDNN, CNN-TDNN and CNN-LSTM-TDNN, and find the most suitable acoustic model architecture for this dataset. The decoding part is combined with three different levels of conversion of HMM model, lexicon and language model into a decoding graph by the weighted finite state transducer. Then use the decoding graph and deep neural network model to decode the speech signal and recognize the result through the decoder. In the experimental results, the character error rate of the testing data of the system outside the domain of language model reached 7.61%, and character error rate in the domain of language modeling reaches 3.95%, Finally, the character error rate of online decoding test is 3.06%.

    中文摘要 I Abstract II 誌謝 IV Content V Table List VII Figure List VIII Chapter1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Objectives 3 1.4 Organization 4 Chapter2 Related Work 5 2.1 Review of Recent Research on Taiwanese Speech Recognition 5 2.2 Kaldi Speech Recognition Toolkit 6 2.3 Speech Recognition System 7 2.3.1 Pre-processing 8 2.3.2 Acoustic Model 9 2.3.3 Language Model 12 2.3.4 Discriminative Training and LF-MMI 13 2.4 Weighted Finite State Transducer 15 Chapter3 Taiwanese Speech Recognition System 17 3.1 System Overview 17 3.2 Pre-processing 18 3.2.1 Data Preparing 18 3.2.2 Audio Augmentation 20 3.2.3 Feature Extraction 21 3.3 Deep Neural Networks Acoustic model 24 3.3.1 GMM-HMM System 24 3.3.2 Time Delay Neural Network Architecture 28 3.3.3 CNN-TDNN Architecture 31 3.3.4 CNN-LSTM-TDNN Architecture 33 3.4 Decoding Graph and Recognition 36 Chapter4 Experimental Results 38 4.1 Dataset 38 4.1.1 Taiwanese Dataset 38 4.1.2 Data Augmentation Dataset 42 4.2 Experiment for Taiwanese Speech Recognition 43 4.2.1 Experimental Settings 43 4.2.2 Evaluation Method 46 4.2.3 Experimental Results 47 Chapter5 Conclusions 52 5.1 Conclusions 52 5.2 Future Works 52 References 53

    [1] 葉高華. "臺灣歷次語言普查回顧." 臺灣語文研究 13.2 (2018): 247-273.
    [2] https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E8%A9%B1
    [3] 臺灣閩南語常見文白異讀. Available: https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E9%96%A9%E5%8D%97%E8%AA%9E%E5%B8%B8%E8%A6%8B%E6%96%87%E7%99%BD%E7%95%B0%E8%AE%80
    [4] 許長謨. "台灣語文聲調教學的認知與策略." 台灣語文學刊 (JTTL) 第 2 (2004): 193-206
    [5] 鄭柏昕。「國語、台語及粵語三語言語音辨識系統之設計研究」。碩士論文,國立中山大學電機工程學系研究所,2012。<https://hdl.handle.net/11296/465k3t>
    [6] 梁振豊。「台語語音辨識及智慧型口語對話汽車導航系統」。碩士論文,國立交通大學電信工程系所,2006。<https://hdl.handle.net/11296/bmzdae>
    [7] 王文德。「台語語音辨識與文字處理之研究」。碩士論文,國立交通大學電信工程系所,2004。<https://hdl.handle.net/11296/3wausj>
    [8] 游聲峰。「語音辨識 輔助的 台語語料庫 收集方法 探討」。碩士論文,國立清華大學統計學研究所,2014。<https://hdl.handle.net/11296/5vbm9k>
    [9] 陳至瑩。「基於HTK連續語音辨識的台語朗讀語音資料庫之自動標音」。碩士論文,長庚大學資訊工程學系,2011。<https://hdl.handle.net/11296/73c2v2>
    [10] Povey, Daniel, et al. "The Kaldi speech recognition toolkit." IEEE 2011 workshop on automatic speech recognition and understanding. No. CONF. IEEE Signal Processing Society, 2011.
    [11] Bahl, Lalit R., Frederick Jelinek, and Robert L. Mercer. "A maximum likelihood approach to continuous speech recognition." IEEE transactions on pattern analysis and machine intelligence 2 (1983): 179-190.
    [12] Davis, Steven, and Paul Mermelstein. "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences." IEEE transactions on acoustics, speech, and signal processing 28.4 (1980): 357-366.
    [13] Dehak, Najim, et al. "Front-end factor analysis for speaker verification." IEEE Transactions on Audio, Speech, and Language Processing 19.4 (2010): 788-798.
    [14] Rabiner, Lawrence R. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989): 257-286.
    [15] Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. 2007.
    [16] Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. "Acoustic modeling using deep belief networks." IEEE transactions on audio, speech, and language processing 20.1 (2011): 14-22.
    [17] Peddinti, Vijayaditya, Daniel Povey, and Sanjeev Khudanpur. "A time delay neural network architecture for efficient modeling of long temporal contexts." Sixteenth Annual Conference of the International Speech Communication Association. 2015.
    [18] Peddinti, Vijayaditya, et al. "Low latency acoustic modeling using temporal convolution and LSTMs." IEEE Signal Processing Letters 25.3 (2017): 373-377.
    [19] Waibel, Alex, et al. "Phoneme recognition using time-delay neural networks." IEEE transactions on acoustics, speech, and signal processing 37.3 (1989): 328-339.
    [20] Povey, Daniel, et al. "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks." Interspeech. 2018.
    [21] Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. 2006.
    [22] Li, Jinyu, et al. "Advancing acoustic-to-word CTC model." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
    [23] Bahdanau, Dzmitry, et al. "End-to-end attention-based large vocabulary speech recognition." 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016.
    [24] Watanabe, Shinji, et al. "Hybrid CTC/attention architecture for end-to-end speech recognition." IEEE Journal of Selected Topics in Signal Processing 11.8 (2017): 1240-1253.
    [25] Mikolov, Tomáš, et al. "Recurrent neural network based language model." Eleventh annual conference of the international speech communication association. 2010.
    [26] Yang, Zhilin, et al. "Breaking the softmax bottleneck: A high-rank RNN language model." arXiv preprint arXiv:1711.03953 (2017).
    [27] Juang, Biing-Hwang, Wu Hou, and Chin-Hui Lee. "Minimum classification error rate methods for speech recognition." IEEE Transactions on Speech and Audio processing 5.3 (1997): 257-265.
    [28] Normandin, Yves. Hidden Markov models, maximum mutual information estimation, and the speech recognition problem. McGill University, 1991.
    [29] Povey, Daniel. Discriminative training for large vocabulary speech recognition. Diss. University of Cambridge, 2005.
    [30] Povey, Daniel, et al. "Purely sequence-trained neural networks for ASR based on lattice-free MMI." Interspeech. 2016.
    [31] Mohri, Mehryar, Fernando Pereira, and Michael Riley. "Speech recognition with weighted finite-state transducers." Springer Handbook of Speech Processing. Springer, Berlin, Heidelberg, 2008. 559-584.
    [32] Ko, Tom, et al. "Audio augmentation for speech recognition." Sixteenth Annual Conference of the International Speech Communication Association. 2015.
    [33] Ko, Tom, et al. "A study on data augmentation of reverberant speech for robust speech recognition." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
    [34] 台灣國賽台語(閩南語)朗讀篇目整理. Available: http://ip194097.ntcu.edu.tw/longthok/longthok.asp.
    [35] TauPhahJi-Command. Available: https://github.com/i3thuan5/TauPhahJi-Command.
    [36] aioanoe-word2vec. Available: https://github.com/i3thuan5/taioanoe-word2vec
    [37] Snyder, David, Guoguo Chen, and Daniel Povey. "Musan: A music, speech, and noise corpus." arXiv preprint arXiv:1510.08484 (2015).

    無法下載圖示 校內:2025-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE