簡易檢索 / 詳目顯示

研究生: 王偉軒
Wang, Wei-Xuan
論文名稱: 應用人耳聽覺濾波器及功率正規化倒譜係數於強健性語音辨識系統
Application of Human Auditory Filters and Power-Normalized Cepstral Coefficients for Robust Speech Recognition System
指導教授: 雷曉方
Lei, Sheau-Fang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 89
中文關鍵詞: 人耳基底膜濾波器組強健性語音辨識系統特徵抽取演算法倒頻譜能量正規化倒譜係數
外文關鍵詞: auditory-based filterbank, robust speech recognition, feature extraction, cepstral coefficient, Power-Normalized Cepstral Coefficient
相關次數: 點閱:94下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本篇論文中利用一個更能符合人耳聽覺特性的聽覺濾波器組又稱簡易型珈碼啁啾調濾波器組(Simplified Gammachirp Filterbank)來取代傳統上使用梅爾三角濾波器的梅爾倒頻譜係數(Mel Frequency Cepstral Coefficient,MFCC)以及功率正規化倒譜係數(Power-Normalized Cepstral Coefficient)中的珈瑪調濾波器組(Gammatone filterbank),進而提出一個特徵抽取演算法來改善原本的強健性語音辨識系統,並且與MFCC、珈瑪調頻率倒頻譜係數(Gammatone Frequency Cepstral Coefficient,GFCC)、加碼啁啾調頻率倒譜係數(Gammachirp Frequency Cepstral Coefficient, GcFCC)、正規化加碼啁啾調倒譜係數(Normalized Gammachirp Cepstral Coefficient,NGcFCC)以及將簡易型珈瑪啁啾調濾波器組與原來的功率正規化倒譜係數(Power-Normalized Cepstral Coefficient)作結合,改善原本的功率化正規倒譜係數。Simplified Gammachirp Filterbank的發想為GcFCC利用語音訊號的聲壓值來調變濾波器組來達到改善辨識率的目的,但是其運算複雜度過高導致及時運算十分困難,而本論文探討加碼啁啾調濾波器組改善辨識率的原因,並且提出了一個不需要依靠語音訊號的聲壓值來進行調變的方法來改善珈瑪啁啾調濾波器組(Gammachirp filterbank),並且將此濾波器組應用在功率正規化倒譜係數,得到更好的辨識率。本篇論文使用AURORA 2.0作為訓練以及測試時的資料庫,雜訊分別有地鐵、人聲、汽車、展覽廳、餐廳、街道、機場、火車站,平均8種噪音的辨識率,未使用PNCC的SGcFCC相較於GcFCC與MFCC分別改善了1.4%、2.8%,而使用SGcFCC以及PNCC比原本PNCC改善了1.19%

    It improves an auditory filterbank based-on the human characteristic in this thesis, and calls the improved filterbank as Simplified Gammachirp Filterbank. It substitutes Mel-Frequency tribank in conventional Mel Frequency Cepstral Coeffficient,MFCC for Simplified Gammachirp filterbank, and substitudes Gammatone filterbank in Gammatone Frequeny Cepstral Coefficient(GFCC),or Power-Normalized Cepstral Coefficient,PNCC for our proposed Gammachirp Filterbank,and then proposes two feature extraction algorithm, one is Simplified Gammachirp Frequency Cepstral Coefficient(SGcFCC), the other is Simplified Gammachirp Filterbank with PNCC to improve robust speech recognition system. SGcFCC compared well with four algorithms, MFCC, Gammatone Frequency Cepstral Coefficient(GFCC), Gammachirp Frequency Cepstral Coefficient(GcFCC), Normalized Gammachirp Cepstral Coefficient(NGcFCC) in Aurora 2, and Simplified Gammachirp Filterbank with PNCC compared well with original PNCC. GcFCC is used the sound pressure level in speech signal to modify the Gammachirp filterbank, but the huge complexity can not let us to use it on real-time casual system. In this thesis, we discuss the reason that how Gammachirp filterbank improves the recognition rate, and to propose the algorithm, Simplified Gammachirp Filterbank.
    We used Aurora 2 DataBase to build speech recognition system, and use it to evaluate our algorithm. Our proposed scheme without PNCC improves the Word Accuracy by 2.16% from NGcFCC, 1.4% from GcFCC, 1.42% from GCC, 2.98% from MFCC, and with PNCC improves original PNCC by 1.19%

    中文摘要 I EXTENDED ABSTRACT III 誌謝 XII 目錄 XIII 表目錄 XV 圖目錄 XVI 第一章 緒論 1 1.1 動機與目的 1 1.2噪音類型簡介 1 1.2.1 折積型噪音 2 1.2.2加成性噪音 2 1.3 強健性語音辨識系統處理方式簡介 3 1.3.1 model domain 3 1.3.2 feature domain 3 1.4論文章節組織 4 第二章 相關文獻介紹與分析 5 2.1 人耳聽覺特性 5 2.1.1人耳生理簡介 5 2.1.2臨界頻帶 6 2.1.3遮蔽效應 9 2.1.4人耳耳蝸特性 11 2.2 應用於語音辨識的特徵抽取法簡介 12 2.2.1 梅爾倒頻譜係數以及加碼調頻率倒譜係數 12 2.2.2 可即時實現的功率正規化倒譜係數 27 2.2.3 動態特徵值參數 30 2.3 Aurora 2 語音辨識系統 30 含噪語音資料庫: 30 訓練集以及測試集的定義 36 辨識系統的介紹 37 第三章 利用人耳聽覺濾波器改良功率正規化導譜係數 39 3.1 噪音抑制演算法 39 Medium-Time Power Calculation 41 Asymmetric Noise Suppression with temporal masking 41 Spectral Weight Smoothing and time-frequency normalization 46 3.2 Gammachirp filterbank 47 利用聲壓值來調變Gammachirpfiltebank的演算法 47 Gammachirp filterbank特性 50 Proposed Gammachirp filterbank Algorithm 53 3.3 本論文特徵抽取演算法總結 57 第四章 演算法的分析比較與結果 62 4.1 單筆語料誤差比較 62 信號保真度測量 62 4.2 Aurora 2 辨識系統比較 65 辨識率指標 65 實驗結果 65 4.3 運算複雜度比較 82 Proposed Gammachirp and Gammatone filterbank compare 82 Proposed PNCC 與一般MFCC的運算複雜度差異 82 第五章 結論與展望 86 參考文獻 87

    [1] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, "An Overview of Noise-Robust Automatic Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745-777, 2014.
    [2] K. Shinoda and C. H. Lee, "A structural Bayes approach to speaker adaptation," IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 276-287, 2001.
    [3] O. Siohan, T. A. Myrvoll, and C. H. Lee, "Structural maximum a posteriori linear regression for fast HMM adaptation," (in English), Computer Speech and Language, Article vol. 16, no. 1, pp. 5-24, Jan 2002.
    [4] C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Computer Speech & Language, vol. 9, no. 2, pp. 171-185, Apr. 1995.
    [5] S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120,Apr. 1979.
    [6] J. S. Lim and A. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proceedings of the IEEE, vol. 67, no. 12, pp. 1586-1604,Dec. 1979.
    [7] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," the Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304-1312, 1974.
    [8] O. Viikki, D. Bye, and K. Laurila, "A recursive feature vector normalization approach for robust speech recognition in noise," Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, , vol. 2, pp. 733-736, May 1998.
    [9] S. Molau, F. Hilger, and H. Ney, "Feature space normalization in adverse acoustic conditions," Proceedings 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-I, Apr. 2003.
    [10] E. Zwicker and H. Fastl, Psychoacoustics: Facts and models. Springer Science & Business Media, 2013.
    [11] M. J. Harvilla and R. M. Stern, "Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training," in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4697-4700, 2012.
    [12] C. Kim and R. M. Stern, "Power-normalized cepstral coefficients (PNCC) for robust speech recognition," IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 24, no. 7, pp. 1315-1329, 2016.
    [13] H. Wei, C. Cheong-Fat, C. Chiu-Sing, and P. Kong-Pang, "An efficient MFCC extraction method in speech recognition," in 2006 IEEE International Symposium on Circuits and Systems, p. 4 pp, 2006.
    [14] R. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, "An efficient auditory filterbank based on the gammatone function," in a meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2, no. 7, 1987.
    [15] L. H. Carney and T. C. T. Yin, "TEMPORAL CODING OF RESONANCES BY LOW-FREQUENCY AUDITORY-NERVE FIBERS - SINGLE-FIBER RESPONSES AND A POPULATION-MODEL," (in English), Journal of Neurophysiology, Article vol. 60, no. 5, pp. 1653-1677, Nov 1988.
    [16] A. Adiga, M. Magimai, and C. S. Seelamantula, "Gammatone wavelet Cepstral Coefficients for robust speech recognition," in 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013) , pp. 1-4 2013.
    [17] 王小川, 語音訊號處理, 三版二刷 ed. 全華科技圖書股份有限公司, 台北, 2012年5月.
    [18] F. J. Harris, "On the use of windows for harmonic analysis with the discrete Fourier transform," Proceedings of the IEEE, vol. 66, no. 1, pp. 51-83, 1978.
    [19] J. Lyons, "Mel frequency cepstral coefficient (MFCC) tutorial," Practical Cryptography, 2015.
    [20] M. Slaney, "An efficient implementation of the Patterson-Holdsworth auditory filter bank," Apple Computer Technical Report #35, 1993.
    [21] K. Prahallad, "Speech technology: a practical introduction topic: spectrogram, cepstrum and mel-frequency analysis," Carnegie Mellon University & International Institute of Information Technology Hyderabad, Available on:http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf, 2011.
    [22] S. S. Stevens, "On the psychophysical law," Psychological Review, vol. 64, no. 3, pp. 153-181, 1957.
    [23] M. G. Heinz, X. Zhang, I. C. Bruce, and L. H. Carney, "Auditory nerve model for predicting performance limits of normal and impaired listeners," Acoustics Research Letters Online, vol. 2, no. 3, pp. 91-96, 2001.
    [24] S. Young et al., "The HTK book (v3. 4)," Cambridge University, 2006.
    [25] H.-G. Hirsch and D. Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions," in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
    [26] R. G. ITU-T, "712, transmission performance characteristics of pulse code modulation (PCM)," International Telecommunication Union, 1992.
    [27] H. G. Hirsch, "FaNT: filtering and noise adding tool," Niederrhein University of Applied Sciences, http://dnt.-kr. hsnr. de/download. html, 2005.
    [28] 葉俊宜, "人耳聽覺濾波器應用於強健性語音辨識系統," 碩士, 電機工程學系, 國立成功大學, 台南市, 2016.
    [29] C. Kim and R. M. Stern, "Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4574-4577, 2010.
    [30] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE transactions on speech and audio processing, vol. 2, no. 4, pp. 578-589, 1994.
    [31] C. Kim and R. M. Stern, "Power function-based power distribution normalization algorithm for robust speech recognition," in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pp. 188-193, 2009.
    [32] D. Gelbart and N. Morgan, "Evaluating long-term spectral subtraction for reverberant ASR," in Automatic Speech Recognition and Understanding, 2001. ASRU'01. IEEE Workshop on, pp. 103-106, 2001.
    [33] C. Kim, "Signal processing for robust speech recognition motivated by auditory processing," Diss. Johns Hopkins University, 2010.
    [34] M. Athineos, H. Hermansky, and D. P. Ellis, "LP-TRAP: Linear predictive temporal patterns," in International Conference on Spoken Language Processing (ICSLP), no. EPFL-CONF-83123, 2004.
    [35] C. Lemyre, M. Jelinek, and R. Lefebvre, "New approach to voiced onset detection in speech signal and its application for frame error concealment,", 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4757-4760, 2008.
    [36] S. M. Prasanna, B. S. Reddy, and P. Krishnamoorthy, "Vowel onset point detection using source, spectral peaks, and modulation spectrum energies," IEEE Transactions on audio, speech, and language processing, vol. 17, no. 4, pp. 556-565, 2009.
    [37] C. Kim and R. M. Stern, "Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction," in Tenth Annual Conference of the International Speech Communication Association, 2009.
    [38] T. Irino, R. D. Patterson, C. L. H., and Y. T. C. T., "A time-domain, level-dependent auditory filter: The gammachirp Temporal coding of resonances by low-frequency auditory nerve fibers: single-fiber responses and a population model," The Journal of the Acoustical Society of America, vol. 101, no. 1, pp. 412-419, 1997.
    [39] T. Irino and R. D. Patterson, "A compressive gammachirp auditory filter for both physiological and psychophysical data," The Journal of the Acoustical Society of America, vol. 109, no. 5, pp. 2008-2022, 2001.
    [40] Y. Zouhir and K. Ouni, Feature Extraction Method for Improving Speech Recognition in Noisy Environments. pp. 56-61, 2016.
    [41] Z. Wang and A. C. Bovik, "Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures," IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98-117, 2009.
    [42] W. Ye-Yi, A. Acero, and C. Chelba, "Is word error rate a good indicator for spoken language understanding accuracy," in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pp. 577-582, 2003.
    [43] 吳炳飛, "數位訊號處理 (11 片 DVD+ 1 本講義)," ed, 2010.

    無法下載圖示 校內:2023-08-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE