| 研究生: |
王偉軒 Wang, Wei-Xuan |
|---|---|
| 論文名稱: |
應用人耳聽覺濾波器及功率正規化倒譜係數於強健性語音辨識系統 Application of Human Auditory Filters and Power-Normalized Cepstral Coefficients for Robust Speech Recognition System |
| 指導教授: |
雷曉方
Lei, Sheau-Fang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 89 |
| 中文關鍵詞: | 人耳基底膜濾波器組 、強健性語音辨識系統 、特徵抽取演算法 、倒頻譜 、能量正規化倒譜係數 |
| 外文關鍵詞: | auditory-based filterbank, robust speech recognition, feature extraction, cepstral coefficient, Power-Normalized Cepstral Coefficient |
| 相關次數: | 點閱:94 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本篇論文中利用一個更能符合人耳聽覺特性的聽覺濾波器組又稱簡易型珈碼啁啾調濾波器組(Simplified Gammachirp Filterbank)來取代傳統上使用梅爾三角濾波器的梅爾倒頻譜係數(Mel Frequency Cepstral Coefficient,MFCC)以及功率正規化倒譜係數(Power-Normalized Cepstral Coefficient)中的珈瑪調濾波器組(Gammatone filterbank),進而提出一個特徵抽取演算法來改善原本的強健性語音辨識系統,並且與MFCC、珈瑪調頻率倒頻譜係數(Gammatone Frequency Cepstral Coefficient,GFCC)、加碼啁啾調頻率倒譜係數(Gammachirp Frequency Cepstral Coefficient, GcFCC)、正規化加碼啁啾調倒譜係數(Normalized Gammachirp Cepstral Coefficient,NGcFCC)以及將簡易型珈瑪啁啾調濾波器組與原來的功率正規化倒譜係數(Power-Normalized Cepstral Coefficient)作結合,改善原本的功率化正規倒譜係數。Simplified Gammachirp Filterbank的發想為GcFCC利用語音訊號的聲壓值來調變濾波器組來達到改善辨識率的目的,但是其運算複雜度過高導致及時運算十分困難,而本論文探討加碼啁啾調濾波器組改善辨識率的原因,並且提出了一個不需要依靠語音訊號的聲壓值來進行調變的方法來改善珈瑪啁啾調濾波器組(Gammachirp filterbank),並且將此濾波器組應用在功率正規化倒譜係數,得到更好的辨識率。本篇論文使用AURORA 2.0作為訓練以及測試時的資料庫,雜訊分別有地鐵、人聲、汽車、展覽廳、餐廳、街道、機場、火車站,平均8種噪音的辨識率,未使用PNCC的SGcFCC相較於GcFCC與MFCC分別改善了1.4%、2.8%,而使用SGcFCC以及PNCC比原本PNCC改善了1.19%
It improves an auditory filterbank based-on the human characteristic in this thesis, and calls the improved filterbank as Simplified Gammachirp Filterbank. It substitutes Mel-Frequency tribank in conventional Mel Frequency Cepstral Coeffficient,MFCC for Simplified Gammachirp filterbank, and substitudes Gammatone filterbank in Gammatone Frequeny Cepstral Coefficient(GFCC),or Power-Normalized Cepstral Coefficient,PNCC for our proposed Gammachirp Filterbank,and then proposes two feature extraction algorithm, one is Simplified Gammachirp Frequency Cepstral Coefficient(SGcFCC), the other is Simplified Gammachirp Filterbank with PNCC to improve robust speech recognition system. SGcFCC compared well with four algorithms, MFCC, Gammatone Frequency Cepstral Coefficient(GFCC), Gammachirp Frequency Cepstral Coefficient(GcFCC), Normalized Gammachirp Cepstral Coefficient(NGcFCC) in Aurora 2, and Simplified Gammachirp Filterbank with PNCC compared well with original PNCC. GcFCC is used the sound pressure level in speech signal to modify the Gammachirp filterbank, but the huge complexity can not let us to use it on real-time casual system. In this thesis, we discuss the reason that how Gammachirp filterbank improves the recognition rate, and to propose the algorithm, Simplified Gammachirp Filterbank.
We used Aurora 2 DataBase to build speech recognition system, and use it to evaluate our algorithm. Our proposed scheme without PNCC improves the Word Accuracy by 2.16% from NGcFCC, 1.4% from GcFCC, 1.42% from GCC, 2.98% from MFCC, and with PNCC improves original PNCC by 1.19%
[1] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, "An Overview of Noise-Robust Automatic Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745-777, 2014.
[2] K. Shinoda and C. H. Lee, "A structural Bayes approach to speaker adaptation," IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 276-287, 2001.
[3] O. Siohan, T. A. Myrvoll, and C. H. Lee, "Structural maximum a posteriori linear regression for fast HMM adaptation," (in English), Computer Speech and Language, Article vol. 16, no. 1, pp. 5-24, Jan 2002.
[4] C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Computer Speech & Language, vol. 9, no. 2, pp. 171-185, Apr. 1995.
[5] S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113-120,Apr. 1979.
[6] J. S. Lim and A. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proceedings of the IEEE, vol. 67, no. 12, pp. 1586-1604,Dec. 1979.
[7] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," the Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304-1312, 1974.
[8] O. Viikki, D. Bye, and K. Laurila, "A recursive feature vector normalization approach for robust speech recognition in noise," Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, , vol. 2, pp. 733-736, May 1998.
[9] S. Molau, F. Hilger, and H. Ney, "Feature space normalization in adverse acoustic conditions," Proceedings 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. I-I, Apr. 2003.
[10] E. Zwicker and H. Fastl, Psychoacoustics: Facts and models. Springer Science & Business Media, 2013.
[11] M. J. Harvilla and R. M. Stern, "Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training," in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4697-4700, 2012.
[12] C. Kim and R. M. Stern, "Power-normalized cepstral coefficients (PNCC) for robust speech recognition," IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 24, no. 7, pp. 1315-1329, 2016.
[13] H. Wei, C. Cheong-Fat, C. Chiu-Sing, and P. Kong-Pang, "An efficient MFCC extraction method in speech recognition," in 2006 IEEE International Symposium on Circuits and Systems, p. 4 pp, 2006.
[14] R. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, "An efficient auditory filterbank based on the gammatone function," in a meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2, no. 7, 1987.
[15] L. H. Carney and T. C. T. Yin, "TEMPORAL CODING OF RESONANCES BY LOW-FREQUENCY AUDITORY-NERVE FIBERS - SINGLE-FIBER RESPONSES AND A POPULATION-MODEL," (in English), Journal of Neurophysiology, Article vol. 60, no. 5, pp. 1653-1677, Nov 1988.
[16] A. Adiga, M. Magimai, and C. S. Seelamantula, "Gammatone wavelet Cepstral Coefficients for robust speech recognition," in 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013) , pp. 1-4 2013.
[17] 王小川, 語音訊號處理, 三版二刷 ed. 全華科技圖書股份有限公司, 台北, 2012年5月.
[18] F. J. Harris, "On the use of windows for harmonic analysis with the discrete Fourier transform," Proceedings of the IEEE, vol. 66, no. 1, pp. 51-83, 1978.
[19] J. Lyons, "Mel frequency cepstral coefficient (MFCC) tutorial," Practical Cryptography, 2015.
[20] M. Slaney, "An efficient implementation of the Patterson-Holdsworth auditory filter bank," Apple Computer Technical Report #35, 1993.
[21] K. Prahallad, "Speech technology: a practical introduction topic: spectrogram, cepstrum and mel-frequency analysis," Carnegie Mellon University & International Institute of Information Technology Hyderabad, Available on:http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf, 2011.
[22] S. S. Stevens, "On the psychophysical law," Psychological Review, vol. 64, no. 3, pp. 153-181, 1957.
[23] M. G. Heinz, X. Zhang, I. C. Bruce, and L. H. Carney, "Auditory nerve model for predicting performance limits of normal and impaired listeners," Acoustics Research Letters Online, vol. 2, no. 3, pp. 91-96, 2001.
[24] S. Young et al., "The HTK book (v3. 4)," Cambridge University, 2006.
[25] H.-G. Hirsch and D. Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions," in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
[26] R. G. ITU-T, "712, transmission performance characteristics of pulse code modulation (PCM)," International Telecommunication Union, 1992.
[27] H. G. Hirsch, "FaNT: filtering and noise adding tool," Niederrhein University of Applied Sciences, http://dnt.-kr. hsnr. de/download. html, 2005.
[28] 葉俊宜, "人耳聽覺濾波器應用於強健性語音辨識系統," 碩士, 電機工程學系, 國立成功大學, 台南市, 2016.
[29] C. Kim and R. M. Stern, "Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4574-4577, 2010.
[30] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE transactions on speech and audio processing, vol. 2, no. 4, pp. 578-589, 1994.
[31] C. Kim and R. M. Stern, "Power function-based power distribution normalization algorithm for robust speech recognition," in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pp. 188-193, 2009.
[32] D. Gelbart and N. Morgan, "Evaluating long-term spectral subtraction for reverberant ASR," in Automatic Speech Recognition and Understanding, 2001. ASRU'01. IEEE Workshop on, pp. 103-106, 2001.
[33] C. Kim, "Signal processing for robust speech recognition motivated by auditory processing," Diss. Johns Hopkins University, 2010.
[34] M. Athineos, H. Hermansky, and D. P. Ellis, "LP-TRAP: Linear predictive temporal patterns," in International Conference on Spoken Language Processing (ICSLP), no. EPFL-CONF-83123, 2004.
[35] C. Lemyre, M. Jelinek, and R. Lefebvre, "New approach to voiced onset detection in speech signal and its application for frame error concealment,", 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4757-4760, 2008.
[36] S. M. Prasanna, B. S. Reddy, and P. Krishnamoorthy, "Vowel onset point detection using source, spectral peaks, and modulation spectrum energies," IEEE Transactions on audio, speech, and language processing, vol. 17, no. 4, pp. 556-565, 2009.
[37] C. Kim and R. M. Stern, "Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction," in Tenth Annual Conference of the International Speech Communication Association, 2009.
[38] T. Irino, R. D. Patterson, C. L. H., and Y. T. C. T., "A time-domain, level-dependent auditory filter: The gammachirp Temporal coding of resonances by low-frequency auditory nerve fibers: single-fiber responses and a population model," The Journal of the Acoustical Society of America, vol. 101, no. 1, pp. 412-419, 1997.
[39] T. Irino and R. D. Patterson, "A compressive gammachirp auditory filter for both physiological and psychophysical data," The Journal of the Acoustical Society of America, vol. 109, no. 5, pp. 2008-2022, 2001.
[40] Y. Zouhir and K. Ouni, Feature Extraction Method for Improving Speech Recognition in Noisy Environments. pp. 56-61, 2016.
[41] Z. Wang and A. C. Bovik, "Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures," IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98-117, 2009.
[42] W. Ye-Yi, A. Acero, and C. Chelba, "Is word error rate a good indicator for spoken language understanding accuracy," in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pp. 577-582, 2003.
[43] 吳炳飛, "數位訊號處理 (11 片 DVD+ 1 本講義)," ed, 2010.
校內:2023-08-01公開