簡易檢索 / 詳目顯示

研究生: 郭先舜
Kuo, Hsien-Shun
論文名稱: 基於人耳聽覺特徵之強健性語音辨識系統
Auditory-Based Features for Robust Speech Recognition System
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 57
中文關鍵詞: 人耳基底膜濾波器語音辨識倒頻譜係數人耳聽覺模型
外文關鍵詞: gammachirp filterbank, speech recognition, cepstral coefficients, auditory modeling
相關次數: 點閱:114下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本篇論文中,一個基於人耳聽覺的特徵演算法將被提出用於強健語音的辨識系統上。在此篇研究中,語音訊號將會經由一個新的特徵演算法稱做人耳基底膜頻帶倒頻譜係數(Basilar-membrane Frequency-band Cepstral Coefficient, BFCC)被特徵化,與現在普遍使用的梅爾倒頻譜係數(Mel Frequency Cepstral Coefficient, MFCC)方法相比,MFCC使用複利葉轉換來產生頻譜而BFCC則是使用基於人耳基底膜的濾波器(gammachirp)的小波轉換來產生頻譜,因為梅爾三角濾波器與人耳基底膜濾波器的不同以及複利葉轉換的特性與小波轉換特性的不同使得BFCC產生的頻譜可以更準確的模仿人耳聽覺的特性以及改善雜訊的干擾。此外,本篇論文使用HTK工具在訓練以及測試時產生隱藏式馬可夫模型(Hidden Markov Model, HMM)。本篇論文使用AURORA 2.0做為訓練以及測試時的資料庫,測試使用AURORA 2.0裡的testA做測試資料,雜訊分別有列車、人聲、車子、展廳,辨識結果顯示出所提出的BFCC方法比MFCC方法在雜訊比範圍-5dB到20dB裡,平均四種雜訊的語音辨識率改善了13%,並與其他人耳聽覺特徵(Gammatone Wavelet Cepstral Coefficient, GWCC)、(Gammatone Frequency Cepstral Coefficient, GFCC)做比較,平均語音辨識率與GWCC相比約改善17%,與GFCC辨識率相比約改善0.5%。

    An auditory-based feature extraction algorithm is proposed for enhancing the robustness of automatic speech recognition. In the proposed approach, the speech signal is characterized using a new feature referred to as the Basilar-membrane Frequency-band Cepstral Coefficient (BFCC). In contrast to the conventional Mel-Frequency Cepstral Coefficient (MFCC) method based on a Fourier spectrogram, the proposed BFCC method uses an auditory spectrogram based on a gammachirp wavelet transform in order to more accurately mimic the auditory response of the human ear and improve the noise immunity. In addition, a Hidden Markov Model (HMM) is used for both training and testing purposes. The evaluation results obtained using the AURORA 2 noisy speech database show that compared to the MFCC method, Gammatone Wavelet Cepstral Coefficient (GWCC), and Gammatone Frequency Cepstral Coefficient (GFCC), the proposed scheme improves the speech recognition rate by 13%, 17%, and 0.5% on average given speech samples with Siganl-to-Noise Ratios (SNRs) ranging from -5 to 20 dB, respectively.

    中文摘要 I Abstract II 致謝 III Content IV Table List VI Figure List VII Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Previous Works 2 1.4 Objectives 3 1.5 Organization 4 Chapter 2 Related Works 6 2.1 System Overview 6 2.2 Preprocessing 7 2.2.1 Framing 7 2.2.2 Fourier Transform 8 2.2.3 Wavelet Transform 8 2.3 Feature Extraction 9 2.3.1 Mel Frequency Cepstral Coefficients 9 2.3.2 Gammatone Wavelet Cepstral Coefficients 11 2.3.3 Cochlear Frequency Cepstral Coefficients 14 2.4 Dynamic Features 16 2.5 Gaussian Mixture Model 17 2.6 Hidden Markov Model 20 Chapter 3 Proposed Feature Extraction Algorithm for Speech Recognition System 22 3.1 System Overview 22 3.2 Basilar-membrane Frequency-band Cepstral Coefficient 23 3.2.1 Gammachirp filter 24 3.2.2 Cochlear Wavelet Transform 27 3.3 Cochlear Frequency Instantaneous Frequency 29 3.3.1 Instantaneous frequency estimation using Hilbert Spectrum 30 3.3.2 Instantaneous frequency estimation using DESA 31 3.4 Cochlear Frequency Spectrogram Energy 32 Chapter 4 Experiments 34 4.1 Corpus 34 4.1.1 AURORA 2.0 34 4.1.2 Binaural AURORA 2.0 35 4.2 HTK toolkit setting 38 4.3 Word Accuracy 40 4.3.1 Word Accuracy of AURORA 2.0 40 4.3.2 Word Accuracy of binaural AURORA 2.0 46 Chapter 5 Conclusions and Future Works 52 5.1 Conclusions and Discussion 52 5.1.1 Discussion 52 5.1.2 Conclusions 53 5.2 Future Works 54 References 55

    [1]S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” Acoustics, Speech and Signal Processing, IEEE Transactions , vol. 27, no. 2, pp. 113–120, Apr. 1979.
    [2]J. S. Goldstein, S. I. Reed, L. L. Scharf, “A multistage representation of the Wiener filter based on orthogonal projections,” Information Theory, IEEE Transactions , vol. 44, no. 7, pp. 2943–2959, Nov. 1998.
    [3]W. Han, C.-F. Chan, C.-S. Choy, K.-P. Pun, “An efficient MFCC extraction method in speech recognition,” in Pro. IEEE International Symposium on Circuits and Systems, 2006, May 21–24, pp. 4.
    [4]R. Patterson and I. N. Smith, “An efficient auditory filterbank based on the gammatone function,” Speech-Group meeting of the Institute of Acoustics on Auditory Modelling, vol. 54, Apr. 1987.
    [5]A. Adiga, M. Magimai, C. S. Seelamantula, “Gammatone wavelet Cepstral Coefficients for robust speech recognition,” IEEE Region 10 Conference on TENCON, 2013, Oct. 22–25, pp. 1–4, 22–25.
    [6]Y. Shao; S. Srinivasan, D. Wang, “Incorporating Auditory Feature Uncertainties in Robust Speaker Identification,” Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference, 2007, Apr. 15–20, vol. 4, pp. 277–280.
    [7]X. Yang, K. Wang, S. A. Shamma, “Auditory representations of acoustic signals,” Information Theory, IEEE Transactions, vol. 38, no. 2, pp. 824–839, Mar. 1992.
    [8]L. Solbach, R. Wohrmann, and J. Kliewer, “The complex-valued continuous wavelet transform as a preprocessor for auditory scene analysis,” in H. Okuno and D. Rosenthal (editors): Readings in Computational Auditory Scene Analysis, pp. 273–292, Erlbaum Publishers, 1998.
    [9]Q. Li, “An auditory-based transfrom for audio signal processing,” Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA '09. IEEE Workshop, 2009 Oct. 18-21, pp. 181–184.
    [10]Q. Li, Y. Huang, “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1791–1801, Aug. 2011.
    [11]A. Mertins, J. Rademacher, “Vocal tract length invariant features for automatic speech recognition,” IEEE Workshop on Automatic Speech Recognition and Understanding, 2005, Nov. 27–27, pp. 308–312.
    [12]A. Venkitaraman, A. Adiga, and C. S. Seelamantula, “Auditory-motivated gammatone wavelet transform,” Signal Processing, vol. 94, pp. 608–619, Jan. 2014.
    [13]B. Hanson, T. Applebaum, “Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech,” International conference on acoustics, speech and signal processing, 1990, Apr. 3–6, pp. 857–860, vol. 2.
    [14]D. Schofield, “Visualisations of speech based on a model of the peripheral auditory system,” NASA STI/Recon Technical Report N, vol.86, pp. 17593, 1985.
    [15]R. D. Patterson, “Auditory filter shapes derived with noise stimuli,” The Journal of the Acoustical Society of America, vol.59, no.3, pp. 640–654, Mar. 1976.
    [16]R. D. Patterson et al. “An efficient auditory filterbank based on the gammatone function,” a meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2, no. 7, Dec. 1987.
    [17]T. Irino and R. D. Patterson, “A time-domain, level-dependent auditory filter: The gammachirp,” The Journal of the Acoustical Society of America, vol.101, no.1, pp. 412–419, Jan. 1997.
    [18]R. A. Lutfi and R. D. Patterson, “On the growth of masking asymmetry with stimulus intensity,” The Journal of the Acoustical Society of America, vol.76, no.3, pp. 739–745, Sep. 1984.
    [19]B. C. J. Moore, R. W. Peters, and B. R. Glasberg, “Auditory filter shapes at low center frequencies,” The Journal of the Acoustical Society of America, vol.88, no.1, pp. 132–140, July 1990.
    [20]S. Rosen and R. J. Baker, “Characterising auditory filter nonlinearity,” Hearing research, vol.73, no.2, pp. 231–243, Mar. 1994.
    [21]T. Irino and R. D. Patterson, “A compressive gammachirp auditory filter for both physiological and psychophysical data,” The Journal of the Acoustical Society of America, vol.109, no.5, pp. 2008–2022, May 2001.
    [22]L. H. Carney, M. J. McDuffy, and I. Shekhter, “Frequency glides in the impulse responses of auditory-nerve fibers,” The Journal of the Acoustical Society of America, vol.105, no.4, pp. 2384–2391, Apr. 1999.
    [23]L. Sun, M. Shen, F. H. Y. Chan, and P. J. Beadle, “Instantaneous Frequency Estimate of Nonstationary Phonocardiograph Signals Using Hilbert Spectrum,” Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference, 2006, Jan. 17–18, pp. 7285–7288.
    [24]Y. Litvin, I. Cohen, and D. Chazan, “Separation of speech and music sources from a single-channel mixture using discrete energy separation algorithm,” in International Workshop on Acoustic Echo and Noise Control, IWAENC, 2010.
    [25]E. Bedrosian, “A product theorem for Hilbert transforms,” Proc. IEEE, 1963, May, pp. 868–869, vol. 51.
    [26]I. Daubechies, “The wavelet transform, time-frequency localization and signal analysis,” Information Theory, IEEE Transactions, vol. 36, no. 5, pp. 961–1005, Sep 1990.

    無法下載圖示 校內:2024-01-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE