簡易檢索 / 詳目顯示

研究生: 張恭銘
Chang, Gung-Ming
論文名稱: 基於發聲起始時間與重音類別偵測之台灣腔英語鑑別性語音特徵分析之研究
Discriminative Feature Analysis based on Voice Onset Time and Stress Detection for Taiwanese-accented English Speech
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 49
中文關鍵詞: 發聲起始時間重音類別偵測
外文關鍵詞: Stress Detection, Voice Onset Time
相關次數: 點閱:84下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近幾年來自動語音辨識系統已有相當顯著的成效,傳統的語音辨識技術在處理語者語音訊號識別時,可分成語者相關與非語者相關來抽取相關的語音屬性來建立識別模型,然而語者相關的語音辨識處理,大多無法識別因口音特性而影響的語音訊號,如發音錯誤,使得傳統語音辨識器是無法良好適用在特定口音的語者身上。另外,資料顯示一般來自相同區域的人會有相似口音跟語氣的走勢,當語者擁有嚴重的口音時,相對於其他正常的發音常會有發音錯誤的麻煩。因為與別人溝通講話或是互動是一種改善外語能力的方法之一,所以在未來具腔調調適的自動語音辨識系統以幫助語者有效的改善他們口音問題為一重要課題。
    在本論文中,我們使用English Across Taiwan (EAT) 跟 Texas Instruments and Massachusetts Institute of Technology (TIMIT)分別表示台灣腔與外國人腔調的英語語料,並對此做出比較跟分析,我們將會提出如何自動去量測發聲起始時間與英語音節的重音偵測。我們使用Teager Energy Operator (TEO) 跟小波轉換兩個方法去量測台灣腔的發聲起始時間,對於台灣腔的英語音節則是使用支援向量機(SVM)來做重音的偵測,其中使用perceptual features、MFCC、delta-MFCC跟delta-delta-MFCC當作輸入音訊的特徵分析,和利用機率型支援向量機來當作重音的分類器,而利用發聲起始時間與音節的重音偵測這兩種特徵,將可輔助改善自動語音辨識系統在腔調影響上的辨識效能。

    In recent years, automatic speech recognition (ASR) systems have achieved great improvements. There are two categories for users to use automatic speech recognition (ASR) systems. One is for user-dependent and the other is for user-independent. However, speaker variability still affects the performance of ASR systems greatly. Among the factors attributing variability, gender and accent are the most important. In addition, it was observed that speakers from the same accent regions had similar tendencies, and speakers with heavy accents tend to make more pronunciation errors in terms of the standard pronunciation. Because one way to improve English skills is speaking out and interacting with others, it will achieve further to benefit all other speakers with different accent and make a great progress on next-generation automatic speech recognition.
    In this thesis, we use English Across Taiwan (EAT) and Texas Instruments and Massachusetts Institute of Technology (TIMIT) to represent Taiwanese accented and foreigner accented English speech corpora for comparison and accent analysis. We also want to propose the method of automatic detection on voice onset time (VOT) of English speech, and the stress detection on English syllables. We use Teager energy operator (TEO) and wavelet transforms methods for measuring voice onset time of Taiwanese-accented English speech. A stress detection based on SVM is also proposed for applying on Taiwanese-accented English syllables. We use a feature set including perceptual features, MFCC, delta-MFCC and delta-delta-MFCC, and the probabilistic SVMs are also presented to implement the stress classifier. By applying the proposed methods of detection on voice onset time and stress of English syllables, automatic speech recognition systems will achieve good recognition performance on accented speech.

    摘要......................................................I ABSTRACT.................................................II ACKNOWLEDGEMENT..........................................IV CONTENTS..................................................V LIST OF FIGURES.........................................VII LIST OF TABLES........................................ VIII CHAPTER 1 INTRODUCTION....................................1 1.1 Background............................................1 1.2 Motivation............................................2 1.3 Organization of the Thesis............................3 CHAPTER 2 FRAMEWORK OF THE PROPOSED SYSTEM................4 2.1 Design Flow of Voice Onset Time Detection.............4 2.2 Design Flow of Stress Detection.......................6 2.2.1 Using Entire Word in Stress Detection...............6 2.2.2 Using Syllable Vowel in Stress Detection............7 CHAPTER 3 VOICE ONSET TIME DETECTION FOR TAIWANESE-ACCENTED ENGLISH SPEECH...................................9 3.1 Definition of Voice Onset Time........................9 3.2 Analysis of Voice Onset Time.........................11 3.2.1 Analysis of Voice Onset Time by using the Teager Energy Operator..........................................11 3.2.2 Analysis of Vowel and Voice Onset Region...........12 3.2.3 Spectrogram Analysis of Voice Onset Time...........16 3.3 Algorithm for Voice Onset Time Detection.............17 3.3.1 Vowel Onset Point Detection using Wavelet Transforms Method...................................................17 3.3.2 End Detection of Consonant using Teager Energy Operator.................................................23 CHAPTER 4 STRESS DETECTION BASED ON SVMS FOR TAIWANESE-ACCENTED ENGLISH SPEECH..................................24 4.1 Categories of Stress.................................24 4.2 Prosodic Features for Stress Detection...............25 4.2.1 Preprocessing......................................25 4.2.2 Frame Based and File Based Stress Classification...26 4.2.3 Perceptual Feature.................................28 4.2.4 Mel-Frequency Cepstral Coefficient and Delta-MFCC.....................................................29 4.2.5 Normalization and Weighting........................31 4.3 The SVM-based Detection Method.......................31 4.3.1 Method of Support Vector Machine...................31 4.3.2 Method of Multi-Class Support Vector Machine.......37 CHAPTER 5 EXPERIMENTAL RESULTS...........................40 5.1 Evaluation on Voice Onset Time Detection.............40 5.2 Evaluation on Stress Detection.......................42 5.2.1 Stress Detection on Words..........................42 5.2.2 Stress Detection on Vowels.........................43 CHAPTER 6 CONCLUSIONS AND FUTURE WORKS..................45 REFERENCES...............................................46 AUTHOR’S BIOGRAPHICAL NOTES.............................49

    [1] Chen, T., Huang, C., Chang, E., and Wang, J. Automatic accent identification using Gaussian mixture models. Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Italy.,(2001).
    [2] Chang, E., Zhou, J., Huang, C., Di, S., and Lee, K.F. Large vocabulary mandarin speech recognition with different approaches in modeling tones. Proc. International Conference on Spoken Language Processing, vol. 2, pp. 983–986. (2000).
    [3] Peter Ladefoged, A Course in Phonetics, Heinle &Heinle, Boston, 2001.
    [4] Kenneth N. Stevens, Acoustic Phonetics, MIT Press, Cambridge, MA, 2000.
    [5] Jintao Jiang, Marcia Chen, and Abeer Alwan, “On the perception of voicing in syllable-initial plosives in noise,” Journal of the Acoustical Society of America, vol. 119, no. 2, Febuary 2006.
    [6] Leigh Lisker and Arthur S. Abramson, “A cross language study of voicing in initial stops,” Word, vol. 20, pp. 384.422, 1964.
    [7] Leigh Lisker and Arthur S. Abramson, “Some effects of context on voice onset time in English stops,.” Language and Speech, vol. 10, pp. 1.28, 1967.
    [8] Sharmistha Das and John H.L. Hansen, “Detection of Voice Onset Time (VOT) for Unvoiced Stops (/p/, /t/, /k/) Using the Teager Energy Operator (TEO) for Automatic Detection of Accented English” Proceedings of the 6th Nordic Signal Processing Symposium - NORSIG 2004, June 9 - 11, 2004
    [9] J.F. Kaiser, “On a Simple Algorithm to Calculate the Energy of a Signal,” IEEE ICASSP-90 – Inter. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp. 381-384, Apr. 1990.
    [10] P. Maragos, J.F. Kaiser, and T.F. Quatieri, “Energy Separation in Signal Modulations with Applications to Speech Analysis,” IEEE Trans. on Signal Processing, 41(10):3024-3051, Oct. 1993.
    [11] S.-H.Chen and J.-F.Wang, “Application of wavelet transforms for C/V segmentation on Mandarin speech signals”, IEE Proc.-Vis. Image Signal Process. Vol. 148, No. 2, April 2001
    [12] MALLAT, S.G.: “Multifrequency channel decomposition of images and wavelet models’, IEEE Trans. Acoust. Speech Signal Process., 1089, pp. 2091-2110
    [13] RABINER, L.R., and SCHAFER, R.W.: “Digital processing of speech signals” (Prentice-Hall, Englewood Cliffs, New Jersey, 1978)
    [14] Abe Kazemzadeh, Joseph Tepperman, Jorge Silva, Hong You, Sungbok Lee, Abeer Alwan, and Shrikanth Narayanan, “AUTOMATIC DETECTION OF VOICE ONSET TIME CONTRASTS FOR USE IN PRONUNCIATION ASSESSMENT”
    [15] Goangshiuan S. Ying, Leah H. Jamieson, Ruxin Chen, Carl D. Michell, “LEXICAL STRESS DETECTION ON STRESS-MINIMAL WORD PAIRS”, Computer Science, Cornell University, Ithaca, NY 14853
    [16] Brian D. Womack and John H.L. Hansen, ”IMPROVESDP EECHR ECOGNITIOVNIA SPEAKER STRESSD IRECTEDC LASSIFICATION”, Robust Speech Processing Laboratory Duke University, Box 90291, Durham, NC 27708-0291
    [17] Huayang Xie, Peter Andreae, Mengjie Zhang, Paul Warren,”Detecting Stress in Spoken English using Decision Trees and Support Vector Machines”, School of Mathematical and Computing Sciences School of Linguistics and Applied Language Studies Victoria University of Wellington, P. O. Box 600,Wellington, New Zealand,
    [18] S. Z. Li, “Content-based audio classification and retrieval using the nearest feature line method”, IEEE Transactions on Speech and Audio Processing, vol. 8, no. 5, pp. 619-625 Sept. 2000.
    [19] Guodong Guo & Stan Z. Li , “Content-Based Audio Classification and Retrieval by Support Vector Machines,” IEEE Transactions on Neural Network, Vol. 14, No. 1, January 2003.
    [20] R. Fletcher, Practical methods of optimization. Chichester and New York: John Wiley and Sons, 1987.
    [21] M. Aizerman, E. Braverman, and L. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning,” Automations and Remote Control, vol. 25, pp. 821-837, 1964.
    [22] N. J. Nilsson, Learning machines: Foundations of trainable pattern classifying systems. McGraw-Hill, 1965.
    [23] R. Courant and D. Hilbert, Methods of Mathematical Physics. Interscience, 1953.
    [24] Chao Hang, Tao Chen and Eric Chang. (2004) “Accent Issues in Large Vocabulary Continuous Speech Recognition,” International Journal of Speech Technology 7:141–153.
    [25] J.J. Humphries, P.C. Woodland& D. Pearce, “USING ACCENT-SPECIFIC PRONUNCIATIONMODELLING FOR ROBUST SPEECH RECOGNITION”, Cambridge University Engineering Department, Trumpington Street, Cambridge, UK. The Hirst Division of GEC-Marconi Materials Technology, Borehamwood, UK

    下載圖示 校內:2008-08-21公開
    校外:2009-08-21公開
    QR CODE