| 研究生: |
蔡佩珊 Tsai, Pei-Shan |
|---|---|
| 論文名稱: |
利用發音事件為基礎之狀態單元驗證於多語辨識發音變異模型之產生 Pronunciation Variation Model Generation for Multilingual ASR Using Pronunciation Event-Based Senone Verification |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 61 |
| 中文關鍵詞: | 多語辮識 、發音變異 、轉換函式 、發音事件 |
| 外文關鍵詞: | Multilingual speech recognition, pronunciation variations, transformation function, pronunciation events |
| 相關次數: | 點閱:74 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於社會國際化,對於多語音辨識器的需求逐漸增加,而多語辨識中發音變異之現象是影響辨識正確率的重要因素。基於隱藏式馬可夫模型的語音辨識器,近年來對於標準語音辨識已達相當程度的正確性,但對於發音變異語音辨識的正確性仍嫌不足。因此本研究基於聚類狀態模型(Senone Model, SEM),同時考慮聲學特性與發音事件進行兩階段發音變異驗證,利用轉換函式作為轉換並產生發音變異模型,以及考慮語言特性參數做發音變異模型之預測;透過發音事件進一步驗證發音變異,以改善聲學特性所造成的錯誤偵測,且藉由轉換函式產生發音變異的聲學模型,希望改善發音變異對辨識正確率造成的影響,並以語言特性參數對發音變異做聲學特性上的分類,以彌補發音變異訓練語料不足的問題。藉由產生發音變異聲學模型,用以增進基於隱藏式馬可夫模型之辨識語音的正確性。
在本論文中,對於發音變異驗證與應用轉換函式及發音變異預測模型,分為下列三項研究重點:1)考慮聲學特性與發音事件於發音變異驗證;2)導入轉換函式建立發音變異SEM;3)運用決策樹預測發音變異模型。對於發音變異模型進行之實驗結果顯示,本論文所提出之方法,對發音變異語句之辨識正確率上,具有相當程度的改進。
Owing to international Socialization, the demand for automatic multilingual speech recognition system increases. Pronunciation variation plays an important role in speech recognition because it degrades speech recognition system’s recognition rate seriously. In recent years, speech recognition system based on Hidden Markov Models has been developed, which has quite good performance for normal speech. However, the accuracy of recognition for pronunciation variation speech is not good enough. In our approach, acoustic features and pronunciation events are considered for two stages of pronunciation variation verification based on senone model. The pronunciation event is used to reduce the pronunciation variation’s false alarm caused by error detection of using acoustic feature. The transformation function is adopted for the variation modeling, and linguistic features are also applied to predict pronunciation variation models. Transformation function is used to generate variation model, to solve the problem of the reductive performance for pronunciation variation speech. The Lack of training data is solved by clustering acoustic variations using linguistic features. In this work, we achieve the goal of improving the accuracy of recognition by generating pronunciation variation models.
This research aims to: 1) considering acoustic features and pronunciation events for variation verification, 2) importing transformation function to model pronunciation variations, and 3) predicting pronunciation using Decision Tree.
The experimental results showed that the proposed method achieved a significant improvement in multilingual speech recognition for pronunciation variation speech.
[1] G. Bouselmi, D. Fohr, and I. Illina, “Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition”, in Proc. of Interspeech, pp. 1449-1452, 2007.
[2] G. Bouselmi, D. Fohr, I. Illina, ”Multi-Accent and Accent-Independent Non-Native Speech Recognition,” in Proc. of Interspeech, 2008
[3] A.-P. Breen and P. Jackson, “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” The Third ESCA/COCOSDA Workshop on Speech Synthesis, pp. 201-206, 1998.
[4] N. Cremelie, J.-P. Martens, “Automatic Rule-Based Generation of Word Pronunciation Networks,” in Proc. of Eurospeech, pp 2459-2462, 1997.
[5] A.-P. Dempster, N.-M. Laird, and D.-B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1, pp. 1–38, 1997.
[6] S. Downey and R. Wiseman, “Dynamic and Static Improvements to Lexical Basefoms,” ESCA Workshop on Modeling Pronunciation Variation, pp. 157-162, 1998.
[7] J J. Goldberger and H. Aronowitz, “A Distance Measure Between GMMs Based on the Unsented Transform and its Application to Speaker Recognition,” in Proc. of Eurospeech, pp. 1985-1988, 2005.
[8] S. Goronzy, K. Eisele, “Automatic Pronunciation Modeling for Multiple Non-Native Accents,” in Proc. of Automatic Speech Recognition and Understanding (ASRU), pp. 123-128, 2003.
[9] J.-L. Hieronymus, “ASCII Phonetic Symbols for the World's Languages: Worldbet,” Journal of the International Phonetic Association, 1993.
[10] M.-Y. Hwang, and X. Huang, “Subphonetic Modeling with Markov States --- Senone,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 33-36, 1992.
[11] M.-Y. Hwang, X. Huang, and F. Alleva, “Predicting Unseen Triphones with Senones,” IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 6, pp. 412-419, 1996.
[12] International Phonetic Association (IPA), “Handbook of the International Phonetic Association : A Guide to the Use of the International Phonetic Alphabet,” Cambridge University Press, 1999.
[13] A. Kipp, M.-B. Wesenick, and F. Schiel, “Automatic Detection and Segmentation of Pronunciation Variants in German Speech Corpora,” in Proc. of the International Conference on Spoken Language Processing (ICSLP), pp. 106-109, 1996.
[14] A. Kipp, M.-B. Wesenick, F. Schiel, “Pronunciation Modeling Applied to Automatic. Segmentation of Spontaneous Speech,” in Proc. of Eurospeech, pp. 1023–1026, 1997.
[15] C.-H Lee, C.-H. Wu, and J.-C. Guo “Pronunciation Variation Generation for Spontaneous Speech Synthesis Using State-Based Voice Transformation,” in Proc. of ICASSP, pp. 15-19, 2010.
[16] B. Mak and E. Barnard, “Phone clustering using the Bhattacharyya distance, ” in Proc. of the International Conference on Spoken Language Processing (ICSLP), pp. 2005-2008, 1996.
[17] Y.-R. Oh, J.-S. Yoon, and H.-K Kim “Acoustic Model Adaptation based on Pronunciation Variability Analysis for Non-Native Speech Recognition,” in Proc. of ICASSP, pp. 137-140, 2006.
[18] P. Rubin, T. Baer, and P. Mermelstein, “An Articulatory Synthesizer for Perceptual Research, “ Journal of the Acoustical Society of America, Vol. 70, pp. 321-328, 1981.
[19] S.-M. Siniscalchi, T. Svendsen, and C.-H. Lee,"Toward A Detector-Based Universal Phone Recognizer," in Proc. of ICASSP, . 4261-4264, 2008.
[20] S. Stefan, “Generating Non-Native Pronunciation Lexicons by Phonological Rules,” in Proc. of International Conference of Phonetic Sciences (ICPhS), pp. 2545-2548, 2003.
[21] J.-C. Wells, “Computer-Coded Phonemic Notation of Individual Languages of the European Community,” Journal of the International Phonetic Association, Vol. 19, pp. 31-54, 1989.
[22] J. Yang, P. Wu, D. Xu, "Mandarin Speech Recognition for Nonnative Speakers Based on Pronunciation Dictionary Adaptation", in Proc. of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp.1-4, 2008.
[23] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.-Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[24] Q. Zhang, T. Li, J. Pan, and Y. Yan, "Nonnative Speech Recognition Based on State-Level Bilingual Model Modification," in Proc. of Third International Conference on Convergence and Hybrid Information Technology (ICCIT), Vol. 2, pp.1220-1225, 2008.