| 研究生: |
李國源 Lee, Kau-Yuan |
|---|---|
| 論文名稱: |
自適性隱藏式馬可夫模型拓撲於語音辨識之應用 Self-Organized Hidden Markov Model Topology for Speech Recognition |
| 指導教授: |
簡仁宗
Chien, Jen-Tzumg |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 121 |
| 中文關鍵詞: | 隱藏式馬可夫模型 、拓撲 |
| 外文關鍵詞: | HMM, topology |
| 相關次數: | 點閱:75 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一種新穎的隱藏式馬可夫模型(HMM)調適演算法,其目地主要在於解決發音變異性(pronunciation variation)在語音辨識上所帶來的問題。對於特定地區或語者的調適語料,為了讓調適過後的模型更加突顯調適語料發音變異性的特質,在此特別針對隱藏式馬可夫模型調適的狀態拓撲(HMM state topology)做調整並更新隱藏式馬可夫模型參數。
在隱藏式馬可夫模型調適拓撲的學習過程中,本論文提出兩種假設檢定的方法,分別能夠在狀態層(state level)及音素層(phone level)來偵測發音變異性,此兩種方法皆能夠透過相似度比值的檢定,以及chi-square分佈的近似做為測試統計量來達成,因此模型本身可以在給定的信賴水準下,自動地根據狀態層及混合層所測得的檢定統計量,來產生最佳的隱藏式馬可夫模型狀態拓撲。在此同時,根據新的隱藏式馬可夫模型狀態拓撲,本論文利用最大化相似度(maximum likelihood, ML)以及遵從貝氏學習(Bayesian learning)觀點之最大化事後機率(maximum a posterior, MAP)方法,以漸進之方式更新隱藏式馬可夫模型本身的參數(parameter)與超參數(hyperparameter)。藉由上述方法,針對不同地區或不同語者聲音特質的所調適過後的隱藏式馬可夫模型,能夠更精細地展現發音的差異性,並得到較佳的辨識結果。在實驗部份,我們使用TIMIT語料庫來評估所提出方法之效能。實驗結果顯示,在相似的參數量底下,我們提出的方法比起標準的隱藏式馬可夫模型,其辨識結果有顯著的提升。
This paper presents an adaptive algorithm for compensating pronunciation variations in hidden Markov model (HMM) based speech recognition. The proposed method aims to adapt the HMM state topology, mixture topology and the corresponding HMM parameters to meet the variations of speaker dialects or other speaker characteristics. In adaptive HMM topology learning algorithm, two hypothesis test schemes are designed to detect whether a new speaking variation occurs in state/phone levels. The test statistics are approximated by the chi-square densities. A new HMM topology is automatically generated by a significance level. Simultaneously, according to the newly-generated Markov model topology, we use maximum likelihood (ML) estimation and maximum a posteriori (MAP) estimation approach which is under Bayesian learning aspect to incrementally update the HMM parameters and their hyperparameters. By the proposed learning algorithm, the pronunciation variations are coped with by a dialect/speaker adaptive HMM topology and obtain better recognition results. We develop the incremental algorithm for corrective training of HMM topology and parameters. In experiments, we use TIMIT corpus to evaluate the effect of proposed approach. Experimental results show that the proposed adaptive topology learning algorithm is substantially better than the standard HMM with comparable size of parameters.
[1] H. Akaike, “A new look at the statistical model identification”, IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716-723, 1974.
[2] Y. Akita, T. Kawahara, “Generalized statistical modeling of pronunciation variations using variable-length phone context”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 689-692, 2005.
[3] T. W. Anderson, Introduction to Multivariate Statistical Analysis 2nd Edition, New York: Wiley, 1984.
[4] S. Asakawa, N. Minematsu, K. Hirose, “Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics”, in Proc. of International Conference on Spoken Language Processing (INTERSPEECH), pp. 890-893, 2007.
[5] L. Bahl, J. Baker, P. Cohen, F. Jelinek, B. Lewis, R. Mercer, “Recognition of a continuously read natural corpus”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 422-424, 1978.
[6] M. D. Berg, M. V. Kreveld, M. Overmars, O. Schwarzkoft, Computational Geometry: Algorithms and Applications, Second Edition, Springer-Verlag Berlin Heidelberg, 2000.
[7] A. Biem, J.-Y. Ha, J. Subrahmonia, “A Bayesian model selection criterion for HMM topology optimization”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 989-992, 2002.
[8] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
[9] J.-T. Chien, Decision tree state tying using cluster validity criteria”, IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 182-193, 2005.
[10] J.-T. Chien and B.-C. Chen, “A new independent component analysis for speech recognition and recognition”, IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1245-1254, 2006.
[11] J.-T. Chien and S. Furui, “Predictive hidden Markov model selection for decision tree state tying”, in Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2701-2704, 2003.
[12] J.-T. Chien, C.-H. Huang and S.-J. Chen, “Compact decision trees with cluster validity for speech recognition”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), vol. 1, pp. 873-876, Orlando, May 2002.
[13] J.-T. Chien and C.-W. Ting, “Speaker identification using probabilistic PCA model selection”, in Proc. of International Conference on Spoken Language Processing (ICSLP), vol. 3, pp. 1785-1788, 2004.
[14] W. Chou, W. Reichl, “Decision Tree State Tying Based on Penalized Bayesian Information Criterion”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), vol. 1, pp. 345-348, 1999.
[15] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society (B), vol. 39, no. 1, pp. 1-38, 1977.
[16] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2001.
[17] T. Fukada, Y. Sagisaka, “Automatic Generation of a Pronunciation dictionary based on a pronunciation network”, in Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2471-2474, 1997.
[18] T. Fukada, T. Yoshimura, Y. Sagisaka, “Automatic generation of multiple pronunciations based on neural networks and language statistics”, in Proc. of Modeling Pronunciation Variation for Automatic Speech Recognition (MPV), pp. 41-46, 1998.
[19] S. Furao and O. Hasegawa, “An incremental network for online unsupervised classification and topology learning”, Neural Networks, vol. 19, pp. 90-106, 2006.
[20] J. L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains”, IEEE Transactions ons Speech and Audio Processing, vol. 2, no. 2, pp. 291-298, 1994.
[21] F. H. Hamker, “Life-long learning cell structures – continuously learning without catastrophic interference”, Neural Network, vol. 14, pp. 551-573, 2001.
[22] A. Hämäläinen, L. Bosch, and L. Boves, “Modeling pronunciation variation using multi-path HMMs for syllables”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 781-784, 2007.
[23] A. Hämäläinen, L. Bosch, and L. Boves, “Pronunciation variant-based multi-path HMMs for syllables”, in Proc. of International Conference on Spoken Language Processing (INTERSPEECH), pp. 1579-1582, 2006.
[24] X.D. Huang, Y. Ariki, M.A. Jack, Hidden Markov Models for Speech Recognition, Edinburgh University Press, 1990.
[25] M.-Y. Hwang and X. Huang, “Dynamically configurable acoustic models for speech recognition”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 669-672, 1998.
[26] M.-Y. Hwang, X. Huang, “Predicting unseen triphones with senones”, IEEE Transactions on Speech and Audio Processing, vol. 4, no. 6, pp. 412-419, 1996.
[27] M.-Y. Hwang, X. Huang, “Shared-distribution hidden Markov models for speech recognition”, IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 414-420, 1993.
[28] Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate”, IEEE Transactions on Speech and Audio Processing, vol. 5, no. 2, pp. 161-172, 1997.
[29] S. Kanokphara, V. Tesprasit, R. Thongprasirt, “Pronunciation Variation Speech Recognition without Dictionary Modification on Sparse Database”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 764-767, 2003.
[30] L. Lamel, R. Kassel, and S. Seneff, “Speech database development: design and analysis of the acoustic-phonetic corpus”, in Proc. of the DARPA Speech Recognition Workshop, pp. 100-109, 1986.
[31] K.-F. Lee, H.-W. Hon, “Speaker-Independent Phone Recognition Using Hidden Markov Models”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641-1648, 1989.
[32] K. Markov and S. Nakamura, “Never-ending learning system for on-line speaker diarization”, in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 699-704, 2007.
[33] K. Markov and S. Nakamura, “Never-ending learning with dynamic hidden Markov network”, in Proc. of International Conference on Spoken Language Processing (INTERSPEECH), pp.1437-1440, 2007.
[34] N. Minematsu, “Mathematical evidence of the acoustic universal structure in speech”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 889-892, 2005.
[35] T. Murakami, K. Maruyama, N. Minematsu, K. Hirose, “Japanse vowel recognition using external structure of speech”, in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 203-208, 2005.
[36] M. Ostendorf and H. Singer, “HMM topology design using maximum likelihood successive state splitting”, Computer Speech and Language, vol. 11, pp. 17-41, 1997.
[37] F. P. Preparata and M. I. Shamos, Computation Geometry: An Introduction, Springer-Verlag New York, 1985.
[38] L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice-Hall, 1993.
[39] W. Reichl, W. Chou, “Decision Tree State Tying Based on Segmental Clustering for Acoustic Modeling”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 801-804, 1998.
[40] J. Rissanen, “A universal prior for intergers and estimation by minimum description length”, The Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983.
[41] P. Schmid, R. Cole M. Fanty, “Automatically Generated Word Pronunciations from phoneme classifier output”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp.223-226, 1993.
[42] G. Schwarz, “Estimating the dimension of a model”, The Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978.
[43] R. Singh, B. Raj, R. Stern, “Automatic Generation of Subword Units for Speech Recognition Systems”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 2, pp. 89-99, 2002.
[44] H. Suzuki, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura, “Speech Recognition Using Voice-Characteristic Dependent Acoustic Models”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp.740-743, 2003.
[45] J. Takami and S. Sagayama, “A successive state splitting algorithm for efficient allophone modeling”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 573-576, 1992.
[46] C.-W. Ting and J.-T. Chien, “Factor analysis of acoustic features for streamed hidden Markov modeling”, in Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 30-35, 2007.
[47] R. C. Vasko, Jr., A. El-jaroudi, J. R. Boston, “An algorithm to determine hidden Markov model topology”, in Proc. of International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 3577-3580, 1996.
[48] S. Watanabe, A. Sako, A. Nakamura, “Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, 2006.