| 研究生: |
錢鐸樟 Chien, To-Chang |
|---|---|
| 論文名稱: |
以最大熵準則結合語音及語言特徵於語音辨識之研究 Integration of Acoustic and Linguistic Features for Maximum Entropy Speech Recognition |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2005 |
| 畢業學年度: | 93 |
| 語文別: | 中文 |
| 論文頁數: | 92 |
| 中文關鍵詞: | 語言模型 、語音模型 、最大交互資訊 、最大熵 、鑑別式 |
| 外文關鍵詞: | discriminative training, maximum entropy, speech recognition, maximum mutual information |
| 相關次數: | 點閱:94 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在傳統語音辨識系統中,語音及語言兩種資訊來源常假設為互相獨立的,分別訓練其各自的模型參數,在進行語音辨識時,我們會將聲學模型與語言模型的機率值結合起來當作最後的決策法則。由於語音辨識時擷取出的候選文字串與所輸入的語音信號存在有相互的影響性,語音模型和語言模型也應將這樣的關係考慮進來,因此我們提出一套整合性最大熵(Maximum Entropy)模型作為語音辨識器的主要架構,並且提出如何結合語言和語音特徵在一致的模型架構下完成訓練。在這樣的架構下,我們將語音及語言之間共變異性特徵在整合性模型內適當的描述。在鑑別性最大熵模型的課題上,我們透過理論分析來建立起整合性模型與鑑別式訓練準則之間的關聯性,另外,以最大熵模型為架構的語音辨識系統可以有效地將其他額外的資訊來源結合在一致的語音及語言模型之中,例如一些語意性主題和長距離關聯法等資訊來源,都可以有效的結合在模型裡。我們將實現新穎的語音及語言模型於自發性廣播新聞語音辨識系統上,並比較傳統以最佳相似度為主語音模型與語言模型相互獨立的系統。
In traditional speech recognition system, we assume that acoustic and linguistic information sources are independent. Parameters of acoustic hidden Markov model (HMM) and linguistic n-gram model are estimated individually and then combined together to build a plug-in maximum a posteriori (MAP) classification rule. However, the acoustic model and language model are correlated in essence. We should relax the independence assumption so as to improve speech recognition performance. In this study, we propose an integrated approach based on maximum entropy (ME) principle where acoustic and linguistic features are optimally combined in an unified framework. Using this approach, the associations between acoustic and linguistic features are explored and merged in the integrated models. On the issue of discriminative training, we also establish the relationship between ME and discriminative maximum mutual information (MMI) models. In addition, this ME integrated model is general so that the semantic topics and long distance association patterns can be further combined. In the experiments, we carry out the proposed ME model for broadcast news transcription using MATBN database. In preliminary experimental results, we obtain improvement compared to conventional speech recognition system based on plug-in MAP classification rule.
[1] J. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1279-1296, August 2000.
[2] J. Bellegarda, “Large vocabulary speech recognition with multispan statistical language models,” IEEE Transactions on Speech and Audio Processing 8, vol. 1, pp. 76-84, January 2000.
[3] J. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 5, pp. 456-467, September 1998.
[4] J. R. Bellegarda, “Statistical language model adaptation: review and perspectives,” Speech Communication, vol. 42, pp. 93-108, 2004.
[5] L. Bahl, P. Brown, P. de Souza and R. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 11, pp. 49-52, April 1986.
[6] M. Berry, S. Dumais and G. O’Brien, “Using linear algebra for intelligent information retrieval,” SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[7] A. Berger, S. D. Pietra and V. D. Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[8] J. Bellegarda, K. Silverman, “ Natural language spoken interface control using data-driven semantic inference,” IEEE Transactions on Speech and Audio Processing, vol. 11, pp. 267-277, 2003.
[9] C.-H. Chueh, T.-C. Chien, and J.-T. Chien, “Discriminative maximum entropy language model for speech recognition,” submitted to Proc. of Interspeech, 2005.
[10] C.-H. Chueh, J.-T. Chien, and H.-M. Wang, “A maximum entropy approach for integrating semantic information in statistical language models,” in Proc. International Symposium on Chinese Spoken Language Processing (ISCSLP2004), pp. 309-312 ,Hong Kong, December 2004.
[11] S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech and Language, vol. 13, 359-394, 1999.
[12] C. Chelba and F. Jelinek, “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283-332, October 2000.
[13] P. C. Chang and B.-H. Juang, “Discriminative training of dynamic programming based speech recognizers,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 2, pp. 135-143, April 1993.
[14] W. Chou, C.-H. Lee and B.-H. Juang, “Segmental GPD training of an hidden Markov model based speech recognizer,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 473-476, 1992.
[15] Z. Chen, K.-F. Lee, M.-J Li, “Discriminative training on language model,” in Proc. International Conference on Spoken Language Processing, pp. 16-20, 2000.
[16] P. R. Clarkson and A. J. Robinson, “Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp.799-802, 1997.
[17] S. F. Chen and R. Rosenfeld, “A survey of smoothing techniques for ME models,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, January 2000.
[18] A. Dempster, N. Laird and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1997.
[19] J. Darroch and D. Ratcliff. “Generalized iterative scaling for log-linear models,” The Annals of Mathematical Statistics, vol. 43, pp. 1470-1480, 1972.
[20] M. Federico, “Efficient language model adaptation through MDI estimation,” in Proc. of EUROSPEECH, pp. 1583-1586, 1999.
[21] M. Federico, “Language model adaptation through topic decomposition and MDI estimation,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, 2002.
[22] J.-L. Gauvain, and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chain,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 291-298, 1994.
[23] X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Pearson Education 2001-04-25.
[24] R. Iyer and M. Ostendorf, “Relevance weighting for combining multi-domain data for n-gram language modeling,” Computer Speech and Language, vol. 13, pp. 267-282, 1999.
[25] E. T. Jaynes, “Information Theory and Statistical Mechanics,” Physics Reviews 106, pp. 620-630, 1957.
[26] B.-H. Juang, W. Hou and C.-H. Lee, “Minimum classification error rate Methods for Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3 , pp. 257-265, May 1997.
[27] B.-H. Juang and S. Katagirl, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043-3054, December 1992.
[28] D. Klakow,“Selecting Articles from the Language Model Training Corpus,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp. 1695-1698, 2000.
[29] H.-K. J. Kuo, E. Fosle-Lussier, H. Jiang and C.-H. Lee, “Discriminative training of language models for speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. I325-328, 2002.
[30] W. Ma Kristine, Z. George and M. Marie,“Bi-modal sentence structure for language modeling,” Speech Communication, vol. 31, pp. 51-67, 2000.
[31] R. Kuhn and R. De Mori, “A Cache-Based Natural Language Model for Speech Reproduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 6, pp. 570-583, 1990.
[32] S. Katagiri, B.-H. Juang, and C.-H. Lee, “Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2345–2373, Nov. 1998.
[33] S. Katagiri, C.-H. Lee, and B.-H. Juang, “New discriminative algorithm based on the generalized probabilistic descent method,” in Proc. of IEEE Workshop on Neural Network for Signal Processing, Princeton, pp.299–309, September 1991.
[34] S. Khudanpur and J. Wu, “Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling,” Computer Speech and Language, pp. 355-372, 2000.
[35] S. Khudanpur and J. Wu, “A maximum entropy language model integrating N-grams and topic dependencies for conversational speech recognition,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, 1999.
[36] Q. Li, “Discovering relations among discriminative training objectives,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, Montreal, May 2004.
[37] C.-H. Lee and B.-H. Juang, “A survey on automatic speech recognition with an illustrative example on continuous speech recognition of Mandarin,” Computational Linguistics and Chinese Language Processing, vol. 1, no.1, pp. 1-36, August 1996.
[38] Q. Li, B.-H. Juang, “A new algorithm for fast discriminative training,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 97-100, 2002.
[39] Q. Li, B.-H. Juang, “Fast discriminative training for sequential observations with application to speaker identification,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 2, pp. 397-400, 2003.
[40] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language models: A maximum entropy approach,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 2, pp. 45-48, 1993.
[41] W. Macherey and H. Ney, “A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition”, in Proc. of EUROSPEECH, vol. 1, pp. 493-496,September 2003.
[42] Y. Normandin, R. Cardin and R. De Mori, “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 299-311, 1994.
[43] S. D. Pietra , V. D. Pietra and J. Lafferty, “Inducing Features of Random Fields,” IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 380-393, vol. 19, no.4 ,April , 1997.
[44] D. Pietra, S. Della Pietra, R.L. Mercer, S. Roukous, “Adaptive language modeling using minimum discriminant estimation,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 633-636,March 1992.
[45] C. Paciorek and R. Rosenfeld, “Minimum classification error training in exponential language models,” in Proc. of NIST/DARPA Speech Transcription Workshop, 2002.
[46] D. Povey, P. C. Woodland, “Minimum phone error and I-Smoothing for improved discriminative training,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, Montreal,2002.
[47] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language model,” Computer Speech and Language, vol. 10, pp. 187-228, 1996.
[48] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999.
[49] R. Rosenfeld, S. F. Chen and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration,” Computer Speech and Language, vol. 15, pp. 55-73, 2001.
[50] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[51] P. S. Rao, M. D. Monkowski, and S. Roukos,“Language model adaptation via minimum discrimination information,”in Proc. of International Conference on Acoustic, Speech and Signal Processing, Detroit, Michigan, USA , pp. 161-164, 1995.
[52] R. Schluter, W. Macherey, “Comparison of discriminative training criteria,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 493-496, 1998.
[53] I. H. Witten and T. C. Bell, “The zero-frequency problem : Estimating the probabilities of novel events in adaptive text compression,” IEEE Transactions on Information Theory , vol. 37, pp. 1085-1094, 1991.
[54] J. Wu and S. Khudanpur, “Building a topic-dependent maximum entropy model for very large corpora,” in Proc. of International Conference on Acoustic, Speech and Signal Processing, pp. I777-780, 2002.
[55] S. Wang, D. Schuurmans, F. Peng, Y. Zhao, “Learning Mixture Models with the Latent Maximum Entropy Principle,” in Proceedings of ICML, Washington DC, 2003.
[56] S. Young, J. Jansen, J. Odell, D. Ollason, and P Woodland. The HTK Book (Version 2.0). ECRL, 1995.
[57]G. D. Zhou and K. T. Lua, “Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition,” Computer Speech and Language, vol. 13, pp. 125-141, 1999.