| 研究生: |
吳雅雯 Wu, Ya-Wen |
|---|---|
| 論文名稱: |
貝氏非參數學習應用於主題性語言模型 Bayesian Nonparametric Learning for Topic-Based Language Models |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2012 |
| 畢業學年度: | 100 |
| 語文別: | 英文 |
| 論文頁數: | 96 |
| 中文關鍵詞: | 機器學習 、語音識別 、語言模型 、貝氏非參數學習 、主題模型 、自然語言處理 |
| 外文關鍵詞: | Machine Learning, Speech Recognition, Language Model, Bayesian Nonparametric Learning, Topic Model, Natural Language Processing |
| 相關次數: | 點閱:122 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
統計式語言模型是一種機率模型用來計算從歷史字串資訊中預估下一個字詞出現的機率,語言模型在很多資訊系統中扮演著重要的角色,包括自動語音辨識系統、機器翻譯系統、手寫字辨識系統、拼字輔助系統等等。一般而言,語言模型存在資料稀疏及長距離資訊不足等問題,顯著影響系統效能,本論文提出一套貝氏非參數型學習演算法擷取字詞中潛藏的主題資訊,並使用此資訊有效從不同大小的訓練文件集中訓練出具彈性及適應性之語言模型,我們透過這套貝氏非參數學習法為語言模型執行主題擷取及回朔平滑(back-off smoothing)等工作,很具吸引力的是我們發展出一套新穎的主題式語言模型,此模型之主題個數及參數個數是從訓練文集中自動決定的,為了控制模型複雜度問題,我們導入主題及回朔模型之非參數型事前機率,透過結合階層式狄氏過程(hierarchical Dirichlet process)及Pitman-Yor過程建立起一套無限語言模型(infinite language model)可以無限制地從無止盡的訓練資料中無止盡的增長主題個數及參數個數,我們發展出具主題性之階層式Pitman-Yor語言模型(THPY-LM),此模型具有自然語言之冪次法則(Power-Law)特性且可以經由忽略主題資訊簡化實現出階層式Pitman-Yor語言模型(HPY-LM),我們透過Gibbs取樣程序可以推論出THPY-LM模型參數。本論文實作具主題性之階層式Pitman-Yor語言模型於華爾街日報(Wall Street Journal)大詞彙連續語音語料庫,我們提出的THPY-LM可以達到比最新系統Modified Kneser-Ney LM及HPY-LM有較低的模型混淆度及語音辨識字元錯誤率。
Statistical n-gram language model aims to predict a new word given a sequence of n-1 history words. This technology plays an important role in many information systems including automatic speech recognition, machine translation, optical character recognition, spelling assistant system and many others. In general, n-gram model suffers from the problems of data sparseness and insufficient long-distance information. In this thesis, we present a Bayesian nonparametric approach to extract latent topic information and establish a scalable language model from different amount of training data. We perform Bayesian nonparametric learning and conduct topic extraction and back-off smoothing for language modeling. Attractively, we develop a topic-based language model where the numbers of topics and n-grams are automatically determined from training data. To cope with the issue of model selection, we introduce the nonparametric priors for topics and back-off n-grams. The infinite language models are constructed through the hierarchical Dirichlet process compound Pitman-Yor (PY) process. We develop the topic-based hierarchical PY language model (THPY-LM) where the power-law property is held and the hierarchical PY (HPY) LM is realizable by disregarding the topic information. The Gibbs sampling procedure is implemented for model inference. In the experiments in Wall Street Journal continuous speech corpus, the proposed THPY-LM outperforms state-of-art methods based on the modified Kneser-Ney LM and the HPY-LM in terms of model perplexity and word error rate.
[1] Bahl, L., Brown, P., de Souza, P., and Mercer, R., "Maximum mutual information estimation of hidden Markov model parameters for speech recognition," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 11, pp. 49-52, 1986.
[2] Bellegarda, J. R., "Exploiting latent semantic information in statistical language modeling," Proceedings of the IEEE, vol. 88, no. 8, pp. 1279-1296, 2000.
[3] Bellegarda, J. R., "Statistical language model adaptation: review and perspectives," Speech Communication, vol. 42, no. 1, pp. 93-108, 2004.
[4] Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C., "A neural probabilistic language model," Journal of Machine Learning Research, vol. 3, no. 6, pp. 1137-1155, 2003.
[5] Berger, A. L., Della Pietra, S. A., and Della Pietra, V. J., "A maximum entropy approach to natural language processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[6] Berry, M. W., Dumais, S. T., and O'Brien, G. W., "Using linear algebra for intelligent information retrieval," SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[7] Blei, D. M., Griffiths, T. L., and Jordan, M. I., "The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies," Journal of the ACM, vol. 57, no. 2, pp. 1-30, 2010.
[8] Blei, D. M., Ng, A. Y., and Jordan, M. I., "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993-1022, 2003.
[9] Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C., "Class-based n-gram models of natural language," Computational Linguistics, vol. 18, no. 4, pp. 467-479, 1992.
[10] Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., and Wellner, P., "The AMI meeting corpus: a pre-announcement," Proc. of the International Conference on Machine Learning for Multimodal Interaction, pp. 28-39, 2006.
[11] Chelba, C. and Jelinek, F., "Structured language modeling," Computer Speech and Language, vol. 14, no. 4, pp. 283-332, 2000.
[12] Chen, S. F., "Shrinking exponential language models," Proc. of Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 468-476, 2009.
[13] Chien, J.-T., "Association pattern language modelling," IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1719-1728, 2006.
[14] Chien, J.-T. and Chueh, C.-H., "Dirichlet class language models for speech recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 3, pp. 482-495, 2011.
[15] Chien, J.-T. and Wu, M.-S., "Adaptive Bayesian latent semantic analysis," IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 198-207, 2008.
[16] Chueh, C.-H. and Chien, J.-T., "Nonstationary latent Dirichlet allocation for speech recognition.," Proc. of Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 372-375, 2009a.
[17] Chueh, C.-H. and Chien, J.-T., "Segmented topic model for text classification and speech recognition," Advances in Neural Information Processing Systems (NIPS) Workshop on Applications for Topic Models: Text and Beyond, pp. 1-4, 2009b.
[18] Clarkson, P. and Robinson, A., "Language model adaptation using mixtures and an exponentially decaying cache," Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ), pp. 799-802, 1997.
[19] Darroch, J. N. and Ratcliff, D., "Generalized iterative scaling for log-linear models," The Annals of Mathematical Statistics, vol. 43, no. 5, pp. 1470-1480, 1972.
[20] Della Pietra, S., Della Pietra, V., and Lafferty, J., "Inducing features of random fields," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 380-393, 1997.
[21] Escobar, M. D. and West, M., "Bayesian density estimation and inference using mixtures," Journal of the American Statistical Association, vol. 90, no. 430, pp. 577-588, 1995.
[22] Federico, M., "Bayesian estimation methods for n-gram language model adaptation," Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 240-243, 1996.
[23] Federico, M., "Efficient language model adaptation through MDI estimation," Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1583-1586, 1999.
[24] Ghahramani, Z., Sollich, P., and Griffiths, T. L., "Bayesian nonparametric latent feature models," Bayesian Statistics, vol. 8, no. 2007.
[25] Gildea, D. and Hofmann, T., "Topic-based language models using EM," Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2167-2170, 1999.
[26] Goldwater, S., Griffiths, T. L., and Johnson, M., "Interpolating between types and tokens by estimating power-law generators," Advances in Neural Information Processing Systems, 2006.
[27] Gorur, D., Jakel, F., and Rasmussen, C. E., "A choice model with infinitely many latent features," Proc. of the International Conference on Machine Learning, pp. 361-368, 2006.
[28] Griffiths, T. L. and Ghahramani, Z., "The Indian buffet process: an introduction and review," Journal of Machine Learning Research, vol. 12, no. pp. 1185-1224, 2011.
[29] Hofmann, T., "Probabilistic latent semantic indexing," Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[30] Huang, S. F. and Renals, S., "Hierarchical Bayesian language models for conversational speech recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 1941-1954, 2010.
[31] Ishwaran, H. and James, L. F., "Gibbs sampling methods for stick-breaking priors," Journal of the American Statistical Association, vol. 96, no. 453, pp. 161-173, 2001.
[32] Jaynes, E. T., "Information theory and statistical mechanics," Physical Review, vol. 106, no. 4, pp. 620-630, 1957.
[33] Katz, S. M., "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on Acoustics Speech and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.
[34] Kneser, R. and Ney, H., "Improved backing-off for m-gram language modeling," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 181-184, 1995.
[35] Kuhn, R. and Demori, R., "A cache-based natural-language model for speech recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 6, pp. 570-583, 1990.
[36] Kuo, H.-K. J., Fosler-Lussier, E., Jiang, H., and Lee, C.-H., "Discriminative training of language models for speech recognition," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 325-328, 2002.
[37] Mochihashi, D., Yamada, T., and Ueda, N., "Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling," Proc. of the Annual Meeting of the ACL, pp. 100-108, 2009.
[38] Moore, G. and Young, S., "Class-based language model adaptation using mixtures of word-class weights," Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 512-515, 2000.
[39] Normandin, Y., Cardin, R., and De Mori, R., "High-performance connected digit recognition using maximum mutual information estimation," IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 299-311, 1994.
[40] Paciorek, C. and Rosenfeld, R., "Minimum classification error training in exponential language models," Proc. of NIST/DARPA Speech Transcription Workshop, 2002.
[41] Pitman, J. and Yor, M., "The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator," The Annals of Probability, vol. 25, no. 2, pp. 855-900, 1997a.
[42] Pitman, J. and Yor, M., "The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator," Annals of Probability, vol. 25, no. 2, pp. 855-900, 1997b.
[43] Rao, P. S., Dharanipragada, S., and Roukos, S., "MDI adaptation of language models across corpora," Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1979-1982, 1997.
[44] Roark, B., Saraclar, M., and Collins, M., "Discriminative n-gram language modeling," Computer Speech and Language, vol. 21, no. 2, pp. 373-392, 2007.
[45] Rosenfeld, R., "A maximum entropy approach to adaptive statistical language modelling," Computer Speech and Language, vol. 10, no. 3, pp. 187-228, 1996.
[46] Sato, I. and Nakagawa, H., "Topic models with power-law using Pitman-Yor process," Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 673-682, 2010.
[47] Schwenk, H., "Continuous space language models," Computer Speech and Language, vol. 21, no. 3, pp. 492-518, 2007.
[48] Shuanghu, B., Haizhou, L., Zhiwei, L., and Baosheng, Y., "Building class-based language models with contextual statistics," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 173-176, 1998.
[49] Stolcke, A., "SRILM - An extensible language modeling toolkit," Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 901-904, 2002.
[50] Su, Y., "Bayesian class-based language models," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. pp. 5564-5567, 2011.
[51] Tam, Y.-C., "Dynamic language model adaptation using Variational Bayes inference," Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 5-8, 2005.
[52] Tam, Y.-C. and Schultz, T., "Unsupervised language model adaptation using latent semantic marginals," Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 2206-2209, 2006.
[53] Teh, Y. W., "A hierarchical Bayesian language model based on Pitman-Yor processes," Proc. of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 985-992, 2006.
[54] Teh, Y. W. and Gorur, D., "Indian buffet processes with power-law behavior," Advances in Neural Information Processing Systems, 2009.
[55] Teh, Y. W., Gorur, D., and Ghahramani, Z., "Stick-breaking construction for the Indian buffet process," Proc. of the International Conference on Artificial Intelligence and Statistics, pp. 556-563, 2007.
[56] Teh, Y. W. and Jordan, M. I., "Hierarchical Bayesian nonparametric models with applications," Bayesian Nonparametrics: Principles and Practice, N. Hjort, C. Holmes, P. Mller, and S. Walker, Eds., 1 ed: Cambridge University Press, 2009.
[57] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M., "Hierarchical Dirichlet processes," Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566-1581, 2006.
[58] Wallach, H. M., "Topic modeling: beyond bag-of-words," Proc. of International Conference on Machine Learning, pp. 977-984, 2006.
[59] Wood, F. and Teh, Y. W., "A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation," Journal of Machine Learning Workshop and Conference Proceedings Artificial Intelligence in Statistics, vol. 5, no. C, pp. 607-614, 2009.
[60] Yamamoto, H., Isogai, S., and Sagisaka, Y., "Multi-class composite N-gram language model," Speech Communication, vol. 41, no. 2-3, pp. 369-379, 2003.
[61] Yaman, S., Chien, J.-T., and Lee, C.-H., "Structural Bayesian language modeling and adaptation," Proc. of Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 2365-2368, 2007.
[62] Zipf, G. K., Selective studies and the principle of relative frequency in language: Harvard University Press, 1932.
[63] Zitouni, I., "Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition," Computer Speech and Language, vol. 21, no. 1, pp. 88-104, 2007.