研究生: |
吳佳昇 Wu, Chia-Sheng |
---|---|
論文名稱: |
使用貝氏潛在語意分析於文件分類及資訊檢索 Bayesian Latent Semantic Analysis for Text Categorization and Information Retrieval |
指導教授: |
簡仁宗
Chien, Jen-Tzung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 中文 |
論文頁數: | 85 |
中文關鍵詞: | 語言模型 、潛在語意分析 、貝氏理論 、文件分類 、資訊檢索 |
外文關鍵詞: | Language Model, Latent Semantic Analysis, Bayesian theory, Information Retrieval, Text Categorization |
相關次數: | 點閱:93 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資料集的大量增長,引用統計式文件模型於資訊檢索上之研究重要性與日俱增。機率式潛在語意分析(probabilistic latent semantic analysis, PLSA)模型為一種可有效率擷取語意及其統計量的文件模型方法。而機率式潛在語意分析在實際應用時,對於新領域文件連續地更新具有高敏感性。本論文中,提出了一個新穎的貝氏機率式潛在語意分析的架構,本研究方法著重於利用遞增式學習演算法,解決新文章加入時的模型更新(updating)問題的方法。本演算法藉由即時遞增式萃取以及學習最新的潛在式語意資訊,以期望提升文件模型之效能,並獲得符合線上資料改變後的新文件模型。在設定上,藉由一個適當的Dirichlet機率密度函式作為PLSA模型參數的事前機率。而擁有相同形式的事後機率分布使得模型得到一個可重複產生的事前/事後機率機制,以求達到累積資料的遞增式學習。本方法提出近似貝氏(quasi-Bayes, QB)機率式潛在語意分析模型以達到累進學習的目的。參數求解過程是採用Expectation-Maximization(EM)演算法推導出來的。在這樣的線上PLSA檢索系統中,為求達到更強健的參數估測同時也建構於超參數(hyperparameter)的更新。相較於原始的最大相似度估測,本論文提出的QB方法,擁有動態增加文件建立索引的能力,在本論文中也同時提出最大化事後機率(maximum a posteriori, MAP)的機率式潛在語意分析模型用於更正型的批次模型訓練(corrective training)方法。在實驗驗證上,利用文件檢索以及文件分類驗證貝氏機率式潛在語意分析之優越性。
Due to the vast growth of data collections, the statistical text modeling is increasingly important for information retrieval. Probabilistic latent semantic analysis (PLSA) is a popular text modeling approach where the semantics and statistics can be effectively captured. However, PLSA is highly sensitive to task domain, which is continuously updated in real-world applications. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve the text modeling by incrementally extracting the up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is established to fulfill incremental learning from constantly accumulated data. The expectation-maximization (EM) algorithm is applied to resolve the quasi-Bayes (QB) estimate of PLSA parameters. The on-line PLSA is constructed to accomplish parameter estimation as well as hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed QB approach is capable of performing dynamic document indexing. Also, we present the maximum a posteriori PLSA for corrective model training. Experiments on document classification and retrieval demonstrate the superiority of using Bayesian PLSA.
[1] Y. Akita and T. Kawahara, “Language modeling adaptation based on PLSA of topics and speakers”, Proceedings of International Conference on Spoken Language Processing, 2004.
[2] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceeding of the IEEE, vol. 88, No. 8, pp. 1279-1296, 2000.
[3] J. R. Bellegarda, “Fast update of latent semantic spaces using a linear transform framework”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 769-772, 2002.
[4] M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using linear algebra for intelligent information retrieval”, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[5] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, vector spaces and information retrieval”, SIAM Review, vol. 41, no. 2, pp. 335-362, 1999.
[6] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[7] T. Brants, F. Chen and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis”, Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 211-218, 2002.
[8] L. Cai and T. Hofmann, “Text categorization by boosting automatically extracted concepts”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.
[9] J. Canny, “GaP: a factor model for discrete data”, Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122-129, 2004.
[10] J. T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition”, IEEE Transaction on Speech and Audio Processing, vol. 7, no. 6, pp. 656-667, 1999.
[11] J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent semantic language modeling and smoothing”, Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1373-1376, 2004.
[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[13] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970.
[14] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1-38, 1977.
[15] C. H. Q. Ding, “A similarity-based probability model for latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 58-65, 1999.
[16] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., 1973.
[17] G. Dupret, “Latent concepts and the number orthogonal factors in latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 221-226, 2003.
[18] M. Federico, “Language model adaptation through topic decomposition and MDI estimation”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2002.
[19] D. Gildea and T. Hofmann, “Topic based language models using EM”, Proceedings of 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.
[20] M. Girolami and A. Kaban, “On an equivalence between PLSI and LDA”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-434, 2003.
[21] L. K. Hansen, S. Sigurdsson, T. Kolenda, F. A. Nielsen, U. Kjems and J. Larsen, “Modeling text with generalizable Gaussian mixtures”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 3494-3497, 2000.
[22] T. Hofmann, “Probabilistic latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[23] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.
[24] T. Hofmann, “Collaborative filtering via Gaussian probabilistic latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 259-266, 2003.
[25] Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate”, IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 161-172, 1997.
[26] X. Jin, Y. Zhou and B. Mobasher, “Web usage mining based on probabilistic latent semantic analysis”, Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197-205, 2004.
[27] T. Joachims, “Text categorization with support vector machine: learning with many relevant features”, Proceedings of 10th European Conference on Machine Learning, pp. 137-142, 1998.
[28] T. G. Kolda and D. P. O’Leary, “A semi-discrete matrix decomposition for latent semantic indexing in information retrieval”, ACM Transactions on Information Systems, vol. 16, no. 4, pp. 322-346, 1998.
[29] H. Masataki, Y. Sagisaka, K. Hisaki and T. Kawahara, “Task adaptation using MAP estimation in n-gram language modeling”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 783-786, 1997.
[30] T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model”, Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
[31] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine Learning, vol. 39, no. 2-3, pp. 103-134, 2000.
[32] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275-281, 1998.
[33] F. Song and W. B. Croft, “A general language model for information retrieval”, Proceedings of 8th Internaional Confrence on Information and Knowledge Management, pp. 316-321, 1999.
[34] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.
[35] D. I. Witter and M. W. Berry, “Downdating the latent semantic indexing model for conceptual information retrieval”, The Computer Journal, vol. 41, no. 8, pp. 589-601, 1998.
[36] H. Zha and H. D. Simon, “On updating problems in latent semantic indexing”, SIAM Journal on Scientific Computing, vol. 21, no. 2, pp. 782-791, 1999.
[37] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM SIGIR 2001, pages 111-119.
[38] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval”, ACM Transactions on Information Systems, vol. 22, no. 2, pp. 179-214, 2004.
[39] D. Harman, Overview of the Fourth Text Retrieval Conference. 1995. Available at http://trec.nist.gov/pubs/trec4/overvies.ps.gz
[40] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999
[41] Gene Golub and Arthur van Loan. Matrix computation. Johns Hopokins U. press, 1996
[42] L. Rabiner and B.H. Juang, “Funadamental of Speech Recognition”, Prentice Hall, pp.321-387, 1993
[43] C. Apte, F. Damerau, and S.M. Weiss, “Towards Langage Independent Automated Learning of Text Categorization Models”, ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 23-30, 1994
[44] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR, pp. 4-9, 2003
[45] Yuya Akita and Tatsuya Kawahara, “Language Model Adaptation based on PLSA of Topics and Speakers”, 8th International Conference on Spoken Language Processing, 2004
[46] David Mrva and Philip C. Woodland, “A PLSA-based Language Model for Conversational Telephone Speech”, 8th International Conference on Spoken Language Processing, 2004