成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳佳昇 Wu, Chia-Sheng
論文名稱：	使用貝氏潛在語意分析於文件分類及資訊檢索 Bayesian Latent Semantic Analysis for Text Categorization and Information Retrieval
指導教授：	簡仁宗 Chien, Jen-Tzung
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2005
畢業學年度：	93
語文別：	中文
論文頁數：	85
中文關鍵詞：	語言模型、潛在語意分析、貝氏理論、文件分類、資訊檢索
外文關鍵詞：	Language Model, Latent Semantic Analysis, Bayesian theory, Information Retrieval, Text Categorization
相關次數：	點閱：93 下載：4
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　　隨著資料集的大量增長，引用統計式文件模型於資訊檢索上之研究重要性與日俱增。機率式潛在語意分析（probabilistic latent semantic analysis, PLSA）模型為一種可有效率擷取語意及其統計量的文件模型方法。而機率式潛在語意分析在實際應用時，對於新領域文件連續地更新具有高敏感性。本論文中，提出了一個新穎的貝氏機率式潛在語意分析的架構，本研究方法著重於利用遞增式學習演算法，解決新文章加入時的模型更新（updating）問題的方法。本演算法藉由即時遞增式萃取以及學習最新的潛在式語意資訊，以期望提升文件模型之效能，並獲得符合線上資料改變後的新文件模型。在設定上，藉由一個適當的Dirichlet機率密度函式作為PLSA模型參數的事前機率。而擁有相同形式的事後機率分布使得模型得到一個可重複產生的事前/事後機率機制，以求達到累積資料的遞增式學習。本方法提出近似貝氏（quasi-Bayes, QB）機率式潛在語意分析模型以達到累進學習的目的。參數求解過程是採用Expectation-Maximization（EM）演算法推導出來的。在這樣的線上PLSA檢索系統中，為求達到更強健的參數估測同時也建構於超參數（hyperparameter）的更新。相較於原始的最大相似度估測，本論文提出的QB方法，擁有動態增加文件建立索引的能力，在本論文中也同時提出最大化事後機率（maximum a posteriori, MAP）的機率式潛在語意分析模型用於更正型的批次模型訓練（corrective training）方法。在實驗驗證上，利用文件檢索以及文件分類驗證貝氏機率式潛在語意分析之優越性。

　　Due to the vast growth of data collections, the statistical text modeling is increasingly important for information retrieval. Probabilistic latent semantic analysis (PLSA) is a popular text modeling approach where the semantics and statistics can be effectively captured. However, PLSA is highly sensitive to task domain, which is continuously updated in real-world applications. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve the text modeling by incrementally extracting the up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is established to fulfill incremental learning from constantly accumulated data. The expectation-maximization (EM) algorithm is applied to resolve the quasi-Bayes (QB) estimate of PLSA parameters. The on-line PLSA is constructed to accomplish parameter estimation as well as hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed QB approach is capable of performing dynamic document indexing. Also, we present the maximum a posteriori PLSA for corrective model training. Experiments on document classification and retrieval demonstrate the superiority of using Bayesian PLSA.

摘要	v
Abstract	vi
致謝	vii
章節目錄	viii
圖目錄	xi
表目錄	xii
第 一 章 序論	1
1 動機及目的	1
2 論文主要方法描述	3
3 章節概要	4
第 二 章 相關工作文獻探討	5
1 向量空間模型	5
1.1 特徵向量之建立	7
1.2 向量空間方法之特色	8
2 潛在語意分析	9
2.1 奇異值分解	10
2.2 文件與字詞相似度	13
3 潛在語意分析更新演算法	14
3.1 奇異值重新計算	15
3.2 疊入（folding-in）	15
3.3 奇異值更新	17
3.4 移除演算法	20
4 其他統計式文件模型	21
第 三 章 機率式潛在語意分析模型	23
1 機率式潛在語意分析	23
2 模型參數	24
3 以最大相似度為準則作參數估測	25
第 四 章 貝氏潛在語意分析於模型調適	30
1 PLSA調適	31
2 最大事後機率參數估測於更正式訓練	32
3 近似貝氏估測於遞增式學習	36
4 延伸至N-Gram模型	41
5 實作上的討論	42
第 五 章 實驗結果	44
1 實驗環境	44
2 評估方法	45
3 實驗收斂條件	48
4 資訊檢索之應用	50
4.1 Medline、Cranfield文件集	50
4.2 Medline文集實驗結果	52
4.3 Cranfield文集實驗結果	59
5 文件分類之應用	61
5.1 文件分類集合Reuters-21578	61
5.2 Reuters-21578文集實驗結果	64
6 實驗分析及討論	65
7 展示系統	69
第 六 章 結論及未來研究方向	72
1 結論	72
2 未來研究方向	73
第 七 章 參考文獻	74
附錄 Interspeech 2005論文	81
作者簡歷	85
                                    

[1] Y. Akita and T. Kawahara, “Language modeling adaptation based on PLSA of topics and speakers”, Proceedings of International Conference on Spoken Language Processing, 2004.
[2] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proceeding of the IEEE, vol. 88, No. 8, pp. 1279-1296, 2000.
[3] J. R. Bellegarda, “Fast update of latent semantic spaces using a linear transform framework”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 769-772, 2002.
[4] M. W. Berry, S. T. Dumais and G. W. O’Brien, “Using linear algebra for intelligent information retrieval”, SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[5] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, vector spaces and information retrieval”, SIAM Review, vol. 41, no. 2, pp. 335-362, 1999.
[6] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[7] T. Brants, F. Chen and I. Tsochantaridis, “Topic-based document segmentation with probabilistic latent semantic analysis”, Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 211-218, 2002.
[8] L. Cai and T. Hofmann, “Text categorization by boosting automatically extracted concepts”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 182-189, 2003.
[9] J. Canny, “GaP: a factor model for discrete data”, Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122-129, 2004.
[10] J. T. Chien, “Online hierarchical transformation of hidden Markov models for speech recognition”, IEEE Transaction on Speech and Audio Processing, vol. 7, no. 6, pp. 656-667, 1999.
[11] J.-T. Chien, M.-S. Wu and H.-J. Peng, “On latent semantic language modeling and smoothing”, Proceedings of International Conference on Spoken Language Processing, vol. 2, pp. 1373-1376, 2004.
[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by latent semantic analysis”, Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[13] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970.
[14] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm”, Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1-38, 1977.
[15] C. H. Q. Ding, “A similarity-based probability model for latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 58-65, 1999.
[16] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., 1973.
[17] G. Dupret, “Latent concepts and the number orthogonal factors in latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 221-226, 2003.
[18] M. Federico, “Language model adaptation through topic decomposition and MDI estimation”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2002.
[19] D. Gildea and T. Hofmann, “Topic based language models using EM”, Proceedings of 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.
[20] M. Girolami and A. Kaban, “On an equivalence between PLSI and LDA”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-434, 2003.
[21] L. K. Hansen, S. Sigurdsson, T. Kolenda, F. A. Nielsen, U. Kjems and J. Larsen, “Modeling text with generalizable Gaussian mixtures”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 6, pp. 3494-3497, 2000.
[22] T. Hofmann, “Probabilistic latent semantic indexing”, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[23] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis”, Machine Learning, vol. 42, no. 1, pp. 177–196, 2001.
[24] T. Hofmann, “Collaborative filtering via Gaussian probabilistic latent semantic analysis”, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 259-266, 2003.
[25] Q. Huo and C.-H. Lee, “On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate”, IEEE Transactions on Speech and Audio Processing, vol. 5, pp. 161-172, 1997.
[26] X. Jin, Y. Zhou and B. Mobasher, “Web usage mining based on probabilistic latent semantic analysis”, Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197-205, 2004.
[27] T. Joachims, “Text categorization with support vector machine: learning with many relevant features”, Proceedings of 10th European Conference on Machine Learning, pp. 137-142, 1998.
[28] T. G. Kolda and D. P. O’Leary, “A semi-discrete matrix decomposition for latent semantic indexing in information retrieval”, ACM Transactions on Information Systems, vol. 16, no. 4, pp. 322-346, 1998.
[29] H. Masataki, Y. Sagisaka, K. Hisaki and T. Kawahara, “Task adaptation using MAP estimation in n-gram language modeling”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 783-786, 1997.
[30] T. Minka and J. Lafferty, “Expectation-propagation for the generative aspect model”, Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352-359, 2002.
[31] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine Learning, vol. 39, no. 2-3, pp. 103-134, 2000.
[32] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275-281, 1998.
[33] F. Song and W. B. Croft, “A general language model for information retrieval”, Proceedings of 8th Internaional Confrence on Information and Knowledge Management, pp. 316-321, 1999.
[34] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, New York: McGraw-Hill, 1983.
[35] D. I. Witter and M. W. Berry, “Downdating the latent semantic indexing model for conceptual information retrieval”, The Computer Journal, vol. 41, no. 8, pp. 589-601, 1998.
[36] H. Zha and H. D. Simon, “On updating problems in latent semantic indexing”, SIAM Journal on Scientific Computing, vol. 21, no. 2, pp. 782-791, 1999.
[37] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the ACM SIGIR 2001, pages 111-119.
[38] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to information retrieval”, ACM Transactions on Information Systems, vol. 22, no. 2, pp. 179-214, 2004.
[39] D. Harman, Overview of the Fourth Text Retrieval Conference. 1995. Available at http://trec.nist.gov/pubs/trec4/overvies.ps.gz
[40] B.-Y. Ricardo and Berthier Ribeiro-Neto , Modern Information Retrieval , Addison-Wesley Longman, May 1999
[41] Gene Golub and Arthur van Loan. Matrix computation. Johns Hopokins U. press, 1996
[42] L. Rabiner and B.H. Juang, “Funadamental of Speech Recognition”, Prentice Hall, pp.321-387, 1993
[43] C. Apte, F. Damerau, and S.M. Weiss, “Towards Langage Independent Automated Learning of Text Categorization Models”, ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 23-30, 1994
[44] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc information retrieval. In Proceedings of SIGIR, pp. 4-9, 2003
[45] Yuya Akita and Tatsuya Kawahara, “Language Model Adaptation based on PLSA of Topics and Speakers”, 8th International Conference on Spoken Language Processing, 2004
[46] David Mrva and Philip C. Woodland, “A PLSA-based Language Model for Conversational Telephone Speech”, 8th International Conference on Spoken Language Processing, 2004

校內：立即公開
校外：2005-07-19公開

簡易檢索 / 詳目顯示

相關論文