| 研究生: |
蔡昀哲 Tsai, Yun-Che |
|---|---|
| 論文名稱: |
具韌性之階層式主題模型 Hierarchical and Flexible Topic Modeling |
| 指導教授: |
李同益
Lee, Tong-Yee |
| 共同指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 英文 |
| 論文頁數: | 101 |
| 中文關鍵詞: | 機器學習 、貝氏非參數學習 、主題模型 、印度晚宴程序 |
| 外文關鍵詞: | Machine Learning, Bayesian Nonparametric Learning, Topic Model, Indian Buffet Process |
| 相關次數: | 點閱:91 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來隨著時代的進步與發展,分享在網路上龐大的多媒體文件資料量不再是人力容易處理解決的,因此自動化文件模型索引及自動文件分群技術的發展,就顯得相當的重要,這種利用機器學習領域中的非監督式學習(Unsupervised Learning)技術可以自動擷取每一筆文件資料的內容重點或主題並有效歸類文件所屬的標籤或群組,有效建立具組織性之文件摘要、文件分類及文件檢索等資訊系統,讓使用者節省大量時間在閱讀或搜索上,因此,以潛在主題(Latent Topic)為主的統計式文件模型已成為自然語言處理相當關鍵性的研究課題,然而傳統主題模型受限於(1)主題個數固定,(2)主題之間假設互相獨立及(3)主題選擇不具彈性,本論文提出一套新穎方法同時處理以上三項限制,有效從異質性達到具韌性之階層主題模型與主題選擇。
本論文的主題模型是架構於潛在狄氏配置(Latent Dirichlet Allocation, LDA),使用此類主題模型,在收集到新文件時,不需要重新訓練即可利用原本模型參數有效預估新文件導入後的機率模型。本研究中我們首先透過階層式潛在狄氏配置(hierarchical LDA, hLDA)擷取主題之間的相依性(Dependency),然後依循貝氏非參數(Bayesian Nonparametric)學習法,鬆綁主題個數的限制,從訓練文集中自動且無限制的延展主題混合模型(Topic Mixture Model)中樹狀結構的分枝(Branch)數及階層(Layer)數,建立無限制主題模型(Infinite Topic Model)並解決主題模型之模型選擇(Model Selection)問題,有效將文集中不同文件字詞自動分配到樹狀結構的階層及其分支上,區分每個字詞的特性並擷取字詞的階層式群聚(Hierarchical Clustering of Words)資訊。透過印度晚宴程序(Indian Buffet Process)及樹狀式折筷子過程(Tree-Structured-Stick-Breaking Process),實現貝氏非參數學習及推論並建立具韌性之主題選擇與主題模型,其中樹狀式折筷子法是用來獲得表示文件字詞的主題比率(Topic Proportion)或比重,而印度晚宴程序是使用Beta機率分布自動從一篇文件的字詞偵測出樹狀結構中不同節點(Tree Node)的主題是否啟動用來表示目標字詞,改善傳統巢狀式中國餐廳程序(Nested Chinese Restaurant Process)中只能固定使用一條樹狀路徑(Tree Path)之節點主題的限制,實現具高度適應性之字詞主題(Word Topic)階層式模型,本論文透過一套新穎的印度晚宴情境(Scenario),使用吉布斯取樣法(Gibbs Sampling)來解決模型推論(Model Inference)問題,我們推導出印度晚宴情境裡潛在變數及模型參數的事後機率並透過此機率取樣出模型參數及每個字詞之階層分支或路徑並獲得字詞的分群結果,我們在實驗中使用不同文件集驗證本研究方法在文件表示及文件檢索的效能。
In the era of information age, due to the explosion of information flood and a huge variety and uncertainty of media data on internet, the information browsers are difficult to extract useful information within a very short period. To save man power and tackle information extraction from tons of documents, the technologies of document representation, word clustering and document indexing have become much more important than before. Information browsers shall save a lot of time on searching what they are eager to know and extracting the relevant documents which are fitted to their interests. For this concern, unsupervised learning plays a crucial role in construction of information systems, e.g. document summarization, text categorization and information retrieval. In the literature, the statistical document representation based on latent topic model has been impacting the areas of natural language processing and developing for many applications. However, traditional topic model is constrained by assuming that (1) the number of topics should be fixed, (2) different topics should be independent and (3) topics should be selected under a single tree path. This dissertation presents a new approach to release these three assumptions and build up a flexible topic model with adaptive topic selection from heterogeneous documents.
In this thesis, we start from topic-based document model by using latent Dirichlet allocation (LDA) where latent topics are assumed to be independent. The prediction distribution of a new document is calculated without model retraining. By considering the dependencies between topics, we extend LDA to the hierarchical LDA or the nested Chinese restaurant process (nCRP) where a hierarchical tree structure of topics ranging from global topic in root layer to specific topics in leave layer is constructed for document representation. The tree nodes represent the word clustering in different levels. We further relax the limitation of topic model with the fixed number of topics. The Bayesian nonparametric method is developed for data representation where model selection problem is tackled by infinitely generating new topics or mixture components from the infinitely observed documents. The infinite topic model is established without limitation of tree layers and tree branches. When new documents are enrolled, the complexity of probability model is gradually and automatically enlarged according to Bayesian nonparametric theory. The issues of model selection and over-estimation problem are resolved. In addition, we release the limitation of nCRP, where only the topics along a fixed tree path are used to represent a document. The Indian buffet process (IBP) and the tree-structured-stick-breaking process are introduced. The nested IBP (nIBP) or the nCRP compound IBP is proposed to conduct flexible topic selection for word representation. A new scenario is designed to infer the proposed nIBP based on the collapsed Gibbs sampling procedure. We derive the posterior probabilities of latent variables and used them to sample model parameters as well as to draw the associated branches or select the tree paths for document representation. In the experiments, different text corpora are used to evaluate the performance of nIBP for document representation and document retrieval.
[1]Adams, R. P., Ghahramani, Z., and Jordan, M. I., "Tree-structured stick breaking for hierarchical data," Advances in Neural Information Processing Systems, pp. 19-27, 2010.
[2] Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P., "Combining stochastic block models and mixed membership for statistical network analysis," Statistical Network Analysis: Models, Issues, and New Directions, vol. 4503, pp. 57-74, 2007.
[3] Antoniak, C. E., "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems," Annals of Statistics, vol. 2, pp. 1152-1174, 1974.
[4] Baeza-Yates, R. A. and Ribeiro-Neto, B., Modern Information Retrieval: Addison-Wesley Longman Publishing Co., Inc., 1999.
[5] Beal, M. J., Ghahramani, Z., and Rasmussen, C. E., "The infinite hidden Markov model," Advances in Neural Information Processing Systems, vol. 14, pp. 577-584, 2002.
[6] Bellegarda, J. R., "Large vocabulary speech recognition with multispan statistical language models," IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 76-84, 2000.
[7] Berry, M. W., "Large-scale sparse singular value computations," International Journal of Supercomputer Applications and High Performance Computing, vol. 6, pp. 13-49, 1992.
[8] Berry, M. W., Dumais, S. T., and OBrien, G. W., "Using linear algebra for intelligent information retrieval," SIAM Review, vol. 37, pp. 573-595, 1995.
[9] Blei, D. M. and Jordan, M. I., "Modeling annotated data," Proceedings of the ACM SIGIR Conference on Research and Development in Informaion Retrieval. pp.127-134, 2003.
[10] Blei, D. M., Ng, A. Y., and Jordan, M. I., "Latent Dirichlet allocation," Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[11] Blei, D. M., Griffiths, T. L., Jordan, M. I., and Tenenbaum, J. B., "Hierarchical topic models and the nested Chinese restaurant process," Advances in Neural Information Processing Systems, vol. 16, pp. 17-24, 2004.
[12] Blei, D. M., Griffiths, T. L., and Jordan, M. I., "The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies," Journal of the ACM, pp. 1-30, 2010.
[13] Blei, D. M., "Probabilistic topic models," Communications of the ACM, vol. 55, pp. 77-84, 2012.
[14] Brandow, R., Mitze, K., and Rau, L. F., "Automatic condensation of electronic publications by sentence selection," Information Processing & Management, vol. 31, pp. 675-685, 1995.
[15] Brill, E., "A simple rule-based part of speech tagger," Proceedings of the workshop on Speech and Natural Language, pp. 112-116, 1992.
[16] Chang, Y. L. and Chien, J. T., "Latent Dirichlet learning for document summarization," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1689-1692, 2009.
[17] Chang, Y. L., Hung, J. J., and Chien, J. T., "Bayesian nonparametric modeling of hierarchical topics and sentences," IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2011.
[18] Chang, Y. L., Lee, K. F., and Chien, J. T., "Bayesian feature selection for sparse topic model," IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1-6, 2011.
[19] Chien, J.-T. and Chueh, C.-H., "Dirichlet class language models for speech recognition," IEEE Transactions on Audio, Speech and Language Processing, pp. 482-495, 2011.
[20] Chien, J.-T. and Chueh, C.-H., "Topic-based hierarchical segmentation," IEEE Transactions on Audio, Speech, and Language Processing, pp. 55-66, 2012.
[21] Dahl, G. E., Yu, D., Deng, L., and Acero, A., "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Transactions on Audio Speech and Language Processing, pp. 30-42, 2012.
[22] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., "Indexing by latent semantic analysis," Journal of the American Society for Information Science, pp. 391-407, 1990.
[23] Dempster, A. P., Laird, N. M., and Rubin, D. B., "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society Series B, pp. 1-38, 1977.
[24] Ding, C. H. Q., "A similarity-based probability model for latent semantic indexing," Proceedings of 22nd International Conference on Research and Development in Information Retrieval (SIGIR), pp. 58-65, 1999.
[25] Erhan, D., Courville, A., Bengio, Y., and Vincent, P., "Why does unsupervised pre-training help deep learning," Journal of Machine Learning Research, pp. 625-660, 2010.
[26] Escobar, M. D. and West, M., "Bayesian density estimation and inference using mixtures," Journal of the American Statistical Association, vol. 90, pp. 577-588, 1995.
[27] Fei-Fei, L. and Perona, P., "A Bayesian hierarchical model for learning natural scene categories," Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 524-531, 2005.
[28] Ghahramani, Z., Sollich, P., and Griffiths, T. L., "Bayesian nonparametric latent feature models," Bayesian Statistics, vol. 8, 2007.
[29] Goldstein, J., Mittal, V., Carbonell, J., and Kantrowitz, M., "Multi-document summarization by sentence extraction," Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, pp. 40-48, 2000.
[30] Golub, G. H. and Loan, C. F. V., Matrix Computations (3rd ed.): Johns Hopkins University Press, 1996.
[31] Gorur, D., Jakel, F., and Rasmussen, C. E., "A choice model with infinitely many latent features," Proceedings of the International Conference on Machine Learning, pp. 361-368, 2006.
[32] Griffiths, T. L. and Ghahramani, Z., "Infinite latent feature models and the Indian buffet process," in Advances in Neural Information Processing System, pp. 475-482, 2005.
[33] Griffiths, T. L. and Ghahramani, Z., "The Indian buffet process: an introduction and review," Journal of Machine Learning Research, vol. 12, pp. 1185-1224, 2011.
[34] Haghighi, A. and Vanderwende, L., "Exploring content models for multi-document summarization," Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp.362-370, 2009.
[35] Hinton, G. E. and Salakhutdinov, R. R., "Reducing the dimensionality of data with neural networks," Science, vol. 313, pp. 504-507, 2006.
[36] Hofmann, T., "Probabilistic latent semantic indexing," Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[37] Hofmann, T., "Probabilistic latent semantic indexing," Proceedings of International Conference on Research and Development in Information Retrieval (SIGIR), pp. 50-57, 1999.
[38] Hofmann, T., "Unsupervised learning by probabilistic latent semantic analysis," Machine Learning, vol. 42, pp. 177-196, 2001.
[39] Ishwaran, H. and James, L. F., "Gibbs sampling methods for stick-breaking priors," Journal of the American Statistical Association, pp. 161-173, 2001.
[40] Kim, D. I. and Sudderth, E. B., "The doubly correlated nonparametric topic model," Advances in Neural Information Processing Systems, pp. 1980-1988, 2011.
[41] Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P., "Exploring strategies for training deep neural networks," Journal of Machine Learning Research, vol. 10, pp. 1-40, 2009.
[42] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y., "Unsupervised learning of hierarchical representations with convolutional deep belief networks," Communications of the ACM, pp. 95-103, 2011.
[43] Li, W. and McCallum, A., "Pachinko allocation: DAG-structured mixture models of topic correlations," Proceedings of International Conference on Machine Learning, pp. 577-584, 2006.
[44] Marlin, B., "Modeling user rating profiles for collaborative filtering," Advances in Neural Information Processing Systems, pp. 627-634, 2004.
[45] Moussaoui, S., Brie, D., Mohammad-Djafari, A., and Carteret, C., "Separation of non-negative mixture of non-negative sources using a Bayesian approach and MCMC sampling," IEEE Transactions on Signal Processing, pp. 4133-4145, 2006.
[46] Newman, D., Asuncion, A., Smyth, P., and Welling, M., "Distributed algorithms for topic models," Journal of Machine Learning Research, pp. 1801-1828, 2009.
[47] Novak, M. and Mammone, R., "Use of non-negative matrix factorization for language model adaptation in a lecture transcription task," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing pp. 541-544, 2001.
[48] Paisley, J., Carin, L., and Blei, D., "Variational inference for stick-breaking beta process priors," Proceeding of International Conference on Machine Learning, 2011.
[49] Pitman, J. and Yor, M., "The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator," Annals of Probability, pp. 855-900, 1997.
[50] Pritchard, J. K., Stephens, M., and Donnelly, P., "Inference of population structure using multilocus genotype data," Genetics, vol. 155, pp. 945-959, 2000.
[51] Salakhutdinov, R. and Mnih, A., "Bayesian probabilistic matrix factorization using Markov chain Monte Carlo," Proceedings of International Conference on Machine Learning, pp. 880-887, 2008.
[52] Salton, G. and McGill, M. J., Introduction to modern information retrieval, 1986.
[53] Teh, Y. W., "A hierarchical Bayesian language model based on Pitman-Yor processes," Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. pp. 985-992, 2006.
[54] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M., "Hierarchical Dirichlet processes," Journal of the American Statistical Association, pp. 1566-1581, 2006.
[55] Teh, Y. W., Gorur, D., and Ghahramani, Z., "Stick-breaking construction for the Indian buffet process," Proceedings of the International Conference on Artificial Intelligence and Statistics. pp.556-563, 2007.
[56] Teh, Y. W., Newman, D., and Welling, M., "A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation," Advances in Neural Information Processing Systems, 2006.
[57] Teh, Y. W. and Jordan, M. I., "Hierarchical Bayesian Nonparametric Models with Applications," Cambridge Series in Statistical and Probabilistic Mathematics, 2008.
[58] Teh, Y. W. and Jordan, M. I., "Hierarchical bayesian nonparametric models with applications," in Bayesian Nonparametrics: Principles and Practice, ed: Cambridge University Press, 2009.
[59] Wallach, H. M., "Topic modeling: beyond bag-of-words," Proceedings of the International Conference on Machine Learning. pp.977-984, 2006.
[60] Wang, C. and Blei, D. M., "Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process," In Advances in Neural Information Processing Systems, pp. 1982-1989, 2009.
[61] Wang, C. and Blei, D. M., "Variational Inference for the nested Chinese restaurant process," Neural Information Processing System, 2009.
[62] Williamson, S., Wang, C., Heller, K. A., and Blei, D. M. "The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling", Proceeding of International Conference on Machine Learning, 2010.
[63] Wood, F. and Teh, Y. W., "A hierarchical nonparametric Bayesian approach to statistical language model domain adaptation," Proceeding of International Conference on Artificial Intelligence and Statitics, pp. 607-614, 2009.
[64] Yaman, S., Chien, J.-T., and Lee, C.-H., "Structural Bayesian language modeling and adaptation," Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH). pp. 2365-2368, 2007.
[65] Zhong, M. and Girolami, M., "Reversible Jump MCMC for non-negative matrix factorization," Journal of Machine Learning Research pp. 663-670, 2009.
校內:2021-12-31公開