| 研究生: |
陳冠斌 Chen, Guan-Bin |
|---|---|
| 論文名稱: |
強化字詞共現性之短文本主題模型改進 Word Co-occurrence Augmented Topic Model in Short Text |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 英文 |
| 論文頁數: | 81 |
| 中文關鍵詞: | 短文本 、主題模型 、文件分類 、文件分群 |
| 外文關鍵詞: | Short Text, Topic Model, Document Clustering, Document Classification |
| 相關次數: | 點閱:147 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在網際網路上,大量的文字使得人們難以在有限的短時間內加以吸收並了解,主題模型(如pLSA與LDA)被提出來試圖對這些長文件做摘要與總結成幾個代表性的主題字。近年來,隨著社群網路的興起(如Twitter),使得短文件的數量也隨之變大,在為數眾多的短文本中如何良好地做摘要與整理也變成一大課題,因而有了應用主題模型於短文本的想法。然而直接應用主題模型到這些短文本上,由於短文本中字數不足以用來良好地統計該主題的字詞共現特性,所以經常會得到一些相干度低的主題。根據我們回顧的文獻,雙詞主題模型(Bi-term topic model, BTM)透過整個資料集中的雙詞(Bi-term),直接對字詞共現特性做建模,能有效改善單一文件中字數不足的問題。然而BTM於統計過程中只考慮雙詞的共現頻率,導致產生的主題很容易會被單一高頻字所主導。
本研究提出兩個基於字詞共現性的主題模型改進方法,一是針對單一文件中缺字的LDA模型做改進,另一個為改善BTM中主題被高頻字所主導的問題。對於LDA的問題,我們提出RO-LDA方法,重新組織資料集中的所有共現字並濾除一些雜訊,形成虛擬文件,最後利用原始LDA進行建模;而對於BTM的問題,我們提出的PMI-β-BTM方法導入點對點交互資訊(pointwise mutual information, PMI)分數於其主題字的事前機率分布中,來降低單一高頻字的影響。實驗結果顯示,我們的RO-LDA方法在雜訊高的Tweet上能有較好的主題性;而我們的PMI-β-BTM在較為正規的新聞標題上有較好的主題性。另外,我們所提出的兩個方法都不需修改原始主題模型,因此可直接應用於基於LDA或BTM的衍生模型上。
The large amount of text on the Internet cause people hard to understand the meaning in a short limit time. Topic models (e.g. LDA and PLSA) has been proposed to summarize the long text into several topic terms. In the recent years, the short text media such as tweet is very popular. However, directly applies the transitional topic model on the short text corpus usually gating non-coherent topics. Because there is no enough words to discover the word co-occurrence pattern in a short document. The Bi-term topic model (BTM) has been proposed to improve this problem. However, BTM just consider simple bi-term frequency which cause the generated topics are dominated by common words. In this paper, we solve the lack of the local word co-occurrence problem in LDA and the problem of the frequent bi-term in BTM. Thus, we proposed two improvement of word co-occurrence methods to enhance the topic models. First, we apply the word co-occurrence information to the BTM. Second, we generate new virtual documents by reorganizing the words in documents and just apply in the traditional LDA. The experimental result that show our RO-LDA method gets well results in the noisy Tweet dataset and the PMI-β-BTM gets well result in the regular short news title text. Moreover, there are two advantages in our methods. We do not need any external data and our proposed methods are based on the original topic model that we did not modify the model itself, thus our methods can easily apply to some other existing LDA or BTM based models.
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," the Journal of machine Learning research, vol. 3, pp. 993-1022, 2003.
[2] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi, "Short and tweet: experiments on recommending content from information streams," in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1185-1194, 2010.
[3] Z. Chen and B. Liu, "Mining topics in documents: standing on the shoulders of big data," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, New York, USA, pp. 1116-1125, 2014.
[4] X. Cheng, X. Yan, Y. Lan, and J. Guo, "BTM: Topic Modeling over Short Texts," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 2928-2941, 2014.
[5] K. W. Church and P. Hanks, "Word association norms, mutual information, and lexicography," Computational linguistics, vol. 16, pp. 22-29, 1990.
[6] M. Divya, K. Thendral, and S. Chitrakala, "A Survey on Topic Modeling," International Journal of Recent Advances in Engineering & Technology (IJRAET), vol. 1, pp. 57-61, 2013.
[7] T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum, "Integrating topics and syntax," in Advances in neural information processing systems, pp. 537-544, 2004.
[8] T. Hofmann, "Probabilistic latent semantic analysis," in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 289-296, 1999.
[9] L. Hong and B. D. Davison, "Empirical study of topic modeling in twitter," in Proceedings of the First Workshop on Social Media Analytics, pp. 80-88, 2010.
[10] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang, "Transferring topical knowledge from auxiliary long texts for short text clustering," in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 775-784, 2011.
[11] T. K. Landauer, P. W. Foltz, and D. Laham, "An introduction to latent semantic analysis," Discourse processes, vol. 25, pp. 259-284, 1998.
[12] W. Li and A. McCallum, "Pachinko allocation: DAG-structured mixture models of topic correlations," in Proceedings of the 23rd international conference on Machine learning, pp. 577-584, 2006.
[13] C. X. Lin, B. Zhao, Q. Mei, and J. Han, "PET: a statistical model for popular events tracking in social communities," in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 929-938, 2010.
[14] J. S. Liu, "The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem," Journal of the American Statistical Association, vol. 89, pp. 958-966, 1994.
[15] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum, "Optimizing semantic coherence in topic models," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262-272, 2011.
[16] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, "Text classification from labeled and unlabeled documents using EM," Machine learning, vol. 39, pp. 103-134, 2000.
[17] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, "Learning to classify short and sparse text & web with hidden topics from large-scale data collections," in Proceedings of the 17th international conference on World Wide Web, pp. 91-100, 2008.
[18] O. Phelan, K. McCarthy, and B. Smyth, "Using twitter to recommend real-time topical news," in Proceedings of the third ACM conference on Recommender systems, pp. 385-388, 2009.
[19] D. Ramage, S. T. Dumais, and D. J. Liebling, "Characterizing Microblogs with Topic Models," International Conference on Weblogs and Social Media (ICWSM), vol. 10, pp. 1-1, 2010.
[20] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pp. 248-256, 2009.
[21] W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, pp. 846-850, 1971.
[22] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, "The author-topic model for authors and documents," in Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494, 2004.
[23] F. Song and W. B. Croft, "A general language model for information retrieval," presented at the Proceedings of the eighth international conference on Information and knowledge management, Kansas City, Missouri, USA, 1999.
[24] H. M. Wallach, "Topic modeling: beyond bag-of-words," in Proceedings of the 23rd international conference on Machine learning, pp. 977-984, 2006.
[25] H. M. Wallach, D. Minmo, and A. McCallum, "Rethinking LDA: Why priors matter," 2009.
[26] X. Wang, A. McCallum, and X. Wei, "Topical n-grams: Phrase and topic discovery, with an application to information retrieval," in Proceedings of the IEEE International Conference on Data Mining (ICDM 2007), pp. 697-702, 2007.
[27] X. Yan, J. Guo, Y. Lan, and X. Cheng, "A biterm topic model for short texts," in Proceedings of the 22nd international conference on World Wide Web, Rio de Janeiro, Brazil, pp. 1445-1456, 2013.
[28] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, et al., "Comparing twitter and traditional media using topic models," in Advances in Information Retrieval, ed: Springer, pp. 338-349, 2011.