| 研究生: |
陳吉德 Chen, Ji-De |
|---|---|
| 論文名稱: |
主題模型化之半監督學習法於短文本流之主題標籤推薦 Hashtag Recommendation of Streaming Short Text by Topic Model Enhanced Semi-Supervised Learning |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 英文 |
| 論文頁數: | 44 |
| 中文關鍵詞: | 主題標籤推薦 、社群媒體 、半監督式學習 |
| 外文關鍵詞: | Hashtag Recommendation, Social Media, Semi-Supervised Learning |
| 相關次數: | 點閱:113 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著像推特這樣的即時社群媒體的成長,許多使用者都會透過這樣的平台去分享與討論他們感興趣的主題,主題標籤(hashtag) 是一種後設資料標籤,可以允許使用者用來標注他們在推特所發表的推文(tweet)的主題,除此之外主題標籤在研究上也有其用途,例如藉由觀察主題標籤在時間軸上的變化趨勢,能夠更準確地從推特資料中偵測重要事件。雖然推特的使用成長的非常快速,但是主題標籤的使用率成長並不如預期,從我們收集到的資料集中顯示,僅有不到20%的推文是有包含主題標籤的,我們認為大部份的使用者可能並不知道自己發表的推文可以被標注上什麼主題標籤,如果我們能夠推薦適合的主題標籤給使用者,相信這方法能夠改進主題標籤使用率低的問題。主題標籤推薦可視為監督式學習問題,若能提供越充足的已標記資料來訓練預測模型,則能在預測上獲得更好的效果,然而,對於主題標籤推薦來說,由於主題標籤的使用率不高,所以已標記標籤的資料數量並不是很充足,因此我們想要進一步利用無主題標籤的推文(non-hashtag tweet),來克服這個問題,但是直接加入所有無主題標籤的推文對於模型的訓練未必是有幫助的,我們採用了權重更新的半監督學習機制去篩選出真正對於訓練有用的無主題標籤的推文。又因為推特有即時性的特性,這個機制同時也必須將主題標籤的時間性特徵考慮進去。這篇研究中的實驗結果顯示了有效的利用無主題標籤的推文去擴充原本的訓練資料,相較於只使用已標記資料的方法要能夠獲得更好的成效。
With the rapidly growing of real-time social media, like Twitter, many users can share and discuss their interest topics through such platforms. Hashtag is a type of metadata tag which allows users to annotate their topics of tweets. For research usage, for example, hashtags can help the performance of event detection by observing the trend of hashtags. Although Twitter grows rapidly, hashtag growth is not as expected. Our dataset shows that there are less than 20% of all tweets containing hashtags. It is caused by that most users may have no idea what hashtags are suitable for tweets they posted. If we can recommend suitable hashtags to users, it can be one of the solutions to solve the problem of low usage rate of hashtag. Hashtag recommendation belongs to the supervised learning problem. Providing more labeled data to train the model can get the higher performance in the prediction task. However, the labeled data in hashtag recommendation is not so much due to the low usage rate of hashtag. Thus, to address this problem, we want to exploit unlabeled data, i.e., non-hashtag tweets. Non-hashtag tweets will be self-labeled with virtual hashtags by the topic model and be used to extend training data. However, directly adding all non-hashtag tweets may not be helpful to train the model because there must be some noisy data. To overcome this issue, we apply the weight-updating mechanisms to filter out the useless parts of non-hashtag tweets which may not have any appropriate hashtags. These mechanisms also have to consider the temporal characteristics of hashtag due to the real-time nature of Twitter. The experimental results in this research show that adding effective non-hashtag tweets to extend original training data outperforms baseline methods which only exploit labeled data to train the model.
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," the Journal of machine Learning research, vol. 3, pp. 993-1022, 2003.
[2] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp. 273-297, 1995.
[3] A. Cui, M. Zhang, Y. Liu, S. Ma, and K. Zhang, "Discover breaking events with popular hashtags in twitter," in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1794-1798, 2012.
[4] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, "Boosting for transfer learning," in Proceedings of the 24th international conference on Machine learning, pp. 193-200, 2007.
[5] W. Feng and J. Wang, "We can learn your #hashtags: Connecting tweets to explicit topics," in Proceedings of the 30th International Conference on Data Engineering, pp. 856-867, 2014.
[6] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting," Journal of computer and system sciences, vol. 55, pp. 119-139, 1997.
[7] F. Godin, V. Slavkovikj, W. De Neve, B. Schrauwen, and R. Van de Walle, "Using topic models for twitter hashtag recommendation," in Proceedings of the 22nd international conference on World Wide Web companion, pp. 593-596, 2013.
[8] E. Khabiri, J. Caverlee, and K. Y. Kamath, "Predicting semantic annotations on the real-time web," in Proceedings of the 23rd ACM conference on Hypertext and social media, pp. 219-228, 2012.
[9] Y. Koren, R. Bell, and C. Volinsky, "Matrix factorization techniques for recommender systems," Computer, vol. 42, pp. 30-37, 2009.
[10] S. M. Kywe, T.-A. Hoang, E.-P. Lim, and F. Zhu, "On recommending hashtags in twitter networks," in Social Informatics, ed: Springer, 2012, pp. 337-350.
[11] T. Li, Y. Wu, and Y. Zhang, "Twitter hash tag prediction algorithm," in Proceedings of the 2011 International Conference on Internet Computing, pp. 59-63, 2011.
[12] C. X. Lin, B. Zhao, Q. Mei, and J. Han, "PET: a statistical model for popular events tracking in social communities," in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 929-938, 2010.
[13] M. Mathioudakis and N. Koudas, "Twittermonitor: trend detection over the twitter stream," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 1155-1158, 2010.
[14] Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos, "Rise and fall patterns of information diffusion: model and implications," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 6-14, 2012.
[15] A. Mazzia and J. Juett, "Suggesting hashtags on twitter," EECS 545m, Machine Learning, Computer Science and Engineering, University of Michigan, 2009.
[16] A. McCallum, "Multi-label text classification with a mixture model trained by EM," in Proceedings of the AAAI' 99 Workshop on Text Learning, pp. 1-7, 1999.
[17] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, "Introduction to wordnet: An on-line lexical database," International journal of lexicography, vol. 3, pp. 235-244, 1990.
[18] K. Nishida, T. Hoshide, and K. Fujimura, "Improving tweet stream classification by detecting changes in word probability," in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 971-980, 2012.
[19] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pp. 248-256, 2009.
[20] S. Rendle and L. Schmidt-Thieme, "Pairwise interaction tensor factorization for personalized tag recommendation," in Proceedings of the third ACM international conference on Web search and data mining, pp. 81-90, 2010.
[21] H. Sajnani, S. Javanmardi, D. W. McDonald, and C. V. Lopes, "Multi-Label Classification of Short Text: A Study on Wikipedia Barnstars," in Analyzing Microtext, 2011.
[22] F. Sebastiani, "Machine learning in automated text categorization," ACM computing surveys (CSUR), vol. 34, pp. 1-47, 2002.
[23] J. She and L. Chen, "Tomoha: Topic model-based hashtag recommendation on twitter," in Proceedings of the companion publication of the 23rd international conference on World wide web companion, pp. 371-372, 2014.
[24] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas, "Short text classification in twitter to improve information filtering," in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 841-842, 2010.
[25] F. Xiao, T. Noro, and T. Tokuda, "News-topic oriented hashtag recommendation in Twitter based on characteristic co-occurrence word detection," in Web Engineering, ed: Springer, 2012, pp. 16-30.
[26] L. Yang, T. Sun, M. Zhang, and Q. Mei, "We know what@ you# tag: does the dual role affect hashtag adoption?," in Proceedings of the 21st International Conference on World Wide Web, pp. 261-270, 2012.
[27] E. Zangerle, W. Gassler, and G. Specht, "Recommending#-tags in twitter," in Proceedings of the Workshop on Semantic Adaptive Social Web (SASWeb 2011). CEUR Workshop Proceedings, pp. 67-78, 2011.
[28] M.-L. Zhang and Z.-H. Zhou, "ML-KNN: A lazy learning approach to multi-label learning," Pattern recognition, vol. 40, pp. 2038-2048, 2007.