簡易檢索 / 詳目顯示

研究生: 柳沃辰
Liu, Wo-Chen
論文名稱: 基於討論參與度與非正規網路語言增強模型之微網誌內容融合系統
A Microblog Content Fusion System Based on User Participation Degree and Enhanced NIL Model
指導教授: 郭耀煌
Kuo, Yau-Hwang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 46
中文關鍵詞: 微網誌使用者生成的內容短文過濾內容融合
外文關鍵詞: Microblog, User-generated Content, Short Text, Filtering, Content Fusion
相關次數: 點閱:150下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   在微網誌上,使用者經常使用簡短的文字(例如:縮寫)以及一些非文字的元素(例如:超連結、影片、表情附號),來克服微網誌內容長度上的限制。然而,在微網誌的回應當中經常包含了許多含混不清、沒有必要或與原文主題無關的訊息,這些訊息將會影響我們分析的結果。除此之外,微網誌的文章與回應當中也經常包含了許多非正規網路語言(NIL),像是拼錯的字、諧音字以及縮寫。本論文提出了一個新方法,對每一篇文章進行以下步驟:過濾與原文無關的回應,並基於討論參與度找出最大討論群(MDG)。根據找出的最大討論群(MDG)作為文章分數計算的依據。文章經計算後挑選出排名較前面的文章,這些被挑選出來的文章,結合本論文修改過的非正規網路語言模型(NIL Model)與語彙鏈模型(Lexical Chain Model),從中挑選出有意義且重要的關鍵詞。為了使最後融合的內容更為豐富,我們從多元的的微網誌平台中挑選相關的內容進行內容融合。
      實驗的部分,本論文的實驗從Plurk、Facebook抓取了三組關鍵字的資料,分別為:“林書豪”、“馬英九” 、“蔡英文”,並且建立了與關鍵字相關的ENIL字典。我們也比較了ENIL模型與中研院斷詞系統(CKIP)的斷詞精確度。與中研院斷詞系統(CKIP)比較的實驗結果顯示,本論文能增進7.4%~17.5%的斷詞精確度。整體效能評估方面,我們利用NDCG來評量使用者對於結果與查詢詞之間關聯性的滿意度,結果顯示大部分的使用者認為我們的系統具備提供良好融合結果的能力。

    Microblog users publish their opinions by using condensed text with some non-textual contents because of the limitation of content length. Moreover, user-generated content often includes chaotic messages, useless information or unrelated information to the theme of original post. Microblog posts and responses also contain Network Informal Language (NIL) such as abbreviations, misspelled and phonetic words and. In this paper, a novel approach of Maximum Discussion Group Detection (MDGD) from each post and its responses is proposed. Briefly, the MDGs with higher user participation degree are selected to extract the significant terms from unconventional expressions of microblog posts by modified NIL and Lexical Chain models. To enrich the fusion results, we refer the related contents from multiple microblog platforms according to the previous extracted terms.
    In the experiments, we use test data set collected from the microblog platforms on Plurk and Facebook which includes the terms of “林書豪”, “馬英九” and “蔡英文”. Then, the NIL dictionary is constructed for ENIL model. Comparing with CKIP, the segmentation results indicate that the precision of ENIL improved 7.4% to 17.5% significantly. Finally, NDCG metrics is used to evaluate the user satisfactions of fusion results. The results of user satisfactions show that our system is capable to provide qualified fused results.

    List of Tables ................................................VIII List of Figures ...............................................IX Chapter 1 Introduction ........................................1 1.1 Motivation ................................................1 1.2 Contributions .............................................2 1.3 Organization ..............................................3 Chapter 2 Background and Related Work .........................4 2.1 Summarization .............................................4 2.1.1 Document Summarization...................................4 2.2 Segmentation ..............................................5 2.3 Microblog .................................................9 Chapter 3 Multi-Feature Analysis for Microblog Content Fusion .10 3.1 Behavior-based Feature Extraction .........................11 3.2 Feature-based Filtering ...................................12 3.3 Maximum Discussion Group Detection ........................14 3.4 A Novel Term Extraction Method ............................19 3.4.1 Notations ...............................................20 3.4.2 Enhanced Network Informal Language Model ................20 3.4.3 Word Segmentation........................................22 3.4.4 Term Frequency Weighting ................................22 3.4.5 Singular Vector Decomposition ...........................24 3.4.6 Candidate Posts Selection ...............................25 3.5 Multiple Post Selection ...................................26 3.6 Multi-source Fusion .......................................27 3.7 Time Complexity Analysis ..................................32 Chapter 4 Experiment ..........................................33 4.1 NDCG ......................................................33 4.2 Results and Analysis ......................................33 4.2.1 Parameters ..............................................33 4.2.2 Feature-based Filtering .................................34 4.2.3 Maximum Discussion Group Detection ......................35 4.2.4 A Novel Keyword Extraction Method .......................36 4.2.5 Multi-source Fusion .....................................38 Chapter 5 Conclusion and Future Work ..........................41 References ....................................................42

    [Ban07] S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using wikipedia,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands, 2007, pp. 787-788.
    [Bru01] Y. C. a. C. J. P. M. Brunn, “Text summarization using lexical chains,” in Proceedings of the Document Understanding Conference (DUC01) (New Orleans, LA, 2001), 2001.
    [Dan11] O. Dan, J. Feng, and B. Davison, “Filtering microblogging messages for social tv,” in Proceedings of the 20th international conference companion on World wide web, Hyderabad, India, 2011, pp. 197-200.
    [Dee90] S. Deerwester, S. T. Dumais, G. W. Furnas et al., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, pp. 391-407, 1990.
    [Del06] J.-Y. Delort, “Identifying commented passages of documents using implicit hyperlinks,” in Proceedings of the seventeenth conference on Hypertext and hypermedia, Odense, Denmark, 2006, pp. 89-98.
    [Erc07] G. Ercan, and I. Cicekli, “Using lexical chains for keyword extraction,” Inf. Process. Manage., vol. 43, no. 6, pp. 1705-1714, 2007.
    [Fuk97] F. Fukumoto, Y. Suzukit, and J. i. Fukumoto, “An automatic extraction of key paragraphs based on context dependency,” in Proceedings of the fifth conference on Applied natural language processing, Washington, DC, 1997, pp. 291-298.
    [Har11] S. Harabagiu, and A. Hickl, “Relevance Modeling for Microblog Summarization,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM-11) Barcelona, Spain, 2011.
    [Hu07] M. Hu, A. Sun, and E.-P. Lim, “Comments-oriented blog summarization by sentence extraction,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, Lisbon, Portugal, 2007, pp. 901-904.
    [Hu09] Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. “Exploiting internal and external semantics for the clustering of short texts using world knowledge,” In Proceeding of the 18th ACM conference on Information and knowledge management (CIKM '09). ACM, New York, NY, USA, 919-928
    [Jav07] A. Java, X. Song, T. Finin et al., “Why we twitter: understanding microblogging usage and communities,” in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, San Jose, California, 2007, pp. 56-65.
    [Jär02] K. Järvelin, and J. Kekäläinen, “Cumulated gain-based evaluation of IR techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422-446, 2002.
    [Lab10] G. Laboreiro, L. Sarmento, J. Teixeira et al., “Tokenizing micro-blogging messages using a text classification approach,” in Proceedings of the fourth workshop on Analytics for noisy unstructured text data, Toronto, ON, Canada, 2010, pp. 81-88.
    [Rad02] D. Radev, Teufel, S., Saggion, H., Lam, W., Blitzer, J., Celebi, A., Qi, H., Drabek, E. and Danyu Liu., “Evaluation of Text Summarization in a Cross-Lingual Information Retrieval Framework,” in Technical Report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, June, 2002., 2002.
    [Sha10] B. Sharifi, M.-A. Hutton, and J. Kalita, “Summarizing microblogs automatically,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010, pp. 685-688.
    [Sil02] H. G. Silber, and K. F. McCoy, “Efficiently computed lexical chains as an intermediate representation for automatic text summarization,” Comput. Linguist., vol. 28, no. 4, pp. 487-496, 2002.
    [Sri10] B. Sriram, D. Fuhry, E. Demir et al., “Short text classification in twitter to improve information filtering,” in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland, 2010, pp. 841-842.
    [Tat08] Tatar, D.; Mihis, A.D.; Czibula, G.S.; , “Lexical Chains Segmentation in Summarization,” Symbolic and Numeric Algorithms for Scientific Computing, 2008. SYNASC '08. 10th International Symposium on , vol., no., pp.95-101, 26-29 Sept. 2008 doi: 10.1109/SYNASC.2008.11
    [Tsa00] Tsai, Chih-Hao. 2000. “MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of Maximum Matching Algorithm,”.
    [Uys11] I. Uysal, and W. B. Croft, “User oriented tweet ranking: a filtering approach to microblogs,” in Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK, 2011, pp. 2261-2264.
    [Wen11] J.-Y. Weng, C.-L. Yang, B.-N. Chen et al., “IMASS: an intelligent microblog analysis and summarization system,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, Portland, Oregon, 2011, pp. 133-138.
    [Wes10] S. Westman, and L. Freund, “Information interaction in 140 characters or less: genres on twitter,” in Proceeding of the third symposium on Information interaction in context, New Brunswick, New Jersey, USA, 2010, pp. 323-328.
    [Yun05] X. Yun-qing, W. Kam-fai, and G. Wei, “NIL is not Nothing: Recognition of Chinese Network Informal Language Expressions,” in Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005.
    [Zha08] X. Zhang, and T. Yao, “A Study of Network Informal Language Using Minimal Supervision Approach,” in Autonomous Systems – Self-Organization, Management, and Control, 2008, pp. 169-175.
    [Zha11] J. Zhang, Y. Xia, B. Ma et al., “Thread Cleaning and Merging for Microblog Topic Detection,” Proceedings of 5th International Joint Conference on Natural Language Processing, Asian Federation of Natural Language Processing, 2011, pp. 589-597.
    [Zit10] Zitao, L., Y. Wenchao, et al. “Short Text Feature Selection for Micro-Blog Mining,” Proceedings of International Conference on Computational Intelligence and Software Engineering (CISE 2010), pp.1-4, 2010.

    無法下載圖示 校內:2017-09-06公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE