| 研究生: |
陳建成 Chen, Chien-Cheng |
|---|---|
| 論文名稱: |
熱門搜尋:從新聞和部落格尋找熱門事件 HOT Search: Finding Hot Events from News and Blog |
| 指導教授: |
盧文祥
Lu, Wen-Hsiang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 41 |
| 中文關鍵詞: | 摘要 、搜尋 |
| 外文關鍵詞: | summarization, search |
| 相關次數: | 點閱:86 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
新聞是生活中不可或缺的知識與資訊來源,電子化使得新聞資訊量的暴增,在如此大量的資訊與知識中,使用者常常會迷失其所想要獲取的知識,而且,我們認為時間性與相關性的新聞檢索不能完全滿足使用者需求,於是我們想要提供使用者更有效的搜尋機制,以幫助使用者更方便找到所想要的資訊。再者,我們觀察到部落格資訊的興起,這些資訊似乎有助於新聞文件的檢索,我們便想藉由部落格的豐富的Web users資訊,來幫助新聞檢索。
於是,我們建立一個HOT Event Extraction機制的搜尋系統,主要分成四大部分,第一部分為利用時間區段來使搜尋結果集中。第二部分我們找出相關的Terms當成是尋找HOT Events的引子,第三部分我們利用定義的Novelty、Bursty和Popularity這三個因素所形成的HOT Event Extraction機制來找出HOT Events的,第四部分為Application,我們引入Clarity的機制來達成Sentence Selection and Title Generation,主要在於將我們所選出來的HOT Events做一個資訊簡化的動作。
系統結合了新聞與部落格這兩大資訊來源,利用這來兩大資訊來源以及HOT Event Extraction的機制,讓使用者可以找到其所想要的相關資訊。另外我們額外作了Sentence Selection和Title Generation精簡的動作,主要是想讓使用者能更一目了然,知道所發生的事件。最後,我們利用實驗結果來驗證HOT Event Extraction機制的效用,其中HOT Event Extraction機制會比一些基本的方法像是TF、TFIDF與Ordering by Time的方法好,然後我們並討論使用者認知上的差異對實際HOT Event Extraction機制的結果所產生的影響。
News is the indispensable source of knowledge and information in life. Online news information has increased substantially. In such a huge amount of information and knowledge, users often lose his way to acquire knowledge. Furthermore, we want to provide users with other search mechanism outside the timing and relevance to help users more convenient in finding the desired information. In addition, we observed that the rise of blog information and such information seems to contribute to news search. In this thesis, we intend to use the very huge of blog information to enhance new search.
Thus, we try to develop a news search system with mechanism of HOT Event Extraction. The system are mainly divided into four parts. The first part, we use time interval to constrain relevance region of hot events. The second part, we find hot terms to extract hot events. The third part, we defined and use the three factors: Novelty, Bursty and Popularity to make our HOT Event Extraction mechanism effective to find hot events. The part IV is Application. We use Clarity mechanism to effectively deal with Sentence Selection and Title Generation. This Part is mainly making hot events more simplified.
Our news search system integrated two information sources, news and blog and use the HOT Event Extraction mechanism to make users find their desired information quickly. Also, we designed additional mechanisms of Sentence Selection and Title Generation to make it easier for users to efficiently understand hot events. Finally, the experimental showed results to validate the effectiveness of HOT Event Extraction mechanism. According to experimental results, we can find HOT Event Extraction mechanisms perform better than the baseline method such as TF, TFIDF and Ordering by Time. Then we discussed the impact of users with different perception for the result of HOT Event Extraction.
A. Das, M. Datar, A. Garg. Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the 16th International WWW Conference, pages 271–280, 2007.
A. Nenkova, L. Vanderwende, K. McKeown. A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization. In Proceedings of SIGIR '06, pages 573–580, 2006.
A. Turpin, Y. Tsegay, D. Hawking, H. E. Williams. Fast Generation of Result Snippets in Web Search. In Proceedings of SIGIR '07, pages 127–134, 2007.
C. L. A. Clarke, E. Agichtein, S. Dumais, R. W. White. The Influence of Caption Features on Clickthrough Patterns in Web Search. In Proceedings of SIGIR '07, pages 135-142, 2007.
C. H. Brooks and N. Montanez. Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering. In Proceedings of the 15th International WWW Conference, pages 625–631, 2006.
B. Schiffman, A. Nenkova, and K. McKeown. Experiments in multidocument summarization. In Proceedings of the Human Language Technology Conference (HLT-2002), March 2002.
D. de Castro Reis, P. Golgher, A. da Silva, and A. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International WWW Conference, pages 502–511, 2004.
D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In Proceedings of the 13th International WWW Conference, pages 491–501, 2004.
E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: Providing personalized newsfeeds via analysis of information novelty. In Proceedings of the 13th International WWW Conference, pages 482–490, 2004.
G. Carenini, R. T. Ng, X. Zhou. Summarizing Email Conversations with Clue Words. In Proceedings of the 16th International WWW Conference, pages 91–100, 2007.
G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proc. of the SIGIR Conference on Research and Development in Information Retrieval, 2004.
G. M. Del Corso, A. Gullf and F. Romani. Ranking a Stream of News. In Proceedings of the 14th International WWW Conference, pages 97–106, 2005.
J. Gao, M. Li, A. Wu and C.N. Huang. Chinese Word Segmentation: A Pragmatic Approach. In Microsoft Research 2004, MSR-TR-2004-123,2004
J. Liu, E. Wagner and L. Birnbaum. Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories. In Proceedings of the 16th International WWW Conference, pages 141–550, 2007.
J. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, 2002.
J. Otterbacher, D. Radev, O. Kareem. News to Go: Hierarchical Text Summarization for Mobile Devices. In Proceedings of SIGIR '07, pages 589-596, 2007.
J. Perkio, W. Buntine, and S. Perttu. Exploring independent trends in a topic-based search engine. In Proceedings of WI'04, pages 664{668, 2004.
J.T.Sun, D. Shen, H.J. Zeng, Q. Yang, Y. Lu, Z. Chen. WebPage Summarization Using Clickthrough Data._ In Proceedings of SIGIR '05, pages 194-201, 2005.
J.W. Ahn, P. Brusilovsky, J. Grady, D. He and S. Y. Syn. Open User Profiles for Adaptive News Systems: Help or Harm? In Proceedings of the 16th International WWW Conference, pages 11–20, 2007.
K. Collins-Thompson, P. Ogilvie, Y. Zhang, and J. Callan. Information filtering, novelty detection, and named-page finding. In Proceedings of the 11th Text Retrieval Conference. National Institute of Standards and Technology, 2002.
M. Atallah and R. Gwadera. Detection of significant sets of episodes in event sequences. In Proceedings of the International Data Mining Conference, pages 3–10, 2004.
M. Henzinger, B. Chang, B. Milch, and S. Brin. Query-free news search. In Proceedings of the 12th International WWW Conference, pages 1–10, 2003.
N. Glance, M. Hurst, and T. Tornkiyo. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.
N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou, Y. Tao, and D. W. Cheung. Mining, indexing, and querying historical spatiotemporal data. In Proceedings of KDD '04, pages 236{245, 2004.
K. McKeown, R. J. Passonneau, D. K. Elson, A. Nenkova, J. Hirschberg. Do Summaries Help? A TaskBased Evaluation of MultiDocument Summarization. In Proceedings of SIGIR '05, pages 210-217, 2005.
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of KDD '05, pages 198{207, 2005.
Q. Mei, C. Liu and H. Su. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of the 15th International WWW Conference, pages 533–542, 2006.
Q. Mei, X. Ling, M. Wondra, H. Su and C.X. Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of the 16th International WWW Conference, pages 171–180, 2007.
R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In Proceedings of the 12th International Conference on World Wide Web, pages 568{576, 2003.
R. Swan and D. Jensen. Timemines: Constructing timelines with statistical models of word usage. In Proceedings of the ACM SIGKDD 2000 Workshop on Text Mining, pages 73–80, 2000.
S. Boykin and A. Merlino. Machine learning of event segmentation for news on demand. Commun. ACM, 43(2):35{41, 2000.
S. Chung and D. McLeod. Dynamic topic mining from news stream data. In Proceedings of International Conference on Ontologies, Databases and Applications of Semantics, pages 653–670, 2003.
S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval, pages 299–306, August 2002.
S. Harabagiu, F. Lacatusu. Topic Themes for Multi-Document Summarization. In Proceedings of SIGIR '05, pages 202-209, 2005.
S. Morinaga and K. Yamanishi. Tracking dynamics of topic trends using a finite mixture model. In Proceedings of KDD '04, pages 811{816, 2004.
X. Wan, J. Yang, J. Xiao. CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations. In Proceedings of SIGIR '07, pages 143-150, 2007.
X. Zhang, G. Cheng, Y. Qu. Ontology Summarization Based on RDF Sentence Graph. In Proceedings of SIGIR '07, pages 707-715, 2007.
Y. Yang, J. Zhang, J. Carbonell, and C. Jin. Topic-conditioned novelty detection. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 688–693, 2002.
Z. Li, B. Wang, M. Li, and W.-Y. Ma. A probabilistic model for retrospective news event detection. In Proceedings of SIGIR '05, pages 106-113, 2005.