| 研究生: |
蕭新維 Hsiao, Hsin-Wei |
|---|---|
| 論文名稱: |
利用增廣資訊的一個資訊熵值為基礎的階層式搜尋結果分群方法 An Entropy-Based Hierarchical Search Result Clustering Method by Utilizing Augmented Information |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 英文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | 增廣資訊 、資訊熵值 、片段內文 、分群 、搜尋引擎 |
| 外文關鍵詞: | Clustering, Snippet, Entropy, Augmented Information, Search Engine |
| 相關次數: | 點閱:101 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
因為搜尋引擎技術的進步,以及網頁數量的大量增加,搜尋引擎所回傳的搜尋結果往往是參雜混亂的。特別是針對那些一個字可能有多種主題的搜尋關鍵字,搜尋結果的多樣主題的混亂程度會更常見。因此對於不同主題的搜尋結果來做的分群的技術被廣泛地發展起來。傳統的分群方法中,有些研究學者利用兩個文件或多個文件之間的相似程度來做分群的依據,或是利用機器學習為主的分群方式來訓練一些文件來得到分群的規則。但是一般文件結構和一般網頁內文結構並不會完全相同,因此不能確定,在一般文件上分群能得到很好的結果的技術,用在網頁的分群上也能夠一樣的好。
搜尋引擎能夠回傳數百到數千個網頁的標題內容,包含該網頁的片段內文以及該網頁的網址資訊。幾乎所有的網頁分群技術也必須要從這些搜尋引擎的回傳內容來得到一些更進一步的資訊。除此之外,效率也是搜尋結果分群的問題中一項很重要的議題。在網頁分群的技術中,我們不能像使用一般文件分群的技術那樣去分析整個文件的內文。假設我們在網頁分群中使用了文件分群的技術,則很有可能花費很多時間去得到最後的分群結果。對於一個即時的分群系統來說,太長的執行時間是不能被允許的。基於這個理由,勢必發展出更具效率的方法來解決這項問題。
在這篇論文中我們提出了幾個更有效率的方法來解決這項問題。我們改進了先前所提出來的一個方法。我們利用了一些搜尋引擎會回傳的增廣的資訊以及將這些增廣資訊和資訊熵值理論整合起來。我們利用了這些新的方法來得到更好的搜尋結果以及減少了執行時間,從我們的實驗也證明了,我們所提出的方法的確能夠提高整體分群結果的品質。
Because of the improvement of the technology of search engines, and the massively increase of the number of web pages, the results returned by the search engines are always mixed and disordered. Especially for the queries with multiple topics, the mixed and disorderly situation of the search results would be more obvious. The technology of clustering search results with different topics has therefore been extensively developed. For traditional clustering methods, some researchers clustered the document sets using the similarity between two or more documents, or exploited machine learning clustering manner training some documents to get the cluster rules. However, the structure between web pages and general documents are not always the same. It can not confirm that the technologies with good performance on general documents clustering always perform well on the web pages clustering.
The search engines can return information of several hundred to thousand of the pages’ titles, snippets and URLs. Almost all of the technologies about search result clustering must attain further information from the contents of the returned lists. Besides, the efficiency issue is also crucial for the clustering of web pages. In web pages clustering it can not use the same technology of analyzing all the contents to calculate its cluster as general document clustering. Supposing that we apply the method of document clustering on web pages clustering, it might waste a lot of time to get the clustered results. Long execution time is not permitted for a real-time clustering system. For this reason, more efficient methods must be developed to conquer these issues.
In this paper we propose some methods with better efficiency that will conquer these issues. We improve one of the previous technologies. We utilize and augment information that search engines returned and integrate the augmented information and entropy calculation in the information. We apply several new methods to attain better clustered search results and reduce execution time. From our experiments is also indicate that these methods we proposed would obtain clustered results with high quality.
[1] Smola, A. J. and Schlkopf, B. A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series, NC2-TR-1998-030. October, 1998.
[2] G. Ball and D. A. Hall. A Clustering Technique for Summarizing Multivariate Data. Behavioral Science 1967, pages 153-155.
[3] Doug Beeferman and Adam Berger. Agglomerative Clustering of a Search Engine Query Log. In SIGKDD 2000, pages 407-416.
[4] Mo Chen, Jian-Tao Sun, Hua-Jun Zeng and Kwok-Yan Lam. A Practical System of Keyphrase Extraction for Web Pages. In CIKM 2005, pages 277-278.
[5] Lee-Feng Chien. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval. In SIGIR 1997, pages 50-58.
[6] Paolo Ferragina and Antonio Gulli. A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering. In WWW 2005, pages 801-810.
[7] Dawn J. Lawrie, W. Bruce. Croft. Generating Hierarchical Summaries for Web Searches. In SIGIR 2003, pages 457-458.
[8] Xiang Ji, Wei Xu and Shenghuo Zhu. Document Clustering with Prior Knowledge. In SIGIR 2006, pages 405-411.
[9] In-Ho Kang and GilChang Kim, Query Type Classification for Web Document Retrieval. In SIGIR 2003, pages 64-71.
[10] Krishna Kummamuru and Raghu Krishnapuram. A Clustering Algorithm for Asymmetrically Related Data with Application to Text Mining. In CIKM 2001, pages 571-573.
[11] Krishna Kummamuru, Ajay Dhawale, and Raghu Krishnapuram. Fuzzy Co-clustering of Documents and Keywords. In FUZZIEEE 2003, pages 772-777.
[12] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Singal and Raghu Krishnapuram. A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In WWW 2004, pages 658-665.
[13] Raghu Krishnapuram and Krishna Kummamuru. Automatic Taxonomy Generation. In IFSA 2003, pages 52-63.
[14] Uichin Lee, Zhenyu Liu and Junghoo Cho. Automatic Identification of User Goals in Web Search. In WWW2005, pages 391-400.
[15] Mark Sanderson. Word Sense Disambiguation and Information Retrieval. In SIGIR 1994, pages 142-151.
[16] Mark Sanderson and W. Bruce Croft. Deriving Concept Hierarchies from Text. In SIGIR 1999, pages 206-213.
[17] Jian-Tao Sun, Xuanhui Wand, Dou, Shen, Wua-Jun and Zeng. Zheng Chen. CWS: A Comparative Web Search System. In WWW 2006, pages 467-476.
[18] Hiroyuki Toda, Ryoji Kataoka. A Search Result Clustering Method using Informatively Named Entities. In WIDM 2005, pages 81-86.
[19] Anton V. Leouski and W. Bruce Croft. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76.
[20] Anton V. Leouski and James Allan. Improving Interactive Retrieval by Combining Ranked List and Clustering. In RIAO 2000, pages 665-681.
[21] Baeza-Yates and Ribeiro-Neto. Modern Information Retrieval.
[22] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma and Jinwen Ma. Learning to Cluster Web Search Results. In SIGIR 2004, pages 210-217.
[23] Ying Zhao and George Karypis. Evaluation of Hierarchical Clustering Algorithm for Document Datasets. In CIKM 2002, pages 515-524.
[24] Oren Zamir and Oren Etzioni, Web Document Clustering: A Feasibility Demonstration. In SIGIR 1998, pages 46-54.
[25] Oren Zamir and Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results. In WWW 1999, pages 1361-1374.
[26] http://www.google.com
[27] http://search.yahoo.com
[28] http://search.msn.com
[29] http://www.vivisimo.com
[30] http://dmoz.com
[31] http://ckipsvr.iis.sinica.edu.tw/
[32] http://clusty.com
[33] http://www.kartoo.com/