研究生: |
林宜嫻 Lin, Yi-Xian |
---|---|
論文名稱: |
自建構式網頁分群法於資訊檢索之應用 Adaptive Page Clustering for Information Retrieval |
指導教授: |
高宏宇
Kao, Hung-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 英文 |
論文頁數: | 50 |
中文關鍵詞: | 網頁分群 、相關係數 、資訊檢索 、主題關鍵字 |
外文關鍵詞: | Page Clustering, Correlation Coefficient, Information Retrieval, Topic Feature |
相關次數: | 點閱:157 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資訊爆炸的時代來臨,利用網路取得資訊已經成為最方便的管道,搜尋引擎就是一個最好的例子,在搜尋引擎輸入關鍵字之後,便可以取得許多相關資訊。然而,透過搜尋引擎搜尋到的資料大多是基於關鍵字匹配所獲得,而且為了提升回傳之資料量,通常不會進行過濾與篩選。於是,過多的資料提高了資料的複雜度,也增加了使用者取得符合需求的資料的困難。另外,每一筆搜尋結果都是互相獨立,無法得知哪些資料是屬於同質性,哪些則是完全不相關。若能在搜尋前將網頁經過有系統的整理,分成多個形式表示之類別或集群,再讓使用者根據其資訊需求選擇相對應的類別或集群,將可減低搜尋資料的複雜度,引導使用者獲取真正有幫助之資訊。
本研究提出一個自建構網頁分群法,其萃取網頁特徵降低特徵的維度,並將網頁自動加以篩選過濾到其適合的集群中,強化網頁的主題特徵,對網頁的特徵分別賦予不同的係數,以提高表示的效果,提供使用者更準確的搜索資料模型。此外,自建構網頁分群法不需要事先指定集群的數量,且對於新加入資料集中的網頁,只需要計算與現存所有網頁文件集群的相似度即可。當使用者進行查詢時,所搜尋到的網頁具有群聚性,這可以讓使用者更容易找到其真正想尋找之資訊,同時也讓使用者有更多的選擇,相當於提高搜尋結果的資料量。實驗結果顯示,比起傳統的TF-IDF,我們的方法可以更好的找到所需要的網頁,且對應集群中的網頁主題有高度相似性。
With the coming of the era of information explosion, using internet to obtain has become the most convenient pipeline information. Search engine is the best example. We can obtain a lot of relevant information after Enter a keyword in the search engines. However, the found information mostly based on keyword matching through the search engines, and the search engines generally not conduct filtering and screening in order to enhance the returns. In addition, no way of knowing which data belongs to homogeneity and which is completely irrelevant due to each search result is independent. If the web pages pass a systematic arranged divided into multiple categories or clusters, and displayed this clustering result to the users, the users will be guided to obtain real help of information.
We propose an adaptive web pages clustering algorithm. It extract features to reduce feature dimensions then filter and screen automatically web pages into its appropriate cluster and enhance the features of the pages to site features for different coefficients to improve the effect and providing users a more accurate search data model. Besides, the adaptive page clustering does not need to specify the number of clusters and only need to calculate the similarity with the existing clusters for a new page. When users conduct keyword search, the searched pages with aggregation, which allows users to obtain the really helpful information more easily and also allows users to have more choices equivalent to improve the amount of search results. The experimental results show that compared to the traditional TF-IDF, the proposed approach can better find the needed web pages and the topics of the web pages in the corresponding cluster are highly similar.
REFERENCES
[1] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell, “WebWatcher: A Learning Apprentice for the World Wide Web,” in Proceedings of AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environment, 1995.
[2] L. D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” 21st Annual International ACM SIGIR, pp. 96-103, 1998.
[3] S. Bandyopadhyay and U. Maulik, “An Evolutionary Technique Based on K-Means Algorithm for Optimal Clustering in RN,” Information Sciences-Applications: An Int’l J., vol. 146, pp. 221-237, Oct. 2002.
[4] H. C. Chang, and C. C. Hsu, “Using topic keyword clusters for automatic document clustering,” IEICE Trans. Inf. Syst., vol. E88D, pp. 1852-1860, AUG. 2005.
[5] Y. Chang, M. Kim and V. V. Raghavan, Construction of query concepts based on feature clustering of documents, Information Retrieval, vol. 9, pp. 231-248, 2006.
[6] M. Chavent, Y. Lechevallier and O. Briant, DIVCLUS-T: A monothetic divisive hierarchical clustering method, Computational Statistics and Data Analysis, vol. 52, pp. 687-701, 2007.
[7] H. Chen, Y. M. Chung, M. Ramsey, and C. C. Yang, “An intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching,” Decision Support Systems, vol. 23, pp. 41-58, 1998.
[8] T.S. Chen, C.C. Lin, Y.H. Chiu and R.C. Chen, “Combined Density- and Constraint-based Algorithm for Clustering,” In Proceedings of 2006 International Conference on Intelligent Systems and Knowledge Engineering, 2006.
[9] K. J. Cios, W. Pedrycz, R. W Swiniarski and L. A. Kurgan, Data Mining: Knowledge Discovery Methods, Springer, 2007.
[10] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms. Prentice Hall, Englwood Cliffs, NJ, USA, 1992.
[11] S. Guha, R. Rastogi and K. Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 73-84, 1998.
[12] J. Han, and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[13] K. A. Heller and Z. Ghahramani, Bayesian hierarchical clustering, Proceedings of the 22nd international conference on Machine learning, pp. 297-304, 2005.
[14] A. K. Jain, M. N. Murty, and P. J. Flynn “Data clustering: A review,” ACM Computer Survey, vol. 31, pp. 264-323, Sep. 1999.
[15] J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification, ” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 3, pp. 335-349, Mar. 2011.
[16] Y. Kanellopoulos, P. Antonellis, C. Tjortjis and C. Makris, k-Attractors: A Clustering Algorithm for Software Measurement Data Analysis, Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, pp. 358-365, 2007.
[17] T. Li, S. Ma and M. Ogihara, Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 218-225, 2004.
[18] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, 1967.
[19] M.E. Maron and J.L. Kuhns, “On relevance, probabilistic indexing and information retrieval,” Journal of the ACM, vol. 7, pp. 216-244, 1960.
[20] MiniwattsMarketingGroup, Top 20 countries with the highest number of Internet users, 2007.
[21] P. Pantel and D. Lin, Document clustering with committees, Proceeding of the 25th ACM International Conference on Research and Development in Information Retrieval pp. 199-206, 2002.
[22] S. E. Robertson, S. Walker, and M. Beaulieu, “Okapi at TREC7, automatic ad hoc, filtering, VLC and filtering tracks,” in Proceedings of the 7th Text Retrieval Conference (TREC-7), pp. 253-264, 1999.
[23] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Book Co. 1983.
[24] G. Salton, Automatic Text Processing. Addison-Wesley Publishing Company, 1988.
[25] A. Singhal, J. Choi, D. Hindle, D. Lewis, and F. Pereira, “AT&T at TREC 7,” in Proceedings of the 7th Text Retrieval Conference (TREC-7), vol. 500, pp. 239-252, 1999.
[26] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
[27] K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 28, no. 1, pp. 11-20, 1972.
[28] T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the 1996 ACM SIGMOD International Conference Management of Data, pp. 103-114, 1996.