研究生: |
林聖峰 Sheng-Feng, Lin |
---|---|
論文名稱: |
網站階層與群組之關係與排名在網頁聲望影響之研究 Toward the Impact Analysis of the Ranking and Relationship between the Hierarchy and Clustering in Web Sites on the Page Popularity |
指導教授: |
高宏宇
Kao, Hung-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 英文 |
論文頁數: | 47 |
中文關鍵詞: | 階層結構 、鏈結分析 |
外文關鍵詞: | Hierarchical Structure, Web Graph, PageRank, Page Quality, Link Analysis |
相關次數: | 點閱:121 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,搜尋引擎在全球資訊網的應用中扮演了極為重要的角色,並且多數的知名搜尋引擎都使用一種 稱為鏈結分析的演算法來計算一個網頁的重要程度。這種演算法利用網頁的鏈結關係以及傳統的平面圖表示法去計算出網頁間相對的重要程度,並且在這些鏈結分析演算法中以Google所提出的PageRank演算法最為知名。而近年來有許多研究者發現類似於PageRank的演算法天生地就會對新產生的網頁有不公平的對待。為了解決這個問題,一個新的演算法Page Quality於2005年被提出,這個演算法是一種可以預測網頁未來重要程度的演算法,它利用兩次時間點間網頁PageRank值的差異比率去推算出網頁未來可能的合理重要程度。在本篇論文中,我們提出了一個新的演算法稱為DRank,這個演算法同樣的可以去減緩上面所提的問題。在我們的方法中,我們利用URL與生俱來的階層結構以及網頁的鏈結關係將傳統的Web Graph改建成為一種三個階層的Web Graph,它包含了host level,directory level以及page level。並且有些學者已經提出在全球資訊網中鏈結關係會有一種聚集的現象,而在我們的資料中我們也驗證這個現象,並且這個現象在host level中極為顯著。有了上述的觀察,我們將三階層的Web Graph中的鏈結關係分成多種類別,並且分派不同的權重給不同類別的鏈結。有了三個階層的Web Graph以及鏈結的權重關係我們可以分別計算出host,directory以及page的重要值。從實驗數據的發現,我們可以知道若是我們可以分配給產生在重要程度高的directories內的新網頁合理的重要值,那我們就可以對於大多數優秀的新網頁去減緩前面所提到不公平現象。我們也利用這些發現去做一種預測的動作,我們可以利用重要程度分佈的狀態去決定一個新網頁未來所可能擁有重要程度值。我們所提出的DRank演算法結合了Page Quality以及上述的兩種發現,並且實驗結果證明我們的DRank演算法可以扮演一個成功的預測角色,在效能上我們可以超過Page Quality。
In recent years, search engines have already played the key roles among Web applications, and link analysis algorithms are the major methods to measure the important values of Web pages. They employ the conventional flat Web graph built by Web pages and link relations of Web pages to obtain the relative importance of Web objects. Previous researches have observed that PageRank-like link analysis algorithms have a bias against newly created Web pages. A new ranking algorithm called Page Quality was proposed to save this issue. Page Quality anticipates future ranking values by the difference rate between current ranking values and previous ranking values. In this paper, we propose a new algorithm called DRank to diminish the bias of PageRank-like link analysis, and attain the better performance of Page Quality. In this algorithm, we model Web graph as a three-layer graph which includes Host Graph, Directory Graph and Page Graph by using the hierarchical structure of URLs and the structure of link relation of Web pages. At first, we discuss the aggregated phenomena of link relations within host level and directory level and according to what we observe we assign different weight to different types of links. We then calculate the importance of hosts, Directories and Pages by weighted graph we built. We find two phenomena: One is that hosts or directories that have higher rank value contain the majority of important pages and we observe that directory level is a better block level to prove new pages created within an important blocks have the higher probability to be important pages. The other is that there are ladder-graphs within directories while we sort ranking values within directories in the decreasing order. By combining Page Quality algorithm and the two phenomena we state above, we can predicate the more accurate values of page importance to diminish the bias of newly created pages. Experiment results on our data shows that DRank algorithm works well on anticipating future ranking values of pages, and the performance of DRank is better than Page Quality.
[1]R. A. Baeza-Yates, F. Saint-Jean, C. Castillo. Web Dynamics, Structure, and Page Quality. In the Proceedings of SPIRE, 2002
[2]K. Bharat, B. W. Chang, M. R. Henzinger, and M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In the Proceedings of 1st International Conference on Data Mining (ICDM) p51-58, 2001.
[3]S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In the Proceedings of WWW Conference, April 1998.
[4]D. Cai, X. F. He, J. R. Wen and W.Y. Ma. Block-level Link Analysis. In the Proceedings of the 27th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR'2004), July 2004.
[5]J. Cho and S. Roy. Impact of search engines on page popularity. In the Proceedings of the International World-Wide Web Conference, May 2004.
[6]J. Cho, S. Roy, R. E. Adams "Page Quality: In Search of an Unbiased Web Ranking." In the Proceedings of 2005 ACM International Conference on Management of Data (SIGMOD), June 2005.
[7]N. Eiron and K. S. McCurley. Locality, Hierarchy, and Bidirectionality on the Web. In the Workshop on Web Algorithms and Models, 2003.
[8]Google Directory: The web organized by topic into categories. http://directory.google.com/
[9]T. H. Haveliwala. Topic-Sensitive PageRank. In the Proceeding of the Intetnational World-Wide Web Conference, May 2002.
[10]X. M. Jiang, G. R. Xue, H. J. Zeng, Z. Chen, W.-G. Song, and W.-Y. Ma. Exploiting PageRank Analysis at Different Block Level. In the Proceedings of Conference of WISE 2004.
[11]S. D. Kamvar, T. H. Haveliwala C. D. Manning and G. H. Golub. Exploiting the Block Structure of the Web for Computing PageRank. In the Proceedings of 12th International World Wide Web Conference, 2003.
[12]R. Kumar, P. Raghavan, S. Rajagopalan, and D. Sivakumar. Stochastic models for the Web graph. In the Proceedings of the 41st IEEE Symposium on Foundations of Comp.Sci., pages 57–65, 2000.
[13]L. Laura, S. Leonardi, G. Caldarelli, and P. D. L. Rios. A multi-layer model for the web graph. In 2nd International Workshop on Web Dynamics, Honolulu, 2002. Also presented in 33rd Annual Conference of the Operational Research Society of Italy, September, 2002.
[14]T. Meng, H. Yan, J. Wang, X. Li . The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling. . In the Proceedings of the Web Intelligence Conference, 2004.
[15]Nielsen NetRatings :Search Engine Ratings http://searchenginewatch.com/reports/article.php/2156451
[16]L. Page, S. Brin, R. Motwani, and T. Winogrand, The PageRank Citation Ranking: Bringing Order to the Web. Technical report , Stanford University, Stanford, CA, 1998.
[17]D. M. Pennock, G. W. Flake, S. Lawrence, E. J. Glover, and C. L. Giles. Winners don’t take all: Characterizing the competition for links on the web. PNAS, pages 5207–5211, 2002.
[18]S. Raghavan and H. Garcia-Molina. Representing web graphs. In Proceedings of the IEEE Intl. Conference on Data Engineering, March 2003.
[19]Search Engine Users: Internet searchers are confident, satisfied and trusting – but they are also unaware and naïve. Embargoed for publication until 4pm, January 23, 2005 , http://www.pewinternet.org/
[20]Search Engine Watch: http://www.searchenginewatch.com/
[21]W. Xing and A Ghorbani. Weighted PageRank Algorithm. In: Proceedings Of the 2nd Annual Conference on Communication Networks and Services Research. 305-314, 2004.
[22]GR. Xue, Q. Yang, HJ. Zeng, Y. Yu, Z. Chen. Exploiting the Hierarchical Structure for Link Analysis. In the proceedings of the 28th annual international ACM SIGIR, 2005.