簡易檢索 / 詳目顯示

研究生: 劉家升
Liu, Chia-Sheng
論文名稱: DRank+: 一個加速網頁聲望收斂的目錄特徵網頁聲望預測方法
DRank+: A Directory based PageRank Prediction Method for Fast PageRank Convergence
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 41
中文關鍵詞: 網路圖形Page Quality階層結構PageRank鏈結分析
外文關鍵詞: Page Quality, PageRank, Link Analysis, Hierarchical Structure, Web Graph
相關次數: 點閱:59下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著搜尋引擎的重要性逐漸增加,網際網路使用者逐漸的改善他們瀏覽網路的習慣。近年來,多數的知名搜尋引擎都使用一種稱為鏈結分析的演算法來衡量一個網頁的重要性。此種演算法採用網頁之間的鏈結關係以及傳統平面圖表示法來計算網頁之間相對的重要程度,在這眾多的鏈結分析演算法中又以Google 所採用的PageRank 演算法最為著名。然而,近幾年幾位學者的研究成果發現PageRank 演算法潛在性地會對新產生的網頁有不公平的對待。針對這樣的問題,幾位學者提出了Page Quality 演算法來解決,此演算法利用了前後兩個時間點網頁的PageRank變化才預測下個時間點該網頁可能的重要程度。去年,一個新的演算法稱為DRank 被提出。此演算法藉由網頁的網址與生俱來的階層結構特徵來減緩新網頁遭受搜尋引擎的不公平待遇的現象。此法乃是針對網頁所位在的目錄其他網頁的PageRank群聚現象,來預測下個時間點網頁的重要程度。在本篇論文中,我們將原始的DRank 演算法做修正,並且提出一個新的方法稱為MDRank 來補足DRank 在目錄中網頁數目過少時可能產生失效而無法預測的情況,而DRank+就是結合了DRank 與MDRank 的一個綜合方法。從實驗結果當中我們發現,修正過後的DRank比起原始的DRank在預測網頁下個時間點的重要性分數上精確許多,而MDRank能夠在目錄中網頁個數較少的情況下更精確的預測到網頁未來的重要性分數。這也說明了我們的DRank+演算法的確扮演了一個成功的預測腳色。不但有效的減緩了新網頁所受到的不公平待遇,在預測新網頁的重要程度上也比原始DRank以及Page Quality演算法來的準確許多。

    As the increasing of importance in search engines, Internet users change their behavior browsing the Internet
    little by little. In recent years, most part of search engines use link analysis algorithms to measure the importance of web pages. They employ the conventional flat web graph constructed by web pages and link relation of web
    pages to measure the relative importance of web pages. The most famous link analysis algorithm is PageRank algorithm. However, previous researches in recent years have found that there exists an inherent bias against newly created pages in PageRank. For this issue, some researchers have proposed a new ranking algorithm called Page Quality to solve it. Page Quality utilizes the difference of PageRank at continuous time stages to predict a reasonable importance score of pages at next time stage. We also have proposed a new ranking algorithm called DRank to solve the same issue last year. It utilizes the intrinsic characteristic of hierarchical structure embedded in URL and the cluster phenomenon of PageRank in a directory to predict the possible importance of pages in the future and to diminish the inherent bias of search engines to new pages. In this paper, we modify the original DRank algorithm and propose a new ranking algorithm called MDRank to complement the weaker part of DRank which could fail while the number of pages in directory is not enough. The integrated algorithm is called DRank+, which combines DRank and MDRank. In our experiments, the modified DRank algorithm obtains more accuracy in predicting the importance score of pages at next time stage than the original DRank algorithm. Furthermore, MDRank can also obtain more accuracy in predicting the future importance score of pages while the number of pages in directories is few. It also interprets that DRank+ not only alleviates the bias of newly created pages successfully but also reaches more accuracy than Page Quality and original DRank in predicting the importance of newly created pages.

    中文摘要................ IV ABSTRACT......................V CONTENT...................... VI TABLE LISTING............X 1. INTRODUCTION........................1 2. RELATED WORK....................4 2.1 PREVIOUS WORK ON LINK ANALYSIS............4 2.1.1 PageRank Algorithm..............4 2.1.2 Weighted PageRank Algorithm...........6 2.2 PAGE POPULARITY ................7 2.2.1 Page Quality .....................8 3. WEB EVOLUTION AND HIERARCHICAL MODEL OF WEB GRAPH..........9 3.1 VARIATION OF WEB GRAPH ............9 3.2 REPUTATION OF NEW PAGES.............10 3.3 HIERARCHICAL AND MULTI-LAYER WEB GRAPH ..............11 3.3.1 Previous Work on Web Graph Structure.........11 3.3.2 Aggregation Graph...................11 3.3.3 Multi-Layer Hierarchical Graph .........13 3.3.4 The Page Level Graph and PageRank Calculation ..............13 4. OUR PROPOSED METHOD..............15 4.1 RELATIONS BETWEEN DIRECTORY AND PAGE ..............15 4.1.1 Ranking Distribution of Pages in a Directory (PageRank Distribution) ..........16 4.1.2 Directory Feature Based Rank (DRank0).............17 4.2 EXTENDED DRANK ALGORITHM..............20 4.2.1 Weakness of DRank Algorithm............21 4.2.2 Modified Directory Feature based Rank (DRank) ...........23 4.2.3 Multiple Directory Feature based Rank (MDRank)..........24 5. EXPERIMENT...........26 5.1 SYSTEM FRAMEWORK.........26 5.2 DATASET DESCRIPTION.............27 5.3 EVALUATION METRICS...............27 5.4 EXPERIMENT RESULTS ....................27 5.4.1 Pre-analysis of Dataset.............27 5.4.2 Prediction Accuracy..............28 5.4.3 Parameter Tuning .................35 5.5 A CASE STUDY..............37 6. CONCLUSION AND FUTURE WORK.............39 7. REFERENCES...........................40

    [1] http://www.pewinternet.org/pdfs/PIP_Data_Memo
    _Searchengines.pdf.
    [2] http://www.iprospect.com/.
    [3] Google Directory: http://directory.google.com/.
    [4] Search Engine Watch: http://searchenginewatch.com/.
    [5] S. Abiteboul, M. Preda, and G. Cobna. Adaptive on-line page importance computation. In Proceedings of the
    International World-Wide Web Conference, May 2003.
    [6] K. Bharat, B.-W. Chang, M. Henzinger, M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In
    Proceedings of the International Conference on Data Mining , November 2001.
    [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of WWW
    Conference, April 1998.
    [8] J. Cho, S. Roy and R. E. Adams. Page Quality: In Search of an Unbiased Web Ranking. In Proceedings of the
    SIGMOD Conference, June 2005.
    [9] J. Cho and S. Roy. Impact of Search Engines on Page Popularity. In Proceedings of the International World-Wide
    Web Conference, May 2004.
    [10] D. Cai, X. F. He, J. R. Wen and W. Y. Ma. Block-level Link Analysis. In Proceedings of the 27th Annual
    International ACM SIGIR Conference on Research and Development in Information Retrieval , July 2004.
    [11] N. Eiron and K. S. McCurley. Locality, Hierarchy, and Bidirectionality on the Web. In Workshop on Web Algorithms
    and Models, May 2003.
    [12] D. Fetterly, M. Manasse, M. Najork and J. Wiener. A Large-Scale of the Evolution of Web Pages. In Proceedings of
    the International World-Wide Web Conference, May 2003.
    [13] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the International World-Wide Web Conference , May
    2002.
    [14] X. M. Jiang, G. R. Xue, H. J. Zeng, Z. Chen, W.-G. Song and W.-Y. Ma. Exploiting PageRank Analysis at Different
    Block Level. In Proceedings of Conference of WISE 2004.
    [15] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Extrapolation methods for accelerating pagerank computations.
    In Proceedings of the International World-Wide Web Conference , May 2003.
    [16] R. Kumar, P. Raghavan, S. Rajagopalan and D. Sivakumar. Stochastic models for the Web graph. In Proceedings of
    the 41st Annual Symposium on Foundations of Computer Science , November 2000.
    [17] S.-F. Lin and H.-Y. Kao. Toward the Impact Analysis of the Ranking and Relationship between the Hierarchy and
    Clustering in Web Sites on the Page Popularity. A Master Thesis in Department of Computer Science and
    Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. , July 2006.
    41
    [18] L. Laura, S. Leonardi, G. Caldarelli and P. D. L. Rios. A multi-layer model for the web graph. In 2nd International
    Workshop on Web Dynamics, March, 2002.
    [19] T. Meng, H. Yan, J. Wang and X. Li. The Evolution of Link-Attributes for Pages and Its Implications on Web
    Crawling. In Proceedings of the Web Intelligence Conference, September 2004.
    [20] A. Ntoulas, J. Cho and C. Olston. What’s New on the Web? The Evolution of the Web from a Search Engine
    Perspective. In Proceedings of the International World-Wide Web Conference , May 2004.
    [21] L. Page, S. Brin, R. Motwani, and T. Winogrand. The PageRank Citation Ranking: Bringing Order to the Web.
    Technical report, Stanford University, Stanford, CA, 1998.
    [22] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983
    [23] G. Salton, A. Wong, and C. Yang. A vector space model for information retrieval. Journal for the American Society
    for Information Retrieval, 1975.
    [24] W. Wang, J. Yang, R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Ming. In Proceedings
    of the 23rd VLDB Conference, August 1997.
    [25] W. Xing and A. Ghorbani. Weighted PageRank Algorithm. In Proceedings of the 2nd Annual Conference on
    Communication Networks and Services Research. 305-314, 2004.
    [26] G.-R Xue, Q. Yang, H.-J. Zeng, Y. Yu, Z. Chen. Exploiting the Hierarchical Structure for Link Analysis. In
    Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information
    Retrieval, August 2005.
    [27] R. B.-Yates, C. Castillo and F. S.-Jean. Web Dynamics, Structure, and Page Quality. In Proceedings of SPIRE
    Conference, September 2002.

    下載圖示 校內:2008-08-20公開
    校外:2008-08-20公開
    QR CODE