簡易檢索 / 詳目顯示

研究生: 官咨含
Kuan, Tzu-Han
論文名稱: 基於區塊類型之網頁客製化
Customizing the Layout of Web Pages by Block Type
指導教授: 盧文祥
Lu, Wen-Hsiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 46
中文關鍵詞: 區塊切割區塊分類CSS Selectors
外文關鍵詞: Block Segmentation, Block Classification, CSS Selectors
相關次數: 點閱:74下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著現今網際網路成為人們最大宗的資訊來源,網頁的內容也變的越來越豐富。一個網頁通常包含了許多不同類型的內容,像是導覽、互動和資訊的區塊。當使用者瀏覽到一個從未見過的網頁時,他通常需要花上許多時間在適應、熟悉整個頁面的結構才能找到需要的部分。Bernard (2002)提出了使用者總是期待擁有特定功能的區塊(如導覽列、廣告區)會出現在網頁的特定位置上。基於以上論點,針對使用者建立一個客製化的網頁模版有助於改善使用者的瀏覽經驗。為了要建立網頁的排版結構,我們提出了網頁區塊類型識別模型。這裡的網頁區塊是基於使用者的需求和我們觀察網頁內容所提出的五大類:導覽類、資訊類、交易類、廣告類和社交類。
    我們利用許多特徵來識別區塊的類型,其中最主要的就是網頁區塊中的CSS selector。CSS selectors是用來控制一個HTML元素如何呈現,而通常會與其所屬的區塊有語意關聯。我們觀察到大約有一半(52%)的CSS selector是擁有這種性質的。我們提出了CSS Selector模型,其利用CSS selector來識別網頁區塊的類別。模型中使用了以下三種方法來抽取CSS selector中的關鍵字:字典查詢、n-game序列的機率(probability of n-gram sequence)以及分枝熵(Branching Entropy)。我們也使用了網頁區塊中的其他元素來幫助分類,如超連結文字、網頁元素的大小和區塊中的文字。分類的方法是一個二階段分類法:先利用CSS Selector Model判斷區塊的類型,再將其結果結合其他特徵利用SVM分類。
    實驗結果呈現了我們的CSS selector模型能有效的從CSS selector抽取與網頁區塊相關的關鍵字,其為我們分類中最重要的特徵。而分類的結果可以利用我們提出的方法來建立網頁的結構。

    Today the Web has become the largest information source for people. Web page contents have been more informative. A web page usually contains various contents such as navigation, interaction and information. When a user is browsing a new web page, he usually needs to spend a lot of time on adopt to the layout of the web page. Bernard, 2002 shows that users always expect that certain functional part of a web page (e.g. navigational links, advertisement bar) appears at certain position of that page. Build a customization template of a category of web pages can improve user browsing experience. In order to build the layout of a web page, we proposed a model to identify the types of blocks. The types are based on the need behind the user, i.e. navigational, informational and transactional plus our observation on web pages, i.e. advertising and social.
    We use various features to identify the type of a block. The major feature is the CSS selectors extracted from a block. The CSS Selectors control how to display a HTML element and usually have semantic relation with the block. In our observation, there are 52% such CSS Selectors in web pages. We proposed the CSS Selector model to extract the keywords from the CSS Selectors. The model uses three methods, i.e. dictionary look up, probability of n-gram sequence, and branching entropy, to segment a sequence of CSS selectors into keywords. We also use features like hypertexts, sizes of HTML elements and context to help classifying the blocks. In block classification, we proposed a 2-stage method to classify the blocks. The method first predicts a type of a block by CSS Selector Model and combines the result from CSS Selector Model into SVM classifier.
    The experiments show that our CSS Selector Model can extract relative keywords about the block’s type. It is the most important feature in our classification. The results of block classification can be used to build the layout of web page.

    摘要 I Abstract II 誌謝 III CONTENTS IV LIST OF TABLES VI LIST OF FIGURES VII 1. Introduction 1 1.1 Motivation 1 1.2 Blocks 2 1.3 Contribution 5 1.4 Paper Organization 5 2. Related Work 6 2.1 Page Segmentation 6 2.2 Finding Informative Blocks 7 2.3 Search Goals 9 3. Methods 10 3.1 System Architecture 10 3.2 Page Segmentation 11 3.3 Semantic Labels Extraction 12 3.3.1 Observation of CSS Selectors 13 3.3.2 CSS Selector Model 15 3.4 Block Features Extraction 20 3.4.1 HTML tags 20 3.4.2 Context 22 3.5 Block Classification 25 4. Experiments and Results 28 4.1 Experiments Setup 28 4.2 Model Performance Evaluation 30 4.3 Parameters Effects 31 4.4 CSS Selector Model Evaluation 32 4.5 Discussion 34 5. Application 40 5.1 Page Layout Builder 41 5.2 Default User-Goal Layout 41 6. Conclusion 44 7. References 45

    1. Bernard, M.L., Criteria for optimal web design (designing for usability). http://psychology.wichita.edu/optimalweb/position.htm, 2002.
    2. Broder, A., A taxonomy of web search. SIGIR Forum, 2002. 36(2): p. 3-10.
    3. Cho, H.-P., Improving the Display of Search Result Using Search Goal Type. 2008.
    4. Cai, D., et al. VIPS: a Vision-based Page Segmentation Algorithm. 2003.
    5. Liu, B., R. Grossman, and Y. Zhai, Mining data records in Web pages, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003, ACM: Washington, D.C. p. 601-606.
    6. Chakrabarti, D., R. Kumar, and K. Punera, A graph-theoretic approach to webpage segmentation, in Proceeding of the 17th international conference on World Wide Web. 2008, ACM: Beijing, China. p. 377-386.
    7. Kohlschütter, C. and W. Nejdl, A densitometric approach to web page segmentation, in Proceeding of the 17th ACM conference on Information and knowledge management. 2008, ACM: Napa Valley, California, USA. p. 1173-1182.
    8. Song, R., et al., Learning block importance models for web pages, in Proceedings of the 13th international conference on World Wide Web. 2004, ACM: New York, NY, USA. p. 203-211.
    9. Lin, S.-H. and J.-M. Ho, Discovering informative content blocks from Web documents, in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002, ACM: Edmonton, Alberta, Canada. p. 588-593.
    10. Kao, H.-Y., J.-M. Ho, and M.-S. Chen. DOMISA: DOM-based Information Space Adsorption for Web Information Hierarchy Mining. in Proceedings of the 4th SIAM Intern'l Conference on Data Mining (SDM-04) 2004.
    11. Cho, W.-T., Y.-M. Lin, and H.-Y. Kao, Entropy-Based Visual Tree Evaluation on Block Extraction, in Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01. 2009, IEEE Computer Society. p. 580-583.
    12. Lee, U., Z. Liu, and J. Cho, Automatic identification of user goals in Web search, in Proceedings of the 14th international conference on World Wide Web. 2005, ACM: Chiba, Japan. p. 391-400.
    13. He, K.-Y., Y.-S. Chang, and W.-H. Lu, Improving Identification of Latent User Goals through Search-Result Snippet Classification, in Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. 2007, IEEE Computer Society. p. 683-686.
    14. Deng, C., Y. Shipeng, and W.-Y.M. Ji-Rong Wen. VIPS: a Vision-based Page Segmentation Algorithm. 2003.
    15. Kao, H.-Y., et al. Entropy-Based Link Analysis for Mining Web Informative Structures. in CIKM. 2002.
    16. Casading Style Sheets (CSS). Available from: http://www.w3.org/Style/CSS/.
    17. Chan, Y.-C. Google-Based Two-Stage Text Segmentation and Learning Question Type Identification from Wikipedia for a Multilingual QA System. 2007.
    18. Tanaka-Ishii, K. and H. Nakagawa. A Multilingual Usage Consultation Tool Based on Internet Searching -More than a Search Engine, Less than QA-,. in Proceedings of the 14th International World Wide Web Conference. 2005.
    19. Jin, Z. and K. Tanaka-Ishii. Unsupervised Segmentation of Chinese Text by Use of Branching Entropy. in Proceedings of the COLING/ACL Main Conference Poster Sessions. 2006.
    20. Kohlschutter, C. and W. Nejdl. A Densitometric Approach to Web Page Segmentation. in CIKM. 2008. Napa Valley.
    21. Cosine similarity. Available from: http://en.wikipedia.org/wiki/Cosine_similarity.
    22. Cortes, C. and V. Vapnik. Support-Vector Networks. in Machine Learning. 1995.
    23. Chang, C.-C. and C.-J. Lin, LIBSVM: a library for support vector machines. 2001.
    24. Precision and recall. Available from: http://en.wikipedia.org/wiki/Precision_%28information_retrieval%29.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE