簡易檢索 / 詳目顯示

研究生: 曾一峰
Tseng, Yi-Feng
論文名稱: 在系統化網頁中的主要資訊區塊與資料物件之探勘與擷取
The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 58
中文關鍵詞: 資訊擷取文件物件模型
外文關鍵詞: DOM, Block Importance, Information Extraction
相關次數: 點閱:58下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網路迅速的發展,至今網際網路已是一個巨大的資料庫,蘊藏了極豐富的資訊。其中,許多網頁是以一連串的物件清單呈現其內容,如搜尋引擎傳回的每筆結果、購物網站的產品資訊等等,這些物件群構成了網頁的資訊主體。本篇論文著重於擷取網頁的資訊主體及其構成的物件群,分為三大步驟:藉由將網頁轉成相對應的樹狀結構,系統可以快速的走訪網頁的所有區域,並偵測出帶有資訊的部份。第二,我們根據網頁區域的特性,設計並量化出若干新穎的特徵質,這些特徵值能讓我們進一步的判斷主體資訊區域。最後,提出一權重模組計算出每個區域的重要程度,擷取出構成此網頁資訊主體的區域。並藉由物件識別元件,我們可以得到構成此區域一連串物件。實驗結果證明系統能應用在各種不同主題的網頁,擷取出正確的資訊主體及物件清單。

    With the fast development of Internet, the Web has already been an enormous database, which contains extremely abundant information. Most of Web pages represent their content by using a list of objects, such as search engine results, products information of shopping Web sites and so on, and these objects form the primary information of each page. In this paper, we focus on the issues of mining primary information and the constituted object groups. Our approach is divided into three major phases: (1) By transforming each Web page into corresponding tree structure, our system can visit all regions of the Web page in an efficient way, and detects the informative parts. (2) We design and quantize several novel features according to the characters of regions of a Web page, and these features can help us judge the primary information region further. (3) A weighting model is proposed that calculates the important degree of each region, we then extract the primary information of the Web pages. By the component of identifying objects, we obtain the list of objects which form the informative region. The experimental result proves our system can be applied to a large number of Web pages with different themes and styles to find the correct primary information and the list of corresponding objects.

    Content 中文摘要 I ABSTRACT II FIGURE LISTING VI TABLE LISTING VII 1. INTRODUCTION 1 1.1 PROBLEM OVERVIEW 2 1.2 MOTIVATION 3 1.3 OUR APPROACH 5 2. RELATED WORK 6 2.1 RECORD-BOUNDARY DISCOVERY IN WEB DOCUMENT 7 2.2 ROADRUNNER 8 2.3 IEPAD 8 2.4 DELA 9 2.5 EXALG 9 2.6 MDR 10 2.7 VINTS 11 3. INTELLIEGENT KNOWLEDGE MINING SYSTEM 12 3.1 SYSTEM ARCHITECTURE 12 3.1.1 Building a DOM tree and Detecting the Candidate IBs: 13 3.1.2 Features Quantifying and Objects Mining 14 3.1.3 Blocks Importance Calculating and Raking 14 3.2 CANDIDATE IBS DETECTION 14 3.2.1 Document Object Model 15 3.2.2 An Efficient Detection of Candidate IB 17 3.2.3 Edit Distance 20 3.3 BLOCK FEATURES DESIGNING 21 3.3.1 Measuring the Similarity of a set of RP 22 3.3.2 Spatial Feature, the Calculation of Density 24 3.3.3 The Distribution of Items Style of Objects 24 3.4 BLOCKS IMPORTANCE AND IBS MINING 28 3.4.1 The Block Importance Model 28 3.4.2 Identifying Correct Data Objects 29 4. EXPERIMENTS AND RESULT 33 4.1 EDIT DISTANCE THRESHOLD 33 4.2 THE IMPACT ON STORAGE REQUIREMENTS 35 4.2.1 Document String of whole pages and IBs 36 4.2.2 The number of extracted objects in different mechanisms 36 4.3 OVERALL PERFORMANCE 39 4.3.1 Search engine results, Shopping sites and other topic of systematic pages 40 4.3.2 Multiple Data Objects 43 4.3.3 Multiple IBs 45 4.3.4 Overall Comparison 46 4.4 FEATURES SELECTION 46 5. CONCLUSION AND FUTURE WORK 55 REFERENCES 56 FIGURE LISTING FIGURE 1 1: AN EXAMPLE OF A SYSTEMATIC WEB PAGE IN BUY.COM 3 FIGURE 3 1: SYSTEM ARCHITECTURE OF INTELLIGENT KNOWLEDGE MINING SYSTEM 13 FIGURE 3 2: A HTML CODE AND CORRESPONDING DOM TREE 16 FIGURE 3 3: AN EXAMPLE OF TRAVERSE AND COMPARISON 19 FIGURE 3 4: AN EXAMPLE OF REGULAR PATTERNS GROUP 23 FIGURE 3 5: EXAMPLE OF THE ITEMS STYLE OF RPS. (A) THE ORIGINAL WEB PAGE. (B) THE CORRESPONDING DOM TREE OF THE FIRST RP OF (A). 28 FIGURE 3 6: AN EXAMPLE OF RP CONTAINING MULTIPLE OBJECTS 31 FIGURE 3 7: ALGORITHM OF IDENTIFYING OBJECTS 32 FIGURE 3 8: ALGORITHM OF FINDING NEXT CANDIDATE IB 32 FIGURE 4 1: THE DISTRIBUTION OF NUMBER OF OBJECTS 35 FIGURE 4 2: THE DISTRIBUTION OF RATIO OF DATA OBJECTS 35 FIGURE 4 3: TOTAL STORAGE REQUIREMENTS FOR THE ORIGINAL PAGE AND IBS BY IKM. (A)RESULT OF EACH PAGE (B) RESULT OF THE NUMBER OF PAGES. 38 FIGURE 4 4: TOTAL EXTRACTED OBJECTS BY THREE MECHANISMS. (A) THE RESULT OF EACH PAGE. (B) THE RESULT OF THE NUMBER OF PAGES. 39 FIGURE 4 5: PERFORMANCE OF IBS FROM IKM AND MDR. (A) RECALL (B) PRECISION. 48 FIGURE 4 6: PERFORMANCE OF DATA-OBJECTS FROM IKM AND MDR. (A) RECALL (B) PRECISION. 49 FIGURE 4 7: PERFORMANCE OF EXTRACTED IBS OF SINGLE FEATURE SELECTION. (A) RECALL (B) PRECISION. 51 FIGURE 4 8: PERFORMANCE OF EXTRACTED DATA OBJECTS OF SINGLE FEATURES SELECTION. (A) RECALL (B) PRECISION. 52 FIGURE 4 9: PERFORMANCE OF EXTRACTED IBS OF TWO FEATURES SELECTION. (A) RECALL (B) PRECISION. 53 FIGURE 4 10: PERFORMANCE OF EXTRACTED DATA OBJECTS OF TWO FEATURES SELECTION. (A) RECALL (B) PRECISION. 54 TABLE LISTING TABLE 4 1: PERFORMANCE OF SEARCH ENGINE RESULT 41 TABLE 4 2: PERFORMANCE OF SHOPPING SITES 41 TABLE 4 3: PERFORMANCE OF OTHER TOPICS 43 TABLE 4 4: RPS CONTAINING MULTIPLE OBJECTS 43 TABLE 4 5: MULTIPLE IBS 45 TABLE 4 6: OVERALL PERFORMANCE OF COMPARISON OF IBS 47 TABLE 4 7: OVERALL PERFORMANCE OF COMPARISON OF DATA OBJECTS 47

    REFERENCES
    [1] Document Object Model – W3C Recommendation. http:// www.w3.org/DOM.
    [2] HTML fixing tool developed by Dave Raggett from the W3C team, http://www.w3.org/People/Raggett/tidy/.
    [3] Arasu, A. and Garcia-Molina, H. “Extracting Structured Data from Web Pages”. SIGMOD’03, 2003.
    [4] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-2001, 2001.
    [5] Chang, C. and Lui, S-L. “IEPAD: Information Extraction Based on Pattern Discovery”. WWW-2001, 2001.
    [6] Cohen, W., Hurst, M., and Jensen, L. “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents.” WWW-2002, 2002.
    [7] Embley, D., Jiang, Y. and Ng. Y. “Record-Boundary Discovery in Web Documents.” SIGMOD-1999, 1999.
    [8] Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. “DOM-based Content Extraction of HTML Document” WWW-2003, 2003.
    [9] Gusfield, D. Algorithms on string, tree and sequence, Cambridge. 1997.
    [10] Kao, H.-Y., Ho, J.-M. and Chen, M.-S. “WISDOM: Web Intra-page Informative Structure Mining base on Document Object Model” IEEE Trans. on Knowledge and Data Engineering, 2005.
    [11] Lin, S.-H., Ho, J.-M. “Discovering Informative Content Blocks from Web Documents” SIGKDD’02, 2003
    [12] Liu, B., Grossman, R. and Zhai, Y. “Mining Data Records in Web Pages” KDD-2003, 2003.
    [13] Muslea, I., Minton, S. and Knoblock, C. “A Hierarchical Approach to Wrapper Induction.” Agents-1999, 1999.
    [14] Ranaswamy, L., Iyengar,A., Liu, L., and Douglis, F. “Automatic Detection of Fragments in Dynamically Generated Web Pages” WWW-2004, 2004.
    [15] Ruihua S., Haifeng L., Ji-Rong W., Wei-Ying M. “Learning Block Importance Models for Web Pages” WWW-2004, 2004.
    [16] Wang, J., and Lochovsky, F.H. “Data Extraction and Label Assignment for Web Databases” WWW-2003, 2003.
    [17] Zhai, Y. and Liu, B. “Web Data Extraction Based on Partial Tree Alignment” WWW-2005, 2005.
    [18] Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. “Fully Automatic Wrapper Generation for Search Engines” WWW-2005, 2005.

    下載圖示 校內:立即公開
    校外:2006-08-23公開
    QR CODE