簡易檢索 / 詳目顯示

研究生: 余宗達
Yu, Chung-da
論文名稱: 在系統化網頁中一種基於文件物件模型中子樹成對比較之有效率的資訊子樹探勘方法
An Efficient Mining Algorithm of Informative Subtrees Based on Pairwise Comparison on Document Object Model in Systematic Web Pages
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 42
中文關鍵詞: 網頁探勘網頁資訊擷取文件物件模型樹
外文關鍵詞: DOM tree, web data records, web information extraction, Web mining
相關次數: 點閱:118下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的發達,很常見藉由伺服器端程式語言所產生的連續網頁標籤的結構化網頁,也稱作系統化網頁,如拍賣網站的清單,搜尋引擎的結果等等。包含豐富的資訊和大量的資料,因此學者們專注於從網頁的內容中探勘出附有資訊的資料。由於網頁只有提供HTML 未經處理的文字資料. Liu, Bing 提出一種探勘演算法MDR ,在系統化的網頁中,利用計算網頁當中子樹的相似度的方式,找到相似的資料在連續的文件物件模型樹。此種方法能夠有效的找出系統化網頁當中連續的網頁結構樹。來當作網頁中的資料區域。
    然而,MDR的效能不佳,需要計算一個網頁當中所有連續的子樹的相似度。花費大量的時間計算編輯距離來獲得相似度,因此,本篇論文提出了一種有效率的演算法來改進之前的方法。根續系統化網頁的特性,設計出從網頁結構樹當中,成對比較在同一個父親節點之下的子樹,減少編輯距離的計算量,並且獲得與之前方法相同的結果。另外,我們也提出了一個快速找尋前k個重要區塊的演算碼,利用前面我們互相比較區塊的概念,我們利用塞選的方式將不重要的小區塊濾除,實驗也證實我們的方法在前k個重要區塊的擷取上能和MDR結果接近但在效能上卻提高許多。

    With the internet growing larger and larger, the structured Web pages which consist of contiguous HTML tags generated by server side programming language are often seen. These pages are called systematic pages. For example, the listing page of the auction web site and search result pages from search engines. These pages contain rich information and large amount of data. Thus, researchers put their attention on mining informative data records from the web content. Due to Web pages only containing semi-structured HTML format data, Liu, Bing has proposed an algorithm, called MDR to extract the informative data records. By computing the similarity of sub-trees in the page, this method finds similar data records in a continuous DOM tree. This method can find continuous structured trees effectively in a systematic web page. These extracted sub-trees are considered as data regions in a page.
    However, MDR in real Web pages does not perform well. It needs to compute the similarities of all of continuous sub-trees. It spends large of time to compute the edit distance to obtain corresponding similarity values. Thus, this paper proposes an efficient mining algorithm to improve MDR by reducing the redundant computation. According to the property of systematic web pages, the proposed method compares the sub-trees under the same parent node pairwisely. The method reduces the number of computation of edit distance. It obtains the same result with the previous method. We also propose a Top-k extraction algorithm to mine the top-k informative blocks. According to our experimental results, the algorithm can attain the high precision rate among the selected k blocks and greatly reduce the computation cost of MDR.

    CONTENT 中文摘要 IV ABSTRACT V CONTENT VII 1. INTRODUCTION 1 2. ISSUES AND RELATED WORKS 5 2.1 MDR 5 2.1.1 The recomputation cases of MDR 7 2.2 DEPTA 8 2.3 NET 8 2.4 IKM 8 2.5 RELATED OPEN SOURCE TECHNIQUE AND W3C STANDARD 9 3. THE PROPOSE TECHNIQUE 11 3.1 EDIT DISTANCE 11 3.2 MOTIVATION 11 3.3 THE SECOND OBSERVATION OF MDR 11 3.4 MIPS: MINING INFORMATIVE DATA REGIONS BY PAIRWISE SUBTREE COMPARISON 12 3.5 THE SELECTED TOP-K GENERALIZED NODES ALGORITHM 17 3.6 TIME COMPLEXITY ANALYSIS 19 3.6.1 Time Complexity of MDR 19 3.6.2 Time Complexity of MIPS 20 3.6.3 Time Complexity analysis of two algorithms 21 4. WEB INFORMATION FILTERING AND KNOWLEDGE ACQUISITION SYSTEM 23 4.1 WEB PAGE STRUCTURE AND CONTENT VIEWER 24 4.2 REPEATED PATTERNS VIEWER 26 5. DIGITAL ARCHIVE WEB SITE 27 5.1 EXTRACTION OF QUERY INTERFACE OF DIGITAL ARCHIVE WEB SITES 27 5.2 THE EXTRACTION OF QUERYING RESULT OF DIGITAL ARCHIVE WEB PAGE 28 6. EXPERIMENT 30 6.1 EXPERIMENT OF DIGITAL ARCHIVE 30 6.2 EXECUTION TIME OF MIPS AND MDR 32 6.3 THE EXPERIMENT OF TOP-K ALGORITHM 37 6.4 THE VERIFICATION OF PARAMETER K AND REAL EXECUTION TIME 38 7. CONCLUSION 40 8. REFERENCES 41

    REFERENCES
    [1]. Andrew,H., David, K. Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web. In Proc. WWW 2005.
    [2]. Baeza-Yates, R. “Algorithms for string matching: A survey.” ACM SIGIR Forum, 23(3-4):34--58, 1989
    [3]. Cascading Style Sheets (CSS),http://www.w3.org/Style/CSS
    [4]. Chang, C.-H., and Liu, S.-L. IEPAD: Information Extraction Based on Pattern Discovery. In Proc. of WWW, 2001.
    [5]. Document Object Model (DOM),http://www.w3.org/DOM/
    [6]. Extensible Markup Language (XML),http://www.w3.org/XML/
    [7]. Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.
    [8]. Marini, J., The Document Object Model, Processing Structured Documents, McGraw-Hill/Osborne, 2002.
    [9]. Hypertext Markup Language (HTML), http://www.w3.org/html/
    [10]. Jie,Z., Danial,L., George,R,T Combining DOM tree and geometric layout analysis for online medical journal article segmentation. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital Libraries (JCDL ’06),
    [11]. Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. KDD-03, 2003.
    [12]. Liu, B. and Zhai, Y. NET - A System for Extracting Web Data from Flat and Nested Data Records Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05), 2005.
    [13]. Mozilla Firefox, http://www.mozilla.com/firefox/
    [14]. mozStorage Extension, http://hacks.atrus.org/mozStorage/
    [15]. Resource Description Framework (RDF), http://www.w3.org/RDF/
    [16]. Simon, K., and Lausen, G. 2005. ViPER: augmenting automatic information extraction with visual perceptions. In Proc. CIKM’05, 381–388. ACM.
    [17]. SQLite, http://www.sqlite.org/
    [18]. Stuart Langridge, DHTML Utopia Modern Web Design Using JavaScript & DOM, sitepoint, 2005.
    [19]. Web Services Description Language, http://www.w3.org/TR/wsdl
    [20]. XML Binding Language (XBL),http://www.w3.org/TR/xbl/
    [21]. XML User Interface Language (XUL) 1.0, http://www.mozilla.org/projects/xul/xul.html,2001
    [22]. XPCOM, http://www.mozilla.org/projects/xpcom/
    [23]. Ye, S., Chua T.S. Learning Object Models from Semistructured Web Documents. IEEE Transactions on Knowledge and Data Engineering (January 2006), 18/3, 334- 349.
    [24]. Tseng YF, Kao HY, The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages , 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI-2006).
    [25]. Zhai, Y., and Liu, B. Web Data Extraction Based on Partial Tree Alignment. In Proc. of WWW, 2005.
    [26]. Zhao, H.; Meng, W.; Wu, Z.; Raghavan, V.; and Yu, C. 2005. Fully automatic wrapper generation for search engines. In Proc. of WWW, 2005.
    [27]. Zhao, H., Meng, W., and Yu, C. Automatic extraction of dynamic record sections from search engine result pages. In Proceedings of the 32nd International Conference on Very Large Data Bases, pages 989–1000. VLDB Endowment, 2006.
    [28]. 彭建翔,精通W3C之DOM 2: 文件物件模型導論, 文魁,2002

    下載圖示 校內:2008-08-28公開
    校外:2008-08-28公開
    QR CODE