成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	曾一峰 Tseng, Yi-Feng
論文名稱：	在系統化網頁中的主要資訊區塊與資料物件之探勘與擷取 The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages
指導教授：	高宏宇 Kao, Hung-Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	58
中文關鍵詞：	資訊擷取、文件物件模型
外文關鍵詞：	DOM, Block Importance, Information Extraction
相關次數：	點閱：159 下載：3
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著網路迅速的發展，至今網際網路已是一個巨大的資料庫，蘊藏了極豐富的資訊。其中，許多網頁是以一連串的物件清單呈現其內容，如搜尋引擎傳回的每筆結果、購物網站的產品資訊等等，這些物件群構成了網頁的資訊主體。本篇論文著重於擷取網頁的資訊主體及其構成的物件群，分為三大步驟：藉由將網頁轉成相對應的樹狀結構，系統可以快速的走訪網頁的所有區域，並偵測出帶有資訊的部份。第二，我們根據網頁區域的特性，設計並量化出若干新穎的特徵質，這些特徵值能讓我們進一步的判斷主體資訊區域。最後，提出一權重模組計算出每個區域的重要程度，擷取出構成此網頁資訊主體的區域。並藉由物件識別元件，我們可以得到構成此區域一連串物件。實驗結果證明系統能應用在各種不同主題的網頁，擷取出正確的資訊主體及物件清單。

With the fast development of Internet, the Web has already been an enormous database, which contains extremely abundant information. Most of Web pages represent their content by using a list of objects, such as search engine results, products information of shopping Web sites and so on, and these objects form the primary information of each page. In this paper, we focus on the issues of mining primary information and the constituted object groups. Our approach is divided into three major phases: (1) By transforming each Web page into corresponding tree structure, our system can visit all regions of the Web page in an efficient way, and detects the informative parts. (2) We design and quantize several novel features according to the characters of regions of a Web page, and these features can help us judge the primary information region further. (3) A weighting model is proposed that calculates the important degree of each region, we then extract the primary information of the Web pages. By the component of identifying objects, we obtain the list of objects which form the informative region. The experimental result proves our system can be applied to a large number of Web pages with different themes and styles to find the correct primary information and the list of corresponding objects.

Content
中文摘要	I
ABSTRACT	II
FIGURE LISTING	VI
TABLE LISTING	VII
1.	INTRODUCTION	1
1.1	PROBLEM OVERVIEW	2
1.2	MOTIVATION	3
1.3	OUR APPROACH	5
2.	RELATED WORK	6
2.1	RECORD-BOUNDARY DISCOVERY IN WEB DOCUMENT	7
2.2	ROADRUNNER	8
2.3	IEPAD	8
2.4	DELA	9
2.5	EXALG	9
2.6	MDR	10
2.7	VINTS	11
3.	INTELLIEGENT KNOWLEDGE MINING SYSTEM	12
3.1	SYSTEM ARCHITECTURE	12
3.1.1	Building a DOM tree and Detecting the Candidate IBs:	13
3.1.2	Features Quantifying and Objects Mining	14
3.1.3	Blocks Importance Calculating and Raking	14
3.2	CANDIDATE IBS DETECTION	14
3.2.1	Document Object Model	15
3.2.2	An Efficient Detection of Candidate IB	17
3.2.3	Edit Distance	20
3.3	BLOCK FEATURES DESIGNING	21
3.3.1	Measuring the Similarity of a set of RP	22
3.3.2	Spatial Feature, the Calculation of Density	24
3.3.3	The Distribution of Items Style of Objects	24
3.4	BLOCKS IMPORTANCE AND IBS MINING	28
3.4.1	The Block Importance Model	28
3.4.2	Identifying Correct Data Objects	29
4.	EXPERIMENTS AND RESULT	33
4.1	EDIT DISTANCE THRESHOLD	33
4.2	THE IMPACT ON STORAGE REQUIREMENTS	35
4.2.1	Document String of whole pages and IBs	36
4.2.2	The number of extracted objects in different mechanisms	36
4.3	OVERALL PERFORMANCE	39
4.3.1	Search engine results, Shopping sites and other topic of systematic pages	40
4.3.2	Multiple Data Objects	43
4.3.3	Multiple IBs	45
4.3.4	Overall Comparison	46
4.4	FEATURES SELECTION	46
5.	CONCLUSION AND FUTURE WORK	55
REFERENCES	56

FIGURE LISTING
FIGURE 1 1: AN EXAMPLE OF A SYSTEMATIC WEB PAGE IN BUY.COM	3
FIGURE 3 1: SYSTEM ARCHITECTURE OF INTELLIGENT KNOWLEDGE MINING SYSTEM	13
FIGURE 3 2: A HTML CODE AND CORRESPONDING DOM TREE	16
FIGURE 3 3: AN EXAMPLE OF TRAVERSE AND COMPARISON	19
FIGURE 3 4: AN EXAMPLE OF REGULAR PATTERNS GROUP	23
FIGURE 3 5: EXAMPLE OF THE ITEMS STYLE OF RPS. (A) THE ORIGINAL WEB PAGE. (B) THE CORRESPONDING DOM TREE OF THE FIRST RP OF (A).	28
FIGURE 3 6: AN EXAMPLE OF RP CONTAINING MULTIPLE OBJECTS	31
FIGURE 3 7: ALGORITHM OF IDENTIFYING OBJECTS	32
FIGURE 3 8: ALGORITHM OF FINDING NEXT CANDIDATE IB	32
FIGURE 4 1: THE DISTRIBUTION OF NUMBER OF OBJECTS	35
FIGURE 4 2: THE DISTRIBUTION OF RATIO OF DATA OBJECTS	35
FIGURE 4 3: TOTAL STORAGE REQUIREMENTS FOR THE ORIGINAL PAGE AND IBS BY IKM. (A)RESULT OF EACH PAGE (B) RESULT OF THE NUMBER OF PAGES.	38
FIGURE 4 4: TOTAL EXTRACTED OBJECTS BY THREE MECHANISMS. (A) THE RESULT OF EACH PAGE. (B) THE RESULT OF THE NUMBER OF PAGES.	39
FIGURE 4 5: PERFORMANCE OF IBS FROM IKM AND MDR. (A) RECALL (B) PRECISION.	48
FIGURE 4 6: PERFORMANCE OF DATA-OBJECTS FROM IKM AND MDR. (A) RECALL (B) PRECISION.	49
FIGURE 4 7: PERFORMANCE OF EXTRACTED IBS OF SINGLE FEATURE SELECTION. (A) RECALL (B) PRECISION.	51
FIGURE 4 8: PERFORMANCE OF EXTRACTED DATA OBJECTS OF SINGLE FEATURES SELECTION. (A) RECALL (B) PRECISION.	52
FIGURE 4 9: PERFORMANCE OF EXTRACTED IBS OF TWO FEATURES SELECTION. (A) RECALL (B) PRECISION.	53
FIGURE 4 10: PERFORMANCE OF EXTRACTED DATA OBJECTS OF TWO FEATURES SELECTION. (A) RECALL (B) PRECISION.	54

TABLE LISTING
TABLE 4 1: PERFORMANCE OF SEARCH ENGINE RESULT	41
TABLE 4 2: PERFORMANCE OF SHOPPING SITES	41
TABLE 4 3: PERFORMANCE OF OTHER TOPICS	43
TABLE 4 4: RPS CONTAINING MULTIPLE OBJECTS	43
TABLE 4 5: MULTIPLE IBS	45
TABLE 4 6: OVERALL PERFORMANCE OF COMPARISON OF IBS	47
TABLE 4 7: OVERALL PERFORMANCE OF COMPARISON OF DATA OBJECTS	47 


                                    

REFERENCES
[1] Document Object Model – W3C Recommendation. http:// www.w3.org/DOM.
[2] HTML fixing tool developed by Dave Raggett from the W3C team, http://www.w3.org/People/Raggett/tidy/.
[3] Arasu, A. and Garcia-Molina, H. “Extracting Structured Data from Web Pages”. SIGMOD’03, 2003.
[4] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-2001, 2001.
[5] Chang, C. and Lui, S-L. “IEPAD: Information Extraction Based on Pattern Discovery”. WWW-2001, 2001.
[6] Cohen, W., Hurst, M., and Jensen, L. “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents.” WWW-2002, 2002.
[7] Embley, D., Jiang, Y. and Ng. Y. “Record-Boundary Discovery in Web Documents.” SIGMOD-1999, 1999.
[8] Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. “DOM-based Content Extraction of HTML Document” WWW-2003, 2003.
[9] Gusfield, D. Algorithms on string, tree and sequence, Cambridge. 1997.
[10] Kao, H.-Y., Ho, J.-M. and Chen, M.-S. “WISDOM: Web Intra-page Informative Structure Mining base on Document Object Model” IEEE Trans. on Knowledge and Data Engineering, 2005.
[11] Lin, S.-H., Ho, J.-M. “Discovering Informative Content Blocks from Web Documents” SIGKDD’02, 2003
[12] Liu, B., Grossman, R. and Zhai, Y. “Mining Data Records in Web Pages” KDD-2003, 2003.
[13] Muslea, I., Minton, S. and Knoblock, C. “A Hierarchical Approach to Wrapper Induction.” Agents-1999, 1999.
[14] Ranaswamy, L., Iyengar,A., Liu, L., and Douglis, F. “Automatic Detection of Fragments in Dynamically Generated Web Pages” WWW-2004, 2004.
[15] Ruihua S., Haifeng L., Ji-Rong W., Wei-Ying M. “Learning Block Importance Models for Web Pages” WWW-2004, 2004.
[16] Wang, J., and Lochovsky, F.H. “Data Extraction and Label Assignment for Web Databases” WWW-2003, 2003.
[17] Zhai, Y. and Liu, B. “Web Data Extraction Based on Partial Tree Alignment” WWW-2005, 2005.
[18] Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. “Fully Automatic Wrapper Generation for Search Engines” WWW-2005, 2005.

2006-08-23公開

簡易檢索 / 詳目顯示

相關論文