| 研究生: |
曾一峰 Tseng, Yi-Feng |
|---|---|
| 論文名稱: |
在系統化網頁中的主要資訊區塊與資料物件之探勘與擷取 The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 英文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 資訊擷取 、文件物件模型 |
| 外文關鍵詞: | DOM, Block Importance, Information Extraction |
| 相關次數: | 點閱:58 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網路迅速的發展,至今網際網路已是一個巨大的資料庫,蘊藏了極豐富的資訊。其中,許多網頁是以一連串的物件清單呈現其內容,如搜尋引擎傳回的每筆結果、購物網站的產品資訊等等,這些物件群構成了網頁的資訊主體。本篇論文著重於擷取網頁的資訊主體及其構成的物件群,分為三大步驟:藉由將網頁轉成相對應的樹狀結構,系統可以快速的走訪網頁的所有區域,並偵測出帶有資訊的部份。第二,我們根據網頁區域的特性,設計並量化出若干新穎的特徵質,這些特徵值能讓我們進一步的判斷主體資訊區域。最後,提出一權重模組計算出每個區域的重要程度,擷取出構成此網頁資訊主體的區域。並藉由物件識別元件,我們可以得到構成此區域一連串物件。實驗結果證明系統能應用在各種不同主題的網頁,擷取出正確的資訊主體及物件清單。
With the fast development of Internet, the Web has already been an enormous database, which contains extremely abundant information. Most of Web pages represent their content by using a list of objects, such as search engine results, products information of shopping Web sites and so on, and these objects form the primary information of each page. In this paper, we focus on the issues of mining primary information and the constituted object groups. Our approach is divided into three major phases: (1) By transforming each Web page into corresponding tree structure, our system can visit all regions of the Web page in an efficient way, and detects the informative parts. (2) We design and quantize several novel features according to the characters of regions of a Web page, and these features can help us judge the primary information region further. (3) A weighting model is proposed that calculates the important degree of each region, we then extract the primary information of the Web pages. By the component of identifying objects, we obtain the list of objects which form the informative region. The experimental result proves our system can be applied to a large number of Web pages with different themes and styles to find the correct primary information and the list of corresponding objects.
REFERENCES
[1] Document Object Model – W3C Recommendation. http:// www.w3.org/DOM.
[2] HTML fixing tool developed by Dave Raggett from the W3C team, http://www.w3.org/People/Raggett/tidy/.
[3] Arasu, A. and Garcia-Molina, H. “Extracting Structured Data from Web Pages”. SIGMOD’03, 2003.
[4] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-2001, 2001.
[5] Chang, C. and Lui, S-L. “IEPAD: Information Extraction Based on Pattern Discovery”. WWW-2001, 2001.
[6] Cohen, W., Hurst, M., and Jensen, L. “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents.” WWW-2002, 2002.
[7] Embley, D., Jiang, Y. and Ng. Y. “Record-Boundary Discovery in Web Documents.” SIGMOD-1999, 1999.
[8] Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. “DOM-based Content Extraction of HTML Document” WWW-2003, 2003.
[9] Gusfield, D. Algorithms on string, tree and sequence, Cambridge. 1997.
[10] Kao, H.-Y., Ho, J.-M. and Chen, M.-S. “WISDOM: Web Intra-page Informative Structure Mining base on Document Object Model” IEEE Trans. on Knowledge and Data Engineering, 2005.
[11] Lin, S.-H., Ho, J.-M. “Discovering Informative Content Blocks from Web Documents” SIGKDD’02, 2003
[12] Liu, B., Grossman, R. and Zhai, Y. “Mining Data Records in Web Pages” KDD-2003, 2003.
[13] Muslea, I., Minton, S. and Knoblock, C. “A Hierarchical Approach to Wrapper Induction.” Agents-1999, 1999.
[14] Ranaswamy, L., Iyengar,A., Liu, L., and Douglis, F. “Automatic Detection of Fragments in Dynamically Generated Web Pages” WWW-2004, 2004.
[15] Ruihua S., Haifeng L., Ji-Rong W., Wei-Ying M. “Learning Block Importance Models for Web Pages” WWW-2004, 2004.
[16] Wang, J., and Lochovsky, F.H. “Data Extraction and Label Assignment for Web Databases” WWW-2003, 2003.
[17] Zhai, Y. and Liu, B. “Web Data Extraction Based on Partial Tree Alignment” WWW-2005, 2005.
[18] Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. “Fully Automatic Wrapper Generation for Search Engines” WWW-2005, 2005.