| 研究生: |
程彥輔 Cheng, Yan-Fu |
|---|---|
| 論文名稱: |
以Bootstrapping方法萃取網路優惠摘要 Extracting Network Preferential Summary with Bootstrapping Method |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 中文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 文字探勘 、資訊萃取 、XML路徑語言 、自助法 、逐點交互訊息 |
| 外文關鍵詞: | Text mining, Information extraction, XPath, Bootstrapping, Point-Wise Mutual Information |
| 相關次數: | 點閱:157 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
台灣電子商務業的產值從2008年開始就持續有明顯的成長,其中消費者對於購物折扣等優惠相關資訊通常具有較大的興趣,由於現今網路的發達,為了找尋所需資訊,使用者通常使用搜尋引擎上網搜尋,但是網路資訊量爆炸性的成長、網頁設計的自由性,使得雜訊大量的存在於網頁之中,搜尋引擎要保持最新以及全面性的搜尋結果並不容易,尤其是特定主題的資訊搜尋,使用者常需要自行判斷是否為其所需的資訊,因為上述需求而發展文件探詢的智慧機制是很重要的。
本研究將使用Bootstrapping的方法,結合文字探勘技術,先找出優惠相關之關鍵字後,以優惠資訊較為齊全的優惠網站作為種子網頁,藉由XML路徑語言(XPath)找出存有優惠資訊的Document Object Model (DOM)位置,得到萃取優惠資訊的樣板,利用該樣版從將選定網站內所有網頁下載下來,經過斷詞系統處理以及設計一考慮字詞距離的Distance Point-Wise Mutual Information (DPMI)分析,將這些資訊存放後,以Bootstrapping方法持續學習新的關鍵字,將學習結果中關鍵字與店家或產品名稱的組合用於搜尋引擎中找出更多的優惠網站,延續前述步驟找出優惠資訊摘要等,建立一個使用者介面,提供使用者以關鍵字查詢優惠資訊,例如:買一送一、同行免費、第二件半價等關鍵字。
在實驗結果的部分,結果顯示使用八個種子關鍵字得到最好的召回率及F-measure,使用名詞合併後的準確率較合併前高出10.7%,使用DPMI進行實驗時以距離為2可以得到最高的準確率29.4%,較於PMI進行實驗結果得到的20%高出9.4%,且最後利用關鍵字與店家或產品名稱找出新優惠網站的實驗中最高也可以得到59%的準確率,召回率則有32.9%。
The output value of e-commerce has obviously growing in 2008. Consumers have most interest in discount and preferential information. It’s difficult for search engine to keep latest and the most comprehensive search result.
This research use bootstrapping method with text mining. After determine preferential keyword, set the website that has complete preferential information as seed pages. Finding document object model (DOM) position of preferential information with XML path language (XPath) to get the pattern that can extract preferential information. The pattern will download webpages from chosen websites. Analyzing these pages with word segmentation system and Distance Point-Wise Mutual Information (DPMI), learning new preferential keywords with bootstrapping method. Combine preferential keyword and store or product name for search engine to find out new preferential websites. Developing a user interface which provides preferential information like: buy one get one, buy one, get one half price, etc.
Experiment result shows that DPMI using two as word distance has the greatest precision 29.4%, 9.4% higher than PMI’s result 20%.
Abou Nabout, N., & Skiera, B. (2012). Return on Quality Improvements in Search Engine Marketing. Journal of Interactive Marketing, 26(3), 141-154. doi: http://dx.doi.org/10.1016/j.intmar.2011.11.001
Agichtein, E., & Gravano, L. (2000). <italic>Snowball</italic>: extracting relations from large plain-text collections. Paper presented at the Proceedings of the fifth ACM conference on Digital libraries, San Antonio, Texas, USA.
Brin, S. (1999). Extracting Patterns and Relations from the World Wide Web: Stanford InfoLab.
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482-494. doi: http://dx.doi.org/10.1016/j.dss.2007.06.002
Chiu, Y.-T., & Chen, Y.-L. (2011). An IPC-based vector space model for patent retrieval. Information Processing & Management, 47(3), 309-322.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1), 22-29.
Ciravegna, D., & Petrelli, D. (2001). User involvement in adaptive information extraction: Position paper.
comScore, I. (2014). comScore Explicit Core Search Share Report. from https://www.comscore.com/Insights/Market-Rankings/comScore-Releases-June-2014-US-Search-Engine-Rankings
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., . . . Yates, A. (2004). Web-scale information extraction in knowitall: (preliminary results). Paper presented at the Proceedings of the 13th international conference on World Wide Web, New York, NY, USA.
Hamburg, M. (1985). Basic Statistics: A Modern Approach : Harcourt Brace Jovanovich: Inc.
Huynh-Thi-Le, Q., Le, T., Vo, B., & Le, B. (2015). An efficient and effective algorithm for mining top-rank-k frequent patterns. Expert Systems with Applications, 42(1), 156-164. doi: http://dx.doi.org/10.1016/j.eswa.2014.07.045
Johnson, J., Tellis, G. J., & Ip, E. H. (2013). To Whom, When, and How Much to Discount? A Constrained Optimization of Customized Temporal Discounts. Journal of Retailing, 89(4), 361-373. doi: http://dx.doi.org/10.1016/j.jretai.2013.08.002
Korrapati, H., & Mezouar, Y. (2014). Vision-based sparse topological mapping. Robotics and Autonomous Systems, 62(9), 1259-1270. doi: http://dx.doi.org/10.1016/j.robot.2014.03.015
Liao, S., & Grishman, R. (2010, August 2010). Filtered Ranking for Bootstrapping in Event Extraction. Paper presented at the Proceedings of the 23rd International Conference on Computational Linguistics, Beijing.
Patel, A., & Schmidt, N. (2011). Application of structured document parsing to focused web crawling. Computer Standards & Interfaces, 33(3), 325-331. doi: http://dx.doi.org/10.1016/j.csi.2010.08.002
Peng, T., & Liu, L. (2013). Focused crawling enhanced by CBP–SLC. Knowledge-Based Systems, 51(0), 15-26. doi: http://dx.doi.org/10.1016/j.knosys.2013.06.008
Popescu, A.-M., & Etzioni, O. (2007). Extracting Product Features and Opinions from Reviews. In A. Kao & S. Poteet (Eds.), Natural Language Processing and Text Mining (pp. 9-28): Springer London.
Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proc. Thirteenth National Conference on Artificial Intelligence, 1044-1049.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11), 613-620.
Sleiman, H. A., & Corchuelo, R. (2013). TEX: An efficient and effective unsupervised Web information extractor. Knowledge-Based Systems, 39(0), 109-123. doi: http://dx.doi.org/10.1016/j.knosys.2012.10.009
Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. Paper presented at the Proceedings of ACL.
TechNews科技新報. (2014). 台灣3大團購網上月業績傳捷報達5.5億、創新高. from http://technews.tw/2014/01/04/taiwan-group-buys-online-months-3-new-high-performance-news-reached-550-million/
Uzun, E., Agun, H. V., & Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Information Processing & Management, 49(4), 928-944. doi: http://dx.doi.org/10.1016/j.ipm.2013.02.005
w3school. (1999). XPath 實例. from http://fanli7.net/w3school/xpath/xpath_examples.html
Wikipedia. (2014). Pattern. from http://en.wikipedia.org/wiki/Pattern
Yangarber, R. (2003). Counter-Training in Discovery of Semantic Patterns. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
Zheng, H.-T., Kang, B.-Y., & Kim, H.-G. (2008). An ontology-based approach to learnable focused crawling. Information Sciences, 178(23), 4512-4522. doi: http://dx.doi.org/10.1016/j.ins.2008.07.030
林千翔. (2006). 基於特製隱藏式馬可夫模型之中文斷詞研究. (碩士), 國立中央大學.
陳光華. (2012). 資訊擷取. from http://terms.naer.edu.tw/detail/1679021/
楊存一. (2002). 利用自適應共振理論網路探討MIS學術論文關鍵議題的發展趨勢. 雲林科技大學. Retrieved from http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=id=%22090YUNTE396016%22.&searchmode=basic
資策會FIND/經濟部技術處. (2011). 「科技化服務價值鏈研究與推動計畫」. from http://www.find.org.tw/find/home.aspx?page=many&id=323
資策會產業情報中心. (2013). 台灣電子商務產值一覽. from http://md.ctee.com.tw/news.php?pa=FISvZD%2BdIUDC7Ig1ZRbzagpaMS7l9x52Acpx0PvlHvs=
校內:2020-07-13公開