簡易檢索 / 詳目顯示

研究生: 邱志傑
chiu, chih-chieh
論文名稱: 不適合存取網站自動分類系統
Inappropriate Material Websites Automatic Classification System
指導教授: 王明習
Wang, Ming-Shi
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 84
中文關鍵詞: 色情網站不適合存取網站支向機分類
外文關鍵詞: Porn website, Inappropriate material websites, SVM
相關次數: 點閱:51下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   網路對民眾生活最大的改變為方便迅速的傳遞資訊,但若這些資訊內含有對心智尚未成熟的孩童有不良的負面影響,將很有可能衍生許多的社會或道德問題。本論文之研究動機為開發不適合存取網站自動分類系統以防止未成年之少年在網路上瀏覽不正當的網站,本文分別以智慧型搜尋引擎代理器自動搜尋可能為不適合存取之相關網站;再透過網站分析系統(WAS)去擷取該網站本身之目錄結構,並且下載其網站完整內容,之後再利用網頁分析核心(WAC)分別計算相關資訊並且產生七大資料庫;接著是給予網站合適的分類向量,分別由關鍵字詞代理器(Keyword Agent)、圖片偵測代理器(Graphic Agent)及網站連結代理器(Link Agent)等計算該網站相關權重後,將所有向量轉交網站評分及分類系統(WRACE)透過SVM演算法來分類是否為不當網站。本文之最主要研究貢獻為開發完成網站自動分類系統,在100個不適合存取網站中可偵測出88個、在100個正常網站中誤判的有4個,隨機抽取100個網站中,可偵測出46個不適合存取網站,且這46個網站經人工檢視也都是真的屬於不適合存取網站,其餘被偵測為正常網站中經檢視後,只有兩個為不適合存取網站。本文也建立了約一萬筆的不適合存取資料庫,精準度約為99.2%,日後可提供教育及學校單位阻擋不適合存取網站。

     One of the most changes for human life of the internet is to exchange information rapidly. The information communicated in the internet is various. Some of the contents accessed from the internet, such as sexuality, violence are not suitable for these under developing young people. How to block these content, called inappropriate material for reaching these young people is an important issue.
     The purpose of this thesis is to develop a website classifies to filter out those website with inappropriate material. These filtered websites are used as the subjects for internet content filtering. The classifier is consisted of several parts. First, an intelligent search-engine is used to search the candidate of inappropriate websites automatically by commercial internet search-engine. Next, website analysis core is used to generate seven databases from website architecture downloaded by websites analysis system. Then the Keyword Agent, the Graphic Agent and the Link Agent are used to generate the weighting vectors of website’s databases. Finally, website Rating and Classifying Engine are used to classify the website if it is an inappropriate website by using the Support Vector Machine.
     The main contribution of this research is to develop an automatic website classifier to find the inappropriate website in the internet as precision as possible. The precision and recall of the proposed system are 0.96 and 0.95, respectively.

    第一章 前言 1.1 網路使用現況 1.2 研究動機與重點 第二章 相關研究 2.1 網路資訊之負面產物-不當資訊 2.2不適合存取網站對十八歲以下青少年的影響 2.3 防制不適合網站的方式 2.3.1網站防堵(Site Blocking) 2.3.2網路服務防堵(Internet Service Blocking) 2.3.3內容過濾(Content Filtering) 2.3.4不適合存取網站防制方法優缺點比較 2.4 相關防制不當網站文獻探討 第三章 研究方法 3.1 相關探討 3.1.1網站與網頁(Website and Webpage) 3.1.2 不當網站之互連性 3.1.3 系統架構圖 3.2 智慧型搜尋引擎代理器(Intelligent Search-Engine Agent, ISEA) 3.3網站分析系統(Website Analysis System, WAS) 3.3.1 網站分析格式 3.3.2網站內容之擷取 3.3.3 網站分析核心(Website Analysis Core, WAC) 3.4 關鍵字詞代理器(Keyword Agent) 3.4.1 傳統關鍵字詞比對之缺失 3.4.2 選取不當關鍵字詞 3.4.3 不當關鍵字詞權重計算方式 3.4.4 網站之關鍵字詞權重 3.5 圖片偵測代理器(Graphic Agent) 3.5.1 網站圖片特性 3.5.2 色彩空間轉換 3.5.3圖片膚色特徵 3.5.4圖片格式 3.5.5圖片偵測處理流程 3.5.6圖片偵測權重計算方式 3.6 連結代理器(Link Agent) 3.6.1 連結代理器運作模式 3.6.3 連結權重之計算方式 3.6.3 網站連結分析 3.7 網站分級代理器(ICRA Agent) 3.8 SVM演算法(SVM Algorithm) 第四章 實驗結果 4.1 實驗環境及測試樣本 4.1.1 實驗環境 4.1.2 測試樣本 4.2 智慧型搜尋引擎代理器之實驗結果 4.3 網站分析系統實驗結果 4.3.1 擷取網站資訊 4.3.2 網站分析核心(Website Analysis Core, WAC) 4.3.3 網站分析系統支援平行處理 4.3.4 網站分析系統效率分析 4.4 關鍵字詞代理器實驗結果 4.4.1 不當關鍵字詞取樣結果 4.4.2 不當網站關鍵字詞強度權重分析 4.4.3 不當網站關鍵字詞實驗結果 4.5 圖片偵測代理器實驗結果 4.5.1 膚色特徵 4.5.2 不當圖片膚色分佈 4.6 網站連結代理器實驗結果 4.7 網站分級標籤代理器實驗結果 4.8 SVM實驗結果 4.8.1 SVM前處理 4.8.2 SVM實驗過程 4.8.3 SVM測試結果 4.8.4 不適合存取網站資料庫 第五章 結論及未來工作 參考文獻 附錄一:ICRA網站分級資訊

    [1] http://mag.udn.com/mag/dc/storypage.jsp?f_ART_ID=5254
    [2] http://gipi.typepad.com/internetpolicy/2004/06/
    [3] 邱志傑, 王明習, 賴溪松, “TANet不當資訊搜尋與分析”, TANET2003論文集, 台北, 2003.
    [4] 邱志傑, 王明習, 賴溪松, “台灣學術網路南區拒絕存取網站申訴審議平台簡介”, 2004「網際空間:資安、犯罪與法律社會」論文集, 台北, 2004.
    [5] D. Chen, C. H. Chi, D. Jing, C. L. Dong and C. D. Ding, “Centralized content-based Web filtering and blocking,” 1999 IEEE International Conference on , Vol. 2, pp. 115-119, Oct. 1999.
    [6] 邱志傑, 王峙中, 顏廷光, 王明習, “垃圾郵件標題分類及實作”, 2004「智慧型Web技術研討會」論文集, 台北, 2004.
    [7] J.M. Pierre, “Practical Issues for Automated Categorization of Web Pages,” Sept. 2000.
    [8] P.Y. Lee, S.C. Hui, A.C.M. Fong, “Neural networks for web content filtering,” IEEE Intelligent Systems, Vol. 17, pp. 48-57, Sept.-Oct. 2002.
    [9] Smith D., Harvey R., Y. Chan, “Classifying Web pages by content,” IEEE European Workshop, 18 Nov. 1999.
    [10] Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee, “Self-organization of the web and identification of communities,” IEEE Computer, pp. 66-71, 2002.
    [11] Michelangelo Diligenti, Marco Gori, Marco Maggini, “Web page scoring systems for horizontal and vertical search,” Proceedings of the 11th international conference on World Wide Web, May 2002.
    [12] M. Henzinger, “Hyperlink analysis for the Web,” IEEE Internet Computing, Vol. 1, pp. 45-50, Jan-Feb 2001.
    [13] Deng Cai, Xiaofei He, Ji-Rong Wen, Wei-Ying Ma, “Block-level link analysis,” Proceedings of the 27th annual international conference on Research and development in information retrieval, July 2004.
    [14] Vassilis Plachouras, Iadh Ounis, “Usefulness of hyperlink structure for query-biased topic distillation,” Proceedings of the 27th annual international conference on Research and development in information retrieval, July 2004.
    [15] Sugiyama, K., “Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages,” ACM Hypertext, pp. 198-207, 2003.
    [16] Ludovic Denoyer, Jean-Noel Vittaut, Patrick Gallinari, Sylvie Brunessaux, Stephan Brunessaux, “Structured multimedia document classification,” Proceedings of the 2003 ACM symposium on Document engineering, November 2003.
    [17] MinYen Kan, “Web Page Categorization without the Web Page,” The thirteenth international world wide web conference, New York City 2004.
    [18] K.W. Cheung and Y. Sun, “Mining Web Site's Clusters from Link Topology and Site Hierarchy,” Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, Canada, 2003.
    [19] 邱志傑, 王明習, 賴溪松, “資料探勘技術對拒絕網頁之關聯性探討及應用,” 2003「網際空間:科技、犯罪與法律社會」論文集, 台北 2003.
    [20] http://www.php.net
    [21] http://www.macromedia.com/tw/software/coldfusion/
    [22] http://www.104case.com.tw/index.cfm
    [23] http://hiboss.hinet.net/hinet_elearn/model_books_temp3.cfm
    [24] http://www.gnu.org/
    [25] http://www.gnu.org/software/wget/wget.html
    [26] L.P. Jing, H.K. Huang, H.B. Shi, “Improved feature selection approach TFIDF in text mining,” Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on Vol. 2, pp. 944-946, Nov. 2002.
    [27] Juris Dilevko, Lisa Gottlieb, “Selection and Cataloging of Adult Pornography Web Sites for Academic Libraries,” The Journal of Academic Librarianship, Vol. 30, pp. 36-50, Jan. 2004.
    [28] Jinfeng Yang, Zhouyu Fu, Tieniu Tan, Weiming Hu, “A novel approach to detecting adult images,” Proceedings of the 17th International Conference, Vol. 4, Aug. 2004.
    [29] F. Jiao, W. Gao, L. Duan, and G. Cui, “Detecting Adult Image Using Multiple Features,” 2001 International Conferences on IEEE Beijing China, Vol. 3, pp. 378-383, Nov. 2001.
    [30] A.J. Smola and B. Schölkopf, A Tutorial on Support Vector Regression, NeuroCOLT Technical Report, Royal Holloway College, University of London, UK, 1998.
    [31] D. Chai and A. Bouzerdoum, “A Bayesian Approach to Skin Color Classification in YCbCr Color Space,” Proceedings, IEEE, Kuala Lumpur Malaysia, Vol. 2, pp. 421-424, Sept. 2000.
    [32] Jongmyon Kim, D.S. Wills, “Evaluating a 16-bit YCbCr color representation for low memory,” 2005 ICCE Consumer Electronics, pp. 181-182, 2005.
    [33] CCIR, “Encoding parameters of digital television for studios”, CCIR Recommendation 601-2, Int. Radio Consult. Committee, Geneva, Switzerland, 1990.
    [34] 邱志傑, 王明習, 賴溪松, “不當資訊防制分析,” TANET2004論文集, 台東 2004.
    [35] Amy N.L. and Carl D.M., “A Survey of Eigenvector Methods for Web Information Retrieval,” 2005 Society for Industrial and Applied Mathematics, Vol. 47, pp. 135–161.
    [36] Nello Cristianini and John Shawe-Taylor, An introduction to Support Vector Machines and other kernel-based learning methods, CAMBERIDGE UNIVERSITY PRESS 2000.
    [37] B. Scholkopf, C.J.C. Burges, A.J. Smola, Introduction to Support Vector Learning, Advances in Kernel Methods-Support Vector Learning, pp. 1-15, Cambridge, MA, 1999.
    [38] C.J.C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, pp. 955-974, 1998.
    [39] N.Cristianini, J. Shawf-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000.
    [40] C.x. Dong, S.q. Yang, X. Rao, J.l. Tang, “An algorithm of estimating the generalization performance of RBF-SVM,” Computational Intelligence and Multimedia Applications, pp. 61-66, Sept. 2003.
    [41] C.W. Hsu, C.C. Chang, C.J. Lin, A Practical Guide to Support Vector Classification, Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan.
    [42] Mario A. Nascimento and Adriano C. R., “An Experiment Stemming Non-Traditional Text,” String Processing and Information Retrieval, pp. 75-80, 1998.
    [43] Fei H.,Vogel S., “Improved named entity translation and bilingual named entity extraction,” 2002 Multimodal Interfaces, pp. 253-258, Oct. 2002.
    [44] 邱志傑, 王明習, 許盛凱, 莊育秀, 賴溪松, “台灣學術網路南區不當資訊資料庫檢測實作,” 2004「資訊科技與人文管理教育論壇」論文集, 台北 2004.
    [45] C.C. Chang and C.J. Lin, LIBSVM: a library for support vector machines.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

    下載圖示 校內:2006-08-25公開
    校外:2006-08-25公開
    QR CODE