研究生: |
邱志傑 chiu, chih-chieh |
---|---|
論文名稱: |
不適合存取網站自動分類系統 Inappropriate Material Websites Automatic Classification System |
指導教授: |
王明習
Wang, Ming-Shi |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 中文 |
論文頁數: | 84 |
中文關鍵詞: | 色情網站 、不適合存取網站 、支向機分類 |
外文關鍵詞: | Porn website, Inappropriate material websites, SVM |
相關次數: | 點閱:51 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
網路對民眾生活最大的改變為方便迅速的傳遞資訊,但若這些資訊內含有對心智尚未成熟的孩童有不良的負面影響,將很有可能衍生許多的社會或道德問題。本論文之研究動機為開發不適合存取網站自動分類系統以防止未成年之少年在網路上瀏覽不正當的網站,本文分別以智慧型搜尋引擎代理器自動搜尋可能為不適合存取之相關網站;再透過網站分析系統(WAS)去擷取該網站本身之目錄結構,並且下載其網站完整內容,之後再利用網頁分析核心(WAC)分別計算相關資訊並且產生七大資料庫;接著是給予網站合適的分類向量,分別由關鍵字詞代理器(Keyword Agent)、圖片偵測代理器(Graphic Agent)及網站連結代理器(Link Agent)等計算該網站相關權重後,將所有向量轉交網站評分及分類系統(WRACE)透過SVM演算法來分類是否為不當網站。本文之最主要研究貢獻為開發完成網站自動分類系統,在100個不適合存取網站中可偵測出88個、在100個正常網站中誤判的有4個,隨機抽取100個網站中,可偵測出46個不適合存取網站,且這46個網站經人工檢視也都是真的屬於不適合存取網站,其餘被偵測為正常網站中經檢視後,只有兩個為不適合存取網站。本文也建立了約一萬筆的不適合存取資料庫,精準度約為99.2%,日後可提供教育及學校單位阻擋不適合存取網站。
One of the most changes for human life of the internet is to exchange information rapidly. The information communicated in the internet is various. Some of the contents accessed from the internet, such as sexuality, violence are not suitable for these under developing young people. How to block these content, called inappropriate material for reaching these young people is an important issue.
The purpose of this thesis is to develop a website classifies to filter out those website with inappropriate material. These filtered websites are used as the subjects for internet content filtering. The classifier is consisted of several parts. First, an intelligent search-engine is used to search the candidate of inappropriate websites automatically by commercial internet search-engine. Next, website analysis core is used to generate seven databases from website architecture downloaded by websites analysis system. Then the Keyword Agent, the Graphic Agent and the Link Agent are used to generate the weighting vectors of website’s databases. Finally, website Rating and Classifying Engine are used to classify the website if it is an inappropriate website by using the Support Vector Machine.
The main contribution of this research is to develop an automatic website classifier to find the inappropriate website in the internet as precision as possible. The precision and recall of the proposed system are 0.96 and 0.95, respectively.
[1] http://mag.udn.com/mag/dc/storypage.jsp?f_ART_ID=5254
[2] http://gipi.typepad.com/internetpolicy/2004/06/
[3] 邱志傑, 王明習, 賴溪松, “TANet不當資訊搜尋與分析”, TANET2003論文集, 台北, 2003.
[4] 邱志傑, 王明習, 賴溪松, “台灣學術網路南區拒絕存取網站申訴審議平台簡介”, 2004「網際空間:資安、犯罪與法律社會」論文集, 台北, 2004.
[5] D. Chen, C. H. Chi, D. Jing, C. L. Dong and C. D. Ding, “Centralized content-based Web filtering and blocking,” 1999 IEEE International Conference on , Vol. 2, pp. 115-119, Oct. 1999.
[6] 邱志傑, 王峙中, 顏廷光, 王明習, “垃圾郵件標題分類及實作”, 2004「智慧型Web技術研討會」論文集, 台北, 2004.
[7] J.M. Pierre, “Practical Issues for Automated Categorization of Web Pages,” Sept. 2000.
[8] P.Y. Lee, S.C. Hui, A.C.M. Fong, “Neural networks for web content filtering,” IEEE Intelligent Systems, Vol. 17, pp. 48-57, Sept.-Oct. 2002.
[9] Smith D., Harvey R., Y. Chan, “Classifying Web pages by content,” IEEE European Workshop, 18 Nov. 1999.
[10] Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee, “Self-organization of the web and identification of communities,” IEEE Computer, pp. 66-71, 2002.
[11] Michelangelo Diligenti, Marco Gori, Marco Maggini, “Web page scoring systems for horizontal and vertical search,” Proceedings of the 11th international conference on World Wide Web, May 2002.
[12] M. Henzinger, “Hyperlink analysis for the Web,” IEEE Internet Computing, Vol. 1, pp. 45-50, Jan-Feb 2001.
[13] Deng Cai, Xiaofei He, Ji-Rong Wen, Wei-Ying Ma, “Block-level link analysis,” Proceedings of the 27th annual international conference on Research and development in information retrieval, July 2004.
[14] Vassilis Plachouras, Iadh Ounis, “Usefulness of hyperlink structure for query-biased topic distillation,” Proceedings of the 27th annual international conference on Research and development in information retrieval, July 2004.
[15] Sugiyama, K., “Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages,” ACM Hypertext, pp. 198-207, 2003.
[16] Ludovic Denoyer, Jean-Noel Vittaut, Patrick Gallinari, Sylvie Brunessaux, Stephan Brunessaux, “Structured multimedia document classification,” Proceedings of the 2003 ACM symposium on Document engineering, November 2003.
[17] MinYen Kan, “Web Page Categorization without the Web Page,” The thirteenth international world wide web conference, New York City 2004.
[18] K.W. Cheung and Y. Sun, “Mining Web Site's Clusters from Link Topology and Site Hierarchy,” Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003), Halifax, Canada, 2003.
[19] 邱志傑, 王明習, 賴溪松, “資料探勘技術對拒絕網頁之關聯性探討及應用,” 2003「網際空間:科技、犯罪與法律社會」論文集, 台北 2003.
[20] http://www.php.net
[21] http://www.macromedia.com/tw/software/coldfusion/
[22] http://www.104case.com.tw/index.cfm
[23] http://hiboss.hinet.net/hinet_elearn/model_books_temp3.cfm
[24] http://www.gnu.org/
[25] http://www.gnu.org/software/wget/wget.html
[26] L.P. Jing, H.K. Huang, H.B. Shi, “Improved feature selection approach TFIDF in text mining,” Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on Vol. 2, pp. 944-946, Nov. 2002.
[27] Juris Dilevko, Lisa Gottlieb, “Selection and Cataloging of Adult Pornography Web Sites for Academic Libraries,” The Journal of Academic Librarianship, Vol. 30, pp. 36-50, Jan. 2004.
[28] Jinfeng Yang, Zhouyu Fu, Tieniu Tan, Weiming Hu, “A novel approach to detecting adult images,” Proceedings of the 17th International Conference, Vol. 4, Aug. 2004.
[29] F. Jiao, W. Gao, L. Duan, and G. Cui, “Detecting Adult Image Using Multiple Features,” 2001 International Conferences on IEEE Beijing China, Vol. 3, pp. 378-383, Nov. 2001.
[30] A.J. Smola and B. Schölkopf, A Tutorial on Support Vector Regression, NeuroCOLT Technical Report, Royal Holloway College, University of London, UK, 1998.
[31] D. Chai and A. Bouzerdoum, “A Bayesian Approach to Skin Color Classification in YCbCr Color Space,” Proceedings, IEEE, Kuala Lumpur Malaysia, Vol. 2, pp. 421-424, Sept. 2000.
[32] Jongmyon Kim, D.S. Wills, “Evaluating a 16-bit YCbCr color representation for low memory,” 2005 ICCE Consumer Electronics, pp. 181-182, 2005.
[33] CCIR, “Encoding parameters of digital television for studios”, CCIR Recommendation 601-2, Int. Radio Consult. Committee, Geneva, Switzerland, 1990.
[34] 邱志傑, 王明習, 賴溪松, “不當資訊防制分析,” TANET2004論文集, 台東 2004.
[35] Amy N.L. and Carl D.M., “A Survey of Eigenvector Methods for Web Information Retrieval,” 2005 Society for Industrial and Applied Mathematics, Vol. 47, pp. 135–161.
[36] Nello Cristianini and John Shawe-Taylor, An introduction to Support Vector Machines and other kernel-based learning methods, CAMBERIDGE UNIVERSITY PRESS 2000.
[37] B. Scholkopf, C.J.C. Burges, A.J. Smola, Introduction to Support Vector Learning, Advances in Kernel Methods-Support Vector Learning, pp. 1-15, Cambridge, MA, 1999.
[38] C.J.C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, pp. 955-974, 1998.
[39] N.Cristianini, J. Shawf-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000.
[40] C.x. Dong, S.q. Yang, X. Rao, J.l. Tang, “An algorithm of estimating the generalization performance of RBF-SVM,” Computational Intelligence and Multimedia Applications, pp. 61-66, Sept. 2003.
[41] C.W. Hsu, C.C. Chang, C.J. Lin, A Practical Guide to Support Vector Classification, Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan.
[42] Mario A. Nascimento and Adriano C. R., “An Experiment Stemming Non-Traditional Text,” String Processing and Information Retrieval, pp. 75-80, 1998.
[43] Fei H.,Vogel S., “Improved named entity translation and bilingual named entity extraction,” 2002 Multimodal Interfaces, pp. 253-258, Oct. 2002.
[44] 邱志傑, 王明習, 許盛凱, 莊育秀, 賴溪松, “台灣學術網路南區不當資訊資料庫檢測實作,” 2004「資訊科技與人文管理教育論壇」論文集, 台北 2004.
[45] C.C. Chang and C.J. Lin, LIBSVM: a library for support vector machines.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.