研究生: |
黃皆富 Huang, Jie-Fu |
---|---|
論文名稱: |
網路文件分類系統之建置與探討 Classification of Web Documents Using a GA-based KNN Method |
指導教授: |
蔡長鈞
Tsai, Chang-Chun |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 中文 |
論文頁數: | 54 |
中文關鍵詞: | 鄰近鄰居法 、遺傳演算法 、文件分類 |
外文關鍵詞: | k-nearest neighbors, document classification, genetic algorithm |
相關次數: | 點閱:119 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路的發達以及電腦設備的進步,再加上寬頻技術的推展,資訊流通的方式從以往傳統的報紙、廣播、電視以及電影等,逐漸轉移到電腦網路上。電腦網路上多元化的資訊呈現方式,包含各種傳播媒介,諸如文字、圖片乃至於聲音、影像等,因此,如何在這樣豐富的資訊中提供使用者所需的資訊是刻不容緩的問題。
為了解決上述的問題,就必須做好文件分類的工作,藉由相關知識領域的專家來做文件分類的工作是可以達到不錯的效果,但是畢竟專家的力量是有限,因此自動化文件分類是重要的。
本研究結合遺傳演算法(genetic algorithm)與KNN(K-Nearest Neighbors)提出一分類演算法,利用遺傳演算法的特性,篩選訓練文件樣本,剔除對於分類並無幫助的樣本,如此便產生了一訓練模板(pattern)。
實證研究方面,針對網路中富含情色資訊之繁體中文網頁實作一分類系統,由實驗數據可得知本研究之分類方法確實改善了未改良之KNN的正確率。
With the prevalence of Internet, the improvement of computer architectures, and the development of broadband technology, the channels of information communication have converted from newspapers, broadcasts, TV, and movies to computer networks. Computer networks can transmit information by text, pictures, voices, videos, and so on. How to find what you want in such sophisticated mediums is a tough problem.
In order to solve this problem, document categorizations is a way. We can classify documents correctly by means of experts’ domain knowledge. But, the efforts that experts can afford are limited, more then it is not time-efficient to classify documents manually. So automatic document categorization is arising.
This study combined genetic algorithm (a.k.a. GA) and k-nearest neighbors (a.k.a. KNN) to make a novel document categorization algorithm known as GA-based KNN (a.k.a. GKN). First, genetic algorithm was applied to select good training samples that are useful to classify documents, and to discard bad training samples that are useless to classify documents. Then, a pattern was generated. Finally, we applied the pattern generated by GA to classify documents by means of KNN.
In the evaluation of GKN, I found many materials of pornography on the Internet, and classified them by GKN and KNN separately, compared the accuracy rates of GKN and that of KNN. The study found the effectiveness of GKN was better than KNN indeed.
江孟峰, “網頁文件分類相關技術之研究,” 電子月刊, Vol. 7, No. 5, 2001, pp. 108 – 114.
林智揚、黃國禎, “網站自動化分級管理系統之研製,” 資訊管理學報, Vol. 8, No. 1, 2001, pp. 123 – 142.
范傑臣, “從多國網路內容管制政策談臺灣網路規範努力方向,” 資訊社會研究, Vol. 2, 2002, pp. 205 – 223.
施毓琦, “網路內容管理與分級標準之研究,” 美國資訊科學與技術學會臺北學生分會會訊, Vol. 15, 2002, pp. 57 – 74.
黃國禎、朱蕙君、楊詠勛, “整合型網路資訊分級管理系統之研製,” 暨大學報, Vol. 6, No. 2, 2003, pp. 177 – 202.
曾元顯, “文件主題自動分類成效因素探討,” 中國圖書館學會會報, Vol. 68, 2002, pp. 62 – 83.
張雅雯, “讓未成年人無色上網 有色分級,” 資訊與電腦, Vol. 223, 1999, pp. 28 – 31.
詹宜軒, “媒體分級制 勢在必行--出版品、網路、線上遊戲皆須分級,” 廣告雜誌, Vol. 145, 2003, pp. 12 – 13.
歐陽彥正、葉建華、黃賢卿, “資料探勘的技術與應用,” 檔案季刊, Vol. 2, No. 2, 2003, pp. 14 – 22.
顏志平、徐熊建, “語意為基礎之網路犯罪資訊搜尋研究,” Journal of Information Technology and Society, Vol. 1, 2002, pp. 57 – 93.
Carvalho D. R. and Freitas, “A hybrid decision tree/genetic algorithm method for data mining,” Information Sciences, Vol. 163, No. 1 – 3, 2004, pp. 13 – 35.
Cowgill M. C. and Harvey R. J., “A genetic algorithm approach to cluster analysis,” Computers and Mathematics with Applications, Vol. 37, No. 7, 1999, pp. 99 – 108.
Cunninggham S. J. and Summers B., “Applying machine learning to subject classification and subject description for information retrieval,” Proceeding of the second New Zealand International Two-stream Conference on Artificial Neural Networks and Expert Systems, IEEE Computer Society, 1995, pp. 243 – 246.
Bandyopadhyay S., Murthy C. A., Pal S. K. “Pattern classification with genetic algorithms,” Pattern Recognition Letters, Vol. 16, No. 8, 1995, pp. 801 – 808.
Chellapilla K., “Combining mutation operators in evolutionary programming,” IEEE Transactions on Evolutionary Computation, Vol. 2, No. 3, 1998, pp. 91 – 96.
He J., Tan A. Tan C., “Machine learning methods for Chinese web page categorization,” Proceeding of the second Chinese Language Processing Workshop, Hong Kong, 2000, pp. 93 – 100.
Kantardzic M., “DATA MINING – Concept, Models Methods, and Algorithms,” IEEE Press, Piscataway, NJ, 2003.
Kwon O. W. and Lee J. H., “Web page classification based on k-nearest neighbor approach,” Proceedings of the fifth International Workshop on Information Retrieval with Asian languages, Hong Kong, China, 2000, pp. 9 – 15.
Kwon O. W. and Lee J. H., “Text categorization based on k-nearest neighbor approach for Web site classification,” Information Processing and Management, Vol. 39, No. 1, 2003, pp. 25 – 44.
Lee P. Y., Hui S. C., Fong A. C. M., “Neural networks for web content filtering,” Intelligent Systems, IEEE, Vol. 17, No. 5, 2002, pp. 48 – 57.
Liu C. H., Lu C. C., Lee W. P., “Document categorization by genetic algorithms,” 2000 IEEE International Conference on Systems, Man, and Cybernetics, Vol. 5, Nashville, TN, 2000, pp. 3868 – 3872.
Klusch M., “Information agent technology for the Internet: a survey,” Data & Knowledge Engineering, Vol. 36, No. 3, 2001, pp. 337 – 372.
Michalewicz Z., Genetic Algorithms + Data Structure = Evolutionary Programs, Springer, Berlin, Germany, 1999.
Mladenić D. and Grobelink M., “Feature selection on hierarchy of web documents,” Decision Support Systems, Vol. 35, 2003, pp. 45 – 87.
Negnevitsky M., “Artificial Intelligence – A Guide to Intelligent Systems,” Addison-Wesley, Harlow, England, 2002.
Salton G., “Automatic Text Processing – The Transformation, Analysis, and Retrieval of Information by Computer,” Addison-Wesley, Harlow, England, 1989.
Shin K. S. and Lee Y. J., “A genetic algorithm application in bankruptcy prediction modeling,” Expert System with Applications, Vol. 23, No. 3, 2002, pp. 321 – 328.
Songbo T., “Neighbor-weighted k-nearest neighbor for unbalanced text corpus,” Expert Systems with Applications, Vol. 28, No. 4, 2005, pp. 667 – 671.
Swain A. K. and Morris A. S., “Performance improvement of self-adaptive evolutionary methods with a dynamic lower bound,” Information Processing Letters, Vol. 82, No. 1, 2002, pp. 55 – 63.
Vlajic N. and Card H. C., “An adaptive neural network approach to hypertext clustering,” Proceeding of International Joint Conference on Neural Network, IEEE Computer Society, Vol. 6, 1999, pp. 3722 – 3726.
Wang Y., Zhou S., Hu Y., “Naïve Bayes-based gradual Chinese document categorization,” Proceeding of World Multiconference on Systemics, Cybernetics and Informatics, IEEE Computer Society, Vol. 2, Orlando, Florida, 2001, pp. 516 – 521.
Yang Y., “A study on Thresholding strategies for text categorization,” Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 137 – 145.
Yuwono B. and Lee D. L., “Search and ranking algorithms for locating resources on the World Wide Web,” Proceedings of the Twelfth IEEE International Conference on Data Engineering, 1996, pp. 164 – 171.