| 研究生: |
呂文蓁 Lu, Wen-Jane |
|---|---|
| 論文名稱: |
新聞文件階層式分類方法之研究 Finding a Suitable Hierarchical Classification for News |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 文件分類 、階層式分類 |
| 外文關鍵詞: | hierarchical classification, text classification |
| 相關次數: | 點閱:152 下載:8 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路的快速發展、寬頻網路的日漸普及,網路上流傳的資訊越來越多樣化,新聞電子化己成趨勢。利用電子新聞收集資料及彙整資訊的人也越來越多。大量的電子資訊造成了網路資訊過載的問題。目前有許多的紙本的報紙都有提供電子新聞供人們利用網路快速瀏覽新聞資訊,也有部分新聞提供了電子資料庫知識查詢的功能,但多數只提供關鍵字查詢的方法,使用者仍需面對著大量的搜尋結果,因此文件分類在處理和組織大量的新聞資料中扮演著一個極為重要的角色。目前有許多研究使用統計及機器學習的方法進行新聞文件分類;也有研究針對新聞進行階層式分類,讓使用者更精確的減少所搜尋到的文件數量。不過目前對於新聞階層式分類的研究,在分類上都是從一層到分類的最後一層皆使用同一種分類方法。
為了幫助新聞資料收集者更快速且準確的分類新聞資料,本研究在各個階層使用不同的分類方法及資料內容將新聞文件做階層式的分類。針對電子新聞找出一種最佳分類方法的組合,以提高分類上的準確度。經由實驗後發現SVM還是較佳的分類方法,但在某些不同的評估情況下會有不同的組合。
Online news has became a trend, manual text classification distributes text documents into one or more pre-defined categories of similar documents. It is essential to develop an automatic classification method to reduce manual work. Currently, the news is a hierarchical structure. We wonder if a classification method applicable to each level.
In this paper, we first collected most popular classification methods. Several suitable combinations are selected and applied to different hierarchical level. Unlike other papers, they all apply one method in all level to classify news. We use three most popular text classification algorithms, Support Vector Machine, Naïve Bayes and K-Nearest-Neighbor, to classify Reuter Corpus, Volume 1. We expect to find a better combinatorial classification method to improve classification accuracy and performance.
■ 英文文獻
Antonious, G., & Harmelen, F. V. A semantic web primer. Cambridge, MA: MIT Press, 2004.
Apte, C., Damerau, F., & Weiss, S. M. Autumated learning of decision rules for text categorization. ACM Transactions on Information Systems, 233-251, 1994.
Baeza-Yates, R., & Ribeiro-Neto, B. Modern Information Retrieval. New York: The ACM Press, 1999.
Cai, L., & Hofmann, T. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on information and knowledge management, 2004.
Chen, D., Muller, H. M., & Sternberg, P. W. Automatic document classification of biological literature. BMC Bioinformatics, 7(370), 1-11, 2006.
Crammer, K., & Singer, Y. On the Algorithmic Implementation of Multi-class SVMs. Journal of Machine Learning Research, 2, 265-292, 2001.
D'Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. The effect of using hierarchical classifiers in text categorization. In Proceedings of the 6th international conference 'Recherched' Information Asistee par Ordinateur Paris:FR, 2000.
Dumais, S., & Chen, H. Hierarchical classification of web content. In Proceedings of the Proccedings of 23rd international conference on research and development in information retrieval (SIGIR'00), 2000.
Edmunds, A., & Morris, A. The problem of information overload in business organizations: A review of the literature. International journal of information management, 20(1), 17-28, 2000.
Fall, C. J., Torcsvari, A., Benzineb, K., & Karetka, G. Automated Categorization in the International Patent Classification. In Proceedings of the ACM SIGIR Forum archive, 2003.
Freeman, J. A., & Skapura, D. M. Neural Networks Algorithms, Applications, and Programming Techniques.: Addison-Wesley Publishing Company., 1992.
Fukumoto, F., & Suzuki, Y. Generating category hierarchy for classifying large corpora. IEICE transactions on information and systems, E89D(4), 1543-1554, 2006.
Hao, P. Y., Chiang, J. H., & Tu, Y. K. Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications, 33, 627-635, 2007.
Joachims, T. Text Categorization with Support Vector Machines; Learning with Many Relevant Features. In European Conference on Machine Learning (ECML). 1998
Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the International Conference on Machine Learning (ICML'99), San Francisco, CA, 1999.
Koller, D., & Schami, M. Hierarchical classifying documents using very few words. In Proceedings of the 14th internatinal conference on machine learning (ICML'97), Nashville, TN, 1997.
Konchady, M. Text mining application programming. Boston, Massachusetts: Charles River Media, 2006.
Lewis, D. D. RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (14-Oct-2005 Version). from http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm, 2005.
Lewis, D. D., & Ringuette, M. A comparison of two learning algorithms for text categorization. . In Proceedings of the Third annual symposium on document analysis and information retrieval (SD AIR'94), 1994.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. RCV1:A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361-397, 2004.
Maron, M. E. Automatic indexing: An experimental inquiry. Journal of the ACM (JACM), 8(3), 404-417, 1961.
Quinlan, J. R. Discovering Rules from Large Collections of Examples: A Case Study. In D. Michine (Ed.), Expert System in the Micro-Electronic Age (pp. 168-201). Edingurgh: Edingurgh University Press, 1979.
Quinlan, J. R. Induction of Decision Trees. Machine Learning, 1(1), 81-106, 1986.
Quinlan, J. R. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993.
Ron, B., Ran, E. Y., Naftali, T., & Yoad, W. On feature distributional clustering for text categorization. In Proceedings of the SIGIR, 2001.
Salton, G. Automatic information organization and retrieval. New York: McGraw-Hill, 1968.
Schütze, H., Hull, D., & Pedersen, J. O. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval.(SIGIR'95), 1995.
Schapire, R. E., Singer, Y., & Singhal, A. Boosting and Rocchio applied to text filtering. In Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR'98), 1998.
Shannon, C. E., & Weaver, W. The Mathematical Theory of Communication. Urbana: University of Illinois Press, 1949.
Vapnik, V. N. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.
Weigend, A. S., Wiener, E., & Pedersen, J. O. Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193-216, 1999.
Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., et al. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4), 2-8, 1999.
Witten, I. H., & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd Ed. Morgan Kaufmann, 2005.
Xie, X. L., & Beni, G. A validity measure for fuzzy clustering. IEEE Transactions Pattern Analysis and Machine Intelligence, PAMI-13(8), 841-847, 1991.
Xu, J. Solving the word mismatch problem throught automatic text analysis., University of Massachusetts, Amherst, 1997.
Yang, Y. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69-90, 1999.
Yang, Y., & Liu, X. A re-examination of text categorization methods. In Proceedings of the Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval 1999.
Yang, Y., & Pedersen, J. O. A comparative study on feature selection in text categorization. In International conference on machine learning(ICML). 1997
■ 網站資料
NIST, http://trec.nist.gov/data/reuters/reuters.html
SVMmulticlass, http://svmlight.joachims.org/svm_multiclass.html
Support Vector Machines, http://www.support-vector.net