| 研究生: |
潘柏璇 Pan, Po-Hsuan |
|---|---|
| 論文名稱: |
以樹狀結構及新詞判斷分類XML文件之研究 A XML document classification with tree structure and new word class |
| 指導教授: |
黃宇翔
Huang, Yeu-Shiang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2005 |
| 畢業學年度: | 93 |
| 語文別: | 中文 |
| 論文頁數: | 60 |
| 中文關鍵詞: | 文件分類 、樹狀結構 、延伸標記語言 、關聯資訊萃取 、新詞 |
| 外文關鍵詞: | relation extraction, new word, document classification, tree structure, XML |
| 相關次數: | 點閱:134 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
延伸標記語言(eXtensible Mark-up Language, XML) 規格是由全球資訊網標準製定組織(W3C) 制定,並於 1998 年 2 月成為推薦規格。XML已逐漸成為網路上不同系統和資料庫間資訊交換的新標準,加上其結構化的特性,使得在處理大量XML文件分類成為一重要課題。目前XML在文件分類上有利用Naïve Bayes演算法、樣版辨識和影像處理分割技術、詞性標記和法則式技術以及TFIDF以解決分類問題等方法,由於過去的研究鮮少針對文件本身的內容作分析,可能造成含糊文件或衍生的相關文件無法正確分類。本研究先以文件的樹狀結構特性找出每個項目的重要性等級,並利用TFIDF方法取得特徵項目後,便可藉由比對各類別的特徵項目將文件正確分類。在分類過程中,同時考量文件中的重要新詞以提高分類正確率。為使分類器能不侷限在限有特徵項目中,本研究也提出一加入重要特徵項目的機制,使分類器能適應廣泛內容的文件。本研究最後與同樣使用階層特性的XML文件分類方法作一比較,結果顯示本研究能顯著改善分類之正確率。
none
參 考 文 獻
中文部分
王常威(2004), “以內容為基礎之XML文件分類方法之研究”, 國立成功大學資訊管理研究所碩士論文
巫啟台(2002),“文件之關聯資訊萃取及其概念圖自動建構”, 國立成功大學資訊工程研究所碩士論文
英文部分
Aizawa, A.(2000), “The feature quantity: an information theoretic perspective of Tfidf-like measures”, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, July, pp.104 - 111
Brill, E.(1992), “A simple rule-based part of speech tagger”, In Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, pp.152-155
Berger, H., Dittenbach, M. and Merkl, D.(2004), “An adaptive information retrieval system based on associative networks”, Proceedings of the first Asian-Pacific conference on Conceptual modeling, Vol.31, January, pp.27 - 36
Bernstein, A., Provost, F. and Clearwater, S.(2003), “The Relational Vector-space Model and Industry Classification”, Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-2003), In L. Getoor and D. Jensen, editors, Acapulco, Mexico, August, pp. 8-18
Bray, T., Paoli, J., Sperberg-McQueen, C. M. and Maler, E.(2000), “Extensible Markup Language (XML) 1.0, 2nd edn.”, W3C recommendation, Technical Report REC-xml-20001006, World Wide Web Consortium
Bruzzone, L. and Melgani, F.(2003), “An advanced classification system based on the back-propagation of consensus”, IEEE international geoscience and remote sensing symposium (IGARS'03), Toulouse 21-25 July, pp.1785-1787
Califf, M.E. and Mooney, R.J.(2003), “Bottom-up relational learning of pattern matching rules for information extraction”, The Journal of Machine Learning Research, Vol.4, September, pp.177-210
Calvo, R.A. and Ceccatto, H.A.(2000), “Intelligent document classification”, Intelligent Data Analysis, Vol.4, February, pp.411-420
Calvo, R.A., Lee, J.M. and Li, X.(2004), “Managing Content with Automatic Document Classification”, Journal of Digital Information, Vol.5, No.282, June, http://jodi.ecs.soton.ac.uk/Articles/v05/i02/Calvo/calvo-final.pdf
Carr, O. and Estival, D.(2002), “Text classification of formatted text documents”, Australasian Natural Language Processing Workshop In conjunction with the 15th Australian Joint Conference on Artificial Intelligence, DSTO, Adelaide (AU), pp.49-54
Chen, Y.S. and Chu, T.H.(1995), “A neural network classification tree”, IEEE International Conference on Neural Networks, Vol.1, pp.409-413
Chen, L., Tokuda, N. and Nagai, A.(2003), “A new differential LSI space-based probabilistic document classifier”, Information Processing Letters, Vol.88, No.5, December, pp.203-212
Chung, Y. D., Kim, J.W. and Kim, M.H.(2003), “Efficient preprocessing of XML queries using structured signatures”, Information Processing Letters, Vol.87, pp. 257-264
Delden, S.V. and Gomez, F. (2004), “Retrieving NASA problem reports: a case study in natural language information retrieval”, Data & Knowledge Engineering, Vol.48, No.2, February, pp. 231-246
Denoyer, L. and Gallinari P. (2004), “Bayesian Network Model for Semi-Structured Document Classification”, Information Processing and Management, September, pp.807-827
Denoyer, L., Vittaut, J.N., Gallinari, P., Brunessaux, S. and Brunessaux, S.(2003), “Structured multimedia document classification”, Proceedings of the ACM symposium on Document engineering, November, pp.153-160
Dhillon, I.S., Mallela, S. and Kumar, R.(2003), “A divisive information theoretic feature clustering algorithm for text classification” The Journal of Machine Learning Research, Vol.3, pp.1265-1287
Fuhr, N.(1992), “Probabilistic models in information retrieval”, The Computer Journal, 35(3), pp.243-255
Fuhr, N. and Pfeifer, U.(1994), “Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions”, ACM Transactions on Information Systems (TOIS), Vol.12, No.1, January, pp.92-115
Gershenson, C.(2002), “Classification of Random Boolean Networks”. Artificial Life VIII: Proceedings of the Eight International Conference on Artificial Life, Sydney, Australia. MIT Press, pp.1-8
Godbole, S.(2001), ”Document Classification as an Internet service: choosing the best classifier”, School of Information Technology, Bombay, September, http://www.it.iitb.ac.in/~shantanu/work/mtpsg.pdf
Goldman, R., McHugh, J. and Widom, J.(1999), “From Semistructured Data to XML: Migrating the Lore Data Model and Query Language”, In Proc. of the 2nd InternationalWorkshop on the Web and Databases, June, pp.25-30
Guillaume, D. and Murtagh, F.(2000), “Clustering of XML documents”, Computer Physics Communications, Vol.127, May , pp. 215-227
Han, C.C.(2002), “A supervised classification scheme using positive Boolean function”, 16th International Conference on Pattern Recognition (ICPR'02), Vol.2 , pp.100-103
Jenkins, C. and Inman, D.(2000) “Adaptive Automatic Classification on the Web” 11th International Workshop on Database and Expert Systems Application pp.504-511.
Jing, L.P., Huang, H.K. and Shi, H.B.(2002), “Improved feature selection approach TFIDF in text mining”, proceedings of the First International Conference on Machine Learning and Cybernetics, Vol.2, Beijing, 4-5 November, pp.944-946
Joachims, T.(1996), “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, Proceedings of the Fourteenth International Conference on Machine Learning, pp.143-151
Kim, D., Jung, H. and Lee, G.G.(2003), “Unsupervised learning of mDTD extraction patterns for Web text mining”, Information Processing & Management, Vol.39, No.4, July, pp.623-637
Liu, S., Dong, M., Zhang, H. and Shi, Z. (2002), “An Approach of Multi-hierarchy Text Classification”, Journal of Chinese Information, Vol.16, No.3, pp.95-100
Mehler, A.(2000), “Text Mining with the Help of Cohesion Trees, Classification, Automation, and New Media”, Proceedings of the 24th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Passau, March 15-17, pp.199-206
Meteer, M., Schwartz, R. and Weischedel, R.(1991), “Empirical Studies in Part of Speech Labelling”, Proceedings of the DARPA Speech and Natural Language Workshop, Morgan Kaufmann
Mihalcea, R. and Moldovan, D.(2000), “Semantic indexing using wordnet senses”, In Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR, http://acl.ldc.upenn.edu/W/W00/W00-1104.pdf
Mostafa, J. and Lam, W.(2000), “Automatic Classification Using Supervised Learning in a Medical Document Filtering Application”, Information Processing and Management, 36(3), pp.415-444
Mostafa, J., Mukhopadhyay, S., Palakal, M. and Lam, W.(1997), “A multilevel approach to intelligent information filtering: model, system, and evaluation”, ACM Transactions on Information Systems (TOIS), Vol.15, No.4, pp.368-399
Ng, Y.K., Tang, J. and Goodrich, M.(2001), “A binary-categorization approach for classifying multiple-record Web documents using application ontologies and a probabilistic model”, Proceedings of the 7th International Conference on Database Systems for Advanced Applications (DASFAA 2001), Hong Kong, April 18-20, pp.58-65
Paulson, P. and Tzanavari, A.(2003), “Combining Collaborative and Content-Based Filtering Using Conceptual Graphs”, Modelling with Words, pp.168-185
Rajman, M. and Besan, R. “Text mining - knowledge extraction from unstructured textual data”. In Proceeding of the 6th Conference of International Federation of Classification Societies (IFCS-98), Roma, Italy, pp.473-480
Ricardo, B.Y. and Berthier, R.N.(1999), Modern Information Retrieval, Addison-Wesley Inc, ACM Press New York
Robertson, S. E. and Sparck Jones, K.(1976), “Relevance weighting of search terms” Journal of the American Society for Information Sciences, Vol 27, No.3, pp.129–146
Rocchio, J.J. (1971), “Relevance Feedback in Information Retrieval”, Chap. 14 in The SMART retrieval system-Experiments in automatic document processing, ed. G. Salton, Englewood Cliffs, New Jersey, pp. 313-323
Roesner, D. and Kunze, M.(2002), “An XML-based document suite”, Proceedings of COLING, pp.1278-1282
Ruiz, M.E. and Srinivasan, P.(2002), “Hierarchical text categorization using neural networks”, Information Retrieval, Vol.5, No.1, pp.87-118
Sadohara, K.(2002), “On a capacity control using Boolean kernels for the learning of Boolean functions”, IEEE International Conference on Data Mining (ICDM'02), pp. 410 - 417
Salton, G.(1991), “Developments in Automatic Text Retrieval”, Science, Vol.253, 30 August, pp.974-979
Salton, G.., Fox, E.A., Buckley, C. and Voorhees, E.M.(1983), “Boolean Query Formulation with Relevance Feedback”, Communications of the ACM, Vol.26, January
Salton, G., Wong, A. and Yang, C.S.(1975), “A vector space model for automatic indexing”, Communications of the ACM, Vol.18. No.11, November, pp. 613-620
Schlieder, T. and Meuss, H.(2002), “Querying and ranking XML documents”, Journal of the American Society for Information Science and Technology (JASIST), Vol.53, No.6, April, 489-503
Scott, S. and Matwin, S.(1999), “Feature engineering for text classification”, In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 379-388
Sebastiani, F.(2002), “Machine learning in automated text categorization”, ACM Computing Surveys (CSUR), Vol.34, pp.1-47
Stokoe, C., Oakes, M.P. and Tait, J.(2003), “Word sense disambiguation in information retrieval revisited”, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July, pp.159-166
Strzalkowski, T.(1994), “Robust text processing in automated information retrieval”, In Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, Germany, ACL, pp.168-173
Sung, L.C., Chen M.C. and Kuo, C.H.(2002), “Web Document Classification based on Tagged-Region Progressive Analysis”, http://ranger.uta.edu/~alp/ix/readings/
webDocClassification.pdf
Trotman, A.(2004), “Searching structured documents”, Information Processing and Management Vol.40, pp.619-632
Wang, Q., Park, I. and Zhang, P.(2003), “Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus”, Natural Language Processing and Knowledge Engineering Proceedings, October, pp.452-458
Williams, K. and Calvo, R.A.(2002), “A Framework for Text Categorization”, Proceedings of the 7th Australasian Document Computing Symposium, Sydney, Australia, December, pp.13-19
Wong, P.C., Whitney P., and Thomas, J.(1999), “Visualizing Association Rules for Text Mining”, Proceedings of the 1999 IEEE Symposium on Information Visualization, San Francisco, CA, October 24-29, pp.120-123
Yang, Y. and Pedersen, J.O. (1997), “A comparative study on feature selection in text categorization”. In Proceedings of ICML-97, International Conference on Machine Learning, pp.412-420.