簡易檢索 / 詳目顯示

研究生: 潘柏璇
Pan, Po-Hsuan
論文名稱: 以樹狀結構及新詞判斷分類XML文件之研究
A XML document classification with tree structure and new word class
指導教授: 黃宇翔
Huang, Yeu-Shiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 60
中文關鍵詞: 文件分類樹狀結構延伸標記語言關聯資訊萃取新詞
外文關鍵詞: relation extraction, new word, document classification, tree structure, XML
相關次數: 點閱:134下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   延伸標記語言(eXtensible Mark-up Language, XML) 規格是由全球資訊網標準製定組織(W3C) 制定,並於 1998 年 2 月成為推薦規格。XML已逐漸成為網路上不同系統和資料庫間資訊交換的新標準,加上其結構化的特性,使得在處理大量XML文件分類成為一重要課題。目前XML在文件分類上有利用Naïve Bayes演算法、樣版辨識和影像處理分割技術、詞性標記和法則式技術以及TFIDF以解決分類問題等方法,由於過去的研究鮮少針對文件本身的內容作分析,可能造成含糊文件或衍生的相關文件無法正確分類。本研究先以文件的樹狀結構特性找出每個項目的重要性等級,並利用TFIDF方法取得特徵項目後,便可藉由比對各類別的特徵項目將文件正確分類。在分類過程中,同時考量文件中的重要新詞以提高分類正確率。為使分類器能不侷限在限有特徵項目中,本研究也提出一加入重要特徵項目的機制,使分類器能適應廣泛內容的文件。本研究最後與同樣使用階層特性的XML文件分類方法作一比較,結果顯示本研究能顯著改善分類之正確率。

    none

    目 錄 摘要 Ⅰ 目錄 Ⅱ 圖目錄 Ⅳ 表目錄 Ⅳ 第一章、緒論 1 第一節、研究背景 1 第二節、研究動機 2 第三節、研究目的 3 第二章、文獻探討 4 第一節、延伸標記語言及其在資訊擷取的應用 4 第二節、文件分類模式 6 一、布林模式 6 二、向量模式 7 三、機率模式 8 四、倒傳遞類神經網路 9 第三節、文字處理 11 一、TFIDF 關鍵字權重計算 11 二、關聯強度 12 三、詞性標記 13 第四節、文件分類應用在XML文件 15 第三章、以樹狀結構及新詞判斷之XML文件分類方法 17 第一節、問題描述 17 第二節、XML文件分類處理流程 21 第三節、XML文件分類方法 24 一、由文件的樹狀結構得知權重 24 二、類別特徵字 27 三、分類 30 四、新詞處理 31 五、更新特徵字 32 第四章、實證研究 34 第一節、XML文件分類方法 34 一、XML文件收集 34 二、XML文件前置處理 34 三、由訓練文件產生分類之特徵字 35 四、文件分類 37 第二節、實證結果分析與比較 40 第三節、討論 43 第五章、結論與建議 52 第一節、研究成果 52 第二節、研究限制 53 第三節、未來研究方向 53 參考文獻 54 圖 目 錄 圖3-1 以內容為基礎之XML文件分類方法 18 圖3-2 XML文件分類方法流程 22 圖3-3 XML文件範例 25 圖3-4 TFIDF演算法 29 圖4-1 測試文件之部分原始內容 39 圖4-2 本方法與王常威(2004)分類方法之正確率比較 41 圖4-3 各類別特徵項目個數及其分類正確筆數之關係 45 圖4-4 每個Fold中各類別分類的正確筆數 46 表 目 錄 表3-1 訓練文件產生特徵項目 19 表3-2 測試文件產生代表性項目 20 表3-3 XML文件中各單字的重要性分數 26 表3-4 本研究使用之斷字表 27 表4-1 本研究之各分類項目名稱及XML文件數目 34 表4-2 訓練文件中各項目及其重要性等級 35 表4-3 本研究各類別中不同Fold之 XML文件編號 36 表4-4 各類別的特徵項目 37 表4-5 測試文件前15%的代表性項目 38 表4-6 各類別在測試文件中佔有的權重值 38 表4-7 本方法與王常威(2004)方法之實證數據比較— 9 Fold 41 表4-8 本方法與王常威(2004)分類方法正確率之差異分析 42 表4-9 各類別在測試文件中佔有的權重值 43 表4-10 測試文件分類結果 43 表4-11 不同Fold中判斷重要新詞類別之正確率 48 表4-12 訓練文件(80%)測試文件(20%)與訓練文件(60%)測試文件(40%)分類正確率之比較 49 表4-13 判斷重要新詞類別之正確率 49 表4-14 以第三個Fold的測試文件產生D類別之重要新詞 50 表4-15 D類別特徵項目更新結果 51

    參 考 文 獻

    中文部分

    王常威(2004), “以內容為基礎之XML文件分類方法之研究”, 國立成功大學資訊管理研究所碩士論文

    巫啟台(2002),“文件之關聯資訊萃取及其概念圖自動建構”, 國立成功大學資訊工程研究所碩士論文

    英文部分

    Aizawa, A.(2000), “The feature quantity: an information theoretic perspective of Tfidf-like measures”, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, July, pp.104 - 111

    Brill, E.(1992), “A simple rule-based part of speech tagger”, In Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, pp.152-155

    Berger, H., Dittenbach, M. and Merkl, D.(2004), “An adaptive information retrieval system based on associative networks”, Proceedings of the first Asian-Pacific conference on Conceptual modeling, Vol.31, January, pp.27 - 36

    Bernstein, A., Provost, F. and Clearwater, S.(2003), “The Relational Vector-space Model and Industry Classification”, Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-2003), In L. Getoor and D. Jensen, editors, Acapulco, Mexico, August, pp. 8-18

    Bray, T., Paoli, J., Sperberg-McQueen, C. M. and Maler, E.(2000), “Extensible Markup Language (XML) 1.0, 2nd edn.”, W3C recommendation, Technical Report REC-xml-20001006, World Wide Web Consortium

    Bruzzone, L. and Melgani, F.(2003), “An advanced classification system based on the back-propagation of consensus”, IEEE international geoscience and remote sensing symposium (IGARS'03), Toulouse 21-25 July, pp.1785-1787

    Califf, M.E. and Mooney, R.J.(2003), “Bottom-up relational learning of pattern matching rules for information extraction”, The Journal of Machine Learning Research, Vol.4, September, pp.177-210

    Calvo, R.A. and Ceccatto, H.A.(2000), “Intelligent document classification”, Intelligent Data Analysis, Vol.4, February, pp.411-420

    Calvo, R.A., Lee, J.M. and Li, X.(2004), “Managing Content with Automatic Document Classification”, Journal of Digital Information, Vol.5, No.282, June, http://jodi.ecs.soton.ac.uk/Articles/v05/i02/Calvo/calvo-final.pdf

    Carr, O. and Estival, D.(2002), “Text classification of formatted text documents”, Australasian Natural Language Processing Workshop In conjunction with the 15th Australian Joint Conference on Artificial Intelligence, DSTO, Adelaide (AU), pp.49-54

    Chen, Y.S. and Chu, T.H.(1995), “A neural network classification tree”, IEEE International Conference on Neural Networks, Vol.1, pp.409-413

    Chen, L., Tokuda, N. and Nagai, A.(2003), “A new differential LSI space-based probabilistic document classifier”, Information Processing Letters, Vol.88, No.5, December, pp.203-212

    Chung, Y. D., Kim, J.W. and Kim, M.H.(2003), “Efficient preprocessing of XML queries using structured signatures”, Information Processing Letters, Vol.87, pp. 257-264

    Delden, S.V. and Gomez, F. (2004), “Retrieving NASA problem reports: a case study in natural language information retrieval”, Data & Knowledge Engineering, Vol.48, No.2, February, pp. 231-246

    Denoyer, L. and Gallinari P. (2004), “Bayesian Network Model for Semi-Structured Document Classification”, Information Processing and Management, September, pp.807-827

    Denoyer, L., Vittaut, J.N., Gallinari, P., Brunessaux, S. and Brunessaux, S.(2003), “Structured multimedia document classification”, Proceedings of the ACM symposium on Document engineering, November, pp.153-160

    Dhillon, I.S., Mallela, S. and Kumar, R.(2003), “A divisive information theoretic feature clustering algorithm for text classification” The Journal of Machine Learning Research, Vol.3, pp.1265-1287

    Fuhr, N.(1992), “Probabilistic models in information retrieval”, The Computer Journal, 35(3), pp.243-255

    Fuhr, N. and Pfeifer, U.(1994), “Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions”, ACM Transactions on Information Systems (TOIS), Vol.12, No.1, January, pp.92-115

    Gershenson, C.(2002), “Classification of Random Boolean Networks”. Artificial Life VIII: Proceedings of the Eight International Conference on Artificial Life, Sydney, Australia. MIT Press, pp.1-8

    Godbole, S.(2001), ”Document Classification as an Internet service: choosing the best classifier”, School of Information Technology, Bombay, September, http://www.it.iitb.ac.in/~shantanu/work/mtpsg.pdf

    Goldman, R., McHugh, J. and Widom, J.(1999), “From Semistructured Data to XML: Migrating the Lore Data Model and Query Language”, In Proc. of the 2nd InternationalWorkshop on the Web and Databases, June, pp.25-30

    Guillaume, D. and Murtagh, F.(2000), “Clustering of XML documents”, Computer Physics Communications, Vol.127, May , pp. 215-227

    Han, C.C.(2002), “A supervised classification scheme using positive Boolean function”, 16th International Conference on Pattern Recognition (ICPR'02), Vol.2 , pp.100-103

    Jenkins, C. and Inman, D.(2000) “Adaptive Automatic Classification on the Web” 11th International Workshop on Database and Expert Systems Application pp.504-511.

    Jing, L.P., Huang, H.K. and Shi, H.B.(2002), “Improved feature selection approach TFIDF in text mining”, proceedings of the First International Conference on Machine Learning and Cybernetics, Vol.2, Beijing, 4-5 November, pp.944-946
    Joachims, T.(1996), “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, Proceedings of the Fourteenth International Conference on Machine Learning, pp.143-151
    Kim, D., Jung, H. and Lee, G.G.(2003), “Unsupervised learning of mDTD extraction patterns for Web text mining”, Information Processing & Management, Vol.39, No.4, July, pp.623-637

    Liu, S., Dong, M., Zhang, H. and Shi, Z. (2002), “An Approach of Multi-hierarchy Text Classification”, Journal of Chinese Information, Vol.16, No.3, pp.95-100

    Mehler, A.(2000), “Text Mining with the Help of Cohesion Trees, Classification, Automation, and New Media”, Proceedings of the 24th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Passau, March 15-17, pp.199-206

    Meteer, M., Schwartz, R. and Weischedel, R.(1991), “Empirical Studies in Part of Speech Labelling”, Proceedings of the DARPA Speech and Natural Language Workshop, Morgan Kaufmann

    Mihalcea, R. and Moldovan, D.(2000), “Semantic indexing using wordnet senses”, In Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR, http://acl.ldc.upenn.edu/W/W00/W00-1104.pdf

    Mostafa, J. and Lam, W.(2000), “Automatic Classification Using Supervised Learning in a Medical Document Filtering Application”, Information Processing and Management, 36(3), pp.415-444

    Mostafa, J., Mukhopadhyay, S., Palakal, M. and Lam, W.(1997), “A multilevel approach to intelligent information filtering: model, system, and evaluation”, ACM Transactions on Information Systems (TOIS), Vol.15, No.4, pp.368-399

    Ng, Y.K., Tang, J. and Goodrich, M.(2001), “A binary-categorization approach for classifying multiple-record Web documents using application ontologies and a probabilistic model”, Proceedings of the 7th International Conference on Database Systems for Advanced Applications (DASFAA 2001), Hong Kong, April 18-20, pp.58-65

    Paulson, P. and Tzanavari, A.(2003), “Combining Collaborative and Content-Based Filtering Using Conceptual Graphs”, Modelling with Words, pp.168-185

    Rajman, M. and Besan, R. “Text mining - knowledge extraction from unstructured textual data”. In Proceeding of the 6th Conference of International Federation of Classification Societies (IFCS-98), Roma, Italy, pp.473-480

    Ricardo, B.Y. and Berthier, R.N.(1999), Modern Information Retrieval, Addison-Wesley Inc, ACM Press New York

    Robertson, S. E. and Sparck Jones, K.(1976), “Relevance weighting of search terms” Journal of the American Society for Information Sciences, Vol 27, No.3, pp.129–146

    Rocchio, J.J. (1971), “Relevance Feedback in Information Retrieval”, Chap. 14 in The SMART retrieval system-Experiments in automatic document processing, ed. G. Salton, Englewood Cliffs, New Jersey, pp. 313-323

    Roesner, D. and Kunze, M.(2002), “An XML-based document suite”, Proceedings of COLING, pp.1278-1282

    Ruiz, M.E. and Srinivasan, P.(2002), “Hierarchical text categorization using neural networks”, Information Retrieval, Vol.5, No.1, pp.87-118

    Sadohara, K.(2002), “On a capacity control using Boolean kernels for the learning of Boolean functions”, IEEE International Conference on Data Mining (ICDM'02), pp. 410 - 417

    Salton, G.(1991), “Developments in Automatic Text Retrieval”, Science, Vol.253, 30 August, pp.974-979

    Salton, G.., Fox, E.A., Buckley, C. and Voorhees, E.M.(1983), “Boolean Query Formulation with Relevance Feedback”, Communications of the ACM, Vol.26, January

    Salton, G., Wong, A. and Yang, C.S.(1975), “A vector space model for automatic indexing”, Communications of the ACM, Vol.18. No.11, November, pp. 613-620

    Schlieder, T. and Meuss, H.(2002), “Querying and ranking XML documents”, Journal of the American Society for Information Science and Technology (JASIST), Vol.53, No.6, April, 489-503

    Scott, S. and Matwin, S.(1999), “Feature engineering for text classification”, In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 379-388

    Sebastiani, F.(2002), “Machine learning in automated text categorization”, ACM Computing Surveys (CSUR), Vol.34, pp.1-47

    Stokoe, C., Oakes, M.P. and Tait, J.(2003), “Word sense disambiguation in information retrieval revisited”, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July, pp.159-166

    Strzalkowski, T.(1994), “Robust text processing in automated information retrieval”, In Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, Germany, ACL, pp.168-173

    Sung, L.C., Chen M.C. and Kuo, C.H.(2002), “Web Document Classification based on Tagged-Region Progressive Analysis”, http://ranger.uta.edu/~alp/ix/readings/
    webDocClassification.pdf

    Trotman, A.(2004), “Searching structured documents”, Information Processing and Management Vol.40, pp.619-632

    Wang, Q., Park, I. and Zhang, P.(2003), “Automatic extraction of the unlisted terms in the field of information technology based on the dynamic circulation corpus”, Natural Language Processing and Knowledge Engineering Proceedings, October, pp.452-458

    Williams, K. and Calvo, R.A.(2002), “A Framework for Text Categorization”, Proceedings of the 7th Australasian Document Computing Symposium, Sydney, Australia, December, pp.13-19

    Wong, P.C., Whitney P., and Thomas, J.(1999), “Visualizing Association Rules for Text Mining”, Proceedings of the 1999 IEEE Symposium on Information Visualization, San Francisco, CA, October 24-29, pp.120-123

    Yang, Y. and Pedersen, J.O. (1997), “A comparative study on feature selection in text categorization”. In Proceedings of ICML-97, International Conference on Machine Learning, pp.412-420.

    下載圖示 校內:2006-06-24公開
    校外:2006-06-24公開
    QR CODE