簡易檢索 / 詳目顯示

研究生: 石俊麟
Shih, Chun-Lin
論文名稱: 高效率多重單位關聯式規則探勘與文件分析之應用
Efficient Association Rules Mining with Multi-Granularities and the Application on Document Analysis
指導教授: 曾新穆
Tseng, Shin-Mu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2003
畢業學年度: 91
語文別: 中文
論文頁數: 69
中文關鍵詞: 文件分類多重單位關聯式規則資料探勘
外文關鍵詞: Association Rule, Text Categorization, Multi-Granularities, Data Mining
相關次數: 點閱:87下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   隨著資料探勘技術的成熟,越來越多資料探勘的技術紛紛應用到各種領域。除了過去傳統的市場分析應用之外,還有像是行動計算中的移動樣式探勘、文件資料與多媒體資料分析以及醫學上的微晶片與生物資訊等等。其中,關聯式規則為最早被提出的問題,至今已發展出數種新問題的變化與相關的演算法。然而,在目前已經提出的問題中,皆尚未考慮交易紀錄單位大小的問題。因此在本論文中,我們將為關聯式規則問題提出一個新的多重單位概念,並且改進舊有的演算法以適用於多重單位架構。接著,我們將進一步改進演算法在時間與空間上的效率。此外,由於不同的單位之間需要一個新的支持度定義標準,以得到數量相當的關聯式規則,我們也將提出一個合理的支持度定義策略來解決這個問題。
      在完成多重單位架構的定義以及演算法的改良之後,我們將把多重單位架構應用在文件分類上,以驗證多重單位架構存在之實用性。在本研究的實驗中我們發現,在多重單位架構下,其文件分類效果超越傳統以文件為單位的單一單位架構,由此可證明多重單位架構的實用價值。

      Data mining technologies have been widely applied on various fields due to the more and more matured development. The well-known applications of data mining include market analysis, mobility patterns mining, text and multimedia data ana lysis, etc. Among all data mining techniques, association rules discovery was the fist explored and most extensively studied one due to the wide applications. However, in the existing researches on association rules mining, there is no one that considers the “granularity” of transaction records. In this research, we propose a new concept for association rules, namely Multi-Granularities Association Rules (MGAR). We also enhance the existing algorithm to incorporate the Multi-Granularities concept such that the efficiency on both of time and space was improved for the new algorithm. Furthermore, a reasonable support strategy is also proposed for discovering large item-sets between different granularities.

      Based on the concept of MGAR, we propose an application on text categorization to verify its practicability. We propose a set of new document categorization methods and the empirical results show that the MGAR-based methods outperform the traditional Single-Granularity-based methods in classification accuracy substantially.

    英文摘要 I 中文摘要 II 致謝 III 目錄 IV 表目錄 VI 圖目錄 VII 第一章 導論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目標 2 1.4 論文架構 2 第二章 相關文獻 3 2.1 關聯式規則之探勘 3 2.1.1 關聯式規則之定義 3 2.1.2 問題之切割 4 2.1.3 Apriori演算法 5 2.1.4 關聯式規則問題之變化 7 2.2 以關聯式規則為基礎之文件分類 8 2.2.1 文件分類方法概況 9 2.2.2 關聯規則式之文件分類 10 2.2.3 各種文件分類方法之特性比較 22 第三章 高效率多重單位之關聯式規則探勘 23 3.1 多重單位架構 23 3.2 多重單位之關聯式規則探勘-Apriori-MG 24 3.3 整合型多重單位架構 25 3.4 高效率多重單位之關聯式規則探勘-Apriori-IG 28 第四章 多重單位關聯式規則於文件分類之應用 34 4.1 多重單位關聯式規則之選取 35 4.2 多重單位關聯式規則之文件分類 38 第五章 效能分析 42 5.1實驗環境之建置 42 5.1.1 測試環境 42 5.1.2 測試資料 42 5.1.3 測試指標 44 5.2 多重單位關聯式規則探勘效率之評估與分析 46 5.2.1多重單位對大項目集合數量影響之實驗 47 5.2.2 高效率多重單位關聯式規則探勘之時間效率實驗 48 5.2.3 多重單位架構對儲存空間之實驗 50 5.2.4 多重單位關聯式規則探勘之實驗總結 50 5.3 多重單位關聯式規則應用於文件分類之效率評估與分析 51 5.3.1 挑選合適的Dominance Factor 51 5.3.2 多重單位與各分類法對Precision的影響 55 5.3.3 多重單位與各分類法對Recall的影響 57 5.3.4 多重單位與各分類法對F1-Measure的影響 59 5.3.5 本研究之各分類方法比較 61 5.3.6 文件分類時間之實驗 62 5.3.7 訓練資料數量對各類別分類結果之影響 63 5.3.8 多重單位關聯式規則文件分類之實驗總結 64 第六章 結論與未來研究方向 65 6.1 結論 65 6.2 未來研究方向 66 參考文獻 67 自述 IX

    [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1993.
    [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules”, Proceedings of the 20th International Conference on VLDB, September 1994.
    [3] Maria-Luiza Antonie, Osmar R. Zaiane, “Text Document Categorization by Term Association,” Proceeding of the IEEE 2001 International Conference on Data Mining (ICDM'2002), Maebashi City, Japan, December 9 - 12, 2002.
    [4] C. Apte, F. J. Damerau, and S. M. Weiss, “Automated Learning of Decision Rules for Text Categorization”, ACM Transactions on Information Systems, 12(2):233-251, 1994.
    [5] Hiroki Arimura, Jun-ichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Shinichi Shimozono, Setsuo Arikawa, “Text Data Mining: Discovery of Important Keywords in the Cyberspace”, Proceeding of Kyoto International Conference on Digital Libraries 2000, Kyoto University, British Library and National Science Foundation (U.S.A.), 121-126, 2000. (Kyoto, November 13th -16th, 2000)
    [6] C. Blake, W. Pratt, “Better Rules, Fewer Features: A Semantic Approach to Selecting Features from Text”, Proceedings of the IEEE Data Mining Conference (IEEE DM'01), San Jose, USA. 2001.
    [7] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, 1997.
    [8] W. Cohen, and H. Hirsch, “Joins that Generalize: Text Classification Using Whirl”, in 4th Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD’98), p169-173, 1999.
    [9] W. Cohen, and Y. Singer, “Context-Sensitive Learning Methods for Text Categorization”, ACM Transactions on Information Systems, 17(2):141-173, 1999.
    [10] Ling Feng, Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, "Mining Inter-Transaction Associations with Templates", Proc. Intl. Conf. on Information and Knowledge Management (CIKM'99), Kansas City, Missouri, USA, November 2-6, 1999, p225-233.
    [11] J. Han, "Mining Knowledge at Multiple Concept Levels", In ACM International Conference on Information and Knowledge Management (CIKM'95), Baltimore, Maryland, USA, November 1995.
    [12] J. Han and Y. Fu, "Discovery of Multiple-Level Association Rules from Large Databases", In Proc. of the 21st VLDB Conference, Zurich, Switzerland, 1995.
    [13] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation”, ACM SIGMOD Intl. Conf. on Management of Data, 2000.
    [14] D. A. Hull, “Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing”, in 17th ACM Intl. Conf. on Research and Development in Information Retrieval (SIGIR’94), p282-289, 1994.
    [15] Y. F. Jing, and W. B. Croft, “An Association Thesaurus for Information Retrieval”, UMass Technical Report 94-17.
    [16] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, in 10th European Conference on Machine Learning (ECML’98), p137-142, 1998.
    [17] D. Lewis, “Naïve (bayes) at forty: The Independence Assumption in Information Retrieval”, in 10th European Conference on Machine Learning (ECML’98). P4-15, 1998.
    [18] Wenmin Li, Jiawei Han, Jian Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”, ICDM 2001
    [19] H. Li, and K. Yamanishi, “Text Classification Using esc-based Stochastic Decision Lists”, in 8th ACM Intl. Conf. on Information and Knowledge Management (CIKM’99), p122-130, Kansas City, USA, 1999.
    [20] S. H. Lin, C. S. Shih, M. C. Chen, J. M. Ho, M. T. Kao, and Y. M. Huang, “Extracting Classification Knowledge of Internet Documents: A Semantics Approach”, Proc. of the 21st Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'98), Melbourne, Australia, August 24-28, 1998.
    [21] B. Liu, W. Hsu, Y. Ma, “Integrating Classification and Association Rule Mining”, in ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD’98), p80-86, New York City, NY, August 1998.
    [22] B. Liu, W. Hsu, Y. Ma, "Mining Association Rules with Multiple Minimum Supports", Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD'99),San Diego, CA, USA, Augest 15-18. 1999, p337-341.
    [23] H. Lu, J. Han, and L. Feng. "Stock Movement Prediction and N-Dimensional Inter-Transaction Association Rules", in Proc. 1998 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), pages 12:1-12:7, Seattle, Washington, June 1998.
    [24] J. S. Park, M. S. Chen, and P. S. Yu, “An Effective Hash Based Algorithm for Mining Association Rules”, Proc. of ACM SIGMOD, May 23-25, 1995, p175-186.
    [25] The Reuters-21578 Text Categorization Test Collection, http://www.research.att.com/~lewis/reuters21578.html.
    [26] M.E. Ruiz, and P. Srinivasan, “Neural Networks for Text Categorization”, in 22nd ACM SIGIR Intl. Conf. on Information Retrieval, p281-282, Berkeley, CA, USA, August 1999.
    [27] F. Sebastiani, “Machine Learning in Automated Text Categorization”, Technical Report IEI-B4-31-1999, Consiglio Nazionale delle Ricerche, Pisa Italy, 1999.
    [28] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables”, Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1996.
    [29] Anthony K. H. Tung, Hongjun Lu, Jiawei Han, L. Feng, "Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules", Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, p297-300.
    [30] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization”, Technical Report CMU-CS-97-127, Carnegie mellon University, April 1997.
    [31] Y. Yang, and C.G. Chute, “An Example-Based Mapping Method for Text Categorization and Retrieval”, ACM Transactions on Information Systems, 12(3):252-277, 1994.
    [32] S. J. Yen, and A. L. P. Chen, “An Efficient Approach to Discovering Knowledge from Large Databases”, Proc IEEE/ACM International Conference on Parallel and Distributed Information System (PDIS), 1996.
    [33] Osmar R. Zaiane, Maria-Luiza Antonie, “Classifying Text Documents by Associating Terms with Text Categories”, Proceeding of the Thirteenth Australasian Database Conference (ADC'02), Melbourne, Australia, January 28-February 1, 2002.

    下載圖示 校內:立即公開
    校外:2003-07-18公開
    QR CODE