| 研究生: |
石俊麟 Shih, Chun-Lin |
|---|---|
| 論文名稱: |
高效率多重單位關聯式規則探勘與文件分析之應用 Efficient Association Rules Mining with Multi-Granularities and the Application on Document Analysis |
| 指導教授: |
曾新穆
Tseng, Shin-Mu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2003 |
| 畢業學年度: | 91 |
| 語文別: | 中文 |
| 論文頁數: | 69 |
| 中文關鍵詞: | 文件分類 、多重單位 、關聯式規則 、資料探勘 |
| 外文關鍵詞: | Association Rule, Text Categorization, Multi-Granularities, Data Mining |
| 相關次數: | 點閱:87 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資料探勘技術的成熟,越來越多資料探勘的技術紛紛應用到各種領域。除了過去傳統的市場分析應用之外,還有像是行動計算中的移動樣式探勘、文件資料與多媒體資料分析以及醫學上的微晶片與生物資訊等等。其中,關聯式規則為最早被提出的問題,至今已發展出數種新問題的變化與相關的演算法。然而,在目前已經提出的問題中,皆尚未考慮交易紀錄單位大小的問題。因此在本論文中,我們將為關聯式規則問題提出一個新的多重單位概念,並且改進舊有的演算法以適用於多重單位架構。接著,我們將進一步改進演算法在時間與空間上的效率。此外,由於不同的單位之間需要一個新的支持度定義標準,以得到數量相當的關聯式規則,我們也將提出一個合理的支持度定義策略來解決這個問題。
在完成多重單位架構的定義以及演算法的改良之後,我們將把多重單位架構應用在文件分類上,以驗證多重單位架構存在之實用性。在本研究的實驗中我們發現,在多重單位架構下,其文件分類效果超越傳統以文件為單位的單一單位架構,由此可證明多重單位架構的實用價值。
Data mining technologies have been widely applied on various fields due to the more and more matured development. The well-known applications of data mining include market analysis, mobility patterns mining, text and multimedia data ana lysis, etc. Among all data mining techniques, association rules discovery was the fist explored and most extensively studied one due to the wide applications. However, in the existing researches on association rules mining, there is no one that considers the “granularity” of transaction records. In this research, we propose a new concept for association rules, namely Multi-Granularities Association Rules (MGAR). We also enhance the existing algorithm to incorporate the Multi-Granularities concept such that the efficiency on both of time and space was improved for the new algorithm. Furthermore, a reasonable support strategy is also proposed for discovering large item-sets between different granularities.
Based on the concept of MGAR, we propose an application on text categorization to verify its practicability. We propose a set of new document categorization methods and the empirical results show that the MGAR-based methods outperform the traditional Single-Granularity-based methods in classification accuracy substantially.
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1993.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules”, Proceedings of the 20th International Conference on VLDB, September 1994.
[3] Maria-Luiza Antonie, Osmar R. Zaiane, “Text Document Categorization by Term Association,” Proceeding of the IEEE 2001 International Conference on Data Mining (ICDM'2002), Maebashi City, Japan, December 9 - 12, 2002.
[4] C. Apte, F. J. Damerau, and S. M. Weiss, “Automated Learning of Decision Rules for Text Categorization”, ACM Transactions on Information Systems, 12(2):233-251, 1994.
[5] Hiroki Arimura, Jun-ichiro Abe, Ryoichi Fujino, Hiroshi Sakamoto, Shinichi Shimozono, Setsuo Arikawa, “Text Data Mining: Discovery of Important Keywords in the Cyberspace”, Proceeding of Kyoto International Conference on Digital Libraries 2000, Kyoto University, British Library and National Science Foundation (U.S.A.), 121-126, 2000. (Kyoto, November 13th -16th, 2000)
[6] C. Blake, W. Pratt, “Better Rules, Fewer Features: A Semantic Approach to Selecting Features from Text”, Proceedings of the IEEE Data Mining Conference (IEEE DM'01), San Jose, USA. 2001.
[7] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data”, Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, 1997.
[8] W. Cohen, and H. Hirsch, “Joins that Generalize: Text Classification Using Whirl”, in 4th Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD’98), p169-173, 1999.
[9] W. Cohen, and Y. Singer, “Context-Sensitive Learning Methods for Text Categorization”, ACM Transactions on Information Systems, 17(2):141-173, 1999.
[10] Ling Feng, Hongjun Lu, Jeffrey Xu Yu, Jiawei Han, "Mining Inter-Transaction Associations with Templates", Proc. Intl. Conf. on Information and Knowledge Management (CIKM'99), Kansas City, Missouri, USA, November 2-6, 1999, p225-233.
[11] J. Han, "Mining Knowledge at Multiple Concept Levels", In ACM International Conference on Information and Knowledge Management (CIKM'95), Baltimore, Maryland, USA, November 1995.
[12] J. Han and Y. Fu, "Discovery of Multiple-Level Association Rules from Large Databases", In Proc. of the 21st VLDB Conference, Zurich, Switzerland, 1995.
[13] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation”, ACM SIGMOD Intl. Conf. on Management of Data, 2000.
[14] D. A. Hull, “Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing”, in 17th ACM Intl. Conf. on Research and Development in Information Retrieval (SIGIR’94), p282-289, 1994.
[15] Y. F. Jing, and W. B. Croft, “An Association Thesaurus for Information Retrieval”, UMass Technical Report 94-17.
[16] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, in 10th European Conference on Machine Learning (ECML’98), p137-142, 1998.
[17] D. Lewis, “Naïve (bayes) at forty: The Independence Assumption in Information Retrieval”, in 10th European Conference on Machine Learning (ECML’98). P4-15, 1998.
[18] Wenmin Li, Jiawei Han, Jian Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules”, ICDM 2001
[19] H. Li, and K. Yamanishi, “Text Classification Using esc-based Stochastic Decision Lists”, in 8th ACM Intl. Conf. on Information and Knowledge Management (CIKM’99), p122-130, Kansas City, USA, 1999.
[20] S. H. Lin, C. S. Shih, M. C. Chen, J. M. Ho, M. T. Kao, and Y. M. Huang, “Extracting Classification Knowledge of Internet Documents: A Semantics Approach”, Proc. of the 21st Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'98), Melbourne, Australia, August 24-28, 1998.
[21] B. Liu, W. Hsu, Y. Ma, “Integrating Classification and Association Rule Mining”, in ACM Intl. Conf. on Knowledge Discovery and Data Mining (SIGKDD’98), p80-86, New York City, NY, August 1998.
[22] B. Liu, W. Hsu, Y. Ma, "Mining Association Rules with Multiple Minimum Supports", Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD'99),San Diego, CA, USA, Augest 15-18. 1999, p337-341.
[23] H. Lu, J. Han, and L. Feng. "Stock Movement Prediction and N-Dimensional Inter-Transaction Association Rules", in Proc. 1998 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), pages 12:1-12:7, Seattle, Washington, June 1998.
[24] J. S. Park, M. S. Chen, and P. S. Yu, “An Effective Hash Based Algorithm for Mining Association Rules”, Proc. of ACM SIGMOD, May 23-25, 1995, p175-186.
[25] The Reuters-21578 Text Categorization Test Collection, http://www.research.att.com/~lewis/reuters21578.html.
[26] M.E. Ruiz, and P. Srinivasan, “Neural Networks for Text Categorization”, in 22nd ACM SIGIR Intl. Conf. on Information Retrieval, p281-282, Berkeley, CA, USA, August 1999.
[27] F. Sebastiani, “Machine Learning in Automated Text Categorization”, Technical Report IEI-B4-31-1999, Consiglio Nazionale delle Ricerche, Pisa Italy, 1999.
[28] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables”, Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1996.
[29] Anthony K. H. Tung, Hongjun Lu, Jiawei Han, L. Feng, "Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules", Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, p297-300.
[30] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization”, Technical Report CMU-CS-97-127, Carnegie mellon University, April 1997.
[31] Y. Yang, and C.G. Chute, “An Example-Based Mapping Method for Text Categorization and Retrieval”, ACM Transactions on Information Systems, 12(3):252-277, 1994.
[32] S. J. Yen, and A. L. P. Chen, “An Efficient Approach to Discovering Knowledge from Large Databases”, Proc IEEE/ACM International Conference on Parallel and Distributed Information System (PDIS), 1996.
[33] Osmar R. Zaiane, Maria-Luiza Antonie, “Classifying Text Documents by Associating Terms with Text Categories”, Proceeding of the Thirteenth Australasian Database Conference (ADC'02), Melbourne, Australia, January 28-February 1, 2002.