研究生: |
李昭輝 Lee, Chao-Hui |
---|---|
論文名稱: |
以規則為基礎的時序資料分類探勘技術之研究 A Study on Rule-Based Classification Techniques for Temporal Data Mining |
指導教授: |
曾新穆
Tseng, Vincent S. |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2010 |
畢業學年度: | 99 |
語文別: | 英文 |
論文頁數: | 95 |
中文關鍵詞: | 規則式分類 、時序資料分類 、多變數時序資料 、時序樣式 |
外文關鍵詞: | rule-based classification, temporal data classification, multivariate temporal data, sequential patterns |
相關次數: | 點閱:131 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技的進步資料的記錄技術越來越發達,各種領域的資料也越來越龐大。而當中隨著時間記錄的資料格式更是逐漸成為目前資料儲存的主流。就如同從照相到錄影一般,錄影所能記錄的資料以及證實的事件比照片更廣範。在醫療照護上也是一樣的情況,心電圖和腦波圖的資料都是為了讓重要的訊息能更準確的被顯示。這樣的資料被稱為時序資料(temporal data)。
資料探勘最主要的目的就是從資料中找出重要的訊息,並期望這些訊息可以給使用者更多的幫助。資料分類是資料探勘方法最常被應用的技術之一,但目前在時序資料的探勘上仍缺少能有效提供分類資訊的資料探勘方法。因此我們在第一個研究主題中我們以序列樣式探勘(sequential pattern mining)的方式為基礎發展CBS (Classification-By-Sequences)演算法,針對時序資料做分類探勘。從資料中找出具有區分類別功能的高頻率序列,利用這些序列以分數累計的方式做分類。這些分類序列除了能達到分類的目的外也可輔助使用者做分類原因的分析。
雖然時序資料已經比靜態的值能更完整的描述事件,但在許多研究領域為了能從更多的角度觀查事件採用多觀察者的方式做長期記錄。比如:氣象觀測從溫度,濕度,風向和氣壓等角度記錄環境天氣。這樣記錄的方式能比較準確且客觀的描述一個事件,但資料的格式相對的也比較複雜不容易被處理。醫學相關的資料中有許多生理訊號資料就常常以這樣多變數時序資料(multivariate temporal data)的型態被記錄,為得是能更仔細的分析或更準確的診斷。在本論文的第二個研究主題,我們將時序資料分類探勘的技術延伸到多變數時序資料的分析上。我們試著從資料的順序以及不同時序變數間的相關度找出與分類相關的資訊。但也因為資料量的龐大,我們加入了純化(purification)的概念來簡化探勘過程的複雜度及加強探勘結果的可讀性。根據此理念我們提出了PTCR-Miner (Progressive Temporal Class Rule Miner)演算法,使得找出的分類規則會因為資訊描述得越清楚,而對所支持的類別越單純且越肯定。整體而言,純化概念限制了部份分類資訊的探勘,雖然讓找到的訊息減少,但是相對比較重要的分類訊息。由實驗也證實了這樣的觀念的加入所找出的特徵確實有分類的價值,並且使用簡單的評分累計的機制就能有相當的準確度。
在多變數時序資料的探勘中,變數總是預設為相同的性質的觀察項目。但在現在跨領域的研究蓬勃發展的研究環境中,多變數資料的變數往往是由不同性質的資料所構成的。因此資料的記錄型態以及資料的取樣頻率都會有所不同。在最後一個研究主題中,我們延伸第二個主題的研究,提出PETR-Miner (Progressive Essential Temporal Rule Miner)演算法,針對多變數資料分類探勘的方法做改進,以可分散式的運算架構在純化的概念下探勘序列特徵。同時也考慮了不受時間影響的分類因素,並以封閉式序列探勘的原理精簡探勘出的分類規則。在分類機制上也考慮了以時序特徵做為分類的不足之處,以比較完整的分類策略進行資料分類。實驗結果證實PETR-Miner演算法在異質性多變數時序資料的分類上能有出色的表現。
With the advances in science, data recording technology has developed rapidly. The information of various fields is also generated at a greatly increased pace. In which data recording with time is becoming the mainstream type of current data storage. Like from photographs to video, one period of video can record and collect more event and truth than a photograph. In medical data, the situation still exists, such as EEG data, which represent series of measured values from a brain. Dynamic bio-signals reveal more information and that make doctors easier to diagnose illnesses and more obvious to annotate events. This kind of data is named temporal data in this dissertation.
The major purpose of data mining technology is to extract important and useful information from data. Classification mining is a popular data mining method for applications. However, there are very few temporal data classification mining methods which can offer useful information for classification results. Hence, we proposed a temporal data classification mining method, named CBS (Classify-By-Sequence) based on sequential pattern mining. CBS method discovers highly frequent temporal sequences with class relations as rules. Temporal data classification is constructed by scoring possible class of data with these classifiable rules. Our method follows the purpose of data mining and all discovered rules can not only be used for classification but also as the reasons of each classification result.
Although temporal data is more complete than attribute-based data for event description, many study areas adopts multiple observations to do long term record to seek more details, such as temperature, humidity, wind and pressure in meteorological observation. In this dissertation, we name it as multivariate temporal data. An event can be recorded more acutely and objectively with this data type, but the data format is relatively complex to be processed. Many medical bio-signal data is usually collected in the format of multivariate temporal data to expect more completely analysis and accurately diagnoses. In the second research topic, we extend our temporal data classification study on multivariate temporal data classification. We attempted to class related information from the order of data and the relations between temporal variables. We defined the purification concept to simplify class rule mining and to make discovered rules more comprehensible. The PTCR-Miner algorithm is proposed following this concept for multivariate temporal data classification. It causes that the more description of the discovered rule has fewer and more confident supported class set. In sum, the purification limits the quantity of discovered rules but all discovered rules are really important information. Experimental results shown the proposed method can be performed effectively and efficiently on multivariate temporal data classification.
Generally, temporal variables of multivariate temporal data are regarded as similar. Nevertheless, in many interdisciplinary researches, temporal variables of multivariate temporal data are very different. Each temporal variable of a dataset may be recorded with different sample period even from different machines. In the final topic of this dissertation, we propose a novel algorithm named PETR-Miner for generalizing our temporal data classification method on the multivariate temporal data with heterogeneous temporal variables. Besides mining the time ordered features between temporal variables, we consider class related item sets and refer the concept of closed sequential pattern mining together to enhance whole mining process. This makes all discovered rules be simple and clear for explanation in each classification. Experimental results show that our method delivers good performance on heterogeneous multivariate temporal datasets.
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” in: Proc. 20th Int’l Conf. on Very Large Databases (VLDB’94), pp. 487-499, Santiago, Chile, 1994.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” in: Proc. 11th Int’l Conf. on Data Engineering (ICDE’95), pp. 3-14, Taiwan, 1995.
[3] K. Ali, S. Manganaris, R. Srikant, “Partial Classification using Association Rules,” in: Proc. 3rd Int’l Conf. on Knowledge Discovery in Databases and Data Mining (KDD’97), pp.115-118, Newport Beach, California, USA, 1997.
[4] R. J. Bayardo Jr., "Brute-Force Mining of High-Confidence Classification Rules," in: Proc. 3rd Int'l Conf. on Knowledge Discovery and Data Mining (KDD'97), pp. 123-126, Newport Beach, California, USA, 1997.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, “Classification and Regression Trees,” Wadsworth International Group, (Belmont, California, USA, 1984).
[6] M. Botsch, J. A. Nossek, "Feature Selection for Change Detection in Multivariate Time Series," in: Proc. of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), pp.590-597, 2007.
[7] R.W. Chang, K.S. Weng and L. Yao, "Clustering of Incomplete Data Based on Ellipsoids with Adaptive Volumes," ICIC Express Letters, vol.3, no.4 (A), pp.1037-1042, 2009.
[8] J. H. Chen and C. S. Chen, “Reducing Classification Time of SVM with Multiple Mirror Classifiers,” IEEE Transactions on Systems Man and Cybernetics, Part B, pp. 1173-1183, 2004.
[9] R. Butterworth, D. A. Simovici, G. S. Santos, Lucila Ohno-Machado, "A greedy algorithm for supervised discretization," Special issue: Biomedical machine learning, Journal of Biomedical Informatics, Vol. 37, pp.285-292, 2004.
[10] H. Cheng, X. Yan, J. Han, and P. S. Yu, "Direct Discriminative Pattern Mining for Effective Classification," in Proceedings of the 24th IEEE International Conference on Data Engineering(ICDE), pp. 169-178, 2008.
[11] B. Chiu, E. Keogh, and S. Lonardi, “Probabilistic discovery of time series motifs,” in: Proc. 9th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, Washington, 2003.
[12] U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous valued attributes for classification learning,” in Proc. 13th Int’l Joint Conf. on Uncertainty in Artificial Intelligence (UAI’93), pp. 1022-1027, Morgan Kaufmann, 1993.
[13] E. Frank, M. A. Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H, "Weka - a machine learning workbench for data mining," The Data Mining and Knowledge Discovery Handbook, pp. 1305-1314, Springer, 2005.
[14] A. K. Ghosh, P. Chaudhuri, and C. A. Murthy, “Multiscale Classification Using Nearest Neighbor Density Estimates,” IEEE Transactions on Systems Man and Cybernetics, Part B , pp. 1139-1148, 2006.
[15] S. Hengpraprohm and P. Chongstitvatana, "Feature Selection by Weighted-SNR for Cancer Microarray Data Classification, International Journal of Innovative Computing," Information and Control, vol.5, no.12(A), pp.4627-4636, 2009.
[16] E. Keogh, S. Lonardi, and W. Chiu, “Finding surprising patterns in a time series database in linear time and space,” in: Proc. 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 550-556, 2002.
[17] D. Kudenko, H. Hirsh, “Feature Generation for Sequence Categorization,” in: Proc. 15th National/10th Conf. on Artificial Intelligence/Innovative applications of artificial intelligence (AAAI/IAAI’98), pp. 733-738, Menlo Park, California, USA, 1998.
[18] A. Lendasse, M. Verleysen, E. de Bodt, M. Cottrell, and Ph. Grgoire, "Forecasting Time-Series by Kohonen Classification," European Symposium on Artificial Neural Networks 1998, D-Facto Publications (Brussels), pp. 221-226, Bruges (Belgium), April, 1998.
[19] N. Lesh, M. J. Zaki, and M. Ogihara, “Mining Features for Sequence Classification,” in: Proc. 5th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 342-246, San Diego, California, USA, 1999.
[20] C. Li, L. Khan, B. Prabhakaran, "Real-time classification of variable length multi-attribute motions," Knowledge an information Systems, 2006.
[21] J. Lin, E. Keogh, S. Lonardi, B. Chiu, "A Symbolic Representation of Time Series, with Implications for Streaming Algorithms," In proc. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (KDD2003), pp. 2-11, San Diego, CA, 2003.
[22] B. Liu, W. Hsu and S. Chen, “Using General Impressions to Analyze Discovered Classification Rules,” in Proc. 3rd Int’l Conf. on Knowledge Discovery and Data Mining (KDD’97), pp. 31-36, Newport Beach, California, USA, 1997.
[23] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” in: Proc. 4th Int’l Conf. on Knowledge Discovery and Data Mining (KDD’98), New York, USA, 1998.
[24] B. Liu, Y. Ma, C. K. Wong, and P. S. Yu., "Scoring the Data Using Association Rules," Applied Intelligence, Volume 18, Issue 2, pp. 119-135, 2003.
[25] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery of Frequent Episodes in Event Sequences,” Data Mining and Knowledge Discovery, Vol. 1, Issue 3, pp.259-285, 1997.
[26] F. Masseglia, F. Cathala, and P. Poncelet, “The PSP Approach for Mining Sequential Patterns,” in: Proc. 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, LNAI, Vol. 1510, pp. 176-184, Nantes, France, 1998.
[27] Y. Matsumoto and J. Watada, "Knowledge Acquisition from Time Series Data through Rough Sets Analysis, International Journal of Innovative Computing," Information and Control, vol.5, no.12(B), pp.4885-4898, 2009.
[28] M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” in: Proc. 5th Int’l Conf. on Extending Database Technology, pp.18-32, Avignon, France, 1996.
[29] F. Morchen, A. Ultsch, "Mining Hierarchical Temporal Patterns in Multivariate Time Series," In KI 2004: Advances in Artificial Intelligence, Vol. 3238, pp. 127-140, 2004.
[30] A. Nanopoulos, R. Alcock, and Y. Manolopoulos, “Feature-based Classification of Time-series Data,“ Information processing and technology, Nona Science Publishers, pp. 49-61, 2001.
[31] J. Pei, J. Han, and R. Mao, "An efficient algorithm for mining frequent closed itemsets," in proceeding 2000 ACM-SIGMOD International Workshop Data Mining and Knowledge Discovery (DMKD'00), pp. 11-20, 2000.
[32] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu, "Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach," IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 11, pp. 1424-1440, November 2004, IEEE Computer Society.
[33] H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, U. Dayal, "Multi-Dimensional Sequential Pattern Mining," in: Int'l Conf. on Information and Knowledge Management (CIKM 2001), pp. 81-88, 2001.
[34] J.R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufman Publisher Inc., pp. 302, San Francisco, CA, USA, 1993.
[35] P. Sætrom, M. L. Hetland, “Unsupervised Temporal Rule Mining with Genetic Programming and Specialized Hardware,” in: Proc. Int’l Conf. on Machine Learning and Applications (ICMLA’03), Association for Machine Learning and Applications, 2003.
[36] C. A. Ratanamahatana, E. Keogh, "Making Time-series Classification More Accurate Using Learned Constraints," In proceedings of SIAM International Conference on Data Mining (SDM '04), pp. 11-22, Lake Buena Vista, Florida, April, 2004.
[37] C. A. Ratanamahatana, E. Keogh, Everything you know about Dynamic Time Warping is Wrong, 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA, 2004.
[38] D. Roverso, "Multivariate Temporal Classification By Windowed Wavelet Decomposition And Recurrent Neural Networks," In 3 rd ANS International Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface, 2000.
[39] M. V. Sudhamani and C. R. Venugopal, "Nonparametric Classification of Data viz. Clustering for Extracting Color Features: An Application for Image Retrieval," ICIC Express Letters, vol.1, no. 1, pp.15-20, 2007.
[40] V. S. Tseng and C. P. Kao, “A Novel Parameter-less Clustering Method for Mining Gene Expression Data,” in: Proc. 8th Pacific Asia Conf. on Knowledge Discovery and Data Mining (PAKDD’04), Sydney, Australia, 2004.
[41] V. S. Tseng and C. P. Kao, “A Novel Similarity-based Fuzzy Clustering Algorithm by Integrating PCM and Mountain Method,” in: IEEE Transactions on Fuzzy Systems, Volume 15, NO. 6, 2007.
[42] V. S. Tseng, C. H. Lee, “Effective Temporal Data Classification by Integrating Sequential Pattern Mining and Probabilistic Induction,” in: Expert System with Applications, Vol. 36, Issue 5, pp. 9524-9532, 2009.
[43] V. S. Tseng, C. H. Lee and J. C. Chen, An Integrated Data Mining System for Patient Monitoring with Applications on Asthma Care, Proc. of the 21th IEEE International Symposium on Computer-Based Medical Systems (CBMS 2008), pp. 290-292, Finland, 2008.
[44] J. Wang, J. Han, "BIDE: efficient mining of frequent closed sequences," in Proceedings of the 20th International Conference on Data Engineering, pp. 79, 2004.
[45] H. Wang, C. Zaniolo, “CMP: A Fast Decision Tree Classifier Using Multivariate Predictions,” in: Proc. 16th Int’l Conf. on Data Engineering (ICDE’00), pp. 449-460, 2000.
[46] X. Weng, J. Shen, Classification of multivariate time series using two-dimensional singular value decomposition, Knowledge-Based Systems, Vol. 21, Issue 7, pp.535-539, 2008.
[47] X. Weng, J. Shen, Classification of multivariate time series using locality preserving projections, Knowledge-Based Systems, Vol. 21, Issue 7, pp. 581-587, 2008.
[48] X. Xi, E. Keogh, C. Shelton, L. Wei & Chotirat Ann Ratanamahatana, "Fast Time Series Classification Using Numerosity Reduction," in: Proc. 23rd Int’l Conf. on Machine Learning (ICML’06), pp. 1033-1040 , New York, USA, 2006.
[49] D. Xin, J. Han, X. Yan, and H. Cheng, “Mining Compressed Frequent-Pattern Sets,” in: Proc. 31st Int’l Conf. on Very Large Data Bases (VLDB’05), pp. 709-720, 2005.
[50] X. Yan, J. Han and R. Afshar, "CloSpan: Mining Closed Sequential Patterns in Large Datasets," In proceeding of SIAM International Conference on Data Mining, pp. 166-177, 2003.
[51] J. Yang, P. Yu, W. Wang, and J. Han, “Mining long sequential patterns in a noisy environment,” in: Proc. SIGMOD, Int’l Conf. on Management of Data, Madison, pp.406-417, Wisconsin, 2002.
[52] M. J. Zaki, “Efficient Enumeration of Frequent Sequences,” in: Proc. 7th Int’l Conf. on Information and Knowledge Management (CIKM’98), pp. 68-75, Washington DC, 1998.
[53] M. J. Zaki, "Generating Non-Redundant Association Rules," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 34-43, 2000.
[54] M. Zhang, W. Hsu, M. L. Lee, Mining progressive confident rules, Proc. of the 12th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD'06), pp. 803-808, Philadelphia, PA, USA, 2006.
[55] Central Weather Bureau, http://www.cwb.gov.tw/, R.O.C., 2005.
[56] Environmental Protection Administration Executive Yuan, [http://edb.epa.gov.tw/], R.O.C., 2005.
[57] Physionet 2001 Challenge, “http://www.physionet.org/challenge/2001“.
[58] Tainan Asthma Allergic Children Health Association, http://140.116.58.191/asthma/index.php, R.O.C., 2005.
[59] UCI KDD Archive, “http://kdd.ics.uci.edu”.