研究生: |
余謝輝 Yu, Hsieh-Hui |
---|---|
論文名稱: |
時間序列前處理與樣式探勘技術 Time Series Data Preprocessing and Pattern Mining Techniques |
指導教授: |
曾新穆
Tseng, Shin-Mu |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2010 |
畢業學年度: | 99 |
語文別: | 英文 |
論文頁數: | 92 |
中文關鍵詞: | 資料探勘 、資料前處理 、資料精簡 、資料離散化 、遺傳演算法 、分群 、時間序列分析 |
外文關鍵詞: | data mining, data preprocessing, data reduction, data discretization, genetic algorithm, clustering, time series analysis |
相關次數: | 點閱:141 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,由於資訊的爆炸性發展,巨量的資料愈易取得,人們對於如何從中獲取有用的知識也顯得愈形迫切,造就了更多資料探勘技術被廣泛的發展及應用於資料的分析上。這些挖掘出來的資訊就像是所羅門王的寶藏一樣,在各領域,如商業戰略、社群分析、科學探索及醫學研究上都顯得相當珍貴。然而,真實世界的資料,因為大量而廣泛的來源,造成了許多的雜訊和不一致性。這些低品質的資料就像是一張模糊的藏寶圖, 探索家永遠不可能找到正確的寶藏位置。因此,如何改善資料的品質,藉以提升資料探勘結果的可信度就變成了一個很大的挑戰。另外,如何從特殊形式的資料庫(例如具有高維度與少量樣本)中探勘出有用的樣式也一直是一個重要議題。
在本論文中,針對時間序列資料庫和交易資料庫,我們分別提出了幾個資料前處理的方法來提升資料分析及樣式探勘的準確性。對於時間序列資料庫而言,我們提出了兩種資料離散化方法並應用於樣式的探勘。第一種方為以切割技術為基礎之資料離散方法,此方法結合了分群技術與遺傳演算法,可以自動的產生樣式。第二種方法,稱為PIP-SAX,是一種結合精簡化與符號化概念的資料離散化方法。我們利用這個方法來做顯露式樣式(Emerging Patterns)的探勘。上述兩個方法,透過真實財金資料的實驗分析與驗證,的確可提供實用且有效的分析結果。
此外,對於擁有極高維度且少量樣本的特殊資料,我們則提了出一個可結合專業知識庫與多階層資訊精簡技術的方法。所提出的方法具有下列兩個優點,第一點為:所提出的方法可以利用不同階層順序同時考慮多種不同的科學算術標準。第二點為:透過多元知識庫的整合,提供使用者更具有相關知識根據的分析結果。
In recent year, data mining techniques have been extensively used in data analysis due to the wide availability of huge amounts of data and imminent need for mining useful information from such data. The useful information is like King Solomon’s treasure which can be greatly helpful in many fields, such as business strategies, society analysis, scientific exploration, medical research and etc. However, real-world databases are highly susceptible to noisy and inconsistent due to the heterogeneous sources and huge size. Low-quality data is just like a blurring treasure map, explorers can never find the correct site. How to improve the quality of data and enhance the mining results thus presents a challenge. In addition, how to mine interesting patterns from the special case of transaction databases (i.e. data with small samples and large number of features) is also a critical issue.
In this dissertation, some effective preprocessing approaches are proposed for data analysis and pattern discovery. We address the issue on two types of data, time-series databases and transaction databases.
For the time series databases, we propose two discretization methods for pattern discovery. The first one is a segmentation-based discretization method, which combines the clustering techniques and genetic algorithm to derive patterns automatically. The second one is a reduction and symbolization-based discretization method, namely PIP-SAX. The proposed method is utilized in Emerging Patterns (EPs) mining. Experimental results on real financial dataset also show the effectiveness of the two proposed approaches.
Moreover, for the special case of transaction databases which has small samples and large number of features, a novel multi-information-based data reduction approach hybrid with a knowledge-based data integration approach is proposed. The proposed method has two main advantages. The first one is that the proposed approach can take multi-criterion into consideration in different order. The second one is that the proposed approach can get better results by integrating with heterogeneous knowledge databases.
[1] J. Aach and G. Church, Aligning gene expression time series with time warping algorithms, Bioinformatics, Vol. 17, pp 495-508, 2001.
[2] R. Agrawal, C. Faloutsos, & A. Swami, Efficient similarity search in sequence databases, Proc. of the 4th Conf. on Foundations of Data Organization and Algorithms, pp.69-84, 1993.
[3] R. Agrawal, T. Imielinski, & A. Swami, Mining association rules between sets of items in large databases, Proc. of 1993 ACM SIGMOD Int'l Conf. on Management of Data, pp.207-216, 1993.
[4] R. Agrawal, & R. Srikant, Fast algorithms for mining association rules, Proc. of the 20th VLDB Conference, pp.478–499, 1994.
[5] M. S. Aldenderfer and R. K. Blashfield, Cluster Analysis, Sage Publications, Beverly Hills, USA, 1984.
[6] J. M. Ale, & G. H. Rossi, An approach to discovering temporal association rules, Proc. of the 2000 ACM Symposium on Applied Computing, pp.294-300, 2000.
[7] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays, in Proceeding of the National Academy Sciences USA 96 (1999) 6745-6750.
[8] F. Al-Shahrour, R. Daz-Uriarte, J. Dopazo, Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes, Bioinformatics 20 (2004)578-580.
[9] J. Ayres, J. Flannick, J. Gehrke, & T. Yiu, Sequential pattern mining using a bitmap representation, Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.429-435, 2002.
[10] F. Azuaje, O. Bodenreider, Incorporating ontology-driven similarity knowledge into functional genomics: An exploratory study, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 317-324.
[11] J. Bailey, T. Manoukian, & K. Ramamohanarao, Fast algorithms for mining emerging patterns, Proc. of 6th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp.39-50, 2002.
[12] S. D. Bay, & M. J. Pazzani, Detecting change in categorical data: mining contrast sets, Proc. of the 5th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.302-306, 1999.
[13] R. Bellman, On the approximation of curves by line segments using dynamic programming, Communications of the ACM, Vol. 4, No. 6, pp. 284, 1961.
[14] A. Ben-dor, R. Shamir, Z. Yakhini, Clustering gene expression patterns, Journal of Computational Biology 6 (1999) 281-297.
[15] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, Z. Yakhini, Tissue classification with gene expression profiles, Journal of Computational Biology 7 (2000) 559-584.
[16] A. Ben-Dor, N. Friedman, Z. Yakhini, Overabundance analysis and class discovery in gene expression data, Agilent Technical Report (2002).
[17] B. J. Berme, P. Pechukas, Gaussian model potentials for molecular interactions, Journal of Chemical Physics 56 (1972) 4213-4216.
[18] D. Berrar, W. Dubitzky, M. Granzow, R. Ells, in Proceedings of Oral and Poster Presenters' Abstracts, Critical Assessment of Microarray Data Analysis (2001) 23-28.
[19] K. Chan, & W. Fu, Efficient time series matching by wavelets, Proc. of the 15th IEEE Int'l Conf. on Data Engineering, pp.126-133, 1999.
[20] S. Chan, B. Kao, C. L. Yip, & M. Tang, Mining emerging substrings, Proc. of the 8th Int'l Conf. on Database Systems for Advanced Applications, pp.119- 126, 2003.
[21] J. R. Chen, “Making subsequence time series clustering meaningful,” The IEEE International Conference on Data Mining, pp. 114-121, 2005.
[22] H. Y. Chuang, H.F. Liu, S. Brown, C. McMunn-Coffran, C.Y. Kao, D.F. Hsu, Identifying significant genes from microarray data, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 358-365.
[23] F. L. Chung, T. C. Fu, V. T. Y. Ng, & R. W. P. Luk, An evolutionary approach to pattern-based time series segmentation, IEEE Trans. Evolutionary Computation, 8 (5): pp.471-489, 2004.
[24] P. Cohen and N. Adams, Unsupervised Segmentation of Categorical Time Series into Episodes, Proceeding of IEEE International Conference on Data Mining, pp. 99-106, 2002.
[25] C. Creighton, S. Hanash, Mining gene expression databases for association rules, Bioinformatics 19 (2003) 79-86.
[26] X. Cui, G.A. Churchill, Statistical tests for differential expression in cdna microarray experiments, Genome Biology 4 (2003) pp. 210.1-210.10.
[27] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 1, No. 4, pp. 224 - 227, 2000.
[28] G. Dong, & J. Li, Efficient mining of emerging patterns: Discovering trends and differences, Proc. of the 5th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.43–52, 1999.
[29] G. Dong, J. Li, & X. Zhang, Discovering jumping emerging patterns and experiments on real datasets, Proc. of the 9th Int'l Database Conference, pp.155-168, 1999.
[30] G. Dong, & J. Li, Mining border descriptions of emerging patterns from dataset pairs, Knowledge and Information Systems, 8, pp.178–202, 2005.
[31] G. Dong, X. Zhang, L. Wong, & J. Li, CAEP: classification by aggregating emerging patterns, Discovery Science, pp.30-42, 1999.
[32] J. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, Vol. 4, pp. 95 - 104, 1974.
[33] M. B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, in Proceeding of the National Academy Sciences USA 95 (1998) 14863-14868.
[34] S. Erdal, O. Ozturk, D. Armbruster, H. Ferhatosmanoglu and W.C. Ray, A time series analysis of microarray data, The IEEE Symposium on Bioinformatics and Bioengineering, pp. 366-375, 2004.
[35] H. Fan, M. Fan, K. Ramamohanarao, & M. Liu, Further improving emerging pattern based classifiers via bagging, Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data, pp.91-96, 2006.
[36] C. L. Fancoua and J. C. Principe, A Neighborhood Map of Competing One Step Predictors for Piecewise Segmentation and Identification of Time Series, IEEE Internal Conference on Neural Networks, Vol. 4, pp.1906-1911, 1996.
[37] L. Feng, K. Ju and K. H. Chon, A method for segmentation of switching dynamic modes in time series, IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 35, No. 5, pp. 1058-1064, 2005.
[38] T. C. Fu, F. L. Chung, V. Ng and R. Luk, Evolutionary segmentation of financial time series into subsequences, The Congress on Evolutionary Computation, Vol. 1, pp. 426 - 430, 2001.
[39] G. Getz, H. Gal, I. Kela, D.A. Notterman, E. Domany, Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data, Bioinformatics 19 (2003) 1079-1089.
[40] V. Guralnik and J. Srivastava, Event detection from time series data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 33-42, 1999.
[41] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Machine Learning 46 (2002) 389-422.
[42] J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, The Internal Conference on Data Engineering, pp. 106-115, 1999.
[43] J. Himberg, K. Korpiaho, H. Mannila,J. Tikanmaki and H.T.T. Toivonen, Time series segmentation for context recognition in mobile devices, The IEEE International Conference on Data Mining, pp. 203-210, 2001.
[44] Y. W. Huang and P. S. Yu, Adaptive Query Processing for Time-Series Data, in Proceeding of the 5th Int’l Conference on Knowledge Discovery and Data Mining, pp. 282-286, San Diego, CA, Aug 15-18, 1999.
[45] L. Hubert and J. Schultz, Quadratic assignment as a general data-analysis strategy, British Journal of Mathematical and Statistical Psychology, Vol. 29, pp. 190 - 241, 1976.
[46] A. Icev, C. Ruiz, E. F. Ryder, Distance-enhanced association rules for gene expression, in Proceedings of the 3rd International Workshop on Data Mining in Bioinformatics,(BIOKDD’03) (2003) 34-40.
[47] W. Jiang, Y. Xu, D. Shi and K. Xia, A novel dynamic clustering algorithm and its application in fuzzy modeling, IEEE International Conference on Granular Computing, pp. 284 – 289, 2009.
[48] E. Keogh, http://www.cs.ucr.edu/~eamonn/discords/, 2005.
[49] E. Keogh, K. Chakrabarti, M. Pazzani, & S. Mehrotra, Dimensionality reduction for fast similarity search in large time series databases, Knowledge and Information Systems, 3 (3): pp.263-286, 2001.
[50] E. Keogh, S. Chu, D. Hart and M. Pazzani, An online algorithm for segmenting time series, The IEEE Internal Conference Data Mining, pp. 289-296, 2001.
[51] E. Keogh, J. Lin and W. Truppel, Clustering of time series subsequences is meaningless: implications for previous and future research, The IEEE International Conference on Data Mining, pp. 115 – 122, 2003.
[52] P. Kralj, N. Lavrac, D. Gamberger, & A. Krstacic, Contrast set mining for distinguishing between similar diseases, Proc. of the 11th Conf. on Artificial Intelligence in Medicine, pp.109-118, 2007.
[53] A. Lendasse, J. Lee, E. de Bodt, V. Wertz and M. Verleysen, Dimension reduction of technical indicators for the prediction of financial time series - Application to the BEL20 Market Index, European Journal of Economic and Social Systems, Vol. 15, No. 2, pp. 31-48, 2001.
[54] H. Li, R. Li, & F. Wang, Blind separation algorithm for spatial independent time series, ICIC Express Letters, vol.4, no.2, pp.553-558, 2010.
[55] J. Li, G. Dong, & K. Ramamohanarao, Making use of the most expressive jumping emerging patterns for classification, Knowledge and Information Systems, 3 (2): pp.131-145, 2001.
[56] J. Li, L. Wong, Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns, Bioinformatics 18 (2002) 725-734.
[57] J. Li, & Q. Yang, Strong compound-risk factors: efficient discovery through emerging patterns and contrast sets, IEEE Transactions on Information Technology in Biomedicine, 11 (5): pp.544-552, 2007.
[58] J. Lin, E. Keogh, S. Lonardi, & B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp.2-11, 2003.
[59] J. Lin, & E. Keogh, Group SAX: extending the notion of contrast sets to time series and multimedia data, Proc. of the 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp.284-296, 2006.
[60] B. Lkhagva, Y. Suzuki, & K. Kawagoe, Extended SAX: extension of symbolic aggregate approximation for financial time series data representation, Proc. of IEICE the 17th Data Engineering Workshop, 2006.
[61] Y. Matsumoto, & J. Watada, Knowledge acquisition from time series data through rough sets analysis, International Journal of Innovative Computing, Information and Control, vol.5, no.12(B), pp.4885-4898, 2009.
[62] J. B. McQueen, Some methods of classification and analysis of mutivariate observations, The Symposium on Mathematical Satistics and Probability, 1967, pp. 281-297.
[63] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1994.
[64] R. Ming, H. Ting, & J. Bailey, Mining minimal contrast sub-graph patterns, Proc. of the 3rd VLDB Workshop on Secure Data Management, pp.639-643, 2006.
[65] J. J. Oliver, R. A. Baxter and C. S. Wallace, Minimum Message Length Segmentation, The. Pacific-Asia Conference Knowledge Discovery Data Mining, pp. 222-233, 1998.
[66] W. Pan, A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments, Bioinformatics 18 (2002) 546-554.
[67] P. Pavlidis, C. Tang, W.S. Noble, Classification of genes using probabilistic models of microarray expression profiles, in Proceedings of the first international Workshop on Data Mining in Bioinformatics (BIOKDD’01) (2001) 15-21.
[68] W. Penny and S. Roberts, Dynamic models for nonstationary signal segmentation, Computers and Biomedical Research, Vol. 32, No. 6, pp. 483-502, 1999.
[69] D. B. Percival and A. T. Walden, Wavelet methods for time series analysis, published by Cambridge University Press, 2000.
[70] Jianlong Qi, Jian Tang, Integrating gene ontology into discriminative powers of genes for feature selection in microarray data, in Proceedings of the 22nd ACM Symposium on Applied Computing (SAC’07) (2007) 430-434.
[71] J. Rahnenfuhrer, F.S. Domingues, J. Maydt, T. Lengauer, Calculating the statistical significance of changes in pathway activity from gene expression data, Statistical Applications in Genetics and Molecular Biology 3 (2004) Article 16.
[72] H. Shatkay, Approximate Queries and Representations for Large Data Sequences, Technical Report cs-95-03, Department of Computer Science, Brown University, 1995.
[73] D.K. Slonim, P. Tamayo, J.P. Mesirov, T.R. Golub, E.S. Lander, Class prediction and discovery using gene expression data, in Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB’00) (2000) 263-272.
[74] URL of Taiwan Stock Exchange (TWSE). http://www.twse.com.tw/en/
[75] H. Tamura, K. Tanno, H. Tanaka, & Z. Zhang, Bi-directional function learning method for time series prediction, ICIC Express Letters, vol.2, no.4, pp.401-407, 2008.
[76] The Gene Ontology Consortium, The gene ontology (go) database and informatics resource, Nucleic Acids Research 32 (2004) D258-D261.
[77] V. S. Tseng, C. H. Chen, P. C. Huang and T. P. Hong, Segmentation of time series by the clustering and genetic algorithms, Pattern Recognition Letters, Vol. 30, 1190-1197, 2009.
[78] V. S. Tseng, C. H. Chen, P. C. Huang and T. P. Hong, A cluster-based genetic approach for segmentation of time series and pattern discovery, The IEEE Congress on Evolutionary Computation, pp. 1949-1953, 2008.
[79] Vincent S. Tseng and Ching-Ping Kao, Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 2, No. 4, pp. 355 - 365, 2005.
[80] J. P. C. Valente and I. L. Chavarrias, Discovering similar patterns in time series, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 497-505, 2000.
[81] L. J. van't Veer, H. Dai, Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415 (2002) 530-536.
[82] H. Wakuya, Enrichment of inner information representations in bi-directional computing architecture for time series prediction, International Journal of Innovative Computing, Information and Control, vol.4, no.11, pp.3079-3090, 2008.
[83] C. Wang and S. Wang, Supporting content-based searches on time Series via approximation, Proceedings of the 12th International Conference on Scientific and Statistical Database Management, 2000.
[84] H. Wang, F. Azuaje, Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships, in Proceeding of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB’04) (2004) 25-31.
[85] X. Y. Wang and Z. O. Wang, A Structure-Adaptive Piece-Wise Linear Segments Representation for Time Series, Proceeding of IEEE International Conference on Information Reuse and Integration, pp. 433 - 437, 2004.
[86] Weka [ http://www.cs.waikato.ac.nz/ml/weka/].
[87] X. Xu, A. Zhang, Selecting informative genes from microarray dataset by incorporating gene ontology, in Proceedings of IEEE Fifth Symposium on Bioinformatics and Bioengineering (BIBE’05) (2005) 241-245.
[88] L. Yan and Q. Liu, Granular resolution and granular reasoning, IEEE International Conference on Granular Computing, pp. 668 – 671, 2009.
[89] Y. H. Yang, S. Dudoit, P. Luu, T. P. Speed, Normalization for cDNA microarray data, in Proceeding of International Biomedical Optics Symposium (2001) 141-152.
[90] Y. Yao, Interpreting concept learning in cognitive informatics and granular computing, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, Vol. 39 , No. 4, pp. 855 – 866, 2009.
[91] B. K. Yi, & C. Faloutsos, Fast Time Sequence Indexing for Arbitrary Lp Norms, Proc. of the VLDB, pp.385-394, 2000.
[92] L. T. H. Yu, F. L. Chung, S. C. F. Chan, & S. M. C. Yuen, Using emerging pattern based projected clustering and gene expression data for cancer detection, Proc. of 2nd conf. on Asia-Pacific bioinformatics, pp.75-84, 2004.
[93] H. H. Yu, Vincent S. Tseng, J. H. Chuang, A multi-Information based gene scoring model with applications on analysis of hepatocellular carcinoma, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 345-350.
[94] M. J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Machine Learning, 42, pp.31-60, 2001.
[95] Z. Zhang and H. Wang, A fast subspace partition clustering algorithm for high dimensional data streams, IEEE International Conference on Intelligent Computing and Intelligent Systems, Vol. 1, pp. 491 – 495, 2009.