| 研究生: |
賴政坪 Lai, Cheng-Ping |
|---|---|
| 論文名稱: |
時序及效益性資料庫探勘技術之研究 A Study on Temporal and Utility-based Database Mining Techniques |
| 指導教授: |
詹寶珠
Chung, Pao-Choo |
| 共同指導教授: |
曾新穆
Tseng, Shin-Mu |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 英文 |
| 論文頁數: | 89 |
| 中文關鍵詞: | 時間序列分群 、維度降低 、高效益項目探勘 、模糊理論 、模糊高效益項目探勘 、糢糊時序高效益項目探勘 |
| 外文關鍵詞: | Time series clustering, Data sampling, Dimension reduction, High utility itemsets mining, Fuzzy set theory, Fuzzy high utility itemsets, Fuzzy temporal high utility itemsets |
| 相關次數: | 點閱:105 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料探勘技術可以用來發現資料庫中許多有用之資訊,如物件變化之趨勢及演變之特徵等。在不同的資料庫型態中,時序及效益性資料庫之探勘,已是相當重要之議題。我們針對這些性質之資料庫,提出了三個創新之觀念及演算法。首先,我們探討在不同時間維度下,時間序列分群演算法的問題;再則,我們提出了一高效益項目探勘演算法及時序性高效益項目探勘演算法,解決目前既有演算法之缺點。在時序資料庫中,時間序列分群之分析已廣泛應用於不同之領域,如生物學、醫學、經濟等。在時間序列之分群前,都會利用一些維度降低之演算法。但經過降低維度後,一些子序列(sub sequence)中之訊息可能被忽略。若考慮這些子序列之訊息,可能會得到不一樣之分群結果。因此,我們提出了一個創新的二階分群的演算法,稱為2LTSC(Two-Level Time Series Clustering)。第一階段稱為level-1,進行整個序列之分群;第二階段稱為level-2,進行子序列之分群。在第二階段分群時,亦考慮到子序列之長度不一樣之情況。實驗結果顯示,所提出之演算法在時間序列分群之分析上,提供了不同及更深入之觀點。
隨著經濟之發展,市場獲利需求之增加,已有許多之研究致力於在效益性資料庫中,探勘出高效益項目集。高效益項目集的探勘,乃探討項目集所能得到效益的高低以提供決策者更多之決策支援。高效益項目在現實生活中可應用在許多領域裡,譬如超級市場裡探勘出高獲利項目、股票市場之分析等。然而過去之研究多著重於在交易資料庫中如何減少候選項目集的產生,快速找出高效益項目但是皆未反映出高效益項目數量及獲利之程度,這些訊息在許多應用方面是很重要之決策資訊,例如:庫存及銷售之分析等。此外,如何去定義一獲利之臨界點,讓比預期獲利點只低一點點之項目亦能被探勘出來(此項目亦可被視為潛力之高效益項目),亦非容易之事。本研究亦針對效益性資料庫之高效益探勘提出一新演算法,以解決過去高效益項目探勘演算法之缺點。我們將模糊理論應用在效益探勘之問題中,提出了一個模糊高效益項目探勘演算法FHUI(Fuzzy High Utility Itemsets)-Mine。除了可反應高效益項目購買數量及獲益之程度外,並提供了一模糊的獲益臨界點,使得即使效益只比定義之獲益臨界點低一點點之項目亦能被探勘出來。為了證明此演算法之可行性,實驗結果與一眾所熟知之 Two-Phase 演算法做比較。結果顯示,FTHUI-Mine 提供了較高之探勘能力,不僅能夠探勘出Two-Phase演算法所找出之高效益項目,更能發現具有潛力之高效益項目。
某些資料庫之特性同時具備時序及效益性資料。針對此項特徵,我們也提出一個糢糊時序高效益項目探勘的演算法FTHUI (Fuzzy Temporal High Utility Itemsets)-Mine,將模糊理論應用於時序資料庫,探勘在不同時間周期之高效益項目。此演算法利用分割移動視窗法與交易權重項目集來減少候選項目集的產生,以減少搜尋高效益項目集與資料庫次數所需花費的時間。為了證明此演算法之可行性,我們以IBM資料產生器產生實驗資料,並將結果與 THUI-Mine 演算法做比較。結果顯示,FTHUI-Mine 亦提供了較高之探勘能力,不僅能夠探勘出THUI-Mine演算法所找出之高效益項目,更能發現具有潛力之效益項目。
Data mining is the process of discovering useful information in large databases. Among different types of data mining methods, temporal data mining and utility-based dataset mining have recently become important topics. In this thesis, we propose novel algorithms so as to provide an alternative perspective on temporal mining and utility-based dataset mining. First, we study the problem of time series clustering under different time-granules. Further, we explore the problems of utility mining and temporal utility mining to deal with the major weaknesses of existing algorithms.
On the analysis of temporal databases, clustering analysis has been applied in a wild variety of fields such as biology, medicine, economics, etc. For time series clustering, dimension reduction methods are often applied to reduce data dimension before clustering. Consequently, the information of subsequence may be overlooked. Nevertheless, some properties of time series with the same sampling data may result in different clustering results after considering the subsequence information. In this thesis, we propose a novel two-level clustering method named 2LTSC (Two-Level Time Series Clustering), which considers both the whole time series, denoted as level-1 in the first level, and the subsequence information of time series, denoted as level-2 in the second level. The data length of level-2 could be different and thus is also considered in the second level in the proposed 2LTSC method. Through experimental evaluation, it is shown that 2LTSC, which considers two different time-granules, can provide different and deeper viewpoints for time series clustering analysis.
Utility mining is to find the itemsets in a transaction database with high utility values e. g., profits. Although a number of algorithms on high utility mining have been proposed, they do not reflect the fuzzy degree of purchased quantity and profit level for mined high utility itemsets, which are essential for decision making in various applications like stock control and sales analysis. In this thesis, we explore to apply fuzzy sets theory to the utility mining problem and propose a novel method, namely FHUI (Fuzzy High Utility Itemsets)-Mine, for mining fuzzy high utility itemsets. In addition to reflect the fuzzy degree for quantity and profit regions of high utility itemsets, FHUI-Mine also provides a fuzzy threshold range that may include itemsets with profits slightly less than the designated threshold value. Through experimental evaluation, FHUI-Mine was shown to deliver higher mining capability than Two-Phase algorithm since it can not only mine all high utility itemsets found by Two-Phase algorithm but also discover additionally the itemsets that are potentially high utility ones.
Some datasets consist of both temporal and utility-based properties. For this kind of datasets, we also propose a novel algorithm named FTHUI (Fuzzy Temporal High Utility Itemsets)-Mine to mine high utility itemsets on temporal databases by utilizing fuzzy sets theory. In addition to providing a fuzzy threshold range that may include itemsets with utilities slightly less than the user specified threshold value, FTHUI-Mine also reflects the fuzzy quantity and utility regions of high utility itemsets. To prove the feasibility of FTHUI-Mine, it is compared with a representative method THUI-Mine through experimental evaluation. It was shown that FTHUI-Mine can not only discover all high utility itemsets found by THUI-Mine algorithm but also detects additional potential high utility itemsets. Moreover, FTHUI-Mine can provide the information of fuzzy degrees of quantity and utility for high utility itemsets, which is valuable to various real applications.
[1] A. Banerjee and J. Ghosh, Clickstream Clustering using Weighted Longest Common Subsequences, Workshop on Web Mining of the first SIAM International Conference on Data Mining, Chicago, USA. 2001.
[2] A. Ben-Dor and Z. Yakhini, Clustering Gene Expression Pattern. Technique report of Hewlett-Packard Laboratories Israel HPL-98-190. 1998.
[3] D. J. Berndt and J. Clifford, Using Dynamic Time Warping to Find Patterns in Time Series. Working Notes of the Knowledge Discovery in Databases Workshop, Washington, USA, 359-370, 1994.
[4] C. Bettini, X. S. Wang and S. Jajodia, Testing Complex Temporal Relationships involving Multiple Granularities and its Application to Data Mining. Proceedings of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada, pp. 68–78. ACM Press, 1996.
[5] K. C. C. Chan and W. H. Au, Mining Fuzzy Association Rules. Conference on Information and Knowledge Management archive Proceeding of the 6th international conference. pp. 209-215, 1997.
[6] R. Chan, Q. Yang and Y. D. Shen, Mining High Utility Itemsets. Proceedings of the Third IEEE International Conference on Data Mining, pp.19, 2003.
[7] C. H. Lee, C. R. Lin and M. S. Chen, On Mining General Temporal Association Rules in A Publication Database. In Proceedings of 2001 IEEE Int. Conf. on Data Mining. 337-344.
[8] H. Cheng, X. Yan and J. Han. IncSpan: Incremental Mining of Sequential Patterns in Large Database. In Proceedings of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. pp. 527-532, 2004.
[9] M. Delgado, N. Marin, M.J. Matrin-Bautista and D. Sanchez, M.-A. Vila, Mining Fuzzy Association Rules: An Overview. Soft Computing for Information Processing and Analysis. Vol. 164, pp. 351-373, 2005.
[10] E. Dermatas and G. Kokkinakis, Algorithm for Clustering Continuous Density HMM by Recognition Error. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 3, 231 – 234, 1996.
[11] G. Duan, Y. Suzuki and K. Kawagoe, Grid Representation of Time Series Data for Similarity Search. 22ND International Conference on Data Engineering Workshops(ICDEW’06) p.x123, 2006.
[12] A. Erwin, R. P. Gopalan and N.R. Achuthan, CTU-Mine: An Efficient High Utility Itemset Mining Algorithm Using the Pattern Growth Approach, 7th IEEE International Conference on Computer and Information Technology (CIT 2007). pp. 71-76.
[13] S. M. Focardi, Clustering Economic and Financial Time Series: Exploring the Existence of Stable Correlation Conditions. Discussion paper 2001-04, The Intertek Group 94, rue de Javel F-75015 Paris, 2001.
[14] E. Ghyselsy and R. Valkanov, Linear Time Series Processes with Mixed Data Sampling and MIDAS Regression Models. Social Science Research Network. 2006.
[15] I. Graham and P. L. Jones, Expert systems-Knowledge, Uncertainty and Decision, Chapman and Computing, Boston, pp. 117-158, 1988.
[16] C. Guo, H. Jia and N. Zhang, Time Series Clustering Based on ICA for Stock Data Analysis. Wireless Communications, Networking and Mobile Computing, WiCOM '08. 4th International Conference. 12-14 Oct. 1-4. 2008.
[17] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning; Data mining, inference and prediction. Springer-Verlag: New York. 2001.
[18] R. J. Hathaway and J. C. Bezdek, Visual Cluster Validity for Prototype Generator Clustering Models. Pattern Recognition Letters, vol. 24, nos. 9-10, 1563-1569, 2003.
[19] V. Hautamaki, P. Nykanen and P. Franti, Time-series Clustering by Approximate Prototypes. Pattern Recognition, 19th International Conference. 8-11 Dec. 1-4, 2008.
[20] T. P. Hong and C.S. Kuo and S. C. Chi, Mining Association Rules from Quantitative Data, Intelligent Data analysis, VOl.3, No. 5, pp. 363-376, 1999.
[21] T. P. Hong, C.S. Kuo and S.C. Chi, A Fuzzy Data Mining Algorithm for Quantitative Values. 3rd International Conference on Knowledge-Based Intelligent Information Engineering Systems, pp. 480-483, 1999.
[22] T. P. Hong, C.S. Kuo and S. C. Chi, Trade-off between Computation Time and Number of Rules for Fuzzy Mining from Quantitative Data. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(5): 587-604 (2001)
[23] K. Kalpakis, D. Gada and V. Puttagunta, Distance Measures for Effective Clustering of ARIMA Time-Series. Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, 273-280, Nov. 2001.
[24] A. Kandel, Fuzzy Expert Systems, CRC Press, Boca Raton, pp. 8-19, 1992
[25] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons. 1990.
[26] E. Keogh, J. Lin and A. Fu, HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In Proc. of the 5th IEEE International Conference on Data Mining, 226 - 233., Houston, Texas, USA, Nov 27-30, 2005.
[27] E. Keogh, J. Lin and W. Truppel, Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research. In Proc. of the 3rd IEEE International Conference on Data Mining, Melbourne, FL, USA, 115–122, 2003.
[28] E. Keogh, S. Lonardi and Y. C. Chiu, Finding Surprising Patterns in a Time Series Database in Linear Time and Space. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada. 550-556, July 23-26, 2002.
[29] E. Keogh and C. A. Ratanamahatana, Exact Indexing of Dynamic Time Warping, Knowledge and Information Systems. 258-386. Springer London. 2004.
[30] E. M. Kornfield, R. Manmatha and J. Allan, Text Alignment with Handwritten Documents. Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04) 195-209, 2004.
[31] J. B. Kruskall and M. Liberman, The Symmetric Time Warping Algorithm: From Continuous to Discrete. Time Warps, String Edits and Macromolecules. Addison-Wesley, 1983.
[32] C. M. Kuok, A. Fu and M. H. Wong, Mining Fuzzy Association Rules in Databases. ACM SIGMOD vol. 27, pp. 41-46, 1998.
[33] C. H. Lee, C. R. Lin and M. S. Chen, Sliding-window Filtering: An efficient Algorithm for Incremental Mining. International Conference on Information and Knowledge Management (CIKM01), pp. 263-270, November 2001.
[34] J. Lin, E. Keogh, S. Lonardi and B. Chiu, A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.
[35] J. Lin, E. Keogh, L. Wei and S. Lonardi, Experiencing SAX: a Novel Symbolic Representation of Time Series. Data Mining and Knowledge Discovery Journal, vol. 15, Number 2, pp. 107-144, 2007.
[36] Y. Liu, W. K. Liao and A. Choudhary, A Fast High Utility Itemsets Mining Algorithm. In Proceedings of the Utility-Based Data Mining Workshop, August 2005.
[37] E. H. Mamdani, Applications of Fuzzy Algorithms for Control of Simple Dynamic Plants, IEEE Proceedings, pp. 1585-1588, 1974.
[38] F. Masseglia, P. Poncelet and M. Teisseire, Incremental Mining of Sequential Patterns in Large Databases. Data and Knowledge Engineering. 46:97-121, 2003.
[39] J. Qian, M. Dolled-Filhart, J. Lin, H. Yu and M. Gerstein, synexpression relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New, Biologically Relevant Interactions. Journal of Molecular Biology, vol. 314, 1053-1066, 2001.
[40] J. Qiu, T. Sun and Y. Shi, The Absolute Continuity of Fuzzy Complex Measures, ICIC Express Letters, vol.1, no.1, pp.27-32, 2007.
[41] Y. Shou, N. Mamoulis and W. Cheung. Fast and Exact Warping of Time Series Using Adaptive Segmental Approximations. Machine Learning, vol 58, pp. 231–267, 2005.
[42] H. Su, Y. Zhang, F. Zhao and Q. Li, An Ensemble Deterministic Model Based on Rough Set and Fuzzy Set and Bayesian Optimal Classifier, International Journal of Innovative Computing, Information and Control, vol.3, no.4, pp.977-986, 2007.
[43] Vincent S. Tseng, Y. H. Chen, C. H. Chen and J. W. Shin, Mining Fuzzy Association Patterns in Gene Expression Patterns, International Journal of Fuzzy Systems, vol. 8, no. 2, 2006.
[44] Vincent S. Tseng, C. J. Chu and T. Liang, An Efficient Algorithm for Mining Temporal High Utility Itemsets from Data Streams, Journal of Systems and Software, Volume 81, Issue 7, pp. 1105-1117, July 2008.
[45] Vincent S. Tseng, C. J. Chu and T. Liang, Mining Temporal Rare Utility Itemsets in Large Dataset, International Journal of Innovative Computing, Information and Control. Vol. 4, Number 11, November 2008.
[46] Vincent S. Tseng and C. P. Kao, A Novel Similarity-based Fuzzy Clustering Algorithm by Integrating PCM and Mountain Method, IEEE Transactions on Fuzzy Systems, vol. 15, Issue 6, pp.1188-1196, Dec. 2007.
[47] Vincent S. Tseng and C. P. Kao, Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method. IEEE/ACM Transactions on Computational Biology And Bioinformatics, vol. 2, No. 4, 355-365, 2005.
[48] Vincent S. Tseng, K. H. Wang and C. I. Lee, A Preprocessing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques. Applied Artificial Intelligence, vol. 17, no. 5, 535-544, 2003.
[49] M. Vlachos, G.. Kollios and D. Gunopulos, Discovering Similar Multidimensional Trajectories. In proceedings of the 18th International Conference on Data Engineering, San Jose, CA, Feb 26-Mar 1, 2002.
[50] Y. Wang, Fuzzy Clustering Analysis by Using Genetic Algorithm, ICIC Express Letters, vol.2, no.4, pp.331-337, 2008.
[51] T. Wang, Y. Chen and S. Tong, Fuzzy Reasoning Models and Algorithms on Type-2 Fuzzy Sets, International Journal of Innovative Computing, Information and Control, vol.4, no.10, pp.2451-2460, 2008.
[52] J. Wang, Y. Liu, L. Zhou, Y. Shi and X. Zhu, Pushing Frequency Constraint to Utility Mining Model. International Conference on Computational Science (3) 2007. pp.685-692.
[53] R. Weber, Fuzzy-ID3: A Class of Methods for Automatic Knowledge Acquisition, The second international conference of Fuzzy Logic and Neural Networks, Iizuka, Japan, pp.265-268, 1992.
[54] H. Yao, H. J. Hamilton and C. J. Butz, A Foundational Approach to Mining Itemset Utilities from Databases. Proc. of the4th SIAM International Conference on Data Mining, Florida, USA, 2004.
[55] J. S. Yeh, C. Y. Chang and Y. T. Wang, Efficient Algorithms for Incremental Utility Mining. Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, pp. 212-217, 2008
[56] J. S. Yeh, Y. C. Li and C. C. Chang, Two-Phase Algorithms for a Novel Utility-Frequent Mining Model. Lecture Notes in Computer Science - the 2007 International Workshop on High Performance Data Mining and Applications (HPDMA 2007). USA:Springer.
[57] H. J. Zimmermann, Fuzzy Set Theory and Its Application, Kluwer Academic Publisher, Boston, 1991.
[58] L. Zadeh, Fuzzy sets Information and Control. p338-353.
[59]IBM, IBM synthetic data generation code. http://www.almaden.ibm.com/software/quest/Resources/index.shtml
校內:2012-06-08公開