| 研究生: |
王品鈞 Wang, Ping-Chung |
|---|---|
| 論文名稱: |
針對混合資料型態的k-means分群演算法 A k-means-based clustering algorithm for datasets with mixed attributes |
| 指導教授: |
黃宇翔
Huang, Yu-Hsiang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 叢集分析 、k-均值分群 、順序屬性 、相似測度 |
| 外文關鍵詞: | Clustering, k-means, Ordinal Attribute, Similarity |
| 相關次數: | 點閱:124 下載:6 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
叢集分析為資料探勘問題的技術之一,在目前的網路環境快速發展下,資料屬性的數目與種類也大量增加,導致傳統分群技術執行的效能大幅降低。因為分割式分群的執行效率相較於其他分群法為高且最為簡單如k-means,因此相關方面的研究也最多,不過由於資料屬性種類繁多,大多研究都只針對數值與類別屬性同時存在的情況,或是單純對順序屬性著墨,但往往資料分群的結果不是正確性不佳,就是分群動作所需耗費的時間成本太高,在資料分群效能以及效率上仍有待加強。本研究架構以k-means為基礎,針對三種屬性同時存在的資料集做分析,並以各自不同的相似度定義作為分群考量,類別屬性以共同發生值為距離考量,以及順序屬性以媒合度的觀點來計算之,並以群內距離最小,群間差異最大的觀念來訂定分群的衡量指標,把三種屬性各視為一個區塊,對群心做各種排列組合,將所有組合都做分群動作,最後將衡量指標最好的群心組合當做下一次分群的初始群心,直到衡量指標變動量小於限定值或不變化則停止。並且提出相對應的基因演算法,本研究考慮三種屬性同時存在的資料進行分群,將順序屬性從數值屬性獨立出來,並由此得知將基因演算法的確對叢集分析效能有所幫助。
Clustering is one of the most important technology in data mining. With the fast development of networks, large numbers of data attributes and items reduce the efficiency of data processing substantially. Among various clustering technologies, partitional clustering is easier to implement and can be employed faster than other methods of clustering. However, different types of data attributes can make clustering complicated. Most of literature concentrates on numerical and categorical, or only ordinal attributes. But the results seem not very well. In terms of accuracy or execution time. The proposed model, based on k-means, has a major objective to deal with the three attributes’ numeric, categorical and ordinal simultaneously. Numerical similarity is counted with Euclidean distance, categorical similarity is derived by the times of each value’s rank, and ordinal similarity is defined by the match degree. The effectiveness of the proposed approach is evaluated by the use of essential concept of clustering which is to minimum the ratio of the within cluster errors to the between cluster errors. A genetic algorithm is also proposed for reducing the execution time, to deal with the three types of attributes at the same time.
Ahmad, A. and Dey, L. (2007). "A k-mean clustering algorithm for mixed numeric and categorical data." Data and Knowledge Engineering 63: 503-527.
Ahmad, A. and Dey, L. (2007). "A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set." Pattern Recognition Letters 28: 110-118.
Ben-Hur,A., Elisseeff, A. and Guyon, I. (2002). "A Stability Based Method for Discovering Structure in clustered data." Pacific Symposium on Biocomputing. 7: 6-17.
Treshansky, A. and McGraw, RM (2001) "An overview of clustering algorithms" Proceedings of SPIE 4367, 41 p.41-51.
Aggarwal, CC. and Wolf, JL , Yu, PS. , Procopiuc, C. and Park, JS. (1999). "Fast Algorithms for Projected Clustering." Proceedings of the 1999 ACM SIGMOD international conference
Li, C. , Chang, E. , Garcia-Molina, H. and Wiederhold, G. (2002). "Clustering for Approximate Similarity Search in High-Dimensional Spaces." IEEE Transactions on Knowledge and Data Engineering 14(4): 792-808.
Ding, CHQ. , He, X. , Zha, H. , Gu, M. and Simon, HD. (2001). "A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering." Proceedings IEEE International Conference: 107-114.
Hsu, CC. , Chen, CL. and Su, YW. (2007). "Hierarchical clustering of mixed data based on distance hierarchy." Information Sciences 177: 4474-4492.
Domingos, P. (2007). "Toward Knowledge-rich Data mining." Data Mining and Knowledge Discovery 15: 21-28.
Chan, EY. , Ching, WK. , Ng, MK. and Huang, JZ. (2004). "An Optimization Algorithm for Clustering Using Weighted Dissimilarity Measures." Pattern Recognition 37: 943-952.
Gokcay, E. and Principe, JC. (2000). "A New Clustering Evaluation Function Using Renyi’s Information Potential." IEEE International Conference:
Wang, H. , Wang, W. , Yang, J. and Yu, PS. (2002). "Clustering by Pattern Similarity in Large Data Sets." ACM SIGMOD international conference on Management of data: 394-405.
Dunn, JC. (1973). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters." Journal of Cybernetics 3: 32-57.
He, J. , Lan, M. , Tan, CL. , Sung, SY. and Low, HB. (2004). "Initialization of Cluster Refinement Algorithms: Areview and Comparative Study." IEEE International Joint Conference on Neural Networks
Lee, JWT. , Yeung, DS. and Tsang, ECC. (2005). "Hierarchical clustering based on ordinal consistency." Pattern Recognition 38: 1913-1925.
Huang, Z. (1998). "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values." Data Mining and Knowledge Discovery 2: 283-304.
Gowda, KC. and Diday, E. (1992). "Symbolic Clustering Using a New Similarity Measure." IEEE transactions on Systems
Gowda, KC. and Ravi, TV. (1995). "Agglomerative clustering of symbolic objects using the concepts of both similarity and dissimilarity." Pattern Recognition Letters.
Wang, K. , Xu, C. and Liu, B. (1999). "Clustering Transactions Using Large Items." Conference on Information and Knowledge Management: 483-490.
Hinneburg, A. and Keim, DA. (1998). "An Efficient Approach to Clustering in Large Multimedia Databases with Noise." Knowledge Discovery and Data Mining: 58-65.
Wagsta, K. , Cardie, C. and Schroedl, S. (2001). "Constrained K-means Clustering with Background Knowledge. " Proceeding of the 18 International Conference on Machine Learning: 577-584.
Parsons, L. , Haque, E. and Liu, H. (2004). "Subspace Clustering for High Dimensional Data : A Review." ACM SIGKDD Explorations Newsletter 6(1): 90-105.
Dale, MB. , Anand, M. and Desrochers, RE. (2007). " Measuring information-based complexity across scales using cluster analysis." Ecological Informatics 2: 121-127.
Ester, M. , Kriegel, HP. , Sander, J. and Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." 2nd Int. Conf. on Knowledge Discovery and Data Mining 226-231.
MacQueen, JB. (1967). "Some methods for classification and analysis of multivariate observations" Symposium on Mathematics and Probability, 5th, Berkely, vol. 1, University of California Press, Berkeley, CA (1967), pp. 281–297.
Lee, M. and Brouwer, RK. (2007). "Fuzzy Clustering and Mapping of Ordinal Values to Numerical." Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence: 538-543.
Ankerst, M. , Breunig, MM. , Kriegel, HP. and Sander, J. (1999). "OPTICS: Ordering Points To Identify the Clustering Structure." ACM SIGMOD International Conference Management of Data: 49-60.
Kim, M. and Ramakrishna, RS. (2005). "New Indices for Cluster Validity Assessment." Pattern Recognition Letters 26: 2353-2363.
Zaki, MJ. , Peters, M. , Assent, I. and Seidl, T. (2007). "CLICKS: An effective algorithm for mining subspace clusters in categorical datasets" Data and Knowledge Engineering 60: 51-70.
Bolshakova, N. , Azuaje, F. and Cunningham, P. (2005). "An integrated tool for microarray data clustering and cluster validity assessment." Bioinformatics, vol.21 issue:4: p451-455
Huang, Z. and Ng, MK. (1999). "A Fuzzy k-Modes Algorithm for Clustering Categorical Data." IEEE transactions on fuzzy systems 7(4): 446-452.
Corsini, P. , Lazzerini, B.and Marcelloni, F. (2006). "Combining Supervised and Unsupervised Learning for Data Clustering." Neural Computation and Application 15: 289-297.
Bradley, PS. and Fayyad, UM. (1998). "Refining Initial Points for K-means Clustering." Proceeding of 15th International Conf. on Machine Learning (ICML98), Morgan Kaufmann, San Francisco.
Piatetsky-Shapiro, G. (2007). "Data Mining and Knowledge Discovery 1996 to 2005: Overcoming the Hype and Moving from “university” to “business” and ”analytics”." Data Mining and Knowledge Discovery 15: 99-105.
Ralambondrainy, H. (1995). "A conceptual version of the K-means algoritm." Pattern Recognition Letters vol.16, issue11, 1147-1157.
Rand, WM. (1971). "Objective Criteria for the Evaluation of Clustering Methods." Journal of the American Statistical Association 66(336): 846-850.
Roubens, M. (2003). "Multiple Criteria Choice, Ranking, and Sorting in the Presence of Ordinal Data and Interactive Points of View." Lecture Notes in Computer Science 2715: 30-38.
Gelbard, R. , Goldman, O. and Spiegler, I. (2007). " Investigating diversity of clustering methods: An empirical comparison." Data and Knowledge Engineering 63: 155-166.
Ong, SH. and Zhao, X. (2000). "On Post-Clustering Evaluation and Modification." Pattern Recognition Letters 21: 365-373.
Sato-Ilic, M. (1998). "Dynamic Clustering Model for Ordinal Similarity. " Fuzzy Information Processing Society-NAFIPS, 1998 Conference of the North American, p91-95.
Sbai, EH. (2001). "Cluster analysis by adaptive rank-order filters." Pattern Recognition 34: 2015-2027.
Zhang, S. and Zaki, MJ. (2006). "Mining Multiple Data Sources : Local Pattern Analysis." Data Mining and Knowledge Discovery 12: 121-125.
Guha, S. , Rastogi, R. and Shim, K. (2000). "ROCK: A robust clustering algorithm for categorical attributes" Information Systems 25(5): 345-366.
Guha, S. , Rastogi, R. and Shim, K. (2001). "CURE: An efficient clustering algorithm for large databases" Information Systems 26(1): 35-58.
Kanungo, T. , Mount, DM. , Netanyahu, NS. , Piatko, CD. , Silverman, R. and Wu, A.Y. (2002). "An Efficient k-Means Clustering Algorithm : Analysis and Implementation." IEEE Transaction on Pattern Analysis and Machine Intelligence 24(27): 881-892.
Maulik, U. and Bandyopadhyay, S. (2002). "Performance Evaluation of Some Clustering Algorithms and Validity Indices." IEEE Transactions o Pattern Analysis and Machine Intelligence. 24: 1650-1654.
Estivill-Castro, V. and Yang, J. (2004). "Fast and Robust General Purpose Clustering Algorithms." Data Mining and Knowledge Discovery 8: 127-150.
Robnik-Šikonja, M. and Vanhoof, K. (2007). "Evaluation of ordinal attributes at value level." Data Mining and knowledge Discovery 14: 225-243.
Frawley, WJ. , Piatetsky-Shapiro, G. and Matheus, CJ. (1992). "Knowledge discovery in databases: An overview." AI Magazine: 57-70.
Xiong, X. , Chan, KL. and Tan, KL. (2004). "Similarity-Driven Cluster Merging Method for Unsupervised Fuzzy Clustering." ACM International Conference Proceeding Series. 70: 611-618.
Sun, Y. , Zhu, Q. and Chen, Z. (2002). "An Iterative Initial-points Refinement Algorithm for Categorical Data Clustering." Pattern Recognition Letters 23: 875-884.
He, Z. , Deng, S. and Xu, X. (2005). "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode." Lecture notes in computer science 3801: 157-162.
He, Z. , Xu, X. and Deng, S. (2005). "Clustering Mixed Numeric and Categorical Data : A Cluster Ensemble Approach." Arxiv preprint cs.