| 研究生: |
方耀輝 Fang, Yao-hwei |
|---|---|
| 論文名稱: |
以資料複雜度指標建構效率型交互驗證方法 The Data Complexity Index to Construct an Efficient Cross-validation Method |
| 指導教授: |
利德江
Li, Der-chiang |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 英文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 交互驗證 、資料複雜度 、二元分類 |
| 外文關鍵詞: | Binary Classification, Cross-validation, Data Complexity |
| 相關次數: | 點閱:62 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
交互驗證在資料探勘領域中常被用來做模式的驗證。然而,在實驗過程中通常要決定一些重要參數,像是訓練資料個數或實驗次數。對於二元分類問題,本研究發展一個新的交互驗證模式,稱作“Complexity-based Efficient (CBE)”交互驗證,CBE交互驗證建立一個CBE複雜度指標,其中CBE指標跟分類正確率有正相關。我們利用CBE指標及統計樣本數決定概念來計算最佳的訓練樣本個數及實驗次數,對於大量且複雜的分類資料可以減少模式驗證的時間。
實驗結果顯示CBE指標跟分類正確率有高度相關,CBE交互驗證和K-fold 交互驗證法和Repeated Random Sub-sampling Validation法有相同的效果,而且CBE交互驗證的驗證時間比K-fold 交互驗證法和Repeated Random Sub-sampling Validation更快速。CBE交互驗證不僅可以計算最佳的訓練樣本數及實驗次數,更近一步可以瞭解分類資料的特徵及結構。
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size or the number of experiment runs, to implement a validated evaluation. This research develops an efficient cross-validation method called Complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index called the CBE index, which has high correlation with the classification accuracies. The CBE index and the sample size determination can be used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with complex and computationally expensive classification data sets.
The experiment results show that the high correlation between the found CBE index and the classification accuracies, and the performances of CBE cross-validation and K-fold Cross-validation and Repeated Random Sub-sampling Validation are similar and that the training time required for CBE cross-validation is lower than that for K-fold Cross-validation and Repeated Random Sub-sampling Validation. CBE index helps users understand the characteristics of the analyzed data in advance, and CBE cross-validation helps users find optimal training data size and the number of experiment runs to reduce model evaluation time.
[1] C.M. Bishop, Pattern recognition and machine learning, Springer, 2006.
[2] D.C. Montgomery, Design and analysis of experiments, 5th edition, Wiley, 2001.
[3] L.J. Cao, H.P. Lee, and W.K. Chong, Modified support vector novelty detector using training data with outliers, Pattern Recognition Letters 24, 2479-2487, 2003.
[4] G. Casella and R.L. Berger, Statistical Inference, second edition, Duxbury, 2002.
[5] R. Clarke, H.W. Ressom, A. Wang, J. Xuan, M.C. Liu, E.A. Gehan, Y. Wang, The properties of high-dimensional data spaces: implications for exploring gene and protein expression data, Nature Reviews Cancer 8 (1), 37-49, 2008.
[6] M. Daszykowski, B. Walczak, D.L. Massart, Looking for natural patterns in data part 1. density-based approach, Chemometrics and Intelligent Laboratory Systems 56 (2), 83-92, 2001.
[7] M. Daszykowski, B. Walczak, D.L. Massart, Representative subset selection, Analytica Chimica Acta 468, 91-103, 2002.
[8] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noisy, In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland 226-231, 1996.
[9] M.T. Hagan, H.B. Demuth, M. Beale, Neural network design, Thomson, Singapore, 1996.
[10] H. Han, Y. Ko, J. Seo, Using the revised EM algorithm to remove noisy for improving the one-against-the-rest method in binary text classification, Information Processing & Management 43 (5), 1281-1293, 2007.
[11] T.K. Ho, A data complexity analysis of comparative advantages of decision forest constructors, Pattern Analysis & Applications 5, 102-112, 2002.
[12] M.Y. Hu, M. Shanker , G.P. Zhang, and M.S. Hung, Modeling consumer situational choice of long distance communication with neural networks, Decision Support Systems 44 (4), 899-908, 2008.
[13] V.N. Vapnik, The nature of Statistical learning theory, second edition, Springer, New York, 2000.
[14] M. Kantardzic, Data mining: concept, model, method, and algorithms, wiley-interscience, 2003.
[15] E.W.M. Lee, Y.Y. Lee, C.P. Lim, C.Y. Tang, Application of a noisy classification technique to determine the occurrence of flashover in compartment fires, Advanced Engineering Informatics 20, 213–222, 2006.
[16] D.C. Li, Y.H. Fang, An algorithm to cluster data for efficient classification of support vector machines, Expert Systems with Applications 34, 2013-2018, 2008.
[17] D.C. Li, Y.H. Fang, A non-linearly virtual sample generation technique using cluster discovery and parametric equations of hypersphere, Expert Systems with Applications 36, 844-851, 2009.
[18] D.C. Li, C.W. Yeh, T.I Tsai, Y.H. Fang, Susan C. Hu, Acquiring knowledge with limited experience, Expert Systems 24 (3), 162-170, 2007.
[19] E.B. Mansilla, On classifier domains of competence, proceedings of the 17th international conference on pattern recognition (ICPR’04), 2004.
[20] H.V. Nguyen, W. Yonggwan, Classification of unbalanced medical data with weighted Regularized Least Squares, Proceedings of the Frontiers in the Convergence of Bioscience and Information Technologies (IEEE), 347-352, 2007.
[21] A.T. Peterson, K.P. Cohoon, Sensitivity of distributional prediction algorithms to geographic data completeness, Ecological Modelling 117 (1), 159-164, 1999.
[22] S. Piramuthu, M.J. Shaw, J.A. Gentry, A classification approach using multi-layered neural networks, Decision Support Systems 11 (5), 509-525, 1994.
[23] A.M. Rubinov, N.V. Soukhorkova, J. Ugon, Classes and clusters in data analysis, European Journal of Operational Research 173, 849-865, 2006.
[24] C. Schaffer, Technical Note: Selecting a classification method by cross-validation, Machine Learning 13, 135-143, 1993.
[25] D.R.B. Stockwell, A.T. Peterson, Effects of sample size on accuracy of species distribution models, Ecological Modelling 148, 1-13, 2002.
[26] P.N. Tan, M. Steinbach, V. Kumar, Introduction to data mining, 1st edition, Pearson Addison Wesley, Boston, 2006.
[27] S. Wang, M. Dash, L.T. Chia, Efficient sampling: Application to image data, advances in Knowledge Discovery and Data Mining, Proceedings, Book Series: Lecture Notes in Artificial Intelligence 3518, 452-463, 2005.
[28] I.H. Witten, Frank Eibe, Data mining: practical machine learning tools and techniques, second edition, Morgan Kaufman, Amsterdam, 2005.