研究生: |
張智傑 Chang, Chih-Chieh |
---|---|
論文名稱: |
透過資料分群技術進行屬性延伸提升小樣本預測能力 Using Data Clustering Techniques to Extend Attributes for Small Data Set Predictions |
指導教授: |
利德江
Li, Der-Chiang |
學位類別: |
博士 Doctor |
系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 英文 |
論文頁數: | 66 |
中文關鍵詞: | 小樣本 、密度空間分群法 、K-means分群法 、屬性延伸 |
外文關鍵詞: | Small data set, DBSCAN, K-means, Attribute extension |
相關次數: | 點閱:92 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,小樣本的問題近年來不斷地被討論,包含一些罕見疾病、新產品之測試,如何在小樣本的情況下獲取更多的額外資訊成為很重要的研究議題。小樣本問題之困難點在於無法使用一般統計理論進行估計,而使得不論是在分類問題或是預測問題上都有相當的困難度。也由於所獲得的小樣本資料是非常的珍貴,必須要透過僅存的幾筆資料找出額外的資訊。本研究提出經由分群的方式進行屬性延伸,透過分群的方式可以發現資料內部所存在的資料結構。研究方法分為兩個步驟,首先利用原始資料透過分群技術將原始資料進行分群,第二個步驟,是將分群後產生的群組進行整體趨勢擴展功能建置各群組的隸屬函數。再計算原資料在隸屬函數中所對映的隸屬值,而此所對應的隸屬函數值即為新的屬性。透過結合舊有屬性與新生成的屬性可以產生新的資料集合。最後,再利用常見的預測模型(迴歸、支援向量迴歸以及倒傳遞類神經網路)進行驗證,比較原始資料與經過屬性延伸後的資料的預測能力。本研究選用密度空間分群法(DBSCAN)及K-means分群法進行分群,透過四個個案進行實驗驗證。結果顯示本研究所提出的方法在DBSCAN分群下不論是在預測誤差、標準差均能有效降低並且能夠有效提升小樣本的預測能力。
Small data set problems have been widely considered in many fields, where increasing the prediction ability is the most important goal. This study considers the data structure to identify new data points in a more precise manner, and is thus able to achieve improved prediction capability. The proposed method consists of two steps. The first step is using the clustering techniques to separate data sets into clusters. The second step is to build up the data attribute extension function, in which the new attributes are computed using fuzzy membership functions obtained by the corresponding membership grades in each cluster. This study applies density-based spatial clustering of applications with noise (DBSCAN) and K-means as clustering techniques. Four real cases are selected to compare the proposed forecasting model with the linear regression (LR), backpropagation neural network (BPNN), and support vector machine for regression (SVR) methods. The result show that the proposed method with DBSCAN clustering has better performance than when using the raw data with regard to the error improving rate, mean square error (MSE), and standard deviation (STD).
[1] Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications: ACM.
[2] Ali, S., & Smith-Miles, K. A. (2006). A meta-learning approach to automatic kernel selection for support vector machines. Neurocomputing, 70(1-3), 173-186.
[3] Amari, S. I., & Wu, S. (1999). Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6), 783-789.
[4] Ankerst, M., Breunig, M.M., Kriegel, H.P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. ACM SIGMOD’99, 28(2), 49-60.
[5] Anthony, M., & Biggs, N. (1997). Computational Learning Theory: An Introduction: Cambridge Univ Pr.
[6] Carrizosa, E., Martin-Barragan, B., & Morales, D. R. (2010). Binarized support vector machines. INFORMS Journal on Computing, 22(1), 154-167.
[7] Chao, G. Y., Tsai, T. I., Lu, T. J., Hsu, H. C., Bao, B. Y., Wu, W. Y., & Lu, T. L. (2011). A new approach to prediction of radiotherapy of bladder cancer cells in small dataset analysis. Expert Systems with Applications, 38(7), 7963-7969.
[8] Chehreghani, M.H. & Abolhassani, H. (2009). Density link-based methods for clustering web pages. Decision Support Systems, 47(4), 374-382.
[9] Cook, D. F., & Shannon, R. E. (1991). A sensitivity analysis of a back-propagation neural network for manufacturing process parameters. Journal of Intelligent Manufacturing, 2(3), 155-163.
[10] Daszykowski, M., Walczak, B., & Massart, D. (2001). Looking for natural patterns in data Part 1. Density-based approach. Chemometrics and Intelligent Laboratory Systems, 56(2), 83-92.
[11] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1-38.
[12] Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in neural information processing systems, 155-161.
[13] Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95-104.
[14] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD’96, 226-231.
[15] Huang, C., & Moraga, C. (2004). A diffusion-neural-network for learning from small samples. International Journal of Approximate Reasoning, 35(2), 137-161.
[16] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.
[17] Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data: Prentice-Hall, Inc.
[18] Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis: Wiley Online Library.
[19] Khan, N., Ksantini, R., Ahmad, I., & Boufama, B. (2012). A novel SVM+ NDA model for classification with an application to face recognition. Pattern Recognition, 45(1), 66-79.
[20] Li, D. C., Chen, L. S., & Lin, Y. S. (2003). Using functional virtual population as assistance to learn scheduling knowledge in dynamic manufacturing environments. International Journal of Production Research, 41(17), 4011-4024.
[21] Li, D. C., Chen, W. C., Liu, C. W., Chang, C.J., & Chen, C.C. (2012a). Determining manufacturing parameters to suppress system variance using linear and non-linear models. Expert Systems with Applications, 39(4), 4020-4025.
[22] Li, D. C., Chen, W. C., Liu, C. W., & Lin, Y. S. (2012b). A non-linear quality improvement model using SVR for manufacturing TFT-LCDs. Journal of Intelligent Manufacturing, DOI: 10.1007/s10845-010-0440-1
[23] Li, D.C., & Fang, Y.H. (2009). A non-linearly virtual sample generation technique using group discovery and parametric equations of hypersphere. Expert Systems with Applications, 36(1), 844-851.
[24] Li, D.C., Fang, Y.H., & Fang, Y.M. (2010). The data complexity index to construct an efficient cross-validation method. Decision Support Systems, 50(1), 93-102.
[25] Li, D. C., Fang, Y. H., Liu, C. W., & Juang, C. (2012c). Using past manufacturing experience to assist building the yield forecast model for new manufacturing processes. Journal of Intelligent Manufacturing.
[26] DOI: 10.1007/s10845-010-0442-z
[27] Li, D. C., & Liu, C. W. (2012). Extending attribute information for small data set classification. IEEE Transactions on Knowledge and Data Engineering, 24(3), 452-464.
[28] Li, D. C., & Liu, C. W. (2009). A neural network weight determination model designed uniquely for small data set learning. Expert Systems with Applications, 36(6), 9853-9858.
[29] Li, D. C., Wu, C. S., Tsai, T. I., & Lina, Y. S. (2007). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.
[30] Li, D. C., & Yeh, C. W. (2008). A non-parametric learning algorithm for small manufacturing data sets. Expert Systems with Applications, 34(1), 391-398.
[31] Lin, C.Y., Chang, C.C., & Lin, C.C. (2005). A new density-based scheme for clustering based on genetic algorithm. Fundamenta Informaticae, 68(4), 315-331.
[32] Liao, T. W. (2011). Diagnosis of bladder cancers with small sample size via feature selection. Expert Systems with Applications, 38(4), 4649-4654.
[33] Liu, P., Zhou, D., & Wu, N. (2007). VDBSCAN: varied density based spatial clustering of applications with noise, Proceedings of IEEE International Conference on Service Systems and Service Management, 1–4.
[34] Muto, Y., & Hamamoto, Y. (2001). Improvement of the Parzen classifier in small training sample size situations. Intelligent Data Analysis, 5(6), 477-490.
[35] Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196-2209.
[36] Pascual, D., Pla, F., & Sanchez, J. (2006). Non parametric local density-based clustering for multimodal overlapping distributions. Intelligent Data Engineering and Automated Learning, 4224, 671-678.
[37] Pei, T., Jasra, A., Hand, D.J., Zhu, A.X., & Zhou, C. (2009). DECODE: a new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery, 18(3), 337-369.
[38] Pelleg, D., & Moore, A. (1999). Accelerating Exact K-means Algorithms with Geometric Reasoning.
[39] Roy, S., & Bhattacharyya, D. (2005). An approach to find embedded clusters using density based techniques. Distributed Computing and Internet Technology, 3816, 523-535.
[40] Rumelhart, D. E., Hintont, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
[41] Sanchez, A., & David, V. (2003). Advanced support vector machines and kernel methods. Neurocomputing, 55(1-2), 5-20.
[42] Schölkopf, B., Smola, A., & Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319.
[43] Shawe-Taylor, J., Anthony, M., & Biggs, N. (1993). Bounding sample size with the Vapnik-Chervonenkis dimension. Discrete Applied Mathematics, 42(1), 65-73.
[44] Tsai, T.I., & Li, D.C. (2008). Utilize bootstrap in small data set learning for pilot run modeling of manufacturing systems. Expert Systems with Applications, 35(3), 1293-1300.
[45] Vapnik, V. N. (2000). The Nature of Statistical Learning Theory: Springer Verlag.
[46] Yang, J., Yu, X., Xie, Z. Q., & Zhang, J. P. (2011). A novel virtual sample generation method based on Gaussian distribution. Knowledge-Based Systems, 24(6), 740-748.
[47] Yang, T., & Kecman, V. (2009). Adaptive local hyperplane algorithm for learning small medical data sets. Expert Systems, 26(4), 355-359.
[48] Yeh, I.C. (2006). Exploring concrete slump model using artificial neural networks. Journal of Computing in Civil Engineering 20 (3), 217-221.
[49] Yu, J., Wang, Y., & Shen, Y. (2008). Noise reduction and edge detection via kernel anisotropic diffusion. Pattern Recognition Letters, 29(10), 1496-1503.
[50] Zorriassatine, F., & Tannock, J. (1998). A review of neural networks for statistical process control. Journal of Intelligent Manufacturing, 9(3), 209-224.