簡易檢索 / 詳目顯示

研究生: 林良憲
Lin, Liang-Sian
論文名稱: 生成多峰性虛擬樣本以評估小資料集之產品壽命性能
Generating Multi-modal Virtual Samples to Assess Product Lifetime Performance for Small Data Sets
指導教授: 利德江
Li, Der-Chiang
學位類別: 博士
Doctor
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 66
中文關鍵詞: 最大P值多峰態屬性小資料集虛擬樣本生成虛擬樣本數
外文關鍵詞: Maximal P-Value, Multi-modality attribute, Small data set, Virtual sample generation, Virtual sample size
相關次數: 點閱:114下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在許多研究報告中,虛擬樣本生成法經常被用於提高小資料學習性能。適當地估計資料的分佈在虛擬樣本生成過程中扮演一個重要的角色,通常面對具有簡單分佈的資料則該方法假定資料為一個簡單的分佈確實可以獲得較佳性能,但是資料可能是一個複雜的分佈。通常混合的資料集具有多峰分佈,也就是資料的分佈並不是一個簡單且單峰的分佈。為了解決這個問題,本研究假設資料來自一個兩參數型的韋伯分佈並且提出最大P值法來估計兩參數值以用來建構一個非線性且非對稱形狀的小資料分佈。更進一步地,本研究提出新的方法來偵測多峰資料集,以避免不當地假設資料為單峰分佈的問題。本研究利用常見的k-means分群方法來找出可能的群集,並且針對每個群內的樣本使用已估計韋伯變量來產生多峰性虛擬樣本。在提出的方法提出一個準則來決定虛擬樣本數的大小,該準則為測量原本樣本和虛擬樣本之間的Weibull偏斜之誤差的變化程度。本研究提供模擬的資料集與兩個實例來驗證最大P值法在小樣本數量下是一個更適當的技術來提升資料分佈估計的正確性。此外,本研究運用六個資料集來驗證提出所提出方法的性能,並在不同的訓練資料數量下比較分類的正確性。最後的實驗結果使用一個無母數檢定法來檢定所提出的方法比整體趨勢擴散法具有更佳的分類性能。

    Virtual sample generation approaches have been used with small data sets to enhance learning performance in a number of reports. The appropriate estimation of the data distribution plays an important role in this process, and the resulting performance is usually better for data sets that have a simple distribution rather than a complex one. However, mixed-type data sets often have a multi-modal distribution instead of a simple, uni-modal one. In order to solve this problem, this study assumes that a data set follows a two-parameter Weibull distribution, and proposes the Maximal P-Value method to estimate two parameters of a Weibull distribution to construct a nonlinear and asymmetrical small data distribution. Further, this study thus proposes a new approach to detect multi-modality in data sets, to avoid the problem of inappropriately using a uni-modal distribution. This work utilizes the common k-means clustering method to detect possible clusters, and, based on the clustered sample sets, a Weibull variate is estimated for each of these to produce multi-modal virtual data. In this approach, the degree of error variation in the Weibull skewness between the original and virtual data is measured and used as the criterion for determining the sizes of virtual samples. This study provides simulated data sets and two practical examples to demonstrate that the Maximal P-Value method is a more appropriate technique to increase estimation accuracy of data distribution with small sample sizes. In addition, six data sets with different training data sizes are employed to check the performance of the proposed method, and comparisons are made based on the classification accuracy. Finally, the experimental results using non-parametric testing show that the proposed method has better classification performance than that of the Mega-Trend-Diffusion method.

    摘要 I ABSTRACT II 誌謝 III CONTENTS IV LIST OF TABLES VI LIST OF FIGURES VII 1. INTRODUCTION 1 1.1 Research Background 1 1.2 Research Motivation 2 1.3 Research Purposes 4 1.4 Research Structure 5 2. LITERATURE REVIEW 6 2.1 Related Studies 6 2.1.1 Virtual Sample Generation 6 2.1.2 The Mega-Trend-Diffusion Method 7 2.1.3 Least-squares Estimation for a Weibull Distribution 8 2.1.4 The Lifetime Performance Testing Procedure 8 2.2 Modality Tests 13 2.2.1 The Dip Test 13 2.2.2 The Excess Mass Test 15 2.3 Related Techniques for Clustering and Classification 17 2.3.1 K-means Clustering 17 2.3.2 Linear Discriminant Analysis 18 2.3.3 K-nearest Neighbors 19 2.3.4 Support Vector Machine 20 3. METHODOLOGY 23 3.1 The Scheme for Virtual Sample Generation 23 3.2 The Maximal P-Value Method 25 3.3 The Proposed Modality Test 26 3.3.1 The Relationship between PDF and CDF 26 3.3.2 The Procedure of Modality Test 28 3.4 The Decision of Virtual Sample Size 30 3.5 Multi-modal Virtual Sample Generation 31 3.5.1 Virtual Sample Generation 32 3.5.2 The Inversion Method 32 3.5.3 K-modality Selection for Attributes 33 3.6 The Detailed Steps of the Proposed Method 34 4. EXPERIMENTS 36 4.1 The Performance of Maximal P-Value Method 36 4.1.1 Simulated Data Sets 36 4.1.2 Two Types of Real Numerical Data 43 4.1.3 Experimental Results 46 4.2 The Six Data Sets 46 4.3 An Example of the Proposed Method 48 4.4 The Experiment Design 50 4.5 The Results for the Selection of Classifiers 51 4.6 The Results of the Experiment to Compare Methods 54 4.7 Summary 58 5. CONCLUSIONS AND SUGGESTIONS 59 5.1 Conclusions 59 5.2 Suggestions 60 REFERENCES 61

    Abernethy, R.B. (2004), The New Weibull Handbook (5th ed.), 536 Oyster Road, North Palm Beach, Florida: Robert B Abernethy.
    Amari, S.-i. & Wu, S. (1999), “Improving support vector machine classifiers by modifying kernel functions.” Neural Networks, 12 (6), pp. 783-789.
    Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/]
    Aydin, I., Karakose, M. & Akin, E. (2011), “A multi-objective artificial immune algorithm for parameter optimization in support vector machine.” Applied Soft Computing, 11 (1), pp. 120-129.
    Benard, A. & Bos-Levenbach, E.C. (1953), “The plotting of observations on probability paper.” Statistica, 7, pp. 163-173.
    Bowman, K.O. & Shenton, L.R. (2001), “Weibull distributions when the shape parameter is defined.” Computational Statistics & Data Analysis, 36 (3), pp. 299-310.
    Chan, Y.-b. & Hall, P. (2010), “Using evidence of mixed populations to select variables for clustering very high-dimensional data.” Journal of the American Statistical Association, 105 (490), pp. 798-809.
    Chang, C.C. & Lin, C.J. (2011), “LIBSVM: A library for support vector machines.” ACM Transactions on Intelligent Systems and Technology, 2 (3), pp. 1-27.
    Chang, Y. & Wu, C.W. (2008), “Assessing process capability based on the lower confidence bound of Cpk for asymmetric tolerances.” European Journal of Operational Research, 190 (1), pp. 205-227.
    Chen, J.P. & Chen, K. (2004), “Comparing the capability of two processes using Cpm.” Journal of Quality Technology, 36 (3), pp. 329-335.
    Cheng, M.Y. & Hall, P. (1999), “Mode testing in difficult cases.” The Annals of Statistics, 27 (4), pp. 1294-1315.
    Cho, S., Jang, M. & Chang, S. (1997), “Virtual sample generation using a population of networks.” Neural Processing Letters, 5 (2), pp. 21-27.
    Cortes, C. & Vapnik, V. (1995), “Support-vector networks.” Machine learning, 20 (3), pp. 273-297.
    Das, K. & Nenadic, Z. (2009), “An efficient discriminant-based solution for small sample size problem.” Pattern Recognition, 42 (5), pp. 857-866.
    Davies, P.L. & Kovac, A. (2004), “Densities, spectral densities and modality.” Annals of Statistics, 32 (3), pp. 1093-1136.
    Demšar, J. (2006), “Statistical comparisons of classifiers over multiple data sets.” The Journal of Machine Learning Research, 7, pp. 1-30.
    Denoeux, T. (1995), “A k-nearest neighbor classification rule based on Dempster-Shafer theory.” IEEE Transactions on Systems, Man and Cybernetics, 25 (5), pp. 804-813.
    Dodson, B. (2006), The Weibull Analysis Handbook (2nd ed.), Milwaukee: American Society for Quality, Quality Press.
    Durbin, J., Knott, M. & Taylor, C. (1975), “Components of Cramer-von Mises statistics. II.” Journal of the Royal Statistical Society. Series B (Methodological), 37 (2), pp. 216-237.
    Estabrooks, A., Jo, T. & Japkowicz, N. (2004), “A multiple resampling method for learning from imbalanced data sets.” Computational Intelligence, 20 (1), pp. 18-36.
    Gail, M.H. & Gastwirth, J.L. (1978), “A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic.” Journal of the Royal Statistical Society. Series B (Methodological), 40 (3), pp. 350-357.
    Good, I. & Gaskins, R. (1980), “Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data.” Journal of the American Statistical Association, 75 (369), pp. 42-56.
    Hartigan, J.A. & Hartigan, P. (1985), “The dip test of unimodality.” The Annals of Statistics, 13 (1), pp. 70-84.
    Iman, R.L. & Davenport, J.M. (1980), “Approximations of the critical region of the fbietkan statistic.” Communications in Statistics-Theory and Methods, 9 (6), pp. 571-595.
    Kapur, K.C. & Lamberson, L.R. (1977), Reliability in Engineering Design, New York: John Wiley and Sons, Inc.
    Knott, M. (1974), “The distribution of the Cramér-von Mises statistic for small sample sizes.” Journal of the Royal Statistical Society. Series B (Methodological), 36 (3), pp. 430-438.
    Lehmann, E.L. & Scheffé, H. (1950), “Completeness, similar regions, and unbiased estimation: Part I.” Sankhyā: The Indian Journal of Statistics (1933-1960), 10 (4), pp. 305-340.
    Li, D.C., Chang, C.C. & Liu, C.W. (2012), “Using structure-based data transformation method to improve prediction accuracies for small data sets.” Decision Support Systems, 52 (3), pp. 748-756.
    Li, D.C., Chen, L.S. & Lin, Y.S. (2003), “Using functional virtual population as assistance to learn scheduling knowledge in dynamic manufacturing environments.” International Journal of Production Research, 41 (17), pp. 4011-4024.
    Li, D.C., Fang, Y.H. & Fang, Y.M.F. (2010), “The data complexity index to construct an efficient cross-validation method.” Decision Support Systems, 50 (1), pp. 93-102.
    Li, D.C. & Lin, L.S. (2013), “A new approach to assess product lifetime performance for small data sets.” European Journal of Operational Research, 230 (2), pp. 290-298.
    Li, D.C., Lin, L.S. & Peng, L.J. (2014), “Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency.” Decision Support Systems, 59, pp. 286-295.
    Li, D.C. & Lin, Y.S. (2006), “Using virtual sample generation to build up management knowledge in the early manufacturing stages.” European Journal of Operational Research, 175 (1), pp. 413-434.
    Li, D.C. & Liu, C.W. (2012), “Extending attribute information for small data set classification.” IEEE Transactions on Knowledge and Data Engineering, 24 (3), pp. 452-464.
    Li, D.C., Liu, C.W. & Hu, S.C. (2010), “A learning method for the class imbalance problem with medical data sets.” Computers in Biology and Medicine, 40 (5), pp. 509-518.
    Li, D.C., Wu, C.S., Tsai, T.I. & Lina, Y.S. (2007), “Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge.” Computers & Operations Research, 34 (4), pp. 966-982.
    Lin, Y.S. & Li, D.C. (2010), “The Generalized-Trend-Diffusion modeling algorithm for small data sets in the early stages of manufacturing systems.” European Journal of Operational Research, 207 (1), pp. 121-130.
    Little, S.N. (1983), “Weibull diameter distributions for mixed stands of western conifers.” Canadian Journal of Forest Research, 13 (1), pp. 85-88.
    Liu, P.H. & Chen, F.L. (2006), “Process capability analysis of non-normal process data using the Burr XII distribution.” The International Journal of Advanced Manufacturing Technology, 27 (9), pp. 975-984.
    Müller, D.W. & Sawitzki, G. (1991), “Excess mass estimates and tests for multimodality.” Journal of the American Statistical Association, 86 (415), pp. 738-746.
    Mannino, M., Yang, Y. & Ryu, Y. (2009), “Classification algorithm sensitivity to training data with non representative attribute noise.” Decision Support Systems, 46 (3), pp. 743-751.
    Montgomery, D.C. (1985), Introduction to Statistical Quality Control, New York: John Wiley & Sons Inc.
    Niyogi, P., Girosi, F. & Poggio, T. (1998), “Incorporating prior information in machine learning by creating virtual examples.” Proceedings of the IEEE, 86 (11), pp. 2196-2209.
    Pearn, W.L., Hung, H. & Cheng, Y.C. (2009), “Supplier selection for one-sided processes with unequal sample sizes.” European Journal of Operational Research, 195 (2), pp. 381-393.
    Poggio, T. & Vetter, T. (1992). Recognition and structure from one (2D) model view: observations on prototypes, object classes, and symmetries. In AIM-1347 (Ed.). Massachusetts Institute of Technology: Artificial Intelligence Laboratory.
    Polonik, W. & Wang, Z. (2005), “Estimation of regression contour clusters—an application of the excess mass approach to regression.” Journal of Multivariate Analysis, 94 (2), pp. 227-249.
    Proschan, F. (1963), “Theoretical explanation of observed decreasing failure rate.” Technometrics, 5 (3), pp. 375-383.
    Qi, Z., Tian, Y. & Shi, Y. (2013), “Robust twin support vector machine for pattern classification.” Pattern Recognition, 46 (1), pp. 305-316.
    Silverman, B.W. (1981), “Using kernel density estimates to investigate multimodality.” Journal of the Royal Statistical Society. Series B (Methodological), 43 (1), pp. 97-99.
    Tong, L.I., Chen, K. & Chen, H. (2002), “Statistical testing for assessing the performance of lifetime index of electronic components with exponential distribution.” International Journal of Quality & Reliability Management, 19 (7), pp. 812-824.
    Wahed, A.S., Luong, T.M. & Jeong, J.H. (2009), “A new generalization of Weibull distribution with application to a breast cancer data set.” Statistics in Medicine, 28 (16), pp. 2077-2094.
    Wu, C.W. & Pearn, W.L. (2008), “A variables sampling plan based on Cpmk for product acceptance determination.” European Journal of Operational Research, 184 (2), pp. 549-560.
    Wu, C.W., Pearn, W.L. & Kotz, S. (2009), “An overview of theory and practice on process capability indices for quality assurance.” International Journal of Production Economics, 117 (2), pp. 338-359.
    Xu, P., Brock, G.N. & Parrish, R.S. (2009), “Modified linear discriminant analysis approaches for classification of high-dimensional microarray data.” Computational Statistics & Data Analysis, 53 (5), pp. 1674-1687.
    Yang, J., Yu, X., Xie, Z.Q. & Zhang, J.P. (2011), “A novel virtual sample generation method based on Gaussian distribution.” Knowledge-Based Systems, 24 (6), pp. 740-748.
    Zhang, L.F., Xie, M. & Tang, L.C. (2007), “A study of two estimation approaches for parameters of Weibull distribution based on WPP.” Reliability Engineering & System Safety, 92 (3), pp. 360-368.

    下載圖示 校內:2024-07-07公開
    校外:2024-07-07公開
    QR CODE