簡易檢索 / 詳目顯示

研究生: 陳泰霖
Chen, Tai-Lin
論文名稱: 基於MapReduce分散式單調性支援向量機之研究
A Study of MapReduce-Based Distributed Monotonic SVM Model
指導教授: 李昇暾
Li, Sheng-Tun
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 56
中文關鍵詞: 支援向量機HadoopMapReduce單調性先驗知識
外文關鍵詞: SVM, Hadoop, MapReduce, Monotonic Prior knowledge
相關次數: 點閱:120下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 支援向量機(SVM)是一種高運算成本的機器學習演算法,傳統的SVM透過使用二次規劃的方法計算矩陣產生相當大的運算成本以進行資料分類,因此本研究利用Hadoop框架透過MapReduce平行運算架構將支援向量機演算法進行分散式處理,本研究所提出的方法可以有效率的提高運算速度,減少記憶體負擔,增加支援向量機處理大量資料的可行性。
    由於網路的快速興起,雲端運算在近年來更加成熟,雲端作業系統整合高運算能力的基礎建設,突破了過去處理資料的限制,近年來資料產生的速度越來越快,在某些狀況下使用一台主機進行處理所需要的時間變得過長,結合雲端運算以及機器學習,可以從大量的資料中得到更多具有價值的資訊。
    本研究結合開源軟體Hadoop,Hadoop結合了MapReduce運算結構,並且具有結合電腦叢集中儲存能力的檔案結構,MapReduce可以自動的將叢集中的運算資源進行分配,讓開發者更加專注於資料處理的部分。
    本研究將透過針對支援向量機的模型,提出一個考慮單調性資料型態的模型稱為Monotonic SVM(MCSVM),對於資料中具有單調性質之資料給予先驗知識,以提升模型的準確率,此模型需要使用二次規劃針對整個資料矩陣求解,造成單調性支援向量機之複雜度高,求解時間長,本研究透過MapReduce平行運算,提出MapReduce MCSVM,針對高複雜度之特性在經過資料切割後,大幅降低訓練時間,增加單調性支援向量機之實務可行性。

    Support Vector Machine (SVM) is a high computing cost algorithm. Traditional SVM uses quadratic programming to solve the classification problem, but incurs high cost during computation. To solve this problem, this study proposes the use of MapReduce in Hadoop. In order to increase the accuracy of classification, we utilize monotonic prior knowledge from experts during the training phase.
    Due to the rapid development of the Internet and storage infrastructure, cloud computing has matured in recent years. Some cloud operating systems integrate the high computation ability of cloud infrastructure and to break through limitations of data processing in the past. Data has been produced at a growing rate in recent years, and the volume of data has become so too large to be processed by a single machine. By combining cloud computing and machine learning, we can obtain more valuable information in from large scale data.
    This study uses Hadoop, which is an open-source framework, to implement the MapReduce framework, which is a distributed computing environment and a distributed file system. MapReduce automatically allocates computing resources among the cluster, and allows developers to focus on data processing.
    This study proposes a model of SVM called MCSVM that considers the monotonic property of data. Prior knowledge of monotonic property is given to the model to increase the accuracy of classification prediction. The MCSVM uses quadratic programming to find the optimal solution, which results in high complexity and the need for long training time. This study proposes a MapReduce MCSVM that significantly reduces the required training time, and increases the feasibility of MCSVM in real world applications.

    摘要 III ABSTRACT IV 誌謝 V List of Table VIII List of Figure IX Chapter 1 Introduction 1 1.1 Background and motivation 1 1.2 Objectives of Research 2 1.3 Organization of Research 3 Chapter 2 Literature Review 5 2.1 Cloud Computing 5 2.2 Hadoop 6 2.2.1 Hadoop Distributed File System (HDFS) 7 2.2.2 MapReduce 7 2.3 Support vector machine (SVM) 8 2.3.1 Construction of SVM 9 2.4 Classification with Monotonicity Constraints 12 Chapter 3 Research Methodology 15 3.1 Concept of Monotonicity 15 3.2 Derivate Monotonicity to SVM 16 3.3 Constructing Monotonicity Constraints 20 3.4 Solve MC-SVM in subSVM 21 3.5 MCSVM with MapReduce Framework 23 3.5.1 Data preprocessing 25 3.5.2 MapReduce training module 26 3.5.3 Testing module 28 Chapter 4 Experiment and Result analysis 30 4.1 Environment of Experiments and Data Collection 30 4.1.1 Experiment environment 30 4.1.2 Data Collection 31 4.2 Experiment step 35 4.3 Performance measures 37 4.4 Experiment result 39 Chapter 5 Conclusions and Future Works 52 5.1 Conclusions 52 5.2 Recommendations for future works 53 Reference 54

    Alham, N. K., Li, M., Liu, Y., & Hammoud, S. (2011). A MapReduce-based distributed SVM algorithm for automatic image annotation. Computers & Mathematics with Applications, 62(7), 2801-2811.
    Archer, N. P., & Wang, S. (1993). Learning bias in neural networks and an approach to controlling its effect in monotonic classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(9), 962-966. doi: 10.1109/34.232084
    Borthakur, D. (2007). The hadoop distributed file system: Architecture and design.
    Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
    Caruana, G., Li, M., & Liu, Y. (2013). An ontology enhanced parallel SVM for scalable spam filter training. Neurocomputing, 108, 45-57.
    Cortes, C., & Vapnik, V. (1995a). Support-vector networks. Machine learning, 20(3), 273-297.
    Cortes, C., & Vapnik, V. (1995b). Support-Vector Networks. Mach. Learn., 20(3), 273-297. doi: 10.1023/a:1022627411411
    Courant, R., & Hilbert, D. (1970). Methods of Mathematical Physics (Vol. I, II). New York: Wiley Interscience.
    Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
    Dembczyński, K., Kotłowski, W., & Słowiński, R. (2008). Ensemble of Decision Rules for Ordinal Classification with Monotonicity Constraints. In G. Wang, T. Li, J. Grzymala-Busse, D. Miao, A. Skowron & Y. Yao (Eds.), Rough Sets and Knowledge Technology (Vol. 5009, pp. 260-267): Springer Berlin Heidelberg.
    Doumpos, M., & Pasiouras, F. (2005). Developing and Testing Models for Replicating Credit Ratings: A Multicriteria Approach. Computational Economics, 25(4), 327-341. doi: 10.1007/s10614-005-6412-4
    Doumpos, M., & Zopounidis, C. (2009). MONOTONIC SUPPORT VECTOR MACHINES FOR CREDIT RISK RATING. New Mathematics and Natural Computation, 05(03), 557-570. doi: doi:10.1142/S1793005709001520
    Gamarnik, D. (1998). Efficient learning of monotone concepts via quadratic optimization. Paper presented at the Proceedings of the eleventh annual conference on Computational learning theory, Madison, Wisconsin, United States.
    Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS Operating Systems Review.
    Greco, S., Matarazzo, B., & Słowiński, R. (1998). A new rough set approach to evaluation of bankruptcy risk. Operational tools in the management of financial risks, 121-136.
    Huang, W., Nakamori, Y., & Wang, S.-Y. (2005). Forecasting stock market movement direction with support vector machine. Computers & Operations Research, 32(10), 2513-2522. doi: http://dx.doi.org/10.1016/j.cor.2004.03.016
    Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S. (2004). Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis. Support Syst., 37(4), 543-558. doi: 10.1016/s0167-9236(03)00086-1
    Kim, H. S., & Sohn, S. Y. (2010). Support vector machines for default prediction of SMEs based on technology credit. European Journal of Operational Research, 201(3), 838-846. doi: http://dx.doi.org/10.1016/j.ejor.2009.03.036
    Man Gyun, N., Won Seo, P., & Dong Hyuk, L. (2008). Detection and Diagnostics of Loss of Coolant Accidents Using Support Vector Machines. Nuclear Science, IEEE Transactions on, 55(1), 628-636. doi: 10.1109/tns.2007.911136
    Mell, P., & Grance, T. (2011). The NIST definition of cloud computing (draft). NIST special publication, 800(145), 7.
    Mercer, J. (1909). Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations. Transactions of the London Philosophical Society (V), 9, 415-446.
    Pazzani, M. J., Mani, S., & Shankle, W. R. (2001). Acceptance of Rules Generated by Machine Learning among Medical Experts. Methods of Information in Medicine(2001 (Vol. 40): Issue 5 2001), 380-385.
    Pendharkar, P. C., & Rodger, J. A. (2003). Technical efficiency-based selection of learning cases to improve forecasting accuracy of neural networks under monotonicity assumption. Decision Support Systems, 36(1), 117-136. doi: http://dx.doi.org/10.1016/S0167-9236(02)00138-0
    Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines.
    Potharst, R., & Feelders, A. J. (2002). Classification trees for problems with monotonicity constraints. SIGKDD Explor. Newsl., 4(1), 1-10. doi: 10.1145/568574.568577
    Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels --Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, Massachusetts: The MIT Press.
    Shin, K.-S., Lee, T. S., & Kim, H.-j. (2005). An application of support vector machines in bankruptcy prediction model. Expert Systems with Applications, 28(1), 127-135. doi: http://dx.doi.org/10.1016/j.eswa.2004.08.009
    Vapnik, V. N. (1995). The nature of statistical learning theory: Springer-Verlag New York, Inc.
    Vapnik, V. N. (1998). Statistical learning theory.
    Vapnik, V. N. (1998). Statistical learning theory: Wiley.
    Wang, S. (1995). The Unpredictability of Standard Back Propagation Neural Networks in Classification Applications. Management Science, 41(3), 555-559. doi: 10.2307/2632981
    Wang, S. (2003). Adaptive non-parametric efficiency frontier analysis: a neural-network-based model. Computers & Operations Research, 30(2), 279-295. doi: http://dx.doi.org/10.1016/S0305-0548(01)00095-8

    無法下載圖示 校內:2024-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE