簡易檢索 / 詳目顯示

研究生: 蔡幸辰
Tsai, Hsing-Chen
論文名稱: 具廣義狄氏先驗分配之多項式簡易貝氏分類器於高維度不平衡資料集之研究
Generalized Dirichlet priors for Multinomial Naïve Bayesian Classifier on High-Dimensional Imbalanced Data Sets
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 84
中文關鍵詞: 多項式簡易貝氏分類器類別不平衡高維度廣義狄氏分配
外文關鍵詞: Multinomial naïve Bayesian classifier, class imbalance, high-dimension, generalized Dirichlet distribution
相關次數: 點閱:104下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多項式簡易貝氏分類器因為其做法簡單、計算效率高以及良好的分類效能,被廣泛的運用於許多領域中。但這卻僅限於資料均勻分布的情況,多項式簡易貝氏分類器因為其本身的機制,導致在預測不平衡資料集的效能上與其他分類演算法相比顯得較差。而當不平衡資料伴隨高維度的特性時,類別不平衡問題將更加嚴重,形成難以處理的高維度不平衡資料集,造成分類器的預測效能不佳且效率低落。而過去處理類別不平衡問題時,大多以資料層面進行改善,但資料層面的方法可能會錯失潛在有用的資訊或更動原始資料的分配,進而產生無法預期的錯誤。因此,本研究將以演算法層面為出發點改善多項式簡易貝氏分類器於高維度不平衡資料集上的分類效能。會針對類別使用不同參數求取方法,於分類器中加入廣義狄式分配參數,並且利用修改分類器的學習機制,減弱分類結果中偏頗於多數類別的情況,以提升分類效能。實驗結果顯示,有效的使用參數求取方法和移除類別值機率方法,可以大幅提升MNB的整體分類情形。本研究所提出的分類器雖然運算時間較長,但利用特徵分組的方式加速求取參數,因此運算時間仍在合理的範圍內,並且分類器於各資料集中召回率和精確率皆表現良好,數值較為一致,在七個不平衡資料集的實驗中,共有六個分類結果中的F-measure是最優秀的,並取得最高的平均F-measure,與MNB、RIPPER和Random Forest相比擁有更優越的效能且表現更為穩定。

    Multinomial naïve Bayesian classifier is widely used in many fields because of its high computational efficiency, and great classification performance. However, this is limited to the data set is normal distribution. When the data set is imbalanced, the number of positive instances is few. The probability estimates for calculating the classification probability of this class value can be unreliable in applying multinomial naïve Bayesian classifier. If the imbalanced data set is accompanied by high-dimensional characteristics, the problem of imbalanced classes will be more serious, resulting in poor classification performance and inefficiency. In past literature, sampling methods have been added to change the distribution of dataset. However, this method may lose potentially useful information or increase the computational burden of the algorithm. Therefore, in order to improve the classification performance of multinomial naïve Bayesian classifier on high-dimensional imbalanced data sets, the method of parameter setting for generalized Dirichlet priors and removing the probability for class have been added to adjust the algorithm of the classifier, expected to attenuate the bias in the classification results. For reducing the computation burden, feature grouping is also used in parameter setting. The experimental results show that those methods can greatly improve the classification performance of MNB. In the seven high-dimensional imbalanced dataset experiments, the modified classifier in this study has the best F-measure among the six classification results, and achieved the highest average F-measure. It shows that the modified classifier has superior and more stable performance than MNB, RIPPER and Random Forest.

    摘要 I 誌謝 VI 目錄 VII 表目錄 IX 圖目錄 X 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 研究流程 4 第二章 文獻探討 5 2.1 簡易貝氏分類器 5 2.1.1 簡易貝氏分類器運作原理 6 2.1.2 簡易貝氏分類器的機率模型 7 2.1.3 簡易貝氏分類器之分類表現 9 2.2高維度不平衡資料集 10 2.3用於不平衡資料集的改善分類效能方法 12 2.3.1資料層面 12 2.3.2演算法層面 13 2.3.3於高維度不平衡資料集的應用 14 2.4先驗分配 15 2.4.1狄氏分配 16 2.4.2廣義狄氏分配 16 2.5評估指標 19 2.6小結 22 第三章 研究方法 23 3.1資料前置處理 24 3.2廣義狄氏分配之參數求取 29 3.2.1多數類別-廣義狄氏分配之參數估算 30 3.2.2少數類別-廣義狄氏分配之參數搜尋 41 3.3多項式簡易貝氏分類器的運用 42 3.4實驗結果評估 48 第四章 實證研究 49 4.1資料集介紹 49 4.2實證結果 53 4.2.1組數對於分類結果之影響 54 4.2.2參數求取方法 60 4.2.3移除類別值機率方法 66 4.3效能與時間之比較 71 4.4小結 74 第五章 結論與建議 76 5.1結論 76 5.2未來研究與發展 78 參考文獻 79

    姚靜姍,(2018)。簡易貝氏分類器在不平衡資料集上效能改善之研究。國立成功大學資訊管理研究所碩士班碩士論文。
    黃于珊,(2009)。多項式簡易貝氏分類器中廣義狄氏分配之參數設定方法。國立成功大學資訊管理研究所碩士班碩士論文。
    Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
    Aitchison, J. (1985). A general class of distributions on the simplex. Journal of the Royal Statistical Society. 47(1), 136-146.
    Ali, W., Shamsuddin, S. M., & Ismail, A. S. (2012). Intelligent Naïve Bayes-based approaches for web proxy caching. Knowledge-Based Systems, 31, 162-175.
    Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York, NY: Addison-Wesley.
    Bak, B. A. & Jensen, J. L. (2016). High dimensional classifiers in the imbalanced case. Computational Statistics & Data Analysis, 98, 46-59.
    Barandela, R., Sánchez, J.S., García, V., & Rangel, E. (2003) Strategies for Learning in Class Imbalance Problems. Pattern Recognition, 36(3), 849-851.
    Batista, G.E., Prati, R.C., & Monard, M.C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 6(1), 20-29.
    Bermejo, P., Gámez, J. A., & Puerta, J. M. (2011). Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Systems with Applications, 38(3), 2072-2080.
    Blagus, R. & Lusa, L. (2010). Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics, 11(1), 523.
    Buckland, M. & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Retrieval, 45(1), 12–19.
    Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence, 147-149.
    Chandra, B., & Gupta, M. (2011). Robust approach for estimating probabilities in Naïve–Bayes Classifier for gene expression data. Expert Systems with Applications, 38(3), 1293-1298.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
    Connor, R. J. & Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association, 64(325), 194-206.
    Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2), 103-130.
    Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18-36.
    Klement, W., Wilk, S., Michalowski, W., Farion, K. J., Osmond, M. H., & Verter, V. (2012). Predicting the need for CT imaging in children with minor head injury using an ensemble of Naive Bayes classifiers. Artificial Intelligence in Medicine, 54(3), 163-170.
    Kotsiantis, S. B., & Pintelas, P. E. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46-55.
    Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.
    Kuncheva, L. I. & Rodríguez, J. J. (2014). A weighted voting framework for classifiers ensembles. Knowledge and Information Systems, 38(2), 259-275.
    Li, Y., Guo, H., Liu , X., Li, Y., & L i, J. (2016). Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 94, 88-104.
    Lin, W. J. & Chen, J. J. (2013). Class-imbalanced classifiers for high-dimensional data. Briefings in Bioinformatics, 14(1), 13-26.
    Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting approach. Expert Systems with Applications, 36(1), 690-701.
    López, V., Fernández, A., Moreno-Torres, J. G., & Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification: Open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7), 6585-6608.
    López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
    Maratea, A., Petrosino, A., & Manzo, M. (2014). Adjusted F-measure and kernel scaling for imbalanced data learning. Information Sciences, 257, 331-341.
    Márquez-Vera, C., Cano, A., Romero, C., & Ventura, S. (2013). Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence, 38(3), 315-330.
    McCallum, A. & Nigam, K. (1998). A comparison of event models for naive bayes text classification. Preceedings of the AAAI-98 Workshop on Learning for Text Categorization, 41-48.
    Mitchell, T. M. (1997). Machine learning. Illinois, IL: McGraw Hill.
    Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
    Rijsbergen, C. J. V. (1979).Evaluation. In Information Retrieval (pp.112-140), London, England: Butterworths.
    Rish, I., Hellerstein, J., & Thathachar, J. (2001). An analysis of data characteristics that affect naive Bayes performance. IBM TJ Watson Research Center, 30.
    Sobran, N. M. M., Ahmad, A., & Ibrahim, Z. (2013). Classification of imbalanced dataset using conventional naïve bayes classifier. Proceedings of the International Conference on Artificial Intelligence in Computer Science and ICT, 35-42.
    Spiegelhalter, D. J. & Knill-Jones, R. P. (1984). Statistical and knowledge-based approaches to clinical decision-support systems, with an application in gastroenterology. Journal of the Royal Statistical Society, 147, 35-77.
    Su, J., Sayyad-Shirab, J., & Matwin, S. (2011). Large Scale Text Classification using Semisupervised Multinomial Naive Bayes. Proceedings of the International Conference on Machine Learning, 97-104.
    Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.
    Swezey, R. M. E., Shiramatsu, S., Ozono, T., & Shintani, T. (2012).An Improvement for Naive Bayes Text Classification Applied to Online Imbalanced Crowdsourced Corpuses. In Ding, W., H. Jiang, M. Ali and M. Li (Eds.), Modern Advances in Intelligent Systems and Tools (pp.147-152), England,London: Springer.
    Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. (2009). A multiple expert approach to the class imbalance problem using inverse random under sampling. Proceedings of the Eighth International Workshop on Multiple Classifier Systems, 82-91.
    Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: alternative techiques. In Introduction to Data Mining (pp.207-315), Minnesota, MN: Pearson.
    Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16), 5261-5267.
    Weerdt, J. D., Backer, M. D., Vanthienen, J., & Baesens, B. (2011). A robust F-measure for evaluating discovered process models. Proceedings of IEEE Symposium on Computational Intelligence and Data Mining, 148-155.
    Werner, J. J., Koren, O., Hugenholtz, P., DeSantis, T. Z., Walters, W. A., Caporaso, J. G., Ley, R. E. (2012). Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. The Isme Journal, 6, 94-103.
    Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97(2), 165-181.
    Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
    Wong, T. T. (2010). Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Computational Statistics & Data Analysis, 54(7), 1756-1765.
    Wong, T. T. (2014). Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classification. Data Mining and Knowledge Discovery, 28(1), 123-144.
    Wong, T. T. & Liu, C. R. (2016). An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial models. Pattern Recognition, 60, 62-71.

    Yijing, L., Haixiang, G., Xiao, L., Yanan, L., & Jinling, L. (2016). Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 94, 88-104.
    Yin, H., & Gai, K. (2015). An empirical study on preprocessing high-dimensional class-imbalanced data for classification. Proceedings of the Twelfth International Conference on Embedded Software and Systems, 1314-1319.
    Zakzouk, T. S. & Mathkour, H. I. (2012). Comparing text classifiers for sports news. Procedia Technology, 1, 474-480.
    Zhou, X., Wang, S., Xu, W., Ji, G., Phillips, P., Sun, P., & Zhang, Y. (2015). Detection of Pathological Brain in MRI Scanning Based on Wavelet-Entropy and Naive Bayes Classifier. In F. Ortuno and I. Rojas (Eds.), Bioinformatics and Biomedical Engineering (pp.201-209), Granada, Spain: Springer.

    無法下載圖示 校內:2024-05-25公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE