| 研究生: |
劉超瑞 Liu, Chao-Rui |
|---|---|
| 論文名稱: |
應用多項式簡易貝氏分類器於文件分類的推導廣義狄氏分配參數之方法 Methods for Setting Parameters of Generalized Dirichlet Priors for Multinomial Naïve Bayesian Classifiers in Document Classification |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 共變異數矩陣 、文件分類 、廣義狄氏分配 、簡易貝氏分類器 、多項式模型 |
| 外文關鍵詞: | covariance matrix, document classification, generalized Dirichlet distribution, multinomial model, naïve Bayesian classifier |
| 相關次數: | 點閱:92 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於簡易貝氏分類器具有高效率和容易使用的特性,已經被廣泛地應用在文件分類的領域上。而廣義狄氏先驗分配已被證明可以用來改善多項式模型結合簡易貝氏分類器處理文件分類的效能;然而,為了增加文件分類運算時的效率,會將欲處理的文件資料,經過資料前置處理後,將文件中數萬個相異字分成若干個小組,並利用這些分組後相異字,以組為單位來依序估算其廣義狄氏分配的參數。但是,由於廣義狄氏分配的在尋找參數時,需要大量的測試,才能得到較適合的參數來計算分類正確率,造成執行時間過於冗長產生運算效率較差的缺點。在本論文中,首先從資料中計算以相異字組為單位的共變異數矩陣,然後利用參數的估算方法和四種參數組合選取的方法,依序從共變異數矩陣的每一列裡可用的統計量,來估算各列較適合的參數。
最後,利用從兩個資料集的實驗結果中,我們發現到從每一列裡選取最大的參數的組合,可以擁有較佳的分類正確率。而本論文的參數估算方法可以有效的找出無資訊性的廣義狄氏分配的參數,並且有效的改善多項式簡易貝氏分類器的分類正確率。
Naïve Bayesian classifiers are a popular tool for classifying documents because of its computational efficiency and easy implementation. Generalized Dirichlet priors have been shown to be an effective way for improving the performance of the naïve Bayesian classifier with multinomial models, called multinomial naïve Bayesian classifiers, in document classification. For the sake of computational efficiency, the distinct words in a document set will be divided into groups, and the parameters of a generalized Dirichlet priors are determined group by group. The parameters corresponding to a group are searched within an interval to find the one that can achieve the highest prediction accuracy, and this searching process increases the computational cost of the multinomial naïve Bayesian classifier. In this thesis, the covariance matrices for groups are first calculated from available documents. Then the parameter estimation method and four strategies for choosing the value of a parameter corresponding to a row is proposed to solve for the parameters of noninformative generalized Dirichlet priors from the covariance matrices. The experimental results on two document sets show that the best strategy is to choose the largest value calculated from a row, and that our parameter estimation method can efficiently solve for the parameters of noninformative generalized Dirichlet priors to significantly improve the prediction accuracy of the multinomial naïve Bayesian classifier.
Addin, O., Sapuan, S. M., Mahdi, E., and Othman, M. (2007). A naive Bayes classifier for damage detection in engineering materials. Materials and Design, 28(8), 2379-2386.
Aitchison, J. (1985). A general class of distributions on the simplex, Journal of the Royal Statistical Society Series B, 47, 136-146.
Aitchison, J.( 2003). Compositional data analysis: where are we and where are we heading?, 37, 829-850.
Ali, W., Shamsuddin, S. M., & Ismail, A. S. (2012). Intelligent Naïve Bayes-based approaches for Web proxy caching. Knowledge-Based Systems, 31, 162-175.
Baeza Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval, New York:ACM Press.
Cestnik, B. (1990). Estimating probabilities:a crucial task in machine learning,Proceedings of the 9th European Conference on Artificial Intelligence, 147-150.
Cestnik, B. and Bratko, I. (1991). On estimating probabilities in tree pruning,
Proceedings of the 5th European Working Session on Learning on Machine Learning, 138-150, Porto: Portugal.
Chen, J., Huang, H., Tian, S., and Qu, Y. (2008). Feature selection for text classification with naive Bayes, Expert Systems with Applications, 36, 5432-5435.
Clark, P. and Niblett, T. (1989). The CN2 induction algorithm, Machine Learning, 3, 261-283.
Cornor, R. J. and Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution, Journal of the American Statistical Association, 64, 194-206.
Domingos, P. and Pazzani, M. (1997). Beyond independence: conditions for the optimality of the simple Bayesian classifier. Machine Learning, 29, 103-130.
Frank, E. and Bouckaert, R. R. (2006). Naive Bayes for text classification with unbalanced classes, Lecture Notes in Computer Science, 4213, 503–510.
JWerner, J., OmryKoren, Hugenholtz, P., DeSantis, T. Z., AWalters, W., Caporaso, J. G., Ley, R. E. (2012). Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. The ISME Journal, 6, 94-103.
Keren, D. (2003). Recognizing image style and activities in video using local features and naive Bayes. Pattern Recognition Letters, 24, 2913-2922.
Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers, Proceedings of the 10th National Conference on Artificial Intelligence, 223-228,.
Lewis, D. D. (1998). Naive Bayes at forty: the independence assumption in information retrieval, Proceedings of the 10th European Conference on Machine Learning, 4–15.
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 41-48.
Menzies, T., Greenwald, J., and Frank, A. (2007). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2-13.
Porter, M.F. (1980). An algorithm for suffix stripping, Program, 14, 130−137.
Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail, Learning for Text Categorization:Papers from the AAAI Workshop, 55-62.
Salton, G. (1975). A theory of indexing, Proceedings of Regional Conference Series in Applied Mathematics, 18, Philadelphia:Society for Industrial and Applied Mathematics.
Schneider, K. M. (2003). A comparison of event models for naive Bayes anti-spam e-mail filtering, Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, 307-314.
Schneider, K. M. (2005). Techniques for improving the performance of naive Bayes for text classification, Lecture Notes in Computer Science, 3406, 682-693.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation, 28, 11-21.
Spiegelhalter, D. J. and Knill-Jones, R. P. (1984). Statistical and knowledge based approaches to clinical decision support systems, with an application in gastroenterology, Journal of the Royal Statistical Society, 147, 35-77.
Tao, L. and Shenghuo, Z. and Mitsunori O.,(2008). Text categorization via generalized discriminant analysis. Information Processing and Management, 44 , 1684–1697.
Terribilini, M., Sander, J.D., Lee, J-H., Zaback, P., Jernigan, R.L., Honavar, V. and Dobbs, D. (2007) ‘Rnabindr: a server for analyzing and predicting RNA-binding sites in proteins’, Nucleic Acids Research, 35, 1–7.
Weng , S. S. and Liu , C. K. (2004). Using text classification and multiple concepts to answer e-mails. Expert Systems with Applications, 26, 529–543.
Witten, I. H. and Frank E. (2005). Data Mining:Practical Machine Learning Tools and Techniques, San Francisco:Morgan Kaufmann, Second Edition.
Wong, T. T. (1998). Generalized Dirtichlet distribution in Bayesian analysis, Applied Mathematics and Computation, 97, 165-181.
Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naive Bayesian classifier, Data Mining and Knowledge Discovery, 18, 183-213.
Wong, T. T. (2010). Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Computational Statistics & Data Analysis, 54(7), 1756-1765.
Wong, T. T. (2012). Generalized Dirichlet priors for naive Bayesian classifiers with multinomial models in document classification. Data Mining and Knowledge Discovery. DOI 10.1007/s10618-012-0296-4.
Youssef, M.A., R. W. Skaggs, J.W. Gilliam and G.M. Chescheir (2006). Field evaluation of a model for predicting nitrogen losses from drained lands. Journal of Environmental Quality, 35, 2026-2042.
Zakzouk, T. S., & Mathkour, H. I. (2012). Comparing text classifiers for sports news. Procedia Technology, 1, 474-480.
Zhao, Y. and Chen, Z. M. (2012). An optimized NBC approach in text classification. Physics Procedia, 24, 1910-1914.