| 研究生: |
莊瑋恩 Chuang, Wei-en |
|---|---|
| 論文名稱: |
使用多項式簡易貝氏模型進行文件分類之先驗分配參數設定方法 A Method to Setting the Parameters of Prior Distributions on the Multinomial Naïve Bayes Model for Text Classification |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 47 |
| 中文關鍵詞: | 簡易貝氏分類器 、先驗分配 、參數設定 、狄氏分配 、廣義狄氏分配 |
| 外文關鍵詞: | multinomial naïve Bayes classifier, prior distribution, generalized Dirichlet distribution, Dirichlet distribution |
| 相關次數: | 點閱:113 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
簡易貝氏分類器因為它本身運算速度的優勢,已經相當的廣泛應用在文件分類的領域中,其中因為文件本身字的分配不同,故簡易貝氏分類器在應用上有學者提出一些不同的模型,如二元獨立模型、多項式模型、卜瓦松模型,負二項式模型…等等。但有學者研究提出,簡易貝氏分類器使用多項式模型比起其他模型還是擁有較好的分類正確率,故本研究將針對多項式模型來進行其先驗分配的參數設定研究。簡易貝氏分類器做文件分類時,使用狄氏分配、以及廣義狄氏分配為簡易貝氏分類器的先驗分配,在使用廣義狄氏分配時,較令人困擾的地方為設定那數以千計的參數,本研究先將每個相異字在各類別中,依出現頻率由高至低排序,再利用分組的模式來進行參數設定的方法,期望以達到降低設定參數的複雜度,提高分類正確率以及運算效率為目標。本研究使用MDR88資料檔進行實證,期望藉由本研究之參數設定方法能擁有較高之分類正確率,同時整理出MDR88資料檔適用於哪個先驗分配將擁有最高之分類正確率,以及探討參數的調整方法以及參數值的變化對分類正確率的影響。
根據實驗結果,當使用調整先驗分配參數的方法無法顧及各類別文件的重要性時,將無法帶來改善正確率的好處;相反的參數各類別個別調整的方法可以顧及各類別文件的重要性,實驗證明,此調整參數值的方法確實可以改善正確率。
另外各個字對於類別的影響程度是不同的,所以應該考慮各個字對於類別的影響性,故本研究導入廣義狄氏分配,藉由本論文提出分組的概念,以組為單位調整參數,除了提高了分類器調整參數的效率,也顧及各組字對於影響類別重要性的不同,並經由實證,此調整參數值的方法可以再度改善分類正確率。
The naïve Bayes classifier is a popular technique for text classification because it performs well and has low computation complexity. Due to the various type of the distribution of the words in documents, there are some probabilistic model has been proposed such as binary independence model, multinomial model, poisson model, the negative model…etc. Previous studies have found that the multinomial model usually gives higher classification accuracy than the binary independence model. In this study, we use the multinomial naïve Bayes classifier for text classification and focus on the impact of setting the parameters of prior distributions.
In the multinomial naïve Bayes model, we assume the prior distribution to be either a Dirichlet or a generalized Dirichlet distribution. Setting the large amount of parameters becomes an issue when we use generalized Dirichlet distributions as priors. In order to reduce the computation complexity and obtain higher accuracy, we separate the parameters into several groups and propose five methods to systematically change the parameters corresponding to a group. We use data set MDR88 in our analysis.
By the experiment result, the concurrent prior setting method cannot get a better classification accuracy because it ignores the influence of the document in each class. On the contrary, if we consider the influence of the document in each class, it also means we should use the individual prior setting method that does improve the accuracy.
Since every word may play an important role in certain class, it is improper to adjust all parameters in a prior concurrently. We try to release this restriction by using generalized Dirichlet distributions as priors and the concept of separating parameters in groups. The experiment result shows that individual prior setting in group can get a higher classification accuracy.
Aitchison, J. (1985). A General Class of Distributions on the Simplex, Journal of the Royal Statistical Society Series B, 47, 136-146.
Baeza-Yates, R. and Ribeiro-Neto, B.(1999). Modern Information Retrieval. New York: The ACM Press.
Biesiada, J., Duch, W., Kachel, A., Maczka, K., and Palucha, S. (2005). Feature Ranking Methods Based on Information Entropy with Parzen Window, International Conference on Research in Electrotechnology and Applied Information, 109-118, Katowice Poland.
Cestnik, B (1990). Estimation probabilities:A Crucial Task in Machine Leaning, Proceedings of the 9th European Conference on Artificial Intelligence, 147-150, Stockholm, Sweden:Pitman.
Cestnik, B. and Bratko, I. (1991). On Estimating Probabilities in Tree Pruning, Proceedings of the 5th European Working Session on Learning on Machine Learning, 138-150, Porto, Portugal.
Clark, P. and Niblett, T. (1989). The CN2 Induction Algorithm, Machine Learning, 3, 261-283.
Cornor, R. J. and Mosimann, J. E. (1969). Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution, Journal of the American Statistical Association, 64, 194-206
Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory, New York: John Wiley.
Domingos, P. and Plazzani, M. (1997) On the Optimality of the Simple Bayesian Classifier under Zero One Loss, Machine Learning, 29, 103-130.
Fang, K. T., Kotz, S., and NG, K. W. (1990). Symmetric Multivariate and Related Distributions, Chapman and Hall, New York
Hand, D. J. and Yu, K. (2002). Idiot’s Bayes – Not so Stupid After All? International Statistical Review, 69, 3, 385
Langley, P., Iba, W., and Thompson, k. (1992). An Analysis of Bayesian Classifiers, Proceedings of the 10th National Conference on Artificial Intelligence, AAAI Press and MIT Press, 223-228, Morgan Kaufmann, Seattle, WA
Lewis, D. D. (1998). Naïve Bayes at forty:The Independence Assumption in Information Retrieval, The 10th European Conference on Machine Learning, 415
Lopez de Mantaras, R. (1991). A Distance-Based Attribute Selecting Measure for Decision Tree Induction, Machine Learning, 6, 81-92.
McCallum, A., Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification, Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, AAAI Press.
Porter, M.F. (1980). An Algorithm for Suffix Stripping, Program, 14(3), 130-137.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1998). Numerical Recipies in C, Cambridge University Press, Cambridge.
Schneider, K. M. (2003). A Comparison of Event Models for Naïve Bayes Anti-spam E-mail Filtering, Proceedings of the 10th conference on European Chapter of the Associations for Computational Linguistics, 307-314, Budapest, Hungary.
Schneider, K. M. (2005). Techniques for Improving the Performance of Naïve Bayes for Text Classification, Lecture Notes in Computer Science, 3406, 682-693.
Spiegelhalter, D. J. and Knill-Jones, R. P. (1984). Statistical and Knowledge Based Approaches to Clinical Decision Support Systems, with an Application in Gastroenterology, Journal of the Royal Statistical Society, 147, 35-77.
Sridhar, D. V., Bartlett, E. B., and Seagrave, R. C. (1998). Information Theoretic Subset Selection, Computers in Chemical Engineering, 22, 613-626.
Wong, T. T. (1998). Generalized Dirtichlet Distribution in Bayesian Analysis, Applied mathematics and Computation, 97, 165-181.
Wong, T. T. (2007). Perfect Aggregation of Bayesian Analysis on Compositional Data, Statistical Papers, 48, 265-282
Wong, T. T. (2008). Alternative Prior Assumptions for Improving the Performance of Naïve Bayesian Classifier, accepted by Data Mining and Knowledge Discovery.
Yang , Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization, The 14th International Conference on Machine Learning, 412–420.