| 研究生: |
黃于珊 Huang, Yu-shan |
|---|---|
| 論文名稱: |
多項式簡易貝氏分類器中廣義狄氏先驗分配之參數設定方法 A Method for Setting the Parameters of Generalized Dirichlet Priors in Multinomial Naive Bayes Model |
| 指導教授: |
翁慈宗
Wong, Tzu-tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 中文 |
| 論文頁數: | 67 |
| 中文關鍵詞: | 先驗分配 、狄氏分配 、廣義狄氏分配 、文件分類 、簡易貝氏分類器 |
| 外文關鍵詞: | Dirichlet distribution, text classification, naive Bayesian classifier, prior distribution, generalized Dirichlet distribution |
| 相關次數: | 點閱:164 下載:10 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘中,文件分類一直是許多學者十分有興趣的研究領域,在眾多的分類演算法中,由於簡易貝氏分類器具有高效率和低複雜度的特性,且已被廣泛地應用,再加上在文件分類時為考量文件中相異字分配的不同,有學者證實以多項式模型為基礎的簡易貝氏分類器相較於其他機率模型,其能擁有更優秀的分類結果,故本研究將以多項式簡易貝氏分類器進行研究探討。而在文件分類時,將分別採用狄氏分配與廣義狄氏分配作為多項式簡易貝氏分類器的先驗分配,並尋找出其最適當的參數組合;然而在文件分類的運作過程中,文件中相異字的排序與分組將嚴重影響演算法的運作成效,此外廣義狄氏分配容易因執行時間冗長而產生運算效率低落的問題。以上兩個議題皆是本論文將著手進行改善的,希望藉由改良廣義狄氏分配的演算法以增進運算效率,並針對相異字的排序與分組做處理,除了降低參數設定的複雜度外,更期望能有效地提高分類正確率。
最後本研究使用MDR88資料檔來進行實證分析,根據實驗結果可發現,在分類正確率方面,當使用狄氏分配作為先驗分配時,各類別個別調整的參數調整方式可獲得較佳的分類正確率,而使用廣義狄氏分配作為先驗分配時,則是各類別各組別個別調整的參數調整方式可得到較佳的結果,但綜觀而言,廣義狄氏分配的結果優於狄氏分配;在分類效率方面,廣義狄氏分配其演算法經改良後,雖然總執行時間依然略遜於狄氏分配,但相較於未改善前的狀況,其運算效率已大幅提升。
In the area of data mining, lots of scholars interest in the research of text classification. In various types of classification algorithms, naive Bayesian classifiers are widely used because of its high prediction accuracy and low computational complexity. It has been shown that the naive Bayesian classifier with a multinomial model can achieve competitive classification accuracy with respect to the naive Bayesian classifiers with other models. Hence, we will use multinomial naive Bayesian classifier for this study. Both Dirichlet and generalized Dirichlet distributions will be assumed as the prior distributions for the naive Bayesian classifier, and the methods for parameter setting of these priors are then proposed. When a prior follows a generalized Dirichlet distribution, computational efficiency and the order of words can become serious problems in the operation of naive Bayesian classifiers. Words are divided into groups, and some properties of the generalized Dirichlet distribution are established to make the generalized Dirichlet applicable for text classification.
Data set MDR88 is chosen for verifying the applicability of the generalized Dirichlet distribution. According to the experimental result, adjusting the parameters prior by prior can achieve a higher prediction accuracy when priors follow Dirichlet distributions. When generalized Dirichlet priors are employed, setting parameters group by group for each class individually is a better approach. Generally speaking, the generalized Dirichlet distribution has a higher prediction accuracy, and its computational efficiency has been greatly improved by the way of dividing words into groups.
Aitchison, J. (1985). A general class of distributions on the simplex, Journal of the Royal Statistical Society Series B, 47, 136-146.
Athanasios, P. (1984). Probability, Random Variables and Stochastic Processes, New York:McGraw-Hill, Second Edition.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval, New York:ACM Press.
Bier, V. M. and Yi, W. (1995). A Bayesian method for analyzing dependencies in precursor data, International Journal of Forecasting, 11, 25-41.
Cestnik, B. (1990). Estimating probabilities:a crucial task in machine learning, Proceedings of the 9th European Conference on Artificial Intelligence, 147-150, Stockholm, Sweden: Pitman.
Cestnik, B. and Bratko, I. (1991). On estimating probabilities in tree pruning, Proceedings of the 5th European Working Session on Learning on Machine Learning, 138-150, Porto: Portugal.
Chen, J., Huang, H., Tian, S., and Qu, Y. (2008). Feature selection for text classification with naive Bayes, Expert Systems with Applications, Available Online 24 June 2008.
Chernoff H. and Lehmann E. L. (1954). The use of maximum likelihood estimates in χ2 tests for goodness-of-fit, The Annals of Mathematical Statistics, 25, 579-586.
Clark, P. and Niblett, T. (1989). The CN2 induction algorithm, Machine Learning, 3, 261-283.
Cornor, R. J. and Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution, Journal of the American Statistical Association, 64, 194-206.
Fang, K. T., Kotz, S., and Ng, K. W. (1990). Symmetric Multivariate and Related Distributions, Chapman and Hall, London.
Frank, E. and Bouckaert, R. R. (2006). Naive Bayes for text classification with unbalanced classes, Lecture Notes in Computer Science, 4213, 503–510.
Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers, Proceedings of the 10th National Conference on Artificial Intelligence, 223-228, San Jose, CA:AAAI Press.
Lewis, D. D. (1998). Naive Bayes at forty:the independence assumption in information retrieval, Proceedings of the 10th European Conference on Machine Learning, 4–15, New York:Springer.
Lewis, D.D. and Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 81-93, Las Vegas:NV.
Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing, Cambridge:MIT Press.
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 41-48, AAAI Press.
Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 67-73, New York:ACM Press.
Porter, M.F. (1980). An algorithm for suffix stripping, Program, 14, 130−137.
Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail, Learning for Text Categorization:Papers from the AAAI Workshop, 55-62.
Salton, G. (1975). A theory of indexing, Proceedings of Regional Conference Series in Applied Mathematics, No.18, Philadelphia:Society for Industrial and Applied Mathematics.
Schneider, K. M. (2003). A comparison of event models for naive Bayes anti-spam e-mail filtering, Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, 307-314, Budapest, Hungary.
Schneider, K. M. (2005). Techniques for improving the performance of naive Bayes for text classification, Lecture Notes in Computer Science, 3406, 682-693.
Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. Urbana:University of Illinois Press.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation, 28, 11-21.
Spiegelhalter, D. J. and Knill-Jones, R. P. (1984). Statistical and knowledge based approaches to clinical decision support systems, with an application in gastroenterology, Journal of the Royal Statistical Society, 147, 35-77.
Sridhar, D. V., Bartlett, E. B., and Seagrave, R. C. (1998). Information theoretic subset selection for neural network models, Computers in Chemical Engineering, 22, 613-626.
Tan, S. (2005). Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert System with Applications, 28, 667–671.
Wiener, E. D., Pedersen, J. O., and Weigend, A. S. (1995). A neural network approach to topic spotting, Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, 317–332.
Witten, I. H. and Frank E. (2005). Data Mining:Practical Machine Learning Tools and Techniques, San Francisco:Morgan Kaufmann, Second Edition.
Wong, T. T. (1998). Generalized Dirtichlet distribution in Bayesian analysis, Applied Mathematics and Computation, 97, 165-181.
Wong, T. T. (2007). Perfect aggregation of Bayesian analysis on compositional data, Statistical Papers, 48, 265-282.
Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naive Bayesian classifier, Data Mining and Knowledge Discovery, 18, 183-213.