簡易檢索 / 詳目顯示

研究生: 吳沐穎
Wu, Mu-Ying
論文名稱: 簡易貝氏分類器中廣義狄氏先驗分配應用於基因序列資料分類之研究
Generalized Dirichlet Priors for Naïve Bayesian Classifiers with Multinomial Models in Classifying Gene Sequence Data
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 54
中文關鍵詞: 狄氏分配基因序列資料分類廣義狄氏分配簡易貝氏分類器
外文關鍵詞: Dirichlet distribution, gene sequence data classification, generalized Dirichlet distribution, naïve Bayesian classifier
相關次數: 點閱:146下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨時光更佚,生物學家已不在受限於從實驗室的培養皿中觀察物種,隨著多源基因體學以及相關技術的發展之下,從日常生活中獲得物種樣本已非難事,但伴隨此技術而來的問題,卻困擾著相關研究學者;雖然透過後者的方法可以更接近真實的去了解物種之間的關聯性,以及其所依存的聚落關係,但透過該技術取得的基因序列樣本卻不再如前者方法一般單純。由於透過新的定序技術,科學家可從自然環境中取得大量的序列樣本,但其通常混雜許多物種的基因序列片段,要如何找出其原來所屬的物種類別,儼然成了一項新的挑戰。而本研究即希望透過簡易貝氏分類器的快速運算特性,來幫助從事此一研究的學者解決對於基因序列作分類的問題,幫助其找出各個序列片段的來源歸屬。但由於基因序列資料在不同的綱目之下,各有不同的物種分類,以及在比較下方的階層中,所需分辨的物種數量尤其龐大,且各個序列中所能取得的各個特徵又不甚明顯,在此情況之下,希望藉由更適合的先驗分配參數設定方式來增進簡易貝氏分類器的分類正確率。而本研究所採用的先驗分配-狄氏分配、廣義狄氏分配,其已被證明較適用於簡易貝氏分類器的先驗分配估計,並希望藉由該分配的原理,實作一適合用於前述類別數量眾多,而樣本特徵組合中又有眾多小出現機率值或幾乎沒有出現的分類問題。並在實證研究中使用兩個資料檔進行驗證,並於結果顯示上述先驗分配確實對分類正確率能有所提升,尤其是在未調整前分類正確率較低的資料檔得到較大的改善。

    With the passing of time, biologists are no longer limited to make observations on Petri dishes in labs. Nowadays, they can easily obtain samples from the natural world by using the new technology developed for metagenomics. Although the new technology is helpful in studying the relationships among species and the places where they live, samples obtained in this way cannot be analyzed by traditional methods. This research attempts to propose a new operational mechanism for naïve Bayesian classifiers to classify gene sequence data for biologists. Since the number of class values or species is generally over one hundred, and the number of features extracted from gene sequence data can be more than ten thousand, the information carried by a feature for classification will be relatively little. In this case, priors can play an important role in the operation of the naïve Bayesian classifier. This research adopts Dirichlet and generalized Dirichlet distributions that have been shown to be appropriate priors for improving the performance of the naïve Bayesian classifier to enhance its prediction accuracy on gene sequence data. The experimental results on two gene sequence data sets demonstrate that priors do helpful in classifying gene sequence instances, and that a significant improvement can be achieved in a gene sequence data set in which the original prediction accuracy is poor.

    摘 要 I Abstract II 致 謝 III 目 錄 IV 表目錄 VI 圖目錄 VII 符號表 VIII 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究流程 3 第二章 文獻探討 4 2.1 簡易貝氏分類器 4 2.1.1 基本運作原理 4 2.1.2 簡易貝氏分類器的應用 6 2.1.2.1 簡易貝氏分類器應用於文件分類 6 2.1.2.2 簡易貝氏分類器應用於基因序列分類 9 2.2平滑常數 12 2.3 狄氏分配與廣義狄氏分配 14 2.3.1 狄氏分配的計算公式 14 2.3.2 廣義狄氏分配的計算公式 16 2.3.3 狄氏與廣義狄氏分配的關係 16 第三章 研究方法 18 3.1 基因序列分類流程與敘述 18 3.2 基因序列資料的前置處理 21 3.3多項式模型 22 3.4 先驗分配參數的調整以及修正方法 22 3.5 尋找最佳先驗分配參數的方法 24 3.5.1 狄氏分配參數的尋找方法 24 3.5.2 廣義狄氏分配參數的尋找方法 25 3.6 驗證方式 30 第四章 實證研究 32 4.1 資料檔介紹 32 4.2 狄氏分配之實證結果 32 4.3 廣義狄氏分配之實證結果 34 4.4 小結 39 第五章 結論與建議 40 參考文獻 42 附錄一 狄氏分配正確率變化表-Bacteria資料檔 46 附錄二 狄氏分配正確率變化表-Fungi資料檔 47 附錄三 廣義狄氏分配正確率變化表- Bacteria資料檔 49 附錄四 廣義狄氏分配正確率變化表- Fungi資料檔 51

    黃于珊,(2009)。多項式簡易貝氏分類器中廣義狄氏先驗分配之參數設定方法。國立成功大學資訊管理研究所碩士班碩士論文。
    Beck, D., Settles, M., and Foster, J. A. (2011). OTUbase: an R infrastructure package for operational taxonomic unit data. Bioinformatics, 27(12), 1700-1701.
    Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence, 2157, Publisher: Pitman Publishing, 147-149.
    Chan, C. L. and Ting, H. W. (2011). Constructing a novel mortality prediction model with Bayes theorem and genetic algorithm. Expert Systems with Applications, 38(7), 7924-7928.
    Chandra, B. and Gupta, M. (2011). Robust approach for estimating probabilities in Naïve–Bayes Classifier for gene expression data. Expert Systems with Applications, 38(3), 1293-1298.
    Chen, J., Huang, H., Tian, S., and Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.
    Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. (2009). The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research, 37, D141-145.
    Connor, R. J. and Mosimann, J. E. (1969). Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution. Journal of the American Statistical Association, 64, No. 325, 194-206.
    Dong, Y., Butler, E. C., Philp, R. P., and Krumholz, L. R. (2011). Impacts of microbial community composition on isotope fractionation during reductive dechlorination of tetrachloroethylene. Biodegradation, 22(2), 431-444.
    Eichorst, S. A., Kuske, C. R., and Schmidt, T. M. (2011). Influence of plant polymers on the distribution and cultivation of bacteria in the phylum Acidobacteria. Applied and Environmental Microbiology, 77(2), 586-596.
    Fienberg, S. E. and Holland, P. W. (1972). On the choice of flattening constants for estimating multinomial probabilities. Journal of Multivariate Analysis, 2(1), 127-134.
    Forney, L. J., Gajer, P., Williams, C. J., Schneider, G. M., Koenig, S. S., McCulle, S. L., Karlebach, S., Brotman, R. M., Davis, C. C., Ault, K., and Ravel, J. (2010). Comparison of self-collected and physician-collected vaginal swabs for microbiome analysis. Journal of Clinical Microbiology, 48(5), 1741-1748.
    Frank, J. A. and Sorensen, S. J. (2011). Quantitative metagenomic analyses based on average genome size normalization. Applied and Environmental Microbiology, 77(7), 2513-2521.
    Good, I. J. (1965). The Estimation of Probabilities, MIT Press, Cambridge, MA.
    Handelsman, J., Rondon, M.R., Brady, S., Clardy, J., and Goodman, R.M. (1998). Molecular biology provides access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5, R 245-R 249.
    Hao, X., Jiang, R., and Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
    Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society A, 186, 453-461.
    Lidstone, G. J. (1920). Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8, 182-192.
    Lu, S.-H., Chiang, D.-A., Keh, H.-C., and Huang, H.-H. (2010). Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values. Knowledge-Based Systems, 23(6), 598-604.
    Macdonald, C. A., Clark, I. M., Hirsch, P. R., Zhao, F. J., and McGrath, S. P. (2011). Development of a real-time PCR assay for detection and quantification of Rhizobium leguminosarum bacteria and discrimination between different biovars in zinc-contaminated soil. Applied and Environmental Microbiology, 77(13), 4626-4633.
    McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization, 41-48.
    Mitchell, T. M. (1997). Machine learning: McGraw-Hill.
    Perks, W. (1947). Some observations on inverse probability including a new indifference rule. Journal of the Institute of Actuaries, 73, 285-334.
    Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008, Article ID 205969, 12 pages.
    Sharpton, T. J., Riesenfeld, S. J., Kembel, S. W., Ladau, J., O'Dwyer, J. P., Green, J. L., Eisen, J. A., and Pollard, K. S. (2011). PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Computational Biology, 7(1), e1001061.
    Simonoff, J. S. (1995). Smoothing categorical data. Journal of Statistical Planning and Inference, 47(1-2), 41-69.
    Stein, C. M. (1962). Confidence Sets for the Mean of a Multivariate Normal Distribution. Journal of the Royal Statistical Society. Series B, 24(2), 265-296.
    Trybula, S. (1958). Some Problems of Simultaneous Minimax Estimation. Annals of Mathematical Statistics, 29, 245-253.
    Walters, W. A., Caporaso, J. G., Lauber, C. L., Berg-Lyons, D., Fierer, N., and Knight, R. (2011). PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers. Bioinformatics, 27(8), 1159-1161.
    Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
    Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied mathematics and Computation, 97, 165-181.
    Wong, T. T. (2007). Perfect aggregation of Bayesian analysis on compositional data. Statistical Papers, 48, 265-282.
    Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
    Youn, E. and Jeong, M. K. (2009). Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognition Letters, 30(5), 477-485.

    下載圖示 校內:2017-07-10公開
    校外:2017-07-10公開
    QR CODE