研究生: |
陳泓宇 Chen, Hung-Yu |
---|---|
論文名稱: |
結合階層式概念與特徵選取技術於基因序列分類之研究 A Classification Method with Taxonomy and Feature Selection for Gene Sequence Data |
指導教授: |
翁慈宗
Weng, Tzu-Tsung |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 52 |
中文關鍵詞: | 多源基因體學 、階層式架構 、特徵選取 、簡易貝氏分類器 |
外文關鍵詞: | Feature selection, metagenomics, naïve Bayesian classifier, taxonomy |
相關次數: | 點閱:117 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為了瞭解生態環境的樣貌,傳統上對微生物的研究都是藉由實驗室的培養來進行,但是實驗室能培養的物種數量相當有限,大概只佔了大自然環境中的百分之一,為了近一步了解生態的多樣性,多源基因體學的研究開始蓬勃發展,透過多源基因體學研究,研究人員可以直接從生態環境中進行採樣與定序,然後藉由分類技術將序列資料進行重組以建立生態環境資料庫。在分類的過程中,由於序列資料本身有著多類別值的特性以及透過基因簽章所產生的高維度特徵問題,所以如何提升運算效率以及提升類別較多的階層的分類正確率是值得去探討的分類議題。本研究將結合階層式概念與特徵選取技術於此議題上,由於階層式架構建立出了每一階層各個類別其上、下階層的類別從屬關係,幫助排除了一些不必要的類別值,而特徵選取技術則是能將高維度特徵問題轉換成低維度特徵問題。本研究透過兩種方法的結合提出一種調整方法可以應用在基因序列的分類議題上,我們使用了兩個基因序列資料檔來做測試,測試結果發現,在分類正確率上,本研究方法和單純使用階層式架構下的方法相比,確實提升了各個階層的分類正確率,但是在和單純使用特徵選取的方法相比時,「科」、「屬」階層的分類正確率依然有些微差距;然而運算效率方面,本研究方法由於考量的類別值個數和特徵數量較少,因此和其他方法相比,本研究方法在分類時的運算速度是佔有非常大的優勢的。
In order to explore the microbes in an ecological environment, traditional studies for biologists are to culture them in laboratories. However, the microbes that can be cultured in laboratories are less than one percent of the whole microbes. Metagenomics is one of the popular topics in biology for studying microbes efficiently. Metagenomics is a technique for researchers to extract samples and sequences from ecological environments directly. Gene sequence data will be reorganized to create databases for ecological environments by classification techniques. Since the numbers of class values and features for gene sequence data are both large, it should be worthy to develop methods for improving the accuracy and computational efficiency in classifying gene sequence data. In this study, we combine taxonomy concept with feature selection technology for this purpose. Since taxonomy records the relations between the class values in different levels, it can be used to exclude unnecessary class values for class prediction. Feature selection technology can reduce the dimensionality of a data set. The experimental results on two gene sequence data sets show that our classification method outperforms the one with taxonomy in both prediction accuracy and computational efficiency. With respect to the method with feature selection, our method greatly improves the computational efficiency of the naïve Bayesian classifier, while the prediction accuracy of our method is slightly inferior in both family and genus levels.
蔡忠霖,(2010)。應用於多源基因體學分類問題之以熵值為基礎的特徵選取法。國立成功大學資訊管理研究所碩士班碩士論文。
盧鵬羽,(2011)。階層式概念應用於多類別基因序列資料分類之研究。國立成功大學資訊管理研究所碩士班碩士論文。
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410.
Bentley, S. D., & Parkhill, J. (2004). Comparative genomic structure of prokaryotes. Annual Review of Genetics, 38, 771-792.
Brady, A., & Salzberg, S. L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 673-678.
Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E., Fierer, N., Pena, A. G., Goodrich, J. K., Gordon, J. I., Huttley, G. A., Kelley, S. T., Knights, D., Koenig, J. E., Ley, R. E., Lozupone, C. A., McDonald, D., Muegge, B. D., Pirrung, M., Reeder, J., Sevinsky, J. R., Tumbaugh, P. J., Walters, W. A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335-336.
Chan, C. K. K., Hsu, A. L., Tang, S. L., & Halgamuge, S. K. (2008). Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. Journal of Biomedicine and Biotechnology. 513701, 1-10.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.
Diaz, N. N., Krause, L., Goesmann, A., Niehaus, K., & Nattkemper, T. W. (2009). TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics, 10, 56.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3), 103-130.
Dudoit, S., & Laan, M. J. (2003). Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples. Division of Biostatistics, paper 130, U. C. Berkeley, CA.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
Gerlach, W., & Stoye, J. (2011). Taxonomic classification of metagenomic shotgun sequences with CARMA3 . Nucleic Acids Research, 39(14), 1-11.
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., & Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 5(10), 245-249.
Hao, X., Jiang, R., & Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
Horton, M., Bodenhausen, N., & Bergelson, J. (2010). MARTA: a suite of Java-based tools for assigning taxonomic status to DNA sequences. Bioinformatics, 26(4), 568-569.
Huson, D. H., Auch, A. F., Qi, J., & Schuster, S. C. (2007). MEGAN analysis of metagenomic data. Genome Res, 17(3), 377-386.
Karlin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11(7), 283-290.
Largeron, C., Moulin, C., & Gery, M. (2011). Entropy based feature selection for text categorization. Symposium on applied computing, TaiChung, Taiwan.
Li, Y., Hsu, D. F., & Chung, S. M. (2009). Combining Multiple Feature Selection Methods for Text Categorization by Using Rank-Score Characteristics. 508-517.
McHardy, A. C., Martin, H. G., Tsirigos, A., Hugenholtz, P., & Rigoutsos, I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4(1), 63-72.
Mitchell, T. M. (1997). Machine Learning: McGraw-Hill.
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and Naive Bayes. Proceeding of 16th international conference on machine learning, 258-267, San Francisco.
Monzoorul Haque, M., Ghosh, T. S., Komanduri, D., & Mande, S. S. (2009). SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics, 25(14), 1722-1730.
Nasser, S., Breland, A., Jr., F. C. H., & Nicolescu, M. (2008). A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Fuzzy Information Peocessing Society, New York.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., & Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008, 205969.
Rosen, G. L., Reichenberger, E. R., & Rosenfeld, A. M. (2011). NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27(1), 127-129.
Uğuz, H. (2011). A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24(7), 1024-1032.
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of Fourteenth International Conference on Machine Learning, 412-420, Nashville.