| 研究生: |
盧鵬羽 Lu, Peng-Yu |
|---|---|
| 論文名稱: |
階層式概念應用於多類別基因序列資料分類之研究 A Study of Applying Taxonomy to Classify Multiclass Gene Sequence Data |
| 指導教授: |
翁慈宗
Weng, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 48 |
| 中文關鍵詞: | 基因序列資料 、多源基因體學 、簡易貝氏分類器 、分類 |
| 外文關鍵詞: | gene sequence data, metagenomics, naïve Bayesian classifier, taxonomy |
| 相關次數: | 點閱:140 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
以往生物學家為了探究某一生態環境的全貌,只能藉由在實驗室培養來了解有那些物種,但是在實驗室裡能培養的物種僅約佔地球上的百分之一而已,剩餘的百分九十九是無法一探究竟的,正因為生態多樣性的關係,使得生物學家致力於研究並補足現行方法的不足。而透過多源基因體學研究可以直接從生態環境中取得樣本,不僅可以建立生態環境完整微生物資料庫,更可以進一步了解生態物種間的交互作用關係。為了建立生態資料庫,必須經過定序產生由a、t、c、g組成的序列資料,然而在目前的文獻所得出的研究結果顯示,在類別值較少的階層,例如界階層的分類正確率較佳,而類別值較多的階層所得出的分類正確率就偏低;因此本篇研究提出一個解決方案來改善此一情況,此方法是先對原始資料檔之類別值建立階層式架構從屬關係,其原因在於傳統上的分類模式中,每當要分類某一階層之類別值時,必需要考慮此階層的所有類別值,但並非所有的類別值都需考慮進來,例如被分派到動物界的序列,在門階層作分類的時候就不需要考慮到屬於其它界之類別值,爾後再將原始資料分為訓練集和測試集並採用交互認證法則來驗證分類結果;導入此概念來作分類時,雖然分類時的效率有變快,但分類正確率卻無法有效提升。
The traditional way for biologists to explore the microbes in an ecological environment is to culture them in a laboratory. However, the microbes that can be cultured are less than one percent of the microbes living on earth. Due to this reason, biologists are looking for a new technology that can make up the insufficiency of current method. Metagenomics allows biologists to obtain gene samples directly from an ecological environment. This new technology can help us not only to establish a complete database for microbes, but also to understand the correlation among ecological species. Sequence data for metagenomics are composed of four basic items a, t, c and g. Many studies have shown that a level with few class values generally has a higher classification accuracy than a level with many class values, and the computational complexity of the naïve Bayesian classifier is proportional to the number of class values. For example, level ‘kingdom’ that has less class values has a higher prediction accuracy than level ‘genus’, and the computational cost for level ‘genus’ is higher. This research proposes a taxonomy summarized from gene sequence data for the class values in various levels. The taxonomy is then embedded in the naïve Bayesian classifier to determine the class value of a test gene sequence instance. The Laplace estimator for the naïve Bayesian classifier is also modified to make the taxonomy applicable. The experimental results on two gene sequence data sets show that the taxonomy does enhance the computational efficiency of the naïve Bayesian classifier, while the prediction accuracy cannot be improved.
蔡忠霖(2010)。應用於多源基因體學分類問題之以熵值為基礎的特徵選取法。國立成功大學資訊管理研究所碩士論文。
Bentley S. D. and Parkhill J.(2004). Comparative genomic structure of prokaryotes. Annual Review of Genetics, 38, 771-792.
Chan C. K., Hsu A. L., Halgamuge S. K. and Tang S. L. (2008). Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics, 9:215.
Diza N. N., Krause L., Goesmann A., Niehaus K., and Nattkemper T. W. (2009). TOCOA – Taxonomic classification of environmental genomic fragments using a kernelized nearst neighbor approach. BMC Bioinformatics, 10:56.
Dudoit S., and Laan M. J. (2003). Unified cross-validation methodology for selection among estimators: Finite sample results, asympototic optimality, and applications, Division of biostatistics. UC Berkeley, Technical. Report #130.
Garrity G. M., Bell J. A., and Lilburn T. G. (2004). Taxonomic outline of the prokaryotes. Bergey’s Manual of Systematic Bacteriology, 2nd Edition, Release 5.0, Springer-Verlag.
Handelsman J.(2004). Metagenomics: application of genomics to Uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68, 669–685.
Handelsman J., Rondon M. R., Brady S. F., Clardy J., and Goodman R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5:10, 245-249.
Huson D. H., Auch A. F., J. Qi J., and Schuster S. C. (2007). MEGAN analysis of metagenomic data. Genome research, 17: 3,377-386.
Karlin S., Mrazek J., and Campbell A. M.. (1997). Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology, 179:12, 3899-3913.
McHardy A. C., Martin H. G.., Tsirigos A., Hugenholtz P., and Rigoutsos I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4:1, 63-72.
Nasser S., Breland A., Harris F. C., and Nicolescu M.(2008). A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Fuzzy Information Processing Scoiety, 26,1-4.
Peng Z., Lu T., Li L., Liu X., Gao Z., and Hu T. et al.(2010) Genome-wide characterization of the biggest grass, bamboo, based on 10,608 putative full-length cDNA sequences. BMC Plant Biology,10:116.
Robert A. H. and Steven J.M. J. (2008). The new paradigm of flow cell sequenceing. Genome Research, 18:6,839-46.
Sandberg R., Winberg G., branden C. I., Kaske A., Ernberg I., and Goster J. (2001). Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Research, 11:8, 1404-1409.
Teeling H., Meyerdierks A., Bauer M., Amann R., and Glöckner F. O. (2004). Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6:9, 938–947,
Wang Q., Garrity G. M., Tiedje J. M., and Cole J. R. (2007). Naïve Bayesian Classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73:16, 5261–5267.
Wheeler, D. L., Chappey C., Lash A. E., Leipe D. D., Madden T. L., Schuler G. D., Tatusova T. A., and Rapp B. A. (2000). Database resources of the national center for biotechnology information. Nucleic Acids Research, 29:1, 11-16.