簡易檢索 / 詳目顯示

研究生: 盧鵬羽
Lu, Peng-Yu
論文名稱: 階層式概念應用於多類別基因序列資料分類之研究
A Study of Applying Taxonomy to Classify Multiclass Gene Sequence Data
指導教授: 翁慈宗
Weng, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 48
中文關鍵詞: 基因序列資料多源基因體學簡易貝氏分類器分類
外文關鍵詞: gene sequence data, metagenomics, naïve Bayesian classifier, taxonomy
相關次數: 點閱:140下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 以往生物學家為了探究某一生態環境的全貌,只能藉由在實驗室培養來了解有那些物種,但是在實驗室裡能培養的物種僅約佔地球上的百分之一而已,剩餘的百分九十九是無法一探究竟的,正因為生態多樣性的關係,使得生物學家致力於研究並補足現行方法的不足。而透過多源基因體學研究可以直接從生態環境中取得樣本,不僅可以建立生態環境完整微生物資料庫,更可以進一步了解生態物種間的交互作用關係。為了建立生態資料庫,必須經過定序產生由a、t、c、g組成的序列資料,然而在目前的文獻所得出的研究結果顯示,在類別值較少的階層,例如界階層的分類正確率較佳,而類別值較多的階層所得出的分類正確率就偏低;因此本篇研究提出一個解決方案來改善此一情況,此方法是先對原始資料檔之類別值建立階層式架構從屬關係,其原因在於傳統上的分類模式中,每當要分類某一階層之類別值時,必需要考慮此階層的所有類別值,但並非所有的類別值都需考慮進來,例如被分派到動物界的序列,在門階層作分類的時候就不需要考慮到屬於其它界之類別值,爾後再將原始資料分為訓練集和測試集並採用交互認證法則來驗證分類結果;導入此概念來作分類時,雖然分類時的效率有變快,但分類正確率卻無法有效提升。

    The traditional way for biologists to explore the microbes in an ecological environment is to culture them in a laboratory. However, the microbes that can be cultured are less than one percent of the microbes living on earth. Due to this reason, biologists are looking for a new technology that can make up the insufficiency of current method. Metagenomics allows biologists to obtain gene samples directly from an ecological environment. This new technology can help us not only to establish a complete database for microbes, but also to understand the correlation among ecological species. Sequence data for metagenomics are composed of four basic items a, t, c and g. Many studies have shown that a level with few class values generally has a higher classification accuracy than a level with many class values, and the computational complexity of the naïve Bayesian classifier is proportional to the number of class values. For example, level ‘kingdom’ that has less class values has a higher prediction accuracy than level ‘genus’, and the computational cost for level ‘genus’ is higher. This research proposes a taxonomy summarized from gene sequence data for the class values in various levels. The taxonomy is then embedded in the naïve Bayesian classifier to determine the class value of a test gene sequence instance. The Laplace estimator for the naïve Bayesian classifier is also modified to make the taxonomy applicable. The experimental results on two gene sequence data sets show that the taxonomy does enhance the computational efficiency of the naïve Bayesian classifier, while the prediction accuracy cannot be improved.

    摘要 i Abstract ii 誌謝 iii 目錄 iv 圖目錄 v 表目錄 vi 第一章 緒論 1 1.1 研究動機 1 1.2研究目的 3 1.3研究架構 5 第二章 文獻回顧 6 2.1多源基因體學 6 2.2基因簽章 7 2.2.1 GC content 8 2.2.2 寡核甘酸頻率 9 2.3應用於多源基因體學的資料探勘工具 10 2.5多類別值的分類預測 12 第三章 研究方法 15 3.1 特徵萃取 16 3.2 未考慮階層式架構之簡易貝氏分類器 18 3.3 階層式架構 23 3.4 考慮階層式架構的簡易貝氏分類器 27 第四章 實證研究 32 4.1 資料收集與整理 32 4.2 資料檔及分類條件的設定 34 4.3 採用及未採用階層式架構分類比較 35 第五章 結論與建議 44 5.1 結論 44 5.2 建議 45 中文參考文獻 46 英文參考文獻 47

    蔡忠霖(2010)。應用於多源基因體學分類問題之以熵值為基礎的特徵選取法。國立成功大學資訊管理研究所碩士論文。

    Bentley S. D. and Parkhill J.(2004). Comparative genomic structure of prokaryotes. Annual Review of Genetics, 38, 771-792.

    Chan C. K., Hsu A. L., Halgamuge S. K. and Tang S. L. (2008). Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics, 9:215.

    Diza N. N., Krause L., Goesmann A., Niehaus K., and Nattkemper T. W. (2009). TOCOA – Taxonomic classification of environmental genomic fragments using a kernelized nearst neighbor approach. BMC Bioinformatics, 10:56.

    Dudoit S., and Laan M. J. (2003). Unified cross-validation methodology for selection among estimators: Finite sample results, asympototic optimality, and applications, Division of biostatistics. UC Berkeley, Technical. Report #130.

    Garrity G. M., Bell J. A., and Lilburn T. G. (2004). Taxonomic outline of the prokaryotes. Bergey’s Manual of Systematic Bacteriology, 2nd Edition, Release 5.0, Springer-Verlag.

    Handelsman J.(2004). Metagenomics: application of genomics to Uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68, 669–685.

    Handelsman J., Rondon M. R., Brady S. F., Clardy J., and Goodman R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5:10, 245-249.

    Huson D. H., Auch A. F., J. Qi J., and Schuster S. C. (2007). MEGAN analysis of metagenomic data. Genome research, 17: 3,377-386.

    Karlin S., Mrazek J., and Campbell A. M.. (1997). Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology, 179:12, 3899-3913.

    McHardy A. C., Martin H. G.., Tsirigos A., Hugenholtz P., and Rigoutsos I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4:1, 63-72.

    Nasser S., Breland A., Harris F. C., and Nicolescu M.(2008). A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Fuzzy Information Processing Scoiety, 26,1-4.

    Peng Z., Lu T., Li L., Liu X., Gao Z., and Hu T. et al.(2010) Genome-wide characterization of the biggest grass, bamboo, based on 10,608 putative full-length cDNA sequences. BMC Plant Biology,10:116.

    Robert A. H. and Steven J.M. J. (2008). The new paradigm of flow cell sequenceing. Genome Research, 18:6,839-46.

    Sandberg R., Winberg G., branden C. I., Kaske A., Ernberg I., and Goster J. (2001). Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Research, 11:8, 1404-1409.

    Teeling H., Meyerdierks A., Bauer M., Amann R., and Glöckner F. O. (2004). Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6:9, 938–947,

    Wang Q., Garrity G. M., Tiedje J. M., and Cole J. R. (2007). Naïve Bayesian Classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73:16, 5261–5267.

    Wheeler, D. L., Chappey C., Lash A. E., Leipe D. D., Madden T. L., Schuler G. D., Tatusova T. A., and Rapp B. A. (2000). Database resources of the national center for biotechnology information. Nucleic Acids Research, 29:1, 11-16.

    下載圖示 校內:2013-07-14公開
    校外:2013-07-14公開
    QR CODE