簡易檢索 / 詳目顯示

研究生: 蔡忠霖
Tsai, Jung-Lin
論文名稱: 應用於多源基因體學分類問題之以熵值為基礎的特徵選取法
An Entropy-based Feature Selection Method for Metagenomic Data
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 40
中文關鍵詞: 多源基因體學分類特徵選取貝氏分類器
外文關鍵詞: classification, feature selection, metagenomics, naïve Bayesian classifier
相關次數: 點閱:136下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多源基因體學研究是近幾年來生物界熱門的議題,以往生物學家為了瞭解一個生態環境的全貌,傳統上只能藉由在實驗室裡培養的方式來了解有哪些物種生存在此生態中,但是在實驗室裡面所能培養的物種大概只佔地球上的百分之一,其餘百分之九十九都是無法窺探的,也因為如此生態多樣性一直都是生物學家想要解決的問題。為了進一步了解生態的多樣性,透過多源基因體學研究可以直接從生態環境中取得樣本,並藉由定序及分類的技術,進而建立生態環境的物種資料庫。藉由多源基因體學的研究,我們不僅可以建立生態環境的完整微生物種資料庫,此外也可以進一步了解生態物種間的交互作用關係。直接從生態環境取得樣本後,為了建立生態資料庫,必須經過定序產生許多由A、T、C和G組成的序列資料,這些序列資料再透過基因簽章可以產生特徵,最後透過分類將每一筆序列資料分到對應類別裡。傳統上在產生特徵時,處理的特徵數量都是大量的,而本研究發展一個entropy-based的特徵選取方法,透過特徵選取來降低特徵數量,希望除了一樣可以得到好的分類結果外,也可以有效地降低分類的運算效率。結果顯示進行特徵選取後,使用8mer基因簽章在綱、目、科階層和使用9mer基因簽章在門、綱、目、科階層下的分類正確率可以高過沒有使用特徵選取的正確率;除此之外,特徵選取也提高了分類時的效率。

    Metagenomics is one of the major topics in biology in these years. The traditional way for biologists to understand the microbes in an area is to culture them in a laboratory. Since the microbes that can be cultured are less than one percent of the whole microbes, biologists need a new technology to study microbes efficiently. When a sample is retrieved from a specific region, sequencing and classification techniques will be employed to analyze the microorganisms in the sample to establish a database for them. This database can provide necessary data to understand the interrelations among the species. A gene sequence instance is composed of alphabets A, T, C, and G, and genome signature techniques are used to extract features for classification. Since the number of features is generally huge for a gene sequence set, we proposed an entropy-based method for feature selection. The experimental results on a gene sequence set show that when either 8-mer or 9-mer for genome signature is employed to generate features, our feature selection method can improve the performance of naïve Bayesian classifiers in some levels. Since not all features are used for classification, the computational efficiency of the naïve Bayesian classifier is also improved.

    摘要 I Abstract II 誌謝 III 目錄 IV 表目錄 V 圖目錄 VI 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻回顧 4 2.1 多源基因體學 4 2.2 多源基因體學的基因簽章 6 2.2.1 GC content 6 2.2.2 寡核甘酸的頻率 7 2.3 預測多源基因類別的資料探勘工具 8 2.4 特徵選取法 10 2.5 交互認證法則 12 第三章 研究方法 14 3.1 特徵萃取 15 3.2 特徵選取 19 3.3 分類預測 20 3.4 評估流程 22 第四章 實證研究 23 4.1 資料收集與整理 23 4.2 特徵選取的討論與分析 24 4.3 使用特徵選取後的分類正確率 25 4.4 小結 34 第五章 結論與建議 36 5.1 結論 36 5.2 建議 37 參考文獻 38

    Birdsell J. A. (2002). Integrating Genomics, Bioinformatics, and Classical Genetics to Study the Effects of Recombination on Genome Evolution. Molecular Biology and Evolution, 19:7, 1181-1197.

    Bentley S. D. and Parkhill J. (2004). Comparative genomic structure of prokaryotes, Annual Review of Genetics, 38, 771-792.

    Chan C. K., Hsu A. L., Tang S. L., and Halgamuge S. K. (2008). Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. Journal of Biomedicine and Biotechnology, 1, 124-129.

    Chatterji S., Yamazaki I., Bai Z., and Eisen J. (2008). CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. Lecture Notes in Computer Computer Science, 4955, 17-28.

    Chen K. and Pachter L. (2005). Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Computational Biology, 1:2, 106-112.

    Chernoff H. and Lehmann E. L. (1954). The use of maximum likelihood estimates in tests for goodness of fit. Annals of Mathematical Statistics, 25:3, 579-586.

    Diaz N. N., Krause L., Goesmann A., Niehaus K., and Nattkemper T. W. (2009). TOCOA – Taxonomic classification of environmental genomic fragments using a kernelized nearst neighbor approach, BMC Bioinformatics, 10:56.

    Dudoit, S. and Laan, M. J. (2003). Unified cross-validation methodology for selection among estimators: Finite sample results, asymptotic optimality, and applications, Division of Biostatistics, UC Berkeley, Technical. report #130.

    Garrity G. M., Bell J. A., and Lilburn T. G. (2004). Taxonomic outline of the procaryotes. Bergey’s Manual of Systematic Bacteriology, 2nd Edition, Release 5.0, Springer-Verlag.

    Huson D. H., Auch A. F., J. Qi J., and Schuster S. C. (2007). MEGAN analysis of metagenomic data, Genome Research, 17:3, 377-386.

    Handelsman J, Rondon M. R., Brady S. F., Clardy J, and Goodman R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products, Chemistry & Biology, 5:10, 245-249.

    Karlin S., Mrazek J., and Campbell A. M. (1997). Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology, 179:12, 3899-3913.

    Karlin S. and Burge C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11:7, 283-290.

    Kunin V., Copeland A., Lapidus A., Mavromatis K., and Hugenholtz P. (2008). A bioinformatician’s guide to metagenomics. Microbiology and Molecular Biology Reviews, 72:4, 557-578.

    McHardy A. C., Martin H. G., Tsirigos A., Hugenholtz P., and Rigoutsos I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4:1, 63-72.

    McHardy, A. C. and Rigoutsos I. (2007). What’s in the mix: phylogenetic classification of metagenome sequence samples. Current Opinion in Microbiology, 10:5, 499-503.

    Nasser S., Breland A., Harris F. C., Nicolescu M. (2008). A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Fuzzy Information Processing Society(NAFIPS 2008), New York, NY.

    Quinlan J. R. (1979). Discovering Rules from Large Collections of Examples: A Case Study. Expert Systems in the Microelectronic Age, 168-201.

    Rosen G., Garbarine E, Caseiro D., Polikar D., and Sokhansanj B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008.

    Shannon C. E. and Weaver W. (1949). The Mathematical Theory of Communication. Urbana: University of Illinois Press.

    Sandberg R., Winberg G., Branden C. I., Kaske A., Ernberg I., and Goster J. (2001) . Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier, Genome Research, 11:8, 1404-1409.

    Teeling H., Waldmann J., Lombardot T., Bauer M., and Clockner F. B. (2004). TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics,5:163.

    Tringe, S.G., Mering C. V., Kobayashi A., Salamov A. A., Chen K., Cheng H. W., Podar M., Short J. M., Mathur E. J., Detter J. C., Bork P., Hugenholtz P., and Rubin E. M. (2005). Comparative metagenomics of microbial communities. Science, 308:5721, 554-557.

    Venter J. C., Remington K., Heidelberg J. F., Halpern A. L., Rusch D., Eisen J. A., Wu D., Paulsen I., Nelson K. E., Nelson W., Fouts D. E., Levy S., Knap A. H., Lomas M. W., Nealson K., White O., Peterson J., Hoffman J., Parsons R., Baden-Tillson H., Pfannkoch C., Rogers Y. H., and Smith H. O. (2004). Environmental genome shutgun sequencing of the Sargasso Sea. Science, 304:5667, 66-74.

    Wheeler, D. L., Chappey C., Lash A. E., Leipe D. D., Madden T. L., Schuler G. D., Tatusova T. A., and Rapp B. A. (2000). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 29:1, 11-16.

    Wang Q., Garrity G. M., Tiedje J. M., and Cole J. R. (2007). Naïve Bayesian Classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73:16, 5261-5267.

    Yang Y., J. and Pedersen O. (1997). A comparative study on feature selection in text categorization. Proceedings of Fourteenth International Conference on Machine Learning(ICML-97), Nashville, Tennessee, 412-420.

    下載圖示 校內:2011-07-02公開
    校外:2011-07-02公開
    QR CODE