| 研究生: |
康鵬濬 Kang, Peng-Chun |
|---|---|
| 論文名稱: |
多源基因體學操作階層單位分析指標之評估 Evaluation Indices for Operational Taxanomy Units in Metagenomics |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2012 |
| 畢業學年度: | 100 |
| 語文別: | 中文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 分群方法 、基因序列資料 、多源基因體學 、階層操作單位 |
| 外文關鍵詞: | clustering method, gene sequence data, metagenomics, operational taxonomy unit |
| 相關次數: | 點閱:145 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
操作階層單位的分析(分群)在多源基因體學上一直佔有很重要的部分,其透過各種方式擷取基因序列資料,利用不同的基因排列法和分群方法,分析出微生物對環境造成交互影響的基因群集,在近期的發展中,更期望利用高變異片段基因短序列資料進行生物分析以減少運算成本。在過去的文獻中,由於基因序列資料十分龐大,主要研究領域在於利用分群方法,在最佳的運算效率下找出穩固的基因群集。然而大多數的分群方法須利用基因排列,將長短不一的原始資料集排列成相同的長度,以進行分群時的相似度運算,因此除了分群方法之外,基因排列法對分群結果也會產生影響。許多演算法已經被廣泛地運用在操作階層單位分析上,然而分群結果的評估仍停留在運算效率與分群數量的差異比較,缺乏客觀的評估指標,以衡量及比較不同演算法所產生的分群結果。本研究首先挑選了近期文獻所提出的分群評估指標,由其來評量分群結果的有效性,然而該分群指標僅就分群法之參數設定與預期的分群結果進行評估,無法有效衡量分群結果好壞,因此本研究利用兩種不同概念衍生的評估指標,針對五種常見的分群方法進行評估。從實驗結果得知,本研究所使用的兩種基因序列資料集,當採用不同的分群方法進行分群時,在相同的生物階層上應設定不同的門檻值。而本研究所提出以非監督式和監督式概念衍伸的指標,將會產生不一致的分析結果,其中又以監督式指標較為可靠。
Operational Taxonomy Unit (OTU) analysis is essential to metagenomics. It retrieves gene sequence data through a variety of ways, and employs gene alignment and clustering methods to determine the gene clusters that can affect the interactions between microbes and ecological environment. In recent studies, short gene sequences with high variability are used for reducing cost. Since the number of genes is huge, clustering methods with computational efficiency are developed to find stable clustering results. The methods for gene alignment to make all gene sequence data have the same length for similarity calculation can also affect clustering results. Many clustering methods therefore have been developed for OTU analysis, and most of them are evaluated by their computational efficiency and the number of clusters. No measures have been established to objectively evaluate the clustering results of gene sequence data. We first select an index proposed by a recent study to test its validity, and find that it is not an effective one for evaluating the clustering results of gene sequence data. We therefore pick two different concepts to design indexes for evaluating the performance of five clustering methods applied on gene sequence data in this study. The experimental results obtained from two gene sequence data sets show that in any specific level in the taxonomy, the thresholds for various clustering methods must be different. The evaluation results measured by the supervised and unsupervised indexes proposed by this study are not consistent, and the supervised index is more reliable.
Amann, R. I., Ludwig, W., and Schleifer, K. H. (1995). Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiological reviews, 59(1), 143–169.
Brady, A. and Salzberg, S. L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 668-673.
Chatterji, S., Yamazaki, I., Bai, Z., and Eisen, J. A. (2008). CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In M. Vingron & L. Wong, Lecture Notes in Bioinformatics, 4955, 17-28.
Chen, K. and Pachter, L. (2005). Bioinformatics for whole-genome shotgun sequencing of microbial communities. Plos Computational Biology, 1(2), 106-112.
Claesson, M. J., Wang, Q., O'Sullivan, O., Greene-Diniz, R., Cole, J. R., Ross, R. P., and O'Toole, P. W. (2010). Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Research, 38(22), 200.
Cole, J. R., Chai, B., Farris, R. J., Wang, Q., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., and Tiedje, J. M. (2007). The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Research, 35, 169-172.
DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., and Andersen, G. L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied Environment Microbiology, 72(7), 5069-5072.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
Hall, N. (2007). Advanced sequencing technologies and their wider impact in microbiology. Journel of Experimental Biology, 210( 9), 1518-1525.
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., and Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 5(10), 245-249.
Hao, X., Jiang, R., and Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 26(5), 680-682.
Kircher, M., Heyn, P., and Kelso, J. (2011). Addressing challenges in the production and analysis of illumina sequencing data. BMC Genomics, 12.
Li, W. and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658-1659.
McHardy, A. C., Martin, H. G., Tsirigos, A., Hugenholtz, P., and Rigoutsos, I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4(1), 63-72.
Nasser, S., Breland, A.E., Harris, F.C., and Nicolescu, M. (2008). A Fuzzy Classifier to Taxonomically Group DNA Fragments within a Metagenome. Fuzzy Information Processing Society.
Pommier, T., Canback, B., Lundberg, P., Hagstrom, A., and Tunlid, A. (2009). RAMI: a tool for identification and characterization of phylogenetic clusters in microbial communities. Bioinformatics, 25(6), 736-742.
Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glockner, F. O. (2007). SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research, 35(21), 7188-7196.
Schloss, P. D. and Handelsman, J. (2005). Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied Environment Microbiology, 71(3), 1501-1506.
Schloss, P. D. and Westcott, S. L. (2011). Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied Environment Microbiology, 77(10), 3219-3226.
Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann, M., Hollister, E. B, and Weber, C.F.(2009). Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied Environment Microbiology, 75(23), 7537-7541.
Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., Arrieta, J. M., and Herndl, G. J.(2006). Microbial diversity in the deep sea and the underexplored “rare biosphere.” Proceedings of National Acadamy of Science of the United Stated of America, 103, 12115–12120.
Sun, Y., Cai, Y., Liu, L., Yu, F., Farrell, M. L., McKendree, W., and Farmerie, W. (2009). ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research, 37(10), 76.
Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining. Addison Wesley, 1 edition.