簡易檢索 / 詳目顯示

研究生: 陳彥群
Chen, Yen-Chun
論文名稱: 探討次世代定序技術的定序偏差對全新基因體組裝的影響
Exploring the Bias Influences of Next-Generation-Sequencing for de novo Genome Assembly
指導教授: 黃吉川
Hwang, Chi-Chuan
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 134
中文關鍵詞: 次世代定序序列組裝定序偏差
外文關鍵詞: Next-generation-sequencing, Short read assembly, Sequencing bias
相關次數: 點閱:170下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   次世代定序(Next-Generation-Sequencing)技術的高通量定序能力,使得DNA定序達到全基因體深度定序的等級,也大幅降低了定序所需的人力與成本,但大量的序列資料也成為電腦計算上的一項新課題。近年來發展了多種次世代序列的組裝軟體,已廣泛地用於處理次世代定序的大量短序列資料,但定序的實驗過程並非完美,除了定序錯誤之外,DNA序列並未隨機取自基因體序列,而是與序列上的鳥糞嘌呤(Guanine, G)以及胞嘧啶(Cytosine, C)的數量有關,此一定序偏差(sequencing bias)將造成資料的不均勻分布,並直接影響到基因體組裝(genome assembly)的結果。
      為了探討定序偏差對於全基因體組裝的影響,本文以金黃色葡萄球菌、大腸桿菌、結核桿菌、阿拉伯芥一號染色體與水稻五號染色體共五種模式物種(Model Species)的參考序列(Reference Sequence),模擬出片段長度100個核苷酸的序列資料並加入不同程度的定序偏差,進而以組裝軟體ALLPATHS-LG、ABySS、Edena、SOAPdenovo、SSAKE、Velvet與Velvet-SC處理之,經由組裝結果與參考序列的比對,以連續性與正確度兩項標準評估定序偏差對於不同組裝軟體的影響程度。
      研究結果指出大部分的組裝軟體雖然可處理低程度的定序偏差,然而定序偏差一旦增大將造成組裝的連續性與正確性降低,每一個軟體對於定序偏差的影響不但有程度上的不同,所傾向之錯誤亦有差異。定序偏差之主要影響可歸咎於區域地序列深度(read coverage)不足,故適度的定序深度將扮演一個重要的角色,此外,基因體中區域GC含量(GC content)之變異量亦為影響偏差的主因之一。

     The next generation sequencing technology is a now important approach to decode the genome. Dealing with the millions of short reads had become a significant issue in the field of computing. In the recent years, a series of tools had been developed to assembly the huge amount of fragments into more continuous sequences. However, the inherent sequencing bias may reduce the performance of assembly. The effects of bias on assembly have not been systematically discussed in the past.
     In this study, we simulate reads with specific degree of sequencing bias and error rate profile for S.aureus, E.coli, M.tuberculosis, Arabidopsis thaliana Chr.1 and Oryza sativa Chr.5. We consider various scenario of bias for each assembler including ALLPATHS-LG, ABySS, Edena, SOAPdenovo, SSAKE, Velvet and Velvet-SC and employ an assembly evaluating tool, GAGE, to discuss the assemblies by both N50 length and accuracy.
     The biased data sets will lead the fracture and error within assemblies. The regions with low read coverage are either unable to be assembled or produce the sequence contain SNPs, Indels or reconstructions. Although the most assemblers are capable to deal with small degree of bias within bacterial data, the bias result much deeper impact for the more complex plant genome. The reasonable amount of reads plays an important role to mitigate the bias. This study provides a novel landscape of assembly for the relationship between the coverage and sequencing bias.

    中文摘要 .......................................I Abstract ..............................II 誌謝 .....................................III 目錄 ......................................IV 圖目錄 ......................................VI 表目錄 ......................................IX 第一章 緒論 ...............................1 1.1基因體組序背景 ...............................1 1.2研究動機與目的 ...............................3 1.3文獻回顧 ...............................4 1.3.1定序原理 ...............................4 1.3.2次世代資料的定序偏差 ......................11 1.3.3次世代資料的短序列組裝 ......................13 1.4本文架構 ..............................15 第二章 研究方法與材料 ......................16 2.1本文研究設計與流程 ......................16 2.2序列資料 ..............................19 2.2.1網路資料庫 ..............................19 2.2.2序列資料的格式與錯誤率 ......................21 2.2.3定序偏差評估 ..............................25 2.3序列模擬 ..............................28 2.4序列組裝 ..............................33 2.4.1序列組裝演算法 ......................34 2.4.2序列組裝軟體 ..............................41 2.5組裝結果的品質評估 ......................48 第三章 結果與討論 ..............................51 3.1定序資料的偏差程度 ......................51 3.1.1 NCBI資料庫的資料 ......................51 3.1.2本文模擬的資料 ......................72 3.2定序偏差對於組裝連續性的影響 ..............78 3.3定序偏差與定序深度的關係 ......................82 3.4定序偏差將導致組裝的錯誤增加 ..............84 3.5定序偏差與物種複雜度的關係 ..............89 3.6比較不同組裝軟體對於定序偏差的影響 ..............91 3.7無關GC含量的定序深度標準差對組裝之影響 ......96 第四章 結論與未來展望 ......................98 4.1結論 ......................................98 4.2未來展望 .............................100 參考文獻 .....................................101 附錄A 序列模擬結果 .............................109 附錄B 組裝結果資料表 .....................117

    [1]F. Sanger, et al., "DNA Sequencing with Chain-Terminating Inhibitors," Proceedings of the National Academy of Sciences of the United States of America, vol. 74, pp. 5463-5467, 1977.
    [2]F. Sanger, "Sequences, Sequences, and Sequences," Annual Review of Biochemistry, vol. 57, pp. 1-28, 1988.
    [3]J. Shendure, et al., "Advanced sequencing technologies: Methods and goals," Nature Reviews Genetics, vol. 5, pp. 335-344, May 2004.
    [4]F. S. Collins, et al., "Finishing the euchromatic sequence of the human genome," Nature, vol. 431, pp. 931-945, Oct 21 2004.
    [5]K. C. Worley and R. A. Gibbs, "GENETICS Decoding a national treasure," Nature, vol. 463, pp. 303-304, Jan 21 2010.
    [6]S. C. Schuster, "Method of the year," Nature Methods, vol. 5, pp. 1-1, Jan 2008.
    [7]N. Rusk and V. Kiermer, "Primer: Sequencing—the next generation," Nature Methods, vol. 5, pp. 1-1, Jan 2008.
    [8]K. R. Chi, "The year of sequencing," Nature Methods, vol. 5, pp. 11-14, Jan 2008.
    [9]M. Kircher and J. Kelso, "High-throughput DNA sequencing - concepts and limitations," Bioessays, vol. 32, pp. 524-536, Jun 2010.
    [10]R. Q. Li, et al., "The sequence and de novo assembly of the giant panda genome," Nature, vol. 463, pp. 311-317, Jan 21 2010.
    [11]P. Tong, et al., "Sequencing and analysis of an Irish human genome," Genome Biology, vol. 11, 2010.
    [12]H. van Bakel, et al., "The draft genome and transcriptome of Cannabis sativa," Genome Biology, vol. 12, 2011.
    [13]J. Eid, et al., "Real-Time DNA Sequencing from Single Polymerase Molecules," Science, vol. 323, pp. 133-138, Jan 2 2009.
    [14]L. D. Stein, "The case for cloud computing in genome informatics," Genome Biology, vol. 11, 2010.
    [15]M. Pop and S. L. Salzberg, "Bioinformatics challenges of new sequencing technology," Trends in Genetics, vol. 24, pp. 142-149, Mar 2008.
    [16]M. A. Quail, et al., "A large genome center's improvements to the Illumina sequencing system," Nature Methods, vol. 5, pp. 1005-1010, Dec 2008.
    [17]J. Shendure, "Next-generation human genetics," Genome Biology, vol. 12, 2011.
    [18]D. S. Horner, et al., "Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing," Briefings in Bioinformatics, vol. 11, pp. 181-197, Mar 2010.
    [19]M. Pop, "Genome assembly reborn: recent computational challenges," Briefings in Bioinformatics, vol. 10, pp. 354-366, Jul 2009.
    [20]J. R. Miller, et al., "Assembly algorithms for next-generation sequencing data," Genomics, vol. 95, pp. 315-327, Jun 2010.
    [21]K. Paszkiewicz and D. J. Studholme, "De novo assembly of short sequence reads," Briefings in Bioinformatics, vol. 11, pp. 457-472, Sep 2010.
    [22]J. C. Dohm, et al., "Substantial biases in ultra-short read data sets from high-throughput DNA sequencing," Nucleic Acids Research, vol. 36, Sep 2008.
    [23]Y. Benjamini, "Estimation and correction for GC-content bias in high thoughtput sequencing," in Mathematics Statistics Library, University of California, Berkeley, ed, 2011.
    [24]L. W. Hillier, et al., "Whole-genome sequencing and variant discovery in C-elegans," Nature Methods, vol. 5, pp. 183-188, Feb 2008.
    [25]M. Cruywagen, "Engineered DNA polymerases enabled decreased amplification bias and improved coverage in illumina sequencing workflow," in 43rd Oak Ridge Conference, 2011.
    [26]M. J. Lopez-Barragan, et al., "Effect of PCR extension temperature on high-throughput sequencing," Mol Biochem Parasitol, vol. 176, pp. 64-7, Mar 2011.
    [27]A. Adey, et al., "Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition," Genome Biology, vol. 11, 2010.
    [28]I. Kozarewa, et al., "Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G plus C)-biased genomes," Nature Methods, vol. 6, pp. 291-295, Apr 2009.
    [29]D. Aird, et al., "Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries," Genome Biology, vol. 12, p. R18, 2011.
    [30]W. Y. Zhang, et al., "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies," Plos One, vol. 6, Mar 14 2011.
    [31]G. Narzisi and B. Mishra, "Comparing De Novo Genome Assembly: The Long and Short of It," Plos One, vol. 6, Apr 29 2011.
    [32]Y. Lin, et al., "Comparative studies of de novo assembly tools for next-generation sequencing technologies," Bioinformatics, vol. 27, pp. 2031-2037, Aug 1 2011.
    [33]S. L. Salzberg, et al., "GAGE: A critical evaluation of genome assemblies and assembly algorithms (vol 22, pg 557, 2012)," Genome Research, vol. 22, pp. 1196-1196, Jun 2012.
    [34]X. Yang, et al., "Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data," Briefings in Bioinformatics, 2012.
    [35]H. Swerdlow, et al., "Capillary Gel-Electrophoresis for DNA Sequencing - Laser-Induced Fluorescence Detection with the Sheath Flow Cuvette," Journal of Chromatography, vol. 516, pp. 61-67, Sep 7 1990.
    [36]X. H. C. Huang, et al., "Capillary Array Electrophoresis Using Laser-Excited Confocal Fluorescence Detection," Analytical Chemistry, vol. 64, pp. 967-972, Apr 15 1992.
    [37]R. A. Mathies and X. C. Huang, "Capillary Array Electrophoresis - an Approach to High-Speed, High-Throughput DNA Sequencing," Nature, vol. 359, pp. 167-169, Sep 10 1992.
    [38]S. C. Schuster, "Next-generation sequencing transforms today's biology," Nature Methods, vol. 5, pp. 16-18, Jan 2008.
    [39]M. Kasahara and S. Morishita, "Large-scale genome sequence processing," ed: Imperial College Press, 2006.
    [40]R. Chikhi and D. Lavenier, "Paired-end read length lower bounds for genome re-sequencing," Bmc Bioinformatics, vol. 10, 2009.
    [41]T. Jarvie, "3K Long-Tag Paired End sequencing with the Genome Sequencer FLX System," NATURE METHODS, MAY 2008.
    [42]Appliedbiosystems. http://info.appliedbiosystems.com.
    [43]Illumina. http://www.illumina.com/.
    [44]ROCHE. http://my454.com/.
    [45]M. L. Metzker, "Applications of Next-Generation Sequencing Sequencing Technologies - the Next Generation," Nature Reviews Genetics, vol. 11, pp. 31-46, Jan 2010.
    [46]J. Shendure and H. L. Ji, "Next-generation DNA sequencing," Nature Biotechnology, vol. 26, pp. 1135-1145, Oct 2008.
    [47]M. L. Metzker, "Emerging technologies in DNA sequencing," Genome Research, vol. 15, pp. 1767-1776, Dec 2005.
    [48]E. R. Mardis, "Next-generation DNA sequencing methods," Annual Review of Genomics and Human Genetics, vol. 9, pp. 387-402, 2008.
    [49]W. J. Ansorge, "Next-generation DNA sequencing techniques," New Biotechnology, vol. 25, pp. 195-203, Apr 2009.
    [50]W. X. Wang, et al., "Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions," Scientific Reports, vol. 1, Aug 5 2011.
    [51]R. A. Farrer, et al., "De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads," Fems Microbiology Letters, vol. 291, pp. 103-111, Feb 2009.
    [52]K. Zhang, et al., "Sequencing genomes from single cells by polymerase cloning," Nature Biotechnology, vol. 24, pp. 680-686, Jun 2006.
    [53]H. Chitsaz, et al., "Efficient de novo assembly of single-cell bacterial genomes from short-read data sets," Nature Biotechnology, vol. 29, pp. 915-U214, Oct 2011.
    [54]P. Flicek and E. Birney, "Sense from sequence reads: methods for alignment and assembly (vol 6, pg s6, 2009)," Nature Methods, vol. 6, Nov 2009.
    [55]N. Whiteford, et al., "An analysis of the feasibility of short read sequencing," Nucleic Acids Research, vol. 33, 2005.
    [56]S. Kurtz, et al., "A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes," Bmc Genomics, vol. 9, Oct 31 2008.
    [57]M. C. Schatz, et al., "Assembly of large genomes using second-generation sequencing," Genome Research, vol. 20, pp. 1165-1173, Sep 2010.
    [58]P. E. C. Compeau, et al., "How to apply de Bruijn graphs to genome assembly," Nature Biotechnology, vol. 29, pp. 987-991, Nov 2011.
    [59]A. M. Phillippy, et al., "Genome assembly forensics: finding the elusive mis-assembly," Genome Biology, vol. 9, 2008.
    [60]S. Kurtz, et al., "Versatile and open software for comparing large genomes," Genome Biology, vol. 5, 2004.
    [61]Novocraft. http://www.novocraft.com/.
    [62]Aspera. http://www.asperasoft.com/.
    [63]NCBI, "http://trace.ncbi.nlm.nih.gov/Traces/sra/."
    [64]C. Ledergerber and C. Dessimoz, "Base-calling for next-generation sequencing platforms," Briefings in Bioinformatics, vol. 12, pp. 489-497, Sep 2011.
    [65]wikipedia. http://en.wikipedia.org/wiki/Phred_quality_score.
    [66]wikipedia. http://en.wikipedia.org/wiki/FASTQ_format.
    [67]H. Chitsaz, et al., "Flow cell lane 6, E. coli K-12, strain MG1655, standard genomic DNA prepared from culture," http://bix.ucsd.edu/projects/singlecell/nbt_data.html, Ed., ed, 2011.
    [68]Babraham-Bioinformatics. www.bioinformatics.babraham.ac.uk.
    [69]U. o. M. CBCB. http://cbcb.umd.edu/research/assembly_primer.shtml.
    [70]K. H. Rosen, "Graphs," in Discrete Mathematics And Its Applications, ed: 6, McGraw-Hill Higher Education, 2006.
    [71]Z. Y. Li, et al., "Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph," Briefings in Functional Genomics, vol. 11, pp. 25-37, Jan 2012.
    [72]M. J. Chaisson, et al., "De novo fragment assembly with short mate-paired reads: Does the read length matter?," Genome Research, vol. 19, pp. 336-346, Feb 2009.
    [73]P. A. Pevzner and H. Tang, "Fragment assembly with double-barreled data," Bioinformatics, vol. 17 Suppl 1, pp. S225-33, 2001.
    [74]P. A. Pevzner, et al., "An Eulerian path approach to DNA fragment assembly," Proceedings of the National Academy of Sciences of the United States of America, vol. 98, pp. 9748-9753, Aug 14 2001.
    [75]I. J. Tsai, et al., "Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps," Genome Biology, vol. 11, 2010.
    [76]R. L. Warren, et al., "Assembling millions of short DNA sequences using SSAKE," Bioinformatics, vol. 23, pp. 500-501, Feb 15 2007.
    [77]D. Hernandez, et al., "De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer," Genome Research, vol. 18, pp. 802-809, May 2008.
    [78]D. R. Zerbino and E. Birney, "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs," Genome Research, vol. 18, pp. 821-829, May 2008.
    [79]D. R. Zerbino, et al., "Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler," Plos One, vol. 4, Dec 22 2009.
    [80]J. T. Simpson, et al., "ABySS: A parallel assembler for short read sequence data," Genome Research, vol. 19, pp. 1117-1123, Jun 2009.
    [81]R. Q. Li, et al., "De novo assembly of human genomes with massively parallel short read sequencing," Genome Research, vol. 20, pp. 265-272, Feb 2010.
    [82]S. Gnerre, et al., "High-quality draft assemblies of mammalian genomes from massively parallel sequence data," Proceedings of the National Academy of Sciences of the United States of America, vol. 108, pp. 1513-1518, Jan 25 2011.
    [83]J. Butler, et al., "ALLPATHS: De novo assembly of whole-genome shotgun microreads," Genome Research, vol. 18, pp. 810-820, May 2008.
    [84]W. J. Kent, "BLAT - The BLAST-like alignment tool," Genome Research, vol. 12, pp. 656-664, Apr 2002.
    [85]F. Finotello, et al., "Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data," Briefings in Bioinformatics, vol. 13, pp. 269-280, May 2012.
    [86]N. Haiminen, et al., "Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results," Plos One, vol. 6, Sep 7 2011.

    下載圖示 校內:2013-01-01公開
    校外:2014-01-01公開
    QR CODE