簡易檢索 / 詳目顯示

研究生: 蔡正宏
Tsai, Cheng-Hung
論文名稱: 利用填補雙端定序間隙的方式將定序產生的序列延長以提高基因體組序之完成度
Improving De Novo Genome Assembly by Using Longer Sequences Constructed from Short Paired-end Reads
指導教授: 蔣榮先
Chiang, Jung-Hsien
劉宗霖
Liu, Tsung-Lin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 56
中文關鍵詞: 次世代定序雙端定序基因體定序基因體組序
外文關鍵詞: next generation sequencing, paired-end sequencing, genome sequencing, genome assembly
相關次數: 點閱:120下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基因體定序和組序是了解生物遺傳物質,DNA,重要的第一步。三十多年前,Sanger提出定序技術後,開創了大量基因定序的時代,而一直到近幾年才有所謂次世代定序的技術出現,例如454和Illumina。次世代定序的資料量遠超過傳統Sanger定序方法而且成本大幅降低,但是次世代定序產生的序列長度比起傳統 Sanger定序方法要短得多,這些次世代定序新的資料特性帶給基因組序新的挑戰。因此,本論文之研究目的為發展新的計算方法來把次世代定序的序列資料組成更完整的基因體。
    在本論文中,我們提出了一種所謂填補雙端定序間隙的方法來將Illumina產生的序列延長,此方法不僅能將原本就存在重疊關係的雙端定序資料合併成較長的序列,還能填補不存在重疊關係的雙端定序資料之間的間隙以得到更長的序列,這將會大大提升單獨使用Illumina的資料就能提高基因組序完成度的機會,而且非常實用因為Illumina的資料比較便宜。對於此問題,我們開發了一個填補雙端定序間隙的軟體─PE-Closer。
    本論文實驗分別以模擬的和真實的Illumina雙端定序資料測試PE-Closer的效能與可行性,並且統計與分析是否經過PE-Closer處理過後得到的較長序列,能夠提高基因組序的完成度與正確率。由實驗的結果顯示,PE-Closer能填補大於90% 原本Illumina雙端定序之間的間隙,將次世代定序產生的序列從100 bp延長至平均長度約500 bp 並且將原本序列上存在的1% 定序錯誤降低至0.01%,利用此較長的序列也能提高基因組序的完成度與正確率。
    總結而言,在本論文中所開發的填補雙端定序間隙軟體,能有效地將原本Illumina雙端定序的資料延長,並且以實驗驗證利用此延長的序列能夠提高基因體組序之完成度與正確率。

    Genome sequencing and assembly are the fundamentals toward understanding the secrets behind DNA. The sequencing techniques were pioneered by Sanger and coworkers more than 30 years ago. Only recently, a series of the so-called next generation sequencing (NGS) techniques, such as 454 and Illumina have emerged and provided much a higher data throughput, thus a much lower data cost compared with Sanger sequencing. However, the sequence data, often called read, by NGS are shorter (~400 bp in 454) or much shorter (~125 bp in Illumina) than the Sanger reads (800~1000 bp). The NGS data introduces new computational challenges to genome assembly. Therefore, the purpose of this research is to develop a new computational method to achieve a better assembly with the NGS data.
    In this research, we proposed a method that increases the length of Illumina reads by closing the gaps between the two reads of Illumina paired-ends (PEs). This method can not only merge the two reads of overlapping PEs into longer reads, but also construct a longer read from the two reads of PE that do not overlap. This will significantly increase the possibility of a much better assembly with Illumina data alone, which is cheaper than the 454 data. We developed a computational program, called PE-Closer (Paired-End Closer), for this task.
    We tested the performance of PE-Closer on the simulated and real Illumina data of several bacterial species (Rhodobacter sphaeroides, Spirochaeta smaragdinae, Planctomyces brasiliensis, Cyclobacterium marinum, Streptomyces violaceusniger and Escherichia coli). PE-Closer was able to close >90% of the gaps of Illumina PEs in all cases, and increase the read length from 100 bp to 500 bp on average. It also corrects errors in the original reads, reducing the error rate from 1% to 0.01%. Using the longer reads obtained by PE-Closer, we improved the de novo genome assembly in terms of both statistics and quality.
    To conclude, our program PE-Closer is efficient in increasing the length of Illumina reads. Our experiments indicated that using the longer reads obtained by PE-Closer resulted in better de novo genome assemblies.

    摘要 I Abstract III 誌謝 V Table of Contents VI List of Tables VIII List of Figures X Chapter 1 Introduction 1 1.1 Sequencing Background 1 1.2 Motivation and Objectives 4 1.3 Methodology 5 1.4 Organization of Thesis 7 Chapter 2 Literature Review 8 2.1 De Novo Assembly Strategies 8 2.1.1 Greedy Approach 10 2.1.2 Overlap-Layout-Consensus Approach 10 2.1.3 De Bruijn graph Approach 11 2.2 Two Related Works 13 2.2.1 SHERA 13 2.2.2 FLASH 14 Chapter 3 Materials and Methods 17 3.1 Sequencing Read Simulations 18 3.2 Overview of the PE-Closer Program 21 3.3 Initial Assembly and Building the Contig Graph 21 3.4 Realigning Paired-end Reads to the Contig Graph 22 3.5 Extracting the Sequences as Gap-Closed Reads 23 3.6 Optional Error Correction Procedure 25 3.7 Post-Processing 26 3.7.1 Result Statistics 26 3.7.2 Sequence Assembly 28 Chapter 4 Experimental Results and Discussions 30 4.1 Genomic Sequences and Sequencing Data 30 4.2 Experimental Results 31 4.2.1 Results of PE-Closer, SHERA and FLASH 31 4.2.2 Assembly Parameters 35 4.2.3 Rhodobacter sphaeroides Assembly using PE-Closer and FLASH 37 4.2.4 PE Gap Closure on Simulated Libraries of Median Insert Length 38 4.2.5 PE Gap Closure on Real Libraries of Median Insert Length 41 4.2.6 PE Gap Closure on a Real Library of Long Insert Length 45 4.3 Summary of Experiments 46 Chapter 5 Conclusions and Future Works 47 5.1 Conclusions 47 5.2 Future Works 48 REFERENCES 49 APPENDIX A 51

    [1] Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.
    [2] Bentley, D.R., Whole-genome re-sequencing. Curr Opin Genet Dev, 2006. 16(6): p. 545-52.
    [3] Valouev, A., et al., A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res, 2008. 18(7): p. 1051-63.
    [4] Schuster, S.C., Next-generation sequencing transforms today's biology. Nat Methods, 2008. 5(1): p. 16-8.
    [5] Sanger, F., S. Nicklen, and A.R. Coulson, DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A, 1977. 74(12): p. 5463-7.
    [6] Batzer, M.A. and P.L. Deininger, Alu repeats and human genomic diversity. Nat Rev Genet, 2002. 3(5): p. 370-9.
    [7] Schadt, E.E., S. Turner, and A. Kasarskis, A window into third-generation sequencing. Hum Mol Genet, 2010. 19(R2): p. R227-40.
    [8] Eid, J., et al., Real-time DNA sequencing from single polymerase molecules. Science, 2009. 323(5910): p. 133-8.
    [9] Metzker, M.L., Sequencing technologies - the next generation. Nat Rev Genet, 2010. 11(1): p. 31-46.
    [10] Rodrigue, S., et al., Unlocking short read sequencing for metagenomics. PLoS One, 2010. 5(7): p. e11840.
    [11] Magoc, T. and S.L. Salzberg, FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 2011. 27(21): p. 2957-63.
    [12] Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829.
    [13] Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res, 2011.
    [14] Cerdeira,L.T., et al. (2011) Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study. J. Microbiol. Methods, 86, 218–223.
    [15] Zhang WY, Chen JJ, Yang Y, Tang YF, Shang J, et al. (2011) A practical comparison of de novogenome assembly software tools for next-generation sequencing technologies. PLoS One 6: e17915.
    [16] Schatz MC, Delcher AL, Salzberg SL. 2010. Assembly of large genomes using
    second-generation sequencing. Genome Res 20: 1165–1173.
    [17] Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of
    short DNA sequences using SSAKE. Bioinformatics 4: 500–501.
    [18] Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, et al.
    (2007) Extending assembly of short DNA sequences to handle error.
    Bioinformatics 23: 2942–2944.
    [19] Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2007) SHARCGS, a fast
    and highly accurate short-read assembly algorithm for de novo genomic
    sequencing. Genome Res 17: 1697–1706.
    [20] Miller,J.R., Delcher,A.L., Koren,S., Venter,E., Walenz,B.P., Brownley,A., Johnson,J., Li,K., Mobarry,C., and Sutton,G. (2008) Aggressive assembly of pyrosequencing reads with s. it Bioinformatics, 24 (24), 2818-2824.
    [21] Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo fragment assembly with short mate-paired reads: Does read length matter? Genome Res. 19:336-346. 2009
    [22] Chaisson MJP, Pevzner PA (2007) Short read fragment assembly of bacterial
    genomes. Genome Res 18: 324–330.
    [23] Butler J, MacCallum I, Kleber M, Shlyakhter I, Belmonte MK, et al. (2008)
    ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820.
    [24] Li R, Zhu H, Ruan J, Qian W, Fang X, et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20: 265–272.
    [25] Quality Scores for Next-Generation Sequencing. Illumina, (2011)
    [26] Li, R., et al., SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 2009. 25(15): p. 1966-7.
    [27] Kelley, D.R., M.C. Schatz, and S.L. Salzberg, Quake: quality-aware detection and correction of sequencing errors. Genome Biol, 2010. 11(11): p. R116.
    [28] Kent WJ. 2002. BLAT–the BLAST-like alignment tool. Genome Res 12: 656–
    664.
    [29] Sayers, E.W., et al., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2012. 40(Database issue): p. D13-25.
    [30] R. Nielsen, J.S. Paul, A. Albrechtsen, Y.S. Song Genotype and SNP calling from next-generation sequencing data Nat Rev Genet, 12 (2011), pp. 443–451

    下載圖示 校內:2014-08-01公開
    校外:2014-08-01公開
    QR CODE