簡易檢索 / 詳目顯示

研究生: 陳薇筑
Chen, Wei-Chu
論文名稱: 最佳化使用高通量定序之微藻基因體組序
Optimizing microalgae genome assembly of high throughput sequencing data
指導教授: 劉宗霖
Liu, Tsunglin
學位類別: 碩士
Master
系所名稱: 生物科學與科技學院 - 生命科學系
Department of Life Sciences
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 70
中文關鍵詞: 全基因體組序微藻評估全新基因體組序
外文關鍵詞: Genome assembly, Hybrid, de novo, Microalgae
相關次數: 點閱:114下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 揭開生物體基因密碼的重要關鍵是全新基因體組序,而要得到基因體序列 的第一步是定序。目前最常見的定序平台是 Illumina,其優點是價錢低廉與錯 誤率低(1%),缺點是定序長度較短(150bp)。另一種定序平台-PacBio 則有非常 長的定序長度(>10kb),但錯誤率則較高(~15%)。目前市面上針對不同的定序平 台,開發了許多組序軟體。但目前都沒有定論哪一種組序方法較好,也沒有好 的方法評估全新基因體組序。在此我們測試不同組全新基因體序列的平台,並 且開發出評估全新基因體組序的方法。我們所使用的基因體是來自成功大學陳 逸民老師。他們在台灣找到可製造高量二十二碳六烯酸(DHA)的微藻,並將 其命名為 BL10。將 BL10 分別使用 Illumina 和 PacBio 定序平台定序,並在測 試各項參數後,成功組出其基因體序列。此外,我們除了評估組序後的長度外, 還更進一步探討組序後的序列內容正確性。我們將不同軟體組序後的基因體序 列兩兩比對,看彼此的相似程度。此外,還額外使用核糖核酸定序資料(RNA sequencing)比對回組序後的基因體序列,看比對上的好壞來評估組序出的基 因體序列優劣。使用上述這些數值成功評估出個各組序軟體的優劣與特性。這 套流程未來可使生物研究更便利,且幫助研究者更精確地選擇軟體。

    De novo genome assembly reveals genetic codes and facilitates lots of biological studies. And sequencing is the first step to get genome sequence. Currently, there are various sequencing platforms. Illumina and PacBio nowadays are the most popular sequencing platform. To assemble these two kinds of read types, various assemblers have been published. But it remains uncertain which assembler or parameters are the best. And there is no good way to judge the performance of a de novo genome assembly. Here, we try to test several assemblers to assemble a new genome. And design a system to compare the performance of genome assemblies. We used Taiwan microalgae as de novo genome which is found by, Dr. Yi-Min Chen, and named BL10. We performed BL10 sequencing with Illumina and PacBio. After we finally assembling BL10 genome with difficulty, we successfully design a system to judge de novo genome assemblies. This system not only compares assembles quantity, for example, N50, but their quality. We use assembles comparing each other sequences to evaluate the quality. Besides, we use RNA transcripts aligning with assembles to decide which assembles is best. After evaluating different BL10 assemblies, hybrid assembler – MaSuRCA will assembly best BL10 genome. And this pipeline will facilitate biological research and help researchers to choose tool accurately.

    中文摘要 ................................................................................................ I 致謝..................................................................................................... VI 目錄....................................................................................................VII 表目錄................................................................................................. XI 圖目錄................................................................................................XII 中英對照表...................................................................................... XIV 一、 研究背景 ....................................................................................1 1-1 橙黃壺菌-BL10 背景介紹..........................................................1 1-2 定序平台介紹 ..............................................................................1 1-3 基因體序列組裝(GENOME ASSEMBLY) .........................................2 1-3-1 基本介紹 ...............................................................................2 1-3-2 全新基因體序列組裝的挑戰................................................4 1-4 基因體組裝原理 ..........................................................................5 1-4-1 Overlap layout consensus (OLC)原理 .....................................5 1-4-2 de-bruijn-graph (DBG)原理....................................................5 1-5 混合型(HYBRID)組序原理........................................................6 1-5-1 DBG2OLC 簡介.......................................................................6 1-5-2 MaSuRCA 簡介 ......................................................................8 1-6 目前評估全新基因組組序方法 ..................................................8 1-7 研究動機與研究目的...................................................................9 二、 材料與方法 .............................................................................. 11 2-1 定序資料的搜集 ........................................................................11 2-1-1 全基因體定序 .....................................................................11 2-1-2 核糖核酸定序 .....................................................................12 2-2 定序資料的統計 ........................................................................12 2-2-1 全基因體定序 .....................................................................12 2-2-2 核糖核酸定序 .....................................................................13 2-3 定序資料的序列整理.................................................................13 2-3-1 定序資料品質鑒定方法......................................................13 2-3-2 定序資料裁切配接器(Adapter) .....................................14 2-4 全新基因體序列組裝方法.........................................................14 2-4-1 ALLPATHS-LG ........................................................................14 2-4-2 MaSuRCa..............................................................................15 2-4-3 DBG2OLC .............................................................................15 2-4-4 Canu.....................................................................................16 2-4-5 Hgap ....................................................................................17 2-5 全新基因體序列組裝評估方法 .................................................17 2-5-1 基因體大小與 N50..............................................................18 2-5-2 組後序列間比對 .................................................................18 2-5-3 組後序列與轉錄本比對方法.............................................19 三、 結果............................................................................................24 3-1 定序資料品質結果 ....................................................................24 3-2 定序資料裁切接配器前後比較 .................................................24 3-3 組序結果長度比較 ....................................................................25 3-3-1 MaSuRCA 參數調整比較.....................................................25 3-3-2 DBG2OLC 參數調整比較 .....................................................27 3-3-3 綜合結果 .............................................................................28 3-4 全新基因體組裝評估結果.........................................................29 3-4-1 基因體大小與 N50..............................................................29 3-4-2 組後序列間比對 .................................................................30 3-4-3 組後序列與轉錄本比對......................................................31 四、總結與討論 ..................................................................................34 4-1 總結............................................................................................34 4-2 討論............................................................................................35 參考文獻..............................................................................................37 表..........................................................................................................41 圖..........................................................................................................48 附錄.........................................................................................................i

    Arthur L.Delcher, S.K., Rober D. Fleischmann, Jeremy Peterson, Owen White, Steven L. Salzbreg. Alignment of whole genome . Nucleic Acids Research, 1999.
    Baker, M. De novo genome assembly: what every biologist should know. Nature Methods 9, 333-337, 2012.
    Batzer, M.A., and Deininger, P.L. Alu repeats and human genomic diversity. Nature Reviews Genetics 3, 370-379, 2002.
    Bolger, A.M., Lohse, M., and Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, 2014.
    Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., and Jaffe, D.B. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research 18, 810-820, 2008.
    Chengxi Ye, Zhanshan Sam Ma, Charles H Cannon, Mihai Pop, and Yu, D.W. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics, 2012.
    Cho, N.H., Kim, H.R., Lee, J.H., Kim, S.Y., Kim, J., Cha, S., Kim, S.Y., Darby, A.C., Fuxelius, H.H., Yin, J., Kim, J.H., Kim, J., Lee, S.J., Koh, Y.S., Jang, W.J., Park, K.H., Andersson, S.G., Choi, M.S., and Kim, I.S. The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes. Proceedings of the National Academy of Sciences of the United States of America 104, 7981-7986, 2007.
    Commins, J., Toft, C., and Fares, M.A. Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biological Procedures Online 11, 52-78, 2009.
    Compeau, P.E., Pevzner, P.A., and Tesler, G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnol 29, 987-991, 2011.
    Doerks, T., Copley, R.R., Schultz, J., Ponting, C.P., and Bork, P. Systematic identification of novel protein domain families associated with nuclear functions. Genome Research 12, 47-56, 2002.
    El-Metwally, S., Zakaria, M., and Hamza, T. LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 32, 3215-3223, 2016.
    Emeson, R.B. RNA editing. Annual Reviews neuroscience, 1996.
    Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., and Regev, A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnol 29, 644-652, 2011.
    Jayakumar, V., and Sakakibara, Y. Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Briefings Bioinformatics, 2017.
    Jurka, J., Kapitonov, V.V., Kohany, O., and Jurka, M.V. Repetitive sequences in complex genomes: structure and evolution. Annual Review of Genomics Human Genetics 8, 241-259, 2007.
    Kai-Chuang Chaung, C.-Y.C., Yu-Ming Su, Yi-Min Chen. Effect of culture conditions on growth, lipid content, and fatty acid composition of Aurantiochytrium mangrovei strain BL10. AMB express, 2012.
    Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., and Phillippy, A.M., 2017.
    Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25, 2009.
    Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B., Yang, B., and Fan, W. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Briefings in Functional Genomics 11, 25-37, 2012.
    Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M. Comparison of next-generation sequencing systems. Journal of Biomedicine and Biotechnology 2012, 251364, 2012.
    Metzker, M.L. Sequencing technologies — the next generation. Nature Reviews Genetics 11, 31, 2009.
    Miller, J.R., Koren, S., and Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315-327, 2010.
    Parra, G., Bradnam, K., Ning, Z., Keane, T., and Korf, I. Assessing the gene space in draft genomes. Nucleic Acids Research 37, 289-297, 2009.
    Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marcais, G., Pop, M., and Yorke, J.A. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22, 557-567, 2012.
    Sanger, F., Nicklen, S., and Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America 74, 5463-5467, 1977.
    Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T.A., Minx, P., Reily, A.D., Courtney, L., Kruchowski, S.S., Tomlinson, C., Strong, C., Delehaunty, K., Fronick, C., Courtney, B., Rock, S.M., Belter, E., Du, F., Kim, K., Abbott, R.M., Cotton, M., Levy, A., Marchetto, P., Ochoa, K., Jackson, S.M., Gillam, B., Chen, W., Yan, L., Higginbotham, J., Cardenas, M., Waligorski, J., Applebaum, E., Phelps, L., Falcone, J., Kanchi, K., Thane, T., Scimone, A., Thane, N., Henke, J., Wang, T., Ruppert, J., Shah, N., Rotter, K., Hodges, J., Ingenthron, E., Cordes, M., Kohlberg, S., Sgro, J., Delgado, B., Mead, K., Chinwalla, A., Leonard, S., Crouse, K., Collura, K., Kudrna, D., Currie, J., He, R., Angelova, A., Rajasekar, S., Mueller, T., Lomeli, R., Scara, G., Ko, A., Delaney, K., Wissotski, M., Lopez, G., Campos, D., Braidotti, M., Ashley, E., Golser, W., Kim, H., Lee, S., Lin, J., Dujmic, Z., Kim, W., Talag, J., Zuccolo, A., Fan, C., Sebastian, A., Kramer, M., Spiegel, L., Nascimento, L., Zutavern, T., Miller, B., Ambroise, C., Muller, S., Spooner, W., Narechania, A., Ren, L., Wei, S., Kumari, S., Faga, B., Levy, M.J., McMahan, L., Van Buren, P., Vaughn, M.W., Ying, K., Yeh, C.T., Emrich, S.J., Jia, Y., Kalyanaraman, A., Hsia, A.P., Barbazuk, W.B., Baucom, R.S., Brutnell, T.P., Carpita, N.C., Chaparro, C., Chia, J.M., Deragon, J.M., Estill, J.C., Fu, Y., Jeddeloh, J.A., Han, Y., Lee, H., Li, P., Lisch, D.R., Liu, S., Liu, Z., Nagel, D.H., McCann, M.C., SanMiguel, P., Myers, A.M., Nettleton, D., Nguyen, J., Penning, B.W., Ponnala, L., Schneider, K.L., Schwartz, D.C., Sharma, A., Soderlund, C., Springer, N.M., Sun, Q., Wang, H., Waterman, M., Westerman, R., Wolfgruber, T.K., Yang, L., Yu, Y., Zhang, L., Zhou, S., Zhu, Q., Bennetzen, J.L., Dawe, R.K., Jiang, J., Jiang, N., Presting, G.G., Wessler, S.R., Aluru, S., Martienssen, R.A., Clifton, S.W., McCombie, W.R., Wing, R.A., and Wilson, R.K. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112-1115, 2009.
    Wilkins, M.H.F., Stokes, A.R., and Wilson, H.R. Molecular Structure of Nucleic Acids: Molecular Structure of Deoxypentose Nucleic Acids. Nature 171, 738, 1953.
    Yang, H.L., Lu, C.K., Chen, S.F., Chen, Y.M., and Chen, Y.M. Isolation and characterization of Taiwanese heterotrophic microalgae: screening of strains for docosahexaenoic acid (DHA) production. Marine Biotechnology (NY) 12, 173-185, 2010.
    Ye, C., Hill, C.M., Wu, S., Ruan, J., and Ma, Z.S. DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Scientific Reports 6, 31900, 2016.
    Zimin, A.V., Marcais, G., Puiu, D., Roberts, M., Salzberg, S.L., and Yorke, J.A. The MaSuRCA genome assembler. Bioinformatics 29, 2669-2677, 2013.
    Zimin, A.V., Puiu, D., Luo, M.C., Zhu, T., Koren, S., Marcais, G., Yorke, J.A., Dvorak, J., and Salzberg, S.L. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Research 27, 787-792, 2017.

    無法下載圖示 校內:2021-09-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE