| 研究生: |
吳宥甫 Wu, Yu-Fu |
|---|---|
| 論文名稱: |
藉由次世代測序資料的品質評估以及錯誤校正改善非模式物種的基因組裝 Improving the De Novo Assembly by Quality Assessment and Error Correction of Second-Generation Sequencing Data |
| 指導教授: |
鄭順林
Jeng, Shuen-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2012 |
| 畢業學年度: | 100 |
| 語文別: | 英文 |
| 論文頁數: | 156 |
| 中文關鍵詞: | 次世代測序 、Denovo 基因組裝 、錯誤校正 、品質評估 、Dynamic Trimming 、Suffix Arrays 、Hamming graph 、覆蓋率控制 、隨機重排 |
| 外文關鍵詞: | Second-Generation Sequencing, Denovo Assembly, Error Correction, Quality Assessment, Dynamic Trimming, Suffix Arrays, Hamming graph, Coverage Control, Random Shuffle |
| 相關次數: | 點閱:169 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為了改善次世代測序資料組裝的的問題, 我們提出數個資料前處理的方法, 藉由測序資料的品質評估與錯誤校正, 以期得到較佳的非模式物種之基因組裝。 我們考慮的處理方法為: (1) Treatment 0: 使用原始測序資料不進行任何品質評估以及錯誤校正 (2) Treatment 1: 使用品質評估以及修剪工具 (SolexaQA) 進行資料修剪 (3)Treatment 2: 使用錯誤校正工具 (Hammer 以及 HiTEC) 進行校正 (4) Treatment 3: 對資料進行隨機重排。使用三種組裝工具 (Velvet、SOAPdenovo 以及 ABySS) 將上述處理後的資料進行組裝。 我們藉由 Phix174的比對結果驗證資料處理的影響, 接著應用這些處理到 Gemmifera 以及非模式物種 algae 的基因組裝。
從Phix174的研究中, 我們得到一些建議: (1) 利用品質分數進行資料修剪可以幫助組裝結果 (2) 組裝的品質評估方式像是: contigs數目、contigs 的最大長度、N50以及所有 contigs 的總長度, 無法完整的說明以及解釋組裝的結果 (3) 如果資料的覆蓋率夠大, 使用隨機抽樣的方式降低覆蓋率可以有效的改善 Velvet 的組裝結果 (4) 隨機重排可以影響 Velvet 以及 ABySS 的組裝結果, 但 SOAPdenovo並不會受其影響, 而進一步的研究發現隨機重排可以改善 Velvet 以及 ABySS 的組裝, 所以建議使用Velvet 以及 ABySS 組裝時, 應該考慮隨機重排的方式進行組裝 (5) 結合資料修剪以及隨機重排可以改善組裝的結果。
我們將這些建議應用在主要實驗物種 Gemmifera 以及非模式物種 algae 的基因組裝上, 並且分別利用 Thaliana 以及非模式物種 algae 的六種相近物種進行比對藉以確認組裝的結果。
In order to improve the result of denovo assembly from second-generation sequencing data, we provide treatments on data pre-processing by quality assessment and error correction of the sequencing data. The treatments we considered are: (1) Treatment 0: Using original data without any quality assessment and error correction. (2) Treatment 1: Using SolexaQA to assess and trim reads data. (3) Treatment 2: Utilizing error correction tools, HiTEC and Hammer, to correct reads data. (4) Treatment 3: Shuffling the reads randomly.
All of the treated data are assembled by three tools: Velvet, SOAPdenovo and ABySS. We validate the treatment effects by the alignments results of Phix 174, whose genome
is well studied, and apply these treatments to the denovo assembly for Gemmifera and a specific non-model algae.
The alignment results of Phix174 data suggest: (1) Quality trimming helps the assembly. (2) Quality measurements of assembly such as number of contigs, max length, N50 and
total length may not completely explain and illustrate the quality of the assembly results. (3) If the collection of short read data have a high coverage, coverage control by random sample may be an effective method to improve assembly for Velvet. (4)Velvet and ABySS may be affected by random shuffle of input reads, but SOAPdenovo is not. In other words, the order of collection of short read data may improve assembly for Velvet and ABySS. So random shuffle of the reads should be done when using Velvet and ABySS. (5) Combining quality trim and random shuffle on reads data may improve assembly result.
We take these suggestions to assemble the main experiment species, Gemmifera and a specific non-model algae. The assembly results of Gemmifera and the algae are checked
by the alignments with the genome sequence of Thaliana and the six related species of the algae, respectively.
Cock, Peter J. A., Fields, Christopher J., Goto, Naohisa, Heuser, Michael L., and Rice, Peter M., “Survey and summary the Sanger fastq file format for sequences with quality
scores, and the SolexA/Illumina FASTQ variants”, Nucleic Acids Reseach, 2010, Vol38, pages 1767-1771.
Cox, Murry P, Peterson, Daniel A, and Biggs, Patrick J, “SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data”, BMC Bioinformatics, 2010, Vol 11, pages 485-490.
Ilie, Lucian, Fazayeli, Farideh, and Ilie, Silvana, “HiTEC: Accurate error correction in high-throughput sequencing data”, Bioinformatics, 2011, Vol 27, pages 295-302.
Kao, Wei-Chun, Chan, Andrw H., and Song, Yun S., “ECHO: A refrence-free short-read error correction algorithm:, Genome Reseach ISSN, 2011, Vol 21, pages 1081-1192.
Kelly, David R, Schatz, Michael C, and Salzberg, Steven L, “Quake: Quality-aware detection and correction of sequencing errors”, Genome Biology, 2010, Vol 11, R116.
Lin, Young, Li, Jian, Shen, Hui, Zhang, Lei, Papasiam, CHristopher J and Deng, Hong-Wen, “Comparative studies of de novo Assembly tools for next-generation sequencing technologies”, Bioinformatics, 2011, Vol 27, pages 2031-2037.
Medvedev, Paul, Scott, Eric, Kakaradov, Boyko, and Pevzner, Pavel, “Error correction of high-throughput sequencing datasets with non-uniform coverage”, Bioinformatics, 2011, Vol 27, pages i137-i141.
Narzisi, Giuseppe, and Mishra, Bud, “Comparing de novo genome assembly: The long and short of it”, PloS One, 2011, Vol 6, e19175.
RA, Holt, and SJ, Jones, “The new paradigm of flow cell sequencing”, Genome Res, Vol 18, pages 839-846.
Salmela, Leena, and Schroder, Jan, “Correcting errors in short reads by multiple alignments”, Bioinformatics, 2011, Vol 27, pages 1455-1461.
Schroder, Jan, Bailey, James, Conway, Thomas, and Zobel, Justin, “Reference-free validation of short read data”, PloS One, 2010, Vol 5, e12681.
Schroder, Jan, Schroder, Heiko, Puglisi, Simon J., Sinha, Ranjan, and Schmidt, Bertil, “SHREC: A short-read error correction”, Bioinformation, 2009, Vol 25, pages 2157-
2163.
Simpson, Jared T., Wong, Kim, Jackman, Shaun D., Schein, Jacqueline E., Jones, Steven J.M., and Birol ˙Inancx, “ABySS: A parallel assembler for short read sequence data”,
Genome Research, 2009, Vol 19, pages 1117-1123.
Yang, Xiao, Aluru, Srinivas, and Dorman, Karin S, “Repeat-aware modeling and correction of short read errors”, BMC Bioinformatics, 2011, Vol 12, s52.
Yang, Xiao, Dorman, Karin S., and Aluru, Srinivas, “Reptile: representative tiling for sohrt read error correction”, Bioinformatics, 2010, Vol 26, pages 2526-2533.
Zerbino, Daniel R., and Birney, Ewan, “Velvet: Algorithms for de novo short read assembly using de Bruijn graphs”, Genome Research, 2008, Vol 18, pages 821-829.