| 研究生: |
周東誼 Chou, Tung-Yi |
|---|---|
| 論文名稱: |
探討定序讀數的索引結構降低參考偏差在變異辨認上的可行性以及潛力 Investigating the Feasibility of Indexing Read Sequences and its Potential to Reduce Reference Bias in Variant Calling |
| 指導教授: |
賀保羅
Paul Horton |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2020 |
| 畢業學年度: | 108 |
| 語文別: | 英文 |
| 論文頁數: | 43 |
| 中文關鍵詞: | 次世代定序 、參考偏差 、變異辨認 、單核苷酸多型變異 、插入刪除變異 |
| 外文關鍵詞: | NGS, reference bias, variant calling, SNP, indel |
| 相關次數: | 點閱:94 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來由於次世代定序技術的進步,取得人類參考基因序列的資料變得快速許多,有許多研究在分析基因體變異,基因體變異的辨認一直是一個重大的課題。在變異辨認的流程中,通過映射定序讀數到參考基因序列上,分析映射結果,我們可以得到變異發生的位置及形態。但是,參考基因序列通常不會含有個人的變異,如果定序儀定出來的讀數有個體上的變異,會導致在映射的過程中因為與參考序列有差異而不會被映射到正確的位置上,映射結果就有一定程度的誤差,我們稱之為”參考偏差”,進而影響到基因體變異辨認的準確性。本研究將探討我們提出的一個系統對於降低參考偏差是否可行及其潛力。
本系統分成三個部分,第一部分是數據前處理,以一般做法將參考序列建立索引,透過BWA將讀數映射到參考序列上產生映射結果檔案(BAM),然後我們將讀數也建立一個索引結構;第二部分從變異辨認格式(VCF)檔案中對於每個變異,在映射結果檔案裡找到有涵蓋到該變異的位置上的讀數,我們稱之為”堆積序列”;第三部分是對於每個變異,截取參考序列上的一段序列進行修改使其符合發生變異的狀態,我們稱之為假設序列。再用假設序列對讀數索引結構進行搜尋,找到的讀數與映射結果的堆積序列進行比對,如果找到的讀數不存在於堆積序列中,說明我們可以找到因為參考偏差而沒有映射到該位置上的讀數。
在實驗中,本研究使用了GIAB 的定序資料集以及ClinVar的變異資料集,我們把變異分成單核苷酸多型變異和插入刪除變異分別進行實驗。結果顯示在插入刪除變異實驗上可以找到包含變異的讀數但在標準映射檔案(BAM)上沒有映射到該變異的位置。在大多數情況下,這些讀數存在於標準映射檔案(BAM)中,但被映射到基因組的另一個(同源)區域。因此,我們在讀數數據上建立索引的策略可以找到有用的證據(匹配讀數)來潛在地支持標準讀數映射過程中被遺漏的讀數。但是,這些變體通常在旁系同源物的區域內發現,因此必須謹慎解釋這一新證據。
In recent years, due to the advancement of the next generation sequencing technology, obtaining data on human reference gene sequences has become much faster. There are many studies analyzing the human genome variation, and the variant calling has always been an important topic. In the process of variant calling, by mapping the read sequences on the reference sequences and analyzing the result, we can get the position and type of the variant. However, the reference sequences usually does not contain individual variants. If the reads sequencing by the sequencer have individual variations, it may cause that they not be mapped to the correct position due to differences with the reference sequences during the mapping process. Thus, the mapping result has a certain degree of bias we call “reference bias”, which in turn affects the accuracy of variant calling. This study will discuss the feasibility and potential of a system we propose to reduce the reference bias.
The system is divided into three parts. The first part is data pre-processing. The reference sequences is indexed in the usual way. The reads are mapped to the reference sequences by using the BWA to generate a mapping result file (standard mapping BAM). Then we also create an index structure of the reads. The second part is that using the variants from a predefined set of candidate variants stored in a Variant Calling Format (VCF) file. For each variant, we first gather the reads covering that position as determined by standard read mapping (the so-called “pileup”) use the finds the reads covering the position of the variation in the mapping result file, which we call the "pileup". The third part is for each variant, extracting a local part of the reference sequence and editing it to contain the variant; constructing a hypothetical sequence. Then we use the hypothetical sequence to query the read index structure. The found reads are compared with the pileup of the mapping result. If the found reads do not exist in the pileup, it means that we can find the reads that is not mapped to the position because of the reference bias.
In the experiment, we used the sequencing data from the GIAB dataset and the potential variants from the ClinVar dataset. We divided the variants into SNP variants and indel variants. Our results show that indel variants sometimes match the reads which match a variant but which are not mapped to that genome position in the standard mapping BAM file. In most cases those reads are present in the standard mapping BAM file but mapped to another (paralogous) region of the genome. Thus our strategy of building an index on the read data can find useful evidence (matching reads) to potentially support variants which the standard read mapping procedure misses. However those variants usually are found in regions within paralogs, so this new evidence must be interpreted with caution.
[1] Hwang, S., Kim, E., Lee, I. et al. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 5, 17875 (2015). https://doi.org/10.1038/srep17875
[2] Chen, J., Li, X., Zhong, H. et al. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Sci Rep 9, 9345 (2019). https://doi.org/10.1038/s41598-019-45835-3
[3] Prodduturi N, Bhagwate A, Kocher JA, Sun Z. Indel sensitive and comprehensive variant/mutation detection from RNA sequencing data for precision medicine. BMC Med Genomics. 2018;11(Suppl 3):67. Published 2018 Sep 14. doi:10.1186/s12920-018-0391-5
[4] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754‐1760. doi:10.1093/bioinformatics/btp324
[5] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. Published 2012 Mar 4. doi:10.1038/nmeth.1923
[6] Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079. doi:10.1093/bioinformatics/btp352
[7] Degner JF, Marioni JC, Pai AA, et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):3207-3212. doi:10.1093/bioinformatics/btp579
[8] McKenna A, Hanna M, Banks E, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303. doi:10.1101/gr.107524.110
[9] DePristo, M., Banks, E., Poplin, R. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). https://doi.org/10.1038/ng.806
[10] Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43(1110):11.10.1-11.10.33. doi:10.1002/0471250953.bi1110s43
[11] Rimmer, A., Phan, H., Mathieson, I. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 46, 912–918 (2014). https://doi.org/10.1038/ng.3036
[12] Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. 2012. arXiv:1207.3907v2 [q-bio.GN].
[13] Kuo, T., Frith, M.C., Sese, J. and Horton P. EAGLE: Explicit Alternative Genome Likelihood Evaluator. BMC Med Genomics 11, 28 (2018). https://doi.org/10.1186/s12920-018-0342-1
[14] P. Ferragina and G. Manzini, "Opportunistic data structures with applications," Proceedings 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 2000, pp. 390-398, doi: 10.1109/SFCS.2000.892127.
[15] Lippert RA. Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comput Biol. 2005;12(4):407-415. doi:10.1089/cmb.2005.12.407
[16] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, S. M. Yiu, Compressed indexing and local alignment of DNA, Bioinformatics, Volume 24, Issue 6, 15 March 2008, Pages 791–797, https://doi.org/10.1093/bioinformatics/btn032
[17] Li, Heng. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv. 1303.
[18] Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 2019;15(7):e1008302. Published 2019 Jul 26. doi:10.1371/journal.pgen.1008302
[19] Garrison, E., Sirén, J., Novak, A. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 36, 875–879 (2018). https://doi.org/10.1038/nbt.4227
[20] Paten B, Novak AM, Eizenga JM, Garrison E. Genome graphs and the evolution of genome inference. Genome Res. 2017;27(5):665-676. doi:10.1101/gr.214155.116
[21] Li H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics. 2012;28(14):1838-1844. doi:10.1093/bioinformatics/bts280
[22] Zook JM, Chapman B, Wang J, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246-251. doi:10.1038/nbt.2835
[23] Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(Database issue):D980-D985. doi:10.1093/nar/gkt1113