| 研究生: |
陳威穎 Chen, Wei-Ying |
|---|---|
| 論文名稱: |
基於評估讀取索引以達成提高Illumina測序資料中基因組變異辨認的靈敏度 Evaluating a read index based approach to improve the sensitivity of Illumina sequencing based genome variant calling |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 28 |
| 中文關鍵詞: | 次世代定序 、變異點偵測 、讀取索引 |
| 外文關鍵詞: | NGS, variant calling, read-index |
| 相關次數: | 點閱:70 下載:8 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
次世代定序技術(NGS)是精準醫療不可或缺的一種技術,但NGS還需搭配後續的資料分析流程才能真正運用到臨床或研究上。其中NGS數據中變異辨認幾乎是所有下游分析和解釋過程都依賴的關鍵步驟,但是因為變異辨認過程中固有的不確定性,使得準確的基因體變異偵測仍然有許多問題及挑戰。因此我們先前提出過一個變異偵測評估工具Explicit Alternative Genome Likelihood Evaluator(EAGLE),對於每個被偵測到的潛在變異位點,通過計算邊際事後機率,來評估測序數據與推定變異所隱含的替代基因組序列的匹配程度。而根據先前研究的結果表明,其有效提高變異評估的準確性,相當適合用來進一步分析及處理各種變異偵測工具所產生的結果。
由於我們比對的人類基因組參考序列通常為線性的,僅包含單個序列,因此無法捕獲人類基因組的多樣性。但是讀序可能含有個體變異,所以在映射的過程中,因為讀序與參考序列的高度不同,造成讀序可能被映射到錯誤的位置或無法映射,這種現象我們稱之為參考偏差。錯誤的讀序映射會導致假陰性或假陽性的變異辨認,從而影響到我們識別基因體變異的準確性。而為了解決這個問題,在先前的研究中,我們新增假設序列及讀取索引結構,改進基因組參考序列,使其能夠包含基因組的多樣性,而研究結果表明,其能有效的降低參考序列帶來的影響。
先前我們在成功大學的實驗室已經探索了用讀取索引來擴展EAGLE以減少參考偏差。而之前的研究主要是分析讀取索引如何影響"pile-up"的讀序映射到候選變異的位置,但沒有定量評估新方法對變異偵測的影響。因此,在本次研究中,我們將對NA12878的真實資料分別使用外顯子測序集及全基因組測序集進行效能的評估及探討,並使用常見的變異偵測工具來協助實驗的進行。
在我們使用精準度與召回率曲線來評估新版本EAGLE的表現時,發現了新版本EAGLE的精準度在每個召回水平上都有所降低。因此,我們擷取EAGLE給予邊際機率較高的"不正確"變異,並手動檢查它們的映射情形。我們發現在大多數的情況下,這些推定變異的映射情形皆支持變異的發生。此外,只有小部分"不正確"的推定變體可以通過考慮讀序的多重映射來解釋。
總言之,根據這次的研究結果,我們推測基準資料可能有缺失某部分變異的情況發生,並且經由後續的驗證,可以大致排除新版本EAGLE發生評估錯誤的可能性。因此,可以說明EAGLE具備發現基準資料缺失的能力。由於我們僅對於研究結果影響較大的小部分案例,進行後續的檢驗,所以此推測仍需在日後用更嚴謹的方式進行驗證與解釋。
Next-Generation Sequencing (NGS) is an indispensable technology for Precision Medicine, but NGS needs to be paired with a subsequent data analysis process before it can be truly applied to clinical or research applications. Variant identification in NGS data is a key step that almost all downstream analysis and interpretation processes rely on, but the inherent uncertainty in the variant identification process makes accurate genomic variant detection still very problematic and challenging. Therefore, a candidate variant evaluation tool, Explicit Alternative Genome Likelihood Evaluator (EAGLE) has been developed to evaluate the match between sequencing data and the alternative genome sequence implied by the putative variant, by calculating the marginal posterior probability for each detected potential variant sites. Based on the results of previous studies, it has been shown to be effective in improving the accuracy of variant assessment and is suitable for further analysis and processing of results generated by various variant detection tools.
Since the human genome reference sequences we compare are usually linear and contain only one genome sequence, the diversity of the human genome cannot be captured. However, read sequences may contain individual variation, so during the mapping process, read sequences may be mapped to the wrong position or fail to be mapped because of the difference in height between read sequences and reference sequences, a phenomenon we called reference bias. Incorrect read sequence mapping can lead to false-negative or false-positive variant identification, which affects the accuracy of identifying genomic variants. To solve this problem, we added new Hypothetical Sequences and read index structures to improve the genomic reference sequences to include genomic diversity, and the results showed that they are effective in reducing the impact of reference sequences.
Previously our lab at NCKU has explored extending Eagle with a read index to reduce reference bias.
Those previous studies focused on analyzing how the read index affected the so called ``pile-up'' of reads mapping to the location of candidate variants, but did not quantitatively evaluate the impact of the new method on variant calling. Therefore, in this study, we will evaluate and investigate the efficacy of NA12878 using exome sequencing set and whole genome sequencing set for real data, and use common variant detection tools to assist the experiment.
We evaluated the performance of the new version of EAGLE using the precision and recall curves, but found that the precision of the new version of EAGLE had reduced precision at each recall level. However, we extracted the ``incorrect'' variants the new EAGLE assigned a higher marginal probability to and manually examined their mapping situation, we found that in most cases those putative variants appeared well supported by the data. Moreover only a small fraction of those ``incorrect'' putative variants could be explained away by considering read multi-mapping.
In conclusion, based on the results of this study, it appears that there may is some missing variation in the benchmark data, and the possibility of evaluation errors in the new version of EAGLE can be largely excluded by subsequent validation. Therefore, it can be stated that EAGLE has the ability to detect missing benchmark data. Since we have only conducted follow-up tests on a small number of cases where the results of the study had a significant impact, this speculation needs to be verified and explained in a more rigorous manner in the future.
[1] Jorge S REISFILHO. “Nextgeneration sequencing.” Breast cancer research 11.3(2009), pp. 1–7.
[2] Vandana Shashi et al. “The utility of the traditional medical genetics diagnostic evaluation in the context of next-generation sequencing for undiagnosed genetic disorders.” Genetics 16.2 (2014), pp. 176–182.
[3] Peter D. Stenson et al. “The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies.” Human genetics 136.6 (2017), pp. 665–667.
[4] Aaron McKenna et al. “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.” Genome research 20.9 (2010), pp. 1297–1303.
[5] Heng Li et al. “The sequence alignment/map format and SAMtools.” Bioinformatics 25.16 (2009), pp. 2078–2079.
[6] Andy Rimmer et al. “Integrating mapping, assembly-and haplotype-based approaches for calling variants in clinical sequencing applications.” Nature genetics 46.8 (2014), pp. 912–918.
[7] Tony Kuo et al. “EAGLE: explicit alternative genome likelihood evaluator.” BMC medical genomics 11.2 (2018), pp. 1–10.
[8] TungYi Chou. “Investigating the Feasibility of Indexing Read Sequences and Its Potential to Reduce Reference Bias in Variant Calling.” (2020).
[9] WenHong Su. “Towards Eliminating Reference Bias in Genome Variant Calling - Integrating a BWAMEM Read Index into the EAGLE Variant Evaluation Method.”(2021).
[10] Petr Danecek et al. “The variant call format and VCFtools.” Bioinformatics 27.15 (2011), pp. 2156–2158.
[11] David J. Lipman and William R. Pearson. “Rapid and sensitive protein similarity searches.” Science 227.4693 (1985), pp. 1435–1441.
[12] Peter JA Cock et al. “The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.”Nucleic acids research 38.6 (2010), pp. 1767–1771.
[13] Torsten Günther and Carl Nettelblad. “The presence and impact of reference bias on population genomic studies of prehistoric human populations.” PLoS genetics 15.7(2019).
[14] Nae-Chyun Chen et al. “Reference flow: reducing reference bias using multiple population genomes.” Genome biology 22.1 (2021), pp. 1–17.
[15] Jessica Lau. Reference bias: Challenges and solutions. https://www.sevenbridges.com/reference-bias-challenges-and-solutions/. 2017.
[16] Ryan Poplin et al. “Scaling accurate genetic variant discovery to tens of thousands of samples.” BioRxiv (2018).
[17] Petr Danecek et al. “Twelve years of SAMtools and BCFtools.” Gigascience 10.2(2021).
[18] Heng Li. “A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.” Bioinformatics 27.21 (2011), pp. 2987–2993.
[19] Heng Li. “Mathematical Notes on SAMtools Algorithms.” (2010).
[20] Justin M. Zook et al. “Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.” Nature biotechnology 32.3 (2014), pp. 246–251.
[21] Michael A. Eberle et al. “A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17member pedigree.” Genome research 27.1 (2021), pp. 157–164.
[22] Geraldine A. Van der Auwera et al. “From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current protocols in bioinformatics 43.1 (2013), pp. 11–10.
[23] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWAMEM.” (2010).
[24] AddOrReplaceReadGroups (Picard). https://gatk.broadinstitute.org/hc/en-us/articles/360037226472-AddOrReplaceReadGroups-Picard-.
[25] MarkDuplicates (Picard). https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-.
[26] Adrian Tan, Gonçalo R. Abecasis, and Hyun Min Kang. “Unified representation of genetic variants.” Bioinformatics 31.13 (2015), pp. 2202–2204.