| 研究生: |
孔德芳 Kung, Te-Fang |
|---|---|
| 論文名稱: |
利用唯一分子標記共通序列準確辨認體細胞變異的統計架構 A statistical framework for accurate somatic variant calling using consensus sequences of unique molecular identifiers |
| 指導教授: |
劉宗霖
Liu, Tsung-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
生物科學與科技學院 - 生物科技與產業科學系 Department of Biotechnology and Bioindustry Sciences |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 體細胞變異點偵測 、分子標籤 、PCR 錯誤 、錯誤模型 |
| 外文關鍵詞: | somatic variant calling, unique molecular identifiers, PCR error, error model |
| 相關次數: | 點閱:64 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
精準醫療中,體細胞變異點偵測非常重要。然而因為定序錯誤的關係使得準確地分析體細胞變異點變得困難。次世代定序的錯誤率約為0.1-1%,當突變等位基因頻率也約為1%或更低時,容易將定序錯誤誤判成突變。為了區分定序錯誤與突變,可以利用分子標籤 (unique molecular identifier, UMI) 的技術,標記序列的來源,再進行聚合酶連鎖反應 (polymerase chain reaction, PCR) 放大序列。當同樣分子標籤的序列被組成共通序列,大部分的定序錯誤會被移除。但PCR錯誤仍會影響判斷體細胞點突變。目前已有變異點偵測軟體能考慮UMI序列中的PCR錯誤,但他們利用基於現象的模型去評估PCR錯誤率。本研究提出一個基於PCR放大過程的統計架構來評估共通序列的錯誤率,利用貝式定理同時考慮PCR錯誤與定序錯誤。此外本研究也建立一套UMI模擬資料流程,用來驗證錯誤模型的效能。針對模擬的UMI資料,相比MuTect2與MAGERI,本研究的統計模型有較高的真陽性率與很低的偽陽性率。
Somatic variant calling is important in precision medicine. Accurate somatic variant calling, however, is difficult because of sequencing errors. Because next-generation sequencing error rate and variant allele frequency are often comparable (0.1-1%), sequencing errors can be misidentified as variants. Fortunately, most sequencing errors can be corrected via unique molecular identifiers (UMIs). In the approach, each DNA molecule is tagged by a UMI and the tagged molecules are amplified via polymerase chain reaction (PCR) into multiple copies and then sequenced. By assembling sequences of the same UMI into a consensus, sequencing errors should be removed. However, PCR errors may still exist. Current somatic variant calling tools that accept UMI data indeed consider PCR error, but they all evaluate PCR error using a phenomenon logical model. Such a model does not reflect the nature of PCR and cannot be connected to the fundamental parameters of PCR error rates. In this study, we develop a statistical framework that echoes the PCR amplification process. Based on the framework, the error probability of consensus bases is accurately estimated using Bayes’ theorem while considering both PCR error and sequencing error. We also set up a pipeline for generating UMI simulation data and test the performance. Comparing to MuTect2 and MAGERI, our approach shows a higher true positive rate and comparable false positive rate.
Akdemir, K.C., Le, V.T., Kim, J.M., Killcoyne, S., King, D.A., Lin, Y.P., Tian, Y., Inoue, A., Amin, S.B., Robinson, F.S., Nimmakayalu, M., Herrera, R.E., Lynn, E.J., Chan, K., Seth, S., Klimczak, L.J., Gerstung, M., Gordenin, D.A., O'Brien, J., Li, L., Deribe, Y.L., Verhaak, R.G., Campbell, P.J., Fitzgerald, R., Morrison, A.J., Dixon, J.R., and Andrew Futreal, P. Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nature Genetics 52, 1178-1188, 2020.
Andrews, T.D., Jeelall, Y., Talaulikar, D., Goodnow, C.C., and Field, M.A. DeepSNVMiner: a sequence analysis tool to detect emergent, rare mutations in subsets of cell populations. PeerJ 4, e2074, 2016.
Benjamin, D., Sato, T., Cibulskis, K., Getz, G., Stewart, C., and Lichtenstein, L. Calling somatic SNVs and indels with Mutect2. Biorxiv, 861054, 2019.
Buermans, H.P., and den Dunnen, J.T. Next generation sequencing technology: Advances and applications. Biochimica et Biophysica Acta - Molecular Basis of Disease 1842, 1932-1941, 2014.
Chatterjee, N., and Walker, G.C. Mechanisms of DNA damage, repair, and mutagenesis. Environmental and Molecular Mutagenesis 58, 235-263, 2017.
Chen, S., Zhou, Y., Chen, Y., Huang, T., Liao, W., Xu, Y., Li, Z., and Gu, J. Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data. BioMed Central Bioinformatics 20, 606, 2019.
Church, D.M., Schneider, V.A., Graves, T., Auger, K., Cunningham, F., Bouk, N., Chen, H.C., Agarwala, R., McLaren, W.M., Ritchie, G.R., Albracht, D., Kremitzki, M., Rock, S., Kotkiewicz, H., Kremitzki, C., Wollam, A., Trani, L., Fulton, L., Fulton, R., Matthews, L., Whitehead, S., Chow, W., Torrance, J., Dunn, M., Harden, G., Threadgold, G., Wood, J., Collins, J., Heath, P., Griffiths, G., Pelan, S., Grafham, D., Eichler, E.E., Weinstock, G., Mardis, E.R., Wilson, R.K., Howe, K., Flicek, P., and Hubbard, T. Modernizing reference genome assemblies. PLOS Biology 9, e1001091, 2011.
Glenn, T.C. Field guide to next‐generation DNA sequencers. Molecular Ecology Resources 11, 759-769, 2011.
Iengar, P. An analysis of substitution, deletion and insertion mutations in cancer genes. Nucleic Acids Research 40, 6401-6413, 2012.
Kou, R., Lam, H., Duan, H., Ye, L., Jongkam, N., Chen, W., Zhang, S., and Li, S. Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations. PLoS One 11, e0146638, 2016.
Li, W., Shao, D., Li, L., Wu, M., Ma, S., Tan, X., Zhong, S., Guo, F., Wang, Z., and Ye, M. Germline and somatic mutations of multi-gene panel in Chinese patients with epithelial ovarian cancer: a prospective cohort study. Journal of Ovarian Research 12, 80, 2019.
McInerney, P., Adams, P., and Hadi, M.Z. Error Rate Comparison during Polymerase Chain Reaction by DNA Polymerase. Molecular Biology International 2014, 287430, 2014.
Metzker, M.L. APPLICATIONS OF NEXT-GENERATION SEQUENCING Sequencing technologies - the next generation. Nature Reviews Genetics 11, 31-46, 2010.
Mullis, K.B. The unusual origin of the polymerase chain reaction. Scientific American 262, 56-61, 64-55, 1990.
Potapov, V., and Ong, J.L. Examining Sources of Error in PCR by Single-Molecule Sequencing. PLoS One 12, 2017.
Qin, D. Next-generation sequencing and its clinical application. Cancer Biology and Medicine 16, 4-10, 2019.
Rieber, N., Zapatka, M., Lasitschka, B., Jones, D., Northcott, P., Hutter, B., Jäger, N., Kool, M., Taylor, M., and Lichter, P. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One 8, e66621, 2013.
Rodríguez-Lázaro, D., and Hernández, M. Real-time PCR in food science: introduction. Current Issues in Molecular Biology 15, 25-38, 2013.
Scally, A. The mutation rate in human evolution and demographic inference. Current Opinion in Genetics and Development 41, 36-43, 2016.
Shugay, M., Zaretsky, A.R., Shagin, D.A., Shagina, I.A., Volchenkov, I.A., Shelenkov, A.A., Lebedin, M.Y., Bagaev, D.V., Lukyanov, S., and Chudakov, D.M. MAGERI: Computational pipeline for molecular-barcoded targeted resequencing. PLoS Computational Biology 13, 2017.
Smith, T., Heger, A., and Sudbery, I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research 27, 491-499, 2017.
Strom, S.P. Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biology and Medicine 13, 3-11, 2016.
Svec, D., Tichopad, A., Novosadova, V., Pfaffl, M.W., and Kubista, M. How good is a PCR efficiency estimate: Recommendations for precise and robust qPCR efficiency assessments. Biomolecular Detection and Quantification 3, 9-16, 2015.
Xu, C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Computational and Structural Biotechnology Journal 16, 15-24, 2018.
Xu, C., Gu, X., Padmanabhan, R., Wu, Z., Peng, Q., DiCarlo, J., and Wang, Y. smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers. Bioinformatics 35, 1299-1309, 2019.
Xu, C., Ranjbar, M.R.N., Wu, Z., DiCarlo, J., and Wang, Y. Detecting very low allele fraction variants using targeted DNA sequencing and a novel molecular barcode-aware variant caller. BioMed Central Genomics 18, 1-11, 2017.
Zook, J.M., McDaniel, J., Olson, N.D., Wagner, J., Parikh, H., Heaton, H., Irvine, S.A., Trigg, L., Truty, R., McLean, C.Y., De La Vega, F.M., Xiao, C., Sherry, S., and Salit, M. An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology 37, 561-566, 2019.