簡易檢索 / 詳目顯示

研究生: 韓海墨
Haentze, Hartmut
論文名稱: Spaced Seeds對使用PanGenie進行無比對基因序列分型的影響
Effects of Spaced Seeds on Alignment-Free Genotyping with PanGenie
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 42
中文關鍵詞: alignment-free基因分析k-mersspaced seedsNGS
外文關鍵詞: alignment-free, genotyping, k-mers, spaced seeds, NGS
相關次數: 點閱:166下載:18
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 無比對、基於 k-mer 的基因分型方法 (例如 PanGenie),是一種基於比對方法的快速替代方式,特別適合對更大的隊列進行基因分型。使用spaced-seeds可以提高使用 k-mers 的靈敏度,例如在同源搜索、宏基因組分類及系統發育重建的應用。然而,spaced-seeds在基於k-mer的基因分型方法中的應用尚未得到研究。本論文為基因分型器 PanGenie 增加了spaced-seeds功能,並使用它來計算基因型。這顯著提高了 PanGenie在對低覆蓋率(5x)和高覆蓋率(30x) 的SNPs、 indel及SV進行基因分型時的靈敏度、 F-score和加權基因型一致性 (wGC)。提高的程度比僅通過增加連續 k-mers 的長度所能達到的更大。尤其是對indel基因分型的影響, wGC 的增加在 30 倍覆蓋率下能高達6.5%。如果在應用上能在散列spaced k-mers上實現有效的算法,則spaced seeds有可能成為基於 k-mer 基因分型的有用技術。

    Alignment-free, k-mer-based genotyping methods such as PanGenie are a fast alternative to alignment-based methods and are particularly well suited for genotyping larger cohorts. The sensitivity of algorithms, that work with k-mers, can be increased by using spaced seeds, for example, in applications for homology search, metagenomic classification and phylogenetic reconstruction. However, the application of spaced seeds in k-mer-based genotyping methods has not been researched yet. This thesis adds a spaced seeds functionality to the genotyper PanGenie and uses this to calculate genotypes. This significantly improves sensitivity, F-score and weighted genotype concordance (wGC) of PanGenie when genotyping SNPs, indel and SV on reads with low (5x) and high (30x) coverage. Improvements are greater than what could be achieved by just increasing the length of contiguous k-mers. Particularly large are effects on genotyping of indel with up to 6.5% increased wGC at 30-fold coverage. If applications implement effective algorithms for hashing of spaced k-mers, spaced seeds have the potential to become an useful technique in k-mer-based genotyping.

    中文摘要 i Abstract ii 誌謝 iii Contents iv List of Tables vi List of Figures viii Nomenclature x 1 Introduction 1 2 Background & Related Work 3 2.1 K-mers 3 2.1.1 (Contiguous) k-mers 3 2.1.2 Spaced k-mers 4 2.2 K-mer Based Genotyping 4 2.2.1 PanGenie 5 2.2.2 Spaced seeds in alignment-free genotyping 8 2.3 Spaced Seeds 8 2.3.1 Definition 8 2.3.2 Seed selection 9 2.3.3 Structure of seeds 11 2.3.4 Counting spaced k-mers 12 3 Research Methodology 14 3.1 Problem formalization 14 3.2 MaskedPanGenie 16 3.3 MaskJelly 18 4 Data 19 4.1 Pangenome reference 19 4.2 Reads 20 5 Replication of Original PanGenie Study 21 6 Results 23 6.1 Computing a good seed 23 6.2 Statistics 23 6.3 Effects on genotyping of SNV, Indel & SV 24 6.4 Genotype sample NA12878 31 6.5 Runtime 31 7 Discussion & Future Works 33 8 Conclusion 35 A Supplementary Material 36 Bibliography 39

    [1] Jana Ebler et al. “Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes”. Nature genetics 54.4 (2022), pp. 518–525.
    [2] Saul B Needleman and Christian D Wunsch. “A general method applicable to the search for similarities in the amino acid sequence of two proteins”. Journal of molecular biology 48.3 (1970), pp. 443–453.
    [3] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM”. arXiv preprint arXiv:1303.3997 (2013).
    [4] Bin Ma, John Tromp, and Ming Li. “PatternHunter: faster and more sensitive homology search”. Bioinformatics 18.3 (2002), pp. 440–445.
    [5] Stephen F Altschul et al. “Basic local alignment search tool”. Journal of molecularbiology 215.3 (1990), pp. 403–410.
    [6] Daniel R Zerbino and Ewan Birney. “Velvet: algorithms for de novo short read assembly using de Bruijn graphs”. Genome research 18.5 (2008), pp. 821–829.
    [7] Derrick E Wood and Steven L Salzberg. “Kraken: ultrafast metagenomic sequence classification using exact alignments”. Genome biology 15.3 (2014), pp. 1–12.
    [8] Ariya Shajii et al. “Fast genotyping of known SNPs through approximate k-mer matching”. Bioinformatics 32.17 (2016), pp. i538–i544.
    [9] Stefan Burkhardt and Juha Kärkkäinen. “Better filtering with gapped q-grams”. Fundamenta informaticae 56.1-2 (2003), pp. 51–70.
    [10] Chris-Andre Leimeister et al. “Fast alignment-free sequence comparison using spaced-word frequencies”. Bioinformatics 30.14 (2014), pp. 1991–1999.
    [11] Karel Břinda, Maciej Sykulski, and Gregory Kucherov. “Spaced seeds improve k-mer-based metagenomic classification”. Bioinformatics 31.22 (2015), pp. 3584–3592.
    [12] Derrick E Wood, Jennifer Lu, and Ben Langmead. “Improved metagenomic analysis with Kraken 2”. Genome biology 20.1 (2019), pp. 1–13.
    [13] Roger Byard and Jason Payne-James. Encyclopedia of forensic and legal medicine. Academic Press, 2016, pp. 290–296.
    [14] Aaron McKenna et al. “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data”. Genome research 20.9 (2010), pp. 1297–1303.
    [15] Andrzej Zielezinski et al. “Alignment-free sequence comparison: benefits, applications, and tools”. Genome biology 18.1 (2017), pp. 1–17.
    [16] Susana Vinga. Alignment-free methods in computational biology. 2014.
    [17] Susana Vinga and Jonas Almeida. “Alignment-free sequence comparison—a review”. Bioinformatics 19.4 (2003), pp. 513–523.
    [18] Luca Denti et al. “MALVA: genotyping by Mapping-free ALlele detection of known VAriants”. Iscience 18 (2019), pp. 20–27.
    [19] Jonas Andreas Sibbesen, Lasse Maretty, and Anders Krogh. “Accurate genotyping across variant classes and lengths using variant graphs”. Nature genetics 50.7 (2018), pp. 1054–1059.
    [20] Na Li and Matthew Stephens. “Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data”. Genetics 165.4 (2003), pp. 2213–2233.
    [21] Nicholas Stoler and Anton Nekrutenko. “Sequencing error profiles of Illumina sequencing instruments”. NAR genomics and bioinformatics 3.1 (2021), lqab019.
    [22] Lucian Ilie and Silvana Ilie. “Multiple spaced seeds for homology search”. Bioinformatics 23.22 (2007), pp. 2969–2977.
    [23] Uri Keich et al. “On spaced seeds for similarity search”. Discrete applied mathematics 138.3 (2004), pp. 253–263.
    [24] Kwok Pui Choi and Louxin Zhang. “Sensitivity analysis and efficient method for identifying optimal spaced seeds”. Journal of Computer and System Sciences 68.1 (2004), pp. 22–40.
    [25] Bin Ma and Ming Li. “On the complexity of the spaced seeds”. Journal of Computer and System Sciences 73.7 (2007), pp. 1024–1034.
    [26] Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. “SpEED: fast computation of sensitive spaced seeds”. Bioinformatics 27.17 (2011), pp. 2433–2434.
    [27] Hamid Mohamadi et al. “ntHash: recursive nucleotide hashing”. Bioinformatics 32.22 (2016), pp. 3492–3494.
    [28] Samuele Girotto, Matteo Comin, and Cinzia Pizzi. “Efficient computation of spaced
    seed hashing with block indexing”. BMC bioinformatics 19.15 (2018), pp. 29–38.
    [29] Enrico Petrucci et al. “Iterative spaced seed hashing: closing the gap between spaced seed hashing and k-mer hashing”. Journal of Computational Biology 27.2 (2020), pp. 223–233.
    [30] Peter Krusche et al. “Best practices for benchmarking germline small-variant calls in human genomes”. Nature biotechnology 37.5 (2019), pp. 555–560.
    [31] Guillaume Marçais and Carl Kingsford. “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers”. Bioinformatics 27.6 (2011), pp. 764–770.
    [32] Peter Ebert et al. “Haplotype-resolved diverse human genomes and integrated analysis of structural variation”. Science 372.6537 (2021), eabf7117.
    [33] Mark JP Chaisson et al. “Multi-platform discovery of haplotype-resolved structural variation in human genomes”. Nature communications 10.1 (2019), pp. 1–16.
    [34] Broad Institute. Picard Tools. http : / / broadinstitute . github . io / picard/. Accessed: 2022-09-01; version 2.27.4-SNAPSHOT. 2022.
    [35] STR/VNTR regions. https://bitbucket.org/jana_ebler/genotyping-experiments/src/master/data/ucsc-repeats/. [Online; accessed 20-November-2022].
    [36] Donna Karolchik et al. “The UCSC Table Browser data retrieval tool”. Nucleic acids research 32.suppl_1 (2004), pp. D493–D496.
    [37] Illumina short reads for NA24385. https : / / ftp - trace . ncbi . nlm . nih . gov /ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/. [Online; accessed 20-November-2022].
    [38] Douglas G Altman et al. “Statistical guidelines for contributors to medical journals.” British medical journal (Clinical research ed.) 286.6376 (1983), p. 1489.
    [39] Roger Mundry and Julia Fischer. “Use of statistical programs for nonparametric tests of small samples often leads to incorrect Pvalues: examples from animal behaviour”. Animal behaviour 56.1 (1998), pp. 256–259.
    [40] Frank Wilcoxon. “Individual comparisons by ranking methods”. Breakthroughs in statistics. Springer, 1992, pp. 196–202.
    [41] Hartmut Häntze. PanGenie nondeterminstic when executed with a Jellyfish database and multiple threads. https://github.com/eblerjana/pangenie/issues/13. [Online; accessed 20-November-2022]. 2022

    下載圖示
    2023-07-30公開
    QR CODE