簡易檢索 / 詳目顯示

研究生: 吳其亮
Wu, Chi-Liang
論文名稱: 優化MaskedPanGenie在基因分型上的速度與準確率
Improving the Speed and Accuracy of the MaskedPanGenie Method of Genotyping
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 42
中文關鍵詞: 基因分型全基因組間隔種子雜湊表
外文關鍵詞: genotyping, pangenome, spaced seeds, hash table
相關次數: 點閱:90下載:13
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著次世代定序技術的發展,全基因組學在生物醫學和遺傳學中更具重要性,全基因組學能夠全面檢測個體基因組中的變異,包括單核苷酸多態性(SNP)、插入缺失(Indels)、結構變異(SV)等。這些變異與許多複雜性狀和疾病相關,了解它們有助於揭示疾病的遺傳基礎。
    Häntze和Horton發表的MaskedPanGenie能夠使用間隔種子(spaced seeds),與原始的PanGenie相比,在5倍和30倍讀取覆蓋率下基因分型的結果有很好的改進。然而,MaskedPanGenie的執行時間卻大幅增加,所以Cheng和Horton在此問題上提出了改善,而本篇論文會在該方法上再進行改進。
    首先,我們提出了產生回文種子的新方法PalindromeRasbhari,此方法產生的種子比起PalindromeSpEED有更好的效能。另一方面,我們使用新的方法來實現雜湊表,使得間隔種子k-mer計數的執行時間可以得到改善。

    With the development of next-generation sequencing (NGS), whole genome studies have become more important in biomedicine and genetics, as they allow for a comprehensive examination of variations within an individual's genome, including single nucleotide polymorphisms (SNPs), insertions and deletions (Indels), and structural variations (SVs). These variations are associated with many complex traits and diseases, and understanding them helps uncover the genetic basis of diseases.
    MaskedPanGenie, published by Häntze and Horton, utilizes spaced seeds, offering substantial improvements in genotyping results at 5x and 30x read coverages compared to the original PanGenie. However, the execution time of MaskedPanGenie remains a bottleneck. Cheng and Horton have proposed improvements on this issue, and this paper will further enhance their method.
    First, we propose a new method for generating palindromic seeds, PalindromeRasbhari, which produces seeds with better performance compared to PalindromeSpEED. Additionally, we implement a new method for the hash table, improving the execution time for spaced k-mer counting.

    中文摘要 i Abstract ii 誌謝 iii Contents iv List of Tables vi List of Figures vii Nomenclature viii 1 Introduction 1 2 Background & Related Work 3 2.1 K-mer based Genotyping 3 2.1.1 PanGenie 4 2.2 Spaced seeds 7 2.2.1 Definition 7 2.2.2 Seed optimization 8 2.3 Related tools 10 2.3.1 SpEED 10 2.3.2 Rasbhari 11 2.3.3 Jellyfish 11 3 Research Methodology 13 3.1 Spaced seeds generation 13 3.1.1 Canonical form 13 3.1.2 Palindrome spaced seeds 14 3.1.3 PalindromeRasbhari 15 3.2 Fast spaced seed hashing 17 3.3 MaskedPanGenie 18 3.4 Dataset 19 3.4.1 Pangenome reference 19 3.4.2 Reads 20 4 Results 22 4.1 Spaced seed on Genotyping 22 4.1.1 Sensitivity 22 4.1.2 Genotyping results 22 4.2 Performance of MaskedPanGenie 23 4.2.1 Runtime 23 4.2.2 Memory 24 5 Discussion & Future Work 28 6 Conclusion 30 Bibliography 31

    [1] Hartmut Häntze and Paul Horton. “Effects of Spaced Seeds on Alignment-Free Genotyping with PanGenie”. Bioinformatics (2023).
    [2] Yi-Tsung Cheng and Paul Horton. “Accelerating Genotyping—Resolving the computational bottleneck of MaskedPanGenie” (2023).
    [3] Stephen F. Altschul et al. “Basic local alignment search tool”. Journal of Molecular Biology 215.3 (1990), pp. 403–410.
    [4] Lucian Ilie, Silvana Ilie, and Mansouri Bigvand Anahita. “SpEED: fast computation of sensitive spaced seeds”. Bioinformatics 27.17 (2011), pp. 2433–2434.
    [5] Lars Hahn et al. “rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison”. PLoS Computational Biology 12 (2016), e1005107.
    [6] Samuele Girotto et al. “FSH: fast spaced seed hashing exploiting adjacent hashes”. Algorithms Mol Biol 13.8 (2018).
    [7] Camille et al. “A survey of k-mer methods and applications in bioinformatics”. Computational and Structural Biotechnology Journal (2024), pp. 2289–2303.
    [8] Susana Vinga. “Alignment-free methods in computational biology”. Briefings in Bioinformatics (2014), pp. 341–342.
    [9] Guillaume Marçais and Carl Kingsford. “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers”. Bioinformatics 27.6 (2011), pp. 764–770.
    [10] Atif Rahman and Lior Pachter. “CGAL: computing genome assembly likelihoods”. Genome Biology (2013).
    [11] Jana Ebler et al. “Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes”. Nature genetics 54.4 (2022), pp. 518–525.
    [12] Laurent Noé. “Best hits of 11110110111: model-free selection and parameter￿free sensitivity calculation of spaced seeds”. Algorithms Mol Biology (2017).
    [13] Kwok Pui Choi and Louxin Zhang. “Sensitivity analysis and efficient method for identifying optimal spaced seeds”. Journal of Computer and System Sciences (2004), pp. 22–40.
    [14] Kwok Pui Choi, Fanfan Zeng, and Louxin Zhang. “Good spaced seeds for homology search”. Bioinformatics 20 (2004), pp. 1053–1059.
    [15] Uri Keich et al. “On Spaced Seeds for Similarity Search”. Discrete Appl. Math. 138.3 (2004), pp. 253–263.
    [16] Justin M. Zook et al. “Extensive sequencing of seven human genomes to characterize benchmark reference materials”. Scientific Data (2016).
    [17] Marta Byrska-Bishop et al. “High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios”. Cell (2021).
    [18] Peter Ebert et al. “Haplotype-resolved diverse human genomes and integrated analysis of structural variation”. Science 372 (2021).
    [19] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWAMEM”. arXiv preprint arXiv (2013).
    [20] Enliang Li et al. “Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment”. IEEE (2021).
    [21] Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. “DSK: k-mer counting with very low memory usage”. Bioinformatics 29 (2013), pp. 652–653.
    [22] Mikhail Roytberg et al. “On Subset Seeds for Protein Alignment”. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6.3 (2009).

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE