簡易檢索 / 詳目顯示

研究生: 阮祈翰
Ruan, Chi-Han
論文名稱: 使用Subset Seeds與GPU平行化增進Pangenome-based基因分型準確率與效率
Improving Pangenome-based Genotyping Accuracy and Efficiency Using Subset Seeds and GPU Parallelization
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 39
中文關鍵詞: 基因分型子集種子GPU
外文關鍵詞: Genotyping, Subset Seeds, GPU
相關次數: 點閱:71下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,alignment-free(無比對)基因分型逐漸受到基因體學社群的關注,並促成了PanGenie 等工具的誕生。PanGenie 透過使用連續 k-mer 展現了優異的基因分型準確度。
    在此基礎上,Häntze 與 Horton 提出了 MaskedPanGenie,藉由導入 spaced seed(間隔種子)來提升靈敏度。比較分析顯示,MaskedPanGenie 在 5× 與 30× 倍讀取覆蓋度下,相較於原版 PanGenie 能顯著提高基因分型的準確度,但其大幅增加的執行時間成為了應用上的主要瓶頸。
    在這個研究中我們首先以全新的 subset seed(子集種子)設計取代spaced seed,進一步提升基因分型的準確度,此外,我們對基於隱藏馬可夫模型(HMM)的演算法使用 GPU 加速,顯著縮短整體運行時間。

    In recent years, alignment-free genotyping has received increasing attention in the genomics community, leading to the development of tools such as PanGenie, which demonstrated high genotyping accuracy through the use of contiguous k-mers. Building on this approach, Häntze and Horton introduced MaskedPanGenie, which incorporates spaced seeds to enhance sensitivity. Comparative analyses revealed that MaskedPanGenie achieved notable improvements in genotyping accuracy over the original PanGenie, particularly under 5× and 30× sequencing coverage. However, its significantly increased runtime has become a major bottleneck in practical applications.
    First, we replaced spaced seeds with a new seed design known as subset seeds, which led to further improvements in genotyping accuracy. In addition, we employed GPU acceleration to optimize the performance of the algorithm based on the Hidden Markov Model (HMM), significantly reducing the overall runtime.

    中文摘要 i Abstract ii 誌謝 iv Contents v List of Tables vii List of Figures viii Nomenclature ix 1 Introduction 1 2 Background & Related Work 3 2.1 K-mers 3 2.1.1 Contiguous k-mers 3 2.1.2 Spaced seed k-mers 4 2.2 K-mer Based Genotyping 4 2.2.1 PanGenie 5 2.2.2 MaskedPanGenie 7 2.3 Related tools 8 2.3.1 Jellyfish & MaskJelly 8 2.4 Performance metrics 8 2.4.1 Sensitivity 8 2.4.2 Weight Genotype Concordance (wGC) 9 3 Methods 10 3.1 Subset Seed 10 3.1.1 Formal definition 10 3.1.2 Illustrative example 11 3.2 Hidden Markov Model (HMM) 12 3.2.1 Genotyping 14 3.2.2 Implement GPU 15 4 Data 18 4.1 Pangenome reference 18 4.2 Reads 18 5 Results 20 5.1 Different seeds on Genotyping 20 5.1.1 Sensitivity 20 5.1.2 Genotyping results 20 5.2 Runtime for MaskedPanGenie 23 6 Discussion & Future Work 24 7 Conclusion 26 Bibliography 27

    [1] Broad Institute, Picard Toolkit, https://broadinstitute.github.io/picard/,Version 2.27.4–SNAPSHOT. Accessed 2022-09-01, 2022.
    [2] J. Ebler, P. Ebert, W. E. Clarke, T. Rausch, P. A. Audano, T. Houwaart, Y. Mao, J. O. Korbel, E. E. Eichler, M. C. Zody, et al., “Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes,” Nature genetics, vol. 54, no. 4, pp. 518–525, 2022.
    [3] H. Häntze and P. Horton, “Effects of spaced k-mers on alignment-free genotyping,” Bioinformatics, vol. 39, no. Supplement_1, pp. i213–i221, 2023.
    [4] L. Ilie, S. Ilie, and A. Mansouri Bigvand, “Speed: Fast computation of sensitive spaced seeds,” Bioinformatics, vol. 27, no. 17, pp. 2433–2434, 2011.
    [5] P. Krusche, L. Trigg, P. C. Boutros, C. E. Mason, F. M. De La Vega, B. L. Moore, M. Gonzalez-Porta, M. A. Eberle, Z. Tezak, S. Lababidi, et al., “Best practices for benchmarking germline small-variant calls in human genomes,” Nature biotechnology, vol. 37, no. 5, pp. 555–560, 2019.
    [6] H. Li, “Aligning sequence reads, clone sequences and assembly contigs with bwa-mem,” arXiv preprint arXiv:1303.3997, 2013.
    [7] B. Ma, J. Tromp, and M. Li, “Patternhunter: Faster and more sensitive homology search,” Bioinformatics, vol. 18, no. 3, pp. 440–445, 2002.
    [8] G. Marçais and C. Kingsford, “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27, no. 6, pp. 764–770, 2011.
    [9] A. Rahman and L. Pachter, “Cgal: Computing genome assembly likelihoods,” Genome biology, vol. 14, pp. 1–10, 2013.

    下載圖示
    校外:立即公開
    QR CODE