| 研究生: |
鄭益宗 Cheng, Yi-Tsung |
|---|---|
| 論文名稱: |
加速基因分型—解決MaskedPanGenie效能上的瓶頸 Accelerating Genotyping—Resolving the computational bottleneck of MaskedPanGenie |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 人工智慧科技碩士學位學程 Graduate Program of Artificial Intelligence |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 45 |
| 中文關鍵詞: | 基因分型 、回文間隔種子 、k-mer 計數 、敏感度 |
| 外文關鍵詞: | genotyping, palindrome spaced seeds, k-mer counting, sensitivity |
| 相關次數: | 點閱:245 下載:21 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
全基因組學方法在基因分型中變得越來越重要,啟發了一些工具的開發,如PanGenie。最近Häntze 和Horton 發表了MaskedPanGenie,它擴展了PanGenie方法以支援使用間隔種子(spaced seeds)。他們的結果顯示,在低(5倍)和高(30倍)讀取覆蓋率下,特別是捕獲插入/缺失(genotype)方面,基因分型性能有所提升。令人鼓舞的是,在高覆蓋率(30倍)下觀察到的全基因組一致性提高了最多6.5%。然而,他們的研究也存在一些限制。特別是,他們的MaskedPanGenie執行時間相對於PanGenie慢得多,另外他們只嘗試了少數種子,並沒有提供一種系統方法來生成針對基因分型任務進行優化的種子。本研究解決了這些限制。
首先,在加速方面,我們提出了MaskedJellyfish,用於在MaskedPanGenie 流程中實現間隔種子k-mer 計數。通過整合MaskedJellyfish 庫進行間隔k-mer 計數,我們消除了間隔k-mer計數中額外的預處理時間,同時簡化了MaskedPanGenie流程。其次,在種子生成方面,我們提出了PalindromeSpEED,為了能生成更好的回文間隔種子,通過不斷優化隨機種子和敏感性評估。
我們有兩個主要的結果。第一個結果是相對於原始的MaskedPanGenie 工作流程,實現了非常顯著的(約2 倍-3 倍)加速。第二個結果是開發了一種針對這項任務進行優化的空隔種子方法,並展示了在性能上的適度改進。
Pangenomic approaches are increasingly important for genotyping, inspiring the development of tools such as PanGenie. Recently Häntze & Horton published MaskedPanGenie, which extends the PanGenie approach to support using spaced seeds. Their results showed enhancements in genotyping performance for both low (5x) and high (30x) read coverage, particularly in capturing indel genotypes. Promisingly, the observed increase in whole genome concordance reached up to 6.5% under high coverage (30x). However their study also had its limitations. In particular their implementation of MaskedPanGenie is much slower than PanGenie and they tried only a small number of seeds and did not provide a systematic method to generate seeds optimized to the genotyping task. The work in this report addresses those limitations. First, for the speed-up aspect, we propose MaskedJellyfish to enable spaced seed k-mer counting in the MaskedPanGenie pipeline. By incorporating the MaskedJellyfish library for spaced k-mer counting, we eliminate the need for additional preprocessing time in spaced k-mer counting and simplify the MaskedPanGenie pipeline. Secondly, for seed generation, we propose PalindromeSpEED to generate better palindrome spaced seeds by iteratively optimizing a random seed through flipping and evaluating sensitivity.
Result We have two main results. The first is a very significant (~2X-3X) speed-up over the original MaskedPanGenie workflow. The second is the development of a method for optimizing spaced seeds for this task and showing that it leads to a modest improvement in performance.
[1] Stephen F. Altschul et al. “Basic local alignment search tool”. Journal of Molecular Biology 215.3 (1990), pp. 403–410.
[2] Bin Ma, John Tromp, and Ming Li. “PatternHunter: faster and more sensitive homology search”. Bioinformatics 18.3 (2002), pp. 440–445.
[3] Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. “SpEED: fast computation of sensitive spaced seeds”. Bioinformatics 27.17 (2011), pp. 2433–2434.
[4] Lars Hahn et al. “rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison”. PLoS Computational Biology12 (2016), e1005107.
[5] Roytberg M. Kucherov G Noé L. “A unifying framework for seed sensitivity and its application to subset seeds”. Journal of Bioinformatics and Computational Biology 4.2 (2006), pp. 553–69.
[6] Enrico Petrucci et al. “Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing”. Journal of Bioinformatics and Computational Biology 27.2 (2019), pp. 208–219.
[7] Uri Keich et al. “On Spaced Seeds for Similarity Search”. Discrete Appl. Math. 138.3 (2004), pp. 253–263.
[8] Bin Ma and Ming Li. “On the complexity of the spaced seeds”. Journal of Computer and System Sciences 73.7 (2007), pp. 1024–1034.
[9] Hartmut Häntze and Paul Horton. “Effects of spaced k-mers on alignment-free genotyping”. Bioinformatics 39.Supplement_1 (2023), pp. i213–i221.
[10] Kwok Pui Choi and Louxin Zhang. “Sensitivity analysis and efficient method for identifying optimal spaced seeds”. Journal of Computer and System Sciences 68 (2004), pp. 22–40.
[11] Lucian Ilie and Silvana Ilie. “Multiple spaced seeds for homology search”. Bioinformatics 23.22 (2007), pp. 2969–2977.
[12] Lucian Ilie and Silvana Ilie. “Long spaced seeds for finding similarities between biological sequences”. DBLP, 2007, pp. 3–8.
[13] Snyder MP Reuter JA Spacek DV. “High-throughput sequencing technologies”. Mol Cell 58.4 (2015), pp. 586–597.
[14] Swati C Manekar and Shailesh R Sathe. “A benchmark study of k-mer counting methods for high-throughput sequencing”. GigaScience 7.12 (2018).
[15] Guillaume Marçais and Carl Kingsford. “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers”. Bioinformatics 27.6 (2011), pp. 764–770.
[16] Peter Krusche et al. “Best practices for benchmarking germline small-variant calls in human genomes”. Nature biotechnology 37 (2019), pp. 555–560.
[17] Jana Ebler et al. “Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes”. Nature genetics 54.4 (2022), pp. 518–525.
[18] Stephens M. Li N. “Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data”. Genetics 165.4 (2003), pp. 2213–33.
[19] Samuele Girotto et al. “FSH: fast spaced seed hashing exploiting adjacent hashes”. Algorithms Mol Biol 13.8 (2018).
[20] Ming Li et al. “PatternHunter II: Highly Sensitive and Fast Homology Search”. Journal of bioinformatics and computational biology 2 (2004), pp. 417–39.
[21] PanGenie nondeterminstic when executed with a Jellyfish database and multiple threads. https://github.com/eblerjana/pangenie/issues/13. 2022.
[22] L. Noé. “Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.” Algorithms Mol Biol 12.1 (2017).