| 研究生: |
吳其亮 Wu, Chi-Liang |
|---|---|
| 論文名稱: |
優化MaskedPanGenie在基因分型上的速度與準確率 Improving the Speed and Accuracy of the MaskedPanGenie Method of Genotyping |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 42 |
| 中文關鍵詞: | 基因分型 、全基因組 、間隔種子 、雜湊表 |
| 外文關鍵詞: | genotyping, pangenome, spaced seeds, hash table |
| 相關次數: | 點閱:90 下載:13 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著次世代定序技術的發展,全基因組學在生物醫學和遺傳學中更具重要性,全基因組學能夠全面檢測個體基因組中的變異,包括單核苷酸多態性(SNP)、插入缺失(Indels)、結構變異(SV)等。這些變異與許多複雜性狀和疾病相關,了解它們有助於揭示疾病的遺傳基礎。
Häntze和Horton發表的MaskedPanGenie能夠使用間隔種子(spaced seeds),與原始的PanGenie相比,在5倍和30倍讀取覆蓋率下基因分型的結果有很好的改進。然而,MaskedPanGenie的執行時間卻大幅增加,所以Cheng和Horton在此問題上提出了改善,而本篇論文會在該方法上再進行改進。
首先,我們提出了產生回文種子的新方法PalindromeRasbhari,此方法產生的種子比起PalindromeSpEED有更好的效能。另一方面,我們使用新的方法來實現雜湊表,使得間隔種子k-mer計數的執行時間可以得到改善。
With the development of next-generation sequencing (NGS), whole genome studies have become more important in biomedicine and genetics, as they allow for a comprehensive examination of variations within an individual's genome, including single nucleotide polymorphisms (SNPs), insertions and deletions (Indels), and structural variations (SVs). These variations are associated with many complex traits and diseases, and understanding them helps uncover the genetic basis of diseases.
MaskedPanGenie, published by Häntze and Horton, utilizes spaced seeds, offering substantial improvements in genotyping results at 5x and 30x read coverages compared to the original PanGenie. However, the execution time of MaskedPanGenie remains a bottleneck. Cheng and Horton have proposed improvements on this issue, and this paper will further enhance their method.
First, we propose a new method for generating palindromic seeds, PalindromeRasbhari, which produces seeds with better performance compared to PalindromeSpEED. Additionally, we implement a new method for the hash table, improving the execution time for spaced k-mer counting.
[1] Hartmut Häntze and Paul Horton. “Effects of Spaced Seeds on Alignment-Free Genotyping with PanGenie”. Bioinformatics (2023).
[2] Yi-Tsung Cheng and Paul Horton. “Accelerating Genotyping—Resolving the computational bottleneck of MaskedPanGenie” (2023).
[3] Stephen F. Altschul et al. “Basic local alignment search tool”. Journal of Molecular Biology 215.3 (1990), pp. 403–410.
[4] Lucian Ilie, Silvana Ilie, and Mansouri Bigvand Anahita. “SpEED: fast computation of sensitive spaced seeds”. Bioinformatics 27.17 (2011), pp. 2433–2434.
[5] Lars Hahn et al. “rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison”. PLoS Computational Biology 12 (2016), e1005107.
[6] Samuele Girotto et al. “FSH: fast spaced seed hashing exploiting adjacent hashes”. Algorithms Mol Biol 13.8 (2018).
[7] Camille et al. “A survey of k-mer methods and applications in bioinformatics”. Computational and Structural Biotechnology Journal (2024), pp. 2289–2303.
[8] Susana Vinga. “Alignment-free methods in computational biology”. Briefings in Bioinformatics (2014), pp. 341–342.
[9] Guillaume Marçais and Carl Kingsford. “A fast, lock-free approach for efficient parallel counting of occurrences of k-mers”. Bioinformatics 27.6 (2011), pp. 764–770.
[10] Atif Rahman and Lior Pachter. “CGAL: computing genome assembly likelihoods”. Genome Biology (2013).
[11] Jana Ebler et al. “Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes”. Nature genetics 54.4 (2022), pp. 518–525.
[12] Laurent Noé. “Best hits of 11110110111: model-free selection and parameterfree sensitivity calculation of spaced seeds”. Algorithms Mol Biology (2017).
[13] Kwok Pui Choi and Louxin Zhang. “Sensitivity analysis and efficient method for identifying optimal spaced seeds”. Journal of Computer and System Sciences (2004), pp. 22–40.
[14] Kwok Pui Choi, Fanfan Zeng, and Louxin Zhang. “Good spaced seeds for homology search”. Bioinformatics 20 (2004), pp. 1053–1059.
[15] Uri Keich et al. “On Spaced Seeds for Similarity Search”. Discrete Appl. Math. 138.3 (2004), pp. 253–263.
[16] Justin M. Zook et al. “Extensive sequencing of seven human genomes to characterize benchmark reference materials”. Scientific Data (2016).
[17] Marta Byrska-Bishop et al. “High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios”. Cell (2021).
[18] Peter Ebert et al. “Haplotype-resolved diverse human genomes and integrated analysis of structural variation”. Science 372 (2021).
[19] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWAMEM”. arXiv preprint arXiv (2013).
[20] Enliang Li et al. “Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment”. IEEE (2021).
[21] Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. “DSK: k-mer counting with very low memory usage”. Bioinformatics 29 (2013), pp. 652–653.
[22] Mikhail Roytberg et al. “On Subset Seeds for Protein Alignment”. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6.3 (2009).