| 研究生: |
施惇瀚 Shih, Dun-Han |
|---|---|
| 論文名稱: |
應用FPGA加速深度學習模型將第三代基因測序數據轉換為全基因體圖像 Applying FPGA acceleration to deep learning models to convert third-generation gene sequencing data into whole-genome images |
| 指導教授: |
黃吉川
Hwang, Chi-Chuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 中文 |
| 論文頁數: | 81 |
| 中文關鍵詞: | 第三代基因定序 、基因體研究 、FPGA硬體加速 、變異檢測 |
| 外文關鍵詞: | Third-generation sequencing, Genomic research, FPGA hardware acceleration, Variant calling |
| 相關次數: | 點閱:39 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基因變異的準確定位對於當代基因體學研究和醫學應用至關重要。能否高效、精準地在基因體中鑑別和定位各種變異位點,直接決定了我們探索遺傳機理、發現致病基因、實現精準醫療的能力。隨著第三代基因測序技術的不斷進步,我們獲得了更長的讀段數據,這為檢測複雜的結構變異帶來了新的契機,但也給變異定位算法的性能和準確性帶來了嚴峻挑戰。海量長讀段數據需要大規模高通量分析,傳統的變異檢測方法已無法滿足需求。
本研究利用機器學習方法提高變異定位這一關鍵環節的性能,DeepVariant能夠準確識別、篩檢和分類基因體中的各類變異,包括單核苷酸變異、短片段插入缺失,甚至長度較大的結構變異等,涵蓋範圍之廣前所未有。精確而全面的變異定位能力,為揭示遺傳變異與疾病之間的聯繫、實現個體化基因風險評估等奠定堅實基礎,與傳統算法相比,DeepVariant的另一顯著優勢在於其標準化、自動化流程。只需提供原始測序數據和參考基因體序列作為輸入,即可直接獲得變異報告,免去了人工設置複雜參數的步驟,極大提高了便利性和可用性。
在機器學習加速預測方面,高效的並行計算能力FPGA擁有大量可編程邏輯資源,能夠實現細粒度的任務級並行,高效利用芯片資源進行大規模數據並行運算。這一特性使其能夠高效加速深度學習中的矩陣乘法、卷積運算等計算密集型操作。高帶寬、低延遲的存儲訪問FPGA通常採用片上存儲資源,可以提供非常高的存儲帶寬和極低的存取延遲。這對於數據密集型的深度學習應用尤為重要,能夠顯著減少由存儲瓶頸帶來的性能損失,目前這種開發方式相較於傳統GPU有著更優勢的邊緣運算效能以及相對更低的計算成本。
在本研究中伺服器上與使用vck5000FPGA執行make_examples步驟時,效率以及時間都有著顯著的差異,由於vck5000在硬體設計上特別適合高校平行計算,原本在CPU 24thread的Server上運行,時間為170分鐘,而在VCK5000加速卡上運行時間為107分鐘,加速約58%。
Genetic variant detection is essential in contemporary genomic research. With advancements in third-generation sequencing technologies, vast amounts of sequencing data require efficient analysis methods. This study applies FPGA hardware acceleration to the DeepVariant variant detection tool to enhance computational efficiency in genetic variant analysis, significantly contributing to genomic medicine research and clinical diagnostics.
The DeepVariant algorithm includes three steps: make examples, call variants, and postprocess variants. The make examples step, crucial for preparing data for the neural network, uses reference genome files (FASTA) and raw mapped reads files (RAW BAM) to generate candidate variant sites. This step segments raw sequencing data, extracts features, and inputs them into a convolutional neural network model, which identifies potential variant sites.
Porting the make examples step to an FPGA hardware platform significantly increases the speed of generating candidate variant sites, accelerating the entire genome alignment process. Running on a server with a 24-thread CPU takes 170 minutes, while running on the VCK5000 accelerator card reduces the time to 107 minutes, achieving an acceleration of approximately 58%.
[1] Olson, M. V. (1993). The human genome project. Proceedings of the National Academy of Sciences, 90(10), 4338-4344.
[2] Pareek, C. S., Smoczynski, R., & Tretyn, A. (2011). Sequencing technologies and genome sequencing. Journal of applied genetics, 52, 413-435.
[3] Baxevanis, A. D., Bader, G. D., & Wishart, D. S. (Eds.). (2020). Bioinformatics. John Wiley & Sons.
[4] Kosorok, M. R., & Laber, E. B. (2019). Precision medicine. Annual review of statistics and its application, 6, 263-286.
[5] Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., & Melet, P. E. (2019). Lowering the latency of data processing pipelines through FPGA based hardware acceleration. Proceedings of the VLDB Endowment, 13(1), 71-85.
[6] Kastner, R., Matai, J., & Neuendorffer, S. (2018). Parallel programming for FPGAs. arXiv preprint arXiv:1805.03648.
[7] Mura, S., & Couvreur, P. (2012). Nanotheranostics for personalized medicine. Advanced drug delivery reviews, 64(13), 1394-1416.
[8] França, L. T., Carrilho, E., & Kist, T. B. (2002). A review of DNA sequencing techniques. Quarterly reviews of biophysics, 35(2), 169-200.
[9] Klug, A. (2004). The discovery of the DNA double helix. Journal of molecular biology, 335(1), 3-26.
[10] Shendure, J., Balasubramanian, S., Church, G. M., Gilbert, W., Rogers, J., Schloss, J. A., & Waterston, R. H. (2017). DNA sequencing at 40: past, present and future. Nature, 550(7676), 345-353.
[11] Heather, J. M., & Chain, B. (2016). The sequence of sequencers: The history of sequencing DNA. Genomics, 107(1), 1-8.
[12] Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature reviews genetics, 11(1), 31-46.
[13] Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the national academy of sciences, 74(12), 5463-5467.
[14] Gauthier, M. G. (2008). Simulation of polymer translocation through small channels: A molecular dynamics study and a new Monte Carlo approach (Doctoral dissertation, University of Ottawa (Canada)).
[15] Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing?. Archives of Disease in Childhood-Education and Practice, 98(6), 236-238.
[16] Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Huntley, J., Fierer, N., ... & Knight, R. (2012). Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. The ISME journal, 6(8), 1621-1624.
[17] Ardui, S., Ameur, A., Vermeesch, J. R., & Hestand, M. S. (2018). Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic acids research, 46(5), 2159-2168.
[18] Quail, M. A., Smith, M., Coupland, P., Otto, T. D., Harris, S. R., Connor, T. R., ... & Gu, Y. (2012). A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics, 13, 1-13.
[19] Zhong, J., & Zhao, X. (2018). Isothermal amplification technologies for the detection of foodborne pathogens. Food Analytical Methods, 11, 1543-1560.
[20] Rhoads, A., & Au, K. F. (2015). PacBio sequencing and its applications. Genomics, proteomics & bioinformatics, 13(5), 278-289.
[21] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., ... & Turner, S. (2009). Real-time DNA sequencing from single polymerase molecules. Science, 323(5910), 133-138.
[22] Travers, K. J., Chin, C. S., Rank, D. R., Eid, J. S., & Turner, S. W. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic acids research, 38(6), e159-e159.
[23] Branton, D., Deamer, D. W., Marziali, A., Bayley, H., Benner, S. A., Butler, T., ... & Schloss, J. A. (2008). The potential and challenges of nanopore sequencing. Nature biotechnology, 26(10), 1146-1153.
[24] Deamer, D., Akeson, M., & Branton, D. (2016). Three decades of nanopore sequencing. Nature biotechnology, 34(5), 518-524.
[25] Jain, M., Olsen, H. E., Paten, B., & Akeson, M. (2016). The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome biology, 17, 1-11.
[26] Quick, J., Loman, N. J., Duraffour, S., Simpson, J. T., Severi, E., Cowley, L., ... & Kugelman, J. (2016). Real-time, portable genome sequencing for Ebola surveillance. Nature, 530(7589), 228-232.
[27] Wick, R. R., Judd, L. M., & Holt, K. E. (2019). Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome biology, 20, 1-10.
[28] Laura Olivares Boldú, Wellcome Connecting Science
[29] Thudi, M., Li, Y., Jackson, S. A., May, G. D., & Varshney, R. K. (2012). Current state-of-art of sequencing technologies for plant genomics research. Briefings in functional genomics, 11(1), 3-11.
[30] Chen, Z., & He, X. (2021). Application of third-generation sequencing in cancer research. Medical Review, 1(2), 150-171.
[31] Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., ... & Venter, J. C. (1991). Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252(5013), 1651-1656.
[32] Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith Jr, R. K., Hannick, L. I., ... & White, O. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research, 31(19), 5654-5666.
[33] Haas, B. J., & Zody, M. C. (2010). Advancing RNA-seq analysis. Nature biotechnology, 28(5), 421-423.
[34] Liu, Q., Guo, Y., Li, J., Long, J., Zhang, B., & Shyr, Y. (2012). Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC genomics, 13, 1-8.
[35] DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., ... & Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics, 43(5), 491-498.
[36] Sedlazeck, F. J., Lee, H., Darby, C. A., & Schatz, M. C. (2018). Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics, 19(6), 329-346.
[37] Wenger, A. M., Peluso, P., Rowell, W. J., Chang, P. C., Hall, R. J., Concepcion, G. T., ... & Hunkapiller, M. W. (2019). Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature biotechnology, 37(10), 1155-1162.
[38] Rafalski, A. (2002). Applications of single nucleotide polymorphisms in crop genetics. Current opinion in plant biology, 5(2), 94-100.
[39] Broman, K. W., Wu, H., Sen, Ś., & Churchill, G. A. (2003). R/qtl: QTL mapping in experimental crosses. bioinformatics, 19(7), 889-890.
[40] Gibbs, R. A., Belmont, J. W., Hardenbol, P., Willis, T. D., Yu, F. L., Yang, H. M., ... & Duster, T. (2003). The international HapMap project.
[41] Zook, J. M., McDaniel, J., Olson, N. D., Wagner, J., Parikh, H., Heaton, H., ... & Salit, M. (2019). An open resource for accurately benchmarking small variant calling methods. Nature Biotechnology, 37(6), 561-566.
[42] Edge, P., & Bansal, V. (2019). Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nature communications, 10(1), 4660.
[43] Mills, R. E., Luttig, C. T., Larkins, C. E., Beauchamp, A., Tsui, C., Pittard, W. S., & Devine, S. E. (2006). An initial map of insertion and deletion (INDEL) variation in the human genome. Genome research, 16(9), 1182-1190.
[44] Mullaney, J. M., Mills, R. E., Pittard, W. S., & Devine, S. E. (2010). Small insertions and deletions (INDELs) in human genomes. Human molecular genetics, 19(R2), R131-R136.
[45] https://learngenomics.dev/docs/genomic-variation/insertions-and-deletions/
[46] Spielmann, M., Lupiáñez, D. G., & Mundlos, S. (2018). Structural variation in the 3D genome. Nature Reviews Genetics, 19(7), 453-467.
[47] Mahmoud, M., Gobet, N., Cruz-Dávalos, D. I., Mounier, N., Dessimoz, C., & Sedlazeck, F. J. (2019). Structural variant calling: the long and the short of it. Genome biology, 20, 1-14.
[48] Narzisi, G., O'Rawe, J. A., Iossifov, I., Fang, H., Lee, Y. H., Wang, Z., ... & Schatz, M. C. (2014). Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nature methods, 11(10), 1033-1036.
[49] Zarrei, M., MacDonald, J. R., Merico, D., & Scherer, S. W. (2015). A copy number variation map of the human genome. Nature reviews genetics, 16(3), 172-183.
[50] Zhang, F., Gu, W., Hurles, M. E., & Lupski, J. R. (2009). Copy number variation in human health, disease, and evolution. Annual review of genomics and human genetics, 10, 451-481.
[51] Mollon, J., Almasy, L., Jacquemont, S., & Glahn, D. C. (2023). The contribution of copy number variants to psychiatric symptoms and cognitive ability. Molecular psychiatry, 28(4), 1480-1493.
[52] Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics, 30(20), 2843-2851.
[53] Koboldt, D. C. (2020). Best practices for variant calling in clinical sequencing. Genome Medicine, 12(1), 91.
[54] Sandmann, S., De Graaf, A. O., Karimi, M., Van Der Reijden, B. A., Hellström-Lindberg, E., Jansen, J. H., & Dugas, M. (2017). Evaluating variant calling tools for non-matched next-generation sequencing data. Scientific reports, 7(1), 43169.
[55] Yang, Z., & Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Molecular biology and evolution, 14(7), 717-724.
[56] Garrison, E., & Marth, G. (2012). Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907.
[57] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.
[58] McKenna A, et al., (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20:1297-1303.
[59] McKenna A, et al., (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43:491-498.
[60] Poplin, R., Newburger, D., Dijamco, J., Nguyen, N., Loy, D., Gross, S. S., ... & DePristo, M. A. (2017). Creating a universal SNP and small indel variant caller with deep neural networks. BioRxiv.
[61] A universal sNP and small-indel variant caller usingdeep neural networks
[62] Liu, X., Han, S., Wang, Z., Gelernter, J., & Yang, B. Z. (2013). Variant callers for next-generation sequencing data: a comparison study. PloS one, 8(9), e75619.
[63] Koboldt, D. C., Larson, D. E., & Wilson, R. K. (2013). Using VarScan 2 for germline variant calling and somatic mutation detection. Current protocols in bioinformatics, 44(1), 15-4.
[64] El Naqa, I., & Murphy, M. J. (2015). What is machine learning? (pp. 3-11). Springer International Publishing.
[65] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
[66] Cunningham, P., Cord, M., & Delany, S. J. (2008). Supervised learning. In Machine learning techniques for multimedia: case studies on organization and retrieval (pp. 21-49). Berlin, Heidelberg: Springer Berlin Heidelberg.
[67] Ghahramani, Z. (2003). Unsupervised learning. In Summer school on machine learning (pp. 72-112). Berlin, Heidelberg: Springer Berlin Heidelberg.
[68] Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, 4, 237-285.
[69] Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., ... & Chen, T. (2018). Recent advances in convolutional neural networks. Pattern recognition, 77, 354-377.
[70] Chauhan, R., Ghanshala, K. K., & Joshi, R. C. (2018, December). Convolutional neural network (CNN) for image detection and recognition. In 2018 first international conference on secure cyber computing and communication (ICSCCC) (pp. 278-282). IEEE.
[71] Brown, S. D., Francis, R. J., Rose, J., & Vranesic, Z. G. (2012). Field-programmable gate arrays (Vol. 180). Springer Science & Business Media.
[72] Rose, J., El Gamal, A., & Sangiovanni-Vincentelli, A. (1993). Architecture of field-programmable gate arrays. Proceedings of the IEEE, 81(7), 1013-1029.
[73] Kuon, I., Tessier, R., & Rose, J. (2008). FPGA architecture: Survey and challenges. Foundations and Trends® in Electronic Design Automation, 2(2), 135-253.
[74] Monmasson, E., & Cirstea, M. N. (2007). FPGA design methodology for industrial control systems—A review. IEEE transactions on industrial electronics, 54(4), 1824-1842.
[75] Deng, Y. (2019). Deep learning on mobile devices: a review. Mobile Multimedia/Image Processing, Security, and Applications 2019, 10993, 52-66.
[76] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879-899.
[77] Asano, S., Maruyama, T., & Yamaguchi, Y. (2009, August). Performance comparison of FPGA, GPU and CPU in image processing. In 2009 international conference on field programmable logic and applications (pp. 126-131). IEEE.
[78] Fallin, M. D., Lasseter, V. K., Avramopoulos, D., Nicodemus, K. K., Wolyniec, P. S., McGrath, J. A., ... & Pulver, A. E. (2005). Bipolar I disorder and schizophrenia: a 440–single-nucleotide polymorphism screen of 64 candidate genes among Ashkenazi Jewish case-parent trios. The American Journal of Human Genetics, 77(6), 918-936.
[79] Su, J., Zheng, Z., Ahmed, S. S., Lam, T. W., & Luo, R. (2022). Clair3-trio: high-performance Nanopore long-read variant calling in family trios with trio-to-trio deep neural networks. Briefings in Bioinformatics, 23(5), bbac301.
[80] Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., Walters, L., & members of the DOE and NIH planning groups. (1998). New goals for the US human genome project: 1998-2003. science, 282(5389), 682-689.
[81] https://www.ncbi.nlm.nih.gov/gdv/browser/genome/?id=GCF_000001405.40
[82] Leung, S. K., Jeffries, A. R., Castanho, I., Jordan, B. T., Moore, K., Davies, J. P., ... & Mill, J. (2021). Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell reports, 37(7).
[83] Abel, H. J., & Duncavage, E. J. (2013). Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer genetics, 206(12), 432-440.
[84] https://pacbio.cn/products-and-services/applications/whole-genome-sequencing/variant-detection/
[85] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & 1000 Genome Project Data Processing Subgroup. (2009). The sequence alignment/map format and SAMtools. bioinformatics, 25(16), 2078-2079.
[86] The Variant Call Format Specification VCFv4.3 and BCFv2.2.(2022). chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://samtools.github.io/hts-specs/VCFv4.3.pdf
[87] Kathail, V. (2020, February). Xilinx vitis unified software platform. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 173-174).
[88] https://www.xilinx.com/products/boards-and-kits/vck5000.html
[89] Cabrera, A. M., Yucesan, Y. A., Liu, F. Y., Blokland, W., & Vetter, J. S. (2023, September). Errant Beam Detection Using the AMD Versal ACAP and Vitis AI. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-6). IEEE.
校內:2029-07-19公開