| 研究生: |
鄭驊軒 Zheng, Hua-Xuan |
|---|---|
| 論文名稱: |
基於雙重平行 GPU 加速多權重矩陣 Motif Discovery 演算法 Doubly Parallel GPU-Accelerated Motif Discovery Using Multiple Weight Matrix Models |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 人工智慧科技碩士學位學程 Graduate Program of Artificial Intelligence |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | GPU 平行運算 、Motif discovery 、EM 演算法 |
| 外文關鍵詞: | GPU parallel computing, Motif discovery, EM algorithm |
| 相關次數: | 點閱:47 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在序列模式識別任務中,準確辨識具生物學意義的重複子序列(motif)是了解序列中潛在規律的重要方式。找出 DNA 中的轉錄因子結合位(TF binding sites)及蛋白質中的活性區位,是理解轉錄調控機制的關鍵任務。即使在已知轉錄因子結合模式的情況下,重新搜尋並比對這些 motif 仍具有實質意義,不僅有助於驗證序列資料的品質,也能評估資料的代表性與穩定性,特別是在實驗資料有限的情況下更加重要。
Motif 發現被歸類為計算困難的 NP-complete 問題。現有演算法可分為確定式與非確定式兩類:前者(如 statistical enumerative methods)雖能保證全域最優,但在大規模資料集上運算量呈指數增長、效率低下;後者(以 MEME 為代表)較易擴展至大型資料且對長序列更為高效,然而無法保證解的全域最優性,且對初始設定極為敏感。以 MEME 為例,它必須對所有可能的序列起點反覆計算機率得分並在每次 EM 迭代中更新模型參數,時間複雜度因序列條數、長度與候選模型數量而急遽攀升。若進一步分析吸引域(basin of attraction),還需對大量隨機初始化的 matrix model 重複進行訓練與比對,以觀察不同起始條件下的收斂分布,進一步放大整體運算負擔。
為了改善此一問題,本研究設計並實作一套基於 GPU 平行架構的 motif 發現系統,利用圖形處理器(GPU)的大量運算核心與共享記憶體特性,加速序列與多組 WMM 模型的匹配計算與訓練過程。本方法可同時處理多個初始模型,並於單次實驗中統計各模型的收斂結果,進一步分析其在參數空間中的分佈與吸引域結構。實驗結果顯示,與傳統處理方式相比,GPU 加速能有效縮短執行時間,提升系統效率,並具備良好的可擴展性,適用於後續更大規模的序列模式分析與穩定性研究。
In sequence pattern recognition tasks, accurately identifying biologically meaningful repetitive subsequences (motifs) is an important approach to understanding the underlying structure of sequences. Locating transcription factor binding sites (TFBSs) in DNA and active sites in proteins is a crucial step toward understanding transcriptional regulatory mechanisms. Even when the binding patterns of specific transcription factors are known, re-identifying and matching motifs remains valuable. Such efforts help validate the quality of sequence data and assess its representativeness and stability, especially when experimental data is limited.
However, motif discovery is computationally challenging and has been proven NP-complete. Existing algorithms fall into two broad categories: deterministic and nondeterministic. Deterministic approaches (e.g. statistical enumerative methods) guarantee the global optimum, but their running time grows prohibitively on large datasets; nondeterministic methods, exemplified by MEME, scale better to large inputs and longer motifs, yet they do not ensure optimality and are highly sensitive to their initial configuration.
In MEME, for instance, every possible position in every sequence must be scored, and the search-update cycle is repeated many times, so the computational complexity rises steeply with the number of sequences, sequence length, and the number of models. When analyzing basins of attraction, the primary task is to identify each attractor (a local optimum reached by EM) and to map the domain of starting points that the algorithm draws into that attractor--an undertaking that requires training and evaluating thousands of independently initialized models to capture the full distribution of convergence patterns, thereby inflating the computational burden far beyond that of a single motif-finding run.
To address this issue, this thesis presents a GPU-based parallel framework for motif discovery. By leveraging the massively parallel processing power and shared memory architecture of modern graphics processing units (GPUs), the proposed system accelerates the sequence-to-model matching and training process across multiple WMMs. The method supports simultaneous processing of multiple initial models and collects their convergence statistics in a single experiment, enabling further analysis of distribution patterns and basin structures in parameter space. Experimental results show that, compared to traditional CPU-based approaches, the GPU-accelerated system significantly reduces execution time, and offers good scalability for large-scale motif discovery and stability analysis.
[1] Muzaffer Can Altinigneli, Claudia Plant, and Christian Böhm. “Massively parallel expectation maximization using graphics processing units”. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing. KDD ’13. Chicago, Illinois, USA: Association for Computing Machinery, 2013, pp. 838–846. isbn: 9781450321747. doi: 10.1145/2487575.2487628. url: https://doi.org/10.1145/2487575.2487628.
[2] Faisal Bin Ashraf and Md Shafiur Raihan Shafi. “MFEA: An evolutionary approach for motif finding in DNA sequences”. Informatics in Medicine Unlocked 21 (2020), p. 100466. issn: 2352-9148. doi: https : / / doi . org / 10 . 1016 / j . imu . 2020 . 100466. url: https : / / www . sciencedirect . com / science / article / pii / S235291482030616X.
[3] Timothy L. Bailey and Charles Elkan. “Fitting a mixture model by expectation maximization to discover motifs in biopolymers”. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 2 (1994), pp. 28–36.
[4] Jhoirene B. Clemente, Francis George C. Cabarle, and Henry N. Adorna. “PROJEC-TION Algorithm for Motif Finding on GPUs”. ArXiv abs/1605.06904 (2016).
[5] Nvidia Corp. “CUDA C++ Best Practices Guide” (2024).
[6] William N. Grundy, Timothy L. Bailey, and Charles P. Elkan. “ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool”. Computer Applications in the Biosciences 12.4 (1996), pp. 303–310.
[7] Paul Horton and Wataru Fujibuchi. “An upper bound on the hardness of exact matrix based motif discovery”. Journal of Discrete Algorithms 5.4 (2007). Selected papers from Combinatorial Pattern Matching 2005, pp. 706–713. issn: 1570-8667. doi: https : / / doi . org / 10 . 1016 / j . jda . 2006 . 10 . 006. url: https : / / www . sciencedirect.com/science/article/pii/S1570866706000797.
[8] C E Lawrence and A A Reilly. “An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences”. Proteins 7.1 (1990), pp. 41–51.
[9] Charles E. Lawrence et al. “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment”. Science 262.5131 (1993), pp. 208–214. doi: 10.1126/ science.8211139.
[10] Michael Levine and Robert Tjian. “Transcription regulation and animal diversity”. Nature 424.6945 (2003), pp. 147–151. doi: 10 . 1038 / nature01763. url: https ://doi.org/10.1038/nature01763.
[11] Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell. “An Ultrafast Scalable ManyCore Motif Discovery Algorithm for Multiple GPUs”. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 2011,pp. 428–434. doi: 10.1109/IPDPS.2011.183.
[12] Yongchao Liu et al. “CUDA–MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units”. Pattern Recognition Letters 31.14 (2010), pp. 2170–2177.
[13] John Nickolls et al. “Scalable parallel programming with CUDA”. ACM SIGGRAPH 2008 Classes. SIGGRAPH ’08. Los Angeles, California: Association for Computing Machinery, 2008. isbn: 9781450378451.
[14] NVIDIA Corporation. CUDA C Programming Guide. https://docs.nvidia.com/ cuda/cuda-c-programming-guide/. NVIDIA. 2024.
[15] Subhransu Pal, Ping Xiao, and Sanguthevar Rajasekaran. “Efficient sequential and parallel algorithms for finding edit distance based motifs”. BMC Genomics 17.Suppl4 (2016), p. 465.
[16] Geir Kjetil Sandve et al. “Accelerating Motif Discovery: Motif Matching on Parallel Hardware”. Algorithms in Bioinformatics. Ed. by Philipp Bücher and Bernard M. E. Moret. Vol. 4175. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 197–206. isbn: 978-3-540-39584-3.
[17] Ngoc Tam L. Tran and Chun-Hsi Huang. “A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data”. Biology Direct 9.1 (2014), p. 4. issn: 1745-6150. doi: 10.1186/1745- 6150- 9- 4. url: https://doi.org/10.1186/1745-6150-9-4.
[18] Linbin Yu and Yun Xu. “A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU”. 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications. 2009, pp. 555–558. doi: 10.1109/ISPA.2009.88.
[19] Yipu Zhang, Ping Wang, and Maode Yan. “An Entropy-Based Position Projection Algorithm for Motif Discovery”. BioMed Research International 2016 (2016), p. 9127474. doi: 10.1155/2016/9127474.