研究生: |
洪維晨 Hong, Wei-Chen |
---|---|
論文名稱: |
基於大語言模型與Runge-Kutta近似二階優化之基因啟動子預測 Gene Promoter Prediction using Large Language Models and Runge-Kutta Approximate Second-Order Optimization |
指導教授: |
黃吉川
Hwang, Chi-Chuan |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 92 |
中文關鍵詞: | 大型語言模型 、基因組學 、深度學習 、二階優化 、Runge-Kutta方法 、啟動子預測 、持續預訓練 |
外文關鍵詞: | Large Language Models, Genomics, Deep Learning, Second-Order Optimization, Runge-Kutta Method, Promoter Prediction, Continual Pre-training |
相關次數: | 點閱:16 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著基因組學與深度學習的快速發展,大型語言模型(LLMs)在解析DNA序列等生物資訊領域展現出巨大潛力。然而,現有模型多採用一階優化器,在處理高維複雜的基因數據時,常面臨收斂速度慢與效能瓶頸。為此,本研究旨在開發更高效的優化算法,並建構一個更強大的基因基礎模型。
本研究提出兩大核心貢獻:首先,我們以通用大型語言模型LLaMA3-8B為基礎,透過在人類參考基因組上的持續預訓練,成功建置了專為基因組學設計的基礎模型「DNA-LLaMA3」。其次,我們設計並實現了一款二階近似優化器「KT4」。該優化器借鑒四階Runge-Kutta數值方法,利用多次一階梯度計算來近似損失函數的二階曲率資訊,期望在可控的計算成本下,獲得比擬二階優化方法的收斂優勢。
在啟動子預測任務(GUE數據集)的微調實驗中,我們將KT4優化器與傳統的AdamW優化器進行了全面比較。實驗結果顯示,儘管KT4在每一個epoch的訓練時間較AdamW長,,其其高效的收斂特性讓整體所需的訓練時間大幅降低,也促使模型效能取得了顯著提升。使用KT4訓練的DNA-LLaMA3模型在各項評估指標上不僅超越AdamW版本,更顯著優於現有的DNABERT-2模型。
本研究證實,將大型語言模型適應於基因領域的持續預訓練策略是有效的。同時,基於Runge-Kutta方法設計的KT4優化器,不僅提升了整體的訓練效率,更使模型預測能力的顯著增強,為解決複雜的生物資訊學問題提供了一條結合先進模型架構與高效優化算法的新路徑。
The potential of Large Language Models (LLMs) in genomics is often hindered by the performance bottlenecks of first-order optimizers. This research addresses this limitation by developing a more powerful genomic foundation model and a highly efficient optimization algorithm.
We introduce two innovations: "DNA-LLaMA3," a foundation model created by continually pre-training LLaMA3-8B for genomics, and "KT4," a novel optimizer. Inspired by the Runge-Kutta method, KT4 approximates second-order curvature information to achieve superior convergence at a manageable computational cost.
On the promoter prediction task, a comparison between KT4 and the conventional AdamW optimizer revealed a clear trade-off. Although KT4 required longer training time, it delivered significantly enhanced model performance. The resulting DNA-LLaMA3 model surpassed both its AdamW-trained counterpart and the existing DNABERT-2 model across all evaluation metrics.
This study validates continual pre-training as an effective strategy for adapting LLMs to genomics. The KT4 optimizer, despite a trade-off in training efficiency, provides substantial gains in predictive power. This work highlights a promising path for bioinformatics, integrating advanced model architectures with more sophisticated optimization algorithms.
1] L. Es, "Initial sequencing and analysis of the human genome," Nature, vol. 409, pp. 860-921, 2001.
[2] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, "Big data: astronomical or genomical?," PLoS Biology, vol. 13, no. 7, p. e1002195, 2015.
[3] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P. Way, E. Ferrero, P.-M. Agapow, M. Zietz, and M. M. Hoffman, "Opportunities and obstacles for deep learning in biology and medicine," Journal of the royal society interface, vol. 15, no. 141, p. 20170387, 2018.
[4] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome," Bioinformatics, vol. 37, no. 15, pp. 2112-2120, 2021.
[5] B. Lenhard, A. Sandelin, and P. Carninci, "Metazoan promoters: emerging characteristics and insights into transcriptional regulation," Nature Reviews Genetics, vol. 13, no. 4, pp. 233-245, 2012.
[6] R. K. Umarov and V. V. Solovyev, "Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks," PloS one, vol. 12, no. 2, p. e0171410, 2017.
[7] I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," in International Conference on Learning Representations, 2019.
[8] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, "Sharpness-aware Minimization for Efficiently Improving Generalization," in International Conference on Learning Representations, 2021.
[9] Z. Zhou, Y. Ji, W. Li, P. Dutta, R. V. Davuluri, and H. Liu, "DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes," in The Twelfth International Conference on Learning Representations, 2024.
[10] O. Press, N. Smith, and M. Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," in International Conference on Learning Representations, 2022.
[11] M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar, "Foundation models for generalist medical artificial intelligence," Nature, vol. 616, no. 7956, pp. 259-265, 2023.
[12] S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747, 2016.
[13] I. Goodfellow, "Deep learning," ed: MIT press, 2016.
[14] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. Mahoney, "Adahessian: An adaptive second order optimizer for machine learning," in proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, pp. 10665-10673, 2021.
[15] M. L. Metzker, "Sequencing technologies—the next generation," Nature Reviews genetics, vol. 11, no. 1, pp. 31-46, 2010.
[16] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of molecular biology, vol. 215, no. 3, pp. 403-410, 1990.
[17] M. W. Libbrecht and W. S. Noble, "Machine learning applications in genetics and genomics," Nature Reviews Genetics, vol. 16, no. 6, pp. 321-332, 2015.
[18] J. Dekker and L. Mirny, "The 3D genome as moderator of chromosomal communication," Cell, vol. 164, no. 6, pp. 1110-1121, 2016.
[19] J. D. Watson, Molecular biology of the gene. Pearson Education India, 2004.
[20] G. Rizk, D. Lavenier, and R. Chikhi, "DSK: k-mer counting with very low memory usage," Bioinformatics, vol. 29, no. 5, pp. 652-653, 2013.
[21] S. Sarkar, K. Mridha, A. Ghosh, and R. N. Shaw, "Machine learning in bioinformatics: new technique for DNA sequencing classification," in Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2022: Springer, pp. 335-355, 2022.
[22] R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715-1725.
[23] M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," Science, vol. 349, no. 6245, pp. 255-260, 2015.
[24] T. M. Mitchell and T. M. Mitchell, Machine learning (no. 9). McGraw-hill New York, 1997.
[25] J. Friedman, "The elements of statistical learning: Data mining, inference, and prediction," (No Title), 2009.
[26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (no. 1). MIT press Cambridge, 1998.
[27] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 2002.
[29] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986.
[31] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International Conference on Machine Learning: pmlr, pp. 448-456, 2015.
[32] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition: Ieee, pp. 248-255, 2009.
[34] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and S. Anadkat, "Gpt-4 technical report," arXiv preprint arXiv:2303.08774, 2023.
[35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171-4186, 2019.
[36] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, and F. Azhar, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
[37] V. Ashish, "Attention is all you need," Advances in neural information processing systems, vol. 30, p. I, 2017.
[38] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," Neurocomputing, vol. 568, p. 127063, 2024.
[39] J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, "Gqa: Training generalized multi-query transformer models from multi-head checkpoints," arXiv preprint arXiv:2305.13245, 2023.
[40] N. Shazeer, "Fast transformer decoding: One write-head is all you need," arXiv preprint arXiv:1911.02150, 2019.
[41] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu, "On layer normalization in the transformer architecture," in International Conference on Machine Learning: PMLR, pp. 10524-10533, 2020.
[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[43] B. Zhang and R. Sennrich, "Root mean square layer normalization," Advances in Neural Information Processing Systems, vol. 32, 2019.
[44] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for Activation Functions," International Conference on Learning Representations, 2018.
[45] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International Conference on Machine Learning: PMLR, pp. 933-941, 2017.
[46] N. Shazeer, "Glu variants improve transformer," arXiv preprint arXiv:2002.05202, 2020.
[47] G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis, "Deep learning: new computational modelling techniques for genomics," Nature Reviews Genetics, vol. 20, no. 7, pp. 389-403, 2019.
[48] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, "Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning," Nature Biotechnology, vol. 33, no. 8, pp. 831-838, 2015.
[49] J. Zhou and O. G. Troyanskaya, "Predicting effects of noncoding variants with deep learning–based sequence model," Nature Methods, vol. 12, no. 10, pp. 931-934, 2015.
[50] D. Quang and X. Xie, "DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences," Nucleic Acids Research, vol. 44, no. 11, pp. e107-e107, 2016.
[51] X.-Q. Liu, B.-X. Li, G.-R. Zeng, Q.-Y. Liu, and D.-M. Ai, "Prediction of long non-coding RNAs based on deep learning," Genes, vol. 10, no. 4, p. 273, 2019.
[52] J. Zhou, C. L. Theesfeld, K. Yao, K. M. Chen, A. K. Wong, and O. G. Troyanskaya, "Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk," Nature Genetics, vol. 50, no. 8, pp. 1171-1179, 2018.
[53] J. Vamathevan, D. Clark, P. Czodrowski, I. Dunham, E. Ferran, G. Lee, B. Li, A. Madabhushi, P. Shah, and M. Spitzer, "Applications of machine learning in drug discovery and development," Nature reviews Drug discovery, vol. 18, no. 6, pp. 463-477, 2019.
[54] G. Chuai, H. Ma, J. Yan, M. Chen, N. Hong, D. Xue, C. Zhou, C. Zhu, K. Chen, and B. Duan, "DeepCRISPR: optimized CRISPR guide RNA design by deep learning," Genome biology, vol. 19, pp. 1-18, 2018.
[55] N. Killoran, L. J. Lee, A. Delong, D. Duvenaud, and B. J. Frey, "Generating and designing DNA with deep generative models," arXiv preprint arXiv:1712.06148, 2017.
[56] Z. Cui, T. Xu, J. Wang, Y. Liao, and Y. Wang, "Geneformer: Learned gene compression using transformer-based context modeling," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): IEEE, pp. 8035-8039, 2024.
[57] X.-M. Zhang, L. Liang, L. Liu, and M.-J. Tang, "Graph neural networks and their current applications in bioinformatics," Frontiers in genetics, vol. 12, p. 690049, 2021.
[58] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
[59] S. Baack, "A critical analysis of the largest source for generative ai training data: Common crawl," in Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 2199-2208, 2024.
[60] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books," in Proceedings of the IEEE international conference on computer vision, pp. 19-27, 2015.
[61] H. Dohrn and D. Riehle, "Design and implementation of the sweble wikitext parser: unlocking the structured data of wikipedia," in Proceedings of the 7th International Symposium on Wikis and Open Collaboration, pp. 72-81, 2011.
[62] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, "Pre-trained models for natural language processing: A survey," Science China technological sciences, vol. 63, no. 10, pp. 1872-1897, 2020.
[63] V. Lialin, V. Deshpande, and A. Rumshisky, "Scaling down to scale up: A guide to parameter-efficient fine-tuning," arXiv preprint arXiv:2303.15647, 2023.
[64] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," ICLR, vol. 1, no. 2, p. 3, 2022.
[65] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Association for Computational Linguistics, 2020.
[66] Y. Cui, Z. Yang, and X. Yao, "Efficient and effective text encoding for chinese llama and alpaca," arXiv preprint arXiv:2304.08177, 2023.
[67] W. Yue, J. Zhang, K. Hu, Y. Xia, J. Luo, and Z. Wang, "Surgicalsam: Efficient class promptable surgical instrument segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, pp. 6890-6898, 2024.
[68] Z. Zhou, J.-X. Shi, P.-X. Song, X. Yang, Y.-X. Jin, L.-Z. Guo, and Y.-F. Li, "LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model," CoRR, 2024.
[69] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, and Z. Dong, "A survey of large language models," arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023.
[70] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, and S. Bhosale, "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023.
[71] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, and A. Fan, "The Llama 3 Herd of Models," CoRR, 2024.
[72] 唐国梁Tommy, "大模型Llama架构 从理论到实战," 2024. [Online]. Available: https://www.53ai.com/news/OpenSourceLLM/2024121330689.html.
[73] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, "PyTorch: an imperative style, high-performance deep learning library," in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8026-8037, 2019.
[74] M. Abadi, "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems," CoRR, 2016.
[75] T. Chen, B. Xu, C. Zhang, and C. Guestrin, "Training deep nets with sublinear memory cost," arXiv preprint arXiv:1604.06174, 2016.
[76] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, "On the importance of initialization and momentum in deep learning," in International Conference on Machine Learning: PMLR, pp. 1139-1147, 2013.
[77] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization.," International Conference on Learning Representations., 2015.
[78] J. C. Butcher, "A history of Runge-Kutta methods," Applied Numerical Mathematics, vol. 20, no. 3, pp. 247-260, 1996.
[79] F. B. Hildebrand, Introduction to Numerical Analysis. Courier Corporation, 1987.
[80] T. Liu and D. I. Ketcheson, "Explicit Runge-Kutta Methods for Quadratic Optimization with Optimal Rates," 2023.
[81] U. S. D. o. Energy, "Human Genome Project," 2006. [Online]. Available: http://www.ornl.gov/hgmis.
[82] B. W. Matthews, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme," Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442-451, 1975.
[83] T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.