研究生: |
洪翰霆 Hong, Han-Ting |
---|---|
論文名稱: |
探討深度學習模型應用於核糖核酸與蛋白質交互作用 Explore the application of deep learning models to ribonucleic acid-protein interaction |
指導教授: |
吳謂勝
Wu, Wei-Sheng |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 151 |
中文關鍵詞: | 核糖核酸-蛋白質相互作用 、核糖核酸 、蛋白質 、接收器操作特性曲線 、精確率對召回率曲線 |
外文關鍵詞: | RNA-protein interactions, RNA, protein, receiver operating characteristic curve, precision-recall curve |
相關次數: | 點閱:36 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
核糖核酸-蛋白質相互作用在調節各種細胞功能和活動中起著至關重要的作用,例如基因表達調控、翻譯調控和蛋白質合成。基因表達控制去氧核醣核酸轉錄為核糖核酸 ,指導蛋白質合成,翻譯與調節蛋白質合成效率,這對細胞功能至關重要。蛋白質合成將遺傳信息轉化為功能性蛋白質,對多種細胞過程和體內平衡至關重要。然而,傳統的識別核糖核酸-蛋白質相互作用的方法耗時且勞動強度大,因此需要開發高效的核糖核酸-蛋白質相互作用篩選工具。現有的工具如 RPISeq、IPMiner和lncPro在確定核糖核酸-蛋白質相互作用時存在中等程度的模型過擬合和較差的泛化性能。為了解決這個問題,我們設計了一種結合核糖核酸和蛋白質序列的深度學習模型來預測核糖核酸-蛋白質相互作用。本研究從POSTAR3數據庫下載了實驗驗證的 RPI 數據。基於這個數據集,我們構建了一個基於多頭注意力機制的端到端深度學習模型,並且對核糖核酸序列進行獨熱編碼以及對蛋白質序列進行三種不同的編碼方式,分別為獨熱編碼、ProteinBERT編碼以及三維空間嵌入式編碼,用以觀察在預測核糖核酸-蛋白質相互作用中的變化。在五折交叉驗證中,我們的模型在模型看過的蛋白質上三種編碼方式接收器操作特性曲線和精確率對召回率曲線下的面積分別獲得了90.5%、90.7%、90.2%和90.5%、90.8%、90.3%的平均驗證值,然而在模型未看過的蛋白質上接收器操作特性曲線和精確率對召回率曲線下的面積分別剩下了68.0%、71.3%、63.0%和71.8%、72.6%、64.0%的平均驗證值,表明改變蛋白質的編碼方式在本研究所使用的蛋白質數量下無法使模型能夠泛用於其餘蛋白質中,因此本研究發現,現階段預測RPI對於模型訓練中未用到的蛋白質,模型較難學習到蛋白質的特徵,在預測結果上與有用到的蛋白質之間具有一段落差。
RNA-protein interactions play a crucial role in regulating various cellular functions and activities, such as gene expression regulation, translation control, and protein synthesis. Gene expression regulation involves the transcription of DNA into RNA, which guides protein synthesis, while translation control modulates the efficiency of protein synthesis, which is essential for cellular function. Protein synthesis converts genetic information into functional proteins, which are critical for numerous cellular processes and homeostasis. However, traditional methods for identifying RNA-protein interactions are time-consuming and labor-intensive, highlighting the need for efficient RNA-protein interaction screening tools. Existing tools such as RPISeq, IPMiner, and lncPro exhibit moderate model overfitting and poor generalization performance when determining RNA-protein interactions. To address this issue, we designed a deep learning model that combines RNA and protein sequences to predict RNA-protein interactions. In this study, experimentally validated RPI data were downloaded from the POSTAR3 database. Based on this dataset, we constructed an end-to-end deep learning model based on the multi-head attention mechanism, and we applied one-hot encoding to RNA sequences, along with three different encoding methods for protein sequences: one-hot encoding, ProteinBERT encoding, and 3D spatial embedding model encoding. These were used to observe changes in the prediction of RNA-protein interactions. In five-fold cross-validation, our model achieved average validation values for area under the receiver operating characteristic (ROC) curve and area under the precision-recall curve (PRC) of 90.5%, 90.7%, 90.2%, and 90.5%, 90.8%, 90.3%, respectively, on proteins seen by the model. However, on proteins unseen by the model, the area under the ROC and PRC curves dropped to 68.0%, 71.3%, 63.0%, and 71.8%, 72.6%, 64.0%, respectively, indicating that changing the protein encoding method in the number of proteins used in this study did not allow the model to generalize to the remaining proteins. Therefore, this study found that at the current stage, predicting RPI for proteins not used in model training presents challenges, and there is a gap in prediction performance between proteins used and not used in training.
[1] L. Wang, X. Yan, M.-L. Liu, K.-J. Song, X.-F. Sun, and W.-W. Pan, “Prediction of RNA-protein interactions by combining deep convolutional neural network with feature selection ensemble method,” Journal of Theoretical Biology, vol. 461, pp. 230–238, Jan. 2019.
[2] M. Kumar, M. M. Gromiha, and G. P. S. Raghava, “Prediction of RNA binding sites in a protein using SVM and PSSM profile,” Proteins, vol. 71, no. 1, pp. 189–194, Apr. 2008.
[3] Z.-P. Liu, L.-Y. Wu, Y. Wang, X.-S. Zhang, and L. Chen, “Prediction of protein–RNA binding sites by a random forest method with combined features,” Bioinformatics, vol. 26, no. 13, pp. 1616–1622, Jul. 2010.
[4] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nat Biotechnol, vol. 33, no. 8, pp. 831–838, Aug. 2015.
[5] X. Pan and H.-B. Shen, “RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach,” BMC Bioinformatics, vol. 18, no. 1, p. 136, Dec. 2017.
[6] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, Jul. 2006.
[7] M. Corley, M. C. Burns, and G. W. Yeo, “How RNA-Binding Proteins Interact with RNA: Molecules and Mechanisms,” Molecular Cell, vol. 78, no. 1, pp. 9–29, Apr. 2020.
[8] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial, “ProteinBERT: a universal deep-learning model of protein sequence and function,” Bioinformatics, vol. 38, no. 8, pp. 2102–2110, Apr. 2022.
[9] H. Wu, X. Pan, Y. Yang, and H.-B. Shen, “Recognizing binding sites of poorly characterized RNA-binding proteins on circular RNAs using attention Siamese network,” Briefings in Bioinformatics, vol. 22, no. 6, p. bbab279, Nov. 2021.
[10] Y. Ju, L. Yuan, Y. Yang, and H. Zhao, “CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks,” Front. Genet., vol. 10, p. 1184, Nov. 2019.
[11] H.-C. Yi, Z.-H. You, D.-S. Huang, X. Li, T.-H. Jiang, and L.-P. Li, “A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information,” Molecular Therapy - Nucleic Acids, vol. 11, pp. 337–344, Jun. 2018.
[12] L. Zhou, Z. Wang, X. Tian, and L. Peng, “LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA–protein interaction identification,” BMC Bioinformatics, vol. 22, no. 1, p. 479, Oct. 2021.
[13] X. Pan, Y.-X. Fan, J. Yan, and H.-B. Shen, “IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction,” BMC Genomics, vol. 17, no. 1, p. 582, Dec. 2016.
[14] Y.-C. T. Yang et al., “CLIPdb: a CLIP-seq database for protein-RNA interactions,” BMC Genomics, vol. 16, no. 1, p. 51, Dec. 2015.
[15] “UCSC Genome Browser Gateway”, http://genome.ucsc.edu , Oct. 2022 (accessed Aug. 2023)
[16] G. W. Yeo, N. G. Coufal, T. Y. Liang, G. E. Peng, X.-D. Fu, and F. H. Gage, “An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells,” Nat Struct Mol Biol, vol. 16, no. 2, pp. 130–137, Feb. 2009.
[17] C. Danan, S. Manickavel, and M. Hafner, “PAR-CLIP: A Method for Transcriptome-Wide Identification of RNA Binding Protein Interaction Sites,” in Post-Transcriptional Gene Regulation, E. Dassi, Ed., in Methods in Molecular Biology, vol. 1358. New York, NY: Springer New York, 2016, pp. 153–173.
[18] The UniProt Consortium et al., “UniProt: the Universal Protein Knowledgebase in 2023,” Nucleic Acids Research, vol. 51, no. D1, pp. D523–D531, Jan. 2023.
[19] F. Pei, Q. Shi, H. Zhang, and I. Bahar, “Predicting Protein–Protein Interactions Using Symmetric Logistic Matrix Factorization,” J. Chem. Inf. Model., vol. 61, no. 4, pp. 1670–1682, Apr. 2021, doi: 10.1021/acs.jcim.1c00173.
[20] J. Karin, H. Michel, and Y. Orenstein, “MultiRBP: multi-task neural network for protein-RNA binding prediction,” in Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Gainesville Florida: ACM, Aug. 2021, pp. 1–9. doi: 10.1145/3459930.3469525.’
[21] A. G. B. Grønning et al., “DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning,” Nucleic Acids Research, p. gkaa530, Jun. 2020, doi: 10.1093/nar/gkaa530.
[22] S. Budach and A. Marsico, “pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks,” Bioinformatics, vol. 34, no. 17, pp. 3035–3037, Sep. 2018, doi: 10.1093/bioinformatics/bty222.
[23] D. Maticzka, S. J. Lange, F. Costa, and R. Backofen, “GraphProt: modeling binding preferences of RNA-binding proteins,” Genome Biol, vol. 15, no. 1, p. R17, 2014, doi: 10.1186/gb-2014-15-1-r17.
[24] J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, Aug. 2021, doi: 10.1038/s41586-021-03819-2