簡易檢索 / 詳目顯示

研究生: 洪翰霆
Hong, Han-Ting
論文名稱: 探討深度學習模型應用於核糖核酸與蛋白質交互作用
Explore the application of deep learning models to ribonucleic acid-protein interaction
指導教授: 吳謂勝
Wu, Wei-Sheng
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 151
中文關鍵詞: 核糖核酸-蛋白質相互作用核糖核酸蛋白質接收器操作特性曲線精確率對召回率曲線
外文關鍵詞: RNA-protein interactions, RNA, protein, receiver operating characteristic curve, precision-recall curve
相關次數: 點閱:36下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 核糖核酸-蛋白質相互作用在調節各種細胞功能和活動中起著至關重要的作用,例如基因表達調控、翻譯調控和蛋白質合成。基因表達控制去氧核醣核酸轉錄為核糖核酸 ,指導蛋白質合成,翻譯與調節蛋白質合成效率,這對細胞功能至關重要。蛋白質合成將遺傳信息轉化為功能性蛋白質,對多種細胞過程和體內平衡至關重要。然而,傳統的識別核糖核酸-蛋白質相互作用的方法耗時且勞動強度大,因此需要開發高效的核糖核酸-蛋白質相互作用篩選工具。現有的工具如 RPISeq、IPMiner和lncPro在確定核糖核酸-蛋白質相互作用時存在中等程度的模型過擬合和較差的泛化性能。為了解決這個問題,我們設計了一種結合核糖核酸和蛋白質序列的深度學習模型來預測核糖核酸-蛋白質相互作用。本研究從POSTAR3數據庫下載了實驗驗證的 RPI 數據。基於這個數據集,我們構建了一個基於多頭注意力機制的端到端深度學習模型,並且對核糖核酸序列進行獨熱編碼以及對蛋白質序列進行三種不同的編碼方式,分別為獨熱編碼、ProteinBERT編碼以及三維空間嵌入式編碼,用以觀察在預測核糖核酸-蛋白質相互作用中的變化。在五折交叉驗證中,我們的模型在模型看過的蛋白質上三種編碼方式接收器操作特性曲線和精確率對召回率曲線下的面積分別獲得了90.5%、90.7%、90.2%和90.5%、90.8%、90.3%的平均驗證值,然而在模型未看過的蛋白質上接收器操作特性曲線和精確率對召回率曲線下的面積分別剩下了68.0%、71.3%、63.0%和71.8%、72.6%、64.0%的平均驗證值,表明改變蛋白質的編碼方式在本研究所使用的蛋白質數量下無法使模型能夠泛用於其餘蛋白質中,因此本研究發現,現階段預測RPI對於模型訓練中未用到的蛋白質,模型較難學習到蛋白質的特徵,在預測結果上與有用到的蛋白質之間具有一段落差。

    RNA-protein interactions play a crucial role in regulating various cellular functions and activities, such as gene expression regulation, translation control, and protein synthesis. Gene expression regulation involves the transcription of DNA into RNA, which guides protein synthesis, while translation control modulates the efficiency of protein synthesis, which is essential for cellular function. Protein synthesis converts genetic information into functional proteins, which are critical for numerous cellular processes and homeostasis. However, traditional methods for identifying RNA-protein interactions are time-consuming and labor-intensive, highlighting the need for efficient RNA-protein interaction screening tools. Existing tools such as RPISeq, IPMiner, and lncPro exhibit moderate model overfitting and poor generalization performance when determining RNA-protein interactions. To address this issue, we designed a deep learning model that combines RNA and protein sequences to predict RNA-protein interactions. In this study, experimentally validated RPI data were downloaded from the POSTAR3 database. Based on this dataset, we constructed an end-to-end deep learning model based on the multi-head attention mechanism, and we applied one-hot encoding to RNA sequences, along with three different encoding methods for protein sequences: one-hot encoding, ProteinBERT encoding, and 3D spatial embedding model encoding. These were used to observe changes in the prediction of RNA-protein interactions. In five-fold cross-validation, our model achieved average validation values for area under the receiver operating characteristic (ROC) curve and area under the precision-recall curve (PRC) of 90.5%, 90.7%, 90.2%, and 90.5%, 90.8%, 90.3%, respectively, on proteins seen by the model. However, on proteins unseen by the model, the area under the ROC and PRC curves dropped to 68.0%, 71.3%, 63.0%, and 71.8%, 72.6%, 64.0%, respectively, indicating that changing the protein encoding method in the number of proteins used in this study did not allow the model to generalize to the remaining proteins. Therefore, this study found that at the current stage, predicting RPI for proteins not used in model training presents challenges, and there is a gap in prediction performance between proteins used and not used in training.

    摘要 IV SUMMARY V 致謝 IX 目錄 1 圖目錄 5 表目錄 16 第1章 研究背景與動機 17 1.1 研究動機 17 1.2 研究背景 17 1.3 研究目的 19 第2章 文獻探討 20 2.1 有關 RNA 與 RBP 相互作用的文獻回顧 20 2.2 關於 RBP 的介紹 20 2.3 RBP-RNA 結合在生物學的機制與影響 20 2.4 預測 RBP-RNA 相互作用工具 21 2.4.1 RPIFSE 21 2.4.2 IDeepC 21 2.4.3 RPI-SAN 22 2.4.4 IPMiner 22 2.4.5 MultiRBP 22 2.4.6 DeepCLIP 22 2.4.7 Pysster 22 2.4.8 GraphProt 23 2.4.9 alphaFold 23 2.5 深度學習演算法 23 2.5.1 卷積神經網路 23 2.5.2 多頭注意力 24 2.6 PROTEINBERT 24 2.6.1 序列與GO注釋編碼 24 2.6.2 模型架構 24 2.7 研究概述 25 第3章 研究方法 27 3.1 數據集 27 3.1.1 CLIPdb 27 3.1.2 UCSC Genome Browser人類hg38序列數據集 35 3.1.3 Uniprot 36 3.1.4 AlphaFold Protein Structure Database 36 3.1.5 位置重疊判斷原則 37 3.1.6 CD-HIT 38 3.2 資料前處裡 39 3.2.1 數據處理 39 3.2.2 本研究CD-Hit不同閾值之間的選擇 42 3.2.3 資料切分 43 3.2.4 序列編碼 45 3.3 RPI模型介紹 49 第4章 研究結果 51 4.1 RPI 學習曲線 51 4.2 RPI 實驗結果 54 4.2.1 獨熱編碼模型驗證集、Sampling_RBP 調整集與 New_RBP 調整集結果 54 4.2.2 ProteinBERT 模型驗證集、Sampling_RBP 調整集與 New_RBP 調整集 55 4.2.3 三維空間嵌入式模型驗證集、Sampling_RBP 調整集與 New_RBP 調整集 57 4.3 與現有預測工具比較 59 4.3.1 三種模型與單輸入預測工具於 Sampling_RBP 調整集上的比較結果 60 4.3.2 三種模型與雙輸入預測工具於Sampling_RBP 調整集上的比較結果 61 4.3.3 三種模型與單輸入預測工具於New_RBP 調整集上的比較結果 62 4.3.4 三種模型與雙輸入預測工具於New_RBP 調整集上的比較結果 64 4.4 探討 ALPHAFOLD 結果 65 第5章 探討 PPI 68 5.1 PPI 研究動機 68 5.2 PPI 資料來源 68 5.3 PPI 資料前處裡 68 5.4 PPI 蛋白質編碼方式 70 5.4.1 蛋白質獨熱編碼方式 71 5.4.2 蛋白質 ProteinBERT 編碼方式 72 5.4.3 蛋白質三維空間嵌入式模型編碼方式 72 5.5 研究結果 73 5.5.1 PPI 學習曲線 73 5.6 PPI 實驗結果 77 5.6.1 獨熱編碼模型驗證集、Sampling_蛋白質調整集與 New_蛋白質調整集結果 77 5.6.2 ProteinBERT 模型驗證集、Sampling_蛋白質調整集與 New_蛋白質調整集 79 5.6.3 三維空間嵌入式模型驗證集、Sampling_蛋白質調整集與 New_蛋白質調整集 80 5.7 PPI 結論 82 第6章 研究結論貢獻與未來展望 83 6.1 研究結論 83 6.2 研究貢獻 83 6.3 未來展望 83 第7章 參考文獻 84 附錄 87 RESEARCH ACKNOWLEDGEMENT 142 Research CRediT contributions: 142

    [1] L. Wang, X. Yan, M.-L. Liu, K.-J. Song, X.-F. Sun, and W.-W. Pan, “Prediction of RNA-protein interactions by combining deep convolutional neural network with feature selection ensemble method,” Journal of Theoretical Biology, vol. 461, pp. 230–238, Jan. 2019.
    [2] M. Kumar, M. M. Gromiha, and G. P. S. Raghava, “Prediction of RNA binding sites in a protein using SVM and PSSM profile,” Proteins, vol. 71, no. 1, pp. 189–194, Apr. 2008.
    [3] Z.-P. Liu, L.-Y. Wu, Y. Wang, X.-S. Zhang, and L. Chen, “Prediction of protein–RNA binding sites by a random forest method with combined features,” Bioinformatics, vol. 26, no. 13, pp. 1616–1622, Jul. 2010.
    [4] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nat Biotechnol, vol. 33, no. 8, pp. 831–838, Aug. 2015.
    [5] X. Pan and H.-B. Shen, “RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach,” BMC Bioinformatics, vol. 18, no. 1, p. 136, Dec. 2017.
    [6] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, Jul. 2006.
    [7] M. Corley, M. C. Burns, and G. W. Yeo, “How RNA-Binding Proteins Interact with RNA: Molecules and Mechanisms,” Molecular Cell, vol. 78, no. 1, pp. 9–29, Apr. 2020.
    [8] N. Brandes, D. Ofer, Y. Peleg, N. Rappoport, and M. Linial, “ProteinBERT: a universal deep-learning model of protein sequence and function,” Bioinformatics, vol. 38, no. 8, pp. 2102–2110, Apr. 2022.
    [9] H. Wu, X. Pan, Y. Yang, and H.-B. Shen, “Recognizing binding sites of poorly characterized RNA-binding proteins on circular RNAs using attention Siamese network,” Briefings in Bioinformatics, vol. 22, no. 6, p. bbab279, Nov. 2021.
    [10] Y. Ju, L. Yuan, Y. Yang, and H. Zhao, “CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks,” Front. Genet., vol. 10, p. 1184, Nov. 2019.
    [11] H.-C. Yi, Z.-H. You, D.-S. Huang, X. Li, T.-H. Jiang, and L.-P. Li, “A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information,” Molecular Therapy - Nucleic Acids, vol. 11, pp. 337–344, Jun. 2018.
    [12] L. Zhou, Z. Wang, X. Tian, and L. Peng, “LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA–protein interaction identification,” BMC Bioinformatics, vol. 22, no. 1, p. 479, Oct. 2021.
    [13] X. Pan, Y.-X. Fan, J. Yan, and H.-B. Shen, “IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction,” BMC Genomics, vol. 17, no. 1, p. 582, Dec. 2016.
    [14] Y.-C. T. Yang et al., “CLIPdb: a CLIP-seq database for protein-RNA interactions,” BMC Genomics, vol. 16, no. 1, p. 51, Dec. 2015.
    [15] “UCSC Genome Browser Gateway”, http://genome.ucsc.edu , Oct. 2022 (accessed Aug. 2023)
    [16] G. W. Yeo, N. G. Coufal, T. Y. Liang, G. E. Peng, X.-D. Fu, and F. H. Gage, “An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells,” Nat Struct Mol Biol, vol. 16, no. 2, pp. 130–137, Feb. 2009.
    [17] C. Danan, S. Manickavel, and M. Hafner, “PAR-CLIP: A Method for Transcriptome-Wide Identification of RNA Binding Protein Interaction Sites,” in Post-Transcriptional Gene Regulation, E. Dassi, Ed., in Methods in Molecular Biology, vol. 1358. New York, NY: Springer New York, 2016, pp. 153–173.
    [18] The UniProt Consortium et al., “UniProt: the Universal Protein Knowledgebase in 2023,” Nucleic Acids Research, vol. 51, no. D1, pp. D523–D531, Jan. 2023.
    [19] F. Pei, Q. Shi, H. Zhang, and I. Bahar, “Predicting Protein–Protein Interactions Using Symmetric Logistic Matrix Factorization,” J. Chem. Inf. Model., vol. 61, no. 4, pp. 1670–1682, Apr. 2021, doi: 10.1021/acs.jcim.1c00173.
    [20] J. Karin, H. Michel, and Y. Orenstein, “MultiRBP: multi-task neural network for protein-RNA binding prediction,” in Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Gainesville Florida: ACM, Aug. 2021, pp. 1–9. doi: 10.1145/3459930.3469525.’
    [21] A. G. B. Grønning et al., “DeepCLIP: predicting the effect of mutations on protein–RNA binding with deep learning,” Nucleic Acids Research, p. gkaa530, Jun. 2020, doi: 10.1093/nar/gkaa530.
    [22] S. Budach and A. Marsico, “pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks,” Bioinformatics, vol. 34, no. 17, pp. 3035–3037, Sep. 2018, doi: 10.1093/bioinformatics/bty222.
    [23] D. Maticzka, S. J. Lange, F. Costa, and R. Backofen, “GraphProt: modeling binding preferences of RNA-binding proteins,” Genome Biol, vol. 15, no. 1, p. R17, 2014, doi: 10.1186/gb-2014-15-1-r17.
    [24] J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, Aug. 2021, doi: 10.1038/s41586-021-03819-2

    無法下載圖示 校內:2029-08-25公開
    校外:2029-08-25公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE