簡易檢索 / 詳目顯示

研究生: 林攸蓁
Lin, You-Chen
論文名稱: 使用基於卷積神經網路的特徵提取預測抗血管生成肽
Anti-angiogenic Peptides Prediction Using CNN Based Feature Extraction
指導教授: 張天豪
Chang, Tien-Hao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 53
中文關鍵詞: 卷積神經網路特徵提取抗血管生成肽
外文關鍵詞: CNN, Feature Extraction, Anti-angiogenic Peptide
相關次數: 點閱:103下載:27
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 抗血管生成肽 (Anti-angiogenic Peptide)是一種氨基酸短鍊,透過與血管內皮生長因子結合來抑制血管生成,從而抑制腫瘤的生長和擴散。抗血管生成肽已經被證實是一種很有前途的癌症治療方法,並且可以與其他治療方法聯合使用來提高治療的效果。在開發抗血管生成肽的過程中,需要對候選肽進行篩選。而計算方法就是一種能快速對大量抗血管生成肽候選物進行篩選的方法。
    隨著近年來機器學習在各個領域的蓬勃發展,已經有許多論文利用機器學習來預測抗血管生成肽。本研究將這些論文中使用的所有特徵分為兩類,即有序列性和無序列性特徵。有序列性特徵是指保留位置訊息的特徵,代表順序不可互換。反之,無序列性特徵是指缺乏位置訊息的特徵,但具有大部分綜合資訊。在這些論文中,一些被編碼成二維向量的有序列性特徵會需要先經過堆疊轉換成一維向量後才能進行分類。但是,堆疊會導致該向量失去序列方向的資訊,因此,如何正確的處理有序列性特徵是本研究主要的探討重點。
    本研究提出一個基於卷積神經網路的特徵提取來預測抗血管生成肽,該模型使用更能夠善用序列特性的卷積神經網路來取代推疊,進行特徵萃取。相較於其他預測抗血管生成肽的研究,該模型達到了目前最好的準確率、靈敏度、特異度、馬修斯相關係數以及曲線下面積。

    Anti-angiogenic peptide (AAP) is a short chain of amino acids that binds to vascular endothelial growth factor to inhibit angiogenesis, which results in the inhibition of tumor growth and spread. Such peptides have been shown to be a promising treatment for cancer and can be used in combination with other therapies to improve therapeutic efficacy. In developing AAPs, candidates require screening, while computational methods offer a speedy solution for this.
    With the recent explosion of machine learning in various fields, there have been many papers utilizing machine learning to predict AAPs. We classify all features used in these papers into two categories, namely sequential and non-sequential. Sequential features are those that retain positional information, meaning that the order of the features is not interchangeable. Conversely, non-sequential features lack positional information, with most of the aggregate characteristics. In these papers, some sequential features encoded as 2D vectors will need to be concatenated. However, concatenation leads to the loss of sequential information of these features. Therefore, how to handle sequential features properly is the main focus in this study.
    In this study, we propose a Convolution Neural Network (CNN)-based feature extraction to predict AAPs, which is more capable of utilizing sequential properties, instead of concatenating them. Compared to other AAPs studies, this study achieves the best accuracy, sensitivity, specificity, Mathews correlation coefficient, and area under the curve.

    圖目錄 XIII 表目錄 XV 第一章 緒論 1 第二章 相關研究 4 2.1 抗血管生成肽 (Anti-angiogenic Peptides, AAP) 4 2.2 抗血管生成肽預測研究 4 2.2.1 基於傳統機器學習之研究 6 2.2.1.1 AntiAngioPred 6 2.2.1.2 TargetAntiAngio 6 2.2.2 基於深度學習之研究 7 2.2.2.1 AAPred-CNN 7 2.2.2.2 Pep-CNN 7 2.3 基於樹模型 (Tree-based model) 8 2.3.1 隨機森林 (Random Forest, RF) 9 2.3.2 梯度提升決策樹 (Gradient Boosting Decision Tree, GBDT) 9 2.3.3 極限梯度提升 (eXtreme Gradient Boosting, XGBoost) 10 2.4 類神經網路 (Neural Network, NN) 10 2.4.1 卷積神經網路 (Convolutional Neural Network, CNN) 11 2.4.1.1 卷積層 (Convolutional Layer) 12 2.4.1.2 批標準化 (Batch Normalization, BN) 13 2.4.1.3 全域平均池化層 (Global Average Pooling Layer) 13 2.4.1.4 跳躍連接 (Skip Connection) 14 第三章 材料和研究方法 15 3.1 特徵表示方法 (Feature Representation) 15 3.1.1 有序列性的特徵表示 (Sequential Feature) 16 3.1.1.1 原子組成 (AC) 16 3.1.1.2 胺基酸替換矩陣 (BLOSUM62) 16 3.1.1.3 基於分組權重的編碼 (EBGW) 17 3.1.1.4 增強的分組氨基酸組成 (EGAAC) 18 3.1.1.5 獨熱編碼 (One-hot) 18 3.1.1.6 物化特性 (PCP) 19 3.1.2 無序列性的特徵表示 (Non-sequential Feature) 19 3.1.2.1 氨基酸組成 (AAC) 19 3.1.2.2 二肽組成 (DC) 20 3.1.2.3 三肽組成 (TC) 20 3.1.2.4 偽氨基酸組成 (PseAAC) 21 3.1.2.5 兩親性偽氨基酸組成 (Am-PseAAC) 22 3.2 特徵選擇 (Feature Selection) 23 3.2.1 循序向後選擇 (Sequential Backward Selection, SBS) 25 3.2.2 模型架構 25 3.2.3 驗證集 (Validation Set) 27 3.3 兩階段混合模型 (Two-stage Hybrid Model) 28 3.3.1 第一階段 (First stage) 28 3.3.2 第二階段 (Second stage) 29 3.3.3 集成學習 (Ensemble Learning) 29 3.4 模型之可解釋性 30 3.4.1 沙普利值 (Shapley Value) 30 3.4.2 類別活化映射 (Class Activation Mapping, CAM) 31 3.4.3 視覺化 (Visualization) 32 第四章 研究結果 34 4.1 資料集 34 4.1.1 基準資料集 (Benchmark dataset) 34 4.1.2. NT15資料集 (NT15 dataset) 34 4.1.3 自製驗證集 35 4.2 效能評估指標 36 4.3 效能評估方法 37 4.4 特徵選擇(實驗) 40 4.5 與其他預測模型效能比較 41 4.6 實驗與分析 42 4.6.1 與TargetAntiAngio的比較 42 4.6.2 消融實驗 (Ablation Tests) 44 4.6.2.1 特徵提取器 44 4.6.2.2 集成學習 45 4.6.2.3 各種有序列性特徵延伸探討 46 4.7 視覺化 48 4.7.1 訓練資料之視覺化 48 4.7.2 抗血管生成肽模式 (Anti-angiogenic Peptides Pattern) 49 第五章 結論 50 第六章 參考文獻 51

    [1] V. Laengsri, C. Nantasenamat, N. Schaduangrat, P. Nuchnoi, V. Prachayasittikul, and W. Shoombuatong, “TargetAntiAngio: A Sequence-Based Tool for the Prediction and Analysis of Anti-Angiogenic Peptides,” International Journal of Molecular Sciences, vol. 20, no. 12, pp. 26, Jun, 2019.
    [2] A. S. E. Ramaprasad, S. Singh, P. S. R. Gajendra, and S. Venkatesan, “AntiAngioPred: A Server for Prediction of Anti-Angiogenic Peptides,” Plos One, vol. 10, no. 9, pp. 13, Sep, 2015.
    [3] U. Consortium, “Activities at the universal protein resource (UniProt),” Nucleic acids research, vol. 42, no. D1, pp. D191-D198, 2014.
    [4] C. H. Lin, L. Wang, and L. Shi, “AAPred-CNN: Accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides,” Methods, vol. 204, pp. 442-448, Aug, 2022.
    [5] S. L. Zhang, and X. J. Li, “Pep-CNN: An improved convolutional neural network for predicting therapeutic peptides,” Chemometrics and Intelligent Laboratory Systems, vol. 221, pp. 9, Feb, 2022.
    [6] "Latest global cancer data," December 15, 2020; https://www.iarc.who.int/news-events/latest-global-cancer-data-cancer-burden-rises-to-19-3-million-new-cases-and-10-0-million-cancer-deaths-in-2020/.
    [7] J. Folkman, “Tumor angiogenesis: therapeutic implications,” New england journal of medicine, vol. 285, no. 21, pp. 1182-1186, 1971.
    [8] E. V Rosca, J. E Koskimaki, C. G Rivera, N. B Pandey, A. P Tamiz, and A. S Popel, “Anti-angiogenic peptides for cancer therapeutics,” Current pharmaceutical biotechnology, vol. 12, no. 8, pp. 1101-1116, 2011.
    [9] S. Marqus, E. Pirogova, and T. J. Piva, “Evaluation of the use of therapeutic peptides for cancer treatment,” Journal of biomedical science, vol. 24, no. 1, pp. 1-15, 2017.
    [10] P. Charoenkwan, W. Chiangjong, M. M. Hasan, C. Nantasenamat, and W. Shoombuatong, “Review and Comparative Analysis of Machine Learning-based Predic-tors for Predicting and Analyzing Anti-angiogenic Peptides,” Current Medicinal Chemistry, vol. 29, no. 5, pp. 849-864, 2022.
    [11] D. Varshni, K. Thakral, L. Agarwal, R. Nijhawan, and A. Mittal, "Pneumonia detection using CNN based feature extraction." pp. 1-7.
    [12] K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” Journal of theoretical biology, vol. 273, no. 1, pp. 236-247, 2011.
    [13] K.-C. Chou, “Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes,” Bioinformatics, vol. 21, no. 1, pp. 10-19, 2005.
    [14] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
    [15] L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?,” Advances in Neural Information Processing Systems, vol. 35, pp. 507-520, 2022.
    [16] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5-32, 2001.
    [17] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189-1232, 2001.
    [18] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, and T. Zhou, “Xgboost: extreme gradient boosting,” R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
    [19] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550-1560, 1990.
    [20] Y. LeCun, and Y. Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995.
    [21] S. Ioffe, and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift." pp. 448-456.
    [22] Y. Wu, and K. He, "Group normalization." pp. 3-19.
    [23] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition." pp. 770-778.
    [24] R. Kumar, K. Chaudhary, J. Singh Chauhan, G. Nagpal, R. Kumar, M. Sharma, and G. P. Raghava, “An in silico platform for predicting, screening and designing of antihypertensive peptides,” Scientific reports, vol. 5, no. 1, pp. 12512, 2015.
    [25] Z. Chen, P. Zhao, F. Li, A. Leier, T. T. Marquez-Lago, Y. Wang, G. I. Webb, A. I. Smith, R. J. Daly, and K.-C. Chou, “iFeature: a python package and web server for features extraction and selection from protein and peptide sequences,” Bioinformatics, vol. 34, no. 14, pp. 2499-2502, 2018.
    [26] Z.-H. Zhang, Z.-H. Wang, Z.-R. Zhang, and Y.-X. Wang, “A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine,” FEBS letters, vol. 580, no. 26, pp. 6169-6174, 2006.
    [27] T.-Y. Lee, Z.-Q. Lin, S.-J. Hsieh, N. A. Bretaña, and C.-T. Lu, “Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences,” Bioinformatics, vol. 27, no. 13, pp. 1780-1787, 2011.
    [28] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, and M. Kanehisa, “AAindex: amino acid index database, progress report 2008,” Nucleic acids research, vol. 36, no. suppl_1, pp. D202-D205, 2007.
    [29] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267-288, 1996.
    [30] H. Zou, and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 67, no. 2, pp. 301-320, 2005.
    [31] A. E. Hoerl, and R. W. Kennard, “Ridge regression: applications to nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 69-82, 1970.
    [32] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S.-I. Lee, “From local explanations to global understanding with explainable AI for trees,” Nature machine intelligence, vol. 2, no. 1, pp. 56-67, 2020.
    [33] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, "Learning deep features for discriminative localization." pp. 2921-2929.
    [34] W. Li, and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658-1659, 2006.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE