| 研究生: |
楊喚凱 Yang, Huan-Kai |
|---|---|
| 論文名稱: |
以卷積神經網路透過基因序列及甲基化數據預測組蛋白修飾 A Convolutional Neural Network to Predict Histone Modification from DNA Sequence and Methylation Data |
| 指導教授: |
賀保羅
Paul Horton |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 43 |
| 中文關鍵詞: | 卷積神經網路 、資料差補 、組蛋白修飾 、甲基化 、細胞株特異性 |
| 外文關鍵詞: | Convolutional Neural Network, Data Imputation, Histone Modification, DNA Methylation, Specificity of Cell Line |
| 相關次數: | 點閱:121 下載:16 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今許多先進的測量方法已被開發來觀測組蛋白修飾的確切位置,但是這些方法針對每種細胞株上測量不同的組蛋白修飾是件非常耗時且昂貴的實驗,為了解決這個困境,在近期的研究中,已有許多方法透過基因序列來預測各種組蛋白修飾,然而基因序列是無法提供細胞株特異性的資訊,導致在預測不同細胞株的組蛋白修飾時會遭遇很大的瓶頸,因此本研究引入了基因序列以及甲基化數據,並且利用卷積神經網路來改善預測不同細胞株的組蛋白修飾。
本研究將”啟動模塊”的特殊架構加入卷積神經網路中,以此來同時抓取不同序列長度的特徵。接著,我們利用”分層批次”的學習演算法,有效地在不平衡的資料集中訓練模型。最後,我們將混和的基因序列及甲基化數據基於上述的方法進行訓練,並從中萃取出具有細胞株特異性資訊的特徵向量來預測組蛋白修飾。
在實驗中,結果顯示我們設計的模型超越了基準模型,尤其是在最不平衡的類別上得到最明顯的改善。此外,我們也對不同輸入數據對效能的影響進行分析,結果表明結合基因序列及甲基化數據進行訓練的模型,相較於單獨只使用一種數據訓練的模型,不管在哪一種組蛋白修飾上都有顯著的提升。最後,我們進一步利用視覺化工具發現我們設計的模型的確能學習到細胞株特異性的資訊。
綜合以上的結果及分析,我們證實了利用改良的卷積神經網路並且結合基因序列及甲基化數據是有助於預測不同細胞株的組蛋白變異。
Nowadays, many advanced measurement methods have been developed to observe the exact binding site of histone modifications. However, it is a very time-consuming and expensive experiment to measure different histone modifications on each cell line. In order to overcome this dilemma, recent researches have been many methods to predict various histone modifications through DNA sequences. However, DNA sequences cannot provide any cell line-specific information, which leads to a big bottleneck in predicting histone modifications in different cell lines. Therefore, this study introduced DNA sequences and methylation data, and utilized convolutional neural network to improve the prediction of histone modifications in different cell lines.
In this research, the special structure of the ”inception module” is added to the convolutional neural network to capture features of different lengths of sequence at the same time. Next, we use the ”stratified mini-batch” mechanism to effectively train the model on the unbalanced dataset. We finally train the mixed DNA sequences and methylation data based on the above methods, and extract feature vectors with cell line-specific information to predict histone modifications.
In the experiment, the results showed that the model, which we designed, outperforms the baseline model, and especially get obvious improvement in the most unbalanced class. Besides, we analyze the effects of different input data for performance. The experimental results show that the model trained with DNA sequence and methylation data has significantly better performance in any predictions of histone modification than a model trained with only DNA sequences or methylation. Finally, we further used visualization tools to find out that the model indeed learns cell line-specific information.
Based on the above results and analysis, we confirmed that the use of an improved convolutional neural network combined with DNA sequence and methylation data is helpful for predicting histone variation in different cell lines.
[1] M. Frommer, L. E. McDonald, D. S. Millar, C. M. Collis, F. Watt, G. W. Grigg,P. L. Molloy, and C. L. Paul. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A., 89(5):1827–1831, 1992.
[2] Yingchuan Li, Xiaoqing Pan, Michelle L Roberts, Pengyuan Liu, Theodore A. Kotchen, Allen W. Cowley Jr., David L. Mattson, Yong Liu, Mingyu Liang, and Srividya Kidambi. Stability of global methylation profiles of whole blood and extracted DNA under different storage durations and conditions. Epigenomics, 10(6):797–811, 2018.
[3] Felix Krueger, Benjamin Kreck, Andre Franke, and Simon R Andrews. Dna methylome analysis using short bisulfite sequencing data. Nature methods, 9(2):145, 2012.
[4] Wikipedia contributors. Dna methylation — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=DNA_methylation&oldid=1028802025. [Online; accessed 24-June-2021].
[5] Paul A Marks, Richard A Rifkind, Victoria M Richon, Ronald Breslow, Thomas Miller, and William K Kelly. Histone deacetylases and cancer: causes and therapies. Nature Reviews Cancer, 1(3):194–202, 2001.
[6] Karolin Luger, Armin W M¨ader, Robin K Richmond, David F Sargent, and Timothy J Richmond. Crystal structure of the nucleosome core particle at 2.8 ˚A resolution. Nature, 389(6648):251–260, 1997.38
[7] Craig L Peterson and Marc-Andr´e Laniel. Histones and histone modifications. Current Biology, 14(14):R546–R551, 2004.
[8] Vincent G Allfrey, R Faulkner, and AE Mirsky. Acetylation and methylation of histones and their possible role in the regulation of rna synthesis. Proceedings of the National Academy of Sciences of the United States of America, 51(5):786, 1964.
[9] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.
[10] Ryuichiro Nakato and Toyonori Sakata. Methods for chip-seq analysis: A practical workflow and advanced applications. Methods, 2020.
[11] Charles E Massie and Ian G Mills. Mapping protein–dna interactions using chipsequencing. In Transcriptional Regulation, pages 157–173. Springer, 2012.
[12] Gregor D Gilfillan, Timothy Hughes, Ying Sheng, Hanne S Hjorthaug, Tobias Straub, Kristina Gervin, Jennifer R Harris, Dag E Undlien, and Robert Lyle. Limitations and possibilities of low cell number chip-seq. BMC genomics, 13(1):1–13, 2012.
[13] Carrie A Davis, Benjamin C Hitz, Cricket A Sloan, Esther T Chan, Jean M Davidson, Idan Gabdank, Jason A Hilton, Kriti Jain, Ulugbek K Baymuradov, Aditi K Narayanan, et al. The encyclopedia of dna elements (encode): data portal update. Nucleic acids research, 46(D1):D794–D801, 2018. 39
[14] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology, 33(8):831–838, 2015.
[15] Daniel Quang and Xiaohui Xie. Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic acids research, 44(11):e107–e107, 2016.
[16] Xu Min, Ning Chen, Ting Chen, and Rui Jiang. Deepenhancer: Predicting enhancers by convolutional neural networks. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 637–644. IEEE, 2016.
[17] Qijin Yin, Mengmeng Wu, Qiao Liu, Hairong Lv, and Rui Jiang. Deephistone: a deep learning approach to predicting histone modifications. BMC genomics, 20(2):11–23, 2019.
[18] Shengquan Chen, Mingxin Gan, Hairong Lv, and Rui Jiang. Deepcape: a deep convolutional neural network for the accurate prediction of enhancers. Genomics, Proteomics & Bioinformatics, 2021.
[19] Laiyi Fu, Qinke Peng, and Ling Chai. Predicting dna methylation states with hybrid information based deep-learning model. IEEE/ACM transactions on computational biology and bioinformatics, 17(5):1721–1728, 2019.
[20] Howard Cedar and Yehudit Bergman. Linking dna methylation and histone modification: patterns and paradigms. Nature Reviews Genetics, 10(5):295–304, 2009.
[21] Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12(10):931–934, 2015. 40
[22] Dipankar Ranjan Baisya and Stefano Lonardi. Prediction of histone posttranslational modifications using deep learning. Bioinformatics, 2020.
[23] Jack Lanchantin and Yanjun Qi. Graph convolutional networks for epigenetic state prediction using both sequence and 3d genome data. Bioinformatics, 36(Supplement 2):i659–i667, 2020.
[24] Huihuang Yan, Shulan Tian, Susan L Slager, Zhifu Sun, and Tamas Ordog. Genome-wide epigenetic studies in human disease: a primer on-omic technologies. American journal of epidemiology, 183(2):96–109, 2016.
[25] Yan Zhang, Lin An, Jie Xu, Bo Zhang, W Jim Zheng, Ming Hu, Jijun Tang, and Feng Yue. Enhancing hi-c data resolution with deep convolutional neural network hicplus. Nature communications, 9(1):1–9, 2018.
[26] Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020.
[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[28] Mingxing Tan and Quoc V Le. Mixconv: Mixed depthwise convolutional kernels. arXiv preprint arXiv:1907.09595, 2019.
[29] Juhua Zhang, Wenbo Peng, and Lei Wang. Lenup: learning nucleosome positioning from dna sequences with improved convolutional neural networks. Bioinformatics, 34(10):1705–1712, 2018. 41
[30] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J Santamar´ıa, Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data, 8(1):1–74, 2021.
[31] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
[32] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
[33] Dunlu Peng, Tianfei Gu, Xue Hu, and Cong Liu. Addressing the multi-label imbalance for neural networks: An approach based on stratified mini-batches. Neurocomputing, 435:91–102, 2021.
[34] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. openreview.net, 2017.
[35] Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com.
[36] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015. 42
[37] Chen Tsu Pei. 深入介紹及比較roc曲 線及pr曲 線. https://medium.com/nlp-tsupei/roc-pr-%E6%9B%B2%E7%B7%9A-f3faa2231b8c. [Online; accessed 20-June-2021].