研究生: |
林晉昇 Lin, Chin-Sheng |
---|---|
論文名稱: |
以降維分析 DNA 甲基化與肺癌的關聯性 Extracting and Analyzing Latent Space from Lung Cancer DNA Methylation Data with Dimensionality Reduction |
指導教授: |
賀保羅
Paul Horton |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 40 |
中文關鍵詞: | 肺癌 、腫瘤 、DNA 甲基化 、降維 、分類 |
外文關鍵詞: | Lung Cancer, Tumor, DNA Methylation, Dimensionality Reduction, Classification |
相關次數: | 點閱:125 下載:22 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
肺癌是國人因癌症死亡最高的癌症,好發於55歲以上之族群。就其生物特性及臨床表現分為小細胞肺癌與非小細胞肺癌兩大類,其中肺腺癌屬於後者,目前已有接近七成的肺癌屬於肺腺癌。近幾十年來,有許多研究發現肺癌患者有不尋常的基因甲基化現象,甚至有些甲基化被發現與患者的不良預後或腫瘤復發有關。本研究將探討DNA甲基化與肺癌的關聯性,並分析腫瘤患者與非腫瘤樣本之間的差異性,並結合降維與分類等方法處理DNA甲基化的資料以預測腫瘤的狀態。
藉由分析美國癌症基因圖譜計畫(TCGA)的肺腺癌患者的DNA甲基化資料,分析腫瘤與非腫瘤樣本的差異性,並找出甲基化與肺癌的關聯性。本研究可分為兩個部分,第一個部分是利用DNA甲基化的資料預測患者的腫瘤狀態。我們首先基於統計的方法進行特徵選取,選出較能區分腫瘤與非腫瘤的甲基,並利用降維的技術將甲基化的資訊降至較低維度後,訓練分類器來預測樣本的腫瘤狀態。第二個部分對腫瘤患者與非腫瘤樣本分析其差異性,探討兩種情況的甲基化分布等。
在實驗中,本研究利用美國癌症基因圖譜計畫的肺腺癌公開資料集507位患者的DNA甲基化資料,其中包括475位腫瘤與32位非腫瘤的樣本。我們利用統計的方法挑選能夠區分腫瘤與非腫瘤的甲基,並證實這些甲基當中的一部份確實與肺癌或某些病症劇有相關性。將DNA甲基化資料透過降維與分類等技術,預測樣本的腫瘤狀態得到相當高的準確率,並進一步分析腫瘤與非腫瘤樣本間甲基化的差異性,證實DNA甲基化與肺癌具有高度的關聯性。
Lung cancer is the top cause of death in Taiwan among all cancer types. It usually happens to people over 55 years old. It can be classified into two categories according to its biological characteristics and clinical symptoms, small-cell lung carcinoma and non-small-cell carcinoma. Lung adenocarcinoma belongs to the latter, and nearly 70% of lung cancers are lung adenocarcinoma. In recent decades, many researchers have found that patients with lung cancer have unusual gene methylation, and even some methylation has been found have relationship to their poor prognosis or tumor recurrence. In this study, we will explore the relationship between DNA methylation and lung cancer, and analyze the differences between tumor patients and normal samples, combining the methods of dimensionality reduction and classification on DNA methylation data to predict tumor status.
We analyze the DNA methylation data of lung cancer patients in the United States Cancer Gene Atlas Project (TCGA), compare the differences between tumor and normal samples, and find out the relationship between DNA methylation and lung cancer. Our research can be divided into two parts. The first part is to use DNA methylation data to predict the patient’s tumor status. We first select features based on statistical methods that can distinguish tumor samples from normal samples, and use dimensionality reduction techniques to reduce the information of methylation to lower dimensions, then train a classifier to predict the tumor status for each sample. For second part, we analyze the differences between cancer patients and normal samples, and discuss the distribution of DNA methylation in both cases.
As for the experiment, we use the DNA methylation data of 507 patients in a public lung adenocarcinoma dataset of TCGA, including 475 tumors and 32 normal samples. We use statistical methods to select methylation sites that can distinguish between tumor and normal, and confirm that some of these methylation sites are indeed related to lung cancer or some kind of other diseases. We also reduce the dimensionality of DNA methylation data and use machine learning to predict the tumor status and obtain convincing accuracy. Then we further analyze the difference of DNA methylation data between tumor and normal samples, and finally confirm that DNA methylation and lung cancer are highly related.
[1] Wajed, S. A., Laird, P. W., & DeMeester, T. R. (2001). DNA methylation: an alternative pathway to cancer. Annals of surgery, 234(1), 10.
[2] Way, G. P., & Greene, C. S. (2017). Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. BioRxiv, 174474.
[3] Wang, Z., & Wang, Y. (2019). Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders. BMC bioinformatics, 20(18), 1-7.
[4] Tsou, J. A., Hagen, J. A., Carpenter, C. L., & Laird-Offringa, I. A. (2002). DNA methylation analysis: a powerful new tool for lung cancer diagnosis. Oncogene, 21(35), 5450-5461.
[5] Way, G. P., & Greene, C. S. (2017). Evaluating deep variational autoencoders trained on pan-cancer gene expression. arXiv preprint arXiv:1711.04828.
[6] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
[7] Du, P., Zhang, X., Huang, C. C., Jafari, N., Kibbe, W. A., Hou, L., & Lin, S. M. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC bioinformatics, 11(1), 587.
[8] Kalpić, D., Hlupić, N., & Lovrić, M. (2011). Student’s t-Tests. International Encyclopedia of Statistical Science. Part 19/Lovrić, Miodrag (ur.).; Berlin: Springer, 2011.; 1559-1563; DOI: 10.1007/978-3-642-04898-2_641; p-ISBN 978-3-642-04897-5, eISBN 978-3-642-04898-2.
[9] Welch, B. L. (1947). The generalization ofstudent's' problem when several different population variances are involved. Biometrika, 34(1/2), 28-35.
[10] Baldi, P. (2012, June). Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning (pp. 37-49).
[11] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[12] Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
[13] Shlens, J. (2014). A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100.
[14] Hyvärinen, A., & Oja, E. (2000). Independent component analysis: algorithms and applications. Neural networks, 13(4-5), 411-430.
[15] Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE transactions on Neural Networks, 10(3), 626-634.
[16] Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562).
[17] Peng, C. Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The journal of educational research, 96(1), 3-14.
[18] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
[19] Bläsius, F. M., Meller, S., Stephan, C., Jung, K., Ellinger, J., Glocker, M. O., ... & Kristiansen, G. (2017). Loss of cadherin related family member 5 (CDHR5) expression in clear cell renal cell carcinoma is a prognostic marker of disease progression. Oncotarget, 8(43), 75076.
[20] Zhang, Y. A., Ma, X., Sathe, A., Fujimoto, J., Wistuba, I. I., Lam, S., ... & Larsen, J. E. (2016). Validation of SCT methylation as a hallmark biomarker for lung cancers. Journal of Thoracic Oncology, 11(3), 346-360.
[21] Han, C., Sun, L. Y., Wang, W. T., Sun, Y. M., & Chen, Y. Q. (2019). Non-coding RNAs in cancers with chromosomal rearrangements: the signatures, causes, functions and implications. Journal of Molecular Cell Biology, 11(10), 886-898.