簡易檢索 / 詳目顯示

研究生: 傅馨儀
Fu, Hsin-Yi
論文名稱: 高維度資料的交互作用模型建立 - 應用於單一核甘酸多型性的全基因相關性研究和拉曼光譜研究
Building Interaction Models for High Dimensional Data with Applications to a SNP Genome-Wide Association Study and a Raman Spectrum Study
指導教授: 鄭順林
Jeng, Shuen-Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 137
中文關鍵詞: 高維度資料交互作用多因子維度降維邏輯迴歸多元適應性雲形迴歸最小絕對值壓縮和選取全基因相關性研究單一核甘酸多型性拉曼光譜
外文關鍵詞: high dimensional data, interaction effect, multi-factor dimensionality reduction, logic regression, multivariate adaptive regression splines, the least absolute shrinkage and selection operator, genome-wide association study, single nucleotide polymorphism, raman spectrum
相關次數: 點閱:209下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究中, 我們提出了新想法改善 Ho (2009) 提出的二階段處理程序: 交互作用的搜尋和模型的建立。我們使用改善過的二階段處理程序去對高維度資料建立一個含有交互作用項的模型。兩種不同變數型態的資料被考慮: 變數皆為離散型的單一核甘酸多型性 (SNP) 資料以及變數皆為連續型的拉曼光譜資料。
    偵測 SNP 之間的交互作用是否與疾病發生有關係時, 多因子維度降維 (MDR) 和 邏輯迴歸 (Logic Regression) 是很受歡迎的兩個方法。然而, Ho (2009) 使用這兩種方法在全基因相關性研究時,只能考慮到鄰近位置的交互作用。我們提出兩個新的想法去改善此缺點。第一個想法是將 MDR 所需要的因子組合情況分割成較小的集合,這使得 MDR 可以同時處理更多的變數。第二個想法是使用滑動視窗並且合併小區塊,這可以讓我們找到位置相離更遠的交互作用。
    在拉曼光譜資料的分析中,我們使用了多元適應性雲形迴歸 (MARS) 去搜尋重要的交互作用。在搜尋完交互作用後,我們分別對這兩種資料使用最小絕對值壓縮和選取 (LASSO) 去對搜尋出來的交互作用建立模型。
    在 SNP 的資料中,我們找到了下列的基因是在過去的研究當中也被認為與類風濕性關節炎有關: HLA-DQB1, TAP2, MICA, NFKBIL1, HLA-DQA2, LST1, NFKBIL1, ABCA7, 和 PTPN22;我們也找到一些過去研究中沒有的基因,提供學者參考,例如: KIAA0319, PSORS1C1, KIAA0329, 和 MYH6。在拉曼光譜資料當中, MARS 在第一階段時便提供了不錯的模型預測力。第二階段時, LASSO 更進一步地減少許多交互作用項並且維持了模型的預測力。

    In this study, we propose new ideas to improve a two stage procedure which is suggested by citet{Ho09}. And then, we use the improved two stage procedure to construct models with interaction terms for high dimensional data. We consider two types of predictors: discrete predictors for the single nucleotide polymorphism (SNP) data and continuous predictors for the Raman spectrum (RS) data.
    Multi-factor Dimensionality Reduction (MDR) and logic regression are popular methods for the identification of SNP and SNP interactions associated with a disease. However, in genome-wide association studies, Ho (2009) applied these two methods with sliding window in which only the interactions of the SNPs from neighbor physical positions are considered. We propose two ideas to overcome these disadvantages. The first idea is to split combination terms into smaller sets. This allows the interaction searching for larger predictor region by using MDR. The second idea is to use sliding windows and combined blocks. This idea will find interactions that the physical position between them are more far away.
    In RS analysis, we apply multivariate adaptive regression splines (MARS) to search important interaction effects. After the first stage of model building for SNP and RS data, we use the least absolute shrinkage and selection operator (LASSO) in the second stage to select important terms of the model.
    In SNP data, HLA-DQB1, TAP2, MICA, NFKBIL1, HLA-DQA2, LST1, NFKBIL1, ABCA7, and PTPN22 which are confirmed in our study, have been shown association with rheumatoid arthritis in previous research. We also identify other novel genes, e.g., KIAA0319, PSORS1C1, KIAA0329, and MYH6 that could be critical candidate genes for biologists. In RS data, MARS is applied at the first stage and the model has a good prediction power. Further, LASSO is applied at the second state. It succeeds in reducing the number of features and maintains the prediction power.

    1. INTRODUCTION................................ 1 1.1 Background and Motivation........................ 1 1.2 Literature Review.............................. 3 1.2.1 Multi-factor Dimensionality Reduction.............. 3 1.2.2 Logic Regression.......................... 4 1.2.3 Multivariate Adaptive Regression Splines............. 5 1.2.4 Logistic Group Lasso Regression................. 6 1.2.5 Raman Spectrum.......................... 7 1.3 Research Procedures............................ 8 1.3.1 Discrete Type Data: SNP data................... 8 1.3.2 Continuous Type Data: RS data.................. 10 1.4 Thesis Framework.............................. 11 2. METHODOLOGY............................... 12 2.1 Multi-factor Dimensionality Reduction................... 12 2.2 Logic Regression.............................. 14 2.3 Multivariate Adaptive Regression Splines................. 17 2.4 Logistic Group Lasso Regression...................... 20 2.5 BIMBAM.................................. 22 3. RESEARCH METHODS............................ 24 3.1 Discrete Type................................ 24 3.1.1 SNP Data Description....................... 24 3.1.2 Data Preprocessing......................... 27 3.1.3 Interaction Searching and Model Building............. 28 3.1.4 New Ideas.............................. 30 3.2 Continuous Type.............................. 32 3.2.1 RS Data Description........................ 32 3.2.2 Analysis Method.......................... 32 3.2.3 Interaction Searching and Model Building............. 34 3.2.4 Other Issues............................. 35 4. RESULTS OF DISCRETE TYPE DATA.................... 37 4.1 Multi-factor Dimensionality Reduction: Analysis of Idea-I........ 37 4.1.1 Analysis Setting.......................... 37 4.1.2 Analysis Results.......................... 37 4.1.3 Identifying Selected Genes..................... 39 4.2 Multi-factor Dimensionality Reduction: Analysis of Idea-I plus Idea-II.....42 4.2.1 Analysis Setting.......................... 42 4.2.2 Analysis Results.......................... 43 4.2.3 Identifying Selected Genes..................... 45 4.3 Logic Regression.............................. 45 4.3.1 Analysis Setting.......................... 45 4.3.2 Analysis Results.......................... 46 4.3.3 Identifying Selected Genes..................... 48 4.4 Summary for Results of Discrete Type Data................ 48 4.5 Computation Time............................. 50 4.5.1 MDR................................ 50 4.5.2 Logic Regression.......................... 51 5. RESULTS OF CONTINUOUS TYPE DATA.................. 53 5.1 Conversion of Variable Type........................ 53 5.2 The Classification of All Tissue Types................... 54 5.2.1 Analysis Setting.......................... 54 5.2.2 Analysis Result........................... 54 5.2.3 Comparison with Decision Tree.................. 55 5.3 The Classification of Normal and Abnormal Tissue Types......... 55 5.3.1 Analysis Setting.......................... 57 5.3.2 Analysis Result........................... 57 5.3.3 Comparison with Decision Tree.................. 58 5.3.4 The Classification of Abnormal Tissue Types........... 58 5.4 Summary for Results of Continuous Type Data.............. 71 6. CONCLUSIONS and FUTURE STUDIES................... 73 6.1 Conclusions................................. 73 6.2 Future Studies................................ 74 6.2.1 SNP data.............................. 74 6.2.2 RS data............................... 74 Appendix............................................82 Appendix A. RA Database............................. 83 Appendix B. Results by Using Adjusted MDR Method.............. 84 Appendix C. Results by Using logicFS Method.................. 93 Appendix D. Results of the Classification of All Tissue Types........... 108 D.1 Original Data................................ 108 D.2 Normalized Data.............................. 115 Appendix E. Results of the Classification of Normal and Abnormal Tissue Types..........122 E.1 Original Data................................ 122 E.2 Normalized Data.............................. 125 Appendix F. Results of the Classification of Abnormal Tissue Types....... 126 F.1 Original Data................................ 126 F.2 Normalized Data.............................. 132

    Amos, C. I., Chen, W. V., Seldin, M. F., Remmers, E. F., Taylor, K. E., Criswell, L. A., Lee, A. T., Plenge, R. M., Kastner, D. L., and Gregersen, P. K. (2009), “Data for Genetic AnalysisWorkshop 16 Problem 1, Association Analysis of Rheumatoid Arthritis Data,” BMC Proceedings, 3(Suppl 7):S2.
    Andrew, A. S., Nelson, H. H., Kelsey, K. T., Moore, J. H., Meng, A. C., Casella, D. P., Tosteson, T. D., Schned, A. R., and Karagas, M. R. (2006), “Concordance of Multiple Analytical Approaches Demonstrates a Complex Relationship between DNA Repair Gene SNPs, Smoking and Bladder Cancer Susceptibility,” Carcinogenesis, 27(5), 1030-1037.
    BIMBAM. http://stephenslab.uchicago.edu/software.html
    Bozdogan, H. (2003), Statistical Data Mining and Knowledge Discovery, CRC Press.
    Carniel, E., Taylor, M. R., Sinagra, G., Di, L. A., Ku, L., Fain, P. R., Boucek, M. M., Cavanaugh, J., Miocic, S., Slavov, D., Graw, S.L., Feiger, J., Zhu, X. Z., Dao, D., Ferguson, D. A., Bristow, M.R., and Mestroni, L. (2005), “Alpha-myosin Heavy Chain: A Sarcomeric Gene AssociatedWith Dilated and Hypertrophic Phenotypes of Cardiomyopathy,” Orphanet Journal of Rare Diseases, 112, 54-59.
    Cho, S., Kim, H., Oh, S., Kim, K., and Park, T. (2009), “Elastic-Net Regularization Approaches for Genome-Wide Association Studies of Rheumatoid Arthritis,” BMC Proceedings, 3(Suppl 7):S25.
    Cho, Y. M., Ritchie, M. D., Moore, J. H., Park, J. Y., Lee, K. U., Shin, H. D., Lee, H. K., and Park, K. S. (2004), “Multifactor-Dimensionality Reduction Shows a Two-locus Interaction Associated with Type 2 Diabetes Mellitus,” Diabetologia, 47, 549-554.
    Chung, Y., Lee, S. Y., Elston, R. C., and Park, T. (2007), “Odds Ratio Based Multifactor- Dimensionality Reduction Method for Detecting Gene-Gene Interactions,” Bioinformatics, 23, 71-76.
    Cook, N. R., Zee, R. Y., and Ridker, P. M. (2004), “Tree and Spline Based Association Analysis of Gene-gene Interaction Models for Ischemic Stroke,” Statistics in Medicine, 23, 1439-1453.
    D’Angelo, G. M., Rao, D., and Gu, C. C. (2009), “Combining Least Absolute Shrinkage and Selection Operator (LASSO) and Principal-Components Analysis for Detection of Gene-Gene Interactions in Genome-Wide Association Studies,” BMC Proceedings, 3(Suppl 7):S62.
    Friedman, J. H. (1991), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1-141.
    Ge, D. L., Zhu, H. D., Huang, Y., Treiber, F. A., Harshfield, G. A., Snieder, H., and Dong, Y. (2007), “Multilocus Analyses of Renin-Angiotensin-Aldosterone System Gene Variants on Blood Pressure at Rest and During Behavioral Stress in Young Normotensive Subjects,” Hypertension, 49, 107-112.
    Haploview. http://www.broadinstitute.org/haploview/haploview
    Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of Statistical Learning, Springer-Verlag New York, LLC.
    Ho, K. J. (2009), “Selections of Models with Interaction Effects for High Dimensional data – Aplication to a SNP Genome-Wide Association Study.”, Master Thesis, National Cheng Kung University.
    Jemal, A., Tiwari, R. C., Murray, T., Ghafoor, A., Samuels, A., Ward, E., fEUER, E. J., and Thun, M. J. (2004), “Cancer statistics, 2004,” CA: A Cancer Journal for Clinicians, 54, 8-29.
    Kanter, E. M., Majumder, S., Vargis, E., Robichaux-Viehoever, A., Kanter, G. J., Shappell, H., Jones, H. W., and Mahadevan-Jansen, A. (2009), “Multiclass Discrimination of Cervical Precancers Using Raman Spectroscopy,” Journal of Raman Spectroscopy, 40, 205-211.
    Liang, X., Gao, Y., Lam, T. K., Li, Q., Falk, C., Yang, X. R., Goldstein A. M., and Goldin, L. R. (2009), “Identifying Rheumatoid Arthritis Susceptibility Genes Using High-Dimensional Methods,” BMC Proceedings, 3(Suppl 7):S79.
    Lieber, C. A., Majumder, S. K., Billheimer, D., Ellis, D. L., and Mahadevan-Jansen, A. (2008), “Raman microspectroscopy for skin cancer detection in vitro,” Journal of Biomedical Optics, 13(2), 024013.
    Lin, H. Y., Hall, M. C., Clark, P. E., Phillips, J. J., and Hu, J. J. (2006), “Gene-Gene Interactions of DNA-Repair nsSNPs in Prostate Cancer Recurrence,” The 97th Annual Meeting of American Association for Cancer Research, Washington, DC.
    Lin, H. Y., Wang, W., Liu, Y. H., Soong, S. J., York, T. P., Myers, L., and Hu, J. J. (2008), “Comparison of Multivariate Adaptive Regression Splines and Logistic Regression in Detecting SNP-SNP Interactions and their Application in Prostate Cancer,” Journal of Human Genetics, 53, 802-811.
    Lou, X. Y., Chen, G. B., Yan, L., Ma, J. Z., Zh,u J., Elston, R. C., and Li, M. D. (2007), “A Generalized Combinatorial Approach for Detecting Gene-by-Gene and Gene-by- Environment Interactions with Application to Nicotine Dependence,” American Journal of Human Genetics, 80, 1125-1137.
    MacCluer, J. W., Amos, C. I., Gregersen, P. K., Heard-Costa, N., Lee, M., Kraja, A. T., Borecki, I. B., Cupples, L. A., and Almasy, L. (2009), “Genetic AnalysisWorkshop 16: Introduction to Workshop Summaries,” Genetic Epidemiology, 33(Suppl 1):S1-S7.
    Meier, L., van de Geer, S., and Buhlmann, P. (2008), “the Group Lasso for Logistic Regression,” Journal of the Royal Statistical Society. Series B, 70(1), 53-71.
    Mukherjee, O., Sanapala, K. R., Anbazhagana, P., and Ghosh, S. (2009), “Evaluating Epistatic Interaction Signals in Complex Traits Using Quantitative Traits,” BMC Proceedings, 3(Suppl 7):S82.
    Qiao, B., Huang, C. H., Cong, L., Xie, J., Lo, S. H., and Zheng, T. (2009), “Genome-Wide Gene-Based Analysis of Rheumatoid Arthritis Associated Interaction with PTPN22 and HLA-DRB1,” BMC Proceedings, 3(Suppl 7):S132.
    Qin, S., Zhao, X., Pan, Y., Liu, J., Feng, G., Fu, J., Bao, J., Zhang, Z., and He, L. (2005), “An Association Study of the N-Methyl-D-Aspartate Receptor NR1 Subunit Gene (GRIN1) and NR2B Subunit Gene (GRIN2B) in Schizophrenia with Universal DNA Microarray,” European Journal of Human Genetics, 13, 807-814.
    Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., and Moore, J. H. (2001), “Multifactor Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer,” The American Journal of Human Genetics, 69, 138-147.
    Ruczinski, I., Kooperberg C., and Leblanc M. (2003), “Logic Regression,” Journal of Computational and Graphical Statistics, 12(3), 475-511.
    Scheet, P. and Stephens, M. (2006), “A Fast and Flexible Statistical Model for Large- Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase,” American Journal of Human Genetics, 78, 629-644.
    Schwender H. and Ickstadt K. (2008), “Identification of SNP Interactions Using Logic Regression,” Biostatistics, 9(1), 187-198.
    So, H. C., Fong, P. Y., Chen, R. Y. L., Hui, T. C. K., Ng, M. Y. M., Cherny, S. S., Mak, W. W. M., Cheung, E. F. C, Chan, R. C. K., Chen, E. Y. H., Li, T., and Sham, P. C. (2010), “Identification of Neuroglycan C and Interacting Partners as Potential Susceptibility Genes for Schizophrenia in a Southern Chinese population,” American Journal of Medical Genetics, Part B, 153, 103-113.
    Soares, M. L., Coelho, T, Sousa, A., Batalov, S., Conceicao, L, Sales-Luis, M. L., Ritchie, M. D., Williams, S. M., Nievergelt, CM., Schork, N. J., Saraiva, M. J., and Buxbaum, J. N. (2005), “Susceptibility and Modifier Genes in Portuguese Transthyretin v30m Amyloid Polygeuropathy: Complexity in a Single-Gene Disease,” Human Molecular Genetic, 14, 543-553.
    Stone, C. J., Bose, S. and Kooperberg, C. (1997), “Polychotomous Regression,” Journal of the American Statistical Association, 92, 117-127.
    Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288.
    Tsai, C. T., Lai, L. P., Lin, J. L., Chiang, F. T., Hwang, J. J., Ritchie, M. D., Moore J. H., Hsu, K. L., Tseng, CD., Liau, C. S., and Tseng, Y. Z. (2004), “Renin-Angiotensin System Gene Polymorphisms and Atrial Fibrillation,” Circulation, 109, 1640-1646.
    Van, E. B. O., Hu, J. J., Levine, E. A., Mosley, L. J., Case, L.D., Lin, H.Y., Knight, S. N., Perrier, N. D., Rubin, P., Sherrill, G. B., Shaw, C. S., Carey, L. A., Sawyer, L. R., Allen, G.O., Milikowski, C., Willingham, M. C., and Miller, M. S. (2008), “Polymorphisms in Drug Metabolism Genes, Smoking, and p53 Mutations in Breast Cancer,” Molecular Carcinogenesis, 47, 88-99.
    Wikipedia: The free encyclopedia. (2010). FL: Wikimedia Foundation, Inc. Retrieved June 29, 2010, from http://www.wikipedia.org/
    Wu, Z., Aporntewan, C., Ballard, D. H., Lee, J. Y., Lee, J. S., and Zhao, H. (2009), “Two-Stage Joint Selection Method to Identify Candidate Markers from Genome-Wide Association Studies,” BMC Proceedings, 3(Suppl 7):S62.
    Xu, J., Lowery, J., Wiklund, F., Sun, J., Lindmark, F., Hsu, F. C., Dimitrov, L., Chang, B., Tumer, A. R., Adami, H. O., Suh, E., Moore, J. H., Zheng, S. L., Isaacs, W. B., Trent, J. M., and Gronberg, H. (2005), “The Interaction of Four Inflammatory Genes Significantly Predicts Prostate Cancer Risk,” Cancer Epidemiology Biomarkers and Prevention, 14, 2563-2568.
    York, T. P., Eaves L. J., and van den Oord E. J. (2006), “Multivariate Adaptive Regression Splines: a Powerful Method for Detecting Disease-Risk Relationship Differences among Subgroups,” Statistics in Medicine, 25, 1355-1367.
    Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,” Journal of the Royal Statistical Society. Series B (Methodological), 68(1), 49-67.
    Zabaleta, J., Lin, H. Y., Sierra, R. A., Hall, M. C., Clark, P. E., Sartor, O. A., Hu, J. J., and Ochoa, A. C. (2008), “Interactions of Cytokine Gene Polymorphisms in Prostate Cancer Risk,” Carcinogenesis, 29, 573-578.

    下載圖示 校內:2013-07-30公開
    校外:2013-07-30公開
    QR CODE