| 研究生: |
何冠儒 Ho, Kuan-Ju |
|---|---|
| 論文名稱: |
高維度資料中交互作用模型的選擇–應用於單一核甘酸多型性的全基因相關性研究 Selections of Models with Interaction Effects for High Dimensional Data -- Aplication to a SNP Genome-Wide Association Study |
| 指導教授: |
鄭順林
Jeng, Shuen-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 英文 |
| 論文頁數: | 132 |
| 中文關鍵詞: | 高維度資料 、交互作用模型 、單一核甘酸多型性 、全基因相關性研究 |
| 外文關鍵詞: | high dimensional data, interaction effects, genome-wide association study, single nucleotide polymorphism |
| 相關次數: | 點閱:92 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現在高維度資料隨處可見: 文字, 聲音, 影像, 基因組資料等。這些資料的一個特徵是少量的觀察對象(n)但是大量的預測變數(p)。在這些資料上進行模型配適和模型選取帶給統計學家一個重要的挑戰。而在高維度資料中尋找出交互作用效應更是困難。
在最近幾年, 基因組資料的資料量已經引人注目地增加。在基因組相關性研究中, 一個常見且重要的任務是確認出與疾病有關的單一核甘酸多型性(SNP)和SNP交互作用。
在本研究中, 我們提出一個兩階段的流程, 交互作用的搜尋和模型的建立, 來建造一個高維度資料中的交互作用模型。在搜尋交互作用的階段, 採用了兩個方法, 多因子維度降維和邏輯迴歸, 來過濾可能造成疾病的特徵。接著在第二階段, 我們利用邏輯斯group lasso迴歸針對被挑選出的特徵來建立模型。
本研究所取得的資料是由某研究機構授權使用。為了保護個人隱私, 取得的資料已經是經過隨機排序。資料仍然能夠用來找出SNPs 是否具有顯著的主作用與交互作用, 但是作用所對應的SNPs名稱並不是真正的SNPs名稱。
一個重要的發現是交互作用也可能來自距離很遠位置的SNPs。此外, 我們分析的流程示範了如何將找到的重要SNP 對應到基因名稱, 以及確認該基因是否屬於乳癌資料庫(BCD)中。當資料未經過隨機排序時, 那些被找到卻不在BCD中的重要基因, 即可提供生物學家未來實驗與分析的參考方向。
Nowadays, high dimensional data are everywhere: texts, sounds, images, genomic, etc. One characteristic of these data is small n (observations) but large p (predictors). Model fitting and model selection based on these data present serious challenges to statisticians. Exploring interaction effects for high dimensional data is even harder.
In recent years, the amount of genomic data generated has increased dramatically. A common and important task in genetic association studies is the identification of single nucleotide polymorphisms (SNPs) and SNP interactions associated with a disease.
In this study, we propose a two-stage procedure, interaction searching and model building, to construct a model with interaction effects for high dimensional data. In interaction searching stage, we apply two methods, multifactor dimensionality reduction and logic regression, to filter potentially valuable features in causing disease. Then in the second stage, we build models for the selected features by logistic group lasso regression.
The data set in this study was obtained, with permission, from a research institute. In order to protect subject privacy, the data we obtained was already randomly permuted. The way of random permutation did not destroy the information in the data set for finding the significant main and interaction effects of SNPs. However, the exact names of those significant SNPs can not be identified.
One major discovery is that the important interaction features may also come from SNPs which are far away in physical positions. Furthermore, we demonstrate a procedure showing how to find the genes from the corresponding SNPs and checking if the genes belong to the Breast Cancer Database (BCD). When the data set is not randomly permuted, the identified important genes which do not belong to BCD could provide critical candidate genes for further studies by biologists.
Andrew, A.S., Nelson, H.H., Kelsey, K.T., Moore, J.H., Meng, A.C., Casella, D.P., Tosteson,
T.D., Schned, A.R., and Karagas, M.R. (2006), “Concordance of multiple analytical
approaches demonstrates a complex relationship between dna repair gene snps, smoking
and bladder cancer susceptibility,” Carcinogenesis, 27(5), 1030–1037.
Brassat, D., Motsinger, A.A., Caillier, S.J., Erlich, H.A., Walker, K., Steiner, L.L., Cree,
B.A.C., Barcellos, L.F., Pericak-Vance, M.A., Schmidt, S., Gregory, S., Hauser, S.L.,
Haines, L.J., Oksenberg, J.R., and Ritchie, M.D. (2006), “Multifactor dimensionality reduction
reveals gene-gene interactions associated with multiple sclerosis susceptibility in
African Americans,” Genes and Imuunity, 7, 310–315.
Cho, Y.M., Ritchie, M.D., Moore, J.H., Park, J.Y., Lee, K.U., Shin, H.D., Lee, H.K., and
Park, K.S. (2004), “Multifactor-dimensionality reduction shows a two-locus interaction
associated with Type 2 diabetes mellitus,” Diabetologia, 47, 549–554.
Chung, Y., Lee, S.Y., Elston, R.C., and Park, T. (2007), “Odds ratio based multifactordimensionality
reduction method for detecting gene-gene interactions,” Bioinformatics,
23(1), 71–76.
Dunning, A. M., et al. (2009), “Association of ESR1 gene tagging SNPs with breast cancer
risk,” Human Molecular Genetics, 18(6), 1131–1139.
Efron, B. (1983), “Estimating the error rate of a prediction rule: improvement on crossvalidation,”
Journal of the American Statistical Association, 78, 316–331.
Friedman, J.H. (1991), “Estimating functions of mixed ordinal and categorical variables using
adaptive splines,” Department of Statistics, Stanford University, Technical Report No.
LCS108.
Friedman, J.H. and Roosen C.B. (1995), “An introduction to multivariate adaptive regression
splines,” Statistical Methods in Medical Research, 4, 197–217.
Fuqua, A.W.S., Cui, Y., Lee, A.V., and Osborne, C.K. (2005), “Insights into the role of
progesterone receptors in breast cancer,” Journal of Clinical Oncology, 23(4), 931–932.
Heidema, A.G., Boer, J.M.A., Nagelkerke, N., Mariman, E.C.M., Daphne L. van der A., and
Feskens, E.J.M. (2006), “The challenge for genetic epidemiologists: how to analyze large
numbers of SNPs in relation to complex diseases,” BMC Genetics, 7(23), 1–15.
Kwork, P.-Y. (2002), Single Nucleotide Polymorphisms: Methods and Protocols, Springer.
Kooperberg C., and Ruczinski, I. (2005), “Identifying interacting SNPs using monte carlo
logic regression,” Genetic Epidemiology, 28, 157–170.
Liang, Y. and Kelemen A. (2008), “Statistical advances and challenges for analyzing correlated
high dimensional SNP data in genomic study for complex diseases,” Statistics
Surveys, 2, 43–60.
Lo, S.-H., Chernoff, H., Cong, L., Ding, Y., and Zheng, T. (2008), “Discovering interactions
among BRCA1 and other cnadidate genes associated with sporadic breast cancer,” PNAS,
105(34), 12387–12392.
Meier, L., Sara van de Geer and Buhlmann, P. (2008), “The group lasso for logistic regression,”
Journal of the Royal Statistical Society. Series B (Methodological), 70(1), 53–71.
Milne, R.L., Fagerholm, R., Nevanlinna, H., and Ben itez, J. (2008), “The importance of
replication in gene–gene interaction studies: multifactor dimensionality reduction applied
to a two-stage breast cancer case–control study,” Carcinogenesis, 29(6), 1215–1218.
Nishida, N., Koike, A., Tajima, A., Ogasawara, Y., Ishibashi, Y., Uehara, Y., Inoue, I., Tokunaga,
K. (2008), “Evaluating the performance of Affymetrix SNP Array 6.0 platform with
400 Japanese individuals,” BMC Genomics, 9, 431–441.
Qin, S., Zhao, X., Pan, Y., Liu, J., Feng, G., Fu, J., Bao, J., Zhang, Z., and He, L. (2005),
“An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1)
and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray,”
European Journal of Human Genetics, 13, 807–814.
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., and Moore,
J.H. (2001), “Multifactor dimensionality reduction reveals high-order interactions among
estrogen-metabolism genes in sporadic breast cancer,” The American Journal of Human
Genetics, 69, 138–147.
Ritchie, M.D., Hahn, L.W., and Moore, J.H. (2003), “Power of mulitfactor dimensionality
reduction for detecting gene-gene interactions in the presence of genotyping error, missing
data, phenocopy, and genetic heterogeneity,” Genetic Epidemiology, 24, 150–157.
Ruczinski, I., Kooperberg C., and Leblanc M. (2003), “Logic regression,” Journal of Computational
and Graphical Statistics, 12(3), 475–511.
Ritchie, M.D. and Motsinger, A.A. (2005), “Multifactor dimensionality reduction for detecting
gene-gene and gene-environment interactions in pharmacogenomics studies,” Pharmacogenomics,
6(8), 823–834.
Schwender H. and Ickstadt K. (2008), “Identification of SNP interactions using logic regression,”
Biostatistics, 9(1), 187–198.
Soares, M.L., Coelho, T, Sousa, A., Batalov, S., Conceicao, L, Sales-Luis, M.L., Ritchie,
M.D., Williams, S.M., Nievergelt, CM., Schork, N.J., Saraiva, M.J., and Buxbaum, J.N.
(2005), “Susceptibility and modifier genes in portuguese transthyretin v30m amyloid
polygeuropathy: complexity in a single-gene disease,” Human Molecular Genetic, 14,
543–553.
Sundvall, M., Iljin, K., Kilpinen, S., Sara H., Kallioniemi, O.-P., and Elenius, K. (2008),
“Role of ErbB4 in breast cancer,” Journal of Mammary Gland Biology and Neoplasia, 13,
259–268.
Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,” Journal of the
Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
Tsai, C.T., Lai, L.P., Lin, J.L., Chiang, F.T., Hwang, J.J., Ritchie, M.D., Moore J.H., Hsu,
K.L., Tseng, CD., Liau, C.S., and Tseng, Y.Z. (2004), “Renin-angiotensin system gene
polymorphisms and atrial fibrillation,” Circulation, 109, 1640–1646.
Tseng, P. and Yun, S. (2009), “A coordinate gradient descent method for nonsmooth separable
minimization,” Mathematical Programming, Series B , 117, 387–423.
Xu, J., Lowery, J., Wiklund, F., Sun, J., Lindmark, F., Hsu, F.C., Dimitrov, L., Chang, B.,
Tumer, A.R., Adami, H.O., Suh, E., Moore, J.H., Zheng, S.L., Isaacs, W.B., Trent, J.M.,
and Gronberg, H. (2005), “The interaction of four inflammatory genes significantly predicts
prostate cancer risk,” Cancer Epidemiology Biomarkers and Prevention, 14, 2563–
2568.
Yuan, M. and Lin, Y. (2006), “Model selection and estimation in regression with grouped
variables,” Journal of the Royal Statistical Society. Series B (Methodological), 68(1), 49–
67.