| 研究生: |
莊鈞幃 Chuang, Chen-Wei |
|---|---|
| 論文名稱: |
以支援向量機之二階多項式核函數檢測高維度空間變數之交互作用 High-Dimensional Interaction Detection by Support Vector Machine with Polynomial-2 Kernel |
| 指導教授: |
張升懋
Chang, Sheng-Mao |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 數據科學研究所 Institute of Data Science |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | 支援向量機 、二階多項式核函數 、變數篩選 、最小絕對值收斂和選擇算子 、交互作用 |
| 外文關鍵詞: | Support vector machine (SVM), Least absolute shrinkage and selection operator (Lasso), Model selection, Interaction effects |
| 相關次數: | 點閱:80 下載:15 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
變數篩選在高維度資料分析中一直是個重要的議題。在高維度的資料中進行變數篩選更是極具挑戰性。當變數數量大於樣本數量時,傳統迴歸分析方法便遭遇了無法一次性配適所有變數的問題。若又多加考慮了篩選交互作用項變數的話,變數篩選的問題將會變得更加棘手。在這篇論文中,我們將透過機器學習中支援向量機的二階多項式核函數,以二階段去進行變數及變數之交互作用項的篩選。在第一階段,透過支援向量機的二階多項式核函數,計算出變數以及其交互作用項的估計值並分別進行排序,從排序後的結果中挑選小於樣本數的變數數量列為候選變數。而在第二階段中,使用最小絕對值收斂和選擇算子將第一階段中所選取到的候選變數構建最終模型,完成重要變數的篩選。我們也使用了一個128 筆樣本數及12,625 維度的急性淋巴細胞性白血病基因資料,透過我們的篩選方法,在做完變數篩選並對測試集資料進行分類預測後,得到了無分類錯誤的結果。此外,我們將我們的方法與現有的篩選方法在各種情況進行比較,得到了相對不錯且穩定的表現。
Variable selection is an important issue in high-dimensional data analysis. When the dimensionality is high, identifying important variables is challenging. Traditional regression models fail because the number of covariates can be greater than the number of observations. The situation becomes even worse when finding important interactions are of our major interest. In this thesis, we propose a two-stage procedure to identify important main and two-way multiplicative interaction effects. In the first stage, support vector machine with polynomial-2 kernel is applied to select both the main effects and interaction effects. Top ranked effects are chosen as candidate effects. In the second stage, logistic regression with Lasso penalty is applied to quantify influential effects among the candidate effects chosen in the first stage. We demonstrate the usefulness of the proposed method by analyzing a gene expression data which consists of 128 subjects each with 12,625 gene expression variables. After selecting variables and building the final model, we classified the data perfectly on the training set. Also, we compared the proposed method with some existing methods in various cases and obtained relatively stable performance.
K. J. Archer and R. V. Kimes. Empirical characterization of random forest variable importance measures. Computational statistics & data analysis, 52(4):2249–2260, 2008.
J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.
S.-M. Chang. Variable screenings in binary response regressions with multivariate normal predictors. arXiv preprint arXiv:1401.4769, 2014.
J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
S. Chiaretti, X. Li, R. Gentleman, A. Vitale, M. Vignetti, F. Mandelli, J. Ritz, and R. Foa. Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 103(7):2771–2778, 2004.
C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
J. Fan, R. Song, et al. Sure independence screening in generalized linear models with npdimensionality. The Annals of Statistics, 38(6):3567–3604, 2010.
N. Hao and H. H. Zhang. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507):1285–1301, 2014.
A. Haris, D. Witten, and N. Simon. Convex modeling of interactions with strong heredity. Journal of Computational and Graphical Statistics, 25(4):981–1004, 2016.
T. M. Hung, M. Na, N. T. Dat, T. M. Ngoc, U. Youn, H. J. Kim, B.-S. Min, J. Lee, and K. Bae. Cholinesterase inhibitory and anti-amnesic activity of alkaloids from corydalis turtschaninovii. Journal of ethnopharmacology, 119(1):74–80, 2008.
M. Lim and T. Hastie. Learning interactions through hierarchical group-lasso regularization. arXiv preprint arXiv:1308.2719, 2013.
E. Roman, W. Fortino, J. Ayello, C. Van de Ven, and M. S. Cairo. Cd22 and cd74 expression in b-precursor acute lymphoblastic leukemia (pre-b all) and significant cytotoxicity of anticd22 and anti-cd74 antibodies: Implications for targeted immunotherapy, 2005.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74 (2):245–266, 2012.
V. Vapnik, I. Guyon, and T. Hastie. Support vector machines. Mach. Learn, 20(3):273–297, 1995.
H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512–1524, 2009.
X. Wang, M. P. Epstein, and J.-Y. Tzeng. Analysis of gene-gene interactions using gene-trait similarity regression. Human heredity, 78(1):17–26, 2014.
J. Wu, B. Devlin, S. Ringquist, M. Trucco, and K. Roeder. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 34(3):275–285, 2010.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
O. Zuk, E. Hechter, S. R. Sunyaev, and E. S. Lander. The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences, 109(4):1193–1198, 2012.