| 研究生: |
洪啓豪 Hong, Chi-Hao |
|---|---|
| 論文名稱: |
使用線性一致估計於連續性狀基因組關聯研究 Linear Consistent Estimator for Continuous Trait Genome-wide Association Studies |
| 指導教授: |
張升懋
Chang, Sheng-Mao |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 45 |
| 中文關鍵詞: | 線性一致估計量 、Adaptive Lasso 、Local False Discovery Rate 、Generalized cross validation |
| 外文關鍵詞: | Linear Consistent Estimator, Adaptive Lasso, Local False Discovery Rate, Generalized cross validation |
| 相關次數: | 點閱:105 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究欲發展出一套程序以找出可能影響疾病的基因。我們常使用線性迴歸來解釋反應變數和解釋變數間的關係。根據文獻結果顯示,使用簡單線性迴歸,容易因為和其他變數的相關性造成偽陽性的判斷;相較之下,使用複迴歸則不易出現此種問題。然而,一旦遇到大量解釋變數時,卻可能受限於有限的樣本數,而無法得到一個最佳線性不偏估計量(BLUE)。因此,我們希望透過一個簡單線性迴歸和複迴歸間的近似關係,得到解釋變數的參數估計值。而此近似方式的一部份為解釋變數間相關係數矩陣的反矩陣。當樣本數大於解釋變數個數的時候,相關係數矩陣為一個可逆的正定矩陣;但當樣本數少於解釋變數個數時,其為一個不可逆的矩陣,使得此轉換方式受到限制。除了樣本數之外,使用複迴歸還有一個常見的問題就是變數的選取,雖然過去發展了很多的指標來判定變數選取的合理性,但這些方式容易受到資料的變化而有很大的改變。
本研究的整個過程包含兩個部分:第一部分是提出一個線性一致估計量來解決相關係數矩陣不可逆性的問題。使用Adaptive Lasso來估計一個基因間的稀疏相關係數矩陣,並使其具有可逆性。;第二部分是估計出的複迴歸係數中,在指定Local False Discovery Rate下找出可能影響疾病的基因。過程中,兩個未知參數,Adaptive Lasso之調整參數λ與Local False Discovery Rate之門檻值q,使用Generalized cross validation(GCV)來決定最適當的數值。本研究將使用模擬的方式來探討整個過程的成果,其中包含樣本數的影響、複迴歸中R-square的影響以及真實顯著變數之位置的影響。
In this thesis, a novel procedure is proposed to identify disease-causing genes. Simple linear regressions were popularly used to figure out the relationships between the independent variable and dependent variables. It is a good way to find the correlation but not the causality when the underlying (linear) model consists of several independent variables. Instead, multiple regressions could avoid this problem. We utilize the relationship between the regression coefficients of simple linear regressions and the regression coefficients of the corresponding multiple regression in population level to estimate parameters by matching moments. The inverse of dependent variables' sample correlation matrix plays the key role in this moment estimator. A problem arises when the sample size is less than the number of independent variables. In that case the resulting sample correlation matrix is no longer invertible. Another technical problem we face is the variable selection issue. Although a lot of variable selection schemes have been developed in various points of view, it is treated as a multiple testing problem in this work.
The proposed procedure consists of two parts. First, a linear consistent estimator of regression coefficients is provided. The singular sample correlation among thousands of genes is replaced by the adaptive Lasso correlation estimate which is sparse and nonsingular. Second, under a pre-specified local false discovery rate, the disease-causing genes are identified via multiple regression. Generalized cross validation is applied to adjust two unknown quantities: the turning parameter of Adaptive Lasso, λ , and the threshold of local false discovery, q. Finally, the proposed procedure is examined by simulations. Factors under consideration include the sample size, the noise level of regression measured by coefficient of determination, and the location of affecting genes.
1. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 289-300.
2. Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis, Journal of the American Statistical Association, 99(465), 96-104.
3. Efron, B. (2009). Correlated z-values and the accuracy of large-scale statistical estimates, Working paper, Stanford University, 2009.
4. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96(456), 1348-1360.
5. Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools, Technometric, 35(2), 109-135.
6. Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso, Journal of Computational and Graphical Statistics, 7(3), 397-416.
7. Hoerl, A. E. and Kennard, R. W. (1970a). Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 42(1), 80-86.
8. Lu, W. and Zhang, H. H. (2007). Variable selection for proportional odds model, Statistics in Medicine, 26, 3771-3781.
9. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288.
10. Zhang, H. H. and Lu, W. (2007). Adaptive lasso for Cox's proportional hazards model, Biometrika, 94(3), 691-703.
11. Zou, H. (2006). The adaptive Lasso and its oracle properties, Journal of the American Statistical Association, 101(476), 1418-1429.
校內:2014-07-18公開