| 研究生: |
吳哲維 Wu, Che-Wei |
|---|---|
| 論文名稱: |
次世代基因定序資料遺失值插補與建立模型之研究 The Study on Missing Value Imputation for Modeling the Data of Next Generation Sequence |
| 指導教授: |
馬瀰嘉
Ma, Mi-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 中文 |
| 論文頁數: | 105 |
| 中文關鍵詞: | 遺失值 、基因序列比對 、廣義估計方程式 |
| 外文關鍵詞: | missing value, generalize estimation equation, gene sequence alignment |
| 相關次數: | 點閱:170 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著科技進步,DNA定序技術以及DNA定序平台也跟著推陳出新,定序人員所使用的定序平台每隔一段時間後就有新型的平台出現,在資金充沛的情況下,定序人員將會購入新型的平台,舊有的平台不一定會馬上淘汰,而每一筆定序資料動輒數萬至數十萬元的費用,在保存最大資料量的需求下,因此造成資料分析人員必須同時分析來自不同平台之下的基因序列比對片段讀數資料,此時平台效果(Platform effect)容易影響分析結果。另外,基因晶片容易因為機台分辨率不足、圖像毀損等等原因,產生遺失值(Missing value),導致無法使用已有的統計方法進行分析。
本論文資料是由國立成功大學分子醫學研究所暨基因體醫學中心孫孝芳教授所提供之大腸直腸癌(Colorectal cancer)患者的基因序列比對片段讀數(read count),檢體來自12位大腸直腸癌患者正常和腫瘤細胞,先後使用一種或兩種不同平台定序而得到讀數資料,由於部份檢體只有一種平台的定序資料,使得資料集具有一整行都是遺失值的情況,既有的加權最小鄰近插補法(Weighted K-nearest neighborhood imputation)並不能使用,故本論文將提出迴歸插補法以及修改加權最小鄰近法來解決上述情況,並利用廣義估計方程式(Generalized estimating equation)針對實際資料進行建模。
本論文除了探討不同插補法之下廣義估計方程式模型的參數估計的好壞,另外,我們也想了解在資料具有大量遺失值的情況下,在使用廣義估計方程式建立模型,選取不同的工作相關矩陣對參數估計值是否還具有穩健性。因此透過統計模擬來比較各種插補法對模型參數估計的好壞,以及比較固定相同插補法之下,選擇不同的工作相關矩陣對模型參數估計的差異。
As science progresses, DNA sequencing technology and DNA sequencing platform also followed innovation. After the sequencing staffs use the old platform, there will be a new platform to follow. The sequencing staffs usually buy new platform, the old platform not necessarily immediately eliminated. At this situation, the data analyst must analyzed read count data of gene sequences from different platforms. In this case, the platform effect is likely to affect analysis result. In addition, the gene chips may generate missing value due to the machine insufficient resolution, image corruption and other reasons, resulting in unusable statistical methods for analysis.
In this study, gene alignment read count data of colorectal cancer patients is provided by Professor H. Sunny Sun from the Institute of Molecular Medicine, National Cheng Kung Univeristy Medical college, and Center for Genomic Medicine. Data are taken out of normal and tumor cells of 12 patients with colorectal cancer from two different platforms. Because the data has missing value, this paper will propose imputation methods and use generalized estimating equation model, and through statistical simulation to compare the behavior of imputation methods under several different situations of parameters.
1.Klambauer, G., Schwarzbauer, K., Mayr, A., Clevert, D., Mitterecker, A., Bodenhofer, U. and Hochreiter, S. (2012). cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Research 40(9), 1-14.
2.Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22.
3.Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 621-628.
4.Nguyen, D. V., Wang, N. and Carroll, R. J. (2004). Evaluation of missing value estimation for microarray data. Journal of Data Science 2, 347-370.
5.Rubin, D. (1976). Inference and missing data. Biometrika 63, 581-592.
6.Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Biometrika 17, 520-525.
7.Toedling, J., Servant, N., Ciaudo, C., Farinelli, L., Voinnet, O., Heard, E. and Barillot, E. (2012). Deep-sequencing protocols influence the results obtained in small-RNA sequencing. PLoS ONE 7(2), 1-11.
8.Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61, 439-447.