簡易檢索 / 詳目顯示

研究生: 曾俊樺
Zeng, Jun-Hua
論文名稱: 資料探勘技術於全基因組關聯研究:以漢族雙極症疾患為例
Data Mining for Genome-Wide Association Studies:An Empirical Study of Han Chinese Patients with Bipolar Disorder
指導教授: 李家岩
Lee, Chia-Yen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 製造資訊與系統研究所
Institute of Manufacturing Information and Systems
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 126
中文關鍵詞: 資料探勘雙極症全基因組關聯研究基因識別系統生物學
外文關鍵詞: data mining, bipolar disorder, GWAS, gene identification, systems biology
相關次數: 點閱:107下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 單核苷酸多態性(Single Nucleotide Polymorphisms, SNP)在人類全基因組內分布廣且密度高其位點數超過300萬個,對於雙極症(Bipolar Disorder, BP)這類型的複雜疾病通常由多個遺傳變異位點與環境因子所成,發病原因複雜,因此使用SNPs與疾病進行相關性研究可為科學家在尋找影響疾病之致病基因上提供一種方向,如全基因組關聯研究(Genome-Wide Association Studies, GWAS)即是應用人類基因組中的SNPs來找出與疾病相關的易感基因。雙極症是一種嚴重失能的精神疾病,有關於雙極症亞型中第一型雙極症(Bipolar I Disorder, BP-I)與第二型雙極症(Bipolar II Disorder, BP-II)的各種臨床表徵與研究,至今仍有許多爭議,因此在臨床上雙極症有著一定機率的錯誤診斷與不當治療,因此對於診斷雙極症的方式不能單靠一種方法來確診,隨著科學技術的重大進展,也可利用SNPs來建立一套診斷技術以作為互補原診斷方式之一種客觀方法。本研究欲應用資料探勘手法於全基因組數據中,期望能夠找出可以區別兩亞型之預測SNPs子集,並以ROC(Receiver Operating Characteristic, ROC)曲線分析來評估不同SNPs組合下之診斷效力以供日後臨床實務診斷之基礎參考,提升醫療品質,另也藉由基因本體論(Gene Ontology, GO)與路徑分析(Pathway Analysis)探究功能性基因並找出可能的致病機轉以提高分析結果之可信度。據分析結果證明本研究流程可成功找出雙極症的致病基因,也降低搜尋易感基因之所需花費的時間與其他成本,而在識別出的SNPs裡也可能含有未知的判別基因。

    Genome-Wide Association Studies (GWAS) is a method which uses Single Nucleotide Polymorphism (SNP) to identify susceptibility genes associated with diseases. Bipolar disorder (BP) is a recurrent and chronic psychiatric illness. Clinically, BP-I and BP-II are the most severe subtypes. Due to their unclear pathological symptoms, it seems difficult to provide appropriate diagnosis and treatments immediately in clinical patients. In the current study, we apply data mining technology to whole genome data of BP (including BP-I and BP-II) and identify the gene that can distinguish two subtypes of BP in order to achieve a better treatment. The clinical dataset of Han-Chinese was used to validate the proposed data mining model. The results show that this study can identify BASP1 (brain abundant membrane-attached signal protein 1) which was justified by previous studies. Moreover, other SNPs we identified may be used to distinguish two subtypes of BP by building statistical classifier. But these results need further investigation via human-body functional validation because complex diseases may cause by the gene-gene or gene- environment interactions. Consequently, the proposed model can address time-consuming gene identification and also quantify the effect of gene-gene interactions.

    摘要 I EXTENDED ABSTRACT II 誌謝 VII 表目錄 XIII 圖目錄 XV 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究流程與論文架構 4 第二章 文獻探討 6 2.1 雙極症的遺傳流行病學研究 6 2.2 鑑定易感基因策略 7 2.2.1 全基因組關聯研究 (Genome-Wide Association Studies, GWAS) 7 2.2.2 功能克隆方法 (Functional Cloning Approach) 9 2.2.3 候選基因方法 (Candidate Gene Approach) 9 2.2.4 連鎖分析 (Linkage Analysis) 9 2.3 PLINK軟體 10 2.4 SNP質量控制 (SNP QUALITY CONTROL, SNP QC) 10 2.5連鎖不平衡 (LINKAGE DISEQUILIBRIUM, LD) 13 2.6 資料探勘策略探討 14 2.6.1 跨產業資料探勘標準作業程序 (Cross Industry Standard Process for Data Mining, CRISP-DM) 15 2.6.2 最小絕對壓縮挑選機制(Least Absolute Shrinkage And Selection Operator, LASSO) 18 2.6.3 AUC-RF (the area under ROC curve with RF) 19 2.6.4 AdaBoost 21 2.7 小結 23 第三章 研究方法 24 3.1 資料前處理 26 3.2 單變量統計檢定 26 3.3 資料探勘變數篩選 28 3.3.1 基因分型缺失數據填補 28 3.3.2 AUC-RF、AdaBoost與LASSO-LR之變數篩選 28 3.3.3 交叉驗證 31 3.3.4投票法 (Voting) 31 3.3.5 多重共線性(Multi-Collinearity) 32 3.3.5.1 多重共線性診斷 32 3.3.5.2 解決多重共線性方法 34 3.4 評估與比較 36 3.4.1 ROC Curve Analysis 36 3.4.1.1 ROC Curve衡量指標 36 3.4.1.2 ROC Curve建構方式 38 3.4.1.3 ROC Space 40 3.4.1.4 ROC Curve Cut-off Value 41 3.4.2 基因本體論(Gene Ontology, GO) 41 3.4.3 Pathway Analysis 42 第四章 案例分析─候選易感位點 44 4.1資料前處理 44 4.1.1 原始資料概述─tfam與tped檔案 44 4.1.2資料合併轉換與編碼 46 4.1.3 SNP質量控制 47 4.2 單變量統計檢定 47 4.3 重要SNP篩選 48 4.3.1 PLINK-LR 48 4.3.2 資料探勘變數篩選 52 4.3.2.1 基因分型缺失數據填補 52 4.3.2.2 AUC-RF 53 4.3.2.3 AdaBoost 55 4.3.2.4 LASSO-LR 57 4.3.3 投票法 (Voting) 58 4.3.4 多重共線性診斷 61 4.4 小結 63 4.4.1 Manhattan Plot 63 4.4.2 文獻討論 63 第五章 案例分析─評估與比較 66 5.1 MAPPING SNPS TO GENES 66 5.1.1 文獻探討:Mapping SNPs to Genes 66 5.1.2 Mapping距離門檻的敏感度分析 67 5.2 ROC CURVE ANALYSIS 75 5.2.1 訓練資料分析結果 77 5.2.2 測試資料分析結果 78 5.2.3 小結 79 5.3 GENE ONTOLOGY AND PATHWAYANALYSIS 82 5.3.1 分析工具說明 82 5.3.2 分析數據說明 83 5.3.3 參數設定 84 5.3.4 分析結果 85 5.3.4.1 第一組基因數據 (±20 kb) 85 5.3.4.2 第二組基因數據 (±200 kb) 92 5.3.4.3 第三組基因數據 (±500 kb) 100 5.3.5 分析結果小結 108 第六章 結論與未來研究及建議 113 6.1 CONTRIBUTIONS 113 6.2 結論 115 6.3 未來研究及建議 116 參考文獻 118 自傳 126

    何瑞麟、陸汝斌、江漢光,三軍總醫院精神科情感型精神病住院病患比率改變之研究,Journal of Medical Sciences,6(4),327-332,1986。
    吳欣怡,第二型雙極症與第二型雙極症共病焦慮症在神經心理功能上的表現,國立成功大學行為醫學研究所,碩士論文,2011。
    吳珮華,利用穩定表現型擴充SNP以找出影響阿茲海默症的最佳基因組合,國立交通大學統計學研究所,碩士論文,2011。
    金如鋒、夏昭林,病例對照設計為基礎的候選基因關聯研究中交互作用的統計方法進展,復旦學報(醫學版),38(3),265-270,2011。
    姚俊杰、駱家偉,隨機森林方法在致病SNPs檢測中的應用,世界科技研究與發展,34(4),613-616,2012。
    凃欣、石立松、汪樊、王擎,全基因組關聯分析的進展與反思,生理科學進展,41(2),87-94,2010。
    孫玉琳、趙曉航,複雜疾病基因定位策略與腫瘤易感基因鑑定,生物化學與生物理進展,32(9),803-810,2005。
    張安平、張學軍、朱文元,疾病相關基因定位的全基因組掃描策略與方法,疾病控制雜誌,5(2),135-138,2001。
    張雁明、邢國芳、劉美桃、劉曉東、韓淵懷,全基因組關聯分析:基因組學研究的機遇與挑戰,生物技術通報,6(1),1-6,2013。
    陸汝斌、張芸瑄、李聖玉、陳秀蘭,精神醫學診斷之變遷,The Journal of Nursing,61(1),26-31,2014。
    許浩彰,建立真核生物體學資料庫分析系統,臺北醫學大學醫學資訊研究所,碩士論文,2007。
    許謙文,探討丙戊酸調節突觸興奮性中星狀神經膠細胞所扮演之角色,國立成功大學藥理學研究所,碩士論文,2008。
    陳惠君、吳羿諠、許民憲、高靖雯、陳秀蘭、陸汝斌,第一型與第二型雙極症臨床表徵及生活品質之比較,Journal of Evidence-Based Nursing,4(4),307-317,2008。
    郭珊珊,全基因組關聯研究所發現之第2型糖尿病基因的再驗證研究,國立臺灣大學醫學院分子醫學研究所,碩士論文,2010。
    舒怡、張洪、章軍建,PI3K/Akt信號通路在神經系統疾病中的研究進展,醫學綜述,17(18),2732-2735,2011。
    曾俊樺、李家岩、李聖玉、郭柏秀、陸汝斌,資料探勘技術於全基因組關聯研究-以漢族雙極症疾患為例,Chinese Institute of Industrial Engineers (CIIE) Conference & Annual Meeting,2015。
    詹蕙安,罕見變異關聯性分析的分類與介紹,國立交通大學統計學研究所,碩士論文,2013。
    葉家僖、黃耀廷,單核苷酸多態性之簡介與研究回顧,生物醫學,2(2),135-146,2009。
    趙依妮、孫琪、胡鯤、楊先樂、阮記明、周愛玲,基於GABA A受體評估雙氟沙星對異育銀鲫的安全性,水生生物學報,39(3),598-603,2015。
    羅旭紅、劉志芳、董長征,基因水平的關聯分析方法,遺傳HEREDITAS(Beijing),35(9),1065-1071,2013。
    Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., & Zondervan, K. T. (2010). Data quality control in genetic case-control association studies. Nature protocols, 5(9), 1564-1573.
    Antonarakis, S. E., Chakravarti, A., Cohen, J. C., & Hardy, J. (2010). Mendelian disorders and multifactorial traits: the big divide or one for all? Nature Reviews Genetics, 11(5), 380-384.
    Assareh, A., Volkert, L. G., & Li, J. (2012). Feature selections using AdaBoost: Application in gene-gene interaction detection. Paper presented at the Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on.
    Association, A. P. (2000). Diagnostic and statistical manual of mental disorders (DSM-IV-TR): American Psychiatric Pub.
    Barnett, J. H., & Smoller, J. W. (2009). The genetics of bipolar disorder. Neuroscience, 164(1), 331-343.
    Barrett, J. C., & Cardon, L. R. (2006). Evaluating coverage of genome-wide association studies. Nature genetics, 38(6), 659-662.
    Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263-265.
    Behan, A., Byrne, C., Dunn, M., Cagney, G., & Cotter, D. (2009). Proteomic analysis of membrane microdomain-associated proteins in the dorsolateral prefrontal cortex in schizophrenia and bipolar disorder reveals alterations in LAMP, STXBP1 and BASP1 protein expression. Molecular psychiatry, 14(6), 601-613.
    Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289-300.
    Berns, G. S., & Nemeroff, C. B. (2003). The neurobiology of bipolar disorder. Paper presented at the American Journal of Medical Genetics Part C: Seminars in Medical Genetics.
    Calle, M. L., Urrea, V., Boulesteix, A.-L., & Malats, N. (2011). AUC-RF: A new strategy for genomic profiling with random forest. Human heredity, 72(2), 121-132.
    Cantor, R. M., Lange, K., & Sinsheimer, J. S. (2010). Prioritizing GWAS results: a review of statistical methods and recommendations for their application. The American Journal of Human Genetics, 86(1), 6-22.
    Chatterjee, S., & Hadi, A. S. (2015). Regression analysis by example: John Wiley & Sons.
    Chen, X., & Ishwaran, H. (2012). Random forests for genomic data analysis. Genomics, 99(6), 323-329.
    Chen, X., Wang, L., Hu, B., Guo, M., Barnard, J., & Zhu, X. (2010). Pathway‐based analysis for genome‐wide association studies using supervised principal components. Genetic epidemiology, 34(7), 716-724.
    Chien, C.-F., Hsu, C.-Y., & Chen, P.-N. (2013). Semiconductor fault detection and classification for yield enhancement and manufacturing intelligence. Flexible Services and Manufacturing Journal, 25(3), 367-388.
    ConsensusPathDB-human,Available:http://consensuspathdb.org/ (Accessed by April 12, 2016)
    D'Angelo, G. M., Rao, D., & Gu, C. C. (2009). Combining least absolute shrinkage and selection operator (LASSO) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies. Paper presented at the BMC proceedings.
    Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
    GenePipe VarioWatch:A high performance bioinformatics pipeline for large-scale
    human genomic variation studies,Available:http://genepipe.ncgm.sinica.edu.tw/variowatch/main.do (Accessed by April 12, 2016)
    Gibson, G. (2010). Hints of hidden heritability in GWAS. Nat Genet, 42(7), 558-560.
    Harding, A. M. S., Kusama, N., Hattori, T., Gautam, M., & Benson, C. J. (2014). ASIC2 subunits facilitate expression at the cell surface and confer regulation by PSD-95. PloS one, 9(4), e93797.
    Hastie, T. J., Tibshirani, R. J., & Friedman, J. H. (2011). The elements of statistical learning: data mining, inference, and prediction: Springer.
    Holmans, P., Green, E. K., Pahwa, J. S., Ferreira, M. A., Purcell, S. M., Sklar, P., . . . Consortium, W. T. C.-C. (2009). Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. The American Journal of Human Genetics, 85(1), 13-24.
    Horvath, S., Xu, X., & Laird, N. M. (2001). The family based association test method: strategies for studying general genotype–phenotype associations. European Journal of Human Genetics, 9(4).
    Kamburov, A., Wierling, C., Lehrach, H., & Herwig, R. (2009). ConsensusPathDB—a database for integrating human functional interaction networks. Nucleic acids research, 37(suppl 1), D623-D628.
    Kim, Y., Suh, I., Kim, H., Han, C., Lim, C., Choi, S., & Licinio, J. (2002). The plasma levels of interleukin-12 in schizophrenia, major depression, and bipolar mania: effects of psychotropic drugs. Molecular psychiatry, 7(10), 1107-1114.
    Laird, N. M., & Lange, C. (2010). The fundamentals of modern statistical genetics: Springer Science & Business Media.
    Langan, C., & McDonald, C. (2009). Neurobiological trait abnormalities in bipolar disorder. Molecular psychiatry, 14(9), 833-846.
    MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analysis. Psychological methods, 4(1), 84.
    Maeda, N. (2015). Proteoglycans and neuronal migration in the cerebral cortex during development and disease. Frontiers in neuroscience, 9.
    Mitropanopoulos, S. (2012). GWAS for Bipolar Disorder in a European Cohort with CNV Discovery. The University of Arizona.
    Moore, J. H., Asselbergs, F. W., & Williams, S. M. (2010). Bioinformatics challenges for genome-wide association studies. Bioinformatics, 26(4), 445-455.
    Myles, S., Davison, D., Barrett, J., Stoneking, M., & Timpson, N. (2008). Worldwide population differentiation at disease-associated SNPs. BMC medical genomics, 1(1), 1.
    Perry, J. R., McCarthy, M. I., Hattersley, A. T., Zeggini, E., Weedon, M. N., Frayling, T. M., & Consortium, W. T. C. C. (2009). Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes, 58(6), 1463-1467.
    Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., . . . Daly, M. J. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3), 559-575.
    Schisterman, E. F., Perkins, N. J., Liu, A., & Bondell, H. (2005). Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology, 73-81.
    Shastry, B. S. (2002). SNP alleles in human disease and evolution. Journal of human genetics, 47(11), 0561-0566.
    Shearer, C. (2000). The CRISP-DM model: the new blueprint for data mining. Journal of data warehousing, 5(4), 13-22.
    Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440-9445.
    Tacq, J. J., & Tacq, J. (1997). Multivariate analysis techniques in social science research: From problem to analysis: Sage.
    Thomson, D., Berk, M., Dodd, S., Rapado-Castro, M., Quirk, S. E., Ellegaard, P. K., . . . Dean, O. M. (2015). Tobacco Use in Bipolar Disorder. Clinical Psychopharmacology and Neuroscience, 13(1), 1.
    Wang, H., Wang, C., Lv, B., & Pan, X. Improved Variable Importance Measure of Random Forest via Combining of Proximity Measure and Support Vector Machine for Stable Feature Selection⋆.
    Wang, J., Duncan, D., Shi, Z., & Zhang, B. (2013). WEB-based gene set analysis toolkit (WebGestalt): update 2013. Nucleic acids research, 41(W1), W77-W83.
    Wang, K., Li, M., & Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. The American Journal of Human Genetics, 81(6), 1278-1283.
    Wang, K., Li, M., & Hakonarson, H. (2010). Analysing biological pathways in genome-wide association studies. Nature Reviews Genetics, 11(12), 843-854.
    Wang, L., Jia, P., Wolfinger, R. D., Chen, X., & Zhao, Z. (2011). Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics, 98(1), 1-8.
    Wang, L.-y., & Fasulo, D. (2006). A Fast Boosting-Based Screening Method for Large-scale Association Study in Complex Traits with Genetic Heterogeneity. Paper presented at the Engineering in Medicine and Biology Society, 2006. EMBS'06. 28th Annual International Conference of the IEEE.
    Wang, M., Chen, X., Zhang, M., Zhu, W., Cho, K., & Zhang, H. (2009). Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests. Paper presented at the BMC proceedings.
    WEB-based GEne SeT AnaLysis Toolkit:Translating gene lists into biological insights,Available:http://www.webgestalt.org/ (Accessed by April 12, 2016)
    Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., & Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6), 714-721.
    Yang, C., Wan, X., Yang, Q., Xue, H., & Yu, W. (2010). Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso. BMC bioinformatics, 11(1), 1.
    Yeo, G. S. (2011). Where next for GWAS? Briefings in functional genomics, 10(2), 51-51.
    Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.

    無法下載圖示 校內:2021-06-20公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE