簡易檢索 / 詳目顯示

研究生: 黃唯軒
Huang, Wei-Xuan
論文名稱: 基於資料探勘技術之非傳染性疾病風險預測模型建立
Construction of a non-communicable disease risk prediction model using data mining methods
指導教授: 王振興
Wang, Jeen-Shing
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 68
中文關鍵詞: 資料探勘非傳染性疾病非酒精性脂肪肝病疾病風險預測影響因素分析
外文關鍵詞: Data mining, Non-communicable disease, Non-alcoholic fatty liver, Disease risk prediction, Influencing factor analysis
相關次數: 點閱:109下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文旨在利用身心健康狀況自評表的問卷資料以及臨床收集之生理資料建立非傳染性疾病的風險預測與分析模型。身心健康狀況自評表的資料包含生理症狀以及生活習慣,而臨床收集的資料則包含有受測者基本資料、生化指標、X光等檢驗報告結果。本論文所開發之模型共有兩種:第一種是單純利用受測者填寫之問卷資料評估其非傳染性疾病之風險高低:第二種為使用臨床收集資料所預測之非傳染性疾病風險高低,進一步預測其非傳染性疾病的風險。本研究在國立成功大學醫學院附設醫院的健康管理中心利用回溯性方式,總共收集2,361名受試者在檢驗室收集之臨床資料以及2,270筆的問卷資料,這兩部分的資料在排除遺漏資料後,透過Boruta演算法的特徵選取法分別選出對於各別疾病辨識效果較為良好的輸入特徵。本論文使用五種不同的預測模型,分別為決策樹、隨機森林、支持向量機、倒傳遞神經網路、輕量化梯度提升器,並比較其有效性。第一種問卷預測的結果顯示,輕量化梯度提升器在預測非酒精性脂肪肝病有最好的結果,其平均標準率、靈敏度、特異度與曲線下面積(AUC)分別為73.3%、73.52%、72.86%、0.7319,並發現睡眠情形以及喝咖啡的情形會對非酒精性脂肪肝病之得病風險有所影響,此外,輕量化梯度提升器在預測高血壓、高血糖、高血脂有最好的結果,其平均曲線下面積(AUC)分別為0.7384、0.7137、0.6181,也從分析結果中中發現模型在預測高血脂的表現明顯較差,在與領域專家討論後確定其原因為問卷資料的內容,無法充分表現高血脂的症狀導致預測上表現的不佳,未來需要考慮增加針對性的問卷資料,例如:高血脂問卷。第二種臨床資料預測的結果顯示,輕量化梯度提升器在預測非酒精性脂肪肝病有最好的結果,其平均標準率、靈敏度、特異度與曲線下面積(AUC)分別為80.9%、81.25%、80.3%、0.8077。綜合以上結果顯示,使用輕量化梯度提升器可得到最佳結果,對於非酒精性脂肪肝病的預測,其平均標準率為80.9%,並且還能額外得到疾病上有價值的影響因素分析結果。研究結果驗證了本論文提出之方法的可行性。希望未來能將提供民眾方便且快速的工具進行居家健康檢查項目選擇的建議,並分析日常生活習慣的情形,以此來讓民眾能夠針對風險較高的疾病進行檢查或預防。

    This thesis aims to construct risk prediction models for non-communicable diseases using the data collected from a physical and mental health self-assessment questionnaire and clinical data. The questionnaire data includes personal physiological conditions and living style, while the clinical data includes subject demographics, biochemical laboratory test results, X-ray, etc. Two types of prediction models have been developed in this thesis. The first type is to use the questionnaires filled out by the subjects to assess the risk of non-communicable diseases. The second type is to predict the risk of non-communicable diseases using the clinical data. In this study, a total of 2,361 subjects' laboratory data and 2,270 questionnaire data were collected retrospectively from the Health Management Center of National Cheng Kung University Hospital. After the removal of missing data, the Boruta algorithm was applied to select the important features from the aforementioned data. With the selected features, five prediction models, decision trees, random forests, support vector machines (SVM), backpropagation neural networks (BPNN), and light gradient boosting machines(LightGBM) were trained to predict the risk of the diseases. The results showed that the best model was the LightGBM which reached the average accuracy, sensitivity, specificity, and area under the curve (AUC) at 73.3%, 73.52, %, 72.86%, 0.7319, respectively, with random validation. We also found that sleep conditions and coffee drinking have an impact on the risk of non-alcoholic fatty liver disease. In addition, the LightGBM also outperformed the other models in predicting hypertension, hyperglycemia, and hyperlipidemia. The AUC of the prediction is 0.7384, 0.7137, and 0.6181, respectively. The results also indicated that the LightGBM has poor performance in predicting hyperlipidemia. After discussing with the medical experts, the reason why the model had poor performance in predicting hyperlipidemia is determined by the content of the questionnaire, and the symptoms of hyperlipidemia cannot be fully explored by the current questionnaire questions. To improve the prediction performance, it is necessary to increase the highly related questions for the disease, such as hyperlipidemia questionnaire. The results of the second type prediction models showed that the LightGBM had the best results in predicting non-alcoholic fatty liver disease(NAFLD), and its average accuracy, sensitivity, specificity, and area under the curve (AUC) were 80.9% ,81.25%, 80.3% and 0.8077, respectively, with random validation. In summary, the above results show that the best results can be obtained by using the LightGBM model that achieves the average accuracy at 80.9% for the prediction of non-alcoholic fatty liver disease, and identifies valuable influencing factors for the disease. The above results have successfully validated the effectiveness of the proposed model. It is hoped that the proposed model becomes a convenient and fast tool to make recommendations for the selection of health exams, and to analyze the daily life habits for people to realize how to prevent diseases.

    中文摘要 i 英文摘要 iii 目錄 ix 表目錄 xii 圖目錄 xiv 第 1 章 緒論 1 1.1 研究動機與背景 1 1.2 文獻探討 2 1.2.1 非傳染性疾病簡介 3 1.2.2 非傳染性疾病在大數據分析之研究現況 5 1.3 研究目的 6 1.4 論文架構 7 第 2 章 實驗設置、收案資料整理與資料探勘介紹 8 2.1 實驗設置 8 2.2 收案資料整理 10 2.2.1 臨床資料 10 2.2.2 問卷資料 11 2.3 資料探勘系統 13 2.3.1 疾病風險預測 14 2.3.2 影響因素分析 15 第 3 章 基於資料探勘之疾病預測模型 16 3.1 資料前處理 17 3.1.1 資料選擇及缺失資料處理 17 3.1.2 非結構性資料轉換 18 3.2 特徵正規化 19 3.3 特徵選取 19 3.4 辨識器 21 3.4.1 倒傳遞類神經網路(BPNN) 22 A. 基因演算法 25 B. 基因演算法最佳化BPNN架構方法說明 26 3.4.2 輕量化梯度提升器 (Light gradient boosting machine, LightGBM) 28 3.5 參數最佳化 32 3.6 驗證方式 34 第 4 章 實驗結果與討論 36 4.1 評估指標介紹 36 4.2 實驗結果及討論 37 4.2.1 驗證方式實驗結果 38 4.2.2 特徵選取實驗結果 39 A. 演算法特徵選取實驗結果 39 B. 演算法與專家特徵選取實驗結果比較 41 4.2.3 架構最佳化實驗結果 41 4.2.4 BPNN及CNN及邏輯回歸模型預測結果之比較 44 4.2.5 資料增加驗證結果 46 A. 母群體樣本數增加實驗結果 47 B. 模型驗證結果 48 4.3 影響因素分析結果 50 第 5 章 54 5.1 結論 54 5.2 未來工作 56 參考文獻 58 附錄 63

    [1] T. Biering‐Sørensen et al., "Left ventricular ejection time is an independent predictor of incident heart failure in a community‐based cohort," European Journal of Heart Failure, vol. 20, no. 7, pp. 1106-1114, 2018.
    [2] I. Bose and R. K. Mahapatra, "Business data mining—a machine learning perspective," Information & Management, vol. 39, no. 3, pp. 211-225, 2001.
    [3] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, "Disease prediction by machine learning over big data from healthcare communities," Ieee Access, vol. 5, pp. 8869-8879, 2017.
    [4] S.-C. Chen et al., "Framingham risk score with cardiovascular events in chronic kidney disease," PLoS One, vol. 8, no. 3, p. e60008, 2013.
    [5] S.-L. Chia, W.-C. Chou, and R.-C. Chen, "Chronic Disease Prevention: Present and Future," Hu Li Za Zhi, vol. 65, no. 5, pp. 13-19, 2018.
    [6] H.-M. Chiu, J.-T. Lin, H.-P. Wang, Y.-C. Lee, and M.-S. Wu, "The impact of colon preparation timing on colonoscopic detection of colorectal neoplasms—a prospective endoscopist-blinded randomized trial," The American Journal of Gastroenterology, vol. 101, no. 12, p. 2719, 2006.
    [7] M. J. Choi, "Relations of life style, nutrient intake, and blood lipids in middle-aged men with borderline hyperlipidemia," Korean J Community Nutr, vol. 10, no. 3, p. 281, 2005.
    [8] J. A. Damen et al., "Prediction models for cardiovascular disease risk in the general population: systematic review," BMJ, vol. 353, p. i2416, 2016.
    [9] J. K. Dyson, Q. M. Anstee, and S. McPherson, "Non-alcoholic fatty liver disease: a practical approach to diagnosis and staging," Frontline Gastroenterology, vol. 5, no. 3, pp. 211-218, 2014.
    [10] T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
    [11] M. H. Forouzanfar et al., "Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015," The Lancet, vol. 388, no. 10053, pp. 1659-1724, 2016.
    [12] J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of Statistics, pp. 1189-1232, 2001.
    [13] D. J. Hand, "Data Mining," Encyclopedia of Environmetrics, vol. 2, 2006.
    [14] O. Harrison, C. Hajat, C. Cooper, G. Averbuj, and P. Anderson, "Communicating health through health footprints," Journal of Health Communication, vol. 16, no. sup2, pp. 158-174, 2011.
    [15] M.-J. Huang, M.-Y. Chen, and S.-C. Lee, "Integrating data mining with case-based reasoning for chronic diseases prognosis and diagnosis," Expert Systems with Applications, vol. 32, no. 3, pp. 856-867, 2007.
    [16] W. C. Hung, J. S. Wu, Z. J. Sun, F. H. Lu, Y. C. Yang, and C. J. Chang, "Gender differences in the association of non-alcoholic fatty liver disease and metabolic syndrome with erosive oesophagitis: a cross-sectional study in a Taiwanese population," (in eng), BMJ open, vol. 6, no. 11, p. e013106, Nov 15 2016, doi: 10.1136/bmjopen-2016-013106.
    [17] L. Jahangiry, M. A. Farhangi, and F. Rezaei, "Framingham risk score for estimation of 10-years of cardiovascular diseases risk in patients with metabolic syndrome," Journal of Health, Population and Nutrition, vol. 36, no. 1, p. 36, 2017.
    [18] G. Ke et al., "Lightgbm: A highly efficient gradient boosting decision tree," in Advances in Neural Information Processing Systems, 2017, pp. 3146-3154.
    [19] M. B. Kursa and W. R. Rudnicki, "Feature selection with the Boruta package," J Stat Softw, vol. 36, no. 11, pp. 1-13, 2010.
    [20] H. Lam, S. Ling, F. H. Leung, and P. K.-S. Tam, "Tuning of the structure and parameters of neural network using an improved genetic algorithm," in IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No. 37243), 2001, vol. 1: IEEE, pp. 25-30.
    [21] I.-M. Lee et al., "Effect of physical inactivity on major non-communicable diseases worldwide: an analysis of burden of disease and life expectancy," The Lancet, vol. 380, no. 9838, pp. 219-229, 2012.
    [22] S. W. Lee, T. Y. Lee, S. S. Yang, Y. C. Peng, H. Z. Yeh, and C. S. Chang, "The association of non-alcoholic fatty liver disease and metabolic syndrome in a Chinese population," (in eng), Hepatobiliary & Pancreatic Diseases International : HBPD INT, vol. 16, no. 2, pp. 176-180, Apr 2017.
    [23] K. T. G. Leong et al., "Risk stratification model for 30-day heart failure readmission in a multiethnic South East Asian community," The American Journal of Cardiology, vol. 119, no. 9, pp. 1428-1432, 2017.
    [24] S. Mezzatesta, C. Torino, P. De Meo, G. Fiumara, and A. Vilasi, "A MACHINE LEARNING-BASED APPROACH FOR PREDICTING THE OUTBREAK OF CARDIOVASCULAR DISEASES IN PATIENTS ON DIALYSIS," Computer Methods and Programs in Biomedicine, 2019.
    [25] G. Mujtaba, L. Shuib, R. G. Raj, R. Rajandram, and K. Shaikh, "Automatic text classification of ICD-10 related CoD from complex and free text forensic autopsy reports," in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016: IEEE, pp. 1055-1058.
    [26] R. P. Myers et al., "Controlled Attenuation Parameter (CAP): a noninvasive method for the detection of hepatic steatosis based on transient elastography," Liver International, vol. 32, no. 6, pp. 902-910, 2012.
    [27] R. Nilsson, J. M. Peña, J. Björkegren, and J. Tegnér, "Consistent feature selection for pattern recognition in polynomial time," Journal of Machine Learning Research, vol. 8, no. Mar, pp. 589-612, 2007.
    [28] E. Osmanbegovic and M. Suljic, "Data mining approach for predicting student performance," Economic Review: Journal of Economics and Business, vol. 10, no. 1, pp. 3-12, 2012.
    [29] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, "A relative similarity based method for interactive patient risk prediction," Data Mining and Knowledge Discovery, vol. 29, no. 4, pp. 1070-1093, 2015.
    [30] L. Qiu, K. Gai, and M. Qiu, "Optimal big data sharing approach for tele-health in cloud computing," in 2016 IEEE International Conference on Smart Cloud (SmartCloud), 2016: IEEE, pp. 184-189.
    [31] K. Saranburut et al., "Evaluation of the Framingham Heart Study risk factors and risk score for incident chronic kidney disease at 10 years in a Thai general population," International Urology and Nephrology, vol. 49, no. 5, pp. 851-857, 2017.
    [32] U. S. D. o. H. a. H. Services, 2016. [Online]. Available: https://health.gov/dietaryguidelines/2015/guidelines/
    [33] A. Singh, G. Nadkarni, O. Gottesman, S. B. Ellis, E. P. Bottinger, and J. V. Guttag, "Incorporating temporal EHR data in predictive models for risk stratification of renal function deterioration," Journal of Biomedical Informatics, vol. 53, pp. 220-228, 2015.
    [34] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, "Regularization of neural networks using dropconnect," in International Conference on Machine Learning, 2013, pp. 1058-1066.
    [35] A. Wang, N. An, G. Chen, L. Li, and G. Alterovitz, "Predicting hypertension without measurement: A non-invasive, questionnaire-based approach," Expert Systems with Applications, vol. 42, no. 21, pp. 7601-7609, 2015.
    [36] S. F. Weng, J. Reps, J. Kai, J. M. Garibaldi, and N. Qureshi, "Can machine-learning improve cardiovascular risk prediction using routine clinical data?," PloS one, vol. 12, no. 4, p. e0174944, 2017.
    [37] W. K. Wong, W. Boscardin, A. Postlethwaite, and D. Furst, "Handling missing data issues in clinical trials for rheumatic diseases," Contemporary Clinical Trials, vol. 32, no. 1, pp. 1-9, 2011.
    [38] W. A. Zoghbi et al., "Sustainable development goals and the future of cardiovascular health: a statement from the Global Cardiovascular Disease Taskforce," ed: Journal of the American College of Cardiology, 2014.
    [39] 世界衛生組織. [Online]. Available: https://www.who.int/zh/news-room/fact-sheets/detail/noncommunicable-diseases
    [40] 曾屏輝、林鴻儒、邱瀚模、李百卿、吳明賢、陳明豐,"從實證醫學角度看自費健康檢查," 內科學誌, vol. 20, no. 6, pp. 532-543, 2009.
    [41] 衛生福利部,2019. [Online]. Available: https://www.mohw.gov.tw/cp-16-48057-1.html
    [42] 衛生福利部國民健康署,2006. [Online]. Available: https://www.hpa.gov.tw/Pages/Detail.aspx?nodeid=1126&pid=1728
    [43] 衛生福利部國民健康署,"2013年國民健康訪問調查結果," 2016.

    下載圖示 校內:2024-08-24公開
    校外:2024-08-24公開
    QR CODE