| 研究生: |
劉姿蘭 Liu, Tzu-Lan |
|---|---|
| 論文名稱: |
應用文字探勘技術於疾病分類自動編碼之研究 Using text mining for discharge summary classification |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系碩士在職專班 Department of Industrial and Information Management (on the job class) |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 中文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | Naïve Bayes 、疾病分類 、文字探勘 、TF-IDF 、VSM 、SVM |
| 外文關鍵詞: | TF-IDF, Naïve Bayes, SVM, VSM, disease classification, text mining |
| 相關次數: | 點閱:100 下載:19 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
民國84年3月台灣進入了全民健保的時代,在健保制度之下,各醫療院所必須將病患之診斷結果編成國際疾病分類碼(International Classification of Disease, ICD),進而再向健保局進行醫療費用之申報。目前台灣各醫學中心、區域醫院及地區醫院都聘有疾病分類專業人員,而在現行人工編碼的情況下,疾病分類人員需花費相當多的時間與精神逐一閱覽病患的相關病歷內容用以編出適當的疾病分類碼,這樣的過程不但耗時費力,也可能因為案件過多、專業知識或經驗等因素造成編碼錯誤或遺漏。
有鑑於此,本研究希望透過文字探勘等技術進行疾病分類自動編碼之研究。研究中以六大健保科別之出院病摘為研究對象,主要採用單純貝氏分類器(Naïve Bayes)、支援向量機(SVM)和向量空間模型(VSM)等三大分類方法,並於研究中加入了TF、TF-IDF文字特徵權重法則以及UMLS同義字代換兩大變數。研究的主要目的為:探討疾病分類自動編碼的可行性、分類方法上各科之間是否具有差異性、文字權重法則和同義字代換是否能提升預測準確率。
經實驗結果,整體而言,自動編碼建議採以SVM且不使用文字特徵權重的方式,而六大科別的預測準確率最高為90.09%,最低為73.74%,平均預測準確率為79.37%。若以六大科別各自進行分類方法最建議解之探討,則六科中僅只一科建議改採VSM結合TF-IDF且門檻值設定為0.1的方式,而其餘五科則維持建議SVM不使用文字特徵權重的方法,因此六大科別的分類方法並無太大差異。而UMLS同義字代換在本研究中並未能大幅有效的提升預測準確率,此外TF與TF-IDF特徵權重法中以TF-IDF的效益優於TF。因此,整體而言我們建議採用SVM作為疾病分類自動編碼之方法。
Taiwan has been in the era of National Health Insurance since March, 1995. Under this system, all the hospitals are supposed to arrange all the patients’ diagnosis records into International Classification of Diseases(ICD), and with this, they apply to the Bureau of National Health Insurance for the medical expenditure. Nowadays, disease classification staffs are employed to do the job in the medical centers, regional hospitals and local hospitals. They work with human labor, and spend a lot of time arranging the proper ICD. Due to too much time exhausted, too many cases to arrange, or lack of know-how or experience, mistakes or negligence are sometimes found.
Accordingly, this research is made in the hope of studying how to classify the patients’ treatment data and their discharge summary in the six department of health insurance through the technology of text mining. Three prime methods, Naïve Bayes, SVM, and VSM are taken in the research; TF, TF-IDF and UMLS for alternation of synonyms are taken in it as well. The target is to find out the availability of auto classification in diseases and the diversities among the six departments. The possibility of promoting the accuracy of auto-classification prediction through TF, TF-IDF and alternation of synonyms is also the target.
From the result of experiments, as far as the whole auto classification is concerned, it is suggested SVM without TF and TF-IDF be used. The accuracy rate of prediction in six departments of health insurance is 90.09% in maximum and 73.74% in minimum; the average is 79.37%. If the six departments are studied individually, it is found that VSM is suggested to be used in only one of the six departments, which TF-IDF is included with the setting threshold 0.1. Therefore, there are few differences in classification among the six departments of health insurance. However, the alternation of synonyms in UMLS is almost useless in promoting the accuracy rate of classification prediction. Moreover, in feature selection, TF-IDF is much better than TF in efficiency. On the whole, it is suggested that SVM be used as the method of auto classification in diseases.
中文參考文獻
藍忠孚. (民85). 全民健康保險診療報酬預估支付制度之研究. 行政院衛生署.
楊志良. (民87). 健康保險. 巨流圖書公司.
范碧玉. (民92). 病歷管理理論與實務. 台灣病歷管理協會.
楊正銘. (民93). 以文字探勘技術應用於疾病分類之輔助系統 - 以出入院病歷摘要為例. 碩士論文, 台北醫學大學醫學資訊研究所.
鍾麗君. (民94). 電腦輔助編碼系統在疾病分類與管理上之模擬研究. 碩士論文, 國立陽明大學醫務管理研究所.
李安唐. (民95). 電子病歷自動分類於ICD-9-CM. 碩士論文, 慈濟大學醫學資訊研究所.
英文參考文獻
Baeza-Yates, R., and Ribeiro-Neto, B. (1999). Modern information retrieval: Addison-Wesley Harlow, England.
Berman, J. J. (2004). Doublet method for very fast autocoding. BMC Medical Informatics and Decision Making, 4(16).
Bichindaritz, I. a. A., S. (2006). Concept mining for indexing medical literature. Engineering Applications of Artificial Intelligence, 19, 411-417.
Daconta, M. C., Obrst, L. J. and Smith, K. T. (2003). The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management. Wiley Publishing.
Dale, R., Mollá-Aliod, D. and Schwitter, R. (2003). Natural Language Processing in the Undergraduate Curriculum. Paper presented at the Fifth Australasian Computing Education Conference (ACE2003).
Deogun, J. S., Sever, H. and Raghavan, V. V. (1998). Structural abstractions of hypertext documents for web-based retrieval. Proceeding of Ninth International Workshop on Database and Expert Systems Applications, 385-390.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37-54.
Fiszman, M., Chapman, W. W., Evans, S. R. and Haug, P. J. (2000). Automatic identification of pneumonia related concepts on chest x-ray reports. Journal of the American Medical Informatics Association, 7(6), 593-604.
Friedman, C., Alderson, P. O., Austin, J. H., Cimino, J. J. and Johnson, S. B. (1994). A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, 1(2), 161-174.
Friedman, C., Kra, P., Yu, H., Krauthammer, M. and Rzhetsky, A. (2001). Genies: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, 74-82.
Gruber, T. R. (1993). A translation approach to portable ontologies specifications. Knowledge Acquisition, 5(2), 199-220.
Hripsack, G., Friedman, C., Alderson, P. O., DuMouchel W., Johnson S. B. and Clayton P. D. (1995). Unlocking clinical data from narrative reports: a study of natural language processing. Annals of Internal Medicine., 122(9), 681-688.
Joachims, T. (1998). Text categorization with support vector machines. Proceedings of European conference on machine learning (ECML). Chemintz, DE, 137–142.
Losiewicz, P., Oard, D. W. and Kostoff, R. N. (2000). Textual data mining to support science and technology management. Journal of Intelligent Information Systems, 15, 99-119.
Mao, W. and Chu, W. W. (2007). The phrase-based vector space model for automatic retrieval of free-text medical documents. Data & Knowledge Engineering, 61(1), 76-92.
Marcotte, E. M., Xenarios, I. and Eisenberg, D. (2001). Mining literature for protein-protein interactions. Bioinformatics, 17, 359-363.
Meystre, S. and Haug, P. J. (2005). Automation of a problem list using natural language processing. BMC Medical Informatics and Decision Making, 5(30).
Meystre, S. and Haug, P. J. (2006). Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. Journal of Biomedical Informatics, 39, 589-599.
Neches, R., Fikes, R., Finin, T., Gruber, T., Patil, R., Senator, T. and Swartout, W. R. (1991). Enabling technology for knowledge sharing. AI Magazine, 12, 36-56.
Nello Cristianini and John Shawe-Taylor. (2000). An introduction to Support Vector Machines and other kernel-based learning methods, Camberidge University Press.
Ono, H., Takabayashi, K., Suzuki, T., Yokoi, H., Imiya, A. and Satomura, Y. (2004). Extraction of diagnosis related terminological information from discharge summary. IEIC Technical Report, 103(295), 13-18.
Ono, T., Hishigaki, H., Tanigami, A. and Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17, 155-161.
Osuna, E., R. Freund and F. Girosi (1997). Training support vector machines: a application to face detection, Proceedings of the Conference on IEEE Computer Vision and Pattern Recognition, Puerto Rico, pp. 130-136
Pai, P. F. and Lin, C. S. (2005). A hybrid ARIMA and support vector machines model in stock price forecasting. Omega, 33, 497-505.
Sager, N., Lyman, M., Bucknall, C., Nhan, N. and Tick, L. J. (1994). Natural language processing and the representation of clinical data. Journal of the American Medical Informatics Association, 1, 142-160.
Sullivan, D. (2001). Document Warehousing and Text Mining. Wiley Computer Publishing.
Vapnik,V. (1995). The Nature of Statistical Learning Theory, Springer-Verlag, New York.
Yu, G. X. (2003). Ostrouchov, G., Geist, A., & Samatova, N. F., An SVM-based algorithm for identification of photosynthesis-specific genome features. In 2nd IEEE computer society bioinformatics conference, Washington, DC, USA, 235–243.