| 研究生: |
李昕純 Lee, Hsin-Chun |
|---|---|
| 論文名稱: |
基於強化條件隨機場域之生醫文獻疾病命名實體辨識與正規化系統 An enhanced CRF-based Method for Disease Named Entity Recognition and Normalization in Biomedical Literature |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 42 |
| 中文關鍵詞: | 疾病命名實體辨識與正規化 、條件隨機場域 、生醫文獻文字探勘 |
| 外文關鍵詞: | Disease named entity recognition and normalization, Conditional random fields, Biomedical text mining |
| 相關次數: | 點閱:126 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在生醫研究領域中,疾病扮演著相當重要的角色。對研究人員而言,常常使用生醫文獻搜尋引擎PubMed去找尋相關的文獻資料,其背後需要靠著生物標註者(biocurator)進行命名實體的標註,加速文獻的搜尋或進而幫助生物學資料庫建立實體關聯、交互作用的關係。然而,人工標註的方式無法有效應付生醫文獻在多年來以爆炸性的速度成長,在過去的標註輔助系統多半致力於基因與藥物相關領域的開發,對於疾病相關的研究卻非常少。因此,針對大量的生醫文獻去作疾病命名實體的擷取並加以正規化、分類,提供一個可以幫助生物標註者快速並有效的標註輔助系統,對於生醫文獻探勘領域是個相當重要的議題。在我們的研究中,開發了一套系統「AuDis」解決疾病命名實體標註問題,透過條件隨機場域機率模型,建立一個疾病名詞辨識的模組,再透過後處理的方式包含:改善因疾病語料庫稀少導致訓練不足的辨識不一致、縮寫辨識校正、命名實體重組以及停用字過濾,進行辨識結果的優化。在正規化的部分,我們蒐集的大量的醫學字典並進行字典擴充,使用字典查詢的方式給予疾病命名實體一個最佳的分類代碼。此研究成果,在2015年「國際生物文獻自動探勘競賽」(BioCreative V – CDR Task)中獲得86.46%的F度量(F-score),不僅在所有參賽隊伍中獲得最佳的表現,同時也超越了現有的強大工具DNorm 6%。證明AuDis是一個高辨識度且為最先進的疾病命名實體辨識與正規化系統。在官方評估後,AuDis目前已提升到87.26%的F度量。
Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g., PubMed). Thus, a framework of disease named entity recognition and normalization has become increasingly important for biomedical text mining. In this work, we not only define five diversity of disease names but also develope a system, AuDis, for disease mention recognition and normalization in biomedical texts. The AuDis utilize an order 2 conditional random fields (CRFs) model to develop a recognition system and optimize the results by customizing several post-processing, including abbreviation resolution, consistency improvement, stopwords filtering, and adjectives re-organized. Furthermore, we utilize dictionary-lookup approach to solve the normalization problem, including stable medical lexicons collection and extension. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. After the official evaluation, AuDis could obtain the performance of 87.26 F-score now. These results suggest that AuDis is a high-performance and state of the art recognition system for disease recognition and normalization from biomedical literature.
[1] R. Batista-Navarro and S. Ananiadou, "Augmenting the Medical Subject Headings vocabulary with semantically rich variants to improve disease mention normalisation," in Proceedings of the fifth BioCreative challenge evaluation workshop, 2015.
[2] J. G. Caporaso, W. A. Baumgartner, D. A. Randolph, K. B. Cohen, and L. Hunter, "MutationFinder: a high-performance system for extracting point mutation mentions from text," Bioinformatics, vol. 23, pp. 1862-1865, 2007.
[3] A. P. Davis, C. J. Grondin, K. Lennon-Hopkins, C. Saraceni-Richards, D. Sciaky, B. L. King, et al., "The Comparative Toxicogenomics Database's 10th year anniversary: update 2015," Nucleic Acids Research, vol. 43, pp. D914-D920, 2015.
[4] A. P. Davis, T. C. Wiegers, M. C. Rosenstein, and C. J. Mattingly, "MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database," Database, vol. 2012, p. bar065, 2012.
[5] R. I. Doğan, R. Leaman, and Z. Lu, "NCBI disease corpus: a resource for disease name recognition and concept normalization," Journal of Biomedical Informatics, vol. 47, pp. 1-10, 2014.
[6] R. I. Dogan and Z. Lu, "An inference method for disease name normalization," in 2012 AAAI Fall Symposium Series, 2012.
[7] R. I. Dogan, G. C. Murray, A. Névéol, and Z. Lu, "Understanding PubMed® user search behavior through log analysis," Database, vol. 2009, p. bap018, 2009.
[8] E. Doughty, A. Kertesz-Farkas, O. Bodenreider, G. Thompson, A. Adadey, T. Peterson, et al., "Toward an automatic method for extracting cancer-and other disease-related point mutations from the biomedical literature," Bioinformatics, vol. 27, pp. 408-415, 2011.
[9] M. Gerner, G. Nenadic, and C. M. Bergman, "LINNAEUS: a species name identification system for biomedical literature," BMC bioinformatics, vol. 11, p. 85, 2010.
[10] C. Hänig, R. Remus, V. Demchik, and S. Bordag, "ExB Medical Text Miner," in Proceedings of the fifth BioCreative challenge evaluation workshop, 2015.
[11] J. Hakenberg, M. Gerner, M. Haeussler, I. Solt, C. Plake, M. Schroeder, et al., "The GNAT library for local and remote gene mention normalization," Bioinformatics, vol. 27, pp. 2769-2771, 2011.
[12] C.-N. Hsu, Y.-M. Chang, C.-J. Kuo, Y.-S. Lin, H.-S. Huang, and I.-F. Chung, "Integrating high dimensional bi-directional parsing models for gene mention tagging," Bioinformatics, vol. 24, pp. i286-i294, 2008.
[13] Y.-Y. Hsu and H.-Y. Kao, "Curatable Named-entity Recognition using Semantic Relations," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, pp. 785-792, 2014.
[14] D. M. Jessop, S. E. Adams, E. L. Willighagen, L. Hawizy, and P. Murray-Rust, "OSCAR4: a flexible architecture for chemical text-mining," J. Cheminformatics, vol. 3, p. 41, 2011.
[15] C.-J. Kuo, M. H. Ling, K.-T. Lin, and C.-N. Hsu, "BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature," BMC bioinformatics, vol. 10, p. S7, 2009.
[16] J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," 2001.
[17] R. Leaman, R. I. Doğan, and Z. Lu, "DNorm: disease name normalization with pairwise learning to rank," Bioinformatics, p. btt474, 2013.
[18] R. Leaman and G. Gonzalez, "BANNER: an executable survey of advances in biomedical named entity recognition," in Pacific Symposium on Biocomputing, 2008, pp. 652-663.
[19] R. Leaman and Z. Lu, "TaggerOne: Joint Named Entity Recognition and Normalization with Semi-Markov Models," Bioinformatics, p. btw343, 2016.
[20] R. Leaman, C.-H. Wei, and Z. Lu, "tmChem: a high performance approach for chemical named entity recognition and normalization," Journal of Cheminformatics, vol. 7, 2015.
[21] H.-C. Lee, Y.-Y. Hsu, and H.-Y. Kao, "AuDis: an automatic CRF-enhanced disease normalization in biomedical text," Database, vol. 2016, p. baw091, 2016.
[22] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, et al., "Annotating chemicals, diseases and their interactions in biomedical literature," in Proceedings of the fifth BioCreative challenge evaluation workshop, 2015.
[23] N. Limsopatham, C. Macdonald, and I. Ounis, "Learning to combine representations for medical records search," in Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013, pp. 833-836.
[24] D. C. Liu and J. Nocedal, "On the limited memory BFGS method for large scale optimization," Mathematical programming, vol. 45, pp. 503-528, 1989.
[25] D. M. Lowe, N. M. O’Boyle, and R. A. Sayle, "LeadMine: Disease identification and concept mapping using Wikipedia," in Proceedings of the fifth BioCreative challenge evaluation workshop, 2015.
[26] Z. Lu, "PubMed and beyond: a survey of web tools for searching biomedical literature," Database, vol. 2011, p. baq036, 2011.
[27] N. Naderi, T. Kappler, C. J. Baker, and R. Witte, "OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents," Bioinformatics, vol. 27, pp. 2721-2729, 2011.
[28] E. Pafilis, S. P. Frankild, L. Fanini, S. Faulwetter, C. Pavloudi, A. Vasileiadou, et al., "The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text," PLoS One, vol. 8, p. e65390, 2013.
[29] S. Pyysalo and S. Ananiadou, "Anatomical entity mention recognition at literature scale," Bioinformatics, p. btt580, 2013.
[30] T. Rocktäschel, M. Weidlich, and U. Leser, "ChemSpot: a hybrid system for chemical named entity recognition," Bioinformatics, vol. 28, pp. 1633-1640, 2012.
[31] L. M. Schriml, C. Arze, S. Nadendla, Y.-W. W. Chang, M. Mazaitis, V. Felix, et al., "Disease Ontology: a backbone for disease semantic integration," Nucleic Acids Research, vol. 40, pp. D940-D946, 2012.
[32] M. Q. Stearns, C. Price, K. A. Spackman, and A. Y. Wang, "SNOMED clinical terms: overview of the development process and project status," in Proceedings of the AMIA Symposium, 2001, p. 662.
[33] C.-H. Wei, B. R. Harris, H.-Y. Kao, and Z. Lu, "tmVar: a text mining approach for extracting sequence variants in biomedical literature," Bioinformatics, p. btt156, 2013.
[34] C.-H. Wei and H.-Y. Kao, "Cross-species gene normalization by species inference," BMC bioinformatics, vol. 12, p. S5, 2011.
[35] C.-H. Wei, H.-Y. Kao, and Z. Lu, "SR4GN: a species recognition software tool for gene normalization," Plos one, vol. 7, p. e38460, 2012.
[36] C.-H. Wei, H.-Y. Kao, and Z. Lu, "GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains," BioMed Research International, vol. 2015, 2015.
[37] C.-H. Wei, R. Leaman, and Z. Lu, "SimConcept: a hybrid approach for simplifying composite named entities in biomedicine," in Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2014, pp. 138-146.
[38] C.-H. Wei, Y. Peng, R. Leaman, A. P. Davis, C. J. Mattingly, J. Li, et al., "Overview of the BioCreative V Chemical Disease Relation (CDR) Task," in Proceedings of the fifth BioCreative challenge evaluation workshop, 2015.
校內:立即公開