| 研究生: |
楊家融 Yang, Chia-Jung |
|---|---|
| 論文名稱: |
生物醫學領域中概念辨識的研究 以基因本體學為例 A Study on Concept Recognition in Biomedical Field Using Gene Ontology as an Example |
| 指導教授: |
蔣榮先
Chiang, Jung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 自然語言處理 、基因本體學 、機器學習 |
| 外文關鍵詞: | natural language processing, gene ontology, machine learning |
| 相關次數: | 點閱:130 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,自然語言處理在生物醫學的專業領域上遇見障礙;專業領域的語言使用和一般領域大相逕庭。在基因本體學等非常專精的領域中常常缺乏大型的訓練資料集,使得強大的深度學習技巧難以在小資料集中施展。
我們在研究中所採用的 Colorado Richly Annotated Full-Text 資料集包含 67 篇的全文文件,由生物學家標註「基因本體學」的資料。我們在研究中找尋出「基因本體學概念辨識」的困難所在,並且用「有名字的概念」為刀,把難題一分為二,分別用字典查找和機器學習來克服。第一步我們先用「有名字的概念」把「基因本體學概念」的資料重新架構,第二步我們再運重新架構的「基因本體學」來完成概念辨識的需求。
我們的系統在 F1-measure 上比先前頂尖的系統進步了約 20%,達到 0.804 的 precision 及 0.715 的 recall。我們也證明了使用「有名字的概念」的想法有效,或許可以推廣到其他專業的語言上。
In recent years, natural language processing has been facing several obstacles in professional text mining in biomedical fields; the scenarios of natural language processing usage are completely different for handling professional languages and general languages. Due to the lack of training data, powerful deep learning techniques were not applicable to small datasets available for the highly specific biological researches such as gene ontology.
The Colorado Richly Annotated Full-Text corpus used in this study contains 67 full-text documents annotated by biologists. In this research, we aimed to identify the key difficulty of the gene ontology concept recognition task and handled this problem using dictionary-matching and machine-learning techniques. Accordingly, problem solving was divided into two steps, dictionary-matching and machine-learning respectively, corresponding to the roles of named concepts. In the first step, we reconstructed the gene ontology concepts after mining the named concepts. Furthermore, in the second step, we leveraged this reconstructed data to fulfill the needs of the proposed hybrid method.
The proposed concept recognizer achieved approximately 20% improvement in F1-measure as compared to the state-of-the-art system resulting in 0.804 precision and 0.715 recall. It proved that the named concept may be applied to the concept recognition of other professional languages.
Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Communications of the ACM, 18(6), 333–340, 1975.
Aronson, A. R., & Lang, F.-M. "An overview of MetaMap: historical perspective and recent advances". Journal of the American Medical Informatics Association, 17(3), 229–236, 2010.
Blake, J. A., Christie, K. R., Dolan, M. E., Drabkin, H. J., Hill, D. P., Ni, L., … Westerfeld, M. "Gene ontology consortium: Going forward". Nucleic Acids Research, 43(D1), D1049-1056, 2014.
Blake, J. A., Dolan, M., Drabkin, H., Hill, D. P., Ni, L., Sitnikov, D., … Westerfield, M. "Gene ontology annotations and resources". Nucleic Acids Research, 41(D1), 530–535, 2013.
Bodenreider, O. "The Unified Medical Language System (UMLS): Integrating biomedical terminology". Nucleic Acids Research, 32(Database issue), D267–D270, 2004.
Campos, D., Matos, S., & Oliveira, J. L. "A modular framework for biomedical concept recognition.". BMC Bioinformatics, 14(1), 281, 2013.
Campos, D., Matos, S., & Oliveira, J. L. "Gimli: Open source and high-performance biomedical name recognition". BMC Bioinformatics, 14, 54, 2013.
Corbett, P., & Murray-Rust, P. "High-Throughput Identification of Chemistry in Life Science Texts". In Computational Life Sciences II (pp. 107–118), 2006.
Degtyarenko, K., De matos, P., Ennis, M., Hastings, J., Zbinden, M., Mcnaught, A., … Ashburner, M. "ChEBI: A database and ontology for chemical entities of biological interest". Nucleic Acids Research, (36), D344–D350, 2008.
Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. "A survey of bioinformatics database and software usage through mining the literature". PLoS ONE, 11(6), e0157989, 2016.
Federhen, S. "The NCBI Taxonomy database". Nucleic Acids Research, 40(Database issue), D136–D143, 2012.
Ferrucci, D., & Lally, A. "UIMA: An architectural approach to unstructured information processing in the corporate research environment". Natural Language Engineering, 10(3–4), 327–348, 2004.
Funk, C., Baumgartner, W., Garcia, B., Roeder, C., Bada, M., Cohen, K., … Leser, U. "Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters". BMC Bioinformatics, 15(1), 59, 2014.
Funk, C. S., Cohen, K. B., Hunter, L. E., & Verspoor, K. M. "Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition". Journal of Biomedical Semantics, 7(1), 52, 2016.
Gobeill, J., Pasche, E., Vishnyakova, D., & Ruch, P. "Managing the data deluge: Data-driven GO category assignment improves while complexity of functional annotation increases". Database, 2013(2013), 1–9, 2013.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., … Gene Ontology Consortium. "The Gene Ontology (GO) database and informatics resource.". Nucleic Acids Research, 32(Database issue), D258-61, 2004.
Jonquet, C., Shah, N. H., Cherie, H., Musen, M. a, Callendar, C., & Storey, M.-A. "NCBO Annotator: Semantic Annotation of Biomedical Data". International Semantic Web Conference, Poster, 1–3, 2009.
Koopman, B., Zuccon, G., Nguyen, A., Bergheim, A., & Grayson, N. "Automatic ICD-10 classification of cancers from free-text death certificates". International Journal of Medical Informatics, 84(11), 956–965, 2015.
Mao, Y., Van Auken, K., Li, D., Arighi, C. N., McQuilton, P., Hayman, G. T., … Lu, Z. "Overview of the gene ontology task at BioCreative IV". Database : The Journal of Biological Databases and Curation, 2014, 1–14, 2014.
Miller, N., Lacroix, E. M., & Backus, J. E. "MEDLINEplus: building and maintaining the National Library of Medicine’s consumer health Web service.". Bulletin of the Medical Library Association, 88(1), 11–17, 2000.
Mujtaba, G., Shuib, L., Raj, R. G., Rajandram, R., Shaikh, K., & Al-Garadi, M. A. "Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection". PLoS ONE, 12(2), e0170242, 2017.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. "Scikit-learn: Machine Learning in Python". Journal of Machine Learning Research, 12(2011), 2825–2830, 2012.
Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., & Jimeno, A. "Text processing through web services: Calling Whatizit". Bioinformatics, 24(2), 296–298, 2008.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … Mesirov, J. P. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles". Proceedings of the National Academy of Sciences, 102(43), 15545–15550, 2005.
Tanenblatt, M., Coden, A., & Sominsky, I. "The ConceptMapper Approach to Named Entity Recognition". Proceedings of the Seventh Conference on International Language Resources and Evaluation LREC10, 546–551, 2010.
The Gene Ontology Consortium. "The graph view of GO:0019852", 2019.
Thomas, P. D. "Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium". Nucleic Acids Research, 45(D1), D331–D338, 2017.
Van Auken, K., Schaeffer, M. L., McQuilton, P., Laulederkind, S. J. F., Li, D., Wang, S.-J. J., … Lu, Z. "BC4GO: a full-text corpus for the BioCreative IV GO task.". Database : The Journal of Biological Databases and Curation, 2014(2014), 1–9, 2014.
Verspoor, K., & Baumgartner, W. A. "Unstructured Information Management Architecture (UIMA)". In Encyclopedia of Systems Biology (pp. 2320–2324), 2013.
Verspoor, K., Cohen, K. B., Lanfranchi, A., Warner, C., Johnson, H. L., Roeder, C., … Hunter, L. E. "A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.". BMC Bioinformatics, 13(2012), 207, 2012.
Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., & Smola, A. "Feature Hashing for Large Scale Multitask Learning". Proceedings of the 26th Annual International Conference on Machine Learning, (Icml), (pp. 1113-1120)., 2009.
Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., … Hassanali, M. "DrugBank: A knowledgebase for drugs, drug actions and drug targets". Nucleic Acids Research, 36(Database issue), D901–D906, 2008.
Yang, C.-J., Chen, Y.-D., Li, W.-G., Huang, C.-Y., & Chiang, J.-H. "GREPC: Geneontology Concept Recognition by Entity, Pattern, and Constrain". BioCreative IV, 182–188, 2013.
Yang, C.-J., & Chiang, J.-H. "Cateye: A Hint-Enabled Search Engine Framework for Biomedical Classification Systems". In New Trends in Computer Technologies and Applications (pp. 758–763), 2018.
Yang, C.-J., & Chiang, J.-H. "Gene ontology concept recognition using named concept: understanding the various presentations of the gene functions in biomedical literature". Database, 2018(2018), 1–10, 2018.