簡易檢索 / 詳目顯示

研究生: 楊家融
Yang, Chia-Jung
論文名稱: 生物醫學領域中概念辨識的研究 以基因本體學為例
A Study on Concept Recognition in Biomedical Field Using Gene Ontology as an Example
指導教授: 蔣榮先
Chiang, Jung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 56
中文關鍵詞: 自然語言處理基因本體學機器學習
外文關鍵詞: natural language processing, gene ontology, machine learning
相關次數: 點閱:130下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,自然語言處理在生物醫學的專業領域上遇見障礙;專業領域的語言使用和一般領域大相逕庭。在基因本體學等非常專精的領域中常常缺乏大型的訓練資料集,使得強大的深度學習技巧難以在小資料集中施展。
    我們在研究中所採用的 Colorado Richly Annotated Full-Text 資料集包含 67 篇的全文文件,由生物學家標註「基因本體學」的資料。我們在研究中找尋出「基因本體學概念辨識」的困難所在,並且用「有名字的概念」為刀,把難題一分為二,分別用字典查找和機器學習來克服。第一步我們先用「有名字的概念」把「基因本體學概念」的資料重新架構,第二步我們再運重新架構的「基因本體學」來完成概念辨識的需求。
    我們的系統在 F1-measure 上比先前頂尖的系統進步了約 20%,達到 0.804 的 precision 及 0.715 的 recall。我們也證明了使用「有名字的概念」的想法有效,或許可以推廣到其他專業的語言上。

    In recent years, natural language processing has been facing several obstacles in professional text mining in biomedical fields; the scenarios of natural language processing usage are completely different for handling professional languages and general languages. Due to the lack of training data, powerful deep learning techniques were not applicable to small datasets available for the highly specific biological researches such as gene ontology.
    The Colorado Richly Annotated Full-Text corpus used in this study contains 67 full-text documents annotated by biologists. In this research, we aimed to identify the key difficulty of the gene ontology concept recognition task and handled this problem using dictionary-matching and machine-learning techniques. Accordingly, problem solving was divided into two steps, dictionary-matching and machine-learning respectively, corresponding to the roles of named concepts. In the first step, we reconstructed the gene ontology concepts after mining the named concepts. Furthermore, in the second step, we leveraged this reconstructed data to fulfill the needs of the proposed hybrid method.
    The proposed concept recognizer achieved approximately 20% improvement in F1-measure as compared to the state-of-the-art system resulting in 0.804 precision and 0.715 recall. It proved that the named concept may be applied to the concept recognition of other professional languages.

    摘要 i Abstract ii 誌謝 iii Contents iv List of Tables vii List of Figures ix Table of Abbreviations x Table of Symbols xi Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Purpose and Specific Aims 4 1.3 Terminology 5 1.4 Organization of the Dissertation 5 Chapter 2. Literature Review 7 2.1 Dictionary-Matching Approaches 9 2.2 Rule-Based Approaches 10 2.3 Hybrid and Other Approaches 10 Chapter 3. Organization of GO 12 3.1 The structure of GO 12 3.2 Mining the NCs from GO 16 3.3 Representation of GO Concepts by NCs 21 3.3.1 Aggregation of the NCs 21 3.3.2 Simplifying the GO statements 24 3.4 Summary 25 Chapter 4. Gene Ontology Concept Recognition System 26 4.1 Introduction of the CRAFT Corpus 27 4.2 Dictionary-Matching Component 29 4.2.1 Preprocessing: sentence segmentation 29 4.2.2 Dictionary matching 29 4.3 Machine Learning Component 31 4.3.1 Candidate Generation 31 4.3.2 Feature Extraction 32 4.3.3 Creating the Labels of the Candidates 34 4.3.4 The Choices of Machine-Learning Models 35 4.4 SN Boosting 36 4.5 Evaluation 37 4.6 Summary 38 Chapter 5. Experimental Results 39 5.1 Results of the Representation of GO with NCs 39 5.2 Results of the Concept Recognition Systems 40 5.3 Analysis of the System Components 42 5.4 Evaluation of the Machine Learning Classifiers 44 Chapter 6. Discussion 45 6.1 Principle Findings 45 6.2 Generalization of the Concept Recognition System 48 6.3 Limitations 49 Chapter 7. Conclusion and Future Studies 50 REFERENCES 52

    Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Communications of the ACM, 18(6), 333–340, 1975.
    Aronson, A. R., & Lang, F.-M. "An overview of MetaMap: historical perspective and recent advances". Journal of the American Medical Informatics Association, 17(3), 229–236, 2010.
    Blake, J. A., Christie, K. R., Dolan, M. E., Drabkin, H. J., Hill, D. P., Ni, L., … Westerfeld, M. "Gene ontology consortium: Going forward". Nucleic Acids Research, 43(D1), D1049-1056, 2014.
    Blake, J. A., Dolan, M., Drabkin, H., Hill, D. P., Ni, L., Sitnikov, D., … Westerfield, M. "Gene ontology annotations and resources". Nucleic Acids Research, 41(D1), 530–535, 2013.
    Bodenreider, O. "The Unified Medical Language System (UMLS): Integrating biomedical terminology". Nucleic Acids Research, 32(Database issue), D267–D270, 2004.
    Campos, D., Matos, S., & Oliveira, J. L. "A modular framework for biomedical concept recognition.". BMC Bioinformatics, 14(1), 281, 2013.
    Campos, D., Matos, S., & Oliveira, J. L. "Gimli: Open source and high-performance biomedical name recognition". BMC Bioinformatics, 14, 54, 2013.
    Corbett, P., & Murray-Rust, P. "High-Throughput Identification of Chemistry in Life Science Texts". In Computational Life Sciences II (pp. 107–118), 2006.
    Degtyarenko, K., De matos, P., Ennis, M., Hastings, J., Zbinden, M., Mcnaught, A., … Ashburner, M. "ChEBI: A database and ontology for chemical entities of biological interest". Nucleic Acids Research, (36), D344–D350, 2008.
    Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. "A survey of bioinformatics database and software usage through mining the literature". PLoS ONE, 11(6), e0157989, 2016.
    Federhen, S. "The NCBI Taxonomy database". Nucleic Acids Research, 40(Database issue), D136–D143, 2012.
    Ferrucci, D., & Lally, A. "UIMA: An architectural approach to unstructured information processing in the corporate research environment". Natural Language Engineering, 10(3–4), 327–348, 2004.
    Funk, C., Baumgartner, W., Garcia, B., Roeder, C., Bada, M., Cohen, K., … Leser, U. "Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters". BMC Bioinformatics, 15(1), 59, 2014.
    Funk, C. S., Cohen, K. B., Hunter, L. E., & Verspoor, K. M. "Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition". Journal of Biomedical Semantics, 7(1), 52, 2016.
    Gobeill, J., Pasche, E., Vishnyakova, D., & Ruch, P. "Managing the data deluge: Data-driven GO category assignment improves while complexity of functional annotation increases". Database, 2013(2013), 1–9, 2013.
    Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., … Gene Ontology Consortium. "The Gene Ontology (GO) database and informatics resource.". Nucleic Acids Research, 32(Database issue), D258-61, 2004.
    Jonquet, C., Shah, N. H., Cherie, H., Musen, M. a, Callendar, C., & Storey, M.-A. "NCBO Annotator: Semantic Annotation of Biomedical Data". International Semantic Web Conference, Poster, 1–3, 2009.
    Koopman, B., Zuccon, G., Nguyen, A., Bergheim, A., & Grayson, N. "Automatic ICD-10 classification of cancers from free-text death certificates". International Journal of Medical Informatics, 84(11), 956–965, 2015.
    Mao, Y., Van Auken, K., Li, D., Arighi, C. N., McQuilton, P., Hayman, G. T., … Lu, Z. "Overview of the gene ontology task at BioCreative IV". Database : The Journal of Biological Databases and Curation, 2014, 1–14, 2014.
    Miller, N., Lacroix, E. M., & Backus, J. E. "MEDLINEplus: building and maintaining the National Library of Medicine’s consumer health Web service.". Bulletin of the Medical Library Association, 88(1), 11–17, 2000.
    Mujtaba, G., Shuib, L., Raj, R. G., Rajandram, R., Shaikh, K., & Al-Garadi, M. A. "Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection". PLoS ONE, 12(2), e0170242, 2017.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. "Scikit-learn: Machine Learning in Python". Journal of Machine Learning Research, 12(2011), 2825–2830, 2012.
    Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., & Jimeno, A. "Text processing through web services: Calling Whatizit". Bioinformatics, 24(2), 296–298, 2008.
    Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … Mesirov, J. P. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles". Proceedings of the National Academy of Sciences, 102(43), 15545–15550, 2005.
    Tanenblatt, M., Coden, A., & Sominsky, I. "The ConceptMapper Approach to Named Entity Recognition". Proceedings of the Seventh Conference on International Language Resources and Evaluation LREC10, 546–551, 2010.
    The Gene Ontology Consortium. "The graph view of GO:0019852", 2019.
    Thomas, P. D. "Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium". Nucleic Acids Research, 45(D1), D331–D338, 2017.
    Van Auken, K., Schaeffer, M. L., McQuilton, P., Laulederkind, S. J. F., Li, D., Wang, S.-J. J., … Lu, Z. "BC4GO: a full-text corpus for the BioCreative IV GO task.". Database : The Journal of Biological Databases and Curation, 2014(2014), 1–9, 2014.
    Verspoor, K., & Baumgartner, W. A. "Unstructured Information Management Architecture (UIMA)". In Encyclopedia of Systems Biology (pp. 2320–2324), 2013.
    Verspoor, K., Cohen, K. B., Lanfranchi, A., Warner, C., Johnson, H. L., Roeder, C., … Hunter, L. E. "A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.". BMC Bioinformatics, 13(2012), 207, 2012.
    Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., & Smola, A. "Feature Hashing for Large Scale Multitask Learning". Proceedings of the 26th Annual International Conference on Machine Learning, (Icml), (pp. 1113-1120)., 2009.
    Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., … Hassanali, M. "DrugBank: A knowledgebase for drugs, drug actions and drug targets". Nucleic Acids Research, 36(Database issue), D901–D906, 2008.
    Yang, C.-J., Chen, Y.-D., Li, W.-G., Huang, C.-Y., & Chiang, J.-H. "GREPC: Geneontology Concept Recognition by Entity, Pattern, and Constrain". BioCreative IV, 182–188, 2013.
    Yang, C.-J., & Chiang, J.-H. "Cateye: A Hint-Enabled Search Engine Framework for Biomedical Classification Systems". In New Trends in Computer Technologies and Applications (pp. 758–763), 2018.
    Yang, C.-J., & Chiang, J.-H. "Gene ontology concept recognition using named concept: understanding the various presentations of the gene functions in biomedical literature". Database, 2018(2018), 1–10, 2018.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE