簡易檢索 / 詳目顯示

研究生: 蕭瑞辰
Hsiao, Jui-Chen
論文名稱: 多階層核心物種預測之基因名稱正規化系統
Multi-level focus species detection for Gene name Normalization
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 53
中文關鍵詞: 基因名稱正規化物種辨識物種消歧多階層
外文關鍵詞: Gene name normalization, Species assignation, Species disambiguation, Species recognition, Multi-level
相關次數: 點閱:113下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基因名稱正規化的目的是替文章中所有提到的基因名稱找出他們所屬的基因編號,其中一項挑戰在於如何將此基因所屬的物種預測出來。同樣的基因名稱可能屬於不同的物種,若兩基因屬於不同物種則其基因編號也必定不同。所以對基因名稱正規化來說預測其物種是一項重要的任務。而在生物文獻中的物種名稱也會有多字同義的情況,也就是物種名稱不同但實際上它們是屬於同一種物種。為了解決這些問題,在這篇論文中,我們發表一個多階層核心物種預測之基因名稱正規化系統。我們從文章中擷取出可能代表物種的線索字,利用我們提出的EFAISF模型算出這些線索字對每種物種的信心值,利用這些值預測出各個文字階層可能的核心物種,這些階層包括名詞片語、句子、段落、與整篇文章。最後再利用階層間的關係導向因子來將最後答案導向正確的基因所屬的物種。我們的方法不僅有好的結果而且也解決了一些物種辨識問題,像是句子中或是文章中沒有出現物種字典裡面的物種名稱,使得無法找到基因所屬物種的問題。我們的方法可在DECA的資料庫達到88.22%的正確率,效能上超越了目前所有物種辨識方法。

    Gene name normalization (GN) is the task of normalizing and mapping the gene mentions in the literature to the Entrez gene identifiers. One challenge in GN task, called “Species Disambiguation”, is how to assign the correct species identifiers to the gene mentions. This is important for GN task because the same gene names may belong to the different species and, their Entrez gene identifiers will be different. There is also another species ambiguous problem, which is the different species names may belong to the same taxonomy identifier. In this paper, we propose a multi-level focus species detection model to solve the species assignation problem. We predict the focus species of gene mentions using four different context levels, which the gene mentions locate to. We use the species cue words extracted from the articles in the corpus as our processing entities . The confidence of each entity is calculated by the proposed EFAISF model, which represents how important the entity is to the species. The final correct species identifier is decided by the Relation Guide Factor method which could detect the most probably answer to the gene mention. Our method not only performs well but also solves the problem which the sentence has no species mention or the article has no species mention. In DECA corpus, we outperform others species assignation methods and attain the precision rate of 88.22%.

    1. INTRODUCTION 1 1.1 Background 1 1.2 Motivation 2 1.3 Our approach 7 1.4 Paper structure 8 2. RELATED WORK 9 2.1 Previous GN work 9 2.2 Recent GN work 9 2.2.1 Gene Normalization by Species Inference 10 2.2.2 Species Assignation 11 2.3 SNER 11 2.4 GNER 12 3. METHOD 13 3.1 Establishment of entities dataset 14 3.1.1 Noun phrase tagging 16 3.1.2 Acquisition cue words from noun phrases 17 3.1.3 Entities filtering 18 3.2 Entities Frequency-Augmented Invert Species Frequency 20 3.2.1 EFAISF - Priority of four levels 22 3.2.2 EFAISF - Combination of four levels 22 3.3 Relational Guide Factor of the gene mentions pairs 23 3.3.1 Deciding the answer from four levels 26 4. Experiment 29 4.1 Dataset preprocess 29 4.2 Baseline methods 30 4.3 Method experiment 32 4.3.1 Comparison with Priority of four levels and Combination of four levels 32 4.3.2 Experiment – the each level focus species detection 36 4.4 Relational Guide Factor and comparison with other works 39 4.4.1 Comparison with other methods 40 4.4.2 Comparing with Dictionary based Multi-level focus species model 42 4.5 Using Multi-level focus species detection on BioCreative III training data 43 4.5.1 Dataset preprocessing 43 4.5.2 Result 44 4.6 Other experiment 45 4.6.1 EFAISF without ‘AISF’ 45 4.6.2 Without Filtering 46 4.6.3 RGF experiment for observation 47 5. CONCULSION AND FUTURE WORK 50 6. REFERENCES 51

    [1] "Entrez Gene
    [http://www.ncbi.nlm.nih.gov/gene]."
    [2] "Wikipedia
    [http://www.wikipedia.org/]."
    [3] Chen, L., H. Liu, and C. Friedman, "Gene name ambiguity of eukaryotic nomenclatures," BIOINFORMATICS, vol. 21, pp. 248-256, 2005.
    [4] Chiang, J.-H., H.-H. Liu, and Y.-T. Huang, "Condensing biomedical journal texts through paragraph ranking," Bioinformatics, vol. 27, pp. 1143-1149, 2011.
    [5] Coburn, A., "Lingua-EN-Tagger
    [http://search.cpan.org/~acoburn/Lingua-EN-Tagger-0.19/Tagger.pm]."
    [6] Dai, H.-J., Y.-C. Chang, R. T.-H. Tsai, and W.-L. Hsu, "Integration of Gene Normalization Stages and Co-reference Resolution Using a Markov-Logic Network " BIOINFORMATICS, vol. 27, pp. 2586-2594, 2011.
    [7] Fluck, J., H. T. Mevissen, H. Dach, M. Oster, and M. Hofmann-Apitius, "ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries," the Second BioCreative Challenge Evaluation Workshop, pp. 149-151, 2007.
    [8] Gerner, M., G. Nenadic, and C. M. Bergman, "LINNAEUS: A species name identification system for biomedical literature," BMC Bioinformatics, vol. 17, 2010.
    [9] Hakenberg, J., C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez, "Inter-species normalization of gene mentions with GNAT," BIOINFORMATICS, vol. 24, pp. i126–i132, 2008.
    [10] Hakenberg, J., C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, "Gene mention normalization and interaction extraction with context models and sentence motifs," Genome Biology, vol. 9, 2008.
    [11] Hakenberg, J. o., M. Gerner, M. Haeussler, I. e. Solt, C. Plake, M. Schroeder, G. Gonzalez, G. Nenadic, and a. C. M. Bergman, "The GNAT library for local and remote gene mention normalization," Bioinformatics, 2011.
    [12] Harmston, N., W. Filsell, and M. Stumpf, "Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices," BIOINFORMATICS, p. 7, 2011.
    [13] Hsu, C.-N., Y.-M. Chang, C.-J. Kuo, Y.-S. Lin, H.-S. Huang, and I.-F. Chung, "Integrating high dimensional bi-directional parsing models for gene mention tagging," BIOINFORMATICS, vol. 24, p. 9, 2008.
    [14] Huang, M., J. Liu, and X. Zhu, "GeneTUKit: a software for document-level gene normalization," Bioinformatics, vol. 2, 2011.
    [15] Kappeler, T., K. Kaljurand, and F. Rinaldi, "TX task: automatic detection of focus organisms in biomedical publications," presented at the Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, Boulder, Colorado, 2009.
    [16] Kerrien, S., Y. Alam-Faruque, B. Aranda, I. Bancarz, A. Bridge, C. Derow, E. Dimmer, M. Feuermann, A. Friedrichsen, R. Huntley, C. Kohler, J. Khadake, C. Leroy, A. Liban, C. Lieftink, L. Montecchi-Palazzi, S. Orchard, J. Risse, K. Robbe, B. Roechert, D. Thorneycroft, Y. Zhang, R. Apweiler, and H. Hermjakob, "IntAct—open source resource for molecular interaction data," Nucleic Acids Research, vol. 35, pp. D561–D565, 2006.
    [17] Kuo, C.-J., "AIIA::GMT
    [http://search.cpan.org/~cjukuo/AIIA-GMT-0.05/lib/AIIA/GMT.pm]."
    [18] L, H., C. M, M. A, and Y. A, "Overview of BioCreAtIvE task 1B: normalized gene lists.," BMC Bioinformatics, vol. 6, p. 23, 5/24 2005.
    [19] Leitner, F., S. A. Mardis, M. Krallinger, G. Cesareni, L. A. Hirschman, and A. Valencia, "An Overview of BioCreative II.5," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 7, pp. 385-399, 2010.
    [20] Lu, Z., H.-Y. Kao, C.-H. Wei, M. Huang, J. Liu, C.-J. Kuo, C.-N. Hsu, R. Tsai, H.-J. Dai, N. Okazaki, H.-C. Cho, M. Gerner, I. Solt, S. Agarwal, F. Liu, D. Vishnyakova, P. Ruch, M. Romacker, F. Rinaldi, S. Bhattacharya, P. Srinivasan, H. Liu, M. Torii, S. Matos, D. Campos, K. Verspoor, K. Livingston, and W. Wilbur, "The gene normalization task in BioCreative III," BMC Bioinformatics, vol. 12, p. S2, 2011.
    [21] Morgan, A. A., Z. Lu, X. Wang, A. M. Cohen, J. Fluck, P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Hakenberg, C. Sun, H.-h. Liu, R. Torres, M. Krauthammer, WilliamWLau, H. Liu, C.-N. Hsu, M. Schuemie, K. B. Cohen, and L. Hirschman, "Overview of BioCreative II gene normalization," Genome Biology, vol. 9, 01 September 2008.
    [22] Mu, T., X. Wang, J. i. Tsujii, and S. Ananiadou, "Imbalanced Classification Using Dictionary-based Prototypes and Hierarchical Decision Rules for Entity Sense Disambiguation," Coling, pp. 851-859, 2010.
    [23] Naderi, N., T. Kappler, C. J. O. Baker, and R. e. Witte, "OrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents," Bioinformatics, vol. 8, 2011.
    [24] Smith, L., L. K. Tanabe, R. J. n. Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C. M. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. A. Struble, R. J. Povinelli, A. Vlachos, W. A. B. Jr, L. Hunter, B. Carpenter, R. T.-H. Tsai, H.-J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. Maña-López, J. Mata, and W. J. Wilbur, "Overview of BioCreative II gene mention recognition," Genome Biology, vol. 9, 01 September 2008.
    [25] Verspoor, K., C. Roeder, H. L. Johnson, K. B. Cohen, J. William A. Baumgartner, and L. E. Hunter, "Exploring Species-Based Strategies for Gene Normalization," IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 7, pp. 462-471, 2010.
    [26] Wang, X. and M. Matthews, "Distinguishing the species of biomedical named entities for term identification," BMC Bioinformatics, vol. 9, pp. 1-9, 2008.
    [27] Wang, X., J. i. Tsujii, and S. Ananiadou, "Classifying relations for biomedical named entity disambiguation," presented at the Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3, Singapore, 2009.
    [28] Wang, X., J. i. Tsujii, and S. Ananiadou, "Disambiguating the species of biomedical named entities using natural language parsers," BIOINFORMATICS, vol. 26, pp. 661-667, 2010.
    [29] Wei, C.-H. and H.-Y. Kao, "Cross-species gene normalization by species," BMC Bioinformatics, vol. 11s, 2011.
    [30] Wei, C.-H. and H.-Y. Kao, "Represented Indicator Measurement and Corpus Distillation on Focus Species Detection," BIBM, pp. 657-662, 2010.
    [31] Wei, C.-H., H.-Y. Kao, and Z. Lu, "SR4GN: A Species Recognition Software Tool for Gene Normalization," PLoS ONE, vol. 7, June 2012.
    [32] Wermter, J., K. Tomanek, and U. Hahn, "High-performance gene name normalization with GeNo," BIOINFORMATICS, vol. 25, pp. 815-821, 2009.
    [33] XinglongWang and C. Grover, "Learning the Species of Biomedical Named Entities from Annotated Corpora," pp. 1808-1813, 2007.

    下載圖示 校內:2013-09-14公開
    校外:2017-09-14公開
    QR CODE