研究生: |
黃奕欽 Huang, I-Chin |
---|---|
論文名稱: |
利用推論網路加權與反模糊比對之生物醫學文獻詞彙正規化 Normalizing Biomedical Name Entities by Inference Network Weighting and De-ambiguity Matching |
指導教授: |
高宏宇
Kao, Hung-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2008 |
畢業學年度: | 96 |
語文別: | 英文 |
論文頁數: | 45 |
中文關鍵詞: | 詞彙正規化 、反模糊比對 、字串比對 、生物資訊學 、文字探勘 |
外文關鍵詞: | Inference Network, Text mining, Bio-informatics, Name Entity Normalization |
相關次數: | 點閱:96 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來因為生物醫學文獻的大量增加,相關研究學者無法將文獻知識進行有效的管理與發掘而使得有許多的知識被龐大的文獻所掩埋。為了要建造一個智慧型的生醫知識管理系統,在近幾十年間,學者們提出許多文獻資訊擷取的技術,但是這些技術在擷取資訊之前,系統必須判斷出在文獻中的詞彙,然後再將這些詞彙對應到相關的概念中。在對應的過程中,由於文獻中的詞彙並沒有統一且嚴謹的撰寫格式,所以產生了許多的問題,造成系統沒有辦法透過直接配對將詞彙與概念結合在一起,例如,詞彙的變異,再者,有些詞彙需要透過整篇文獻的了解才能確定其概念,因此,此研究的目的就是要自動且精確地判斷出文獻中有哪些詞彙是學者所提到的相關概念。
此研究透過資訊檢索的技巧,利用推論網路提出加權詞彙相似度的方法,計算出生醫文獻中的詞彙與概念的相似度用以解決詞彙的變異,以及反模糊比對增加文獻中概念的確信度。不同於之前的研究,此研究善加利用了概念與詞彙中的資訊來增加系統的準確性,讓判斷出來的詞彙更具意義。研究結果顯示,系統利用推論網路加權和反模糊比對再配合簡單的詞彙規則所得到的相關概念,能夠在生醫文獻的詞彙正規化中獲得一個顯著的效果。總結而言,此研究希望所提出的方法能夠正規化文獻中的詞彙以協助更高層的相關研究得到更好的效能。
In recent years the number of biomedical literatures is increased dramatically and the related experts cannot efficiently manage and extract knowledge from literatures so that much useful information would be lost. In order to construct an intelligent biomedical knowledge management system, researchers have proposed many Relation Extraction methods during the last several decades. However, before applying those methods the system has to recognize the name entities in literature and map the entity to the relative concept. Due to the less of conscientious and careful writing style, there are many problems, e.g. term variation and term ambiguity, in the mapping process and they cause error correlation between name entity and concept by the directly mapping method. Thus, the purpose of this study is to automatically and exactly identify the relative concepts mentioned in literatures.
In this study, the influence network weighting strategy is applied to weight the similarity score between the entity and the concept as well as to solve the term variation. The proposed de-ambiguity strategy is used to increase the confidence of concept in literature. Different from previous studies, this study makes a good use of the information in entity and concept to increase the precision of system and makes the identified entity even more meaningful. Results of the experiment, the system using those proposed strategies outperforms the simple strategies and previously proposed methods in biomedical entity normalization. Generally, this study proposes to help the next step of text mining researches, e.g. PPI and Co-occurrence, by normalizing the name entity.
[1] Alex, B., Grover, C., Haddow, B., Kabadjov, M., Klein, E., Matthews, M., Roebuck, S., Tobin, R. and Wang, X. (2008) Assisted curation: does text mining really help?, Pac Symp Biocomput, 556-567.
[2] Andrew, G. and Mounia, L. (2002) Video retrieval using an MPEG-7 based inference network. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, Tampere, Finland.
[3] Becker, K.G., Barnes, K.C., Bright, T.J. and Wang, S.A. (2004) The genetic association database, Nat Genet, 36, 431-432.
[4] Chang, J.T., Schutze, H. and Altman, R.B. (2004) GAPSCORE: finding gene and protein names one word at a time, Bioinformatics, 20, 216-225.
[5] Cheng-Ju Kuo, Y.-M.C., Han-Shen Huang, Kuan-Ting Lin, Bo-Hou Yang, Yu-Shi Lin, Chun-Nan Hsu and I-Fang Chung. (2007) Exploring Match Scores to Boost Precision of Gene Normalization. . Proceedings of the BioCreAtIvE II Workshop 2007. Madrid, Spain.
[6] Cohen, A. (2005) Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries, Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, 17-24.
[7] Cohen, K.B., George, K.A.-M., Andrew, E.D. and Lawrence, H. (2002) Contrast and variability in gene names. Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3. Association for Computational Linguistics, Phildadelphia, Pennsylvania.
[8] Cohen, W. and Minkov, E. (2006) A graph-search framework for associating gene identifiers with documents, BMC Bioinformatics, 7, 440.
[9] Freund, Y. and Schapire, R.E. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, 55, 119-139.
[10] Fundel, K., Guttler, D., Zimmer, R. and Apostolakis, J. (2005) A simple approach for protein name identification: prospects and limits, BMC Bioinformatics, 6 Suppl 1, S15.
[11] Gonzalo, N. (2001) A guided tour to approximate string matching, ACM Comput. Surv., 33, 31-88.
[12] Grover, C., Haddow, B., Klein, E., Matthews, M., Nielsen, L., Tobin, R. and Wang, X. (2007) Adapting a relation extraction pipeline for the BioCreAtIvE II task. Proceedings of the BioCreAtIvE II Workshop 2007.
[13] Hirschman, L., Colosimo,M., Morgan,A., Colombe,J. and Yeh,A. (2004) Task 1B: gene list task., In Proceedings of the Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Workshop. Grenada, Spain.
[14] Howard, T. and Croft, W.B. (1990) Inference Networks for Document Retrieval. University of Massachusetts.
[15] Howard, T. and Croft, W.B. (1991) Evaluation of an inference network-based retrieval model, ACM Trans. Inf. Syst., 9, 187-222.
[16] J. Hakenberg, L.R., C. Plake, H. Strobelt, and M. Schroeder. (2007) Me and my friends:Gene mention normalization with background knowledge., Proceedings of the BioCreAtIvE II Workshop 2007.
[17] Jaro, M.A. (1989) Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association, 84, 414-420.
[18] Jaro, M.A. (1995) Probabilistic linkage of large public health data files, Statistics in medicine, 14, 491-498.
[19] Karamanis, N. (2007) Text Mining for Biology and Biomedicine Sophia Ananiadou and John McNaught (editors) (University of Manchester and UK National Centre for Text Mining) Boston and London: Artech House, 2006, xi+286 pp; hardbound, ISBN 1-58053-984-X, £53.00, Computational Linguistics, 33, 135-140.
[20] Khalid, M., Jijkoun, V. and de Rijke, M. (2008) The Impact of Named Entity Normalization on Information Retrieval for Question Answering. In, Advances in Information Retrieval. 705-710.
[21] Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y. and Collier, N. (2004) Introduction to the Bio-Entity Task at JNLPBA, the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications.
[22] Krauthammer, M., Rzhetsky, A., Morozov, P. and Friedman, C. (2000) Using BLAST for identifying gene and protein names in journal articles, Gene, 259, 245 - 252.
[23] Lau, W.W., Johnson, C.A. and Becker, K.G. (2007) Rule-based human gene normalization in biomedical text with confidence estimation, Computational systems bioinformatics / Life Sciences Society, 6, 371-379.
[24] Letovsky, S.I., Cottingham, R.W., Porter, C.J. and Li, P.W.D. (1998) GDB: the Human Genome Database, Nucl. Acids Res., 26, 94-99.
[25] Lim, J.-H., Jang, H., Lim, J. and Park, S.-J.A.P.S.-J. (2007) Normalization of Gene/Protein Names in Biological Literatures using Vector-Space Model. In Jang, H. (ed), Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. 390-393.
[26] Martijn Schuemie, R.J., Jan Kors (2007) Peregrine: Lightweight Gene Name Normalization by Dictionary Lookup. Proceedings of the BioCreAtIvE II Workshop 2007. Madrid, Spain.
[27] Morgan, A.A., Lu, Z., Wang, X., Cohen, A.M., Fluck, J., Ruch, P., Divoli, A., Fundel, K., Leaman, R., Hakenberg, J.o., Sun, C., Liu, H.-h., Torres, R., Krauthammer, M., Lau, W.W., Liu, H., Hsu, C.-N., Schuemie, M. and Hirschman, L. (2008) Overview of BioCreative II Gene Normalization, Genome Biology.
[28] Morgan, A.A., Wellner, B., Colombe, J.B., Arens, R., Colosimo, M.E. and Hirschman, L. (2007) Evaluating the automatic mapping of human gene and protein mentions to unique identifiers, Pac Symp Biocomput, 281-291.
[29] NCBI ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI Gene FTP site.
[30] Ricardo, A.B.-Y. and Berthier, R.-N. (1999) Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc.
[31] Settles, B. (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text, Bioinformatics, 21, 3191-3192.
[32] Smith, L., Tanabe, L.K., Johnson (nee Ando), R., Kuo, C.-J., Chung, I.-F., Lin, Y.-S., Klinger, R., Friedrich, C.M., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C.A., Povinelli, R.J., Vlachos, A., Baumgartner Jr., W.A., Hunter, L., Carpenter, B., Tzong-Han Tsai, R., Dai, H.-J., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, A., M1 na-Lopez, M., Mata-Vazquez, J. and Wilbur, W.J. (2008) Overview of BioCreative II Gene Mention Recognition, Genome Biology.
[33] Tamames, J. and Valencia, A. (2006) The success (or not) of HUGO nomenclature, Genome Biol, 7, 402.
[34] Tsuruoka, Y., McNaught, J. and Ananiadou, S. (2008) Normalizing biomedical terms by minimizing ambiguity and variability, BMC Bioinformatics, 9 Suppl 3, S2.
[35] Tsuruoka, Y., McNaught, J., Tsujii, J.i., chi and Ananiadou, S. (2007) Learning string similarity measures for gene/protein name dictionary look-up using logistic regression, Bioinformatics, 23, 2768-2774.
[36] Wain, H.M., Lush, M.J., Ducluzeau, F., Khodiyar, V.K. and Povey, S. (2004) Genew: the Human Gene Nomenclature Database, 2004 updates, Nucleic acids research, 32, D255-257.
[37] Wang, X. and Matthews, M. (2008) Comparing usability of matching techniques for normalising biomedical named entities, Pac Symp Biocomput, 628-639.
[38] Wren J, G.H. (2002) Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries, Methods Inf Med, 41, 426–434.
[39] Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O'Donovan, C., Redaschi, N. and Suzek, B. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information, Nucleic Acids Res, 34, D187-191.
[40] Xu, H., Fan, J.-W., Hripcsak, G., Mendonca, E.A., Markatou, M. and Friedman, C. (2007) Gene symbol disambiguation using knowledge-based profiles, Bioinformatics, 23, 1015-1022.
[41] Yeganova, L., Smith, L. and Wilbur, W.J. (2004) Identification of related gene/protein names based on an HMM of name variations, Computational Biology and Chemistry, 28, 97-107.
[42] Yoshimasa, T. and Jun'ichi, T. (2004) Improving the performance of dictionary-based approaches in protein name recognition, J. of Biomedical Informatics, 37, 461-470.