| 研究生: |
張智凱 Chang, Chi-kai |
|---|---|
| 論文名稱: |
G-Norm: 自動化PubMed助理以兩階段基因名稱正規化方法 G-Norm: Automated pop-up PubMed assistant based on two-phase gene normalization approach |
| 指導教授: |
蔣榮先
Chiang, Jung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 44 |
| 中文關鍵詞: | 正規化 |
| 外文關鍵詞: | normalization |
| 相關次數: | 點閱:45 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在生物醫學領域方面的資訊擷取之重要議題為基因和蛋白質名稱的身分識別,即基因和蛋白質名稱的正規化;與基因名稱識別不同的是,基因名稱識別只是找出文章中的基因和蛋白質名稱,並不清楚文章中的基因和蛋白質名稱所敘述的對象為誰,更重要的是基因和蛋白質名稱的混淆情況,此意謂著不同的基因可能使用著相同的名稱,導致資訊擷取的效果不佳。本論文提出基因名稱正規化的步驟,來將文章中的基因和蛋白質名稱給予其正確的身分,加速生物領域的資訊擷取應用的發展。因人類為生物學家長久以來主要的研究重心,故我們將針對人類的基因和蛋白質名稱做正規化,系統中將建立基因和蛋白質名稱辭典供身分識別之用,並利用分類器辨別有對應到多個身分的基因和蛋白質名稱,系統可達到81%的準確率和75%的召回率。
本論文實做一個線上的PubMed輔助工具,讓使用者瀏覽PubMed文件時,可先清楚的知道文章中的基因和蛋白質有哪些和其身分,並提供基因名稱註解等資訊幫助使用者快速閱覽文章。
The complexity and ambiguity of gene nomenclature makes a critical problem to develop the text mining applications in biosciences.
We propose human gene name normalization approach in the thesis to identify the gene and protein names in biomedical literature. First, we establish gene and protein dictionary to provide the responsible gene ID and using various string transformations to match gene and protein names in literature. The dictionary is composed of Entrez Gene dictionary and BioThesaurus dictionary. Second, we use Maximum Entropy model to distinguish ambiguous gene names. The performance of our system can achieve results of 81% precision and 75% recall.
We provide a pop-up PubMed assistant tool to users. It can show all gene and protein names in PubMed literature and its identification. We also provide GO Function, related genes to users.
[1] 美國國家醫學圖書館網站 : http://www.nlm.nih.gov/
[2] NCBI網站 : http://www.ncbi.nlm.nih.gov
[3] Winona C. Barker, John S. Garavelli, Hongzhan Huang, Peter B. McGarvey, Bruce C. Orcutt, Geetha Y. Srinivasarao, Chunlin Xiao, Lai-Su L. Yeh, Robert S. Ledley, Joseph F. Janda, Friedhelm Pfeiffer, Hans-Werner Mewes, Akira Tsugita and Cathy Wu. “The Protein Information Resource(PIR)” , Nucleic Acids Research, vol. 28,pp. 41-44, 2000.
[4] Hongfang Liu, Zhang-Zhi Hu, Jian Zhang and Cathy Wu. “BioThesaurus: a web-based thesaurus of protein and gene names” , bioinformatics, vol. 22, pp. 103-105, 2006.
[5] Cathy H. Wu, Hongzhan Huang, Anastasia Nikolskaya, Zhangzhi Hu, Winona C. Barker, “The iProClass integrated database for protein functional analysis” , Computational Biology and Chemistry, vol. 28,pp. 87-96, 2004.
[6] Minoru Kanehisa and Susumu Goto. “KEGG: Kyoto Encyclopedia of Genes and Genomes” , Nucleic Acids Research, vol. 28, pp. 27-30, 2000.
[7] 醫學圖書館標題表網站 : http://www.nlm.nih.gov/mesh/
[8] BioCreAtIve 網站 : http://biocreative.sourceforge.net/
[9] Ulf Leser, Jorg Hakenberg. “What makes a gene name? Named entity recognition in the biomedical literature” briedings in bioinformatics, vol. 6, pp. 357-369, 2005.
[10] LingPipe 網站 : http://www.alias-i.com/lingpipe/index.html
[11] E. T. Jaynes. “Information Theory and Statistical Mechanics. Ⅱ” , PHYSICAL REVIEW, vol. 108, pp. 171-190, 1957.
[12] Ronald R. Yager. “On ordered weighted averaging aggregation operators in multicriteria decisionmaking” IEEE transactions on system, vol. 18, pp. 183-190, 1998.
[13] Javier Tamames, Alfonso Valencia. “The success (or not) of HUGO nomenclature”, Genome Biology, Vol. 7, 2006.
[14] Aaron M. Cohen. “Unsupervised gene/protein named entity normalization using automatically extracted dictionaries” Proceedings of the ACL_ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 17-24, 2005.
[15] Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra. “A Maximum Entropy Approach to Natural Language Processing” Computational Linguistics, 1996.
[16] Burr Settles. “ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text” bioinformatics, vol. 21, pp. 3191-3192, 2004.
[17] Jeffrey T. Chang, Hinrich Schutze , Russ Altman. “GAPSCORE: finding gene and protein names one word at a time” bioinformatics, vol. 20, pp. 216-225, 2004.
[18] Hongfang Liu, Zhang-Zhi Hu, Manabu Torii, Cathy Wu, Carol Friedman. “Quantitative Assessment of Dictionary-based Protein Named Entity Tagging” Journal of American Medical Informatics Association, vol. 13, pp. 497-507, 2006.