| 研究生: |
甘昇玄 Gan, Sheng-Xuan |
|---|---|
| 論文名稱: |
生物醫學文件中基因關係摘要系統 A Summarization System for Gene Relations in Biomedical Literatures |
| 指導教授: |
蔣榮先
Chiang, Jung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2004 |
| 畢業學年度: | 92 |
| 語文別: | 中文 |
| 論文頁數: | 42 |
| 中文關鍵詞: | 生物資訊 、資訊淬取 、自動機 |
| 外文關鍵詞: | bioinformatics, information extraction, automata |
| 相關次數: | 點閱:83 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當科學家手上有了三十億個DNA 序列的龐大生物資料庫之後,生物學家再也不可能用傳統的紙上作業的方法去解讀這些資料,必須依靠電腦的高速計算能力才有可能對這筆龐大的資料進行整理與分析。本論文結合資料探勘、自然語言處理、有限狀態機的技術在大量的醫學文件中摘要出與查詢基因相關之資訊,在處理流程上共分為三個部分: 1.利用醫學文件中描寫基因相關資訊的書寫習慣,配合資料探勘的方法,初步從文句中篩選出可能與查詢基因相關的基因、功能、疾病。2.簡化文句的句型結構並將其轉換為下一個模組所需的輸入格式。3.將篩選與查詢的結果回朔到原文中以有限狀態機的方法判斷原句的句型是否在描述與查詢基因之間的關係,並將符合句型規則的相關基因、功能、疾病及其原句淬取出來。最後,我們將摘要資訊整理為1. 相關基因方面資訊。2. 相關功能方面資訊。3. 相關疾病方面資訊。來提供醫學人員關於生物途徑(BiologicalPathway)、蛋白質交互作用預測、人類致病基因、蛋白質功能的相關資訊,加速醫學研究的流程。
After creating thirty hundred million DNA sequences’ database by biologists, it needs the high speed process to analyze the vast amounts of data . This paper proposes an summarization system which summarizes a query gene’s related information in biomedical literatures. This system combines three techniques :data mining methodology, natural language processing , finite state machine and consists of three process steps. First ,utilize the author’s writing habits in biomedical literatures and data mining method to extract the candidate related genes , functions
and diseases. Second , tag the part-of-speech of sentences’ tokens and simplify the sentence pattern. Third , use finite state machine to extract summary sentences that describe the relation between the query gene and other genes, functions or diseases.
Finally , the summary information is integrated to : 1. Information about related genes. 2.Information about related functions. 3. Information about related diseases.This system provides information to help researches about biological pathway ,protein-protein interaction , human cancer gene and gene’s function and increase the process of medical research.
[1] Akane Yakushiji, Yuka Tateisi, Yusuke Miyao,“Event Extraction from Biomedical Papers using a Full Parser,”Proc. of the Pacific Symposium on Biocomputing , 2001.
[2] Bing Liu, Chee Wee Chin, Hwee Tou Ng.,“Mining Topic-Specific Concepts and Definitions on the Web,"WWW-03 , 2003.
[3] Blaschke C. and A. Valencia,“The Frame-based Module of the SUISEKI Information Extraction System,"IEEE Intelligent Systems, vol. 17, pp. 14-20, 2002.
[4] Brill E.,“A simple rule-based part of speech tagger,"In proceedings of the Thrid Conference on Applied Natural Language Processing, 1992.
[5] Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A, “ GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles,"Bioinformatics , vol. 20, pp. S74-S82, 2001.
[6] Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P,“Protein structures and information extraction from biological texts: the PASTA system ,”Bioinformatics , vol. 19, no. 1, pp. 135-143, 2003.
[7] Gondy Leroy, Hsinchun Chen and Jesse D. Martinez,“A shallow parser based on closed-class words to capture relations in biomedical text,"Journal of Biomedical Informatics (JBI), vol. 36, pp. 145-158, June 2003.
[8] J.H Chiang, H.C. Yu, and H.J Hsu,“GIS: a biomedical text-mining system for gene information discovery,"Bioinformatics , vol. 20, no. 1, pp. 120-121, 2004.
[9] J.H Chiang, H.C. Yu, and H.J Hsu,“MeKE: discovering the functions of gene products from biomedical literature via sentence alignment,"Bioinformatics , vol. 19, no. 11, pp. 1417–1422, 2003.
[10] Joshua M. Temkin and Mark R. Gilder, “Extraction of protein interaction information from unstructured text using a context-free grammar,"Bioinformatics , vol. 19, no. 16, pp. 2046-2053, 2003.
[11] Leroy G, Chen H,“Filling preposition-based templates to capture information from medical abstracts,”In: Pacific Symposium on Biocomputing, January, Kauai , 2002.
[12] Marcotte E. M., L. Xenarios, and D. Eisenberg ,“Mining literature for protein-protein interactions ,"Bioinformatics , vol. 17, no 4, pp. 359-363, 2001.
[13] M Rebhan, V Chalifa-Caspi, J Prilusky, and D Lancet,“GeneCards: a novel functional genomics compendium with automated data mining and query
reformulation support,"Bioinformatics , vol. 14, no. 8, pp. 656-664, 1998.
[14] Novichkova S, Egorov S, Daraselia N,“MedScan, a natural language processing engine for MEDLINE abstracts,"Bioinformatics , vol. 19, no. 13, pp. 1699-1706, 2003.
[15] R. Agrawal, R. Srikant.,“Fast algorithms for mining association rules in large databases,"Proc. of the 20th Int'l Conference on Very Large Databases, September 1994.
[16] Sekimizu T, Park HS, Tsujii J.,“Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts,"In Genome Informatics., 1998.