| 研究生: |
桂卓慶 Kooi, Tock-kheng |
|---|---|
| 論文名稱: |
利用文字探勘技術萃取轉錄因子與目標基因調控資訊 USING TEXT MINING TECHNIQUES TO EXTRACT REGULATION BETWEEN TRANSCRIPTION FACTOR AND TARGET GENE |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 43 |
| 中文關鍵詞: | 目標基因與轉錄因子關係 、資訊萃取 |
| 外文關鍵詞: | information extraction, regulation information between TF and TGene |
| 相關次數: | 點閱:113 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
鑑於人類基因序列的完整解碼,許多生物實驗持續地展開。人類的身體之所以能夠運作是依賴多個健全的蛋白質持續地交互作用(Protein-Protein Interaction;PPI)。蛋白質是從DNA->mRNA->Protein,即基因的最終產物。轉錄調控是基因形成蛋白質過程中的第一步也是最重要的起始步驟。本研究有興趣的是分析轉錄因子(Transcription Factor;TF)與目標基因(Target Gene;TGene)的辨示及其彼此間調控關係,而最終都會將此資訊以文獻方式公佈出來。
隨著生物文獻逐年迅速地增加,生物研究人員不易從如此大量文獻中完全的閱讀並擷取出轉錄因子與目標基因的資訊。因此,如果能夠有效率地利用資訊技術處理大量生物文獻,並進行過濾,協助讀者擷取出目標基因與轉錄因子關係,將對生物研究人員在實驗目標上有很大的幫助。
目前大部分學者極力投入在蛋白質間交互作用資訊萃取的研究,而本研究則是專注於轉錄因子與目標基因的調控關係。其中本研究的困難度在於:(1)名稱辨識上需要利用兩個不同的生物字典,(2)調控關係的萃取需要嚴謹界定,例如:轉錄因子調控目標基因,但是目標基因不能調控轉錄因子。除此之外,大部分學者極少處理句子中出現模糊字眼的處理,如:“Previous Studies …”指之前的研究,這樣子沒辦法表示該篇文獻確實提出實驗證明。
藉此,本研究設計了一個搭配錯誤樣板和正確樣板來分析PubMed查詢結果的文獻並預測TF和TGene之間的關係,以提供生物研究人員實驗參考。經由實驗結果證明搭配錯誤樣板和正確樣板的F-measure比單純使用正確樣板的F-measure值還來得高。
Human genome sequences have completely decoded. The data is helpful to the gene identification and gene regulation. In gene regulation research, it includes regulation information between transcription factor(TF) and target gene(TGene) that may help biologists to know which TGene is regulated by the TF. Presently, regulation information mostly is recorded in biological literatures.
Due to the rapid growth of biological literature, biologists hardly spend lot of time to read through all related literatures and extract regulation information between TF and TGene. Therefore, if any information technology can be utilized to filter and extract relationship between TF and TGene that may improve the reading efficiency.
Nowadays, most researchers put their every effort in protein-protein interactions research, but this thesis is specialized to extract regulation between TF and TGene. The difficulties are (1) named entity recognition need two domain dictionary (2) relation recognition must conscientiously defined. As an example, TF can only regulate TGene expression but TGene cannot. Besides that, most researchers focus on extracting important information but less aware modality information like “Previous Studies…” means that studies are some time ago, no experiment evidence in that paper.
Therefore, this thesis aims to use text-mining technique to analyze TF query literatures from PubMed , use negative and positive pattern to predict the relationship between TF and target gene that may give valuable insight to the biologists.
■ 英文文獻
Agichtein, E., Eskin, E., & Gravano, L. (2000). Extracting Relations from Large Text Collections. 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval: Addison-Wesley Harlow, England.
Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. Paper presented in WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98.
Edmunson, H. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264-285.
Eyre, T. A., Ducluzeau, F., Sneddon, T. P., Povey, S., Bruford, E. A., & Lush, M. J. (2006). The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Research, 34, D319-D321.
Fano, R. (1961). Transmission of Information. Cambridge, Mass:MIT Press.
Fellbaum, C. (1998). WordNet: An electronic lexical database. MIT Press.
Fox, C. (1992). Lexical analysis and stoplists. In: Frakes WB, Baeza-Yates R, editors. Information retrieval: data structures and algorithms. (p. 102-30): Prentice Hall.
Fundel, K., Kuffner, R., & Zimmer, R. (2007). RelEx-Relation extraction using dependency parse trees. Bioinformatics, 23(3), 365-371.
Hearst, M. A. (1999). Untangling Text Data Mining. Paper presented at the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Hobbs, J. R. (1993). The Generic Information Extraction System. Paper presented at the Proceedings of the 5th conference on Message understanding.
Hobbs, J. R. (2002). Information extraction from biomedical text. Journal of Biomedical Informatics, 35, 260-264.
Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., & Li, M. (2004). Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20(18), 3604-3612.
Mihalcea, R., & Moldovan, D. I. (1999). Word Sense Disambiguation based on Semantic Density. Paper presented at the Use of WordNet in National Language Processing Systems:Proceedings of the conference.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography, 3(4), 235-244.
Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2), 155-161.
Park, J. C., Kim, H. S., & Kim, J. J. (2001). Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Paper presented at the In Proceedings of the Pacic Rim Symposium on Biocomputing.
Raychaudhuri, S. (2006). Computational Text Analysis for Functional Genomics and Bioinformatics: Oxford University Press.
Robertson, S. E., Porter, M. F., & Rijsbergen, C. J. (1980). New models in probabilistic information retrieval: London: British Library.
Sekimizu, T., Park, H. S., & Tsujii, J. i. (1998). Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Inform. Ser.Workshop Genome Inform., 9(62-71).
Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 4(1), 20-28.
Soderland, S. (1999). Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 34, 233-272.
Tatar, D. (2005). Word sense disambiguation by machine learning approach: a short survey. Fundamenta Informaticae, 64, 433-442.
Werner, T. (2005). The next generation of literature analysis: Integration of genomic analyses into text mining. Brief. Bioinformatics, 6.
Xiao, J., Chua, T. S., & Liu, J. (2003). A global rule induction approach to information extraction. Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, 530-537.
Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Paper presented at the In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
王惠鈞, & 吳啟裕. 蛋白質體學之新進展/New Edvelopments in Proteomics. Paper presented 中央研究院 生物化學研究所.
■ 網站資料
Message Understanding Conference
(http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html)
PubMed
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=)
HUGO Gene Nomenclature Committee (HGNC)
(http://www.genenames.org/)
Sequence Retrieval System (SRS)
(http://www.ebi.ac.uk/)
The Comprehensive Perl Archive Network (CPAN)
(http://search.cpan.org/)
WordNet
(http://wordnet.princeton.edu/)