| 研究生: |
陳怡秀 Chen, Yi-hsiu |
|---|---|
| 論文名稱: |
醫學文獻樣板辨識與擴張學習方法 A Pattern Recognition and Extended Learning Method for Medical Literatures |
| 指導教授: |
王惠嘉
Wang, Hei-chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 中文 |
| 論文頁數: | 101 |
| 中文關鍵詞: | 目標基因 、轉錄因子 、Bootstrapping 、文字探勘 |
| 外文關鍵詞: | Text mining, Target gene, Transcription factor, Bootstrapping |
| 相關次數: | 點閱:122 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人體生理機能的運行主要是仰賴健全的蛋白質交互作用,而在生成蛋白質的過程中,轉錄調控是一個非常重要的起始步驟,決定了基因的表現以及後續蛋白質的形成,如何找出轉錄因子(Transcription Factor, TF)與目標基因(Target Gene, TGene)間的調控關係是很重要的議題。目前很多這類的資訊都紀錄在相關文獻中,但隨著文獻的大量成長,生物學家更不易從大量的文獻之中獲取所需資訊,必須花費大量的時間與人力進行資料篩選與過濾,因此如何利用資訊技術協助生物學者從龐大的文獻找尋出有用資訊顯得格外重要。
為了從生物醫學文獻中擷取出所需的資訊,因此有許多專家學者提出方法來改善成效不彰的文件搜尋方式,但這些方法仍有潛在的缺點,例如人工產生樣板(Manual Generate Pattern),雖然資料的準確率(Precision)較高,但召回率(Recall)低;若統計生物醫學常見的關鍵字(Keyword),則會有準確率低但召回率高的情況,且目前的樣板學習方法都僅學習一次,以至於有些有用訊息並沒有找到。
為能從文件中學到更多有用樣板,本研究希望利用Bootstrapping技術來自動產生樣板(Automatic Generate Pattern)解決以人工方式產生樣板的不便性,並設法在多次運行的過程中,獲得到更多有效的樣板。在初始訓練階段會產正確樣板與錯誤樣板,並找出句子中的TF與TGene構成Tuple(TF,TGene);於Bootstrappting階段再利用獲得的Tuple找出相關的句子進行樣板訓練,以產生更多樣板,再利用新樣板找出新的Tuple,不斷重覆Bootstrappting階段程序,直到達到終止條件。最後再透過樣板比對(Pattern Matching),擷取出轉錄因子和目標基因的調控資訊。希望藉由本研究能讓生物學家快速又準確地獲得所需資訊,不必再花費多餘的時間及精力進行瀏覽、篩選大量文獻。研究結果顯示,本研究採用之方法,可有效率找出轉錄因子與目標基因的調控資訊,並且能在最少的人力與人為參與的狀況下,達到不錯的成果。
The functions of human body physiology rely on soundly protein-protein interaction (PPI). In the process of make up protein, transcriptional regulation is a significant initial step, which decides gene expression and subsequent make up protein. How to find out transcription factor (TF) and target gene (TGene) is an important issue. On the other hand, the relevant information recorded on medical literatures that grown quickly. How to use information technologies to assist biologist search out useful information is important extraordinarily.
In order to extract useful information from medical literatures, many specialists address methods to improve document search manners. But some manners have potential problem. The pattern learning method is only once learning at present, so that we can’t find out useful information completely.
The purpose of our research is solve the inconvenience of manual generate pattern. We automatic generate pattern by “Bootstrapping” technique to learn more useful pattern from documents in many iterations. In the initial training phase, we generate positive and negative patterns, and look for TF and TGene from sentences to make up Tuple(TF,TGene). In the bootstrapping phase, we use Tuple to find out relevant sentences and then implement pattern training to obtain more patterns, then use new pattern to find out new Tuple. Execute bootstrapping phase continuous until achieve termination criterion. Finally, we use pattern matching to extract the information about transcriptional regulation. We hope for obtain useful information by our research, let biologists don’t need to spend more time and energy to browse and filter literatures. According to the experiments, that show our method can efficiently find out transcription factor and target gene’s transcriptional information by using bootstrapping technique in minimized labor and human participation.
英文文獻
Abney, S. (2002). Bootstrapping. Paper presented at the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania
Agichtein, E. (2005). Extracting Relations from Large Text Collections. Columbia University New York, USA.
Agichtein, E., & Gravano, L. (2000). Snowball:Extracting relations from large plain-text collectins. Paper presented at the 5th ACM International Conference on Digital Libraries, San Antonio, Texas, United States.
Ananiadou, S., & Mcnaught, J. (2006). Text Mining for Biology and Biomedicine. Norwood: Artech House
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (1 ed.). Harlow, England: Addison Wesley.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open Information Extraction from the Web. Paper presented at the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. Paper presented at the International Workshop on The World Wide Web and Databases.
Chiang, J. H., Liu, H. S., Chao, S. Y., & Chen, C. Y. (2007). Discovering gene - gene relations from sequential sentence patterns in biomedical literature. Expert Systems with Applications, 33(4), 1036-1041.
Ciravegna, F., & Petrelli, D. (2001). User involvement in customizing adaptive Information Extraction:position paper. Paper presented at the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), Seattle.
Edmundson, H. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264-285.
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., et al. (2004). Web-Scale Information Extraction in KnowItAll(Preliminary Results). Paper presented at the 13th international conference on World Wide Web New York, USA
Feng, H., & Chua, T.-S. (2003). A Bootstrapping Approach to Annotating Large Image Collection. Paper presented at the 5th ACM SIGMM international workshop on Multimedia information retrieval Berkeley, California.
Fox, C. (1992). Lexical analysis and stoplists: Prentice-Hall.
Hou, W. J., & Chen, H. H. (2004). Enhancing performance of protein and gene name recognizers with filtering and integration strategies. Journal of Biomedical Informatics 37(6), 448-460.
Hovy, E., Hermjakob, U., & Ravichandran, D. (2002). A Question/Answer Typology with Surface Text Patterns. Paper presented at the second international conference on Human Language Technology Research San Diego, California.
Kim, J.-D., Ohta, T., & Tsujii, J. i. (2008). Corpus annotation for mining biomedical events from literature. BCM Bioinformatics, 9(10), 1-25.
Li, W., Liu, T., & Li, S. (2008). Bootstrapping for extracting relations from large corpora. Journal of electronics(China), 25(1), P.89-96.
Marcotte, E. M., Xenarios, L., & Eisenberg, D. (2001). Mining literature for protein-protein interactions. Bioinformatics, 17(4), 359-363.
Müller, H.-M., Kenny, E. E., & Sternberg, P. W. (2004). Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. Plos Biology, 2(11).
Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2), 155-161.
Ravichandran, D., & Hovy, E. (2001). Learning surface text patterns for a Question Answering system. Paper presented at the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania.
Raychaudhuri, S. (2006). Computational Text Analysis:For Functional Genomics and Bioinformatics (1 ed.). USA: Oxfor University Press.
Rice, G. A., & Robinson, D. O. (1975). The role of bigram frequency in the perception of words and nonwords. Memory & Cognition, 3(5), 513-518.
Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? Bioinformatics, 4(1), 20-28.
Witten, I. H., & Frank, E. (2005). Data mining:Practial Machine Learning Tools and Techniques (2 ed.). San Francisco: Morgan Kaufmann.
Xia, L. (2006). Adaptive Relationship Extraction by Machine Learning. University of Sheffield.
Xiao, J., Chua, T. S., & Liu, J. (2003). A global rule induction approach to information extraction. Paper presented at the 15th IEEE International Conference on Tools with Artificial Intelligence.
Yangarber, R., Lin, W., & Grishan, R. (2002). Unsupervised Learning of Generalized Names. Paper presented at the 19th International Conference on Computational Linguistics, Taipei, Taiwan.
Yu, H., & Agichtein, E. (2003). Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(1), i340-i349.
Zeng, X., Li, F., Zhang, D., & Vakali, A. (2004). An XML-Based Bootstrapping Method for Pattern Acquisition. Paper presented at the 6th International Conference on Enterprise Information Systems, Porto, Portugal.
Zhou, X., Liu, B., Wu, Z., & Feng, Y. (2007). Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artificial Intelligence in Medicine, 41, 87-104.
中文文獻
桂卓慶. (2008). 利用文字探勘技術萃取轉錄因子與目標基因調控資訊. 國立成功大學資訊管理研究所.
張素香, 李蕾, 秦穎, & 鍾義信. (2006). 基于Bootstrapping的中文實體關係自動生成. 微電子學與計算機, 23(12), 15-18.
許家偉, & 陳焜林. (1994). RNA轉錄因子新知: 科學月刊全文資料庫.
蔡懷寬, & 莊樹諄. (2006). 生物資訊在基因調控及基因預測上的研究 中央研究院週報, 1065.
網站資料
Bach, N., & Badaskar, S. (2007). A Survey On Relation Extraction.
(http://www.ark.cs.cmu.edu/LS2/images/9/97/BachBadaskar.2007.pdf)
HUGO Gene Nomenclature Committee (HGNC)
(http://www.genenames.org/)
Natural Language Processing
(http://en.wikipedia.org/wiki/Natural_language_processing)
Transcription factor
(http://en.wikipedia.org/wiki/Transcription_factor)
PubMed
(http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=HIF-1)
Sequence Retrieval System (SRS)
(http://www.ebi.ac.uk/)
UMLS
(http://umlsinfo.nlm.nih.gov/)
UMLSKS
(http://umlsks.nlm.nih.gov/)