簡易檢索 / 詳目顯示

研究生: 陳怡秀
Chen, Yi-hsiu
論文名稱: 醫學文獻樣板辨識與擴張學習方法
A Pattern Recognition and Extended Learning Method for Medical Literatures
指導教授: 王惠嘉
Wang, Hei-chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 101
中文關鍵詞: 目標基因轉錄因子Bootstrapping文字探勘
外文關鍵詞: Text mining, Target gene, Transcription factor, Bootstrapping
相關次數: 點閱:122下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人體生理機能的運行主要是仰賴健全的蛋白質交互作用,而在生成蛋白質的過程中,轉錄調控是一個非常重要的起始步驟,決定了基因的表現以及後續蛋白質的形成,如何找出轉錄因子(Transcription Factor, TF)與目標基因(Target Gene, TGene)間的調控關係是很重要的議題。目前很多這類的資訊都紀錄在相關文獻中,但隨著文獻的大量成長,生物學家更不易從大量的文獻之中獲取所需資訊,必須花費大量的時間與人力進行資料篩選與過濾,因此如何利用資訊技術協助生物學者從龐大的文獻找尋出有用資訊顯得格外重要。
    為了從生物醫學文獻中擷取出所需的資訊,因此有許多專家學者提出方法來改善成效不彰的文件搜尋方式,但這些方法仍有潛在的缺點,例如人工產生樣板(Manual Generate Pattern),雖然資料的準確率(Precision)較高,但召回率(Recall)低;若統計生物醫學常見的關鍵字(Keyword),則會有準確率低但召回率高的情況,且目前的樣板學習方法都僅學習一次,以至於有些有用訊息並沒有找到。
    為能從文件中學到更多有用樣板,本研究希望利用Bootstrapping技術來自動產生樣板(Automatic Generate Pattern)解決以人工方式產生樣板的不便性,並設法在多次運行的過程中,獲得到更多有效的樣板。在初始訓練階段會產正確樣板與錯誤樣板,並找出句子中的TF與TGene構成Tuple(TF,TGene);於Bootstrappting階段再利用獲得的Tuple找出相關的句子進行樣板訓練,以產生更多樣板,再利用新樣板找出新的Tuple,不斷重覆Bootstrappting階段程序,直到達到終止條件。最後再透過樣板比對(Pattern Matching),擷取出轉錄因子和目標基因的調控資訊。希望藉由本研究能讓生物學家快速又準確地獲得所需資訊,不必再花費多餘的時間及精力進行瀏覽、篩選大量文獻。研究結果顯示,本研究採用之方法,可有效率找出轉錄因子與目標基因的調控資訊,並且能在最少的人力與人為參與的狀況下,達到不錯的成果。

    The functions of human body physiology rely on soundly protein-protein interaction (PPI). In the process of make up protein, transcriptional regulation is a significant initial step, which decides gene expression and subsequent make up protein. How to find out transcription factor (TF) and target gene (TGene) is an important issue. On the other hand, the relevant information recorded on medical literatures that grown quickly. How to use information technologies to assist biologist search out useful information is important extraordinarily.
    In order to extract useful information from medical literatures, many specialists address methods to improve document search manners. But some manners have potential problem. The pattern learning method is only once learning at present, so that we can’t find out useful information completely.
    The purpose of our research is solve the inconvenience of manual generate pattern. We automatic generate pattern by “Bootstrapping” technique to learn more useful pattern from documents in many iterations. In the initial training phase, we generate positive and negative patterns, and look for TF and TGene from sentences to make up Tuple(TF,TGene). In the bootstrapping phase, we use Tuple to find out relevant sentences and then implement pattern training to obtain more patterns, then use new pattern to find out new Tuple. Execute bootstrapping phase continuous until achieve termination criterion. Finally, we use pattern matching to extract the information about transcriptional regulation. We hope for obtain useful information by our research, let biologists don’t need to spend more time and energy to browse and filter literatures. According to the experiments, that show our method can efficiently find out transcription factor and target gene’s transcriptional information by using bootstrapping technique in minimized labor and human participation.

    1. 緒論 1 1.1. 研究背景 1 1.2. 研究動機與目的 2 1.3. 研究範圍與限制 3 1.4. 研究流程 4 1.5. 論文大綱 5 2. 文獻探討 6 2.1. 生物資訊相關資源 6 2.1.1. 轉錄因子與目標基因 6 2.1.2. 期刊文獻檢索系統 7 2.1.3. 統一醫學語言系統 8 2.1.4. 序列搜索系統 10 2.1.5. HUGO基因命名委員會 11 2.2. 自然語言處理 13 2.2.1. 詞性標記 13 2.2.2. 字根還原 14 2.3. 機器學習 14 2.4. Bootstrapping 15 2.4.1. DIPRE 15 2.4.2. Snowball 16 2.4.3. KnowItAll 16 2.4.4. TextRunner 17 2.5. 相關研究 18 2.6. 先前研究 18 2.7. 小結 21 3. 研究方法 22 3.1. 研究架構 22 3.2. 研究方法之差異比較 24 3.3. 文獻取回與前處理階段 26 3.3.1. PubMed文獻查詢 26 3.3.2. 轉錄因子與目標基因名稱之標定 27 3.3.3. 句子的條件限制 28 3.3.4. 前處理 29 3.4. 初始訓練階段 30 3.4.1. 正確樣板訓練模組 31 3.4.1.1. 分析句子結構 31 3.4.1.2. 正確樣板產生 33 3.4.1.3. 擷取Tuple 34 3.4.2. 錯誤樣板訓練模組 34 3.4.2.1. 錯誤句過濾 35 3.4.2.2. 錯誤樣板產生 35 3.5. Bootstrappting階段 36 3.5.1. 過濾句子 37 3.5.2. 尋找New Tuple 38 3.5.3. Bootstrapping 38 3.6. 樣板比對階段 38 3.7. 虛擬程式碼 40 3.7.1. 文獻取回與前處理的虛擬程式碼 40 3.7.2. 初始訓練階段的虛擬程式碼 41 3.7.3. Bootstrappting階段的虛擬程式碼 43 3.7.4. 樣板比對的虛擬程式碼 45 3.7.5. 正確樣板訓練的虛擬程式碼 46 4. 系統建置與驗證 47 4.1. 系統建置 47 4.1.1. 實作環境 47 4.1.2. 使用套件及模組 47 4.1.3. 系統處理流程 48 4.2. 實驗方法 49 4.2.1. 資料來源 49 4.2.2. 比較對象 50 4.2.3. 評估指標 50 4.3. 實驗結果與分析 51 4.3.1. 實驗一:人工標定初始資料集 51 4.3.2. 實驗二:人工檢視初始資料集是否含有正確句而不進行標定 58 4.3.3. 實驗三:初始資料集含有TF-TGene/TGene-TF句子而不進行標定 63 4.3.4. 實驗四:於Iteration 1訓練錯誤樣板 71 4.3.5. 實驗五:人工標定初始資料集搭配錯誤樣板進行過濾 73 4.3.6. 實驗六:人工檢視初始資料集是否含有正確句而不進行標定,再搭配錯誤樣板進行過濾 78 4.3.7. 實驗七:初始資料集含有TF-TGene/TGene-TF句子而不進行標定,再搭配錯誤樣板進行過濾 81 4.4. 系統畫面範例 85 5. 結論與未來研究方向 87 5.1. 研究成果 87 5.2. 未來研究方向 96 參考文獻 97

    英文文獻
    Abney, S. (2002). Bootstrapping. Paper presented at the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania
    Agichtein, E. (2005). Extracting Relations from Large Text Collections. Columbia University New York, USA.
    Agichtein, E., & Gravano, L. (2000). Snowball:Extracting relations from large plain-text collectins. Paper presented at the 5th ACM International Conference on Digital Libraries, San Antonio, Texas, United States.
    Ananiadou, S., & Mcnaught, J. (2006). Text Mining for Biology and Biomedicine. Norwood: Artech House
    Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (1 ed.). Harlow, England: Addison Wesley.
    Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open Information Extraction from the Web. Paper presented at the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
    Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. Paper presented at the International Workshop on The World Wide Web and Databases.
    Chiang, J. H., Liu, H. S., Chao, S. Y., & Chen, C. Y. (2007). Discovering gene - gene relations from sequential sentence patterns in biomedical literature. Expert Systems with Applications, 33(4), 1036-1041.
    Ciravegna, F., & Petrelli, D. (2001). User involvement in customizing adaptive Information Extraction:position paper. Paper presented at the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), Seattle.
    Edmundson, H. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264-285.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., et al. (2004). Web-Scale Information Extraction in KnowItAll(Preliminary Results). Paper presented at the 13th international conference on World Wide Web New York, USA
    Feng, H., & Chua, T.-S. (2003). A Bootstrapping Approach to Annotating Large Image Collection. Paper presented at the 5th ACM SIGMM international workshop on Multimedia information retrieval Berkeley, California.
    Fox, C. (1992). Lexical analysis and stoplists: Prentice-Hall.
    Hou, W. J., & Chen, H. H. (2004). Enhancing performance of protein and gene name recognizers with filtering and integration strategies. Journal of Biomedical Informatics 37(6), 448-460.
    Hovy, E., Hermjakob, U., & Ravichandran, D. (2002). A Question/Answer Typology with Surface Text Patterns. Paper presented at the second international conference on Human Language Technology Research San Diego, California.
    Kim, J.-D., Ohta, T., & Tsujii, J. i. (2008). Corpus annotation for mining biomedical events from literature. BCM Bioinformatics, 9(10), 1-25.
    Li, W., Liu, T., & Li, S. (2008). Bootstrapping for extracting relations from large corpora. Journal of electronics(China), 25(1), P.89-96.
    Marcotte, E. M., Xenarios, L., & Eisenberg, D. (2001). Mining literature for protein-protein interactions. Bioinformatics, 17(4), 359-363.
    Müller, H.-M., Kenny, E. E., & Sternberg, P. W. (2004). Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. Plos Biology, 2(11).
    Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2), 155-161.
    Ravichandran, D., & Hovy, E. (2001). Learning surface text patterns for a Question Answering system. Paper presented at the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania.
    Raychaudhuri, S. (2006). Computational Text Analysis:For Functional Genomics and Bioinformatics (1 ed.). USA: Oxfor University Press.
    Rice, G. A., & Robinson, D. O. (1975). The role of bigram frequency in the perception of words and nonwords. Memory & Cognition, 3(5), 513-518.
    Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? Bioinformatics, 4(1), 20-28.
    Witten, I. H., & Frank, E. (2005). Data mining:Practial Machine Learning Tools and Techniques (2 ed.). San Francisco: Morgan Kaufmann.
    Xia, L. (2006). Adaptive Relationship Extraction by Machine Learning. University of Sheffield.
    Xiao, J., Chua, T. S., & Liu, J. (2003). A global rule induction approach to information extraction. Paper presented at the 15th IEEE International Conference on Tools with Artificial Intelligence.
    Yangarber, R., Lin, W., & Grishan, R. (2002). Unsupervised Learning of Generalized Names. Paper presented at the 19th International Conference on Computational Linguistics, Taipei, Taiwan.
    Yu, H., & Agichtein, E. (2003). Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(1), i340-i349.
    Zeng, X., Li, F., Zhang, D., & Vakali, A. (2004). An XML-Based Bootstrapping Method for Pattern Acquisition. Paper presented at the 6th International Conference on Enterprise Information Systems, Porto, Portugal.
    Zhou, X., Liu, B., Wu, Z., & Feng, Y. (2007). Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artificial Intelligence in Medicine, 41, 87-104.

    中文文獻
    桂卓慶. (2008). 利用文字探勘技術萃取轉錄因子與目標基因調控資訊. 國立成功大學資訊管理研究所.
    張素香, 李蕾, 秦穎, & 鍾義信. (2006). 基于Bootstrapping的中文實體關係自動生成. 微電子學與計算機, 23(12), 15-18.
    許家偉, & 陳焜林. (1994). RNA轉錄因子新知: 科學月刊全文資料庫.
    蔡懷寬, & 莊樹諄. (2006). 生物資訊在基因調控及基因預測上的研究 中央研究院週報, 1065.

    網站資料
    Bach, N., & Badaskar, S. (2007). A Survey On Relation Extraction.
    (http://www.ark.cs.cmu.edu/LS2/images/9/97/BachBadaskar.2007.pdf)
    HUGO Gene Nomenclature Committee (HGNC)
    (http://www.genenames.org/)
    Natural Language Processing
    (http://en.wikipedia.org/wiki/Natural_language_processing)
    Transcription factor
    (http://en.wikipedia.org/wiki/Transcription_factor)
    PubMed
    (http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=HIF-1)
    Sequence Retrieval System (SRS)
    (http://www.ebi.ac.uk/)
    UMLS
    (http://umlsinfo.nlm.nih.gov/)
    UMLSKS
    (http://umlsks.nlm.nih.gov/)

    下載圖示 校內:2014-08-06公開
    校外:2014-08-06公開
    QR CODE