簡易檢索 / 詳目顯示

研究生: 李建慶
Li, Chien-Ching
論文名稱: 利用Gene Ontology 與文獻透過模糊相似度分析進行基因的分群
Clustering Genes by Gene Ontology and Literatures with Fuzzy Measure-based Similarity
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 70
中文關鍵詞: 模糊相似性分析計算基因本體論基因分群
外文關鍵詞: Fuzzy similarity measure, Genes clustering, Gene Ontology
相關次數: 點閱:83下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 蛋白質或基因相似性分析在傳統上最常以序列相似性來做分析,但卻未考慮到
    他們的功能相似性。利用GO,文獻等做功能相似性分析的研究在近年越來越多,Gene
    Ontology(GO)是Gene Ontology Consortium 此組織以註解生物基因與蛋白質的資料
    建立了一套具有動態形式的控制字彙,來解釋真核生物的基因在細胞內所扮演的角色
    及生醫學方面的知識;而這個組織把這些字彙建立了基因本體論資料庫。
    所有真核生物的基因或蛋白質皆可在GO 的系統下轉換成GO 註解(GO Term)
    的集合;但是在傳統上,利用GO 分群皆以Information Content 或Edge Counting 等
    本體論的技術來對Gene Product 做分群;考慮到GO Term 以何種方式被註解,與其
    相關文獻之間的影響,因此本篇論文針對此方向來做研究,以模糊理論中的模糊二元
    關聯式(Fuzzy Binary Relation)來找出文獻之間的關係。本研究的目標是希望能以一
    個嶄新不同的相似性分析方法,對Gene Products 做相似性分析,讓生物學家能從另
    一角度看出Gene Products 之間的關係。近年來由於蛋白質功能公開發表的數量急速
    增加,將蛋白質透過分群,可提高生物技術分析結果的正確性。而本篇論文系統研究
    架構在於三大模組,依照順序分別是文獻擷取模組、相似性計算模組、以及分群模組。
    文獻擷取模組是利用GO 給蛋白質的相對應註解,針對此註解的集合,分別找
    出此集合在PubMed 中相對應的相關論文;相似性計算模組則是以Fuzzy Similarity
    Measure 的方法計算出文獻與文獻的分數並給予權重,進而算出集合與集合間的相似
    性;接著,分群模組則是將相似性計算模組算出的相似性分數矩陣將Gene Product
    做凝聚階層式分群後找出每個群組所代表的蛋白質,以利生物學家做後續蛋白質或基
    因分析的工作。經由本研究所定義的相似性公式計算,與其他傳統的相似性計算方法
    的結果做比較,本研究的結果不僅能找出正確的群集,甚至能看出群集內成員間的相
    似性關係,因此本研究在相似性計算能正確的找出已知的群集。

    Traditionally, the similarity of genes and proteins often analyse by their sequence,
    but never consider their similarity of functions. There are more and more researchs of
    functional similarity analysis by using GO、literatures in recent years. Because of more and
    more annotated data of genes are generated, a organization, named Gene Ontology
    Consortium, built a set of dynamical controlled vocabulary to explain the role of genes or
    proteins playing in the cell and the knowledge of biological medicine of Eukaryotes.
    From the GO point of view, a gene or a protein can be annotated by three domains,
    which are biological process, molecular function, and cellular component. The GO
    researchers collect the genes or proteins of different species of Eukaryotes, such as SGD,
    MGI, FlyBase, …etc, to annotate and classify all the genes or proteins.
    We can say that the genes and proteins of all Eukaryotes can be converted into GO
    annotation by GO system. Traditionally, ontology techniques such as Information Content
    or Edge Counting are applied to cluster gene products. Recently, the number of sequences
    of proteins and genes prompt increase. The objective of our research is to use a different
    and new similarity analysis method to consider more concepts about gene functions. We
    expect to raise clustering precision through analyzing GO terms and related PubMed
    literatures parallelly. From different side of view, we hope this kind of similarity measure
    can help biologists find the relation of genes. In this, fuzzy similarity measure is adapted to
    calculate the scores of each pair literatures, so we can count out the similarities of each set
    and then cluster the gene products to find the represented gene cluster. This research is also
    have good evaluation results to compare with Information Content, Edge Count and
    Blastclust which is a sequential similarity measure tool of NCBI.

    1. 緒論...........................................................................................................................- 1 - 1.1. 研究背景........................................................................................................... - 1 - 1.2. 研究動機與目的............................................................................................... - 4 - 1.3. 研究流程........................................................................................................... - 6 - 1.4. 研究範圍與限制............................................................................................... - 7 - 1.5. 論文架構........................................................................................................... - 7 - 1.6. 小結................................................................................................................... - 9 - 2. 文獻探討.................................................................................................................- 10 - 2.1. NCBI ............................................................................................................... - 10 - 2.1.1. PubMed .......................................................................................................- 10 - 2.1.2. MeSH ..........................................................................................................- 11 - 2.2. GENE ONTOLOGY............................................................................................. - 12 - 2.3. SIMILARITY MEASURES.................................................................................... - 15 - 2.3.1. Pair-based Similarity Measure ....................................................................- 16 -   Node-based(Information Content)Approach .........................................- 16 -   Edge-based(Distance)Approach ............................................................- 17 -   Pairwise similarity with Average Function .................................................- 18 - 2.3.2. Set-based Similarity Measure......................................................................- 19 - 2.3.3. Graph Similarity Techniques.......................................................................- 21 - 2.3.4. Fuzzy Measure-based Similarity.................................................................- 21 -   Fuzzy Sugeno Measure Similarity ..............................................................- 22 -   Fuzzy Binary Relations ...............................................................................- 23 - 2.4. CLUSTERING ALGORITHMS............................................................................... - 24 - v 2.4.1. 圖形式分群演算法.....................................................................................- 24 - 2.4.2. 分割式分群演算法.....................................................................................- 25 - 2.4.3. 階層式分群演算法.....................................................................................- 25 - 2.4.4. 模型式分群演算法.....................................................................................- 28 - 2.4.5. 密度式與網格式分群演算法.....................................................................- 29 - 2.5. 小結................................................................................................................. - 29 - 3. 研究方法.................................................................................................................- 30 - 3.1. 研究架構......................................................................................................... - 31 - 3.2. DOCUMENT RETRIEVING MODULE................................................................... - 32 - 3.2.1. Retrieving GO Terms from Gene Ontology ................................................- 32 - 3.2.2. Retrieving Documents from PubMed .........................................................- 33 - 3.3. FUZZY MEASURE-BASED SIMILARITY .............................................................. - 33 - 3.4. HAC CLUSTERING MEASURE........................................................................... - 36 - 3.5. 小結................................................................................................................. - 37 - 4. 實作與驗證.............................................................................................................- 40 - 4.1. 系統實作設計................................................................................................. - 40 - 4.1.1. 前處理.........................................................................................................- 41 - 4.1.2. Document Retrieving Module.....................................................................- 41 - 4.1.3. Fuzzy Measure-based Similarity Module ...................................................- 42 - 4.1.4. HAC Clustering Measure Module...............................................................- 42 - 4.2. 實驗方法......................................................................................................... - 43 - 4.2.1. 比較對象與資料來源.................................................................................- 43 - 4.2.2. 評估指標.....................................................................................................- 45 - 4.2.3. 實驗方法設計.............................................................................................- 46 - vi 4.3. 實驗結果與分析............................................................................................. - 47 - 4.3.1. 本研究方法之實驗結果.............................................................................- 47 - 4.3.2. Blastclust 之實驗結果................................................................................- 53 - 4.3.3. Imformation content 以及Edge count 之實驗結果...................................- 56 - 4.3.4. 實驗分析.....................................................................................................- 61 - 5. 結論與未來研究方向.............................................................................................- 63 - 5.1. 研究結果與貢獻............................................................................................. - 63 - 5.2. 未來研究方向................................................................................................. - 64 - 參考文獻.........................................................................................................................- 65 -

    Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic
    Subspace Clustering of High Dimensional Data for Data Mining Applications.
    ACM SIGMOD Record 27(2), 94-105.
    Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering
    Points To Identify the Clustering Structure. Paper presented at the Proc.
    ACM SIGMOD 1999 International Conference on Management of Data,
    Philadelphia PA.
    Aoki, K. F., Yamaguchi, A., Okuno, Y., Akutsu, T., Ueda, N., Kanehisa, M., et al.
    (2003). Efficient Tree-Matching Methods for Accurate Carbohydrate
    Database Queries. Genome Informatics(14), 134-143.
    Berkhin, P. (2002). Survey Of Clustering Data Mining Techniques (Technical
    report). CA: Accrue Software.
    Bezdek, J. C. (1981). Pattern Recognition With Fuzzy Objective Function
    Algorithms. New York: Plenum.
    Cao, S. L., Qin, L., He, W. Z., Zhong, Y., Zhu, Y. Y., & Li, Y. X. (2004). Semantic
    Search among Heterogeneous Biological Databases Based on Gene
    Ontology. Acta Biochimica et Biophysica Sinica, 36(5), 365-370.
    Chen, C. Y., Oyang, Y. J., & Juan, H. F. (2004). Incremental generation of
    summarized clustering hierarchy for protein family analysis. Bioinformatics,
    20(16), 2586-2596.
    Chen, Y., Reilly, K. D., Sprague, A. P., & Guan, Z. (2006). SEQOPTICS: a protein
    sequence clustering system. BMC Bioinformatics, 7(4), 1-9.
    Cheng, J., Cline, M., Martin, J., Finkelstein, D., Awad, T., Kulp, D., et al. (2004). A
    Knowledge-Based Clustering Algorithm Driven by Gene Ontology. Journal
    of Biopharmaceutical Statistics, 14(3), 687-700.
    - 66 -
    Delfs, R., Doms, A., Kozlenkov, A., & Schroeder, M. (2004). GOPubMed:
    ontology-based literature search applied to Gene Ontology and PubMed.
    Paper presented at the Proc. German Bioinformatics Conference.
    Dembele, D., & Kastner, P. (2003). Fuzzy C-means method for clustering
    microarray data. Bioinformatics, 19(8), 973-980.
    Du, P., Gong, J., Wurtele, E. S., & Dickerson, J. A. (2005). Modeling Gene
    Expression Networks using Fuzzy Logic. IEEE Transactions on Systems,
    Man, and Cybernetics, Part B: Cybernetics, 35(6), 1351- 1359.
    Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for
    Discovering Clusters in Large Spatial Databases with Noise. Paper
    presented at the Proceedings of 2nd International Conference on
    Knowledge Discovery and Data Mining, Porland, Organ.
    Ganesan, P., Molina, H. G., & Widom, J. (2003). Exploiting Hierarchical Domain
    Structure to Compute Similarity. ACM Transactions on Information Systems,
    21(1), 64-93.
    Grabisch, M. (2000). Fuzzy Measures and Integrals: Theory and Applications.
    Jenssen, T.-K., Lægreid, A., Komorowski, J., & Hovig, E. (2001). A literature
    network of human genes for high-throughput analysis of gene expression.
    Nature Genetics, 28, 21-28.
    Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus
    Statistics and Lexical Taxonomy. Paper presented at the In Proceedings of
    International Conference Research on Computational Linguistics, Taiwan.
    Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A.
    Y. (2002). An Efficient k-Means Clustering Algorithm: Analysis and
    Implementation. IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 24(7), 881-892.
    Klir, G. J., & Yuan, B. (2005). Fuzzy Sets and Fuzzy Logics: Theory and
    Applications (4 ed.): Pearson Education Taiwan Ltd.
    - 67 -
    Kosala, R., & Blockeel, H. (2000). Web Mining Research: A Survey. Paper
    presented at the ACM SIGKDD Explorations Newsletter.
    Lee, M., Wang, W., & Yu, H. (2006). Exploring supervised and unsupervised
    methods to detect topics in biomedical text. BMC Bioinformatics, 7(140).
    Lei, Z., & Dai, Y. (2006). Assessing protein similarity with Gene Ontology and its
    use in subnuclear localization prediction. BMC Bioinformatics, 7(491).
    Lord, P. W., Stevens, R. D., Brass, A., & Goble, C. A. (2003). Investigating
    Semantic Similarity Measures across the Gene Ontology: The Relation
    between Sequence and Annotation. Bioinformatics, 19(10), 1275-1283.
    MacQueen, J. B. (1967). Some methods for classification and analysis of
    multivariate observations. Paper presented at the Proceedings of the Fifth
    Berkeley Symposium on Mathematical Statistics and Probability, University
    of California, Berkeley, United States.
    Manning, C. D., & Schutze, H. (2001). Foundations of Statistical Natural Language
    Processing: MIT Press.
    Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering
    technique. Pattern Recognition, 33, 1455-1465.
    Mirkin, B. (2005). Clustering for data mining : a data recovery appraoch.
    Myers, G. (1999). Whole-Genome DNA-Sequencing. Computing in Science and
    Engineering, 1(3), 33-43.
    Ontrup, J., Nattkemper, T. W., Gerstung, O., & Ritter, H. (2003). A MeSH Term
    based Distance Measure for Document Retrieval and Labeling Assistance.
    Paper presented at the Proceedings of the 25'th Annual International
    Conference of the IEEE EMBS, Cancun, Mexico.
    Perez-Iratxeta, C., Keer, H. S., Bork, P., & Andrade, M. A. (2002). Computing fuzzy
    associations for the analysis of biological literature. BioTechniques, 32(6),
    1380-1385.
    - 68 -
    Popescu, M., Keller, J. M., & Mitchell, J. A. (2006). Fuzzy Measures on the Gene
    Ontology for Gene Product Similarity. IEEE/ACM Transactions on
    Computational Biology and Bioinformatics, 3(3), 263-274.
    Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and Application
    of a Metric on Semantic Nets. IEEE Transactions on systems, man, and
    cybernetics, 19(1), 17-30.
    Raychaudhuri, S., Chang, J. T., Imam, F., & Altman, R. B. (2003). The
    computational analysis of scientific literature to define and recognize gene
    expression clusters. Nucleic Acids Research, 31(15), 4553-4560.
    Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity in A
    Taxonomy. Paper presented at the Proceedings of the 14th International
    Joint Conference on Artificial Intelligence.
    Resnik, P. (1999). Semantic Similarity in a Taxonomy: An Information-Based
    Measure and its Application to Problems of Ambiguity in Natural Language.
    Journal of Artificial Intelligence Research, 11, 95-130.
    Richardson, R., & Smeaton, A. F. (1995). Using wordnet in a knowledge-based
    approach to information retrieval. Ireland: School of Computer Applications,
    Dublin City University
    Shatkay, H., Edwards, S., & Boguski, M. (2002). Information Retrieval meets Gene
    Analysis. IEEE Intelligent Systems, 17(2), 45-53.
    Shatkay, H., & Feldman, R. (2003). Mining the Biomedical Literature in the
    Genomic Era: An Overview. Journal of Computational Biology, 10(6),
    821-855.
    Speer, N., Spieth, C., & Zell, A. (2004). A Memetic Clustering Algorithm for the
    Functional Partition of Genes Based on the Gene Ontology. Paper
    presented at the Proceedings of IEEE Symp. Computational Intelligence in
    Bioinformatics and Computational Biology.
    Tao, Y. C., & Leibel, R. L. (2002). Identifying functional relationships among human
    genes by systematic analysis of biological literature. BMC Bioinformatics,
    3(16), 1-9.
    - 69 -
    Torsello, A., Hidovic, D., & Pelillo, M. (2004). Four Metrics for Efficiently Comparing
    Attributed Trees. Paper presented at the Proceedings of the 17th
    International Conference on Pattern Recognition.
    Venter, J. C., Adams, M. D., Myers, E. W., & Li, P. W. (2001). The Sequence of the
    Human Genome. Science, 291(5507), 1304 - 1351.
    Vesanto, J., & Alhoniemi, E. (2000). Clustering of the Self-Organizing Map. IEEE
    Transactions on Neural Networks, 11(3), 586-600.
    Vinterbo, S. A., Kim, E. Y., & Machado, L. O. (2005). Small, fuzzy and interpretable
    gene expression based classifiers. Bioinformatics, 21(9), 1964-1970.
    Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S., & Chen, C.-F. (2007). A new method
    to measure the semantic similarity of GO terms. Bioinformatics, 23(10),
    1274-1281.
    Wang, W., Yang, J., & Muntz, R. (1997). STING : A Statistical Information Grid
    Approach to Spatial Data Mining. Paper presented at the Twenty-Third
    International Conference on Very Large Data Bases, Athens, Greece.
    Xu, R., & Wunsch, D., II. (2005). Survey of clustering algorithms. IEEE
    Transactions on Neural Networks, 16(3), 645-678.
    Zhong, J., Zhu, H., Li, J., & Yu, Y. (2002 ). Conceptual Graph Matching for
    Semantic Search. Paper presented at the Proceedings of the 10th
    International Conference on Conceptual Structures: Integration and
    Interfaces.
    Zhong, W., Altun, G., Harrison, R., Tai, P. C., & Pan, Y. (2005). Improved K-means
    clustering algorithm for exploring local protein sequence motifs representing
    common structural property IEEE Transactions on Nanobioscience, 4(3),
    255-265.
    網站資料
    基因本體論網站:http://www.geneontology.org/
    - 70 -
    The universal protein resource:http://www.pir.uniprot.org/
    PubMed:http://www.pubmed.gov/
    Subnuclear Compartments Prediction System:http://array.bioengr.uic.edu/subnuclear.htm

    下載圖示 校內:2011-07-08公開
    校外:2011-07-08公開
    QR CODE