簡易檢索 / 詳目顯示

研究生: 陳賢徹
Chen, Sian-che
論文名稱: 考量分類錯誤成本的個別基因排序法
Individual gene ranking methods based on misclassification cost
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 54
中文關鍵詞: 基因選取基因微陣列資料個別基因排序法分類錯誤成本
外文關鍵詞: individual gene ranking, microarray data, misclassification cost, Gene selection
相關次數: 點閱:133下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊技術的發展,基因微陣列資料(gene microarray data)成為研究特定基因與疾病如癌症的一大關鍵。然而,基因微陣列資料存在著高維度及小樣本的特性,因此在分類時常會有包含許多無用的基因,不只不能幫助分類問題的處理,甚至會導致更差的結果,也因此基因微陣列資料處理前的基因選取便扮演了一個重要的角色。然而癌症分類問題由於不同分類錯誤的成本差異極大,過去對於成本敏感的研究中,主要都是在分類階段將成本納入考量,從未有人在基因挑選時以成本為優先考量。綜合以上原因,本論文主要以個別基因排序法為主,分別提出三個將成本納入考量的基因排序法,透過基因的不同類別值個數、特定類別的機率及類別值之間的相對位置等不同的方式,讓基因能夠以可能造成的成本損失加以排序,再挑選所需的基因,配合考量分類錯誤成本的羅輯斯迴歸分類器,進行分類預測,期望真正達到降低分類錯誤成本又能維持一定分類正確率的結果。經由八個資料檔測試後,三種考量分類錯誤成本的個別基因排序法皆能夠有效的發揮該有的作用,提升分類錯誤成本較高類別的分類正確率且降低分類錯誤成本,顯示以分類錯誤成本為主的個別基因排序法進行基因挑選,所得的基因集對於降低整體分類錯誤成本是可行的。

    Microarray data have the following special characteristics: huge number of genes and small number of available instances. Most of the genes not only are useless and helpless for classification, but also may even lead to worse results. Thus, gene selection plays an important role in processing microarray data for classification. In particular, the misclassification costs for different class values are generally not equal. The previous studies of cost-sensitive analysis consider misclassification cost only in the stage for classification, not in the stage for feature selection. This research proposes three individual gene ranking methods based on misclassification cost. The costs are calculated as a function of the quantities, the occurring probabilities, and the relative positions of gene expression values and class values. Then we rank genes by their costs to choose the genes for classification. The chosen gene subset is used in the cost-sensitive logistic regression classifier to evaluate its performance. This approach is tested on eight microarray datasets. The experimental results show that our individual gene ranking methods can effectively increase the prediction accuracy of the class with relative high misclassification cost and decrease the average misclassification cost.

    摘要 I Abstract II 誌謝 III 目錄 IV 表目錄 VI 圖目錄 VII 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 成本分析 4 2.1.1 成本矩陣 5 2.1.2 成本敏感分析方法 6 2.2 基因選取法 7 2.2.1 個別基因排序法 8 2.2.2 組合基因排序法 8 2.3 評估指標 9 2.3.1 接收者操作特性曲線 10 2.3.2 成本曲線 12 2.4 交互認證法則 14 第三章 研究方法 16 3.1 基因選取架構及流程 16 3.2 以成本為主的個別基因排序法 17 3.2.1 最小成本排序法 18 3.2.2 機率排序法 20 3.2.3 成本無參數評分演算法 21 3.2.4 特殊性質 23 3.3 以成本為主的分類演算法 24 3.3.1 成本羅輯斯迴歸分析 24 3.4 評估流程 25 第四章 實證研究 27 4.1 資料收集與整理 27 4.2 參數設定 29 4.3 實證結果 29 4.3.1 分類正確率比較 29 4.3.2 分類錯誤成本比較 35 4.4 基因相似度 44 4.5 小結 47 第五章 結論與建議 49 5.1 結論 49 5.2 建議 50 參考文獻 51

    許景涵 (2005),以基因微陣列資料探討基因選取方法對分類正確率之影響,國立成功大學工業與資訊管理學系碩士班論文。
    陳丁群 (2008),以致病基因集為先驗資訊的基因選取方法之研究,國立成功大學資訊管理研究所碩士班碩士論文。
    鄭凱峰(2004),小樣本高維度資料中二階段分類法之效能評估-以基因微陣列資料癌症分類為例,國立成功大學工業與資訊管理學系碩士班論文。
    劉冠良 (2007),以叢集分析與距離測度為基礎之基因選取法,國立成功大學資訊管理研究所碩士班碩士論文。
    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine. A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy Sciences of the United States of America, 96, 6745-6750.
    Barandela, R., Valdovinos, R. M., Sánchez, J. S., and Ferri, F. J.(2004). The imbalanced training sample problem: under or over sampling?, Lecture Notes in Computer Science, 3138, 806-814.
    Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, United States, 233-240.
    Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77-87.
    Egan, J. P. (1975). Signal detection theory and ROC analysis. Series in Cognition and Perception. New York: Academic Press.
    Eitrich, T., Kless, A., Druska C., Grotendorst, J., and Meyer, W. (2007). Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques, Journal of Chemical Information and Modeling, 47, 92-103.
    Holte, R. C. and Drummond, C. (2008). Cost-sensitive Classifier Evaluation using Cost Curves, Lecture Notes in Computer Science, 5012, 26-29.
    Holte, R. C. and Drummond, C. (2000). Explicitly representing expected cost: An alternative to ROC representation. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, United States, 198-207.
    Holte, R. C. and Drummond, C. (2005). Cost-sensitive classifier evaluation. Proceedings of the 1st international Workshop on Utility-based Data Mining, Chicago, Illinois, United States, 3-9.
    Huetra, E. B., Duval, B., and Hao, J. K. (2006). A hybrid GA/SVM approach for gene selection and classification of microarray data, Lecture Notes in Computer Science, 3907, 34-44.
    Jörnsten, R. and Yu, B. (2003). Simultaneous gene clustering and subset selection for sample classification via MDL, Bioinformatics, 19, 1100-1109.
    Li, J., Zhang, C., and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20(5), 2429-2437.
    Liu, H., Li, J., and Wong, L. (2002). A comparative study of feature selection and multiclass classification methods using gene expression profiles and proteomic patterns, Genome Informatics, 13, 51-60.
    Liu, Y. and Shriberg, E. (2007). Comparing evaluation metrics for sentence boundary detection. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, Hawaii, United States, 4, 185-188.
    Lu, Y. and Han, J. (2003). Cancer classification using gene expression data, Information Systems, 28, 243-268.
    Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.
    Park, P., Pagano, M., and Bonetti, M. (2001). A nonparametric scoring algorithm for identifying informative genes from microarray data, Proceedings of the Pacific Symposium on Biocomputing, Hawaii, United States, 6, 52-63.
    Su, Y., Murali, T., Pavlovic, V., Schaffer, M., and Kasif, S. (2003). RankGene: identification of diagnostics genes based on expression data, Bioinformatics, 19, 1578-1579.
    Ting, K. M. (2002). An Instance-Weighting Method to Induce Cost-Sensitive Trees, IEEE Transactions on Knowledge and Data Engineering, 14, 659-665.
    Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, United States of America: Diane Cerra, 161-166.
    Wong, T. T. and Hsu, C. H. (2008). Two-stage classification methods for microarray data, Expert Systems with Applications, 34(1), 375-383.
    Zhao, H. (2008). Instance weighting versus threshold adjusting for cost-sensitive classification, Knowledge and Information Systems, 15, 321-334.

    下載圖示 校內:2010-07-20公開
    校外:2010-07-20公開
    QR CODE