簡易檢索 / 詳目顯示

研究生: 許瑞洋
Hsu, Jui-Yang
論文名稱: 同時考量分類錯誤成本及正確率之二階段個別基因選取法
Two-stage individual gene selection methods based on misclassification cost and accuracy
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 55
中文關鍵詞: 基因選取基因微陣列分類錯誤成本
外文關鍵詞: Feature selections, gene microarray, misclassification cost
相關次數: 點閱:87下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於生物與醫學方面的技術不斷創新改進,專家們發展了一套"基因微陣列"技術,來找尋特定基因與疾病的相關性。然而基因微陣列具有高維度且小樣本的特性,其中包含了許多與疾病並不相干的基因,所以目前大多都會針對基因微陣列資料進行前處理以降低維度,而此過程即稱為基因選取。由於現實的癌症問題中,不同類別之分類錯誤成本並不相同,所以過去如成本敏感的研究中,即會在分類階段設定不同類別有不同的成本。在基因選取階段考量分類錯誤成本時,由於完全以成本為優先考量,致使低成本類別之分類正確率大幅滑落,進而造成有時分類錯誤成本不降反升的現象。所以綜合以上理由,本研究提出二階段基因選取法來進行基因選取,其中分別以正確率導向之基因選取法及成本導向之基因選取法進行組合,產生高正確率導向基因選取法及低成本導向基因選取法兩類基因選取架構,共計有18種二階段基因選取法,針對個別基因進行選取,期望在與成本導向基因選取法比較時,能降低分類錯誤成本,並還能不過度損失低成本類別之正確率。結果顯示,本研究所提出之組合式基因選取法,部分較成本導向基因選取法具有競爭力,其中機率排序法-t值法在正確率曲面下面積及成本曲面下面積之效能表現最為突出。

    As the continuous improvement and innovation of bioscience and medical science technologies, scientists developed a series of technologies for gene microarray data to find the relations between diseases and genes. The number of instances in a microar-ray data set is far less than the number of genes in an instances, and lots of genes are irrelevant to a specific disease. Therefore, gene selection is essential to reduce the di-mensionality of a microarray data set. The misclassification costs of different classes are generally different. Previous study performs cost-sensitive gene selection such that the classification accuracy of a microarray data set is greatly reduced. To com-pensate such difficiency, this study considers both misclassification cost and predic-tion accuracy to propose 18 two-stage individual gene selection methods for microar-ray data. The experimental results on eight microarray data sets show that the method adopting probability ranking for misclassification cost in the first stage and t-value ranking for prediction accuracy in the second stage has the best performance evalu-ated by area under cost curve and area under accuracy curve.

    摘要 I Abstract II 誌謝 III 章節目錄 IV 表目錄 VI 圖目錄 VII 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 成本預測觀點 4 2.2 成本分析 5 2.2.1 成本矩陣 5 2.2.2 成本敏感分類方法 7 2.3 基因選取法 7 2.3.1 無成本導向基因選取法 8 2.3.2 成本導向基因選取法 10 2.4 評估指標 11 2.4.1 數值型態指標 11 2.4.2 圖形型態指標 12 2.5 交互認證法則 17 第三章 研究方法 19 3.1 研究流程 20 3.2 成本導向基因選取法 21 3.2.1 最小成本排序法 22 3.2.2 機率排序法 24 3.2.3 成本無參數評分演算法 25 3.3 正確率導向之基因選取法 26 3.3.1 t值法 27 3.3.2 BW比率 27 3.3.3 基因分數排序法 28 3.4 第一階段基因選取個數方法 29 3.5 二元邏輯斯迴歸分析 29 第四章 實證研究 31 4.1 資料蒐集及前處理 31 4.2 參數設定 33 4.3 實證結果 37 4.3.1第一階段基因選取個數 37 4.3.2評估方法與實驗數據 38 4.3.4小結 45 第五章 結論與建議 47 5.1 結論 47 5.2 建議 48 參考文獻 49

    陳丁群 (2008),以致病基因集為先驗資訊的基因選取方法之研究,國立成功大學資訊管理研究所碩士班碩士論文。

    陳賢徹 (2009),考量分類錯誤成本的個別基因排序法,國立成功大學資訊管理研究所碩士班碩士論文。

    許景涵 (2005),以基因微陣列資料探討基因選取方法對分類正確率之影響,國立成功大學資訊管理研究所碩士班碩士論文。

    Alizadeh. A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Broldnck, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Husdson, J. J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Botstein, D., Brown, P. O., and Staitdt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403, 503-511.

    Alon. U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligoimcleotide arrays, Proceedings of the National Academy Sciences of the United States of America, 96, 6745-6750.

    Aristu, A. L. and Tello, F. P. H. (2008), The structure of bryant’s empathy index for children : A cross-validation study, The Spanish Journal of Psychology, 11, 670- 667.

    Barandela, R., Valdovinos, R. M., Sánchez, J. S., and Ferri, F. J.(2004). The imbal-anced training sample problem: under or over sampling?, Lecture Notes in Computer Science, 3138, 806-814.

    Beck, S., Mikut, R., and Jakel, J.(2004), A cost-sensitive algorithm for fuzzy rule-based classifier, Mathware & Soft Computing, 11, 179-195.

    Canul-Reich, J., Hall, L.O., Goldgof, D., and Eschrich, S.A. (2008), Feature Selection for Microarray Data by AUC Analysis, IEEE International Conference on Systems, Man and Cybernetics(SMC), October, 768-773.

    Dietterich, T.G. (2000), Ensemble methods in machine learning, Lecture Notes in Computer Science, 1857, 1-15.

    Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77-87.

    Dudoit, S. and Laan, M. J. (2003). Unified cross-validation methodology for selection among estimators: Finite sample results, asymptotic optimality, and applications, Di-vision of Biostatistics, UC Berkeley, Technical report #130.

    Egan, J. P. (1975). Signal detection theory and ROC analysis. Series in Cognition and Perception. New York: Academic Press.

    Gasparini, F., Corchs, S., and Schettini, R. (2005), Are call or precision oriented skin classifier using binary combining strategies, Pattern Recognition, 38, 2204-2207.

    Gheyas, I. A. and Smith, L. S. (2010), Feature subset selection in large dimensionality domains, Pattern Recognition, 43, 5-13.

    Golub, T. R. Slonim, D. K., Tamayo., P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lan-der, E. S. (1999), Molecular classification of cancer: class discovery and class predic-tion by gene expression monitoring, Science, 286, 531-537.

    Gordon, G. J., Jensen,R. V., Hsiao, L. L., Gullans, S. R., Blumenstock,J. E., Ramas-wamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno R. (2002), Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Whitehead Institute Massachusetts Institute of Technology Center for Genome Research, 62, 4963-4967.

    Greiner, R., Grove, A. J., and Roth, D. (2002), Learning cost-sensitive active classifi-ers, Artificial Intelligence, 139, 137-174.

    Hall, M. (2006), A decision tree-based attribute weighting filter for naive Bayes, Knowledge-Based Systems, 20, 120-126.

    Holte, R. C. and Drummond, C. (2005). Cost-sensitive classifier evaluation. Proceed-ings of the 1st international Workshop on Utility-based Data Mining, Chicago, Illi-nois, United States, 3-9.

    Holte, R. C. and Drummond, C. (2008). Cost-sensitive classifier evaluation using cost curves, Lecture Notes in Computer Science, 5012, 26-29.

    Ji, S. and Carin, L. (2007). Cost-sensitive feature acquisition and classification, Pat-tern Recognition, 40, 1474-1485.

    Kazmierska, J. and Malicki, J. (2008), Application of the naive bayesian classifier to optimize treatment decisions, Radiotherapy and Oncology, 86, 211-216.

    Kemal, P. and Salih, G. (2009), A new feature selection method on classification of medical datasets: Kernel F-score feature selection, Expert Systems with Applications, 36, 10367-10373.

    Landgrebe, T. C. W., Paclik, P., and Duin, R. P. W. (2006), Precision-recall operating characteristic (P-ROC) curves in imprecise environments, Pattern Recognition, 4, 123-127.

    Laura J. V. T. V., Vijver , M. J. V. D., Yudong, D. H., H. D., Augustinus, A. M. H., Dorien, W. V., George, J. S., Johannes L. P., Chris R., M. J. M., Mark P., Douwe Atsma, A. W., Annuska G., Leonie, D., T. V. D. V., Harry, B., Sjoerd, R., Emiel T. Rutgers, S. H. F. And Rene B. (2002), A gene-expression signature as a predictor of survival in breast cancer, Massachusetts Medical Society, The New England
    Journal of Medicine, 347, 25

    Lee, M. C., Boroczky, L., Sungur-Stasik, K., Cann, A. D., Borczuk, A. C., Kawut, S. M., and Powell, C. A. (2008), A two-step approach for feature selection and classifier ensemble construction in computer-aided diagnosis, Computer-Based Medical Sys-tems, 548-553.

    Li, J., Zhang, C., and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expres-sion, Bioinformatics, 20(5), 2429-2437.

    Liu, H., Li, J., and Wong, J. (2002). A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns, Genome Informatics, 13, 51-60.

    Liu, Y. and Zheng, Y. F. (2006), FS_SFS : A novel feature selection method for sup-port vector machines, Pattern Recognition, 39, 1333-1345.

    Mamitsuka, H. (2006). Selecting features in microarray classification using ROC curves, Pattern Recognition, 39, 2393-2404.

    Nakariyakul, S. and Casasent, D. P. (2009), An improvement on floating search algo-rithms for feature subset selection, Pattern Recognition, 42, 1932-1940.

    Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.
    Park, P., Pagano, M., and Bonetti, M. (2001). A nonparametric scoring algorithm for identifying informative genes from microarray data, Proceedings of the Pacific Sym-posium on Biocomputing, Hawaii, United States, 52-63.

    Peterson, L. E. and Coleman, M. A. (2008), Machine learning-based receiver operating characteristic(ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research, International Journal of Approximate Reasoning, 47, 17-36.

    Polat, K. and Gunes, S. (2009), A new feature selection method on classification of medical datasets : Kernel F-score feature selection, Expert Systems with Applications, 36, 10367-10373.

    Pomeroy ,S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin M. E., Kim J. Y. H., Goumnerovak, L. C., Blackk, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio T., Mukherjee, S., Rifkin, R., Califanokk, A., Stolovitzkykk, G., Louis, D. N., Mesirov, J. P., Lander, E. S. and Golub, T. R. (2002), Prediction of central nervous system embryonal tumour outcome based on gene expression, letters to nature, 415, 436-442.

    Saeys, Y., Abeel, T., Van de Peer, Y. (2008), Robust feature selection using ensemble feature selection techniques, Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 313-325.

    Singh, R. P., Agarwa, C., and Agarwal1, Rajesh., (2002), Grape seed extract induces apoptotic death of human prostate carcinoma DU145 cells via caspases activation accompanied by dissipation of mitochondrial membrane potential and cytochrome c release, Carcinogenesis, 23, 1869-1876.

    Sun, Y., Kamel, M. S., Wong, A. K. C., and Wang, Y. (2007), Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition, 40, 3358-3378.

    Tan, P. N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining, Ad-dison Wesley

    Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees, IEEE Transactions on Knowledge and Data Engineering, 14, 659-665.

    Wang, Y., Li, L., Ni, J., and Huang, S. (2009), Feature selection using tabu search with long-term memories and probabilistic neural networks, Pattern Recognition, 30, 661-670.

    Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann.

    Zhao, H. (2008). Instance weighting versus threshold adjusting for cost-sensitive clas-sification, Knowledge and Information Systems, 15, 321-334.

    Zhang, H. and Su, J. (2006), Learning probabilistic decision trees for AUC, Pattern Recognition, 27, 892-899.

    下載圖示 校內:2011-06-22公開
    校外:2011-06-22公開
    QR CODE