簡易檢索 / 詳目顯示

研究生: 賴永耀
Lai, Yong-Yao
論文名稱: 利用虛擬基因表現資料提升研究初期癌症辨識率
Utilization of virtual sample generation to facilitate cancer identification for gene expression data in early stages
指導教授: 利德江
Li, De-Jiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 48
中文關鍵詞: 虛擬樣本基因表現資料基因微陣列
外文關鍵詞: Virtual sample generation, cDNA MicroArray, Gene selection, Virtual Sample, DNA microarray, Gene Expression Data
相關次數: 點閱:94下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來由於生物晶片(BioChip)技術的快速發展,使得生物學家可以在同一時間內一次分析成千上萬的基因,對於臨床實驗上了解癌症的發生與發展等相關研究提供了一個極佳的解決方案,其中又以具有迅速檢測基因表現之能力的基因微陣列(cDNA MicroArray)技術最受人矚目。雖然基因微陣列可以攜帶豐富的資訊以供研究,但其高維度(dimension)且高雜訊(noise)的資料特性增加了專家學者進行研究時在資料處理程序上的困難度;此外,其樣本取得不易且變數個數(Genes)遠大於資料個數(tissue samples)的現象也帶出了小樣本學習誤差的問題。

    基因微陣列資料特性所衍生出來的問題也引起了學者們廣泛的討論,紛紛提出了各種不同的解決方法,這些方法多是針對基因選取(Gene Selection)來做改良。然而去除基因微陣列中大量的雜訊(不重要的基因)雖然可以同時達到降低維度並提高預測正確率的目標,但前提是必須要有足夠的樣本數。當新興的疾病問世,尚未有足夠的患病人數可以採集薄膜樣本時,或者是基於成本或其他因素的考量無法取得足夠的樣本數時,再好的基因選取法也很難有優異的成效。

    因此本研究將在基因微陣列的研究領域上加入全新的元素,引進虛擬樣本(Virtual Samples)的概念來處理癌症分類問題。方法上第一階段先以基因排序法找出具鑑別力的基因,第二階段則應用群集化核心密度估計法(Clusterized KDE)產生虛擬樣本以增加有意義的資訊來幫助分類器作正確的預測。預期透過新增的虛擬樣本來增加訓練的樣本空間以提升癌症辨識率。

    DNA microarray today plays an important role of the cancer classification problem. Microarray technology allows us to measure the expression levels of thousands of genes simultaneously in clinical experiments. Clinicians are enable to obtain the gene expression profile of tissue samples rapidly and make decision correctly.

    DNA microarray data are characterized as low size, high dimensionality (this is called the small samples problem), a large number of noise or high correlation genes. Recently researchers apply gene selection mechanism to find the genes most relevant to a specific classification task. It can improve learning accuracy and reduce the computation cost, but can not solve the innately limited of lock of training samples, for example, in the early stages, like during the outbreak of the new disease, only limited data can be obtained, so that the model derived is also too unstable to deal with the new disease effectively, and the performance can not improve significant even thought using gene selection mechanism.

    In this paper, we propose the virtual sample technique, named CKDE (Clusterized Kernel Density Estimation). Not only apply gene selection mechanism but then analysis the characteristic of data after reduced. Generate the virtual sample to increase meaningful information, and the proposed model improves the learning accuracy significantly.

    第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究目的 4 第二章 文獻探討 5 2.1 基因選取法 5 2.1.1 .t統計量檢定法 6 2.1.2 混合式基因選取法 6 2.1.3 特徵空間基因選取法 8 2.2 虛擬樣本 10 2.2.1 區間化核心密度估計 10 2.2.2 大趨勢擴散 14 2.3 分類法 15 2.3.1 .K近鄰 16 2.3.2 機率式類神經網路 16 2.3.3 支撐向量機 17 第三章 研究方法 19 3.1 系統架構 19 3.2 基因選取 20 3.3 產生虛擬樣本 20 3.3.1 群集化核心密度估計 20 3.3.2 產生變量 25 3.3.3 產生程序 26 3.4 分類法 28 第四章 實證研究 29 4.1 資料集 29 4.2 參數設定 30 4.2.1 資料分群 30 4.2.2 抽樣法則 32 4.3 實驗結果 33 4.3.1 結腸癌資料集 34 4.3.2 淋巴癌資料集 37 4.3.3 神經膠質瘤資料集 40 第五章 結論與建議 44 5.1 結論 44 5.2 未來研究方向 44 參考文獻 46

    Alon,U., Barkai,N., Notterman,D.A., Gish,K., Ybarra,S., Mack,D. and Levine,A.J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor
    and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
    Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue classification with gene expression profiles. Journal of Comput- ational Biology, 7, 559-584.
    Brown, M., Grundy, W., Lin, D., Cristianini, N., Sugnet, C., Furey, T. M., Ares, J. and Haussler, D. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America, 97, 262-267.
    Chen, X. W. (2006) Margin-based wrapper methods for gene identification using microarray. Neurocomputing, 69, 2236-2243.
    Dasarathy, B. (1991) Nearest Neighbor Norms: NN Patern Classification Techniques. IEEE Computer Society Press.
    Daszykowski, M., Walczak, B. and Massart, D. L. (2002) Representative Subset Selection. Analytica Chimica Acta, 468, 91-103.
    Ester, M., Kriegel, H. P., Sander, H., Xu, X., (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Datasets with Noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon, 229-231.
    Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10), 906-914.
    Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-537.
    Kendall, M. G. and Stuart, A. (1973) The Advanced Theory of Statistics, third edition, vol. 2. Griffin, London.
    Li, D. C. and Lin, Y. S. (2006) Using virtual sample generation to build up management knowledge in the early manufacturing stages. European Journal of Operational Research, 175(1), 413-434.
    Li, D. C., Hsu, H. C., Tsai, T. I., Lu, T. J. and Hu, S. C. (2007) A new method to help diagnose cancers for small sample size. Expert Systems with Application, 33(2), in press.
    Li, T., Zhang, C. and Ogihara, M. (2004) A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Express -ion. Bioinformatics, 20, 2429-2437.
    Li, W., (2006) the-more-the-better and the-less-the-better. Bioinformatics, 22(18), 2187 -2188.
    Liu, H., Li, J. and Wong, L. (2002) A comparative study on feature selection and class- ification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13, 51-60.
    Niijima, S. and Kuhara, S. (2006) Gene subset selection in kernel-induced feature space. Pattern Recognition Letters, 27, 1884-1892.
    Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G.., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., Deimling, A. V., Pomeroy, S. L., Golub, T. R. and Louis, D. N. (2003) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63(7), 1602-1607.
    Ross, S. M. (1996) Simulation, second edition. Academic Press, San Diego.
    Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutor, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C. and Golub, T. R. (2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8(1), 68-74.
    Su, Y., Murali, T. M., Pavlovic, V., and Kasif, S. (2003) RankGene: Identification of Diagnostic Genes Based on Expression Data. Bioinformatics. 19, 1578-1579.
    Speckt, D. F. (1990) Probabilistic neural networks. Neural Networks 3 (1), 109-118.
    Vapnik, V. N. (1998) Statistical Learning Theory. Wiley-Interscience, New York, USA.
    Wang, Y., Makedon, F. S., Ford, J. C. and Pearlman, J. (2005) HykGene: An Hybrid Approach for Selecting Marker Genes for Phenotype Classification using Micro -array Gene Expression Data. Bioinformatics, 21(8), 1530-1537.

    下載圖示 校內:立即公開
    校外:2007-07-11公開
    QR CODE