| 研究生: | 林修弘 Lin, Xiu-Hong | 
|---|---|
| 論文名稱: | 用多項式簡易貝氏分類器分類基因序列資料時以遺傳密碼進行特徵萃取之研究 Applying genetic code for feature extraction in classifying gene sequence data by multinomial naive Bayesian classifiers | 
| 指導教授: | 翁慈宗 Wong, Tzu-Tsung | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 管理學院 - 資訊管理研究所 Institute of Information Management | 
| 論文出版年: | 2016 | 
| 畢業學年度: | 104 | 
| 語文別: | 中文 | 
| 論文頁數: | 42 | 
| 中文關鍵詞: | 遺傳密碼 、宏基因體學 、簡易貝氏分類器 、特徵萃取 、基因序列 | 
| 外文關鍵詞: | feature extraction, gene sequence, genetic code, multinomial naïve Bayesian classifier | 
| 相關次數: | 點閱:121 下載:3 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
以往人類在探究環境微生物時都是先從環境中採集樣本,再放入實驗室進行培養研究。但是近年來科學家發現實驗室中的環境只能培養自然環境中百分之一的微生物,限制了研究範圍,因此直接取樣以進行基因定序的宏基因體學技術更適合用於研究微生物種群。在處理宏基因體學序列時,簡易貝氏分類器由於其良好的分類效果和線性的運算成本被廣泛採用。雖然簡易貝氏分類器在研究中已經取得了不錯的效果,但宏基因體學序列資料類別值多、屬性維度高且分佈稀疏的特點限制了其分類效果的進一步提升。為此,已有大量學者針對這一問題進行了深入的研究,提出了屬性選擇、階層式處理、先驗分配優化等方案。本研究針對這一問題,引入生物學中的遺傳密碼對序列資料進行處理,並且改進了屬性萃取步驟並提出了組合式特徵使用方法,希望能夠進一步提升簡易貝氏分類器處理宏基因體學序列資料時的準確率。實驗結果顯示本文提出的研究方法不僅在準確率上略有提升,還能夠顯著提升運算速度。
We often collect microorganism samples from the environment and cultivate them in laboratories, while most of them cannot be cultivated well. Collecting gene sequences from their cells is therefore a better way to study the environment microbial populations. Multinomial naïve Bayesian classifiers are often used for analyzing gene sequence data because of its computational efficiency and easy implementation. The dimension is high, and the number of class values is large in a gene sequence set. Many studies have proposed approaches to improve the accuracy of the multinomial naïve Bayesian classifier, such as feature selection and prior setting methods. This study introduces the concept of genetic code to transform gene sequence, and proposes several ways to aggregate the features for classification. The experimental results on a gene sequence set show that the approach proposed in this study can significantly accelerate the computation of the multinomial naïve Bayesian classifier when the accuracy is improved.
姚佳佑,(2015)。結合多項式簡易貝氏分類器與狄氏先驗分配參數估算方法於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
陳朝友,(2014)。結合多項式馬可夫貝氏分類器與廣義狄氏分配參數估算方法於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
蔡忠霖,(2010)。應用多源基因體學分類問題之以熵值為基礎的特徵選取法。國立成功大學資訊管理研究所碩士論文。
朱玉賢,李毅,鄭曉峰,(2007)。現代分子生物學。高等教育出版社。
Alexander, T., Alexander, P., and Shestopalov, V. I. (2014). TUIT, a BLAST-based tool for taxonomic classification of nucleotide sequences. Biotechniques, 56(2), 78-84.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410
Bazinet, A. L. and Cummings, M.P. (2012). A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92.
CRICK, F. H. C. (1966). Codon-anticodon pairing:the wobble hypothesis. Journal of Molecular Biology, 19,548-555.
CRICK, F. H. C. (1970). Central dogma of molecular biology. Nature. 08, 227 (5258): 561–3. 
CRICK, F. H. C., Barnett, L., Brenner S., and Watts-Tobin ,R.J. (1961).General nature of the genetic code for proteins. Nature, December 30, 4809, 1227-1232.
Cui, H.F. and Zhang, X.G.(2013).Alignment-free supervised classification of metagenomes by recursive SVM. BMC Genomics , 14:641.
Daeyaert, F., Moereels,H., and Lewi,P.J.(1998). Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. Computer Methods and Programs In Biomedicine, 56, 221–233.
Duan, L.G, Di, P. and Li A.P. (2014). A new naive Bayes text classification algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(2), February, 947 ~ 952.
David, A. R. and Anthony, J. R.(2008). Simplicity, function, and legibility in an
enhanced ambigraphic nucleic acid notation. BioTechniques, 44,811-813.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
Frenkel, F. E. and Korotkov, E. V. (2008). Classification of triplet periodicity in the DNA sequences of genes from KEGG databank. Molecular Biology, 42(4),  629–640.
Goés, F., Alves, R., Corrêa, L., Chaparro,C., and Thom,L. (2014). Advances in Bioinformatics and Computational Biology, 8826 , 17-24.
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., and Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry and Biology, 5(10), 245-249.
Hoff, K. J., Tech, M., Thomas,L., Rolf,D.,Burkhard ,M., and Peter, M.(2008). Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics ,9:217.
Liao, R. Q., Zhang, R. C., Guan, J. H., and Zhou, S. G. (2014). A new unsupervised binning approach for metagenomic sequences based on N-grams and automatic feature weighting. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 42-54.
Liu, K.L. and Wong,T.T. (2013). Naı¨ve Bayesian classifiers with multinomial models for rRNA taxonomic assignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(5), SEPTEMBER. 
Peter, Y., Ekaterina,P., and Maxim, D., and Frank,K. (2006). Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Research, 34(2).
Prabhakara, S. and Acharya, R. (2012). Unsupervised two-way clustering of metagenomic sequences. Journal of Biomedicine and Biotechnology, 1-11.
Proctor, G.N. (1994). Mathematics of microbial plasmid instability and subsequent differential growth of plasmid-free and plasmid-containing cells, relevant to the analysis of experimental colony number data. Plasmid, 32(2) September, 101–130.
Reddy, R. M., Mohammed, M. H., and Mande, S. S. (2012). TWARIT: An extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences. Gene, 505(2), 259-265.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 1-12.
Saghir, H. and Dalila B. M. (2013). A random-forest-based efficient comparative machine learning Predictive DNA-codon metagenomics binning technique for WMD events & applications. IEEE International Conference on Technologies for Homeland Security, 12-14, Nov.
Tracey, A.K.F., Li, P.E., Matthew, B.S., and Patrick S. G. C. (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Research, 43(10), e69.
Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Watson J and Crick F. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171 (4356): 737 – 8.
Weil, G., Heus,K., Faraut, T., and Jacques,D.(2004). The cyclic genetic code as a constraint satisfaction problem. Theoretical Computer Science, 322,313-334
 校內:2021-07-01公開
                                        校內:2021-07-01公開