| 研究生: |
朱柏勳 Chu, Po-Hsun |
|---|---|
| 論文名稱: |
基於機器學習方法之微型核糖核酸目標基因預測 Machine Learning Based MicroRNA Target Prediction |
| 指導教授: |
張天豪
Chang, Tien-Hao |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 中文 |
| 論文頁數: | 36 |
| 中文關鍵詞: | 機器學習 、微型核糖核酸 、信息核糖核酸 |
| 外文關鍵詞: | machine learning, microRNA, mRNA |
| 相關次數: | 點閱:105 下載:17 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
辨別微型核糖核酸(microRNA)結合的目標基因是研究基因抑制作用的基礎。現今已經有很多預測器,而基於機器學習的預測器更是大幅的提升了預測的效能。對於基於機器學習的預測器來說,負資料集的使用卻仍是一個困難的議題,由於並沒有專門在辨認非目標基因的系統,所以目前基於機器學習方法的的預測器多半是使用自己產生的負資料集作為訓練資料集,而不同生成的方法也會對機器學習演算法帶來不同的效果。機器學習的另外一個要點即是特徵的使用,在本論文中我們將使用一般經驗法則下有用的特徵如互補的種子區域(seed matching region)、結合體熱力穩定性(thermodynamic stability of duplex)…等,並且加入一些新型特徵(de novo feature),如序列模式特徵-兩兩核苷酸(Bigram)以及本研究所提出使用的三三核苷酸(Trigram)特徵。而由於機器學習演算法建立的模型通常較為複雜,人類通常無法直接解釋模型學習到什麼,因此我們使用規則提取的演算法從預測器中提取出基於經驗法則特徵以及新型特徵的規則。
在本研究中我們與幾個現行的預測器比較,取得了很高的ROC AUC分數,其中分析了不同製作方法的負資料集所帶來的影響,並且根據不同的狀況,我們提出一個如何準備負資料集的方法。在機器學習的架構上,為了讓結合的判定更加嚴苛,我們結合了多種不同性質的機器學習演算法,並且使用調和平均數對所有演算法的結果進行平均,藉以得到更穩健的預測。
Identifying mircroRNA binding target is important for studying gene regulation. There are many existing target prediction tools, and the predictors which are based on machine learning algorithm improve performance a lot. An issue in machine learning-based predictor is the negative dataset. Because there is no systematic method to collect negative dataset (non-binding miRNA-mRNA pair), each work will produce their negative dataset. Different generation methods of negative dataset will take different effect on machine learning algorithm. Another important thing on machine learning is feature engineering. This work uses some empirical features such as seed matching type, thermodynamic stability of duplex, accessibility, site location, multiplicity of binding site in previous works and the de novo features (unigram, bigram, trigram) which this work proposed. The last issue is that machine learning algorithm is too complicate for human to interpret what knowledge the machine has learned. Thus, we applied the rule induction algorithm to extract rules which are based on de novo features and empirical features from our model.
In this work, we proposed the harmonic model and got a higher performance than other tools on ROC AUC. In order to make the determination of the miRNA-mRNA binding more stringent, harmonic model aggregates three algorithms with harmonic mean. By many experiments, we provided a guideline about how to prepare the negative dataset in different situations.
1. Bandyopadhyay S, Mitra R (2009) TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples. Bioinformatics 25: 2625-2631.
2. Lekprasert P, Mayhew M, Ohler U (2011) Assessing the utility of thermodynamic features for microRNA target prediction under relaxed seed and no conservation requirements. PLoS One 6: e20622.
3. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E (2007) The role of site accessibility in microRNA target recognition. Nature genetics 39: 1278-1284.
4. Miranda KC, Huynh T, Tay Y, Ang Y-S, Tam W-L, et al. (2006) A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell 126: 1203-1217.
5. Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, et al. (2007) MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular cell 27: 91-105.
6. Doench JG, Sharp PA (2004) Specificity of microRNA target selection in translational repression. Genes & development 18: 504-511.
7. John B, Enright AJ, Aravin A, Tuschl T, Sander C, et al. (2004) Human microRNA targets. PLoS Biol 2: e363.
8. Menor M, Ching T, Zhu X, Garmire D, Garmire LX (2014) mirMark: a site-level and UTR-level classifier for miRNA target prediction. Genome biology 15: 1.
9. Vejnar CE, Zdobnov EM (2012) MiRmap: comprehensive prediction of microRNA target repression strength. Nucleic acids research 40: 11673-11683.
10. Vejnar CE, Blum M, Zdobnov EM (2013) miRmap web: comprehensive microRNA target prediction online. Nucleic acids research 41: W165-W168.
11. Hsu S-D, Lin F-M, Wu W-Y, Liang C, Huang W-C, et al. (2010) miRTarBase: a database curates experimentally validated microRNA–target interactions. Nucleic acids research: gkq1107.
12. Xiao F, Zuo Z, Cai G, Kang S, Gao X, et al. (2009) miRecords: an integrated resource for microRNA–target interactions. Nucleic acids research 37: D105-D110.
13. Vlachos IS, Paraskevopoulou MD, Karagkouni D, Georgakilas G, Vergoulis T, et al. (2015) DIANA-TarBase v7. 0: indexing more than half a million experimentally supported miRNA: mRNA interactions. Nucleic acids research 43: D153-D159.
14. Kung DM (2011) A Study of RNA Features for MicroRNA Target Prediction. MS thesis.
15. Chapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Machine learning 46: 131-159.
16. Vapnik V (2013) The nature of statistical learning theory: Springer Science & Business Media.
17. Oyang Y-J, Hwang S-C, Ou Y-Y, Chen C-Y, Chen Z-W (2005) Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE transactions on neural networks 16: 225-236.
18. Artin E (1964) The Gamma Function. New York: Holt, Rinehart and Winston.
19. Zięba M, Tomczak JM, Lubicz M, Świątek J (2014) Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Applied soft computing 14: 99-108.
20. Cohen WW. Fast effective rule induction; 1995. pp. 115-123.
21. Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise; 1996. pp. 226-231.
22. Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9: 2579-2605.