研究生: |
林巧玲 Lin, Chiao-Ling |
---|---|
論文名稱: |
利用EM演算法優化線性區別分析分類器下的主動學習演算法 Combine Expectation-Maximization Algorithm with Active Learning for Linear Discriminant Analysis Classifier |
指導教授: |
陳瑞彬
Chen, Ray-Bing |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 統計學系 Department of Statistics |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 32 |
中文關鍵詞: | 主動學習 、線性區別分析 、最大期望演算法 |
外文關鍵詞: | Active Learning, LDA classifier, EM algorithm |
相關次數: | 點閱:162 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
主動學習是大量未標籤樣本易取且花費成本少但標籤樣本相當稀少,而人工進行樣本標籤又太過昂貴,故為了避免浪費成本,我們利用一些準則挑選有資訊的未標籤樣本進行人工標籤。我們的研究顯示線性區別分析分類器的預測精確度可以藉由增加大量未標籤樣本來加以改善。在本論文中採用的準則主要有經驗AUC以及經驗AUC之影響函數,可使分類器增加更多資訊達到優化效果。過程中結合最大期望演算法以及線性區別分析分類器,以對未標籤樣本進行評估,確認是否有助於分類器優化。在EM演算法過程中,若有過多的未標籤樣本會對分類準確率造成負面影響,故於未標籤樣本中加入加權因子使得分類的準確率可以改善,此外,訓練集樣本數增加到一定程度時分類準確率趨近平衡,故設定停止條件以免造成更多浪費。
Our study shows that the accuracy of linear discriminant analysis (LDA) classifier can be improved by augmenting labeled training data with a pool of unlabeled data. We introduce some criteria, for example, empirical AUC and influence function for empirical AUC, to select the unlabeled points for labeling. The goal is to sequentially identify the unlabeled points to improve the classification accuracy. In addition, the EM algorithm is used to take the unlabeled points into classifier learning. For the huge unlabeled data set, an augmented EM algorithm is used by taking the weight factor to adjust the information from the unlabeled points. Furthermore, we come out a possible stopping criterion for the proposed active learning algorithm.
Chang, Y.-c. I. and Chen, R.-B. (2019). Active learning with simultaneous subject and variable selections. Neurocomputing, 329:495–505.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
Deng, X., Joseph, V. R., Sudjianto, A., and Wu, C. J. (2009). Active learning through sequential design, with applications to detection of money laundering. Journal of the American Statistical Association, 104(487):969–981.
Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393.
Ke, B.-S., Chiang, A. J., and Chang, Y.-c. I. (2018). Influence analysis for the area under the receiver operating characteristic curve. Journal of Biopharmaceutical Statistics, 28(4):722–734.
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3):103–134.
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. Medicine.
Smith, J. W., Everhart, J., Dickson, W., Knowler, W., and Johannes, R. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, page 261. American Medical Informatics Association.