簡易檢索 / 詳目顯示

研究生: 萬晏寧
Wan, Yen-Ning
論文名稱: 於串流環境中基於動態關鍵特徵分布檢定之非監督式概念漂移偵測方法
Unsupervised Concept Drift Detection using Dynamic Crucial Feature Distribution Test in Data Streams
指導教授: 黃仁暐
Huang, Jen-Wei
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 42
中文關鍵詞: 數據流概念漂移偵測非監督式方法
外文關鍵詞: Data Stream, Concept Drift Detection, Unsupervised Method
相關次數: 點閱:81下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著科學技術的進步,現在我們處在各式各樣大量串流資訊的時代,隨著時間過去,資料不斷的被產生,像是:金融市場交易數據、空氣品質的監測資訊、人們購買物品的紀錄、世界各國的新聞等等。因此,如何處理和取得數據流下的隱藏資訊成為人們汲於探索的課題。在這之中,串流資料的概念漂移偵測被視為一個值得深入研究的方向。在時間推移著的同時,數據流所乘載的資訊將發生不可預測的變化,人們建立的舊模型準確度下降,導致舊模型不再實用,我們將這個問題稱為“概念漂移”。在這篇論文中,我們提出了一種新穎的概念漂移檢測方法,通過使用無監督式關鍵特徵分布檢定的方法解決實際面向的概念漂移檢測問題。我們對合成數據集和實際數據集都進行了評估。從實驗的結果表明,我們提出的方法比現有方法更加有效。

    Data evolves rapidly over time, leads to the unpredictable changing of the implicit information
    behind data streams. The accuracy of conventional models reduces as times goes by, and old models are rendered impractical. This phenomenon is referred to as “Concept Drift”.
    This study proposed a novel approach for solving the concept drift detection problem in a more realistic way by using the unsupervised method and focusing on the dynamic crucial feature distribution test. Experiment results demonstrated the efficacy of the proposed model when applied to synthetic as well as real-world datasets.

    中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Definition of Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Types of Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Patterns of Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Sequential Analysis based Methods . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Error Rate based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Windows based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.4 Feature based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Problem Definition and System Architecture . . . . . . . . . . . . . . . . . . . . 10 3.2 Dealing with Data Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Dynamic Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 The Mechanism of Feature Reduction . . . . . . . . . . . . . . . . . . . . 13 3.3.2 The Strategy for Automatically Choosing Dimensions . . . . . . . . . . . 14 3.3.3 Dynamic Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Crucial Feature Distribution Test . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.1 Test of Change in Distribution . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 Crucial Feature Distribution Test . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 Windows Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.7 Summary of DCFDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.2 Real-World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Evaluation Metrics and Experimental Settings . . . . . . . . . . . . . . . . . . . 24 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Results of DCFDT and Previous Works . . . . . . . . . . . . . . . . . . . 25 4.3.2 Performance on Different Dimensions Settings . . . . . . . . . . . . . . . 34 4.3.3 w/o Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.4 No Drift and Always Drift . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    [1] I. Zliobaite, “Learning under concept drift: an overview,” arXiv:1010.4784, 2010.
    [2] J. Gama, I. Zliobaıte, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys (CSUR), vol. 46, no. 4, 2014.
    [3] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: a review,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp.2346–2363, 2019.
    [4] S. Jameel, M. Hashmani, H. Alhussain, M. Rehman, and A. Budiman, “A critical review on adverse effects of concept drift over machine learning classification models,” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 11, no. 1, 2020.
    [5] V. Souza, D. dos Reis, A. Maletzke, G. Batista et al., “Challenges in benchmarking stream learning algorithms with real-world data,” Data Mining and Knowledge Discovery, vol. 34, p. 1805–1858, 2020.
    [6] A. Pesaranghader, H. Viktor, and E. Paquet, “Mcdiarmid drift detection methods for evolving data streams,” International Joint Conference on Neural Networks (IJCNN), pp.1–9, 2018.
    [7] X. Zhou, W. Faro, X. Zhang, and R. Arvapally, “A framework to monitor machine learning systems using concept drift detection,” Business Information Systems, vol. 353, pp. 218–231, 2019.
    [8] S. Kadam, “A survey on classification of concept drift with stream data,” HAL, 2019.
    [9] E. S. Page, “Continuous inspection schemes.” Biometrika, vol. 41, no. 1-2, 1954.
    [10] J. Gama, P. Medas, G. Gastillo, and P. Rodrigues, “Learning with drift detection,” Advances in Artificial Intelligence – SBIA, vol. 3171, pp. 286–295, 2004.
    [11] M. Baena-Garcıa, J. Campo-Avila1, R. Fidalgo, A. Bifet, R. Gavalda, and R. Morales-Bueno, “Early drift detection method,” International Workshop on Knowledge Discovery from Data Streams, 2006.
    [12] M. Yan, “Accurate detecting concept drift in evolving data streams,” ICT Express, vol. 6, pp. 332–338, 2020.
    [13] A. Bifet and R. Gavalda, “Learning from time-changing data with adaptive windowing,” Proceedings of the Seventh SIAM International Conference on Data Mining, vol. A. Bifet and R. Gavald`a, 2007.
    [14] C. Tan, V. C. Lee, and M. Salehi, “Online semi-supervised concept drift detection with density estimation,” arXiv:1909.11251, 2019.
    [15] ¨ O. G¨oz¨ua¸cık, A. B¨uy¨uk¸cakır, H. Bonaband, and F. Can, “Unsupervised concept drift detection with a discriminative classifier,” Proceedings of the 28th ACM International Conference on Information and Knowledge Management, p. 2365–2368, 2019.
    [16] T. Sethi and M. Kantardzic, “On the reliable detection of concept drift from streaming unlabeled data,” Expert Systems with Applications, vol. 82, pp. 77–99, 2017.
    [17] I. Jolliffe, Principal Component Analysis,Second Edition. Springer, 2002.
    [18] T. Minka, “Automatic choice of dimensionality for pca,” Proceedings of the 13th International Conference on Neural Information Processing Systems, p. 577–583, 2000.
    [19] H. Chipman, E. George, and R. McCulloch, “The practical implementation of bayesian model selection.” Institute of Mathematical Statistics, pp. 65–116, 2001. [Online]. Available: https://projecteuclid.org/euclid.lnms/1215540964
    [20] F. J. and M. Jr., “The kolmogorov-smirnov test for goodness of fit,” Journal of the American Statistical Association, vol. 46, no. 253, p. 68–78, 1951.
    [21] V. Losing, B. Hammer, and H. Wersing, “Knn classifier with self adjusting memory for heterogeneous concept drift,” 2016, pp. 291–300.
    [22] V. Losing, “driftdatasets,” GitHub repository, 2018.
    [23] G. Hultenand, L. Spencer, and P. Domingos, “Mining time-changing data streams,” 2001,p. 97–106.
    [24] S. Nick and K. YongSeog, “A streaming ensemble algorithm (sea) for large-scale classification,” 2001, p. 377–382.
    [25] H. Gomes, A. Bifet, J. Read, and J.et al., “Adaptive random forests for evolving data stream classification,” Machine Learning, vol. 106, pp. 1469–1495, 2017.
    [26] M. Harries, “Splice-2 comparative evaluation: electricity pricing,” 1999.
    [27] J. Blackard and D. Dean, “Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables,” computers and electronics in agriculture, vol. 24, no. 3, p. 131–151, 1999.
    [28] R. Cattral, F. Oppacher, and D. Deugo, “Evolutionary data mining with automatic rule generalization,” Computers and Electronics in Agriculture, p. 296–300, 2001.
    [29] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, “Learning in nonstationary environments: A survey,” IEEE Computational Intelligence Magazine, vol. 10, no. 4, p. 12–25, 2015.
    [30] A. S. Foundation, “Prequential evaluation.” [Online]. Available: https://samoa.incubator.apache.org/documentation/Prequential-Evaluation-Task.html
    [31] A. Bifet, G. Morales, J. Read, G. Holmes, and B. Pfahringer, “Efficient online evaluation of big data stream classifiers,” KDD, pp. 59–68, 2015. [Online]. Available: https://melmeric.files.wordpress.com/2010/05/efficient-online-evaluation-of-big-data-stream-classifiers.pdf
    [32] I. Fr´ıas-Blanco, J. d. Campo-´Avila, G. Ramos-Jim´enez, R. Morales-Bueno, A. Ortiz-D´ıaz, and Y. Caballero-Mota, “Online and non-parametric drift detection methods based on hoeffding’s bounds,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 3, pp. 810–823, 2015.

    無法下載圖示
    校外:不公開
    電子論文及紙本論文均尚未授權公開
    QR CODE