簡易檢索 / 詳目顯示

研究生: 邱顯舜
Chiu, Hsien-Shun
論文名稱: 具巢狀嵌入表徵學習之深度生成模型於異常偵測研究
A Deep Generative Model with Representation Learning of Nested Embeddings for Anomaly Detection
指導教授: 李昇暾
Li, Sheng-Tun
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 56
中文關鍵詞: 異常偵測生成對抗網路資料不平衡巢狀嵌入表格式資料
外文關鍵詞: Anomaly Detection, Generative Adversarial Network(GAN), Data Imbalance Problem, Nested Embeddings, Tabular Data
相關次數: 點閱:61下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在深度學習領域中,表格式資料是最常被探討的議題,在許多領域至關重要。近年來深度學習在非結構化資料,例如文本、圖像等領域皆獲得很大的成功。在表格式資料中,數值型特徵往往分布密集,類別型特徵可能分布稀疏,尤其當類別型特徵間呈現巢狀階層關係時,特徵之間的關係會變得更加複雜,這些表格式資料的特性都為深度學習帶來了極大的挑戰。
      此外,在現代數據分析中,異常偵測是一個重要的課題,特別是在金融欺詐檢測、網絡安全及工業設備故障預測等領域。表格式資料中的資料不平衡與異常偵測息息相關,這種不平衡往往會在模型訓練過程中導致顯著的偏差,造成具有重要意義的異常類別的資訊被忽略。
    生成對抗網路(GAN)技術在資料生成方面展現了巨大的潛力,有許多研究也將此技術用於處理表格式資料以及異常偵測問題。然而,以往的研究中,大多都專注於處理數值型資料,少數文獻中針對類別型資料。其中,並沒有研究是針對具有巢狀結構類別特徵的表格式資料。因此,本研究結合了條件式生成對抗網路(Conditional GAN)和嵌入(Embeddings)提出了一種創新的深度生成式模型Nested-CWGAN-GP,用以處理具有巢狀類別型特徵和數值型特徵的表格式資料,在此生成模型中應用了新穎的巢狀嵌入方法捕捉巢狀類別特徵之間的層次關係,為訓練提供更好的表徵,並且對資料集進行過採樣(oversampling)以解決異常偵測中資料不平衡的問題。

    In the realm of deep learning, tabular data is one of the most frequently discussed topics and is crucial in many fields. Within tabular data, numerical features often have a dense distribution, while categorical features may be sparsely distributed. The complexity increases especially when categorical features have nested hierarchical relationships, making the inter-feature relationships more intricate. These characteristics of tabular data pose significant challenges for deep learning. Furthermore, in modern data analytics, anomaly detection is a critical issue, particularly in fields such as financial fraud detec-tion, cybersecurity, and industrial equipment failure prediction. Data imbalance in tabular datasets is closely related to anomaly detection; this imbalance often leads to significant biases during model training, causing the information of critically important anomalous classes to be overlooked.
    Generative Adversarial Networks (GANs) have shown immense potential in data generation, with many studies applying this technology to address issues of tabular data and anomaly detection. However, previous research has primarily focused on numerical data, with few studies addressing categorical data. Among these studies, none have specif-ically targeted tabular data with nested categorical features. Therefore, this study proposes a novel deep generative model named Nested-CWGAN-GP and it integrates Conditional Generative Adversarial Networks (CGAN) with embeddings to handle tabular data with both nested categorical and numerical features. This model employs a novel method of nested embeddings to capture the hierarchical relationships among nested categorical fea-tures, providing better representations for training and generating synthetic data to em-ploy oversampling to solve the data imbalance issue in anomaly detection.

    摘要 i Abstract ii 致謝 iii Table of Contents vi List of Tables viii List of Figures 1 1. Introduction 2 1.1 Research Background 2 1.2 Research Objectives 3 1.3 Research Architecture 4 2. Related Works 6 2.1 Tabular Data 6 2.1.1 Overview of standard Tabular Data 6 2.1.2 Tabular Data with Nested Relationships 7 2.2 Data Imbalance Problems in Anomaly Detection 8 2.3 Oversampling Techniques for Addressing Data Imbalance 10 2.4 Entity Embeddings 11 2.5 Generative Adversarial Networks (GAN) and Conditional GAN 12 2.6 GANs for Tabular Data 14 2.7 Summary 15 3. Methodology 17 3.1 Problem Definition 17 3.2 System Architecture 18 3.3 Modeling Categorical and Numerical Features 18 3.4 Nested Embeddings 19 3.5 Generative Model 21 3.5.1 Generative Model Objective 21 3.5.2 Architecture of Nested-CWGAN-GP 23 4. Experiments 25 4.1 Dataset and Experimental Process 25 4.2 Experimental Setup 26 4.2.1 Hyperparameter Settings 27 4.2.2 Competing Methods and Classifier Implementation 28 4.3 Evaluation Metrics 29 4.3.1 Confusion Matrix 29 4.3.2 Evaluation Metrics for Classification Performance 30 4.3.3 Overall Performance Evaluation Metrics 31 4.4 Experimental Results 32 4.5 Summary 39 5. Conclusions 41 5.1 Conclusions and Contributions 41 5.2 Limitation and Future Works 42 References 44

    Al-Shabi, M. (2019). Credit Card Fraud Detection Using Autoencoder Model in Unbalanced Datasets. Journal of Advances in Mathematics and Computer Science, 1-16. https://doi.org/10.9734/jamcs/2019/v33i530192
    Barella, V. H., Garcia, L. P. F., de Souto, M. C. P., Lorena, A. C., & de Carvalho, A. C. P. L. F. (2021). Assessing the data complexity of imbalanced datasets. Information Sciences, 553, 83-109. https://doi.org/https://doi.org/10.1016/j.ins.2020.12.006
    Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., & Kasneci, G. (2022). Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems.
    Camino, R., Hammerschmidt, C., & State, R. (2018). Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202.
    Chawla, N. V. (2010). Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, 875-886.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7,
    Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., & Sun, J. (2017). Generating multi-label discrete patient records using generative adversarial networks. Machine learning for healthcare conference,
    Engelmann, J., & Lessmann, S. (2021). Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174, 114582.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems, 30.
    Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.
    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing,
    He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence),
    He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
    Hooi, E. K. J., Zainal, A., Kassim, M. N., & Ayub, Z. (2022). Feature Encoding For High Cardinality Categorical Variables Using Entity Embeddings: A Case Study in Customs Fraud Detection. 2022 International Conference on Cyber Resilience (ICCR),
    Kamalov, F., Leung, H. H., & Cherukuri, A. K. (2023, 20-23 Feb. 2023). Keep it simple: random oversampling for imbalanced data. 2023 Advances in Science and Engineering Technology International Conferences (ASET),
    Lin, Y., Li, L., Jing, H., Ran, B., & Sun, D. (2020). Automated traffic incident detection with a smaller dataset based on generative adversarial networks. Accident Analysis & Prevention, 144, 105628. https://doi.org/https://doi.org/10.1016/j.aap.2020.105628
    Ma, Y., & Zhang, Z. (2020). Travel mode choice prediction using deep neural networks with entity embeddings. IEEE Access, 8, 64959-64970.
    Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
    Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., & Kim, Y. (2018). Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384.
    Ren, Z., Lin, T., Feng, K., Zhu, Y., Liu, Z., & Yan, K. (2023). A systematic review on imbalanced learning methods in intelligent fault diagnosis. IEEE Transactions on Instrumentation and Measurement.
    Sarra, R. R., Dinar, A. M., Mohammed, M. A., Ghani, M. K., & Albahar, M. A. (2022). A Robust Framework for Data Generative and Heart Disease Prediction Based on Efficient Deep Learning Models. Diagnostics, 12(12).
    Sauber-Cole, R., & Khoshgoftaar, T. M. (2022). The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey. Journal of Big Data, 9(1), 98.
    Sun, Y., Xu, L., Guo, L., Li, Y., & Wang, Y. (2020, 2020//). A Comparison Study of VAE and GAN for Software Fault Prediction. Algorithms and Architectures for Parallel Processing, Cham.
    Xiong, T., Wang, S., Mayers, A., & Monga, E. (2012). DHCC: Divisive hierarchical clustering of categorical data. Data Mining and Knowledge Discovery, 24, 103-135.
    Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. Advances in neural information processing systems, 32.
    Xu, L., & Veeramachaneni, K. (2018). Synthesizing tabular data using generative adversarial networks. arXiv 2018. arXiv preprint arXiv:1811.11264.
    Zhao, Z., Kunar, A., Birke, R., & Chen, L. Y. (2021). Ctab-gan: Effective table data synthesizing. Asian Conference on Machine Learning,
    Zhu, R., Guo, Y., & Xue, J.-H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217-223. https://doi.org/https://doi.org/10.1016/j.patrec.2020.03.004

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE