| 研究生: |
邱顯舜 Chiu, Hsien-Shun |
|---|---|
| 論文名稱: |
具巢狀嵌入表徵學習之深度生成模型於異常偵測研究 A Deep Generative Model with Representation Learning of Nested Embeddings for Anomaly Detection |
| 指導教授: |
李昇暾
Li, Sheng-Tun |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 異常偵測 、生成對抗網路 、資料不平衡 、巢狀嵌入 、表格式資料 |
| 外文關鍵詞: | Anomaly Detection, Generative Adversarial Network(GAN), Data Imbalance Problem, Nested Embeddings, Tabular Data |
| 相關次數: | 點閱:61 下載:8 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在深度學習領域中,表格式資料是最常被探討的議題,在許多領域至關重要。近年來深度學習在非結構化資料,例如文本、圖像等領域皆獲得很大的成功。在表格式資料中,數值型特徵往往分布密集,類別型特徵可能分布稀疏,尤其當類別型特徵間呈現巢狀階層關係時,特徵之間的關係會變得更加複雜,這些表格式資料的特性都為深度學習帶來了極大的挑戰。
此外,在現代數據分析中,異常偵測是一個重要的課題,特別是在金融欺詐檢測、網絡安全及工業設備故障預測等領域。表格式資料中的資料不平衡與異常偵測息息相關,這種不平衡往往會在模型訓練過程中導致顯著的偏差,造成具有重要意義的異常類別的資訊被忽略。
生成對抗網路(GAN)技術在資料生成方面展現了巨大的潛力,有許多研究也將此技術用於處理表格式資料以及異常偵測問題。然而,以往的研究中,大多都專注於處理數值型資料,少數文獻中針對類別型資料。其中,並沒有研究是針對具有巢狀結構類別特徵的表格式資料。因此,本研究結合了條件式生成對抗網路(Conditional GAN)和嵌入(Embeddings)提出了一種創新的深度生成式模型Nested-CWGAN-GP,用以處理具有巢狀類別型特徵和數值型特徵的表格式資料,在此生成模型中應用了新穎的巢狀嵌入方法捕捉巢狀類別特徵之間的層次關係,為訓練提供更好的表徵,並且對資料集進行過採樣(oversampling)以解決異常偵測中資料不平衡的問題。
In the realm of deep learning, tabular data is one of the most frequently discussed topics and is crucial in many fields. Within tabular data, numerical features often have a dense distribution, while categorical features may be sparsely distributed. The complexity increases especially when categorical features have nested hierarchical relationships, making the inter-feature relationships more intricate. These characteristics of tabular data pose significant challenges for deep learning. Furthermore, in modern data analytics, anomaly detection is a critical issue, particularly in fields such as financial fraud detec-tion, cybersecurity, and industrial equipment failure prediction. Data imbalance in tabular datasets is closely related to anomaly detection; this imbalance often leads to significant biases during model training, causing the information of critically important anomalous classes to be overlooked.
Generative Adversarial Networks (GANs) have shown immense potential in data generation, with many studies applying this technology to address issues of tabular data and anomaly detection. However, previous research has primarily focused on numerical data, with few studies addressing categorical data. Among these studies, none have specif-ically targeted tabular data with nested categorical features. Therefore, this study proposes a novel deep generative model named Nested-CWGAN-GP and it integrates Conditional Generative Adversarial Networks (CGAN) with embeddings to handle tabular data with both nested categorical and numerical features. This model employs a novel method of nested embeddings to capture the hierarchical relationships among nested categorical fea-tures, providing better representations for training and generating synthetic data to em-ploy oversampling to solve the data imbalance issue in anomaly detection.
Al-Shabi, M. (2019). Credit Card Fraud Detection Using Autoencoder Model in Unbalanced Datasets. Journal of Advances in Mathematics and Computer Science, 1-16. https://doi.org/10.9734/jamcs/2019/v33i530192
Barella, V. H., Garcia, L. P. F., de Souto, M. C. P., Lorena, A. C., & de Carvalho, A. C. P. L. F. (2021). Assessing the data complexity of imbalanced datasets. Information Sciences, 553, 83-109. https://doi.org/https://doi.org/10.1016/j.ins.2020.12.006
Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M., & Kasneci, G. (2022). Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems.
Camino, R., Hammerschmidt, C., & State, R. (2018). Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202.
Chawla, N. V. (2010). Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, 875-886.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7,
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., & Sun, J. (2017). Generating multi-label discrete patient records using generative adversarial networks. Machine learning for healthcare conference,
Engelmann, J., & Lessmann, S. (2021). Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174, 114582.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems, 30.
Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing,
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence),
He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://doi.org/10.1109/TKDE.2008.239
Hooi, E. K. J., Zainal, A., Kassim, M. N., & Ayub, Z. (2022). Feature Encoding For High Cardinality Categorical Variables Using Entity Embeddings: A Case Study in Customs Fraud Detection. 2022 International Conference on Cyber Resilience (ICCR),
Kamalov, F., Leung, H. H., & Cherukuri, A. K. (2023, 20-23 Feb. 2023). Keep it simple: random oversampling for imbalanced data. 2023 Advances in Science and Engineering Technology International Conferences (ASET),
Lin, Y., Li, L., Jing, H., Ran, B., & Sun, D. (2020). Automated traffic incident detection with a smaller dataset based on generative adversarial networks. Accident Analysis & Prevention, 144, 105628. https://doi.org/https://doi.org/10.1016/j.aap.2020.105628
Ma, Y., & Zhang, Z. (2020). Travel mode choice prediction using deep neural networks with entity embeddings. IEEE Access, 8, 64959-64970.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., & Kim, Y. (2018). Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384.
Ren, Z., Lin, T., Feng, K., Zhu, Y., Liu, Z., & Yan, K. (2023). A systematic review on imbalanced learning methods in intelligent fault diagnosis. IEEE Transactions on Instrumentation and Measurement.
Sarra, R. R., Dinar, A. M., Mohammed, M. A., Ghani, M. K., & Albahar, M. A. (2022). A Robust Framework for Data Generative and Heart Disease Prediction Based on Efficient Deep Learning Models. Diagnostics, 12(12).
Sauber-Cole, R., & Khoshgoftaar, T. M. (2022). The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey. Journal of Big Data, 9(1), 98.
Sun, Y., Xu, L., Guo, L., Li, Y., & Wang, Y. (2020, 2020//). A Comparison Study of VAE and GAN for Software Fault Prediction. Algorithms and Architectures for Parallel Processing, Cham.
Xiong, T., Wang, S., Mayers, A., & Monga, E. (2012). DHCC: Divisive hierarchical clustering of categorical data. Data Mining and Knowledge Discovery, 24, 103-135.
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. Advances in neural information processing systems, 32.
Xu, L., & Veeramachaneni, K. (2018). Synthesizing tabular data using generative adversarial networks. arXiv 2018. arXiv preprint arXiv:1811.11264.
Zhao, Z., Kunar, A., Birke, R., & Chen, L. Y. (2021). Ctab-gan: Effective table data synthesizing. Asian Conference on Machine Learning,
Zhu, R., Guo, Y., & Xue, J.-H. (2020). Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognition Letters, 133, 217-223. https://doi.org/https://doi.org/10.1016/j.patrec.2020.03.004