簡易檢索 / 詳目顯示

研究生: 蔣佳紋
Chiang, Chia-Wen
論文名稱: 於不平衡資料環境中應用類神經網路之有效電子郵件機密程度分類模型
Effective Email Security Level Classification of Imbalanced Data Using Artificial Neural Network
指導教授: 黃仁暐
Huang, Jen-Wei
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2017
畢業學年度: 106
語文別: 英文
論文頁數: 51
中文關鍵詞: 電子郵件分類器文字探勘類神經網路
外文關鍵詞: E-mail, Classifier, Text Mining, Artificial Neural network
相關次數: 點閱:175下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 利用電子郵件的傳遞訊息較以往書信連絡更為方便,但是電子郵件的方便也容易造成機密資訊外流的問題,尤其是在公司內部的電子郵件。
    透過資料探勘的文本分析,可以有效的幫電子郵件做出機密等級的分類,在此篇研究中,我們利用類神經網路提取每封電子郵件資訊並將電子郵件表示成多維度的文本向量(document vector),以文本向量做為電子郵件的特徵,不同於以往的單字元訓練,我們利用了雙字元(bi-gram)來進行文本的斷字及前處理,再藉由此訓練文本向量,另外,將不平衡的資料向下取樣,在利用類神經網路學習各個機密程度標籤相對應的文本向量,我們也實際與公司合作,在公司真實的電子郵件試驗我們的方法,實驗結果也顯示我們所提取之電子郵件特徵的方法較其他方法更為優異,也更適合用於做為電子郵件機密等級分類的特徵。

    Email is far more convenient than traditional mail in the delivery of messages. However, it is susceptible to information leakage. This problem can be alleviated by classifying emails into different security levels using text mining and machine learning technology. In this research, we developed a scheme in which a neural network is used to extract information from emails to enable its transformation into a multidimensional vector. Email text data is processed using bi-gram to train the document vector, which then undergoes under-sampling to deal with the problem of data imbalance. Finally, the security label of emails is classified using a deep neural network. The proposed system was evaluated in an actual corporate setting. Our results show that the proposed feature extraction approach is more effective than existing methods for the representations of email data in true positive rates and F1-scores.

    Table of Contents 中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Document Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Paragraph Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Deep Back-propagation Neural Network . . . . . . . . . . . . . . . . . . 13 2.2.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Email Security Classification System . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.0.1 Extracting Textual Content of Emails . . . . . . . . . . . . . . . . . . . . 19 3.0.2 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Feature Extraction based on Paragraph Vector . . . . . . . . . . . . . . . 22 3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Under-sampling using K-means Clustering . . . . . . . . . . . . . . . . . 25 3.2.2 Training Email Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.2 Network Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Classification Results Using Different Email Representations . . . . . . . . . . . 32 4.2.1 Bi-BoW versus Uni-BoW . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.2 Bi-LDA Versus Uni-LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.3 Bi-PVDM versus Uni-PVDM . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.4 Bi-PVDBOW versus Uni-PVDBOW . . . . . . . . . . . . . . . . . . . . 37 4.2.5 Total Comparison of Classification Accuracy . . . . . . . . . . . . . . . . 39 4.3 Under-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Validate with All Security level III Email . . . . . . . . . . . . . . . . . . 42 4.3.2 Validation without Under-sampling . . . . . . . . . . . . . . . . . . . . . 43 4.4 Performance Validation using a Combination of Data Segmentation Methods . . 43 4.5 Performance Comparison: Varying the Number of Layers . . . . . . . . . . . . . 45 4.5.1 One versus All Classification Method Comparison . . . . . . . . . . . . . 46 4.6 Classification Accuracy Validation by Corporation . . . . . . . . . . . . . . . . . 46 4.7 Average Time of Predicting an Email . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
    [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine
    Learning research, 3(Jan):993–1022, 2003.
    [3] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.
    [4] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
    [5] C. Drummond, R. C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11. Citeseer Washington DC, 2003.
    [6] N. Ganesan, K. Venkatesh, M. Rama, and A. M. Palani. Application of neural networks in diagnosing cancer disease using demographic data. International Journal of Computer Applications, 1(26):76–85, 2010.
    [7] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. Advances in intelligent computing, pages 878–887, 2005.
    [8] Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
    [9] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
    [10] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
    [11] R. Hecht-Nielsen et al. Theory of the backpropagation neural network. Neural Networks,
    1(Supplement-1):445–448, 1988.
    [12] G. Heinrich. Parameter estimation for text analysis. University of Leipzig, Tech. Rep,
    2008.
    [13] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. Proceedings of the first workshop on social media analytics, pages 80–88, 2010.
    [14] A. K. Jain and B. Yu. Document representation and its application to page decomposition.
    IEEE Transactions on pattern analysis and machine intelligence, 20(3):294–308, 1998.
    [15] N. Japkowicz et al. Learning from imbalanced data sets: a comparison of various strategies. In AAAI workshop on learning from imbalanced data sets, volume 68, pages 10–15. Menlo Park, CA, 2000.
    [16] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
    [17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980, 2014.
    [18] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In International Conference on Machine Learning, pages 957–966, 2015.
    [19] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188– 1196, 2014.
    [20] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Mu¨ller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
    [21] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    [22] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Hlt-naacl, volume 13, pages 746–751, 2013.
    [23] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
    [24] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
    [25] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
    [26] C.-F. Tsai. Bag-of-words representation in image annotation: A review. ISRN Artificial
    Intelligence, 2012, 2012.
    [27] T. White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
    [28] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.

    無法下載圖示 校內:2022-10-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE