簡易檢索 / 詳目顯示

研究生: 徐文彥
Hsu, Wen-Yen
論文名稱: 二階段式分類技術處理不平衡情感資料
A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
指導教授: 利德江
Li, Der-Chiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 63
中文關鍵詞: 不平衡資料情感分析SVM
外文關鍵詞: Imbalanced Data, Sentiment Analysis, Support Vector Machine
相關次數: 點閱:118下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今網路科技以及電腦資訊的磅礡發展下,講求高效率運作與提升精準度已成為目前技術發展的目標,而數據處理為這一切發展的根基,在大數據文化的迫使下,數據不再像之前一樣單純,而是參雜著錯綜複雜的資訊使得現今機器學習或是深度學習上分析精準度以及效能受到不小的影響,再加上文字數據本身的複雜度,以及不平衡資料的問題提高,時常導致分類器在學習的過程當中,對小類別資料(Minority class)時常誤分類為大類別資料(Majority class),使結果有著顯著的偏差。小類別的資料通常數量少而具有重要的意義,因此,如何在類別比例懸殊的文字資料集進行資料分析是一個實務上的挑戰和議題。隨機抽樣(Random Sampling)是處理不平衡資料集的手段之一,該方法分為兩種,一為過抽樣(Oversampling),二為欠抽樣(Undersampling),過抽樣指的是針對小類別樣本進行隨機複製,來增加其樣本數,而欠抽樣則是針對大類別樣本進行隨機刪減,兩者的目的都是為了降低不平衡比例,但這種方法在情感分析的情境下,可能會導致訓練時產生過擬合(Overfitting)和訊息缺失的問題。本研究為了解決該問題,提出二階段平衡分類法,目的為讓學習模型之建立在資料量平衡的情境下來找出理想的分類情況。方法之第一階段透過成本敏感度支持向量機(Cost-sensitive support vector machine , CS-SVM),找出不平衡比例低的資料集,而在第二階段使用支持向量機(Support Vector Machine, SVM),將第一階段產出的平衡資料集進行分類,並依據分類結果以基因演算法(Genetic Algorithm , GA)來處理SVM中誤分類懲罰成本C和篩選核函數中的參數Γ。

    Imbalanced data has a heavy impact on the performance of models. In the case of imbalanced text datasets, minority class data are often classified to the majority class, resulting in a loss of the minority information and low accuracy. Thus, it is a serious challenge to determine how to tackle high imbalance ratio distribution of datasets. In our project, a two-phase classification is carried out aimed toward a text data learning model without distribution skewness, where the model adjusts to the optimal condition. There are two core stages in the proposed method: In stage one, the aim of stage is to create balanced dataset, and in stage two, the balanced dataset is classified using a symmetric cost-sensitive support vector machine. We also adjust the learning parameters in both stages with a genetic algorithm in order to create the optimal model. The Yelp review datasets are used in this study to validate the effectiveness of the proposed method. In addition, four criteria are used to evaluate and compare the performance of the proposed method and the other well-known algorithms: Accuracy, F-measure, Adjusted G-mean, and AUC. The experimental results reveal that the new method can significantly improve the learning approach.

    目錄 摘要 II 表目錄 XV 圖目錄 XVI 第一章 緒論 1 1.1研究背景 1 1.2研究動機 4 1.3研究目的 6 1.4研究架構 7 第二章 文獻探討 9 2.1資料不平衡學習問題之回顧 9 2.1.1資料不平衡之探討 9 2.1.2不平衡情感分析之探討 11 2.2隨機抽樣方法之回顧 13 2.2.1過抽樣之探討 13 2.2.2欠抽樣之探討 16 2.3文字嵌入與特徵萃取之回顧 17 2.3.1文字嵌入之探討 18 2.3.2特徵萃取之探討 20 2.4 SVM演算法之回顧 22 2.5小結 25 第三章 研究方法 27 3.1文字資料前處理 27 3.2特徵字詞向量化 28 3.3二階段模型分類法 30 3.3.1第一階段 31 3.3.2第二階段 33 3.3.3階段參數調整 34 3.4研究方法流程 36 第四章 實例驗證 38 4.1實驗環境 38 4.1.1實驗方式 38 4.1.2實驗評估指標 39 4.1.3實驗資料 41 4.1.4實驗建構軟體 42 4.2實驗結果 44 4.2.1實驗結果表 45 4.2.2 Yelp_α資料集及實驗結果 47 4.2.3 Yelp_β資料集及實驗結果 49 4.2.4 Yelp_γ資料集及實驗結果 52 第五章 結論與建議 56 5.1結論 56 5.2未來建議 57 參考文獻 58

    Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
    Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. European conference on machine learning: Springer, 39-50.
    Assaf, A. G., & Magnini, V. (2012). Accounting for customer satisfaction in measuring hotel efficiency: Evidence from the US hotel industry. International Journal of Hospitality Management, 31(3), 642-647. doi:10.1016/j.ijhm.2011.08.008
    Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
    Batuwita, R., & Palade, V. (2012). Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. Journal of Bioinformatics and Computational Biology, 10(04), 1250003.
    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
    Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. Proceedings of the 45th annual meeting of the association of computational linguistics, 440-447.
    Cao, P., Zhao, D., & Zaiane, O. (2013). An optimized cost-sensitive SVM for imbalanced data learning. Pacific-Asia conference on knowledge discovery and data mining: Springer, 280-292.
    Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), Article 27. doi:10.1145/1961189.1961199
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    Chunhong, Z., & Licheng, J. (2004). Automatic parameters selection for SVM based on GA. Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No. 04EX788): IEEE, 1869-1872.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
    Dasgupta, S., & Ng, V. (2009). Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2: Association for Computational Linguistics, 701-709.
    Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management: ACM, 127-136.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9(Aug), 1871-1874.
    Guo, Y., Barnes, S. J., & Jia, Q. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation. Tourism Management, 59, 467-483. doi:10.1016/j.tourman.2016.09.009
    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing: Springer, 878-887.
    He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering(9), 1263-1284.
    Huang, C.-L., & Wang, C.-J. (2006). A GA-based feature selection and parameters optimizationfor support vector machines. Expert systems with Applications, 31(2), 231-240.
    Jiang, L., Yu, M., Zhou, M., Liu, X., & Zhao, T. (2011). Target-dependent twitter sentiment classification. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1: Association for Computational Linguistics, 151-160.
    Jiang, Z., Li, L., Huang, D., & Jin, L. (2015). Training word embeddings for deep learning in biomedical text mining tasks. 2015 IEEE international conference on bioinformatics and biomedicine (BIBM): IEEE, 625-628.
    Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
    Krawczyk, B., Woźniak, M., & Schaefer, G. (2014). Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing, 14, 554-562.
    Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. Icml: Nashville, USA, 179-186.
    Lemaître, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1), 559-563.
    Li, S., Wang, Z., Zhou, G., & Lee, S. Y. M. (2011). Semi-supervised learning for imbalanced sentiment classification. Twenty-Second International Joint Conference on Artificial Intelligence.
    Li, S., Zhou, G., Wang, Z., Lee, S. Y. M., & Wang, R. (2011). Imbalanced sentiment classification. Proceedings of the 20th ACM international conference on Information and knowledge management: ACM, 2469-2472.
    Li, Y., Guo, H., Zhang, Q., Gu, M., & Yang, J. (2018). Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowledge-Based Systems, 160, 1-15. doi:10.1016/j.knosys.2018.06.019
    Lilleberg, J., Zhu, Y., & Zhang, Y. (2015). Support vector machines and word2vec for text classification with semantic features. 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC): IEEE, 136-140.
    Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
    Longadge, R., & Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.
    Mahajan, A., Dey, L., & Haque, S. M. (2008). Mining financial news for major events and their impacts on the market. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology: IEEE, 423-426.
    Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543.
    Schölkopf, B., Smola, A. J., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural computation, 12(5), 1207-1245.
    Sánchez-Franco, M. J., Navarro-García, A., & Rondán-Cataluña, F. J. (2019). A naive Bayes strategy for classifying customer satisfaction: A study based on online reviews of hospitality services. Journal of Business Research, 101, 499-506. doi:10.1016/j.jbusres.2018.12.051
    Syarif, I., Prugel-Bennett, A., & Wills, G. (2016). SVM parameter optimization using grid search and genetic algorithm to improve classification performance. Telkomnika, 14(4), 1502.
    Tao, X., Li, Q., Guo, W., Ren, C., Li, C., Liu, R., & Zou, J. (2019). Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Information Sciences, 487, 31-56.
    Thai-Nghe, N., Gantner, Z., & Schmidt-Thieme, L. (2010). Cost-sensitive learning methods for imbalanced data. The 2010 International joint conference on neural networks (IJCNN): IEEE, 1-8.
    Tirunillai, S., & Tellis, G. J. (2014). Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation. Journal of Marketing Research, 51(4), 463-479. doi:10.1509/jmr.12.0106
    Tripathy, A., Agrawal, A., & Rath, S. K. (2015). Classification of Sentimental Reviews Using Machine Learning Techniques. Procedia Computer Science, 57, 821-829.
    Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. Proceedings of the international joint conference on AI, 60.
    Wang, Z., Ma, L., & Zhang, Y. (2016). A hybrid document feature extraction method using latent Dirichlet allocation and word2vec. 2016 IEEE First International Conference on Data Science in Cyberspace (DSC): IEEE, 98-103.
    Wu, Q., Ye, Y., Zhang, H., Ng, M. K., & Ho, S.-S. (2014). ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowledge-Based Systems, 67, 105-116.
    Wu, X., & Srihari, R. K. (2003). New μν-Support Vector Machines and their Sequential Minimal Optimization. Proceedings of the 20th International Conference on Machine Learning (ICML-03), 824-831.
    Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., & Wang, X. (2015). Word embedding composition for data imbalances in sentiment and emotion classification. Cognitive Computation, 7(2), 226-240.

    下載圖示 校內:2025-05-01公開
    校外:2025-05-01公開
    QR CODE