研究生: |
柯宗逸 Ke, Tsung-Yi |
---|---|
論文名稱: |
預訓練模型對多模態情緒辨識的影響 Effect of Pretrained Models on Multimodal Emotion Recognition |
指導教授: |
藍崑展
Lan, Kun-Chan |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 172 |
中文關鍵詞: | 遷移學習 、預訓練模型 、模態競爭 、多模態情緒辨識 |
外文關鍵詞: | Transfer Learning, Pretrained Model, Modality competition, Multimodal emotion recognition |
相關次數: | 點閱:31 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
遷移學習是近年來被廣泛應用的技術。隨著預訓練模型的發展,我們能夠利用較少的資料與運算資源,使模型達到更優異的表現。然而,遷移學習並非在所有情境下都能提升模型的效能。先前的文獻指出,資料偏移、災難性遺忘等因素可能導致遷移學習效果不佳,但針對多模態模型在遷移學習中的影響,仍缺乏深入探討。
在多模態模型的聯合訓練過程中,我們觀察到一個普遍的問題—模態競爭。所謂模態競爭,即各模態在聯合訓練中互相爭奪資源,導致只有部分模態能夠被優先學習。這主要源於模型在隨機初始化時,某一模態可能因學習速度較快而主導訓練進程,使得其他模態的資訊未能獲得充分探索,從而可能使多模態模型的整體表現不如單一模態模型。
雖然先前的研究已嘗試通過梯度調製、學習率調整等方法來緩解模態競爭問題,但這些方法僅能在一定程度上降低模態競爭現象,難以從根本上解決問題。因此,模態競爭成為聯合訓練策略下多模態模型的一個根本性問題。
本論文以多模態情緒辨識為例,探討模態競爭對遷移學習的影響,並分析在何種模態競爭情境下,遷移學習能夠有效提升模型表現。此外,我們還整合了所有公開可用的含有語音、視訊與文字標籤的情緒資料集,構建了一個大規模的資料庫,以支援後續的實驗與分析。
Transfer learning has been widely applied in recent years. With the advancement of pretrained models, we can achieve superior model performance using less data and computational resources. However, transfer learning does not always improve performance in all scenarios. Previous studies have shown that domain shift and catastrophic forgetting may hinder the effectiveness of transfer learning. However, the impact of transfer learning on multimodal models remains underexplored.
During the joint training process of multimodal models, we observed a pervasive issue: modality competition. Modality competition occurs when different modalities compete for resources during joint training, resulting in only a subset of modalities being preferentially learned. This primarily stems from the fact that, during random initialization, one modality may dominate the training process due to its faster learning speed, thereby limiting the exploration of information from other modalities. Consequently, the overall performance of multimodal models may fall short of unimodal models.
Although previous studies have attempted to mitigate the modality competition problem through techniques such as gradient modulation and learning rate adjustment, these approaches only partially reduce the competition and do not address the issue fundamentally. As a result, modality competition remains a core challenge in the joint training strategies of multimodal models.
In this paper, we use multimodal emotion recognition as a case study to investigate the impact of modality competition on transfer learning and to analyze the conditions under which transfer learning can effectively enhance model performance. Furthermore, we have integrated all publicly available emotion datasets, including audio, video, and text labels, to construct a large-scale database to support subsequent experiments and analyses.
[1] Z. Wang, Z. Dai, B. Póczos, and J. Carbonell, "Characterizing and avoiding negative transfer," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11293-11302.
[2] S. Gururangan et al., "Don't stop pretraining: Adapt language models to domains and tasks," arXiv preprint arXiv:2004.10964, 2020.
[3] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, "Imagenet-21k pretraining for the masses," arXiv preprint arXiv:2104.10972, 2021.
[4] Y. Huang, J. Lin, C. Zhou, H. Yang, and L. Huang, "Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)," in International conference on machine learning, 2022: PMLR, pp. 9226-9259.
[5] W. Wang, D. Tran, and M. Feiszli, "What makes training multi-modal classification networks hard?," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12695-12705.
[6] N. Wu, S. Jastrzebski, K. Cho, and K. J. Geras, "Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks," in International Conference on Machine Learning, 2022: PMLR, pp. 24043-24055.
[7] Y. Fan, W. Xu, H. Wang, J. Wang, and S. Guo, "Pmr: Prototypical modal rebalance for multimodal learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20029-20038.
[8] X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, "Balanced multimodal learning via on-the-fly gradient modulation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8238-8247.
[9] Y. Yao and R. Mihalcea, "Modality-specific learning rates for effective multimodal additive late-fusion," in Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 1824-1834.
[10] Y. Sun, S. Mai, and H. Hu, "Learning to balance the learning rates between various modalities via adaptive tracking factor," IEEE Signal Processing Letters, vol. 28, pp. 1650-1654, 2021.
[11] C. Du et al., "On uni-modal feature learning in supervised multi-modal learning," in International Conference on Machine Learning, 2023: PMLR, pp. 8632-8656.
[12] Y. Fan, W. Xu, H. Wang, J. Liu, and S. Guo, "Detached and interactive multimodal learning," in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 5470-5478.
[13] S. Niu, Y. Liu, J. Wang, and H. Song, "A decade survey of transfer learning (2010–2020)," IEEE Transactions on Artificial Intelligence, vol. 1, no. 2, pp. 151-166, 2020.
[14] M. Iman, H. R. Arabnia, and K. Rasheed, "A review of deep transfer learning and recent advancements," Technologies, vol. 11, no. 2, p. 40, 2023.
[15] M. F. H. Siddiqui, P. Dhakal, X. Yang, and A. Y. Javaid, "A Survey on Databases for Multimodal Emotion Recognition and an Introduction to the VIRI (Visible and InfraRed Image) Database. Multimodal Technologies and Interaction 6, 6 (June 2022), 47," ed, 2022.
[16] J. A. Miranda-Correa, M. K. Abadi, N. Sebe, and I. Patras, "Amigos: A dataset for affect, personality and mood research on individuals and groups," IEEE transactions on affective computing, vol. 12, no. 2, pp. 479-493, 2018.
[17] R. Subramanian, J. Wache, M. K. Abadi, R. L. Vieriu, S. Winkler, and N. Sebe, "ASCERTAIN: Emotion and personality recognition using commercial sensors," IEEE Transactions on Affective Computing, vol. 9, no. 2, pp. 147-160, 2016.
[18] G. Caridakis, J. Wagner, A. Raouzaiou, Z. Curto, E. Andre, and K. Karpouzis, "A multimodal corpus for gesture expressivity analysis," Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 80, 2010.
[19] V. Markova, T. Ganchev, and K. Kalinkov, "Clas: A database for cognitive load, affect and stress recognition," in 2019 International Conference on Biomedical Innovations and Applications (BIA), 2019: IEEE, pp. 1-4.
[20] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, "Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236-2246.
[21] S. Koelstra et al., "Deap: A database for emotion analysis; using physiological signals," IEEE transactions on affective computing, vol. 3, no. 1, pp. 18-31, 2011.
[22] M. K. Abadi, R. Subramanian, S. M. Kia, P. Avesani, I. Patras, and N. Sebe, "DECAF: MEG-based multimodal database for decoding affective physiological responses," IEEE Transactions on Affective Computing, vol. 6, no. 3, pp. 209-222, 2015.
[23] H. Ranganathan, S. Chakraborty, and S. Panchanathan, "Multimodal emotion recognition using deep learning architectures," in 2016 IEEE winter conference on applications of computer vision (WACV), 2016: IEEE, pp. 1-9.
[24] H. O’Reilly et al., "The EU-emotion stimulus set: a validation study," Behavior research methods, vol. 48, pp. 567-576, 2016.
[25] J. Chen et al., "HEU Emotion: a large-scale database for multimodal emotion recognition in the wild," Neural Computing and Applications, vol. 33, pp. 8669-8685, 2021.
[26] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, "A multimodal database for affect recognition and implicit tagging," IEEE transactions on affective computing, vol. 3, no. 1, pp. 42-55, 2011.
[27] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, "Meld: A multimodal multi-party dataset for emotion recognition in conversations," arXiv preprint arXiv:1810.02508, 2018.
[28] Z. Zhang et al., "Multimodal spontaneous emotion corpus for human behavior analysis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3438-3446.
[29] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, and Z. Cui, "MPED: A multi-modal physiological emotion database for discrete emotion recognition," IEEE Access, vol. 7, pp. 12177-12191, 2019.
[30] S. Castro, D. Hazarika, V. Pérez-Rosas, R. Zimmermann, R. Mihalcea, and S. Poria, "Towards multimodal sarcasm detection (an _obviously_ perfect paper)," arXiv preprint arXiv:1906.01815, 2019.
[31] H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, and C.-C. Lee, "NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017: IEEE, pp. 292-298.
[32] O. Perepelkina, E. Kazimirova, and M. Konstantinova, "RAMAS: Russian multimodal corpus of dyadic interaction for affective computing," in International Conference on Speech and Computer, 2018: Springer, pp. 501-510.
[33] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, "Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions," in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), 2013: IEEE, pp. 1-8.
[34] T. Ganchev, V. Markova, I. Lefterov, and Y. Kalinin, "Overall design of the SLADE data acquisition system," in Proceedings of the Second International Scientific Conference “Intelligent Information Technologies for Industry”(IITI’17) Volume 1, 2018: Springer, pp. 56-65.
[35] M. Valstar et al., "Avec 2013: the continuous audio/visual emotion and depression recognition challenge," in Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, 2013, pp. 3-10.
[36] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, "BAUM-1: A spontaneous audio-visual face database of affective and mental states," IEEE Transactions on Affective Computing, vol. 8, no. 3, pp. 300-313, 2016.
[37] C. Eroglu Erdem, C. Turan, and Z. Aydin, "BAUM-2: A multilingual audio-visual affective face database," Multimedia tools and applications, vol. 74, no. 18, pp. 7429-7459, 2015.
[38] L. Zhang et al., "“BioVid Emo DB”: A multimodal database for emotion analyses validated by subjective ratings," in 2016 IEEE Symposium Series on Computational Intelligence (SSCI), 2016: IEEE, pp. 1-6.
[39] Y. Li, J. Tao, L. Chao, W. Bao, and Y. Liu, "CHEAVD: a Chinese natural emotional audio–visual database," Journal of Ambient Intelligence and Humanized Computing, vol. 8, pp. 913-924, 2017.
[40] Y. Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia, "Mec 2017: Multimodal emotion recognition challenge," in 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), 2018: IEEE, pp. 1-5.
[41] S. Katsigiannis and N. Ramzan, "DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices," IEEE journal of biomedical and health informatics, vol. 22, no. 1, pp. 98-107, 2017.
[42] O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd international conference on data engineering workshops (ICDEW'06), 2006: IEEE, pp. 8-8.
[43] H. Gunes and M. Piccardi, "A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior," in 18th International conference on pattern recognition (ICPR'06), 2006, vol. 1: IEEE, pp. 1148-1153.
[44] M. F. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. Scherer, "The first facial expression recognition and analysis challenge," in 2011 IEEE international conference on automatic face & gesture recognition (FG), 2011: IEEE, pp. 921-926.
[45] E. Douglas-Cowie et al., "The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data," in Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007 Lisbon, Portugal, September 12-14, 2007 Proceedings 2, 2007: Springer, pp. 488-500.
[46] C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, pp. 335-359, 2008.
[47] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, "LIRIS-ACCEDE: A video database for affective content analysis," IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43-55, 2015.
[48] S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PloS one, vol. 13, no. 5, p. e0196391, 2018.
[49] Y. Wang and L. Guan, "Recognizing human emotional state from audiovisual signals," IEEE transactions on multimedia, vol. 10, no. 5, pp. 936-946, 2008.
[50] S. Haq and P. J. Jackson, "Multimodal emotion recognition," in Machine audition: principles, algorithms and systems: IGI global, 2011, pp. 398-423.
[51] W.-L. Zheng and B.-L. Lu, "Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks," IEEE Transactions on autonomous mental development, vol. 7, no. 3, pp. 162-175, 2015.
[52] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, "Emotionmeter: A multimodal framework for recognizing human emotions," IEEE transactions on cybernetics, vol. 49, no. 3, pp. 1110-1122, 2018.
[53] T.-H. Li, W. Liu, W.-L. Zheng, and B.-L. Lu, "Classification of five emotions from EEG and eye movement signals: Discrimination ability and stability over time," in 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), 2019: IEEE, pp. 607-610.
[54] W.-L. Zheng and B.-L. Lu, "A multimodal approach to estimating vigilance using EEG and forehead EOG," Journal of neural engineering, vol. 14, no. 2, p. 026017, 2017.
[55] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, "The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent," IEEE transactions on affective computing, vol. 3, no. 1, pp. 5-17, 2011.
[56] A. Metallinou, Z. Yang, C.-c. Lee, C. Busso, S. Carnicke, and S. Narayanan, "The USC CreativeIT database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations," Language resources and evaluation, vol. 50, pp. 497-521, 2016.
[57] M. Grimm, K. Kroschel, and S. Narayanan, "The Vera am Mittag German audio-visual emotional speech database," in 2008 IEEE international conference on multimedia and expo, 2008: IEEE, pp. 865-868.
[58] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, "Collecting large, richly annotated facial-expression databases from movies," IEEE multimedia, vol. 19, no. 3, pp. 34-41, 2012.
[59] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Interspeech, 2005, vol. 5, pp. 1517-1520.
[60] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, "The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression," in 2010 ieee computer society conference on computer vision and pattern recognition-workshops, 2010: IEEE, pp. 94-101.
[61] I. J. Goodfellow et al., "Challenges in representation learning: A report on three machine learning contests," in Neural information processing: 20th international conference, ICONIP 2013, daegu, korea, november 3-7, 2013. Proceedings, Part III 20, 2013: Springer, pp. 117-124.
[62] M. N. Dailey et al., "Evidence and a computational explanation of cultural differences in facial expression recognition," Emotion, vol. 10, no. 6, p. 874, 2010.
[63] C.-M. Chang and C.-C. Lee, "Fusion of multiple emotion perspectives: Improving affect recognition through integrating cross-lingual emotion information," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017: IEEE, pp. 5820-5824.
[64] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, "Static facial expressions in the wild: data and experiment protocol," CVHCI Google Scholar, 2011.
[65] W. Dai, Z. Liu, T. Yu, and P. Fung, "Modality-transferable emotion embeddings for low-resource multimodal emotion recognition," arXiv preprint arXiv:2009.09629, 2020.
[66] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," in Proceedings of the conference. Association for computational linguistics. Meeting, 2019, vol. 2019: NIH Public Access, p. 6558.
[67] W. Dai, S. Cahyawijaya, Z. Liu, and P. Fung, "Multimodal end-to-end sparse model for emotion recognition," arXiv preprint arXiv:2103.09666, 2021.
[68] L. Cai, Z. Zhang, Y. Zhu, L. Zhang, M. Li, and X. Xue, "Bigdetection: A large-scale benchmark for improved object detector pre-training," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4777-4787.
[69] Y. Baveye, J.-N. Bettinelli, E. Dellandréa, L. Chen, and C. Chamaret, "A large video database for computational models of induced emotion," in 2013 Humaine association conference on affective computing and intelligent interaction, 2013: IEEE, pp. 13-18.
[70] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in International conference on machine learning, 2023: PMLR, pp. 28492-28518.
[71] S. Serengil and A. Özpınar, "A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules," Bilişim Teknolojileri Dergisi, vol. 17, no. 2, pp. 95-107, 2024.
[72] S. I. Serengil and A. Ozpinar, "Lightface: A hybrid deep face recognition framework," in 2020 innovations in intelligent systems and applications conference (ASYU), 2020: IEEE, pp. 1-5.
[73] S. I. Serengil and A. Ozpinar, "Hyperextended lightface: A facial attribute analysis framework," in 2021 International Conference on Engineering and Emerging Technologies (ICEET), 2021: IEEE, pp. 1-4.
[74] S. Jing, X. Mao, and L. Chen, "Automatic speech discrete labels to dimensional emotional values conversion method," IET Biometrics, vol. 8, no. 2, pp. 168-176, 2019.
[75] P. Ekman, "Are there basic emotions?," 1992.
[76] Z. Lan, "Albert: A lite bert for self-supervised learning of language representations," arXiv preprint arXiv:1909.11942, 2019.
[77] S. Lundberg, "A unified approach to interpreting model predictions," arXiv preprint arXiv:1705.07874, 2017.
[78] J. Huang, L. Qu, R. Jia, and B. Zhao, "O2u-net: A simple noisy label detection approach for deep neural networks," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3326-3334.
[79] S.-C. Tsai, "Early detection of dysphagia through sound analysis with noisy labels," National Cheng Kung University, 2024.
[80] L. Shen, Z. Lin, and Q. Huang, "Relay backpropagation for effective learning of deep convolutional neural networks," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 2016: Springer, pp. 467-482.
[81] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, "Large-scale long-tailed recognition in an open world," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2537-2546.