簡易檢索 / 詳目顯示

研究生: 徐嘉昊
Hsu, Jia-Hao
論文名稱: 應用注意力機制的耦合長短記憶模型於影音情緒辨識
Attentively-Coupled Long Short-Term Memory for Audio-Visual Emotion Recognition
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 61
中文關鍵詞: 影音情緒辨識模組階段混合分段注意力長短記憶模型
外文關鍵詞: Audio-visual emotion recognition, model-level fusion, segment based attention mechanism, Long Short-Term Memory, Sequence-to-sequence model
相關次數: 點閱:142下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人機互動產品的不斷演進,許多智慧型產品能輔助我們的日常生活所需,例如智慧音箱、家用機器人及自駕車等。而在與這些產品互動時,能加入對使用者的情緒辨識,將使這些產品更加人性化,及增加互動的延展性。目前已有越來越多關於情緒辨識的研究。在現存的影音模態情緒辨識中,僅少數系統對於情緒表達作分段辨識,從分段情緒呈現中找到情緒表達更細部的起伏變化。
    本論文以情緒表達的分段段落作為辨識單位,捕捉語者的臉部表情以及聲音訊號,考慮臉部及音訊上的不同特徵加以處理及分析,且考慮分段訊號的前後依賴關係,並從分段中找到對整句情緒表達影響較大的重要片段,給予該片段在整體辨識時較高的專注力,提升各分段的辨識準確率。
    不同於單模態情緒辨識,多模態的情緒辨識架構中需考慮不同模態資料的混合方式,本論文著重於如何改進混合方式以提升分段情緒辨識的效能。本論文使用耦合長短記憶模型做資料的混合並加入注意力機制,於每一次辨識模組的混合單元運算時序上。耦合單元能於單元更新時同時考慮兩模態訊號特徵的互相影響關係,更新時加入各時序分段的專注程度給予模型專注力,並學習訊號的長期依賴關係。
    從最後實驗可看出,相較於其他現存傳統的影音情緒辨識系統,本論文提出的影音情緒辨識系統準確率可達到70.1%,在各架構中表現突出。證明本論文提出的注意力機制的耦合長短記憶模型,不論在多模態訊號混合,或者是分段注意力機制的情緒辨識上,都能有很好的效果。

    With the continuous evolution of human-computer interaction products, many smart products can support our daily needs, such as smart speakers, home robots and self-driving cars. In the interaction with these products, the ability to add emotion recognition to users will make these products more humane and increase the flexibility of interaction. There have been more and more studies on emotion recognition. In the existing audio-visual modal emotion recognition systems, only few of them focused on segment-based recognition of emotion expression, contrast to utterance-based emotion recognition. From the segment-based emotion expression, we can find the fluctuations of the more detailed expression of emotion.
    This thesis uses segments as the identification unit to capture the facial expressions and audio signals of the speakers, considers and analyzes the different features of the facial and audio signals, and considers the pre- and post-dependence of the segmented signals. In the segmentation process, an important segment that has a great influence on the expression of the whole sentence is firstly found, and the segment is given a higher attention in the overall recognition to improve the recognition accuracy of each segment.
    Different from single-modal emotion recognition, multi-modal emotion recognition architecture considers the data from different modalities. This thesis focuses on how to improve the fusion mechanism to improve the performance of segment-based emotion recognition by using a attentively-coupled long-term memory model. With the attention mechanism, in each fusion operation, the coupling unit can simultaneously consider the mutual influence relationship of the two modal signal characteristics when updating the unit, and add the degree of attention of each sequential segment for emotion recognition. The long-short term memory is adopted to control the flow of information to learn the long and short-term dependence of the signal. The model obtains the emotion prediction sequence of each segment, and expects to recognize the emotion from both facial and audio emotion expressions of the speaker.
    In the experimental results, the accuracy of the proposed audio-visual emotion recognition system achieved 70.1%, which outperformed other existing traditional audio-visual emotion recognition systems. The experimental results showed that the proposed attentively-coupled long short-term memory model achieved good results in multi-modal emotion recognition or emotion recognition using segment-based attention.

    摘要 I Abstract II 誌謝 IV Contents V List of Tables VIII List of Figures X Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 3 1.3 Literature Review 4 1.3.1 Speech Emotion Recognition Systems 4 1.3.2 Facial Expression Recognition Systems 6 1.3.3 Fusion Methods of Multi-modal Signals 7 1.4 Problems 9 1.5 Description of Proposed Method 10 Chapter 2 Database Design and Collection 11 2.1 Original Annotation of BAUM-1 11 2.1.1 BAUM-1 Database 11 2.1.2 Recording Setup 12 2.2 Classification of Emotion and Sound 14 2.2.1 Classification of Emotion 14 2.2.2 Classification of Sound 15 2.3 Segmentation and Re-Annotation of BAUM-1 15 2.3.1 Segmentation 16 2.3.2 Re-Annotation 18 2.3.3 Statistical Information of Re-Annotation Corpus 19 Chapter 3 Proposed Methods 22 3.1 Pre-processing 23 3.1.1 Pre-Processing of Audio Data 24 3.1.2 Pre-Processing of Visual Data 24 3.2 Feature Extraction 26 3.2.1 Feature Extraction of Audio Data 26 3.2.2 Feature Extraction of Visual Data 30 3.2.3 Calculation of Segment Weights 33 3.3 Emotion Recognition Model 35 3.3.1 Coupled LSTM model 36 3.3.2 Attentively-Coupled LSTM model 39 Chapter 4 Experimental Results and Discussion 41 4.1 Evaluation of Audio Feature Extraction Model 41 4.1.1 Audio Emotion Feature Extraction Model 42 4.1.2 Audio Sound Type Feature Extraction Model 44 4.2 Evaluation of Visual Feature Extraction Model 47 4.3 Evaluation of NTN Model 49 4.4 Evaluation of Audio-Visual Emotion Recognition Model 50 4.5 Comparison with Other Methods 52 4.5.1 Evaluation of Modalities 52 4.5.2 Evaluation of Different Methods 54 Chapter 5 Conclusion and Future Work 56 Reference 58

    [1] Service Robotics: Sales up 25 percent - 2019 boom predicted [Online]. Available: https://ifr.org/news/service-robotics.
    [2] Z. Zeng et al., "Bimodal HCI-related affect recognition," in Proceedings of the 6th international conference on Multimodal interfaces, 2004: ACM, pp. 137-143.
    [3] (2018). Emotion Recognition and Sentiment Analysis Market to Reach $3.8 Billion by 2025 [Online]. Available: https://www.tractica.com/newsroom/press-releases/emotion-recognition-and-sentiment-analysis-market-to-reach-3-8-billion-by-2025/.
    [4] M. Pantic and L. J. Rothkrantz, "Automatic analysis of facial expressions: The state of the art," IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 1424-1445, 2000.
    [5] Y.-H. Chen, "Prosodic Phrase-Based Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Non-verbal Speech Signals," National Cheng Kung University, 2018.
    [6] A. Schirmer and T. C. Gunter, "Temporal signatures of processing voiceness and emotion in sound," Social cognitive and affective neuroscience, vol. 12, no. 6, pp. 902-909, 2017.
    [7] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, "BAUM-1: A spontaneous audio-visual face database of affective and mental states," IEEE Transactions on Affective Computing, vol. 8, no. 3, pp. 300-313, 2016.
    [8] N. Anand and P. Verma, "Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data," in Technical Report: Stanford University, 2015.
    [9] K. S. Rao, S. G. Koolagudi, and R. R. Vempada, "Emotion recognition from speech using global and local prosodic features," International journal of speech technology, vol. 16, no. 2, pp. 143-160, 2013.
    [10] H. Cao, S. Benus, R. Gur, R. Verma, and A. Nenkova, "Prosodic cues for emotion: analysis with discrete characterization of intonation," Speech prosody 2014, 2014.
    [11] E. Tzinis and A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017: IEEE, pp. 190-195.
    [12] L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, "Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN," Sensors, vol. 17, no. 7, p. 1694, 2017.
    [13] G. Trigeorgis et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 5200-5204.
    [14] J. Deng, S. Fruhholz, Z. Zhang, and B. Schuller, "Recognizing Emotions from Whispered Speech Based on Acoustic Feature Transfer Learning," IEEE Access, pp. 1-1, 2017.
    [15] C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition," in 2017 IEEE International Conference on Multimedia and Expo (ICME), 2017: IEEE, pp. 583-588.
    [16] S. Kim and M. L. Seltzer, "Towards language-universal end-to-end speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 4914-4918.
    [17] T. Ahonen, A. Hadid, and M. Pietikainen, "Face description with local binary patterns: Application to face recognition," IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2037-2041, 2006.
    [18] T. Kalsum, S. M. Anwar, M. Majid, B. Khan, and S. M. Ali, "Emotion recognition from facial expressions using hybrid feature descriptors," IET Image Processing, vol. 12, no. 6, pp. 1004-1012, 2018.
    [19] B. Yang, J. Cao, R. Ni, and Y. Zhang, "Facial expression recognition using weighted mixture deep neural network based on double-channel facial images," IEEE Access, vol. 6, pp. 4630-4640, 2017.
    [20] Y. Li, J. Zeng, S. Shan, and X. Chen, "Occlusion aware facial expression recognition using cnn with attention mechanism," IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439-2450, 2018.
    [21] G. Zhang, X. Huang, S. Z. Li, Y. Wang, and X. Wu, "Boosting local binary pattern (LBP)-based face recognition," in Chinese Conference on Biometric Recognition, 2004: Springer, pp. 179-186.
    [22] J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu, "Person-specific SIFT features for face recognition," in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07, 2007, vol. 2: IEEE, pp. II-593-II-596.
    [23] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [24] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, "A survey of affect recognition methods: Audio, visual, and spontaneous expressions," IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39-58, 2008.
    [25] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, "Learning affective features with a hybrid deep model for audio–visual emotion recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3030-3043, 2017.
    [26] Z. Zeng et al., "Audio-visual affect recognition through multi-stream fused HMM for HCI," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, vol. 2: IEEE, pp. 967-972.
    [27] J. Cai et al., "Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild," in 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2019: IEEE, pp. 443-448.
    [28] S. Sahoo and A. Routray, "Emotion recognition from audio-visual data using rule based decision level fusion," in 2016 IEEE Students’ Technology Symposium (TechSym), 2016: IEEE, pp. 7-12.
    [29] X. Qiu and X. Huang, "Convolutional neural tensor network architecture for community-based question answering," in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
    [30] M.-H. Su, C.-H. Wu, K.-Y. Huang, and T.-H. Yang, "Cell-Coupled Long Short-Term Memory With L-Skip Fusion Mechanism for Mood Disorder Detection Through Elicited Audiovisual Features," IEEE transactions on neural networks and learning systems, 2019.
    [31] O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006: IEEE, pp. 8-8.
    [32] H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, and C.-C. Lee, "Nnime: The nthu-ntua chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017: IEEE, pp. 292-298.
    [33] J.-P. Goldman, "EasyAlign: an automatic phonetic alignment tool under Praat," 2011.
    [34] F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the 18th ACM international conference on Multimedia, 2010: ACM, pp. 1459-1462.
    [35] E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, "Interspeech 2009 emotion recognition challenge evaluation," in 2010 IEEE 18th Signal Processing and Communications Applications Conference, 2010: IEEE, pp. 216-219.
    [36] C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.
    [37] M. Dominguez, M. Farrús, and L. Wanner, "An automatic prosody tagger for spontaneous speech," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 377-386.
    [38] C. Saravanan, "Color image to grayscale image conversion," in 2010 Second International Conference on Computer Engineering and Applications, 2010, vol. 2: IEEE, pp. 196-199.
    [39] D. E. King, "Max-margin object detection," arXiv preprint arXiv:1502.00046, 2015.
    [40] J.-S. Lin, S.-H. Liou, W.-C. Hsieh, Y.-Y. Liao, H. Wang, and Q. Lan, "Facial Expression Recognition Based on Field Programmable Gate Array," in 2009 Fifth International Conference on Information Assurance and Security, 2009, vol. 1: IEEE, pp. 547-550.
    [41] D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex," The Journal of physiology, vol. 160, no. 1, pp. 106-154, 1962.
    [42] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, "Artificial convolution neural network for medical image pattern recognition," Neural networks, vol. 8, no. 7-8, pp. 1201-1214, 1995.
    [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
    [44] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
    [45] Y. Fan, X. Lu, D. Li, and Y. Liu, "Video-based emotion recognition using CNN-RNN and C3D hybrid networks," in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: ACM, pp. 445-450.
    [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
    [47] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [48] S. Panchapagesan et al., "Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting," in Interspeech, 2016, pp. 760-764.

    下載圖示 校內:2021-08-31公開
    校外:2021-08-31公開
    QR CODE