研究生: |
陳毅軒 Chen, Yi-Hsuan |
---|---|
論文名稱: |
使用深度神經網路考量口語與非口語之韻律短語語音情緒辨識 Prosodic Phrase-Based Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Non-verbal Speech Signals |
指導教授: |
吳宗憲
Wu, Chung-Hsien |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 語音情緒辨識 、韻律短語 、非口語音段 、卷積神經網路 、長短期記憶模型 、序列對序列模型 |
外文關鍵詞: | Speech emotion recognition, Prosodic Phrase, Non-verbal segment, Convolutional Neural Network, Long-short term memory, Sequence-to-sequence model |
相關次數: | 點閱:172 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於對話機器人、心智疾病診斷協助、銷售、照護、娛樂等各種智慧型服務的普及,語音情緒辨識已經變得越來越重要。在人與機器的溝通中,情緒辨識與情感分析能夠增強機器與人類的互動。而當人腦在辨別他人的情緒時會同時感受口語以及非口語的聲音表達以達到更清楚的辨別。據我們所知,目前並沒有語音情緒辨識機制有對非口語的笑聲、哭聲等自然情緒表現有所著重。
本論文主要考慮人與人對話的語音中自然產生的情緒呈現與對話中情緒的變化,也透過參考人腦對他人情緒辨別的方法,期望透過分析語音中的口語音段與非口語音段的特徵來協助語音的辨識,因此選擇國立清華大學-國立台灣藝術大學中文情緒互動多模態語料庫(NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus,NNIME)。此語料庫是一個指定情境而無語句腳本的自發性情緒語料庫,內部包含許多不同如笑聲、哭聲、氣音等自然情感對話所具有的非口語聲音片段。
本論文將NNIME資料庫的101場對話進行重新分段,取得4766個單一語者對話回應段的資料,將每個回應段透過支持向量機與韻律短語自動標記器,將音訊變為由非口語段、韻律短語、靜音段所組成的音段序列。各序列分別匯入訓練完成的卷積神經網路,抽取各音段的情緒特徵、聲音特徵。將各分段的特徵以向量表示,匯入具有注意力機制以長短期記憶模型為基底的序列對序列模型進行語音的情緒分段辨識。最後輸入的語者對話回應段會得到與分段數量相同長度的情緒標記表示序列。
依據各項實驗結果顯示,在自然的情緒表示下,考慮非口語特徵與聲音種類特徵可以提高分段情緒辨識的結果,期望透過語音情緒辨識的加強可以使機器更加人性化。
Speech emotion recognition is increasingly important for many applications, such as chatbot, mental problem diagnosis assistant, smart health care, sale advertising, smart entertainment and some other smart services. In Human-Machine communication, emotion recognition and sentiment analysis can enhance the interaction between people and devices. When people recognize others’ emotion, our brains independently process vocal representation and emotionality, then gain difference of emotionality more clearly from the voice effect. To the best of our knowledge, none of existing emotion recognition system considers laughter, cries or other emotion interjection in speech which naturally exists in our daily life when we express our emotion.
The thesis tries to observe spontaneous emotion expression and change within a single turn in daily dialogue. Considering how human brain discriminates others’ emotion, we extract the features of verbal and non-verbal parts in speech for emotion recognition. For these purposes, the thesis choose a spontaneous speech emotion corpus, NNIME (NTHU-NTUA Chinese Interactive Multimodal Emotion Corpus), which contains various emotional nonverbal sounds, such as laughter, sobbing, and sigh in speech.
Totally, 4766 single-speaker turns in dialogue produced based on the segments in the audio data of NNIME 101 sessions. In order to reconstruct each turn into a sequence of silence interval, prosodic phrase and non-verbal sound, an SVM-based verbal/nonverbal discriminator is developed and a Prosodic Phrase (PPh) auto-tagger is used. These segments are then used as the training data for emotion/sound feature extraction based on convolutional neural networks (CNNs). Finally, every turn is represented as a sequence of emotion/sound feature vectors and becomes the input of a sequence-to-sequence model. The attentive LSTM-based sequence-to-sequence model is finally adopted to give an emotion tag sequence as recognition result for a given turn.
According to the experimental results, a better emotion recognition performance of spontaneous speech can benefit human machine interaction.
References
[1] S. Blanton, "The voice and the emotions," Quarterly Journal of Speech, vol. 1, no. 2, p. 154-172, 1915.
[2] B. W. Schuller, "Speech Emotion Recognition: Two Decades in a Nutshell, Benchmarks, and Ongoing Trends," Communications of the ACM, May 2018, Vol. 61 No. 5, Pages 90-99, 2018.
[3] S. Lugović, I. Dunđer, and M. Horvat, "Techniques and applications of emotion recognition in speech," in 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016, p. 1278-1283.
[4] X. Zhang, Y. Sun, and D. Shufei, "Progress in speech emotion recognition," in TENCON 2015 - 2015 IEEE Region 10 Conference, 2015, p. 1-6.
[5] N. Campbell, "On the Use of NonVerbal Speech Sounds in Human Communication," in Verbal and Nonverbal Communication Behaviours, Berlin, Heidelberg, 2007, p. 117-128: Springer Berlin Heidelberg.
[6] A. Schirmer and T. C. Gunter, "Temporal signatures of processing voiceness and emotion in sound," Social Cognitive and Affective Neuroscience, vol. 12, no. 6, p. 902-909, 2017.
[7] H. C. Chou, W. C. Lin, L. C. Chang, C. C. Li, H. P. Ma, and C. C. Lee, "NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, p. 292-298.
[8] I. S. Engberg, A. V. Hansen, O. Andersen, and P. Dalsgaard, "Design, recording and verification of a Danish emotional speech database," in Fifth European Conference on Speech Communication and Technology, 1997.
[9] E. Douglas-Cowie, R. Cowie, and M. Schröder, "A new emotion database: considerations, sources and scope," in ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
[10] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European Conference on Speech Communication and Technology, 2005.
[11] C. Busso et al., "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
[12] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, "Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions," in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, 2013, p. 1-8: IEEE.
[13] Y. Li, J. Tao, L. Chao, W. Bao, and Y. Liu, "CHEAVD: a Chinese natural emotional audio–visual database," Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 6, p. 913-924, 2017.
[14] E. Tzinis and A. Potamianos, "Segment-based speech emotion recognition using recurrent neural networks," in Affective Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, 2017, p. 190-195: IEEE.
[15] K. S. Rao, S. G. Koolagudi, and R. R. Vempada, "Emotion recognition from speech using global and local prosodic features," International journal of speech technology, vol. 16, no. 2, p. 143-160, 2013.
[16] H. Cao, S. Benus, R. Gur, R. Verma, and A. Nenkova, "Prosodic cues for emotion: analysis with discrete characterization of intonation," Speech prosody 2014, 2014.
[17] N. Anand and P. Verma, "Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data," in Technical Report: Stanford University, 2015.
[18] G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, p. 5200-5204.
[19] L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, "Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN," Sensors (Basel, Switzerland), vol. 17, no. 7, p. 1694, 2017.
[20] S. Kim and M. L. Seltzer, "Towards Language-Universal End-to-End Speech Recognition," eprint arXiv:1711.02207, 2017.
[21] N. T. V. Michael Neumann, "Attentive Convolutional Neural Network based Speech Emotion Recognition:A Study on the Impact of Input Features, Signal Length, and Acted Speech," 2017.
[22] F. B. Pokorny, F. Graf, F. Pernkopf, and B. W. Schuller, "Detection of negative emotions in speech signals using bags-of-audio-words," in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, p. 879-884.
[23] C.-W. Huang and S. S. Narayanan, "Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition," in Multimedia and Expo (ICME), 2017 IEEE International Conference on, 2017, p. 583-588: IEEE.
[24] C.-y. Tseng, S.-h. Pin, and Y.-l. Lee, "Speech prosody: issues, approaches and implications," From Traditional Phonology to Mandarin Speech Processing, Foreign Language Teaching and Research Process, p. 417-438, 2004.
[25] P. W. Boersma, David, "Praat: doing phonetics by computer [Computer program]. Version 6.0.40."
[26] M. Domínguez Bajo, M. Farrús, and L. Wanner, "An automatic prosody tagger for spontaneous speech," in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers;, Osaka, Japan, 2016, p. 377-387: COLING.
[27] F. Eyben, M. Wöllmer, and B. Schuller, openSMILE -- The Munich Versatile and Fast Open-Source Audio Feature Extractor. 2010, p. 1459-1462.
[28] E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, "INTERSPEECH 2009 Emotion Recognition Challenge evaluation," in 2010 IEEE 18th Signal Processing and Communications Applications Conference, 2010, p. 216-219.
[29] C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines," ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 1-27, 2011.
[30] D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interaction and functional architecture in the cat's visual cortex," The Journal of Physiology, vol. 160, no. 1, p. 106-154.2, 1962.
[31] K. Fukushima and S. Miyake, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition," in Competition and Cooperation in Neural Nets, Berlin, Heidelberg, 1982, p. 267-285: Springer Berlin Heidelberg.
[32] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, "Artificial convolution neural network for medical image pattern recognition," Neural Networks, vol. 8, no. 7, p. 1201-1214, 1995.
[33] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, p. 2278-2324, 1998.
[34] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, p. 1735-1780, 1997.
[35] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to Sequence Learning with Neural Networks," ArXiv e-prints, Available: https://ui.adsabs.harvard.edu/#abs/2014arXiv1409.3215S
[36] D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," eprint arXiv:1409.0473, 2014.
[37] 陳垂康, "應用具語句關注之連續對話狀態追蹤與強化學習之面試訓練系統," 碩士, 資訊工程學系, 國立成功大學, 台南市, 2017.
[38] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002.