| 研究生: |
魏禹宏 Wei, Yu-Hung |
|---|---|
| 論文名稱: |
基於分解的語音之多目標學習於語音情緒辨識 Multi-task learning for Speech Emotion Recognition based on decomposed speech |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 智慧製造國際碩士學位學程 International Master Program on Intelligent Manufacturing |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 語音情緒辨識 、語音分解 、多目標學習 |
| 外文關鍵詞: | Speech emotion recognition, speech decomposition, multi-task learning |
| 相關次數: | 點閱:112 下載:14 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著現代的科技進步迅速,越來越多的人機互動產品充斥在我們生活周遭,例如智慧語音助手,智慧型手表以及自駕車等。當這些產品在與使用者互動時,若能加入對於使用者的情緒辨識為資訊,可以使這些產品更為人性化,增加互動過程中的舒適感。現今有許多關於情緒辨識的研究,而在現在的語音情緒辨識中,多數系統使用強大的模型來取得特徵,而忽略了對於語音情緒辨識很重要的韻律特徵。
本論文利用自動編碼器正確地從語音中抽取聲音的主要組成作為較為直觀的傳統韻律特徵,與強大的語音模型所抽取的特徵做結合,且使用多目標訓練的機制來讓語音模型抽取出的特徵含有未被當成輸入資訊的文字特徵,也讓語音分解模型壓縮出更加包含情緒資訊的韻律特徵。
與語音模型抽出的特徵不同,語音分解需要精準地控制各個編碼器使對應的輸出包含我們所期望的資訊,本論文著重於如何正確的分解出四種聲音的組成,使語音情緒辨識系統能夠獲得更為仔細分解過的韻律特徵。
從最後實驗可看出,相較於只使用語音模型的單目標情緒辨識系統,本論文提出的語音情緒辨識系統準確率可達到77.5%,在各架構中表現突出,並且還獲得自動語音辨識作為副產物。證明本論文提出的語音模型混合韻律特徵的多目標情緒辨識模型,不論在語音模型特徵與韻律特徵的混合,或者是多目標輔助的情緒辨識上,都能有很好的效果。
With the vast improvement of modern technology, our daily lives are surrounded by large amounts of Human-Computer Interaction products, like virtual assistants, smart watches, and self-driving cars to name a few. When users are interacting with these products, the addition of the user’s emotion information can allow these products’ response to be more natural and improve the comfortability during these interactions. Nowadays, there are lots of research surrounding the topic of emotion recognition, most speech emotion recognition systems use a powerful model to extract the features, but neglect to utilize a very important part of emotion recognition which is the prosodic features.
This thesis uses an autoencoder to correctly extract the components of speech to use as a more straightforward traditional prosodic feature, combined with powerful model extracted features, we also employ multi-task learning methods to enrich the features, including textual data from the speech model and emotion rich prosodic features from the speech decomposition model.
Unlike the speech model extracted features, speech decomposition needs to precisely control the outputs of each encoder to contain the data we desire, this thesis focuses on correctly decomposing the components of speech, allowing the speech emotion recognition system to acquire carefully decomposed prosodic features.
As we can see from the experiment results, compared with the single-task emotion recognition system that only uses a speech model as the extraction model, the proposed speech emotion recognition system achieved an accuracy of 77.5%, with a bonus of automatic speech recognition results. Proving that the proposed multi-task learning speech emotion recognition model can achieve good results on both the speech model extracted feature combined with prosodic features, and multi-task learning emotion recognition.
[1] S. Handel, "Classification of emotions," The Emotion Machine, 2011.
[2] A. B. Ingale and D. Chaudhari, "Speech emotion recognition," International Journal of Soft Computing and Engineering (IJSCE), vol. 2, no. 1, pp. 235-238, 2012.
[3] E. Rodero, "Intonation and emotion: influence of pitch levels and contour type on creating emotions," Journal of voice, vol. 25, no. 1, pp. e25-e34, 2011.
[4] K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, "Unsupervised speech decomposition via triple information bottleneck," in International Conference on Machine Learning, 2020: PMLR, pp. 7836-7846.
[5] "Juniper Estimates 3.25 Billion Voice Assistants Are in Use Today, Google Has About 30% of Them." https://voicebot.ai/2019/02/14/juniper-estimates-3-25-billion-voice-assistants-are-in-use-today-google-has-about-30-of-them/ (accessed.
[6] E. Adamopoulou and L. Moussiades, "An overview of chatbot technology," in IFIP International Conference on Artificial Intelligence Applications and Innovations, 2020: Springer, pp. 373-383.
[7] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, "The design and implementation of xiaoice, an empathetic social chatbot," Computational Linguistics, vol. 46, no. 1, pp. 53-93, 2020.
[8] C. Crolic, F. Thomaz, R. Hadi, and A. T. Stephen, "Blame the bot: anthropomorphism and anger in customer–chatbot interactions," Journal of Marketing, vol. 86, no. 1, pp. 132-148, 2022.
[9] Z. Lin, P. Xu, G. I. Winata, F. B. Siddique, Z. Liu, J. Shin, and P. Fung, "Caire: An end-to-end empathetic chatbot," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 09, pp. 13622-13623.
[10] "Affective Computing Market Size, Share & Trends Analysis Report." https://www.grandviewresearch.com/industry-analysis/affective-computing-market (accessed.
[11] R. Caruana, "Multitask learning: A knowledge-based source of inductive bias1," in Proceedings of the Tenth International Conference on Machine Learning, 1993: Citeseer, pp. 41-48.
[12] M. Lewis, "The emergence of human emotions," Handbook of emotions, vol. 2, pp. 265-280, 2000.
[13] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern recognition, vol. 44, no. 3, pp. 572-587, 2011.
[14] I. Luengo, E. Navas, I. Hernáez, and J. Sánchez, "Automatic emotion recognition using prosodic parameters," in Ninth European conference on speech communication and technology, 2005.
[15] C. S. Montenegro and E. A. Maravillas, "Acoustic-prosodic recognition of emotion in speech," in 2015 International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), 2015: IEEE, pp. 1-5.
[16] E. Bozkurt, E. Erzin, C. E. Erdem, and A. T. Erdem, "Use of line spectral frequencies for emotion recognition from speech," in 2010 20th International Conference on Pattern Recognition, 2010: IEEE, pp. 3708-3711.
[17] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, "Direct modelling of speech emotion from raw speech," arXiv preprint arXiv:1904.03833, 2019.
[18] Y. Wang, A. Boumadane, and A. Heba, "A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding," arXiv preprint arXiv:2111.02735, 2021.
[19] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion recognition by speech signals," in Eighth European conference on speech communication and technology, 2003.
[20] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[21] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, vol. 33, pp. 12449-12460, 2020.
[22] P. Tzirakis, A. Nguyen, S. Zafeiriou, and B. r. W. Schuller, "Speech emotion recognition using semantic information," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 6279-6283.
[23] I. Gat, H. Aronowitz, W. Zhu, E. Morais, and R. Hoory, "Speaker Normalization for Self-supervised Speech Emotion Recognition," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7342-7346.
[24] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, "One-shot voice conversion by separating speaker and content representations with instance normalization," arXiv preprint arXiv:1904.05742, 2019.
[25] M. D. Pell, A. Jaywant, L. Monetta, and S. A. Kotz, "Emotional speech processing: Disentangling the effects of prosody and semantic cues," Cognition & Emotion, vol. 25, no. 5, pp. 834-853, 2011.
[26] S. Shechtman and A. Sorin, "Sequence to sequence neural speech synthesis with prosody modification capabilities," arXiv preprint arXiv:1909.10302, 2019.
[27] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85-117, 2015.
[28] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, "Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations," arXiv preprint arXiv:1804.02812, 2018.
[29] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, "Autovc: Zero-shot voice style transfer with only autoencoder loss," in International Conference on Machine Learning, 2019: PMLR, pp. 5210-5219.
[30] Y. Zhang and Q. Yang, "An overview of multi-task learning," National Science Review, vol. 5, no. 1, pp. 30-43, 2018.
[31] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, "12-in-1: Multi-task vision and language representation learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10437-10446.
[32] X. Cai, J. Yuan, R. Zheng, L. Huang, and K. Church, "Speech Emotion Recognition with Multi-Task Learning," in Interspeech, 2021, vol. 2021, pp. 4508-4512.
[33] D. Dong, H. Wu, W. He, D. Yu, and H. Wang, "Multi-task learning for multiple language translation," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1723-1732.
[34] L.-W. Chen and A. Rudnicky, "Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition," arXiv preprint arXiv:2110.06309, 2021.
[35] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, pp. 335-359, 2008.
[36] D. Talkin and W. B. Kleijn, "A robust algorithm for pitch tracking (RAPT)," Speech coding and synthesis, vol. 495, p. 518, 1995.
[37] A. Polyak and L. Wolf, "Attention-based wavenet autoencoder for universal voice conversion," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 6800-6804.
[38] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[39] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[40] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, and M. Funtowicz, "Transformers: State-of-the-art natural language processing," in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38-45.
[41] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. r. W. Schuller, "Multi-task semi-supervised adversarial autoencoding for speech emotion recognition," IEEE Transactions on Affective computing, 2020.
[42] C. Zhang and L. Xue, "Two-stream Emotion-embedded Autoencoder for Speech Emotion Recognition," in 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), 2021: IEEE, pp. 1-6.
[43] F. Eyben, M. Wöllmer, and B. r. Schuller, "Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459-1462.