| 研究生: |
黃俊修 Huang, Jyun-Siou |
|---|---|
| 論文名稱: |
基於隱藏式馬可夫模型之語音情緒辨識的初始模型探討 Initial Model Study of Speech Emotion Recognition Using Hidden Markov Model Based System |
| 指導教授: |
邱瀝毅
Chiou, Lih-Yih |
| 共同指導教授: |
雷曉方
Lei, Sheau-Fang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 96 |
| 中文關鍵詞: | 情緒辨識 、梅爾倒頻譜 、高斯混和模型 、隱藏式馬可夫模型 |
| 外文關鍵詞: | Emotion Recognition, Mel-frequency Cepstral Coefficient(MFCC), Gaussian Mixture Model(GMM), Hidden Markov Model(HMM) |
| 相關次數: | 點閱:170 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音情緒辨識在人機互動領域上是最重要的主題之一,可以應用在聊天機器人、心理檢測、安全提醒等等情境。近幾年,已嘗試過非常多不同的特徵及分類器,例如,音高、共振峰、梅爾倒頻譜係數這些特徵以及支援向量機、高斯混和模型、隱藏式馬可夫模型、類神經網路等等分類器。
語音情緒可視為一段語氣的狀態變化。在本研究中,基於此論點以高斯混和模型、離散隱藏式馬可夫模型、連續隱藏式馬可夫模型作為分類器並選用梅爾倒頻譜係數作為特徵的辨識率結果做出一個完整的比較,並得出使用狀態變化機率的隱藏式馬可夫模型優於使用統計資訊做分類的高斯混和模型,以及使用多維機率密度函數的連續隱藏式馬可夫模型優於使用離散機率的離散隱藏式馬可夫模型。
上述系統存在的下溢位及奇異性問題,也會在本論文討論及提出解決方法。再者,關於隱藏式馬可夫模型中初始模型假設的討論也會在本論文提出。
最後,最高的辨識率分別以高斯混和模型、離散隱藏式馬可夫模型、連續隱藏式馬可夫模型作為分類器的結果分別為,58.07%、65.67%、89.20%。而最高的平均辨識率分別以高斯混和模型、離散隱藏式馬可夫模型、連續隱藏式馬可夫模型作為分類器的結果為,51.12%、53.20%、70.15%。辨識率結果顯示出連續隱藏式馬可夫模型這三者中辨識率最好的分類器。
Emotion recognition from speech signal is one of the most important topics in human-machine interaction, it is used for the chatting bot, mental examination, safety warning …etc. For the past few years, there has been tried for several speech features and classifier, e.g. pitch, formant, Mel-frequency Cepstral Coefficient (MFCC) features and Support Vector Machine (SVM), Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Artificial Neuron Network (ANN) classifiers.
Emotional speech can be regarded as a prosody state flow. In this work, the accuracy results among GMM, discrete HMM and continuous HMM are compared which are used the MFCC as speech features. There are also underflow and singularity problems among the above systems, it will be discussed and overcome. Moreover, the pre-processing of initial model hypothesis for HMM classifier will be discussed in this paper.
Finally, the highest accuracy results of recognition are 58.07%, 65.67%, 89.20% for GMM, discrete HMM, continuous HMM classifiers respectively, and the highest average accuracy results of recognition are 51.12%, 53.20%, 70.15% for GMM, discrete HMM, continuous GMM classifiers respectively. The results show that continuous HMM are the best classifier among them. And it supports that the accuracy used HMM considering state flow outperforms GMM which is considering only statistical information. Moreover, it also supports that the continuous HMM which is used multivariant dimension probability density outperforms the discrete HMM used discrete probability.
[1] Burkert, Peter, et al., “Dexpression: Deep convolutional neural network for expression recognition,” arXiv preprint arXiv:1509.05371, 2015.
[2] Yu, Zhiding, and Cha Zhang, “Image based static facial expression recognition with multiple deep network learning.,” Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, pp. 435-442, 2015.
[3] Chatzikyriakidis, Stergios, et al., “An overview of Natural Language Inference Data Collection: The way forward?.,” Proceedings of the Computing Natural Language Inference Workshop, 2017.
[4] Koolagudi, Shashidhar G., YV Srinivasa Murthy, and Siva P. Bhaskar, “Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition.,” International Journal of Speech Technology 21.1, vol. 21, no. 1, pp. 167-183, 2018.
[5] Darwin, Charles, and Phillip Prodger., The expression of the emotions in man and animals., USA: Oxford University Press, 1998.
[6] Russell, James A., “A circumplex model of affect.,” Journal of personality and social psychology 39.6, vol. 39, no. 6, p. 1161, 1980.
[7] Vayrynen, Eero., Emotion recognition from speech using prosodic features., Oulu: University of Oulu, 2014.
[8] Dellaert, Frank, Thomas Polzin, and Alex Waibel., “Recognizing emotion in speech.,” Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96., vol. 3, pp. 1970-1973, 1996.
[9] El Ayadi, Moataz, Mohamed S. Kamel, and Fakhri Karray., “Survey on speech emotion recognition: Features, classification schemes, and databases.,” Pattern Recognition, vol. 44, no. 3, pp. 572-587, 2011.
[10] Nwe, Tin Lay, Say Wei Foo, and Liyanage C. De Silva., “Speech emotion recognition using hidden Markov models.,” Speech communication, vol. 41, no. 4, pp. 603-623, 2003.
[11] Kishore, KV Krishna, and P. Krishna Satish., “Emotion recognition in speech using MFCC and wavelet features.,” in 2013 3rd IEEE International Advance Computing Conference (IACC), Ghaziabad, India, 2013.
[12] Nogueiras, Albino, et al., “Speech emotion recognition using hidden Markov models.,” in Seventh European Conference on Speech Communication and Technology., Aalborg, Denmark, 2001.
[13] Sato, Nobuo, and Yasunari Obuchi., “Emotion recognition using mel-frequency cepstral coefficients.,” Information and Media Technologies, vol. 2, no. 3, pp. 835-848, 2007.
[14] Bitouk, Dmitri, Ragini Verma, and Ani Nenkova., “Class-level spectral features for emotion recognition.,” Speech communication 52.7-8, vol. 52, no. 7-8, pp. 613-625, 2010.
[15] Bjorn Schuller, Gerhard Rigoll, and Manfred Lang, “Hidden Markov Model-based Speech Emotion Recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Hong Kong, China, 2003.
[16] Wagner, Johannes, Thurid Vogt, and Elisabeth Andre., “A systematic comparison of different HMM designs for emotion recognition from acted and spontaneous speech.,” in International Conference on Affective Computing and Intelligent Interaction. Springer., Berlin, Heidelberg, 2007.
[17] Rabiner, Lawrence R. ,Biing-Hwang Juang, and Janet C. Rutledge., Fundamentals of speech recognition., Upper Saddle River, United States: Pearson Education (US), 1993.
[18] Yu, Dong, and Li Deng., AUTOMATIC SPEECH RECOGNITION., London: Springer, 2016.
[19] Kwon, Oh-Wook, et al., “Emotion recognition by speech signals.,” in Eighth European Conference on Speech Communication and Technology., Geneva, Switzerland, 2003.
[20] “Auris Medical Cochlear therapies - The Inner Ear.,” [Online]. Available: http://www.aurismedical.com/seiten_e/01_about.htm.
[21] “Cochlear Implant HELP - Electrodes and Channels.,” [Online]. Available: https://cochlearimplanthelp.com/journey/choosing-a-cochlear-implant/electrodes-and-channels/.
[22] Singh, Satyanand, and E. G. Rajan., “Vector quantization approach for speaker recognition using MFCC and inverted MFCC.,” International journal of computer applications, vol. 17, no. 1, pp. 0975-8887, 2011.
[23] Gupta, Maya R., and Yihua Chen., “Theory and use of the EM algorithm.,” Foundations and TrendsR in Signal Processing, vol. 4, no. 3, pp. 223-296, 2011.
[24] Rabiner, Lawrence R., and Biing-Hwang Juang., “An introduction to hidden Markov models.,” ieee assp magazine, vol. 3, no. 1, pp. 4-16, 1986.
[25] Rabiner, Lawrence R., “A tutorial on hidden Markov models and selected applications in speech recognition.,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[26] Juang, Bing-Hwang, Stephene Levinson, and M. Sondhi., “Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.).,” IEEE Transactions on Information Theory, vol. 32, no. 2, pp. 307-309, 1986.
[27] Liporace, L., “Maximum likelihood estimation for multivariate observations of Markov sources.,” IEEE Transactions on Information Theory, vol. 28, no. 5, pp. 729-734, 1982.
[28] “Surrey Audio-Visual Expressed Emotion (SAVEE) Database.,” [Online]. Available: http://kahlan.eps.surrey.ac.uk/savee/.
[29] Rabiner, Lawrence R., et al., “Some properties of continuous hidden Markov model representations.,” AT&T technical journal, vol. 64, no. 6, pp. 1251-1270, 1985.