簡易檢索 / 詳目顯示

研究生: 林家寬
Lin, Jia-Kuan
論文名稱: 應用機率式句法結構於情緒辨識中語音情感結構之模型化
Affective Structure Modeling of Speech for Emotion Recognition Using Probabilistic Context Free Grammar
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 62
中文關鍵詞: 語音情緒辨識機率式句法結構階層式情感模型
外文關鍵詞: Emotional speech recognition, probabilistic context free grammar, hierarchical affective model
相關次數: 點閱:92下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音是人們溝通最自然且富含情緒的表達方式,且在情感計算研究上是不可或缺的一大關鍵技術。在傳統的語句層級與切割層級的語音情緒辨識中,較少討論語句之間情緒起伏的架構狀態關係。
    本論文提出情感階層式架構的語音情緒辨識方法,採用Canny邊界偵測演算法,對語音訊號在頻譜圖上表現的相似度進行假設斷點的偵測。並利用支持向量機之情緒模型,偵測假設斷點中情緒表現形成emotion profile向量,並選取情緒表現中差異最大之處做為斷點,將語音訊號以二元階層式架構呈現。為避免二元階層式架構中節點的唯一性,在階層式架構規則推導時產生規則過多導致資料稀疏的情形,採用向量量化的技術對節點進行編碼,以碼字間連接關係描述語音訊號的階層式架構。最後導入機率式句法結構的概念,對碼字間關係所建立而成的階層式架構模型化,以提供語音訊號進行情緒辨識。
    實驗方面,本論文採用包含了七種情緒類別的德語EMODB情緒語料庫進行研究,並以音檔串接的方式增加長語句數量,共1495句語句,並採用語者獨立交互驗證方式辨識。其結果顯示,本論文所提出之語音情緒辨識方法,在長語句的辨識效果較佳,有效的針對語句的情緒起伏情形模型化,其辨識率可達到87.22%,相較於語句層級的辨識效果較佳。未來希望可以透過真實語料的蒐集與分析,建立更符合人類自發性情緒表達的階層式架構模型。

    Speech is the most natural way with rich emotional information for communication. Recognition of emotions in speech plays an important role in affective computing. Related research on utterance-level and segment-level processing lacks the understanding of the underlying structure of emotional speech. In this thesis, a hierarchical approach to modeling affective structure based on probabilistic context free grammar is proposed for recognition. Canny edge detection algorithm is employed to detect the hypothesized segment boundaries of speech signal according to spectral similarity. Emotion profiles generated from the SVM-based classification model are used to find a maximum change boundary between segments. Then, a binary tree is constructed to derive the hierarchical structure with multi-layer speech segments. Vector quantization is further used to generate emotion-profile codebook and a hierarchical representation of the speech segments. Probabilistic context free grammar is adopted to model the hierarchical relations between codewords for affective structure modeling. In order to evaluate the proposed method, Berlin emotional speech database (EMO-DB) with 1495 utterances and 7 emotions and the leave-one-speaker-out cross validation scheme was employed. For investigating the effect of utterance length, concatenation of two or more utterances from the database was also performed. The experimental results show that the proposed method achieved emotion recognition accuracy of 87.22% in long utterance and outperformed the conventional SVM-based method. Further study on collecting more real corpus is needed for the analysis and recognition of emotions in spontaneous speech.

    摘要 I ABSTRACT II 誌謝 IV List of Tables VII List of Figures VIII Chapter1 Introduction 1 1.1 Motivation 1 1.2 Problems 3 1.3 Background and Literatures Review 5 1.3.1 Affective Expression 5 1.3.2 Affective Computing and Emotion Databases 7 1.3.3 Recognition Units and Acoustic Features 9 1.3.4 Emotion Classification 12 1.4 Research Goals 15 1.5 Research Framework 16 Chapter2 Proposed Method 17 2.1 Hierarchical Segmentation 18 2.1.1 Segment Boundary Detection 18 2.1.2 SVM-Based Emotion Model 20 2.1.3 Hierarchical Segmentation of Speech 23 2.2 Codebook Construction and Transformation 26 2.3 Modeling and Inference by Probabilistic Context Free Grammar 28 2.3.1 Affective Structure Model Training 30 2.3.2 Inference of Emotions 31 Chapter3 Experimental Results and Discussion 36 3.1 Evaluations on EMO-DB 36 3.1.1 Descriptive Statistics of the Corpus 36 3.1.2 Determination of the Codeword Number 41 3.1.3 Analysis of the Derived Rules 42 3.1.4 Performance of SVM-based Method 45 3.1.5 Performance of the Proposed Method 47 3.2 Performance Evaluations Using Concatenated EMO-DB 51 3.2.1 Corpus Expansion from EMODB 51 3.2.2 Recognition of Long Utterances 52 3.3 Performance Comparison and Discussion 56 Chapter4 Conclusion and Future Work 58 References 59

    [1] R. W. Picard, “Affective Computing,” MIT Press, Cambridge, 1997.
    [2] P. Ekman, “Basic Emotions,” Handbook of cognition and emotion, 1999.
    [3] S. Burnett, Geoffrey Bird, Jorge Moll, Chris Frith, and Sarah-Jayne Blakemore, “Development during Adolescence of the Neural Processing of Social Emotion,” Journal of Cognitive Neuroscience, Vol. 21, No. 9, Pages 1736-1750 Posted Online August 2009.
    [4] R. E. Thayer, “The Biopsychology of Mood and Arousal,” New York: Oxford University Press, 1989.
    [5] J. A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, Vol 39(6), December , 1161-1178, 1980.
    [6] R. Plutchik, H. Kellerman, “Emotion: Theory, Research and Experience. Volume 1. Theories of Emotion,” New York: Academic, 1980.
    [7] D. Watson, A. Tellegan, “Toward a consensual structure of mood,” Psychological Bulletin 98: 219–235, 1985.
    [8] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech,” in Proc. INTERSPEECH 2005, Lisbon, Portugal, pp. 1517-1520, 2005.
    [9] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008.
    [10] S. Steidl, “Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech,” Logos Verlag, Berlin, 2009.
    [11] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 Emotion Challenge,” in Proc. INTERSPEECH 2009, Brighton, UK, pp. 312–315B, 2009.
    [12] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Muller, and S. Narayanan, “The INTERSPEECH 2010 Paralinguistic Challenge,” in Proc. INTERSPEECH 2010, pp. 2794–2797, Makuhari, Japan, 2010.
    [13] I. Murray and J. L. Arnott, “Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” Journal of the Acoustic Society of America, 93(2), pp.1097-1108, 1993.
    [14] B. Schuller, A. Batliner, S. Steidl, and Dino Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge,” Speech Communication.(Special Issue on Sensing Emotion and Affect—Facing Realism in Speech Processing), vol. 53, pp. 1062–1087, Nov./ Dec. 2011.
    [15] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–285, Feb. 1989.
    [16] M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, Pages 572–587, Volume 44, Issue 3, March 2011.
    [17] B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov Model-Based Speech Emotion Recognition,” Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003.
    [18] D. Neiberg, K. Elenius, and K. Laskowski, “Emotion recognition in spontaneous speech using GMMs,” in Proc. INTERSPEECH 2006, Pittsburgh, Pennysylvania, 2006.
    [19] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion Recognition in Speech Using Neural Networks,” Neural Computing and Applications.9, 290–296, 2000.
    [20] C. J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, Volume 2 (2) P. 121–167, 1998.
    [21] B. Schuller and G. Rigoll(), “Timing Levels in Segment-Based Speech Emotion Recognition,” in Proc. INTERSPEECH 2006, Pittsburgh, Pennysylvania, 2006.
    [22] B. Schuller and L. Devillers, “Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm,” in Proc. INTERSPEECH 2010, Japan, 2010.
    [23] E. Mower and S. Narayanan, “A hierarchical static-dynamic framework for emotion classification,” ICASSP, Prague, Czech Rep., May 2011.
    [24] J. H. Jeon, R. Xia, and Y. Liu, “Sentence level emotion recognition based on decisions from subsentnce segments,” ICASSP, Prague, Czech Rep., May 2011.
    [25] D. Jiang and L, Cai, “Speech Emotion Classification with the Combination of Statistic Features and Temporal Features,” ICME, 2004.
    [26] A. Batliner, S. Steidl, D. Seppi and B. Schuller, “Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach,” Advances in Human–Computer Interaction, Volume 2010, January 2010.
    [27] D. Bitouk, R. Verma and A. Nenkova, “Class-level spectral features for emotion recognition,” Speech Communication, Volume 52, Issues 7–8, Pages 613–625, July–August 2010.
    [28] H. Shimodaira and M. Nakai, “Prosodic Phrase Segmentation by Pitch Pattern Clustering,” Acoustics, Speech, and Signal Processing ICASSP-94, 19-22 Apr. 1994.
    [29] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, 19 (6): 716–723, 1974.
    [30] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics 6 (2): 461–464, 1978.
    [31] J. Canny, “A Computational Approach To Edge Detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986.
    [32] R. Fernandez, “A Computational Model for the Automatic Recognition of Affect in Speech,” Doctor of Philosophy degree thesis, Massachusetts Institute of Technology, 2004.
    [33] B. Vlasenko, B. Schuller, A. Wendemuth and G Rigoll, “Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing,” INTERSPEECH 2007, 2007.
    [34] E. Mower and S. Narayanan, “A Hierarchical Static-Dynamic Framework for Emotion Classification,” Acoustics, Speech and Signal Processing (ICASSP), 2011.
    [35] Chiu-yu Tseng and Yeh-lin Lee, “Speech rate and prosody units: Evidence of interaction from Mandarin Chinese,” The International Conference on Speech Prosody, P. 251-254. Nara, Japan, 2004.
    [36] Chiu-yu Tseng, Shao-huang Pina, Yeh-lin Lee, Hsin-min Wang and Yong-cheng Chen, “Fluent speech prosody: Framework and modeling,” Speech Communication, Volume 46, Issues 3–4, Pages 284–309, July 2005.
    [37] Chiu-yu Tseng, “Recognizing Mandarin Chinese Fluent Speech Using Prosody Information—An Initial Investigation,” The 3rd International Conference on Speech Prosody 2006. Dresden, Germany, 2006.
    [38] Chung-Hsien Wu, Jen-ChunLin and Wen-Li Wei, “Two-Level Hierarchical Alignment for Semi-Coupled HMM-Based Audiovisual Emotion Recognition With Temporal Course,” IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 8, DECEMBER 2013.
    [39] S. Steidl, M. Levit, A. Batliner, E. Nöth and H. Niemann, “Of all things the measure is man: Automatic classification of emotions and inter-labeler consistency,” ICASSP, 2005.
    [40] E. Mower, M. J. Mataric and S. Narayanan, “A Framework for Automatic Human Emotion Classification Using Emotion Profiles,” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY 2011.
    [41] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, 20:273–297, 1995.
    [42] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movella, “Recognizing facial expression: Machine learning and application to spontaneous behavior,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 568–573, 2005.
    [43] Y. Linde, A. Buzo and R. M. Gray, “An Algorithm for Vector Quantizer Design,” Communications, IEEE Transactions on Volume:28, Issue: 1, pages 84 – 95, Jan. 1980.
    [44] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297, 1967.
    [45] C. D. Manning, H. Schutze, “Foundations of Statistical Natural Language Processing,” The MIT Press Cambridge, Massachusetts London, England, June 1999.
    [46] F. Eyben, M. Wöllmer and B. Schuller, “openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor,” Proceedings of the international conference on Multimedia, Pages 1459-1462, 2010.
    [47] C. C. Chang and C. J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
    [48] Kuan-Chun Cheng and Chung-Hsien Wu, “Recognition of Emotions in Speech Using Multi-level Units and Hierarchical Correlation Models,” Master degree thesis, Department of Computer Science and Information Engineering, National Cheng Kung University, 2013.

    無法下載圖示 校內:2024-12-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE