| 研究生: |
李瑋育 Lee, Wei-Yu |
|---|---|
| 論文名稱: |
應用特徵轉換於臉部表情辨識中說話效應之移除 Removal of Speaking Effect on Facial Expression Recognition Based on Feature Conversion |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 60 |
| 中文關鍵詞: | 人臉情緒 、說話效應 、轉換函式 、決策樹 、迴歸函數 |
| 外文關鍵詞: | facial expression, speaking effect, conversion function, decision tree, regression function |
| 相關次數: | 點閱:157 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,隨著智慧型機器人的應用受到廣泛的重視,為了讓電腦具有人性化的互動能力,其中一個重要的課題就是讓電腦了解人類的情緒。然而,表現情緒的方式有很多,其中最明顯也最直接的表示方式就是透過臉部表情的變化來傳達。以往人臉情緒辨識方面的研究大多針對單純的臉部表情,亦即辨識使用者在沒有說話情況下臉部表情所呈現的情緒狀態;然而,在現實的情況下,臉部的表情變化往往由說話以及情緒兩種因素所構成,亦即人類在溝通的時候通常是說話時同時帶有情緒的展現,因此在這樣的情況下,人臉情緒的辨識結果會因為說話因素的影響,造成辨識率下降。
為了解決說話效應對於人臉情緒辨識的影響,本論文主要提出一套新的人臉情緒辨識架構:首先,我們透過語音上發音屬性類別 (articulatory attribute class) 偵測的輔助,找出影像上的唇型類別,為了進一步降低在特徵轉換上的時間複雜度,透過關鍵畫框的選取找出重要的畫框,並僅針對所選取出之畫框做轉換,以達到降低轉換複雜度之目的,而考慮到前後說話效應對於目前臉部特徵點所造成之變異,進一步利用決策樹考慮前後所接續的唇型以挑選出最適合的轉換函式,在轉換方面,透過主成份分析 (Principal Component Analysis; PCA) 的運算,過濾語者的特性以保有各個發音屬性臉部表情的特質,在特徵空間 (eigen space) 下依據不同的情緒象限和唇型類別對於臉部特徵點做轉換以消除說話之效應,並利用樣板模型驗證轉換後的結果,最後,透過四個情緒象限的驗證結果推論出輸入臉部特徵所屬之情緒象限,並藉由迴歸函數,在Thayer所提出的二維情緒平面上,推估出情緒強烈程度 (arousal) 以及正負面向 (valence) 之數值。
實驗部份,所使用之資料庫包含情緒語音及其臉部表情影像的平行語料共2,880句 (6人),其中1,920句做為訓練,另外240句做為測試。實驗結果顯示,相較於現行在處理說話效應對於臉部表情辨識影響之研究,本論文所提出之方法,在辨識率上可達到88.3%的辨識率,證實所提出之方法優於現行相關之研究,也顯示出移除說話效應對於人臉情緒辨識的確是有效的。
In recent years, with the development of computer technology, the intelligent robots are getting considerable attentions from different fields of applications. Hence, creating an intelligent human-computer interface toward harmonious interaction between robot and human has become an important issue. To make the computer has the better ability to interact with human; the emotion recognition is a critical topic. There are many ways to express emotions; however, facial expression is one of the most directly related cues to human emotions. In the previous researches, most facial expression recognition tasks use the pure expression database for recognizing user’s emotional state which recognizes user’s expression without considering the speaking effect. However, in the most real-world scenarios, the facial expression is influenced by affective state and speaking content both which affective expression often accompany with speaking effect for communication. Hence, the facial expression recognition performance will influenced by speaking effect.
In this thesis, to overcome this problem, we investigate the impact of speaking effect on facial expression, and propose a novel facial expression recognition framework. First, the speech articulatory attribute class detector is employed to identify the lip category. In order to reduce the time complexity of feature conversion, the key frame selection is then applied. Considering the previous and current speaking contents, the decision tree is used for select the most suitable conversion function. In terms of conversion, based on the principal component analysis, the speaking style of each speaker can be removed and preserve the characteristic of articulatory attribute class for facial expression. In the eigen spaces, according to each emotion quadrant and lip category, the speaking effect can be removed by using the corresponding facial feature conversion function. Based on converted feature, the template model is then applied for verify the recognition result. Finally, the regression function of the verified emotion quadrant is used for estimating the arousal and valence values on Thayer's 2D emotional plane.
In the experiments, 2,880 emotion utterances were recorded from six subjects which 1,920 and 240 utterances for training and testing, respectively. The experimental results show that the proposed method (achieves 88.3 % recognition accuracy) outperforms the current expression recognition approaches, and demonstrate that removing the speaking effect on facial expression recognition is useful.
[1] A. Mehrabian. “Communication without words”. Psychology Today, 2:53–56, 1968.
[2] Mehrabian and S.R. Ferris, “Inference of attitude from nonverbal communication in two channels”, Journal of Counseling Psychology 31 (3) (1967), pp. 248–252.
[3] Ambady, N., & Rosenthal, R. (1992). “Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis”. Psychological Bulletin, 111, 256-274.
[4] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions”, Proc. Ninth ACM Int'l Conf. Multimodal Interfaces (ICMI '07), pp. 126-133, 2007.
[5] A. Azcarate, F. Hageloh, K.v.d. Sande, R. Valenti, “Automatic facial emotion recognition”. Universiteit van Amsterdam, June 2005.
[6] Ioannou, S., Raouzaiou, A., Tzouvaras, V., Mailis, T., Karpouzis, K., & Kollias, S. (2005). “Emotion recognition through facial expression analysis based on a neurofuzzy network”. Special Issue on Emotion: Understanding & Recognition, Neural Networks, 18(4), 423–435.
[7] I. Cohen, A. Garg, T.S. Huang, “Emotion recognition from facial expressions using multilevel HMM”, Neural Inf. Process. Syst. (2000)
[8] P. Ekman and W. Friesen. “Facial action coding system (FACS): Manual”. Palo Alto: Consulting Psychologists Press, 1978.
[9] Ying-li Tian, Takeo Kande, Jeffrey F. Cohn, “Robost Lip Tracking by Combining Shape, Color and Motion”. Proc.Asian Conf. Computer Vision, pp.1040~1045,2000.
[10] Ying-li Tian, Takeo Kande, Jeffrey F. Cohn, “Recognizing Action Units for Facial Expression Analysis”. IEEE Transaction on Pattern Analysis and Machine Intelligence, pp. 97~115,2001.
[11] M. Pantic and L. J. Rothkrantz. “Automatic analysis of facial expressions: The state of the art”. IEEE Transactions On Pattern Analysis And Machine Intelligence, 22(12):1424–1445, December 2000.
[12] M. Pantic and L. J. M. Rothkrantz. “An expert system for recognition of facial actions and their intensity”. In Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, 2000.
[13] Cuiping Zhang. “3D Face Structure Extraction from Images at Arbitrary Poses and under Arbitrary Illumination Conditions”. In partial fulfillment of the Requirements for the degree Of Doctor of Philosophy, October 2006.
[14] Tomoko Okada, Tetsuya Takiguchi and Yasuo Ariki. “Pose Robust and Person Independent Facial Expressions Recognition Using AAM Selection”. In Consumer Electronics, 2009.
[15] Ting Shan, Lovell, B.C., Shaokang Chen. “Face Recognition Robust to Head Pose from One Sample Image”. Pattern Recognition, ICPR 2006.
[16] Shiro Kumano, Kazuhiro Otsuka, Junji Yamato, Eisaku Maeda, Yoichi Sato. “Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates”. International Journal of Computer Vision, 2009.
[17] Z. Zeng, J. Tu, M. Liu, T. Zhang, N. Rizzolo, Z. Zhang, T. Huang, D. Roth and S. Levinson, “ Bimodal HCI-related Affect Recognition”. ICMI'04, The 6th International Conference on Multimodal Interfaces (2004) pp. 137—143
[18] Zhu Yong, Chen Yabi, Zhan Yongzhao, “ Expression Recognition Method of Image Sequence in Audio-video”, Third International Symposium on Intelligent Information Technology Application, 2009.
[19] D. Datcu and L.J.M. Rothkrantz. “Semantic Audio-Visual Data Fusion for Automatic Emotion Recognition”. Euromedia'2008, April 2008.
[20] T.F. Cootes, G.J. Edwards and C.J. Taylor, “Active appearance models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681-685 June 2001.
[21] R. E. Thayer , “The Biopsychology of Mood and Arousal”, New York ,Oxford University Press,1989
[22] I. Matthews and S. Baker, “Active Appearance Models revisited,” International Journal of Computer Vision, Vol.60, no. 2, pp. 135-164, 2004.
[23] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” 2009. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[24] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in kernel methods: Support vector machines, B. Scholkopf, C. Burges, and A. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp. 185–208.
[25] Ström, N., “Phoneme probability estimation with dynamic sparsely connected artificial neural networks”, The Free Speech Journal, 5, 1997.
[26] Mikkel B. Stegmann, “The AAM-API: An Open Source Active Appearance Model Implementation”.