研究生: |
蔡定男 Tsai, Ting-Nan |
---|---|
論文名稱: |
基於深度神經網路之情緒辨識系統及其於人形機器人之應用 Deep Neural Network Based Emotion Recognition System for Humanoid Robot |
指導教授: |
李祖聖
Li, Tzuu-Hseng S. |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 66 |
中文關鍵詞: | 卷積神經網路 、情緒辨識 、長短期記憶神經網路 、遷移學習 |
外文關鍵詞: | Convolutional Neural Network, Emotion recognition, Long Short-Term Memory, Transfer Learning |
相關次數: | 點閱:163 下載:15 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
機器人在與人互動的過程中,如何辨識對方的情緒是一個非常重要的問題。本論文提出一個應用於人形機器人之情緒辨識系統,讓機器人透過網路攝影機截取互動對象的影像,辨識其情緒並給予適當的回應。本論文所提出的情緒辨識系統是基於深度神經網路的方式,學習辨識六種基本情緒,包含開心、難過、驚訝、恐懼、厭惡及生氣。整個系統架構共分為四個部分,首先將使用卷積神經網路做為特徵提取器,對大量單一圖片的人臉表情資料庫進行訓練,並做為提取圖形特徵的手段;接著是以長短期記憶神經網路訓練序列影像的數據庫,學習出影像隨時間的動態變化相對於六種情緒的關聯,也就是提取序列影像中時間的特徵;本論文所提出的神經網路架構將結合兩者的優勢,同時考量臉部表情的圖型特徵以及隨時間變化的特性,以學習辨識影像序列的人臉情緒辨識。更進一步藉由遷移式學習,使情緒辨識系統的性能有顯著的提升。最後,以留一法交叉驗證比較不同方法之間的辨識率以及在人形機器人上實現即時情緒辨識能力。
It is crucial for robots to recognize human emotions during the interaction between human and robot. Therefore, this thesis proposes an emotion recognition system for a humanoid robot. The robot is equipped with a camera in order to capture the image of the user's face and the goal is for the robot to respond appropriately according to the user's emotion which is recognized by our system. The emotion recognition system, based on a deep neural network, learns the six basic emotions including happiness, anger, disgust, fear, sadness and surprise. The whole structure of the system consists of four steps: the first step takes advantage of a convolutional neural network to extract visual features by learning on a great amount of static images; the second step utilizes a long short-term memory recurrent neural network to figure out the relationship between the transformation of facial expressions in image sequences and the six basic emotions; the third step combines the advantages of both CNN and LSTMs by integrating them into our model; the last step but not least improves the performance of the emotion recognition system by using transfer learning, which is a method to transfer the knowledge of related but different problems. Finally, the performance of the proposed system is verified by leave-one-out cross validation and is compared with other models. Then the proposed system is applied to the interaction between human and robot to demonstrate the practicability of this system.
[1] K. Dautenhahn, “Methodology and themes of human-robot interaction: a growing research Field,” International Journal of Advanced Robotic Systems, vol. 4, no. 1, pp. 15, 2007.
[2] L. Parker, F. E. Schneider, and A. C. Schultz, Multi-robot systems: from swarms to intelligent automata, Dordrecht: Springer, 2005.
[3] K. R. Scherer, “What are emotions? and how can they be measured?,” Social Science Information, vol. 44, no. 4, pp. 695-729, Dec. 2005.
[4] V. Mayya, R. M. Pai, and M. M. Manohara Pai, “Automatic Facial Expression Recognition using DCNN”, Proce. Comp. Sci, vol. 93, pp. 453-461, 2016.
[5] K. Zhang, Y. Huang, and Y. Du, L. Wang, “Facial expression recognition based on deep evolutional spatial-temporal networks”, IEEE Trans. Image Process., vol. 26, no. 9, pp. 4193-4203, Mar. 2017.
[6] Y. Byeon and K. Kwak, “Facial expression recognition using 3D convolutional neural network,” International Journal of Advanced Computer Science and Applications, vol. 5, no. 12, 2014.
[7] W. Zhang, Y. Zhang, L. Ma, J. Guan, and S. Gong, “Multimodal learning for facial expression recognition”, Pattern Recognit., vol. 48, no. 10, pp. 3191-3202, 2015.
[8] X. Fan and T. Tjahjadi, “A spatial-temporal framework based on histogram of gradients and optical flow for facial expression recognition in video sequences,” Pattern Recognit., vol. 48, no. 11, pp. 3407-3416, 2015.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of Annual Conference on Neural Information Processing Systems, 2012, pp. 1097-1105.
[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[11] M. Sundermeyer, H. Ney, and R. Schlüter, “From feedforward to recurrent LSTM neural networks for language modeling,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 3, pp. 517-529, 2015.
[12] S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2010.
[13] A. T. Lopes, E. De Aguiar, A. F. De Souza, and T. Oliveira-Santos, “Facial expression recognition with convolutional neural networks: Coping with few data and the training sample order,” Pattern Recognit., vol. 61, pp. 610-628, Jan. 2017.
[14] “Six basic emotions,” Managementmania.com, 2018. [Online]. Available: https://managementmania.com/en/six-basic-emotions.
[15] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554-2558, 1982.
[16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift, ” arXiv:1502.03167, 2015.
[18] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” Proc. Conf. Artificial Intelligence and Statistics, 2011.
[19] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 3431-3440, 2015.
[20] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 2018.
[21] Z. Yu and C. Zhang, “Image based static facial expression recognition with multiple deep network learning,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442, ACM, 2015.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proc. Int. Conf. Learn. Representations, 2015.
[23] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 6645–6649, May 2013.
[24] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Proc. Interspeech, pp. 194-197, 2012.
[25] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142-158, 2016.
[26] N. Jean, M. Burke, M. Xie, W. Davis, D. Lobell, and S. Ermon, “Combining satellite imagery and machine learning to predict poverty,” Science, vol. 353, no. 6301, pp. 790-794, 2016.
[27] D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” The Journal of Physiology, vol. 195, no. 1, pp. 215-243, 1968.
[28] R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas, and H. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, pp. 947-951, Jun. 2000.
[29] M. Lin, Q. Chen, and S. Yan, “Network in network. ,” arXiv:1312.4400, 2013.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2016
[31] F. A. Gers, J. Schmidhuber, F. Cummins, “Learning to forget: Continual prediction with LSTM,” Proc. 9th Int. Conf. Artif. Neural Netw. (ICANN), vol. 2, pp. 850-855, Sep. 1999.
[32] S. Ruder, “An overview of multi-task learning in deep neural networks, ” 2017.
[33] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, 2017.
[34] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended Cohn-Kanade dataset (ck + ): A complete dataset for action unit and emotion-specified expression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 94–101, Jun. 2010.
[35] D. Kingma and J. Ba. “Adam: A method for stochastic optimization,” ICLR, 2015.
[36] Webcamera, Logitech C920[Online]. Available:
http://www.logitech.com/zh-tw/product/hd-pro-webcam-c920?crid=34
[37] Industrial computer, PICO880[Online]. Available:
http://www.axiomtek.com.tw/
[38] P. Viola and M. J. Jones, “Robust real-time object detection,” In IEEE ICCV Workshop on Statistical and Computational Thesis of Vision, 2001.
[39] G. Bradski. The opencv library. Dr. Dobb's Journal of Software Tools, 2000.