簡易檢索 / 詳目顯示

研究生: 莊書愷
Chuang, Shu-Kai
論文名稱: 基於深度學習之多人視聽覺整合之發言者匹配系統
Speaker matching system for multi-person visual and audio integration based on deep learning
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 49
中文關鍵詞: 視聽覺整合發言者偵測發言者匹配多人發言系統
外文關鍵詞: Visual and audio integration, Speaker detection, Speaker matching, Multi-person speaking system
相關次數: 點閱:79下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近幾年人工智慧發展迅速,對話系統成為熱門的研究主題之一。在一個在公共場合或家庭的對話環境,不一定只有一對一的對話。當多人同時發話時,系統會無法了解當下使用者為何者,導致提供錯誤的語句給對話系統處理,使得系統無法作出正確回饋。基於上述的問題,本論文提出基於深度學習之多人視聽覺整合之發言者匹配系統在多位使用者中可有效應用。其系統包含語音處理模組、影像處理模組與發言者匹配整合模組。語音處理模組,使用Google Speech-API,將輸入的語音進行降噪、語音識別,並將識別結果紀錄於文字暫存器。影像處理模組,使用臉部RGB影像來進行人臉偵測、身分辨識、嘴型狀態辨識與注意力分析。發言者匹配整合模組中,通過發言者偵測處理影像模組的輸出得到發言者的身分,發言者匹配將發言者身分、狀態暫存器、文字暫存器作為輸入,進行匹配處理。其中使用狀態暫存器、文字暫存器,使模組之間可以有效的溝通運作。在3位用戶的情況下對話紀錄和對話匹配的準確性可以達到85%、80%以上。在實際應用中,當單人對系統發言,可正確匹配身分及發言內容。當多人同時對系統發言,引導用戶按順序發話。避免對話系統收到不完整、無意義的句子。

    Artificial intelligence has developed rapidly in recent years, and dialogue systems have become one of the hot research topics. In a dialogue environment in a public place or family, there is not necessarily only one-to-one dialogue. When multiple people are talking at the same time, the system will not know who the current user is, resulting in providing the wrong statement to the dialogue system for processing, so that the system cannot respond correctly. Based on the above problems, this thesis proposes a speaker matching system based on deep learning for multi-person visual and audio integration, which can be effectively applied to multiple users. The system includes a speech processing module, an image processing module and speaker matching integration module. The speech processing module uses Google Speech-API to perform noise reduction and speech recognition on the input speech, and records the recognition result in the text register. The image processing module uses face RGB images for face detection, face recognition, mouth state recognition and attention analysis. In the speaker matching and integration module, the speaker's identity is obtained through the speaker detection and processing of the output of the image module. Speaker matching takes speaker identity, status register, and text register as input, and performs matching processing. Among them use the state register, the text register, so that the modules can communicate and operate effectively. In the case of 3 users, the accuracy of dialogue records and dialogue matching can reach more than 85% and 80%. In practical applications, when a single person speaks to the system, the identity and contents of the speech can be matched correctly. When multiple people speak to the system at the same time, guide the user to speak in order. Avoid incomplete and meaningless sentences in the dialogue system.

    中文摘要 I Abstract IV 誌謝 VI Content VII Table List IX Figure List X Chapter1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Objectives 2 1.4 Organization 3 Chapter2 Related Work 4 2.1 The Survey of Face Recognition Application 4 2.2 The Survey of Attention Analysis 4 2.3 The Survey of Mouth detector 6 2.4 The Survey of Audiovisual Integration 7 Chapter3 Speaker Matching System for Multi-Person Visual and Audio Integration 8 3.1 System Overview 8 3.1.1 Speech Processing Module 8 3.1.2 Image Processing Module 8 3.1.3 Speaker Matching Integration Module 8 3.2 Speech Processing Module 10 3.2.1 Frame Overview 10 3.2.2 Automatic Speech Recognition 11 3.2.3 Analysis and Storage 11 3.3 Image Processing Module 11 3.3.1 Frame Overview 11 3.3.2 Multi-Face Detect Base on MobileNet SSD 12 3.3.3 Multi-person Face Recognition Based on Face Image 18 3.3.4 Mouth Detector and Mouth Shape State Recognition 22 3.3.5 Attention Analysis Based on Head Pose Estimation 26 3.4 Speaker Matching Integration Module 30 3.4.1 Frame Overview 30 3.4.2 Introduce Registers 31 3.4.3 Speakers Detect 32 3.4.4 Speakers Matching 33 Chapter4 Experimental Results 35 4.1 Experimental Environment 35 4.2 Experimental Results for Face Recognition 35 4.3 Experimental Results for Mouth Shape Status Recognition 36 4.4 Experimental Results for Attention Analysis 39 4.5 Experimental Results for Multi-Speaker Matching 41 4.5.1 Evaluation Methods 42 4.5.2 Experiment Results 42 4.6 Experimental of MOS for Proposed System 43 Chapter5 Conclusions and Future Works 45 5.1 Conclusions 45 5.2 Future Works 46 References 47

    [1] 法、日聯手!第一代可以買回家的人形機器人 Pepper 到底是怎麼生出來的?
    Available:https://buzzorange.com/techorange/2015/01/09/how-aldebaran-robotics-built-its-friendly-humanoid-robot-pepper/
    [2] 應用-感知情緒的機器人Available: http://163.32.86.30/~2018PBL23/a34.html
    [3] J. Weizenbaum, "ELIZA—a computer program for the study of natural language communication between man and machine," Communications of the ACM, vol. 9, no. 1, pp. 36-45, 1966.
    [4] Yu, Dong, and Li Deng. AUTOMATIC SPEECH RECOGNITION. Springer london limited, 2016.
    [5] Allen, James. Natural language understanding. Pearson, 1995.
    [6] Churcher, G., Eric Stevan Atwell, and Clive Souter. Dialogue management systems: a survey and overview. University of Leeds, School of Computing Research Report 1997.06. 1997., 1997.
    [7] Reiter, Ehud, and Robert Dale. Building natural language generation systems. Cambridge university press, 2000.
    [8] Dutoit, Thierry. An introduction to text-to-speech synthesis. Vol. 3. Springer Science & Business Media, 1997.
    [9] Zuo, Fei, and P. H. N. de With. "Real-time embedded face recognition for smart home." IEEE transactions on consumer Electronics 51.1 (2005): 183-190.
    [10] Kar, Nirmalya, et al. "Study of implementing automated attendance system using face recognition technique." International Journal of computer and communication engineering 1.2 (2012): 100.
    [11] Taleb, Imene, Mohamed El Amine Ouis, and Madani Ould Mammar. "Access control using automated face recognition: Based on the PCA & LDA algorithms." 2014 4th International Symposium ISKO-Maghreb: Concepts and Tools for knowledge Management (ISKO-Maghreb). IEEE, 2014.
    [12] Liu, Kun, et al. "Attention recognition of drivers based on head pose estimation." 2008 IEEE Vehicle Power and Propulsion Conference. IEEE, 2008.
    [13] Alioua, Nawal, et al. "Driver head pose estimation using efficient descriptor fusion." EURASIP Journal on Image and Video Processing 2016.1 (2016): 2.
    [14] Riener, Andreas, and Andreas Sippl. "Head-pose-based attention recognition on large public displays." IEEE computer graphics and applications 34.1 (2014): 32-41.
    [15] Shin, Jongju, Bongjin Jun, and Daijin Kim. "Robust two-stage lip tracker." 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 2009.
    [16] Cootes, Timothy F., Gareth J. Edwards, and Christopher J. Taylor. "Active appearance models." IEEE Transactions on pattern analysis and machine intelligence 23.6 (2001): 681-685.
    [17] Saeed, Usman, and Jean-Luc Dugelay. "Combining edge detection and region segmentation for lip contour extraction." International Conference on Articulated Motion and Deformable Objects. Springer, Berlin, Heidelberg, 2010.
    [18] Gupta, Isha, et al. "Real-Time Driver's Drowsiness Monitoring Based on Dynamically Varying Threshold." 2018 Eleventh International Conference on Contemporary Computing (IC3). IEEE, 2018.
    [19] King, Davis E. "Dlib-ml: A machine learning toolkit." Journal of Machine Learning Research 10.Jul (2009): 1755-1758.
    [20] Sagonas, Christos, et al. "300 faces in-the-wild challenge: Database and results." Image and vision computing 47 (2016): 3-18.
    [21] Traum, David, and Louis-Philippe Morency. "Integration of Visual Perception in Dialogue Understanding for Virtual Humans in Multi-Party interaction." International Workshop on Interacting with ECAs as Virtual Characters. 2010
    [22] Nguyen, Quang, Sang-Seok Yun, and JongSuk Choi. "Audio-visual integration for human-robot interaction in multi-person scenarios." Proceedings of the 2014 IEEE Emerging Technology and Factory Automation (ETFA). IEEE, 2014.
    [23] Bayram, Bariş, and Gökhan Ince. "Audio-visual multi-person tracking for active robot perception." 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2015.
    [24] Këpuska, Veton, and Gamal Bohouta. "Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx)." Int. J. Eng. Res. Appl 7.03 (2017): 20-24.
    [25] Saha, Priyabrata, Burhan A. Mudassar, and Saibal Mukhopadhyay. "Adaptive control of camera modality with deep neural network-based feedback for efficient object tracking." 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2018.
    [26] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
    [27] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
    [28] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
    [29] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
    [30] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.
    [31] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
    [32] X. Xiong and F. De la Torre. Supervised descent methods and its applications to face alignment. In Proc. CVPR, 2013. 2, 3, 4, 5
    [33] A. Asthana, S. Zafeoriou, S. Cheng, and M. Pantic. Incremental face alignment in the wild. In Conference on Computer Vision and Pattern Recognition, 2014. 1, 2, 3, 4, 5, 7
    [34] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [35] Yang, Tsun-Yi, et al. "FSA-Net: Learning fine-grained structure aggregation for head pose estimation from a single image." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

    無法下載圖示 校內:2025-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE