簡易檢索 / 詳目顯示

研究生: 陳益宏
Chen, Yi-Hung
論文名稱: 應用語音驅動技術之多媒體遠距互動系統
Applied Voice Driven Technique to Remote Multimedia Interaction System
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 51
中文關鍵詞: 語音驅動人臉技術動態時間校正比對支持向量機
外文關鍵詞: voice driven talking face, dynamic time warping (DTW), support vector machine (SVM)
相關次數: 點閱:80下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 溝通是人們社會生活中最重要的一環,且不限制在何時何地。在多媒體通訊技術快速發展應用下,影像電話可將來電者的影像呈現在通訊裝置上,讓通訊不再只是單純的聲音交談,也可以有實際上的影像互動,呈現出更真實面對面談話的情感。
    當在戶外使用3G無線網路作為與室內影音即時通訊時,由於頻寬限制的關係,在室內中顯示的畫面會產生不流暢的現象,本論文評估目前的多媒體通訊方式和通訊品質,並且針對上述問題提出解決方法:使用VDTF (Voice Driven Talking Face)技術達到虛擬即時人臉影像顯示,本系統只需藉著傳送語音訊號便能顯示出具有情感表達的影像。
    VDTF技術結合了即時母音音素辨識和關鍵字偵測,其包含四個主要步驟:(1)藉由支持向量機(support vector machine, SVM)分類器依序辨識出母音音素之特徵序列。(2)使用動態時間校正(Dynamic Time Warping)與系統預設關鍵字做特徵序列比對。(3)比對出之母音音素和關鍵字序列對應影像的顯示序列(4)利用透明度混合法將一連串的唇形及情感影像序列結合跟聲音一起播放出來。
    實驗結果顯示,中文的母音經過SVM的分類並加以群聚之後,可將母音辨識之錯誤率由原先19.22%降低到9.37%,針對單音辨識率與影像動畫的自然度與真實感之MOS評分表平均可達3.32分。

    Communication is the most important part of human relations in society. With the communication technologies promoted increasingly, video conversation communication system can show the caller's video/voice on device of receiver. Moreover, we can interact with images by presenting batter emotion conversation.
    3G network is one of popular wireless technologies used for video conversation between indoor and outdoor. Generally, the display quality is not smoothness because of the bandwidth limitations. For this reason, we proposed Voice Driven Talking Face (VDTF) technology, which can present the emotional face image using the corresponding speech signal to overcome the above problems in this thesis.
    VDTF is based on vowel recognition and keyword spotting. It includes four steps. (1) In vowel recognition, the recognized vowel sequence is identified by Support Vector Machine (SVM) classifier. (2) In keyword spotting, to identify the correct keyword by the Dynamic Time Warping (DTW) if sequence contains the vowels of the first word of the keywords and to change the sequence list of vowels. (3) To arrange the corresponding emotion image and lip shape image. (4) Finally, to adjust the lip-sync animation and emotion images as image-sequence in the transparent level using alpha blending, and display the adjusted image-sequence and speech synchronously.
    In experimental results, the average vowel error rate are 19.22% and 9.37% within SVM-based classifier, without and with clustering, respectivly. The MOS credits for single word recognition are averagely evaluated to 3.32 points and the display images are not only vivid but also friendly in this proposed system.

    Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Thesis Objective 2 1.3 Thesis Organization 2 Chapter 2. Background and Related Work 3 2.1 Communication Networks 3 2.1.1 Cable transmission network 4 2.1.2 Wireless network: 5 2.2 Voice Driven 6 2.2.1 Elements of Facial Animation System. 6 2.2.2 Text-Driven Facial Animations 6 2.2.3 Speech-Driven Facial Animations 7 2.2.4 Face Model 8 2.3 Audio-Visual Articulatory Model 9 Chapter 3. Proposed System Overview 11 3.1 Communication Subsystem 13 3.1.1 Voice Communication Services 13 3.1.2 Internet Communication Services 14 3.2 Implementation of Voice over IP (VOIP) 15 3.2.1 An Introduction of Socket API 16 Chapter 4. Voice Driven Talking Face 19 4.1 Voice Driven Talking Face Architectural 19 4.2 Mandarin Vowel Classification Based on Single Vowel 22 4.3 Noise Reduction 24 4.4 Feature Extraction 27 4.4.1 Frame Blocking 27 4.4.2 Log-Energy 28 4.4.3 Pre-Emphasis 28 4.4.4 Hamming Window 29 4.4.5 FFT 29 4.4.6 Triangular Bandpass Filters 30 4.4.7 DCT 31 4.5 Vowel Recognition Based on SVM 31 4.5.1 Linear Classifier: 32 4.5.2 Non-Separable Case 36 4.5.3 Kernel Function 37 4.5.4 Multi-class SVMs 39 4.6 Keyword Spotting Based on Dynamic Time Warping 41 4.7 Alpha Blending 43 Chapter 5. Experiment and Comparison 45 5.1 Experimental Setup 45 5.2 Experimental Results and Comparison 46 Chapter 6. Conclusion and Future Work 49 References 50

    [1] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter, “Lifelike talking faces for interactive services,” Proc. IEEE, vol. 91, no. 9, pp. 1406–1428, Sep. 2003
    [2] T. Kim, Y. Kang, and H. Ko, “Achieving real-time lip synch via SVM-based phoneme classification and lip shape refinement,” in Proc. ICMI’02, 2002, pp. 299–304.
    [3] L. Xie and Z. Liu, “Realistic mouth-synching for speech-driven talking face using articulatory modeling,” IEEE Trans. Multimedia, vol. 9, no. 3, pp. 500–510, Apr. 2007.
    [4] L. Xie, “Research on Key Issues of Audio Visual Speech Recognition,” Ph.D. dissertation, Northwestern Polytechnical Univ., Xian, China, 2004.
    [5] Junho Park, HANSEOK KO, “Real-Time Continuous Phoneme Recognition System using Class-Dependent Tied-Mixture HMM with HBT Structure for Speech-Driven Lip-synchronization”, IEEE Trans-Multimedia, Vol.10, Issue 7, pp.1299-1306, Nov, 2008
    [6] I-Chen Lin, Chen-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, “A Speech Driven Talking Head System Based on a Single Face Image”, pp. 43-49, Proc. Pacific Graphics'99 (IEEE ISBN 0-7695-0293-8), Oct., Seoul, Korea.
    [7] A.M. Kondoz, “Digital speech : Coding for Low Bit Rate Communications System,” WILEY, 1994
    [8] J. Ostermann and A.Weissenfeld, “Talking faces-technologies and applications,” In Proc. of ICPR’04, Aug. 2004, vol. 3, pp. 826–833.
    [9] Zong-You Chen, J-F Wang, “Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition”, Master Thesis. Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C. July 2009
    [10] Guang-Yi Wang, Mau-Tsuen Yang, Cheng-Chin Chiang and Wen-Kai Tai, “A Talking Face Driven by Voice using Hidden Markov Model,” Journal of Information Science and Engineering 22, 1059-1075 (2006)
    [11] Kamil K. W´ojcicki, Benjamin J. Shannon and Kuldip K. Paliwal, “Spectral Subtraction with Variance Reduced Noise Spectrum Estimates”, Australian International Conference on Speech Science & Technology,2006
    [12] F. Schwenker, “Hierarchical support vector machines for multi-class pattern recognition” Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on, Vol. 2, 30 Aug.-1 Sept. 2000.
    [13] Porter, Thomas, Tom Duff, “Compositing Digital Images”. Computer Graphics 18 (3): 253–259, (1984).
    [14] Eric Cosatto, Hans Peter Graf, “Photo-Realistic Talking-Heads from Image Samples”, IEEE Trans. Multimedia, vol.2, No.3,Sept,2000.
    [15] Joern Ostermann, Axel Weissenfeld, “Talking Faces - Technologies and Applications”, Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3.
    [16] Elena Tsiporkva, Veselka Boeva, “Two-Pass Imputation Algorithm for Missing value Estimation in Gene Expression Time Series”, Journal of Bioinformatics and Computational Biology, Vol.5, No.5,2007
    [17] http://www.reallusion.com/crazytalk/ (CrazyTalk)
    [18] 王小川, 語音訊號處理, 全華圖書股份有限公司

    無法下載圖示 校內:2020-12-30公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE