| 研究生: |
陳宗佑 Chen, Zong-You |
|---|---|
| 論文名稱: |
基於聲韻辨識之互動式即時語音驅動人臉系統 Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 英文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | 聲韻辨識 、語音驅動人臉 |
| 外文關鍵詞: | Voice-Driven Human Talking Face, Phonetic Recognition |
| 相關次數: | 點閱:67 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
科技始終來自於人性!日漸普及運用在生活中的互動式多媒體應用還有很大的研究改善空間,如何將此一技術改善,為人們帶來更多的便利是我們一直以來努力的目標!
本論文中,我們提出一個即時的語音驅動人臉技術,藉以應用於影音通訊系統中。此系統主要是將接收到的語音段作預強調、漢明窗接著取12階的LPCC (Linear Predictive Cepstral Coefficient) 作為特徵參數,中文韻母辨識則利用SVM (Support Vector Machine) 做分類。
由於每個人說話的習慣與腔調都有所不同,我們針對其所唸的16個中文單韻母之對應的嘴型圖片,利用SAD (Sum of Absolute Differences) 將差異較小的嘴型圖片歸為一類,藉以找出最符合個人說話特質的分類,以提升辨識率與效能。
最後,採用Alpha Blending作為兩張圖之間平滑化的一種方法,其為藉由調整圖片之透明度來混合來源圖片與目的圖片的像素,使其在圖片轉換時呈現出即時影像動畫的效果。經由實驗結果,此架構之中文單韻音錯誤率(Phoneme error rate, PER)為19.22%,分類後可降低為8.78%,字錯誤率(Word Error Rate, WER)可達27.65%,針對單音辨識率與影像動畫的自然度與流暢度之MOS評分表平均可達3.43分。
Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing.
In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs).
The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced.
At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point.
[1] I-Chen Lin, Chen-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, "A Speech Driven Talking Head System Based on a Single Face Image", pp. 43-49, Proc. Pacific Graphics'99 (IEEE ISBN 0-7695-0293-8), Oct., Seoul, Korea.
[2] J. Ostermann and A.Weissenfeld, “Talking faces-technologies and applications,” In Proc. of ICPR’04, Aug. 2004, vol. 3, pp. 826–833.
[3] R. Koenen, F. Pereira, and L. Chiariglione, “MPEG-4: Context and objectives,” Image Commun. J., vol. 9, no. 4, pp. 295–304, May 1997, http://drogo.cselt.it/ufv/leonardo/icjfiles/mpeg-4_si/paper1.htm.
[4] Hyewon Pyun, Wonseok Chae, Yejin Kim, Hyungwoo Kang, and Sung Yong Shin “An Example-based Approach to Text-driven Speech Animation with Emotional Expressions” CS/TR-2004-200 July 19, 2004
[5] J. Ostermann, “Animation of Synthetic Faces in MPEG-4”, Proc. of Computer Animation, pp.49-51, Philadelphia, Pennsylvania, USA, June 8-10, 1998.
[6] Jhing-Fa Wang, Hung-Tzu Kao , “Voice Driven Multimedia Interactive System with Ubiquitous Sound Recognition for Digital Home Application ”. Master Thesis. Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C. July 2007
[7] I-Chen Lin, Chen-Sheng Hung, Tzong-Jer Yang, Ming Ouhyoung, "A Speech Driven Talking Head System Based on a Single Face Image", pp. 43-49, Proc. Pacific Graphics'99 (IEEE ISBN 0-7695-0293-8), Oct., Seoul, Korea.
[8] T. Kim, Y. Kang, and H. Ko, “Achieving real-time lip synch via SVM-based phoneme classification and lip shape refinement,” in Proc. ICMI’02, 2002, pp. 299–304.
[9] Regine Andre-Obrecht, “A New Statistical Approach for the Automatic Segmentation of Continuous Speech Signals”, IEEE Transactions on Acoustic, Speech, and Signal Processing, Vol.36, No. 1, January 1998.
[10] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.
[11] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter, “Lifelike talking faces for interactive services,” Proc. IEEE, vol. 91, no. 9, pp. 1406–1428, Sep. 2003
[12] L. Xie, “Research on Key Issues of Audio Visual Speech Recognition,” Ph.D. dissertation, Northwestern Polytechnical Univ., Xian, China, 2004.
[13] Junho Park, HANSEOK KO, “Real-Time Continuous Phoneme Recognition System using Class-Dependent Tied-Mixture HMM with HBT Structure for Speech-Driven Lip-synchronization”, IEEE Trans-Multimedia, Vol.10, Issue 7, pp.1299-1306, Nov, 2008
[14] F. Schwenker, “Hierarchical support vector machines for multi-class pattern recognition” Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000. Proceedings. Fourth International Conference on, Vol. 2, 30 Aug.-1 Sept. 2000.
[15] L. Xie and Z. Liu, “Realistic mouth-synching for speech-driven talking face using articulatory modeling,” IEEE Trans. Multimedia, vol. 9, no. 3, pp. 500–510, Apr. 2007.
[16] A.M. Kondoz,“Digital speech : Coding for Low Bit Rate Communications System,” WILEY, 1994
[17] J. Makhoul, “Stable and Efficient methods for Linear Prediction,”IEEE Trans. On ASSP, Vol. 25, pp. 423-428, October 1977
[18] Wan Vincent and Renals Steve, “Speaker verification using sequence discriminant support vector machines,” IEEE transactions on speech and audio processing, vol. 13, No. 2, march 2005.
[19] William M. Campbell, Joseph P. Campbell, Terry P. Gleason, Douglas A. Reynolds, and Wade Shen, “Speaker Verification Using Support Vector Machines and High-Level Features,” IEEE transactions on speech , audio and language processing, vol. 15, no. 7, september 2007.
[20] J.C. Wang, C.H.Yang, J.F. Wang, and H.P. Lee, “Robust speaker identification and verification,” IEEE Compu. Intell. Mag., pp.52-59, May 2007.
[21] Harrison, B.L., Kurtenbach, G., and Vicente, K.J. An Experimental Evaluation of Transparent User Interface Tools and Information Content. In Proc. UIST ‘95, 81–90.
[22] Porter, T. and Duff, T. Compositing Digital Images, Computer Graphics 18, 3, July 1984, pp. 253–259.
[23] 王小川, 語音訊號處理, 全華圖書股份有限公司