| 研究生: |
高宏慈 Kao, Hung-Tzu |
|---|---|
| 論文名稱: |
語音驅動多媒體互動系統結合泛在聲音辨識於數位家庭之應用 Voice Driven Multimedia Interactive System with Ubiquitous Sound Recognition for Digital Home Application |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 英文 |
| 論文頁數: | 83 |
| 中文關鍵詞: | 多媒體互動系統 、泛在聲音辨識 |
| 外文關鍵詞: | ubiquitous sound recognition, multimedia interactive system |
| 相關次數: | 點閱:51 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在我們生活的環境中,每一種聲音都有其獨特性,我們常常可以藉由環境聲音的特質來辨識出聲音,進而判斷周遭的狀況。例如哭聲、狗叫聲、玻璃破碎聲、大叫聲…等。如果我們能針對這些聲音資訊作分類及辨識並整合通訊成為一個監控系統,對於監控居家環境安全將有很大的幫助。
本論文中,我們提出一個針對數位家庭應用的多媒體互動系統,此系統主要是利用泛在聲音辨識子系統結合通訊子系統來完成。聲音辨識的部分,我們使用多麥克風與混音器來建構收音介面,使用信號子空間來做雜訊消除。分類器是使用支援向量機的來做辨識,音訊特徵上包含兩大組: 第一組是perceptual feature 其中包括 total spectrum power, subband powers, brightness, bandwidth 和 pitch,第二組為MFCC 和 delta MFCC。在通訊方面,通訊子系統透過串列阜連接GSM Modem來控制傳送SMS訊息與語音通話,結合串流伺服器與voice driven talking face(VDTF)模組(在互動模式的介面上,建造一個人臉動畫配合人的語音,以撥放人說話的表情)來完成視訊監控與影像互動的效果。當聲音辨識器辨識出聲音,決定事件類型後,系統會依照預先設定的情境模式來決定對異常事件處理的通訊方式。藉由VDTF介面的呈現,可以不用傳輸影像,只需傳輸語音就能達到影像互動的效果,特別適用於低頻寬的網路環境,例如手機。另外,遠端PDA手機也可以連結到串流伺服器看到系統端的監視影像。如此小孩在家中哭鬧時,與外界的互動可達到遠端居家照護的功能。在監控模式下,系統可以傳送簡訊告知家中發生的異常事件。實驗驗證泛在聲音辨識器在環境中可以有效的辨識出聲音事件,平均辨識率達到90.97%,如此可彌補影像移動偵測上的不足。透過通訊子系統也可以達到監控或居家照護的目的。
In life environment, every kind of sound has its individual property and surrounding event can be recognized by it. For example: human crying, screaming, dog barking, and glass breaking, etc. We can identify and classify these sounds, and integrate the results with a communication system, to make a surveillance system that provides ability for monitoring surrounding events.
In this thesis, we present a multimedia interactive system for digital home applications. This system consists of two subsystems, the ubiquitous sound recognition subsystem and the communication subsystem. We use a set of multi-microphone and a mixer for sound recording interface. A signal subspace algorithm is adopted for noise reduction to provide the ubiquitous computing ability. The sound recognizer is designed by using the (support vector machine) SVM.
In communication phase, a GSM modem was adopted for sending the SMS messages and making voice calls via serial port. Moreover, A voice driven talking face (VDTF) module is also combines with a video streaming server to achieve the goal of user interaction. The VDTF is an interactive interface that represents user’s facial expression by building a face-animation. After the sound has been recognized as a particular event, the system will make a communicational decision corresponding to the pre-defined scenario. With the presentation of VDTF, the system does not need to transmit video signals with wide-bandwidth, only the voice signals need to be transmitted. This provides the ability to achieve the interaction characteristic, especially for a low-bandwidth working environment, such as mobile phones. Moreover, a remote PDA phone can also connect to the streaming server for monitoring the home status. For example, a baby crying event can be detected by the proposed system, and therefore to achieve the remote home care capability. In surveillance mode, System can notify the master what happened by sending the SMS message.
Experimental results demonstrate the proposed ubiquitous sound recognizer can effectively detect the sound events; the average accuracy rate is 90.97%. Therefore, it can compensate some information for motion video on the lack of event detection. With the using of the proposed communication subsystem, users can also achieve the objectives of remote monitoring and home care capability.
[1] Jhing-Fa Wang, Cai-Bei Lin, “A Study on Content-based Audio Classification Using Probabilistic SVMs and ICA”. Master Thesis. Department of Electrical Engineering National Cheng Kung University, Tainan, Taiwan, R.O.C. July 2006
[2] Abu-El-Quran AR, Goubran RA, Chan ADC, "Security-monitoring using microphone arrays and audio classification," IEEE Transactions on Instrumentation and Measurement, vol. 55, no. 4, pp.1025-1032, 2006.
[3] C. Clavel et al.: “Events detection for an audio-based surveillance system”, Proc. IEEE International Conference on Multimedia and Expo., Amsterdam, July 2005
[4] P. K. Atrey et al.: “Audio based event detection for multimedia surveillance”, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse (France), May 2006.
[5] Lin et al., “A Speech Driven Talking Head System base on a Single Face Image”, Computer Graphics and Applications, 1999, Proceedings, Seventh Pacific Conference on Seoul, South Korea Oct. 5-7, 1999, Los Alamitos, CA, USA, IEEE Computer Soc., US, Oct. 5, 1999, pp. 43-49, 317, XP010359461.
[6] P. Kakumanu R. Gutierrez-Osuna A. Esposito R. Bryll A. Goshtasby O. N. Garcia, Speech driven facial animation.
[7] Xie L. and Liu, Z.Q. “Realistic mouth-synching for Speech-Driven Talking Face Using articulatory modelling” , IEEE Transactions on Multimedia, Vol.9, No.3, pp. 500-510. April (2007)
[8] Short Message Service / SMS Tutorial [Online]. Available:
<http://www.developershome.com/sms/>
[9] SMS and the PDU format [Online]. Available: <http://www.dreamfabric.com/sms/>
[10] Windows Media Encoder 9. Microsoft Corporation.
<http://www.microsoft.com/zh/tw/default.aspx>
[11] 陳冠竹,十分創意 FLASH MX -設計師談動畫設計,上奇科技, 2002-06-11
[12] Motion detection algorithms [Online]. Available:
<http://www.codeproject.com/KB/audio-video/Motion_Detection.aspx>
[13] AForge.NET [Online]. Available: < http://code.google.com/p/aforge/>
[14] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement”, IEEE Transactions on Speech and Audio Processing, vol. 3, no. 4, pp. 251-266, 1995.
[15] HyperTerminal [Online]. Available:
<http://technet2.microsoft.com/WindowsServer/zh-CHT/Library/ee14de93-9676-477a-9d05-101d6bebe20f1028.mspx?mfr=true>
[16] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter, “Lifelike talking faces for interactive services,” Proc. IEEE, vol. 91, no. 9, pp. 1406–1428, Sep. 2003.
[17] Lei Xie and Zhi-Qiang Liu, Senior Member, IEEE, “Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling”, IEEE Transactions on multimedia, vol. 9, no. 3, april 2007.
[18] Jia-Ching Wang, Jhing-Fa Wang, Cai-Bei Lin, Kun-Ting Jian, and Wai-He Kuok,” Content-based audio classification using support vector machines and independent component analysis”, ICPR06 (I: 1204-1207).
[19] J. Ostermann and D. Millen, “Talking heads and synthetic speech: An architecture for supporting electronic commerce,” in Proc. ICME, 2000, p. MA2.3.