| 研究生: | 李智宏 Li, Chih-Hung | 
|---|---|
| 論文名稱: | 基於新穎的音節切割及音節相關性之方法應用於笑聲偵測系統研究與實現 Research and Implementation of the Laughter Detection System Based on a Novel Syllable Segmentation and Correlation Methods | 
| 指導教授: | 王駿發 Wang, Jhing-Fa | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 電機工程學系 Department of Electrical Engineering | 
| 論文出版年: | 2012 | 
| 畢業學年度: | 100 | 
| 語文別: | 英文 | 
| 論文頁數: | 45 | 
| 中文關鍵詞: | 自相關函數 、聲道轉換偵測器 、梅爾倒頻譜係數 、動態時間校正 、音高分析 | 
| 外文關鍵詞: | Laughter detection, Autocorrelation function, Vocal tract transfer detector, Mel-scale frequency cepstral coefficient, Dynamic time warping, pitch analysis | 
| 相關次數: | 點閱:116 下載:2 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
在高科技的時代,人們的壓力愈來愈大,常常有憂鬱症的案例出現,會導致這個結果通常是因為對自己的壓力狀況沒有一定的了解,造成無法彌補的遺憾。笑聲偵測系統可以記錄自己的笑聲長度及次數來幫助人們了解自己。另外在語音辨識系統中,笑聲常常導致辨識率降低,笑聲偵測系統可以做為訊號的前處理來提升語音辨識系統的效能。
過去笑聲偵測或辨識的研究,主要為擷取笑聲的聲學特徵及使用有效的分類器做分類,常用的聲學特徵如梅爾倒頻譜系數及感知線性預測系數,而分類器常用隱藏式馬可夫模型及支援向量機,上述方法為了達到高的準確率訓練並建立完整的聲學模型是必要的,但因為笑聲種類過多且變異性過大所以要建立完整的模型是非常困難的。為了避免上述的問題,我們根據笑聲的自相關特性,提出一個語者不相關及低運算量且不需訓練的笑聲偵測系統,此系統利用改良的自相關函數以及聲道轉換偵測器去實現音節切割的演算法把一段聲音的所有音節分離開來,接著利用梅爾倒頻譜係數當作特徵並使用動態時間校正及音高趨勢的分析來來計算音節相關性的分數,最後系統判斷連續三個音節相關性高者為笑聲段。
我們提出的系統針對十位受測者,共一百五十句句子,可達到94%的準確率;在運算速度方面使用HTC one X作測試,平均一秒鐘可處理2.63秒的訊號,由以上結果可指出我們提出的系統是有效率且即時的。
In high-tech age, people pressure increase and the depression often appears because people do not understand their pressure situation. At last, the pressure out of control causes the regret. Laughter is an indicator which can help people to know their pressure situation. The high pressure often occurs in people who seldom laugh in life. The laughter detection system can record laughter length and times to help people understand themselves. In the speech recognition, the laughter detection system can be the pre-processing to improve the performance because the recognition rate will be decreased in the situation of the speech signal with laughter. 
The previous laughter classification works focused on audio features extraction and building models. The most researches used Mel-scale frequency cepstral coefficients (MFCCs) and Perceptual Linear Predictive Coefficients (PLPs) as audio features, and hidden Markov models (HMMs) and support vector machines (SVMs) were popularly used as classifier. Generally, to achieve high recognition rates, a large database with a well training process is often required. But the variance of acoustic features between two different kinds of laughter is still a serious problem. We proposed a laughter detection system based on the correlation characteristic of signals. The advantages of the system are speaker independent, low-computational and training-free. To achieve the goal, a modified autocorrelation function (MACF) is combined with a new approach called vocal tract transfer detector (VTTD) for segmenting an input signal into a syllable stream. Next, based on each syllable’s Mel-scale frequency cepstral coefficients (MFCCs), the correlation between two consecutive syllables is measured by the dynamic time warping (DTW) algorithm and pitch analysis. The three consecutive syllables with high correlation are considered as a laughter segment.
In our experimental result, the proposed system can achieve an accuracy rate of 94% in ten subjects with totally 150 sentences. In computation time, we choose the smart phone, HTC one X as our experimental platform. The system can handle signal of average 2.63 seconds in one second on smart phone. Such results indicate that the proposed method is effective in detecting laughter, thereby demonstrating the real-time of the system.
[1]	M. Schroeder, D. Heylen, and I. Poggi, “Perception of non-verbal emotional listener feedback,” in Proc. 3rd Int. Conf. Speech Prosody, Dresden, Germany, 2006, May 2-5, pp. 1–4.
[2]	“self improvement mentor (SIM)” at http://www.self-improvement-mentor.com
[3]	K. Laskowski and S. Burger, “Analysis of the occurrence of laughter in meetings,” in Proc. 8th Int. Conf. INTERSPEECH, Antwerp, Belgium, 2007, Aug. 27-31, pp. 1258–1261.
[4]	B. Schueller, F. Eyben, and G. Rigoll, “Static and dynamic modelling for the recognition of non-verbal vocalisations in conversational speech,” Lecture Notes in Computer Science, vol. 5078, pp. 99–110, 2008.
[5]	N. Campbell, H. Kashioka, and R. Ohara, “No laughing matter,” in Proc. 9th Eur. Conf. Speech Communication and Technology, Lisbon, Portugal, 2005, Sept. 4-8, pp.465–468. 
[6]	A. Lockerd and F. Mueller, “LAFCAM: Leveraging affective feedback camcorder,” in Proc. CHI, Human Factors in Computing Systems, Minneapolis, Minnesota, United States, 2002, Apr. 20-25 pp. 574–575.
[7]	K. Laskowski and T. Schultz, “Detection of laughter-in-Interaction in multichannel close-talk microphone recordings of meetings,” Lecture Notes in Computer Science, vol. 5237, pp. 149–160, 2008.
[8]	K. P. Truong and D. A. van Leeuwen, “Automatic discrimination between laughter and speech,” Speech Commun., vol. 49, no. 2, pp.144–158, 2007.
[9]	L. Kennedy and D. Ellis, “Laughter detection in meetings,” in Proc. NIST Meeting Recognition Workshop, 2004, May 17.
[10]	S. Petridis and M. Pantic, “Audiovisual discrimination between,laughter and speech,” in Proc. 34th IEEE Int. Conf. Acoustics, Speech ,Signal Processing, Las Vegas, Nevada, United States, 2008,  Mar. 30 – Apr. 4, pp. 5117–5120.
[11]	S. Petridis and M. Pantic, “Audiovisual laughter detection based on temporal features,” in Proc. 10th ACM Int. Conf. Multimodal Interfaces, Chania, Greece, 2008, Oct. 20-22, pp. 37–44.
[12]	M. Knox, N. Morgan, and N. Mirghafori, “Getting the last laugh: Automatic laughter segmentation in meetings,” in Proc. 9th Int. Conf. INTERSPEECH, Brisbane, Australia, 2008, Sept. 22-26, pp. 797–800.
[13]	S. Petridis and M. Pantic, “Audiovisual discrimination between speech and laughter: Why and when visual information might help,” IEEE Trans. Multimedia, vol. 13, no.2, pp. 216-234, Apr. 2011.
[14]	N. Campbell, “Whom we laugh with affects how we laugh,” in Proc. Workshop Phonetics of Laughter, Saarbrücken, Germany, 2007, Aug. 4-5, pp. 61–65.
[15]	R. Muralishankar and D. O'Shaughnessy, “A comparative analysis of noise robust speech features extracted from all-pass based warping with MFCC in a noisy phoneme recognition,” in Proc. 3rd IEEE Int. Conf. Digital Telecommunications, Bucharest, Romania, 2008, Jun. 29 – Jul. 5, pp. 180–185.
[16]	M. J. Macchi, M. F. Spiegel, and K. L. Wallace, “Modeling duration adjustment with dynamic time warping,” in Proc. IEEE Int. Conf. Acoustics, Speech ,Signal Processing, Albuquerque, United States, 1990, Apr. 3-6, vol.1 ,pp. 333–336.
[17]	R. Janakiraman, J. C. Kumar, and H. A. Murthy, “Robust syllable segmentation and its application to syllable-centric continuous speech recognition,” in Proc. 16th IEEE National Conf. Communications, Madras, Chennai, 2010, Jan. 29-31, pp. 1–5.
[18]	E. Vidal, H.M. Rulot, F. Casacuberta, and J.-M. Benedi, “On the use of a metric-space search algorithm (AESA) for fast DTW-based recognition of isolated words,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 36, no.5, pp. 651-660, 1988.
[19]	Y. Li, J. Le, Y. Yang, and Jian. Wang, ”Improvement algorithm of DTW on isolated-word recognition,” in Proc. IEEE Int. Conf. Computer Science and Automation Engineering, Shanghai, China, 2011, Jun. 10-12, vol.3 ,pp. 319–322.
[20]	M.G. Sumithra, M.S. Ramya, and K. Thanuskodi, “Noise robust isolated word recognition,” in Proc. IEEE Int. Conf. Communication and Computational Interlligence, Erode, India, 2010, Dec. 27-29, pp. 362–367.