| 研究生: |
吳健綺 Wu, Jian-Qi |
|---|---|
| 論文名稱: |
應用轉換函式之歸群與選取於情緒語音合成之研究 A Study on Emotional Speech Synthesis via Conversion Function Clustering and Selection |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 語音合成 、聲音轉換 、轉換函式 |
| 外文關鍵詞: | speech synthesis, emotion converison, conversion function |
| 相關次數: | 點閱:107 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
「語音將是未來世界的主流」,電腦科技能融入一般人的日常生活中,語音是最重要的指標技術。雖然電腦語音合成的發展已達至情緒合成語音的階段,但大量聲音語料的需求限制了語音科技的實際應用;透過聲音轉換的技術,可降低語料的需求,然而對每一轉換模型僅以單一轉換函式來描述是不足的,因此藉由多轉換函式達至更佳的聲音轉換音質是本研究的主要目的。
在本論文中,對於應用多轉換函式於電腦情緒語音轉換的問題,分為下列四項研究重點:1)根據不同情緒,設計小量的平衡語料腳本,並錄製平行語料;2)提出轉換函式的聲學與語言學相似度估算方法;3)導入K-means演算法進行轉換函數歸群;4)整合中性情緒的文字轉語音系統,進行電腦情緒語音之合成。
在實驗中,首先評估各語音參數對情緒特性的重要性;接著比較不同多函式轉換模型的效能,先以預測誤差進行客觀評估,再以MOS進行主觀測試,最後透過統計檢定的方式進行驗證。本論文提出之方法,確實可在小量語料的限制之下,在主客觀的評估上得到較好的表現。
Speech technology is the key for the development of computer science in next generation. A text-to-speech (TTS) synthesis system that can express emotion can be an effective communication tool for users. However, the requirement of large size of speech database obstructs the development and application of such a system.
In this thesis, a conversion function clustering and selection method was proposed for text to emotional speech synthesis. More specially, this study focuses on: 1) designing balanced small-sized emotional parallel speech database, 2) proposing a similarity measure between functions on both acoustic and linguistic features, 3) adopting K-means algorithm to cluster functions, finally, integrating the emotional speech conversion system as a post-processor for emotional speech synthesis.
Several experiments with statistical hypothesis testing were conducted to evaluate the quality of converted speech as perceived by human subjects. Compared with previous method, the proposed method exhibits encouraging potential in expressive speech synthesis.
[Abe, 1988] M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, “Voice
conversion through vector quantization,” in Proc. of ICASP, New York, NY, USA, pp. 655-658, Apr. 1988.
[Chen, 2003] Y. Chen, M. Chu, E. Chang, J. Liu and R. Liu, “Voice
conversion with smoothed GMM and MAP adaptation,” in Proc. of EUROSPEECH, pp. 2413-2416, Geneva, Switzerland, Sep. 2003.
[Duzans, 2003] H. Duzans and A. Bonafonte, “Estimation of GMM in
voice conversion including unaligned data,” in Proc. of EUROSPEECH, pp. 861-864, Geneva, Switzerland, Sep. 2003.
[Duxans, 2004] H. Duxans, A. Bonafonte, A. Kain and J. van Santen,
“Including dynamic information in voice conversion systems,” in Proc. of SEPLN, Barcelona, Spain, July 2004.
[Duxans, 2004] H. Duxans, A. Bonafonte, A. Kain and J. van Santen,
“Including dynamic and phonetic information in voice conversion systems,” in Proc. of ICSLP, pp. 1193-1196, Jeju Island, South Korea, Oct. 2004.
[Gillett, 2003] B. Gillett and S. King, “Transforming voice quality,” in
Proc. of EUROSPEECH, pp. 1713-1716, Geneva, Switzerland, Sep. 2003.
[Goldberger, 2005] J. Goldberger and H. Aronowitz, “A Distance
Measure Between GMMs Based on the Unsented Transform and its Application to Speaker Recognition,” in Proc. of EUROSPEECH 2005, pp. 1985-1988, Lisbon, Portugal, 2005.
[Iida, 2003] A. Iida, N. Campbell, F. Higuchi and M. Yasumura, “A
corpus-based speech synthesis with emotion,” Speech
Communication, 40(1-2): 161-187, 2003.
[Kain, 1998] A. Kain and M. W. Macon, “Spectral voice conversion for
Text-to-Speech Synthesis,” in Proc. of ICASSP, vol. 1, pp. 285-288, Seattle, Washington, USA, May 1998.
[Kawahara, 1997] H. Kawahara, “Speech representation and
transformation using adaptive interpolation of weighted spectrum: vocoder revisited,” in Proc. of ICASSP, vol. 2, pp. 1303-1306, Munich, Germany, Apr. 1997.
[Kawahara, 1999] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigné,
“Restructuring speech representations using a pitch adaptive time-frequency-based F0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, 27(3-4): 187-207, Apr. 1999.
[Orphanidou, 2004] Orphanidou, C., Moroz, I.M. and Roberts, S.J.
“Wavelet-based voice morphing,” WSEAS Journal on Systems, 10 (3): 3297-3302, 2004.
[Pfizinger, 2004] H. R. Pfizinger, “DFW-based spectral smoothing for
concatenative speech synthesis,” in Proc. of ICSLP, pp. 1397-1400, Jeju Island, South Korea, Oct. 2004.
[Stylianou, 1998] Y. Stylianou, O. Cappé and E. Moulines, “Continuous
probabilistic transform for voice conversion,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, Mar. 1998.
[Schröder, 2001] M. Schröder, “Emotional speech synthesis – A review”,
in Proc. of EUROSPEECH, vol. 1, pp. 561-564, Aalborg, Denmark, Sep. 2001.
[陳俊甫, 2004] 陳俊甫, 應用機率式句法結構與隱含式語義索引於情
緒語音合成之單元選取, 國立成功大學資訊工程研究所碩士論文, 2004.