| 研究生: |
李冠德 Li, Kuan-Te |
|---|---|
| 論文名稱: |
結合發音與聽覺參數之音框選取於多語言語音合成系統 Unit-selection-based frame selection using articulatory and auditory features for polyglot TTS system |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 跨語語音合成 、發音屬性 、聽覺參數 、音框選取 |
| 外文關鍵詞: | Cross-lingual speech synthesis, Articulatory attribute, Auditory feature, frame selection |
| 相關次數: | 點閱:76 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音是人類溝通最重要的媒介,日常對話經常混雜許多語言,傳統單一語言TTS已不合使用者需求,因此本研究希望建立一混雜多語的語音合成系統。
以往跨語語音合成被語料庫的蒐集成本所限制,往往要蒐集到同一位語者所念的不同語言的語料庫是相當困難的,因此本論文提出一個結合發音屬性與聽覺參數之音框選取方法以解決跨語言非平行語料之聲音轉換的問題。發音屬性和音素的發音位置及發音方法有關,是一種不受語言和語者影響的強健特徵,而聽覺參數相較於傳統頻譜特徵更能表現出人類聽覺特性。本研究主要工作如下:第一階段蒐集不同語言語料並依照發音屬性做分類。第二階段則是建立發音屬性偵測器,對各語料做偵測得到發音特徵參數,同時擷取頻譜和聽覺參數。第三階段 ,從分類群中找出和目標語料的頻譜及聽覺參數相似之音框,串接時考慮被挑選音框的前中後發音屬性,利用動態規劃得到平行語料之音框序列,實驗結果顯示,研究方法相較於他人方法有效提升合成音質和語者相似度。
Speech is the most intuitive way of human communication. In our daily life, code-switching occurs frequently in conversation, especially Chinese, English and Taiwanese. Conventional monolingual TTS do not meet user’s requirement. Thus, in this study, a cross-lingual TTS system is proposed.
Cross-lingual speech synthesis is constrained by high cost of corpus preparation. It’s hard to collect such a polyglot corpus recorded by the same speaker. We proposed a unit-selection based frame selection method integrating articulatory attribute and auditory feature to solve cross-lingual and nonparallel problem in voice conversion. Articulatory attribute is robust feature which is language independent and speaker independent. Auditory feature represents the real perception of human hearing compared to conventional spectral feature. The proposed system consists of three major phases. First, we collect different speech databases of each language and cluster the frame segment together according to their articulatory attributes. Then, feature vector including articulatory feature, spectral feature and auditory feature is extracted and is used in frame level alignment. At the third phase, we select each candidate frame of target corpus from the cluster by calculating the Euclidean distance of spectral feature and auditory feature. Articulatory attribute of each candidate frame is considered for concatenation. Finally a frame sequence is generated by using dynamic programming. From the experiment results, better quality and similarity of synthesized speech is obtained and is preferred by the subjects than the conventional methods.
[1] A. Kain, M.W. Macon, “Spectral voice conversion for text-to-speech synthesis,” Proc. ICASSP, vol.1, pp.285-288, May 1998.
[2] A. J. Hunt, A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” Proc. ICASSP, vol.1, pp.373-376, May 1996.
[3] C.J. Leggetter, P.C. Woodland, “Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression,” ARPA Spoken Language Technology Workshop, 1995.
[4] C. Huang, Y. Shi, J. L. Zhou, M. Chu, T. Wang, E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” Proc. ICASSP, pp.901-904, 2004.
[5] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, “Text-Independent Voice Conversion Based on Unit Selection,” Proc. ICASSP, vol.1, May 2006.
[6] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney and J. Hirschberg, “Text-independent cross-language voice conversion,” Proc. ICSLP, 2006.
[7] D. Erro, A. Moreno, A. Bonafonte, “INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol.18, no.5, pp.944-953, July 2010.
[8] E. Barnard, M. Davel, and C. van Heerden, “From multilingual to polyglot speech synthesis,” Proc. Eurospeech, pp. 835–838, 1999.
[9] H. Ye and S. Young, “Voice conversion for unknown speakers,” Proc. Int. Conf. Spoken Lang., pp. 1161–1164, 2004.
[10] H. Duxans, D. Erro, J. Pérez, F. Diego, A. Bonafonte, A. Moreno, “Voice conversion of non-aligned data using unit selection,” TC-STAR Workshop on Speech to Speech Translation, 2006.
[11] H. Hermansky, K. Tsuga, S. Makino, H. Wakita, “Perceptually based processing in automatic speech recognition,” Proc. ICASSP, vol.11, no., pp. 1971- 1974, April 1986.
[12] H. Zen, T. Nose, J. Yamagishi, S. Sako, K. Tokuda, The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007 http://hts.sp.nitech.ac.jp/
[13] J. Yamagishi, T. Kobayashi, “Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training,” IEICE Trans. On Inf. & Syst., vol.E90D, no.2, pp.533-543, Feb. 2007.
[14] J. Kominek, A. W. Black, “CMU Arctic Databases for Speech Synthesis,” 2003.
[15] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp.1315-1318, 2000.
[16] K. Kirchhoff, G.A. Fink and G. Sagerer, “Conversational Speech Recognition Using Acoustic and Articulatory Input,” Proc. ICASSP, June 2000.
[17] L. Saheer, P.N. Garner, J. Dines, H. Liang, “VTLN adaptation for statistical speech synthesis,” Proc. ICASSP, pp.4838-4841, 14-19 March 2010.
[18] L. Cai, D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” key Lab. Of Pervasive Computing, Ministry od Education, Dept. of Computer, Tsinghua Univ., Beijian.
[19] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, “Voice conversion through vector quantization,” Proc. ICASSP, pp.655-658, April 1988.
[20] A. Mouchtaris, J. Van der Spiegel, P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE Transactions on Audio, Speech, and Language Processing, vol.14, no.3, pp. 952- 963, May 2006.
[21] M. Ostendorf, “Moving Beyond the `Beads-On-A-String' Model of Speech,” Proc. IEEE ASRU Workshop, 1999.
[22] M. Pitz, H. Ney, “Vocal Tract Normalization Equals Linear Transformation in Cepstral Space,” IEEE Transactions on Speech and Audio Processing, vol.13, no.5, pp. 930- 944, September 2005.
[23] R. F. Lyon, “A Computational Model of Filtering, Detection, and Compression in the Cochlea,” Proc. ICASSP-82, pp. 1282-1285, 1982.
[24] S. St¨uker, Multilingual Articulatory Features, Doctoral Dissertation, Carnegie Mellon University, Pittsburgh, PA, USA.
[25] S. M. Siniscalchi, T. Svendsen, and C.H. Lee, “A Penalized Logistic Regression Approach to Detection Based Phone Classification,” Proc. Interspeech, Brisbane Australia, 2008.
[26] S. Seneff, “A computational model for the peripheral auditory system: Application of speech recognition research,” Proc. ICASSP '86, vol.11, pp. 1983- 1986, Apr 1986.
[27] Y. Stylianou, O. Cappe, E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on Speech and Audio Processing, vol.6, no.2, pp.131-142, Mar 1998.
[28] Y. Wu, Y. Nankaku, K. Tokuda, ”State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis,” Proc. Interspeech, pp. 528-531, 2009.
[29] Y. Qian, H. Liang, F. K. Soong, “A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS,” IEEE Transactions on Audio, Speech, and Language Processing, vol.17, no.6, pp.1231-1239, August 2009.
[30] 國立臺灣師範大學國音教材編輯委員會編纂,國音學,臺北縣 :正中出版
[31] 王小川,語音訊號處理,臺北市:全華圖書
[32] CMU Dictionary, http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[33] Sandhi, http://en.wikipedia.org/wiki/Sandhi
校內:2014-08-31公開