簡易檢索 / 詳目顯示

研究生: 李冠德
Li, Kuan-Te
論文名稱: 結合發音與聽覺參數之音框選取於多語言語音合成系統
Unit-selection-based frame selection using articulatory and auditory features for polyglot TTS system
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 54
中文關鍵詞: 跨語語音合成發音屬性聽覺參數音框選取
外文關鍵詞: Cross-lingual speech synthesis, Articulatory attribute, Auditory feature, frame selection
相關次數: 點閱:76下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音是人類溝通最重要的媒介,日常對話經常混雜許多語言,傳統單一語言TTS已不合使用者需求,因此本研究希望建立一混雜多語的語音合成系統。
    以往跨語語音合成被語料庫的蒐集成本所限制,往往要蒐集到同一位語者所念的不同語言的語料庫是相當困難的,因此本論文提出一個結合發音屬性與聽覺參數之音框選取方法以解決跨語言非平行語料之聲音轉換的問題。發音屬性和音素的發音位置及發音方法有關,是一種不受語言和語者影響的強健特徵,而聽覺參數相較於傳統頻譜特徵更能表現出人類聽覺特性。本研究主要工作如下:第一階段蒐集不同語言語料並依照發音屬性做分類。第二階段則是建立發音屬性偵測器,對各語料做偵測得到發音特徵參數,同時擷取頻譜和聽覺參數。第三階段 ,從分類群中找出和目標語料的頻譜及聽覺參數相似之音框,串接時考慮被挑選音框的前中後發音屬性,利用動態規劃得到平行語料之音框序列,實驗結果顯示,研究方法相較於他人方法有效提升合成音質和語者相似度。

    Speech is the most intuitive way of human communication. In our daily life, code-switching occurs frequently in conversation, especially Chinese, English and Taiwanese. Conventional monolingual TTS do not meet user’s requirement. Thus, in this study, a cross-lingual TTS system is proposed.
    Cross-lingual speech synthesis is constrained by high cost of corpus preparation. It’s hard to collect such a polyglot corpus recorded by the same speaker. We proposed a unit-selection based frame selection method integrating articulatory attribute and auditory feature to solve cross-lingual and nonparallel problem in voice conversion. Articulatory attribute is robust feature which is language independent and speaker independent. Auditory feature represents the real perception of human hearing compared to conventional spectral feature. The proposed system consists of three major phases. First, we collect different speech databases of each language and cluster the frame segment together according to their articulatory attributes. Then, feature vector including articulatory feature, spectral feature and auditory feature is extracted and is used in frame level alignment. At the third phase, we select each candidate frame of target corpus from the cluster by calculating the Euclidean distance of spectral feature and auditory feature. Articulatory attribute of each candidate frame is considered for concatenation. Finally a frame sequence is generated by using dynamic programming. From the experiment results, better quality and similarity of synthesized speech is obtained and is preferred by the subjects than the conventional methods.

    中文摘要 i 圖目錄 vi 表目錄 vii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 研究方法簡介 5 1.4 章節概要 6 第二章 文獻回顧 8 2.1 HTS語音合成系統 8 2.1.1 HMM模型調適(model adaptation) 10 2.1.2 狀態映對(state mapping) 11 2.2 串接式合成系統 12 2.2.1 聲音轉換 14 2.3 非平行語料的音框校準 14 第三章 研究方法 16 3.1 語料標記 17 3.1.1中文音素標記 17 3.1.2 英文音素標記 19 3.2音素歸群 20 3.3 特徵擷取 22 3.3.1 倒頻譜參數 23 3.3.2發音屬性擷取 24 3.3.3 聽覺特徵參數 26 3.3.4 語者特性的消除 28 3.4 音框選取 29 3.4.1 參數定義 29 3.5.2 Substitution cost 30 3.4.3 Concatenation cost 31 3.4.4 動態規劃 33 3.4.5基頻轉換 33 第四章 實驗結果與分析 35 4.1實驗語料 35 4.2實驗環境設定 36 4.3 實驗與評估 37 4.3.1 聽覺參數權重之評估 38 4.3.2 與他人方法比較之實驗評估 42 4.3.3 unseen phone合成品質評估 46 第五章 結論與未來展望 48 5.1 結論 48 5.2 未來展望 49 參考文獻 50

    [1] A. Kain, M.W. Macon, “Spectral voice conversion for text-to-speech synthesis,” Proc. ICASSP, vol.1, pp.285-288, May 1998.
    [2] A. J. Hunt, A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” Proc. ICASSP, vol.1, pp.373-376, May 1996.
    [3] C.J. Leggetter, P.C. Woodland, “Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression,” ARPA Spoken Language Technology Workshop, 1995.
    [4] C. Huang, Y. Shi, J. L. Zhou, M. Chu, T. Wang, E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” Proc. ICASSP, pp.901-904, 2004.
    [5] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, “Text-Independent Voice Conversion Based on Unit Selection,” Proc. ICASSP, vol.1, May 2006.
    [6] D. Sundermann, H. Hoge, A. Bonafonte, H. Ney and J. Hirschberg, “Text-independent cross-language voice conversion,” Proc. ICSLP, 2006.
    [7] D. Erro, A. Moreno, A. Bonafonte, “INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol.18, no.5, pp.944-953, July 2010.
    [8] E. Barnard, M. Davel, and C. van Heerden, “From multilingual to polyglot speech synthesis,” Proc. Eurospeech, pp. 835–838, 1999.
    [9] H. Ye and S. Young, “Voice conversion for unknown speakers,” Proc. Int. Conf. Spoken Lang., pp. 1161–1164, 2004.
    [10] H. Duxans, D. Erro, J. Pérez, F. Diego, A. Bonafonte, A. Moreno, “Voice conversion of non-aligned data using unit selection,” TC-STAR Workshop on Speech to Speech Translation, 2006.
    [11] H. Hermansky, K. Tsuga, S. Makino, H. Wakita, “Perceptually based processing in automatic speech recognition,” Proc. ICASSP, vol.11, no., pp. 1971- 1974, April 1986.
    [12] H. Zen, T. Nose, J. Yamagishi, S. Sako, K. Tokuda, The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007 http://hts.sp.nitech.ac.jp/
    [13] J. Yamagishi, T. Kobayashi, “Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training,” IEICE Trans. On Inf. & Syst., vol.E90D, no.2, pp.533-543, Feb. 2007.
    [14] J. Kominek, A. W. Black, “CMU Arctic Databases for Speech Synthesis,” 2003.
    [15] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp.1315-1318, 2000.
    [16] K. Kirchhoff, G.A. Fink and G. Sagerer, “Conversational Speech Recognition Using Acoustic and Articulatory Input,” Proc. ICASSP, June 2000.
    [17] L. Saheer, P.N. Garner, J. Dines, H. Liang, “VTLN adaptation for statistical speech synthesis,” Proc. ICASSP, pp.4838-4841, 14-19 March 2010.
    [18] L. Cai, D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” key Lab. Of Pervasive Computing, Ministry od Education, Dept. of Computer, Tsinghua Univ., Beijian.
    [19] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, “Voice conversion through vector quantization,” Proc. ICASSP, pp.655-658, April 1988.
    [20] A. Mouchtaris, J. Van der Spiegel, P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE Transactions on Audio, Speech, and Language Processing, vol.14, no.3, pp. 952- 963, May 2006.
    [21] M. Ostendorf, “Moving Beyond the `Beads-On-A-String' Model of Speech,” Proc. IEEE ASRU Workshop, 1999.
    [22] M. Pitz, H. Ney, “Vocal Tract Normalization Equals Linear Transformation in Cepstral Space,” IEEE Transactions on Speech and Audio Processing, vol.13, no.5, pp. 930- 944, September 2005.
    [23] R. F. Lyon, “A Computational Model of Filtering, Detection, and Compression in the Cochlea,” Proc. ICASSP-82, pp. 1282-1285, 1982.
    [24] S. St¨uker, Multilingual Articulatory Features, Doctoral Dissertation, Carnegie Mellon University, Pittsburgh, PA, USA.
    [25] S. M. Siniscalchi, T. Svendsen, and C.H. Lee, “A Penalized Logistic Regression Approach to Detection Based Phone Classification,” Proc. Interspeech, Brisbane Australia, 2008.
    [26] S. Seneff, “A computational model for the peripheral auditory system: Application of speech recognition research,” Proc. ICASSP '86, vol.11, pp. 1983- 1986, Apr 1986.
    [27] Y. Stylianou, O. Cappe, E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on Speech and Audio Processing, vol.6, no.2, pp.131-142, Mar 1998.
    [28] Y. Wu, Y. Nankaku, K. Tokuda, ”State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis,” Proc. Interspeech, pp. 528-531, 2009.
    [29] Y. Qian, H. Liang, F. K. Soong, “A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS,” IEEE Transactions on Audio, Speech, and Language Processing, vol.17, no.6, pp.1231-1239, August 2009.
    [30] 國立臺灣師範大學國音教材編輯委員會編纂,國音學,臺北縣 :正中出版
    [31] 王小川,語音訊號處理,臺北市:全華圖書
    [32] CMU Dictionary, http://www.speech.cs.cmu.edu/cgi-bin/cmudict
    [33] Sandhi, http://en.wikipedia.org/wiki/Sandhi

    無法下載圖示 校內:2014-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE