| 研究生: |
林士倫 Lin, Shih-Lun |
|---|---|
| 論文名稱: |
應用發音屬性單元擴增與韻律詞階層驗證之混合式中文自然語音合成 A Hybrid Approach to Natural Mandarin Speech Synthesis Based on AF-based Candidate Expansion and Prosodic Word Level Verification |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 英文 |
| 論文頁數: | 67 |
| 中文關鍵詞: | 混合式語音合成 、單元擴增 、殘差補償 、韻律詞階層驗證 |
| 外文關鍵詞: | Hybrid speech synthesis, Candidate expansion, Residual compensation, Prosodic word level verification |
| 相關次數: | 點閱:80 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音是人與人溝通最直接也最快速的方法,為了讓生活中各種電子產品能像人類一樣與我們溝通,自然語音合成就是一項非常關鍵的技術。
以往所使用的方法中,不論是訓練模型或是單元挑選,其合成品質都與語料庫的大小有很大的相關性。然而一般研究當中,要取得很大量的語料庫是非常困難的。因此本論文提出一基於發音屬性的單元擴增方法,希望能透過取回相似的發音屬性的單元來達到充分利用現有少量語料的目的,以產生可能的最佳結果,並透過殘差補償的過濾機制預先篩選不適用的單元,降低不必要的計算量。
過去所提出的各種基於單元串接的方法,都不能確保其結果的穩定度能夠優於模型產生的音檔,使得人們在應用上都以使用訓練模型的系統為優先考量。因此,本論文提出一透過韻律詞階層驗證的方法,透過語料中韻律詞的普遍表現來對挑選結果進行驗證,定位出串接不夠順暢的部分後,改以針對此處串接做過最佳化的合成單元來做取代,以確保最後產生的結果與整體品質皆能不低於合成器產生的音質,以提高本系統的可用性。
實驗結果顯示,單元擴增的方法確實能找出更多可能的候選單元以提升串接的品質,而經過篩選後並不會產生過於大量的計算量。而最後的驗證機制也能順利找出合成不順暢的部分予以修正,使得整體的品質都能不低於合成器的結果。
Speech is the most intuitive and instant way to communicate with people. In order to make electrical products interact with people like a human, natural speech synthesis is a critical technique to fulfill this purpose.
In previous speech synthesis approaches, which include model-based and unit selection-based methods, the quality of synthesized speech depends heavily on the corpora sizes used in the training process. However, it is still time consuming and labor intensive to obtain such large corpora. To alleviate the need of large corpora for the training of speech synthesis systems, we proposed an AF-based candidate expansion method, which retrieves units with similar articulatory features, to achieve comparable synthesis performance using only small corpora. In addition, to reduce the time for cost estimation in the training phase, removing unsuitable units using residual-based filter is also employed in this research.
Another problem of unit selection-based approach in previous approaches is that the quality and stableness of the synthesized speech is highly variant compared to natural speech. Users may prefer to use the results from model-based synthesizers if the quality of these approaches does not always perform well. For this variability problem, we proposed a prosodic word level verification method, where general performance of prosodic word in corpora is employed to verify the concatenated speech. Locating where in the speech not smooth enough and then using model-based synthesized units with optimization to replace them are used as a means to guarantee the quality of result speech is better than that produced by the synthesizers.
Experimental results show that candidate expansion can retrieve more high quality units for concatenation and the total amount of units is also restricted by filters to prevent considering too many units that may not helpful to improve the performance of the synthesized speech. The performance gain of the verification approach over the synthesizer is achieved by locating and modifying discontinuous parts in the speech.
[1] 行政院研究發展考核委員會, 101年個人/家戶數位機會調查報告, 台灣, 2012.
[2] Ivan Bulyko, and Mari Ostendorf, “Joint prosody prediction and unit selection for concatenative speech synthesis,” 2001 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2001), vol. 2, pp. 781-784, 2001.
[3] Shinsuke Sakai and Han Shu, “A Probabilistic Approach to Unit Selection for Corpus-based Speech Synthesis,” 9th European Conference on Speech Communication and Technology (INTERSPEECH 2005), pp. 81-84, 2005.
[4] Francisco Campillo Dı ́az and Eduardo Rodrı ́guez Banga, “A method for combining intonation modeling and speech unit selection in corpus-based speech synthesis systems,” Speech Communication, vol. 48, pp. 941-956, 2006.
[5] Matoušek, Jindřich, Skarnitzl, Radek, Tihelka, Daniel, and Machač, Pavel, “Removing Preglottalization from Unit-Selection Synthesis: Towards the Linguistic Naturalness of Synthetic Czech Speech,” International Journal of Computer Science, vol. 39, pp. 123-130, 2012.
[6] Takayoshi Yoshimura, Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, and Tadashi Kitamura, “Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis,” 6th European Conference on Speech Communication and Technology (EUROSPEECH 1999), vol. 5, pp. 2347-2350, 1999.
[7] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura, “Speech Parameter Generation Algorithms For HMM-Based Speech Synthesis,” 2000 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2000), pp. 1315-1318, 2000.
[8] Zhen-hua Ling, Yi-jian Wu, Yu-ping Wang, Long Qin, and Ren-hua Wang, “USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method,” Blizzard Challenge 2006, 2006.
[9] Heiga Zen , Keiichi Tokuda, and Alan W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.
[10] Tadej Justin, Miran Pobar, Ivo Ipšić, France Mihelič, and Janez Žibert, “A Bilingual HMM-Based Speech Synthesis System for Closely Related Languages,” Text, Speech and Dialogue 2012, vol. 7499, pp. 543-550, 2012.
[11] Stas Tiomkin, David Malah, Slava Shechtman, and Zvi Kons, “A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units,” IEEE Transactions on Audio, Speech, And Language Processing, vol. 19, no. 5, pp. 1278-1288, 2011.
[12] Ekrem Guner, Amir Mohammadi, and Cenk Demiroglu, “Analysis Of Speaker Similarity In The Statistical Speech Synthesis Systems Using A Hybrid Approach,” 20th European Signal Processing Conference (EUSIPCO 2012), pp. 2055-2059, 2012.
[13] Iñaki Sainz, Daniel Erro, Eva Navas, Inma Hernáez, “A Hybrid TTS Approach for Prosody and Acoustic Modules,” 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011), pp. 333-336, 2011.
[14] Thomas Drugman, Alexis Moinet, Thierry Dutoit, and Geoffrey Wilfart, “Using A Pitch-Synchronous Residual Codebook For Hybrid Hmm/Frame Selection Speech Synthesis,” 2009 International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009), pp. 3793-3796, 2009.
[15] T. Kobayashi, S. Imai, and T. Fukuda, “Mel-Generalized Log Spectral Approximation Filter,” IEICE Transactions, Vol. J68-A, No. 6, pp. 610-611, 1985.
[16] Hiroyuki Segi, Tohru Takagi, and Takayuki Ito, “A Concatenative Speech Synthesis Method Using Context Dependent Phoneme Sequences with Variable Length as Search Units,” ISCA Speech Synthesis Workshop (ISCA SSW-5), pp. 55-60, 2004.
[17] Mumtaz Begum, Raja N. Ainon, Roziati Zainuddin, Zuraidah M. Don, and Gerry Knowles, “Prosody Generation by Integrating Rule and Template-Based Approaches for Emotional Malay Speech Synthesis,” IEEE Region 10 Conference (TENCON), pp. 1-6, 2008.
[18] Syaheerah L. Lutfi, Raja Noor Ainon, Salimah Mokhtar, and Zuraidah M. Don, “Template-Driven Emotions Generation in Malay Text-to-Speech: A Preliminary Experiment,” International Conference of Information Technology in Asia (CITA), pp.144-149, 2005.
[19] Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Hideo Saito, and Shinji Ozawa, “Template-Based Methods for Sentence Generation and Speech Synthesis,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), pp. 1757-1760, 2011.
[20] M. Ostendorf, “Moving Beyond the `Beads-On-A-String' Model of Speech,” IEEE ASRU Workshop, 1999.
[21] Kuan-Te Li, “Unit-selection-based frame selection using articulatory and auditory features for polyglot TTS system,” M.S. thesis, National Cheng-Kung University, Tainan, Taiwan, 2011.
[22] Chia-Ping Chen, Yi-Chin Huang, Chung-Hsien Wu, and Kuan-De Lee, “Cross-lingual frame selection method for polyglot speech synthesis,” 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4521-4524, 2012.
[23] Sabato Marco Siniscalchi, Torbjørn Svendsen, and Chin-Hui Lee, “Toward A Detector-Based Universal Phone Recognizer,” 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), pp. 4261-4264, 2008.
[24] Sabato Marco Siniscalchi, Torbjørn Svendsen, and Chin-Hui Lee, “A penalized logistic regression approach to detection based phone classification,” 9th Annual Conference of the International Speech Communication Association (INTERSPEECH 2008), pp. 2390-2393, 2008.
[25] Nikko Ström, “The NICO toolkit for artificial neural networks,” http://www.speech.kth.se/NICO, 1996.
[26] Yan-Ting Yang, “Phone Set Construction based on Articulatory Feature for Code-Switching Speech Recognition,” M.S. thesis, National Cheng-Kung University, Tainan, Taiwan, 2011.
[27] Chung-Hsien Wu, Han-Ping Shen, and Yan-Ting Yang, “Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition,” 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4565-4568, 2012.
[28] 國立臺灣師範大學國音教材編輯委員會, 國音學, 修訂第八版, 正中出版, 台灣, 2008.
[29] Michael Günther, Johann Schuster, and Markus Siegle, “Symbolic calculation of k-shortest paths and related measures with the stochastic process algebra tool CASPA,” The First Workshop on DYnamic Aspects in DEpendability Models for Fault-Tolerant Systems, pp. 13-18, 2010.
[30] Sz-Ting Weng, “Hierarchical Pitch Pattern Selection Based on Prosodic Structure and Fujisaki Model for Natural Speech Synthesis for Natural Speech Synthesis,” M.S. thesis, National Cheng-Kung University, Tainan, Taiwan, 2012.
[31] Yi-Chin Huang, Chung-Hsien Wu, and Sz-Ting Weng, “Hierarchical prosodic pattern selection based on Fujisaki model for natural mandarin speech synthesis,” 8th International Symposium on Chinese Spoken Language Processing (ISCSLP 2012), pp. 79 - 83, 2012.
[32] Philip Hoole and Fang Hu, “Tone-Vowel Interaction in Standard Chinese,” International Symposium on Tonal Aspects of Languages with Emphasis on Tone Languages (TAL 2004), pp. 89-92, 2004.
[33] Lian-hong Cai, Dan-dan Cui, and Rui Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” Journal of Chinese Information Processing, Vol. 21, No. 2, pp. 94-99, 2007.