| 研究生: |
陳彥佑 Chen, Yan-You |
|---|---|
| 論文名稱: |
基於少量語料之個人化自然語音合成 Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 96 |
| 中文關鍵詞: | 發音特徵 、候選單元擴展 、韻律調整 、自發性語音合成 、語音特徵重疊與平滑演算法 |
| 外文關鍵詞: | articulatory features, candidate expansion, prosody adjustment, spontaneous speech synthesis, speech parameter overlapping and smoothing algorithm |
| 相關次數: | 點閱:117 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音合成系統的建立往往會面臨到兩個互相矛盾的問題,使用不充足的語料庫導致產生的語音品質降低或者準備大語料庫造成的高度時間和人力的消耗。本論文之主要研究在於在小語料庫限制下建立能夠產生個人化自然的語音合成系統之方法。
首先,為了高品質語音合成對於精確切音語料庫的需求,本研究提出一個切音演算法用來自動切割語音語料庫。該方法係利用發音特徵找出候選斷點,並利用最小描述長度為基礎的分割演算法找出最可能的音素邊界。最後,將找出的邊界對基於Viterbi的強制對位結果進行修正來達到更精確的切音,尤其是對於自發性語音。實驗結果顯示基於發音特徵及最小描述長度所提出的切音演算法能夠有效改善以Viterbi為基礎的切音結果。
在小語料庫的基礎上,首先我們提出一套基於混合式語音合成技術,包含候選單元擴展、兩階段單元選取以及韻律詞階層韻律調整。候選單元擴展取得有潛力但無法單純使用語言資訊取得的候選單元。兩階段單元透過考慮音素和韻律詞層級從擴展的候選單元中選擇最佳的單元序列。對韻律詞階層的韻律調整則根據語料庫的統計資訊來驗證韻律詞的韻律參數,並根據統計參數語音合成語音之韻律結果調整驗證失敗的韻律詞韻律。實驗結果顯示,所提出的方法能在小語料庫的限制下達到高品質且自然的合成語音。
對於聽者感知而言,個人化自發性語音與自然語音同樣相當重要。因此,我們接著提出一方法來產生個人化的自發性語音。首先透過調適一個預先訓練的平均語音模型來獲得目標語者的語音模型。基於調變頻譜的後濾波器進一步來改善個人化的特性並同時減輕合成語音過度平滑的問題。接著,為了建立平順的語音,本論文提出一個重疊與平滑演算法來調整韻律詞,以改善合成語音的自發性。實驗結果顯示,所提出的方法能在小語料庫的限制下有效的建立出說話者的頻譜過度參數,包含重疊長度比例及語速調整參數,並用來產生平順的語音。
The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality.
First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach.
On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus.
For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech.
Academia Sinica Balanced Corpus [Online]. Available: http://app.sinica.edu.tw/kiwi/mkiwi/98-04.pdf
Akita, Y. & Kawahara, T. (2010). Statistical transformation of language and pronunciation models for spontaneous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1539–1549.
Andersson, S., Yamagishi, J., & Clark, R. (2010). Utilising Spontaneous Conversational Speech in HMM-Based Speech Synthesis. Proceedings of ISCA Speech Synthesis Workshop. (pp. 173–178).
Bengio, Y. (2009). Learning deep architectures for AI. Technical Report 1312, Université de Montréal, 2007.
Bennett, C. L. & Black, A. W. (2005). Prediction of pronunciation variations for Speech Synthesis: A Data-Driven Approach. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 297–300).
Black, A. W. & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Proceedings of the European Conference on Speech Communication and Technology, (Vol. 1, pp. 581–584).
Black, A. W. & Kominek, J. (2009). Optimizing segment label boundaries for statistical speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 3785–3788).
Black, A. W., Zen, H., & Tokuda, K. (2007). Statistical parametric speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (Vol. 4, pp. IV–1229–IV–1232).
Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
Bulyko, I. & Ostendorf, M. (2001). Joint prosody prediction and unit selection for concatenative speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, (Vol. 2, pp. 781–784).
Cai, L.-H., Cui, D.-D., & Cai, R. (2007). TH-CoSS, a Mandarin speech corpus for TTS. Journal of Chinese Information Processing, 21(2), 94–99.
Cettolo, M., Vescovi, M., & Rizzi, R. (2005). Evaluation of BIC-based algorithms for audio segmentation. Computer Speech and Language, 19(2), 147–170.
Chen, C.-P., Huang, Y.-C., Wu, C.-H., & Lee, K.-D. (2014). Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1558–1570.
Chen, L.-H., Ling, Z.-H., Jiang, Y., Song, Y., Xia, X.-J., Zu, Y.-Q., Yan, R.-Q., & Dai, L.-R. (2013). The USTC System for Blizzard Challenge 2013. Proceedings of Blizzard Challenge Workshop 2013.
Chen, L.-H., Ling, Z.-H., Zu, Y.-Q., Yan, R.-Q., Jiang, Y., Xia, X.-J., & Wang, Y. (2014). The USTC System for Blizzard Challenge 2014. Proceedings of Blizzard Challenge Workshop 2014.
Chen, S.-H. & Wang, Y.-R. (1990). Vector quantization of pitch information in Mandarin speech. IEEE Transactions on Communications, 38(9), 1317–1320.
Chen, S.-H., Hwang, S.-H. & Wang, Y.-R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Audio, Speech, and Language Processing, 6(3), 226–239.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 39(1), 1–38.
Díaz, F. C. & Banga, E. R. (2006). A method for combining intonation modeling and speech unit selection in corpus-based speech synthesis systems. Speech Communication, 48(8), 941-956.
Drugman, T., Moinet, A., Dutoit, T., & Wilfart, G. (2009). Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (pp. 3793–3796).
Eide, E. (2001). Distinctive features for use in an automatic speech recognition system. Proceedings of INTERSPEECH 2001. (pp. 1613–1616).
Forney, G. D., Jr. (1973). The Viterbi algorithm. Proceedings of the IEEE. (Vol. 61, pp. 268–278).
Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (Vol. 1, pp. 137–140).
Grünwald, P. (2004). A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077v1.
Guner, E., Mohammadi, A., & Demiroglu, C. (2012). Analysis of speaker similarity in the statistical speech synthesis systems using a hybrid approach. Proceedings of European Signal Processing Conference. (pp. 2055–2059).
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., ..., Sainath, T. N. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hoole, P. & Hu, F. (2004). Tone-vowel interaction in standard Chinese. Proceedings of International Symposium on Tonal Aspects of Languages with Emphasis on Tone Languages. (pp. 89–92).
Hsia, C.-C., Wu, C.-H., & Wu, J.-Y. (2010). Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 1994–2003.
Huang, C., Shi, Y., Zhou, J.-L., Chu, M., Wang, T., & Chang, E. (2004). Segmental tonal modeling for phone set design in Mandarin LVCSR. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (pp. 901–904).
Huang, Y.-C., Wu, C.-H. & Chao, Y.-T. (2013a). Personalized spectral and prosody conversion using frame-based codeword distribution and adaptive CRF. IEEE Transactions on Audio, Speech, and Language Processing, 21(1), 51–62.
Huang, Y.-C., Wu, C.-H., & Lin, S.-L. (2013b). Residual compensation based on articulatory feature-based phone clustering for hybrid Mandarin speech synthesis. Proceedings of ISCA Speech Synthesis Workshop. (pp. 303–307).
Hunt, A. & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 373–376).
Kang, S., Qian, X., & Meng, H.-Y. (2013). Multi-distribution deep belief network for speech synthesis. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. (pp. 8012–8016).
Kawahara, H., Masuda-Katsuse, I., & Cheveigné, A. de (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3), 187–207.
King, S. & Taylor, P. (2000). Detection of phonological features in continuous speech using neural networks. Computer Speech & Language, 14(4), 333–353.
Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37(3), 303–319.
Kobayashi, T., Imai, S., & Fukuda, T. (1985). Mel-generalized log spectral approximation (MGLSA) filter. Journal of IEICE, J68-A(6), 610-611.
Koriyama, T., Nose, T., & Kobayashi, T. (2010). Conversational spontaneous speech synthesis using average voice model. Proceedings of INTERSPEECH 2010. (pp. 853–856).
Koriyama, T., Nose, T., & Kobayashi, T. (2011). On the Use of Extended Context for HMM-Based Spontaneous Conversational Speech Synthesis. Proceedings of INTERSPEECH 2011. (pp. 2657–2660).
Lee, C.-H., Wu, C.-H., & Guo, J.-C. (2010). Pronunciation variation generation for spontaneous speech synthesis using state-based voice transformation. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 4826–4829).
Lin, T. & Wang, L.-J. (1992). Phonetics Tutorials (in Chinese). Beijing University Press.
Ling, Z.-H. & Wang, R.-H. (2007). HMM-based hierarchical unit selection combining Kullback-Leibler divergence with likelihood criterion, Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (Vol. 4, pp. 1245–1248).
Ling, Z.-H., Deng, L., & Yu, D. (2013). Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2129–2139.
Ling, Z.-H., Xia, X.-J., Y. Song, C.-Y. Yang, Chen, L.-H., & Dai, L.-R. (2012). The USTC System for Blizzard Challenge 2012. Proceedings of Blizzard Challenge Workshop 2012.
Ma, C., Tsao, Y., & Lee, C.-H. (2006). A study on detection based automatic speech recognition. Proceedings of INTERSPEECH 2006. (pp. 2350–2353)
Maeno, Y., Nosea, T., Kobayashia, T., Koriyamaa, T., Ijimab, Y., Nakajimab, H., Mizunob, H., & Yoshiokab, O. (2014). Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis. Speech Communication, 57, 144–154. doi: 10.1016/j.specom.2013.09.014.
Masuko, T., Tokuda, K., Kobayashi, T., and Imai, S. (1996). Speech synthesis using HMMs with dynamic features. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (pp. 389–392).
MAT Speech Database—TCC-300 [Online]. Available: http://rocling.iis.sinica.edu.tw/ROCLING/MAT/Tcc_300brief.htm
Matsoukas, S., Schwartz, R., Jin, H., & Nguyen, L. (1997). Practical implementations of speaker-adaptive training. Proceedings of DARPA Speech Recognition Workshop. (pp. 1137-1140).
Monzo, C., Iriondo, I., & Socoró, J. C. (2014). Voice Quality Modelling for Expressive Speech Synthesis. The Scientific World Journal, 2014. doi: 10.1155/2014/627189.
Ostendorf, M. (1999). Moving beyond the ‘beads-on-a-string’ model of speech. Proceedings of IEEE ASRU Workshop. (pp. 79–84).
Pollet, V. & Breen, A. (2008). Synthesis by generation and concatenation of multiform segments. Proceedings of INTERSPEECH 2008. (pp. 1825–1828).
Prahallad, K., Black, A. W., & Mosur, R. (2006). Sub-phonetic modeling for capturing pronunciation variations for conversational speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (Vol. 1, pp. I-853–I-856).
Qian, Y., Liang, H., & Soong, F. K. (2009). A cross-language state sharing and mapping approach to bilingual (Mandarin-English) TTS. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1231–1239.
Rendel, A., Sorin, A., Hoory, R., & Breen, A. (2012). Towards automatic phonetic segmentation for TTS. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 4533–4536).
Rybach, D., Gollan, C., Schluter, R., & Ney, H. (2009). Audio segmentation for speech recognition using segment features. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 4197–4200).
Sainz, I., Erro, D., Navas, E., & Hernáez, I. (2011). A hybrid TTS approach for prosody and acoustic modules. Proceedings of INTERSPEECH 2011. (pp. 333–336).
Sakai, S. & Shu, H. (2005). A probabilistic approach to unit selection for corpus-based speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. (pp. 81–84).
Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2008a). Toward a detector-based universal phone recognizer. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 4261–4254).
Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2008b). A penalized logistic regression approach to detection based phone classification. Proceedings of INTERSPEECH 2008. (pp. 2390–2393).
Strom, N. (1997). The NICO artificial neural network toolkit. http://nico.nikkostrom.com.
Stüker, S. (2003). Multilingual articulatory features (Doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA, USA).
Takamichi, S., Toda, T., Neubig, G., Sakti, S., & Nakamura, S. (2014). A postfilter to modify the modulation spectrum in HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (pp. 209–294).
Tiomkin, S., Malah, D., Shechtman, S., & Kons, Z. (2011). A hybrid text-to-speech system that combines concatenative and statistical synthesis units. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1278–1288.
Toda, T. & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, E90-D(5), 816–824.
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Tokuda, K., Kobayashi, T., Masuko, T., & Imai, S. (1994). Mel-generalized cepstral analysis - a unified approach to speech spectral estimation. Proceedings of International Conference on Spoken Language Processing. (pp. 1043–1046).
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing. (Vol. 3, pp. 1315–1318).
Tokuda, K., Zen, H., & Black, A. W. (2002). An HMM-based speech synthesis system applied to English. Proceedings of IEEE Speech Synthesis Workshop. (pp. 227–230).
Tseng, C.-Y. & Lee, Y.-L. (2004). Speech rate and prosody units: Evidence of interaction from Mandarin Chinese. Proceedings of International Conference on Speech Prosody. (pp. 251–254).
Tseng, C.-Y., Su, Z.-Y., & Lee, L.-S. (2009). Mandarin spontaneous narrative planning-prosodic evidence from national Taiwan university lecture corpus. Proceedings of INTERSPEECH 2009. (pp. 2943-2946).
Tseng, S.-C. & Liu, Y.-F. (2002). Annotation of Mandarin conversational dialogue corpus. Academia Sinica, CKIP Tech. Rep. -01.
Werner, S., Eichner, M., Wolff, M., & Hoffmann, R. (2004). Toward spontaneous speech synthesis-utilizing language model information in TTS. IEEE Transactions on Speech and Audio Processing, 12(4), 436–445.
Wu, C.-H. & Chen, J.-H. (2001). Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis. Speech Communication, 35(3), 219–237.
Wu, C.-H. & Hsieh, C.-H. (2006). Multiple change-point audio segmentation and classification using an MDL-based Gaussian model. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 647–657.
Wu, C.-H., Hsia, C.-C., Chen, J.-F., & Wang, J.-F. (2007). Variable-length unit selection in TTS using structural syntactic cost. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1227–1235.
Wu, C.-H., Hsia, C.-C., Lee, C.-H., & Lin, M.-C. (2010). Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1394–1405.
Wu, C.-H., Hsia, C.-C., Liu, T.-H., & Wang, J.-F. (2006). Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1109–1116.
Wu, C.-H., Huang, Y.-C., Lee, C.-H., & Guo, J.-C. (2014). Synthesis of spontaneous speech with syllable contraction using state-based context-dependent voice transformation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(3), 585–595.
Wu, C.-H., Huang, Y.-C., Lin, S.-L., & Chen, C.-P. (2014). Natural speech synthesis based on hybrid approach with candidate expansion and verification. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (pp. 250–254).
Wu, C.-H., Lee, C.-H., & Liang, C.-H. (2009). Idiolect extraction and generation for personalized speaking style modeling. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 127–137.
Wu, C.-H., Shen, H.-P., & Yang, Y.-T. (2012). Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition. Proceedings of International Conference on Acoustic, Speech, and Signal Processing. (pp. 4565–4568).
Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K., and Kobayashi, T. (2002). A context clustering technique for average voice model in HMM-based speech synthesis. Proceedings of International Conference on Spoken Language Processing. (pp. 133–136).
Yamagishi, J. (2006). Average-voice-based speech synthesis (Doctoral dissertation, Tokyo Institute of Technology, Japan).
Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K., & Isogai, J. (2009). Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 66–83.
Yamagishi, J., Tamura, M., Masuko, T., Tokuda, K., & Kobayashi, T. (2003). A training method of average voice model for HMM-based speech synthesis. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E86-A(8), 1956–1963.
Yoshimura, T., Tokuda, K., & Masuko, T. (2005). T. Kobayashi, and T. Kitamura, Incorporating a mixed excitation model and postfilter into HMM-based text-to speech synthesis. Systems and Computers in Japan, 36(12), 43–50.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Proceedings of the European Conference on Speech Communication and Technology. (Vol. 5, pp. 2347–2350).
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., & Tokuda, K. (2007). The HMM-based speech synthesis system version 2.0, Proceedings of ISCA Speech Synthesis Workshop, (pp. 294–299).
Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing. (pp. 7962–7966).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Zweig, G. & Nguyen, P. (2009). A segmental CRF approach to large vocabulary continuous speech recognition. Proceedings of IEEE ASRU workshop. (pp. 152–157).
校內:2021-07-10公開