| 研究生: |
吳仲耘 Wu, Jung-yun |
|---|---|
| 論文名稱: |
應用韻律階層及動態參數之音高預測在基於HMM之中文語音合成器 Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMM-based Mandarin Speech Synthesis |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 76 |
| 中文關鍵詞: | 語音合成 、音高 、動態參數 、韻律階層 |
| 外文關鍵詞: | Pitch, Prosody Hierarchy, Dynamic Feature, Speech Synthesis |
| 相關次數: | 點閱:85 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
韻律表現是影響語音自然度的重要因素,而音高更蘊含了豐富的韻律訊息。基於隱藏式馬可夫模型的語音合成器,近年來已可合成出流暢及可理解的語音,系統的可攜性及適應性更是其發展優勢,但在語音的自然度上仍需改善。因此,本研究以“階層式音韻架構”作為音高預測的基礎,各層的音韻單元考慮“動態參數特性”;希望改善傳統音韻模型以小單元合成的不足,並以動態參數保留時間上的關聯性,使單元之間的連接更加自然,藉以改善基於隱藏式馬可夫模型之合成語音的自然度。
在本論文中,對於應用階層式韻律架構及動態參數之音高預測模型,分為下列四項研究重點:(1)階層式韻律架構的預測及產生;(2)導入動態參數生成演算法於各韻律階層;(3)運用分類回歸樹及隱藏式馬可夫模型建立各層韻律模型;(4)參數提取使用STRAIGHT(Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram)。
在實驗中,首先對韻律停頓預測模型評估其預測準確度,接著對音高預測模型進行主觀及客觀的評估,證明本論文提出之方法,在合成語音的自然度表現上,有不錯的表現及改善。
Prosody is the main measurement of naturalness for speech, and pitch is the key factor known to carry the prosodic information. In resent years, speech synthesis based on Hidden Markov Models has been developed, which can synthesize smooth speech and in an advantageous position about its flexible property and portable in size. Nevertheless, there is still room for improvement in “the naturalness” of synthesized speech. In our research, we take the “prosody hierarchy structure” as the basis of pitch prediction model, and apply “dynamic features” to the unit of each hierarchical layer. We describe prosodic units as the supra-segmental units which occur in a hierarchy structure and reflect how brain processes speech; the latter preserve time correlation between adjacent units and result in more natural connection among each conjunction point. Applying this framework to HMM-based speech synthesis system, we can result a better, natural sounding speech.
The purpose of this thesis is to develop a pitch prediction model using prosody hierarchy structure and dynamic features and to investigate the improvement of naturalness for synthesized speech. More specifically, this research is aimed to: (1) Prediction and generation of prosody hierarchy structure; (2) Dynamic features for each hierarchical layer; (3) Building the pitch prediction model for each layer: CART for prosodic word and syllable level, HMM for frame level; (4) Feature analysis using STRAIGHT (Speech Transformation and Representation based on Adaptive Interpolation of weiGHTed spectrogram).
The experimental result using both subjective and objective tests in the proposed approach and other comparative systems shows that our scheme is better can comparative ones and can generate more natural sounding speech.
[Andrej, 1986] Andrej, L. and Frank, F., “Synthesis of Natural Sounding Pitch Contours in Isolated Utterances Using Hidden Markov Models”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. ASSP-34, no.5, pp.1074-1080, October 1986
[Benijamin, 1994] Benijamin, A., Chilin S. and Richard S., “A Corpus-Based Mandarin Text-to-Speech Synthesizer”, in Proc of ICSLP, S29, 8.1-8.4, pp. 1771-1774, 1994
[Breiman, 1984] Breiman, L., Friedman, J.H., Olshen, R. A. and C.J. Stone,” Classification and Regression Trees”, Chapman Hall, New York, 1984
[Chan, 1994] Chan, M. V., Feng, X., Heinen, J. A. and Niederjohn, R. J., “Classification of Speech Accents with Neural Networks”, Neural Networks, IEEE World Congress on Computational Intelligence., IEEE International Conference on, vol.7, pp. 4483-4486, 1994
[Chen, 1990] Chen, S. H. and Wang Y. R., “Vector Quantization of Pitch Information in Mandarin Speech”, IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, 1990
[Chen, 1995] Chen, S. H. and Wang, Y. R., ”Tone Recognition of Continuous Mandarin Speech Based on Neural Networks”, IEEE Trans. on Speech and Audio processing, vol. 3, no.2, pp.146-150, March 1995
[Chen, 1998] Chen, S. H., Hwang, S. H. and Wang, Y. R., “An RNN-based Prosodic Information Synthesizer for Mandarin Text-to-Speech”, IEEE Trans. on Speech and Audio Processing, vol. 6, no.3, pp.226-269, 1998
[Chen, 2005] Chen S. H., Lai, W. H. and Wang, Y. R., “A Statistics-based Pitch Contour Model for Mandarin Speech”, The Journal of the Acoustical Society of America, 117(2), pp. 908-925, 2005
[Chu, 2001] Chu, M. and Qian, Y., “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts”, Computational Linguistics and Chinese Language Processing, 6(1), pp. 61-82, 2001
[Dong, 2002] Dong, M. and Lua, K. T., “Pitch Contour Model for Chinese Text-to-Speech Using CART and Statistical Model”, in Proc. of ICSLP, pp. 2405-2408, 2002
[Fujisaki, 1984] Fujisaki, H. and Hirose, K., “Analysis of voice fundamental frequency contours for declarative sentences of Japanese”, Journal of Acoustic Society, Japan, 1984
[Fukada, 1992] Fukada, T., Tokuda, K., Kobayashi, T. and Imai, S., “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. of ICASSP, vol.1, pp.137–140, 1992
[Greg, 2000] Greg, P. K. and Shih, C., “Stem-ML: Language-Independent Prosody Description”, in Proc. of ICSLP, pp. 239-242, 2000
[Huang, 2004] Huang, C., Shi, Y., Zhou, J. L., Chu, M., Wang, T., and Chang, E., “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004
[Kawahara, 1997] Kawahara, H., “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited”, in Proc. of ICASSP, vol. 2, pp. 1303-1306, Munich, Germany, April 1997
[Kim, 1997] Kim, S. H., and Kim, J. Y., “Efficient Model of Establishing Words Tone Dictionary for Korean TTS System”, in Proc. of Eurospeech, pp. 243-246, 1997
[Ladd, 1996] Ladd, D. R., “Intonational phonology”, Cambridge Studies in Linguistics 79. Cambridge: Cambridge University Press. 334 pages, 1996
[Lee, 1989] Lee, L. S., Tseng, C. Y. and Ouh-young M., “The Synthesis Rules in a Chinese Text-to-speech System”, IEEE Trans. on Acoustic, Speech and Signal Processing, vol. 37, no. 9, pp. 1309-1319, September 1989
[Lee, 1993] Lee, L. S., Tseng, C. Y. and Hsieh, C. J., “Improved Tone Concatenation Rules in a Formant-Based Chinese Text-to-Speech System”, IEEE Trans. on Speech and Audio processing, vol. 1, no.3, pp.287-294, July 1993
[Lin, 1992] Lin, T. and Wang, L. J., “Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992
[Lin, 1999] Lin, X., Chen, Y., Lim, S. and Lim, C., “Recognition of Emotional State From Spoken Sentences”, IEEE 3rd workshop on Multimedia Signal Processing, pp. 469-473, 1999
[Masuko, 1996] Masuko, T., Tokuda, K., Kobayashi, T. and Imai, S., “Speech Synthesis Using HMMs with Dynamic Features”, in Proc. of ICASSP, pp. 389-392, 1996
[Monaghan, 1991] Monaghan, A.I.C. and Ladd, D.R., “Manipulating Synthetic Intonation for Speaker C haracterisation”, in Proc. of ICASSP, S7.11, pp. 453-456, 1991
[Pan, 2000] Pan, N. H., Jen, W. T., Yu, S. S., Yu, S. S., Huang, S. Y. and Wu, M. J., “Prosody Model in a Mandarin Text-to-Speech System Based on a Hierarchical Approach”, IEEE International Conference on Multimedia and Expo, vol. 1, pp. 448-451, 2000
[Rissanen, 1984] Rissanen, J., “Universal Coding, Information, Prediction, and Estimation”, IEEE Trans. on IT, vol. 30, no. 40, pp. 629-636, 1984
[Shinoda, 1997] Shinoda, K. and Watanabe, T., “Acoustic modeling based on the MDL criterion for speech recognition”, in Proc. of EuroSpeech, vol. 1, pp. 99-102, 1997
[Sun, 2002] Sun, X., The Determination, Analysis and Synthesis of Fundamental Frequency, Ph. D Thesis, Northwestern University, 2002
[Tao, 2004] Tao, J., “F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method”, Lecture Nodes of Artificial Intelligence, Springer, 2004
[Tokuda, 1995] Tokuda, K., Kobayashi, T. and Imai, S., “Speech Parameter Generation from HMM Using Dynamic Features”, in Proc. of ICASSP, pp. 660-663, 1995
[Tokuda, 2000] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. and Kitamura, T., “Speech Parameter Generation Algorithms for HMM-based Speech Synthesis”, in Proc. of ICASSP, pp. 1315-1318, 2000
[Tseng, 2004] Tseng, C.Y. and Lee, Y. L., ”Speech rate and Prosody Units: Evidence of Interaction from Mandarin Chinese”, in Proc. of the International Conference on Speech Prosody, pp. 251-254, 2004
[Tseng, 2005] Tseng, C. Y., Pin, S. H., Lee, Y. L., Wang, H. M. and Chen, Y. C., “Fluent Speech Prosody: Framework and Modeling”, Speech Communication, Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation, Vol. 46: 3-4, pp. 284-309, 2005
[Wightman, 1994] Wightman, C. W. and Ostendorf. M., “Automatic Labeling of Prosodic Patterns”, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, pp. 469-481, October 1994
[Yi, 2001] Yi, X. and Wang Q. E., “Pitch Targets and Their Realization: Evidence from Mandarin Chinese”, Speech Communication, pp. 319-337, 2001
[Young, 2006] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P., The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[Zen, 2007] Zen, H., Nose, T., Yamagishi, J., Sako, S. and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007. http://hts.sp.nitech.ac.jp/
[謝, 民63年] 謝雲飛, 語音學大綱, 民國63年初版