| 研究生: |
郭峻成 Guo, Jun-Cheng |
|---|---|
| 論文名稱: |
應用狀態語音轉換函式於自發性語音合成中發音變異之產生 Pronunciation Variation Generation for Spontaneous Speech Synthesis Using State-Based Voice Transformation |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 中文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 發音變異 、語音合成 、發音參數 、線性轉換 、隱藏式馬可夫模型 、分類回歸樹 |
| 外文關鍵詞: | transformation, HMM, CART, articulatory feature, speech synthesis, pronunciation variation |
| 相關次數: | 點閱:111 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自然語音中發音變異之現象是影響語音自然度的重要因素。基於隱藏式馬可夫模型的語音合成器,近年來已經可合成出流暢及清晰的語音,其系統的可攜性及適應性更是其發展優勢,但在語音的自然度上仍嫌不足,需要改善合成語音的自然度。因此,本研究以“轉換函式”作為轉換並合成出發音變異現象的方法,以及考慮“構音特性參數”做發音變異現象之預測;透過轉換函式產生新的音韻模型,希望改善在傳統合成方法中,僅利用固定數量音韻模型合成的不足,並以構音特性參數對發音變異做聲學特性上的分類,以彌補訓練語料不足的問題。藉由產生發音變異現象,用以增進基於隱藏式馬可夫模型之合成語音的自然度。
在本論文中,對於應用轉換函式及發音變異預測模型,分為下列兩項研究重點:1)導入轉換函式於隱藏式馬可夫模型建立發音變異模型;2)運用分類迴歸樹預測發音變異種類。
對於發音變異模型進行主觀及客觀評估之實驗結果顯示,本論文所提出之方法,在合成語音之自然度表現上,具有相當程度的改進。
Pronunciation variation plays an important role in spontaneous speech. In recent years, speech synthesis based on Hidden Markov Models has been developed, which can synthesize smooth speech and clear pronunciation. The advantage of HMM-based synthesized system is its flexibility and portability. However, the “naturalness” of traditional HMM-based synthesized speech is not spontaneous. In our approach, the transformation function is adopted for the pronunciation variation modeling, and articulatory features are also considered to predict pronunciation variation. Transformation function is used to generate variation phones, to solve the problem of lacking of phone models in previous works. The Lack of training data is solved by clustering acoustic variations using articulatory features. In this work, we achieve the goal of synthesizing spontaneous speech by generating pronunciation variations.
This research aims to: 1) importing transformation function into HMM to model pronunciation variations, and 2) predicting pronunciation using Classification and Regression Tree.
Objective and subjective tests were used to evaluate the performance of our approach. The experimental results showed that the proposed method achieved a significant improvement in synthesized speech in terms of spontaneity.
[1] Breen, A. P., and Jackson P., “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” in Proc. of the Third ESCA/COCOSDA Workshop on Speech Synthesis, pp.201-206, Blue Mountain, Australia, Nov. 1998.
[2] Rubin, P., Baer, T., & Mermelstein, P., “An Articulatory Synthesizer for Perceptual Research, “ Journal of the Acoustical Society of America, 70, 321-328, 1981.
[3] Anastasakos, T., McDonogh, J., Schwartz, R., and Makhoul J., “A Compact Model for Speaker Adaptive Training,” Proc. ICSLP, PP. 1137-1140, Philadelphia, Oct. 1996.
[4] Kuwabara, H., and Sagisaka Y., “Acoustic Characteristics of Speaker Individuality: Control and Conversion,” Speech Communication, Vol. 16, No. 2, PP.165-173, 1995.
[5] Lawrence, R. Rabiner. "A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE 77 (2): 257-286, February 1989.
[6] Tseng, S.-C., and Liu, Y.-F., “Annotation of Mandarin Conversational Dialogue Corpus,” CKIP Technical Report, No. 02-01, Academia Sinica, 2002.
[7] Bennett C.L., and Black A.W., “Prediction of pronunciation variations for speech synthesis: A data-driven approach,” in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, Philadelphia, Pennsylvania, 2005.
[8] Werner S., Eichner M., Wolff M., and Hoffmann R.,“Toward spontaneous speech synthesis - utlilizing language model information in TTS,” IEEE Trans. Speech, Audio Processing, pp. 436–445, 2004.
[9] Prahallad K., Black A., and Mosur R., “Sub-phonetic modeling for capturing pronunciation variation in conver-sational speech synthesis,” in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing, Toulouse, France, 2006.
[10] Sun L.-Y., and Wang Y.-R., “An Analysis Modeling of Syllable Contraction in Spontaneous Mandarin Speech Recognition,” Master Thesis, Dept. of Communication Engineering, NCTU, Taiwan, 2004.
[11] Kuwahara. H., “Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited,” In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pages 1303–1306, Munich, Germany, 1997.
[12] Fukada, T., Tokuda, K., Kobayashi, T., and Imai, S., “An Adaptive Algorithm for Mel-cepstral Analysis of Speech”, in Proc. of ICASSP, S7.11, PP. 453-456, 1991.
[13] Sakoe, H., and Chiba, S., “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43- 49, 1978, ISSN: 0096-3518.
[14] Dong, M., and Lua, K. T., “Pitch Contour Model for Chinese Text-to-Speech Using CART and Statistical Model”, in Proc. of ICSLP, pp. 2405-2408, 2002.
[15] Zen, H., Nose, T., Yamagishi, J., Sako, S., and Tokuda, K., The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007. http://hts.sp.nitech.ac.jp/
[16] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P., The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[17] Huang, C., Shi, Y., Zhou, J. L., Chu, M., Wang, T., and Chang, E., “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004.
[18] Lin, T., and Wang, L. J., “Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992.
[19] 謝雲飛, 語音學大綱, 民國63年初版
[20] Tseng, S.-C., “Contracted Syllables in Mandarin: Evidence from Spontaneous Conversation,” Journal of Language and linguistics, PP. 153-180, 2005.
[21] Fosler-Lussier, E., and Nelson Morgan, “Effects of Speaking Rate And Word Frequency on Conversation Pronunciations,” Speech Communication, vol. 29, PP. 137-158, 1999
[22] Dempster, A.P., Laird, N.M. Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR: 2984875. MR0501537.
[23] International Phonetic Association (IPA), Handbook.
[24] CAI Lianhong, CUI Dandan, and CAI Rui, “TH-CoSS, a Mandarin Speech Corpus for TTS,” Key Lab. of Pervasive Computing, Ministry of Education, Dept. of Computer, Tsinghua Univ., Beijing.