| 研究生: | 夏啟峻 Hsia, Chi-chun | 
|---|---|
| 論文名稱: | 語音合成中合成單元選取及語音轉換之研究 A Study on Synthesis Unit Selection and Voice Conversion for Text-to-Speech Synthesis | 
| 指導教授: | 吳宗憲 Wu, Chung-hsien | 
| 學位類別: | 博士 Doctor | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2008 | 
| 畢業學年度: | 96 | 
| 語文別: | 英文 | 
| 論文頁數: | 100 | 
| 中文關鍵詞: | 情緒語音合成 、單元挑選 、語音轉換 | 
| 外文關鍵詞: | expressive speech synthesis, unit selection, voice conversion | 
| 相關次數: | 點閱:106 下載:7 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
語音可傳遞文字訊息,更可同時表達發話者的情緒狀態與發話意圖,語音是人類溝通的最基本方式;語音將是未來科技的主流,語音科技的發展是下一世代人機互動介面的關鍵;可表達語音情緒變化的文字轉語音系統可進一步提升與使用者的訊息溝通。透過單元串接的方法,許多語料式語音合成系統可達至情緒化電腦語音生成的目標;然而,對每一種目標情緒都需要收集大量的聲音語料,成本的負擔限制了相關技術的發展與應用。語音轉換是一藉由轉換頻譜與音韻參數,將中性情緒語音轉換為各種不同情緒語音的技術,並且只需要小量的情緒平行語料;此類技術可發展成為中性情緒語音合成系統的後處理模組,達至豐富情緒語音合成的目標。
	本研究的目的為改善人機互動介面上,合成語音的自然度、情緒與溝通意圖傳達的正確性。為了達到此一研究目標,中性情緒語音合成系統及情緒語音轉換模組的開發為本論文研究的兩大主題。本研究的理論基礎與原理包括語言學與語音學、樣型辨識、語言模型、訊號處理與多變量分析。研究之特定目標包括:1) 設計一組小量音素平衡的情緒平行語料以開發語音情緒轉換模組,及一大量的中性情緒語料作為合成單元挑選方法研究的材料;2) 發展可變長度單元挑選方法與句法結構距離之估算; 3) 發展轉換含式歸群演算法與高斯混和雙鏈模型(GMBM),以改善頻譜轉換的效果;4) 發展階層式音韻轉換架構,並開發迴歸式歸群以降低音韻參數轉換的誤差。
	實驗透過統計驗證,評量以語音情緒轉換配合中性情緒語音合成的系統效能。內容包括單元挑選方法與距離估算含式的評估,以及對音韻參數的影響;透過主客觀的測試,分析多轉換含式的效果,並比較高斯混和模型與高斯混和雙鏈模型的差異;階層式音韻轉換與迴歸式歸群演算法亦透過轉換誤差加以評量。實驗結果顯示,合成語音的自然度與情緒表達度都得到提升,所提出的策略能有效的降低語音情緒轉換在參數上的誤差。
	本研究未來可以朝向句法或音韻結構對頻譜及音韻參數的影響性加以分析探討,尤其在不同情緒間的影響差異。本研究之分析與結果,可以提供語言學、附屬語言學家及電腦科學學者在人機互動介面技術研究上的參考。
Speech is the fundamental manner that can simultaneously convey linguistic and paralinguistic information of a speaker. Speech technology is the key for the development of human-machine interaction in next generation. A text-to-speech (TTS) synthesis system that can express emotion can be an effective communication tool for users. Several corpus-based TTS systems have been proposed for emotional speech synthesis using synthesis unit selection and concatenation. However, the requirement of large size of speech database obstructs the development and application of such a system. Voice conversion is to convert the spectral and prosodic features of neutral speech to expressive speech using small-sized parallel speech database, and can be adopted as a post-processing module of neutral-styled TTS system for expressive speech synthesis.
	The purpose of this study is to investigate the improvement of naturalness for synthetic speech and the accuracy of emotion expression and communication intentions in human-machine interaction interface. To achieve the goal, this dissertation focuses on two issues: neutral-styled TTS synthesis and emotional voice conversion. Theories in linguistics/phonetics, pattern recognition, language modeling, signal processing and multivariate analysis provide the essential principles for the development of this study. More specifically, the research was aimed to: 1) develop a set of small-sized phonetic-balanced parallel speech databases for spectral and prosodic conversion, and a large-sized neutral-styled speech database for concatenative TTS synthesis, 2) develop a neutral-styled concatenative TTS system using variable-length unit selection and structural syntactic cost, 3) develop a spectral conversion method based on multiple conversion function clustering and Gaussian mixture bi-gram model (GMBM), and 4) develop a hierarchical prosody conversion using regression-based clustering.
	Experiments with statistical hypothesis testing were conducted to evaluate the proposed approach using a neutral-styled TTS system followed by spectral and prosody conversion modules for emotion speech synthesis. For neutral-styled TTS, experiments were conducted on unit selection schemes and cost functions for best unit sequence decoding. The effects on prosodic features were also investigated. In the evaluation for spectral conversion using multiple functions, objective and subjective tests were used to investigate the improvement by conversion function clustering. Comparisons between Gaussian mixture bi-gram model and Gaussian mixture model were also carried out. For prosody conversion, experiments were conducted to test the performance of hierarchical prosody modeling and regression-based clustering. Experimental results show the proposed approach gives an encouraging improvement both in naturalness of synthetic speech and emotion identification rate for emotion expression.
	The future work is recommended to investigate more relation between prosodic structure and the prosodic/spectral features of emotional speech. In real application for spontaneous speech, further efforts are needed to design and collect speech database from natural conversation. The outcomes are expected to provide helpful information for linguist/para-linguists and computer scientists for the development of more effective, livelier human-machine interaction.
[Abe, 1988] Abe, M., Nakamura, S., Shikano, K., and Kuwabara, H., “Voice Conversion through Vector Quantization,” in Proc. of ICASSP 1998, pp. 655-658, Tokyo, Japan, May 1988.
[Abney, 1991] Abney, S., Parsing by Chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny (eds.), Principle-Based Parsing, pp. 257-278. Dordrecht: Kluwer Academic, 1991.
[Bellegarda, 2000] Bellegarda, J. R., “Exploiting Latent Semantic Information in Statistical Language Modeling,” in Proc. of the IEEE, vol. 88, no. 8, pp. 1279-1296, Aug. 2000.
[Black, 1995] Black, A. W. and Campbell, N., “Optimizing Selection of Units from Speech Database for Concatenative Synthesis,” in Proc. of EUROSPEECH 1995, pp. 581-584, Madrid, Spain, Sept. 1995.
[Breen, 1998] Breen, A. P. and Jackson, P., “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” in Proc. of the Third ESCA/COCOSDA Workshop on Speech Synthesis, pp. 201-206, Blue Mountain, Australia, Nov. 1998.
[Brown, 1973] Brown, R., A First Language: The Early Stages, Cambridge, MA: Harvard University Press, 1973.
[Campillo Díaz, 2003] Campillo Díaz, F. and Banga, E. R., “On the Design of Cost Functions for Unit-Selection Speech Synthesis,” in Proc. of EUROSPEECH 2003, pp. 289-292, Geneva, Switzerland, Sept. 2003.
[Chang, 1989] Chang, L. L. et al., “Part-of-Speech (POS) Analysis on Chinese Language,” Tech. Rep., Inst. Inform. Sci., Academia Sinica, Taiwan, R.O.C., 1989.
[Chen, 1990] Chen, S. H. and Wang, Y. R., “Vector Quantization of Pitch Information in Mandarin Speech,” IEEE Trans. on Communications, vol. 38, no. 9, pp. 1317-1320, Sept. 1990.
[Chen, 1998] Chen, S. H., Hwang, S. H. and Wang, Y. R., “An RNN-Based Prosodic Information Synthesizer for Mandarin Text-to-Speech,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 3, pp. 226-239, May 1998.
[Chou, 1997] Chou, F. C., Tseng, C. Y., Chen, K. J. and Lee, L. S., “A Chinese Text-to-Speech System Based on Part-of-Speech Analysis, Prosodic Modeling, and Non-Uniform Units,” in Proc. of ICASSP 1997, vol. 2, pp. 923-926, Munich, Germany, Apr. 1997.
[Chou, 1998] Chou, F. C. and Tseng, C. Y., “Corpus-Based Mandarin Speech Synthesis with Contextual Syllabic Units Based on Phonetic Properties,” in Proc. of ICASSP 1998, pp. 893-896, Seattle, Washington, USA, May 1998.
[Cinque, 1993] Cinque, G. “A Null Theory of Phrase and Compound Stress,” Linguistic Inquiry, vol. 24, pp. 239-297, 1993.
[Cummings, 1993] Cummings, K. E. and Clements, M. A., “Application of the Analysis of Glottal Excitation of Stressed Speech to Speaking Style Modification,” in Proc. of ICASSP1993, vol. 2, pp. 207-210, Minneapolis, MN, USA, Apr. 1993.
[Dempster, 1977] Dempster, A. P., Laird, N. M. and Rubin, D. B., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. R. Statist. Soc. B, vol. 39, no. 1, pp. 1-38, 1977.
[Duxans, 2004] Duxans, H., Bonafonte, A., Kain, A. and van Santen, J., “Including Dynamic and Phonetic Information in Voice Conversion Systems,” in Proc. of ICSLP 2004, pp. 5-8, Jeju Island, South Korea, 2004.
[En-Najjary, 2003] En-Najjary, T., Rosec, O., and Chonavel, T., “A New Method for Pitch Prediction from Spectral Envelop and its Application in Voice Conversion,” in Proc. of EUROSPEECH 2003, pp. 1753-1756, Geneva, Switzerland, Sept. 2003.
[Fach, 1999] Fach, M. L., “A Comparison between Syntactic and Prosodic Phrasing,” in Proc. of EUROSPEECH 1999, vol. I, pp. 527-530, Budapest, Hungary, Sept. 1999.
[Fujisaki, 1984] Fujisaki, H. and Hirose, K., “Analysis of Voice Fundamental Frequency Contours for Declarative Sentence of Japanese,” J. Acoust. Soc. Jpn. (E), vol. 5, no. 4, pp. 233-242, 1984.
[Hindle, 1994] Hindle, D., A Parser for Text Corpora. In B.T.S. Atkins and A. Zampolli (eds.), Computational Approach to the Lexicon, pp. 103-151. Oxford University Press, 1994.
[Hsia, 2007] Hsia, C. C., Wu, C. H. and Wu, J. Q., “Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion,” IEEE Trans. on Computers, vol. 56, no. 9, pp. 1225-1254, 2007.
[Huang, 1996] Huang, X., Acero, A. and Adcock, J., “Whistler: a Trainable Text-to-Speech System,” in Proc. of ICSLP 1996, vol. 4, pp. 2387-2390, Philadephia, PA, USA, Oct. 1996.
[Huang, 2001] Huang, X., Acero, A. and Hon, H. W., Spoken Language Processing, A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, 2001.
[Huang, 2004] Huang, C., Shi, Y., Zhou, J., Chu, M., Wang, T. and Chang, E., “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” in Proc. of ICASSP 2004, vol. 1, pp. 901-904, Montreal, Canada, 2004.
[Hunt, 1996] Hunt, A. J. and Black, A. W., “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database,” in Proc. of ICASSP 1996, pp. 373-376, 1996.
[Iida, 2003] Iida, A., Higuchi, F., Campbell, N. and Yasumura, M., “A Corpus-Based Speech Synthesis System with Emotion,” Speech Communication, vol. 40, no. 1-2, pp. 161-187, 2003.
[Kain, 1998] Kain, A. and Macon, M. W., “Spectral Voice Conversion for Text-to-Speech Synthesis,” in Proc. of ICASSP 1998, vol. 1, pp. 285-288, Seattle, Washington, USA, May 1998.
[Kain, 2000] Kain, A. and Stylianou, Y., “Stochastic Modeling of Spectral Adjustment for High Quality Pitch Modification,” in Proc. of ICASSP 2000, vol. 2, pp. 949-952, Istanbul, Turkey, June 2000.
[Kawahara, 1997] Kawahara, H., “Speech Representation and Transformation Using Adaptive Interpolation of Weighted Spectrum: Vocoder Revisited,” in Proc. of ICASSP 1997, vol. 2, pp. 1303-1306, Munich, Germany, Apr. 1997.
[Kawahara, 1999] Kawahara, H., Masuda-Katsuse, I. and de Cheveigné, A., “Restructuring Speech Representations Using a Pitch Adaptive Time-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds,” Speech Communication, vol. 27, no. 3-4, pp. 187-207, Apr. 1999.
[Kawanami, 2003] Kawanami, H., Iwami, Y., Toda, T., Saruwatari, H. and Shikano, K., “GMM-Based Voice Conversion Applied to Emotional Speech Synthesis,” in Proc. of EUROSPEECH 2003, pp. 2401-2404, Geneva, Switzerland, Sept. 2003.
[Kay, 1993] Kay, S. M., Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall PTR, 1993.
[Kim, 1997] Kim, E. K., Lee, S. and Oh, Y. H., “Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker,” in Proc. of EUROSPEECH 1997, vol. 5, pp. 2519-2522, Rhodes, Greece, Sept. 1997.
[Kim, 2004] Kim, N. S. and Park, S. S., “Discriminative Training for Concatenative Speech Synthesis,” IEEE Signal Processing Letters, vol. 11, no. 1, pp. 40-43, Jan. 2004.
[Kochanski, 2000] Kochanski, G. P. and Shih, C., “STEM-ML: Language Independent Prosody Description,” in Proc. of ICSLP 2000, pp. 239-242, Beijing, China, Oct. 2000.
[Kullback, 1951] Kullback, S. and Leibler, R. A., On Information and Sufficiency. Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, March 1951.
[Lambert, 2003] Lambert, T., Breen, A. P., Eggleton, B., Cox, S. J. and Milner, B. P., “Unit Selection in Concatenative TTS Synthesis Systems Based on Mel Filter Bank Amplitudes and Phonetic Context,” in Proc. of EUROSPEECH 2003, pp. 273-276, Geneva, Switzerland, Sept. 2003.
[Levenberg, 1944] Levenberg, K. “A Method for the Solution of Certain Problems in Least Squares.” Quart. Appl. Math. 2, pp. 164-168, 1944.
[Ma, 2005] Ma, J. and Liu, W., “Voice Conversion Based on Joint Pitch and Spectral Transformation with Component Group-GMM,” in Proc. of NLP-KE 2005, pp. 199-203, Oct. 2005.
[Manning, 1999] Manning, C. D. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, Massachusetts, 1999.
[Murray, 1993] Murray, I. R. and Arnott, J. L., “Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion,” J. Acoust. Soc. Amer., vol. 93, no. 2, pp. 1097-1108, 1993.
[Rencher, 1998] Rencher, A. C., Multivariate Statistical Inference and Applications, John Wiely & Sons, Inc., 1998.
[Schröder, 2001] Schröder, M., “Emotional Speech Synthesis – A Review,” in Proc. of EUROSPEECH 2001, vol. 1, pp. 561-564, Aalborg, Denmark, Sept. 2001.
[Selkrik, 1984] Selkrik, E. Phonology and Syntax: The Relation Between Sound and Structure, MIT Press, Cambridge, Mass., 1984.
[Shott, 1990] Shott, S., “Statistics for Health Professionals,” W. B. Sauders com., Feb. 1990.
[Sinica Tree-Bank] Sinica Tree-Bank. Avaliable: http://turing.iis.sinica.edu.tw/treesearch/
[Stylianou, 1998] Stylianou, Y., Cappé, O. and Moulines, E., “Continuous Probabilistic Transform for Voice Conversion,” IEEE Trans. on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, Mar. 1998.
[Sun, 2002] Sun, X., The Determination, Analysis, and Synthesis of Fundamental Frequency, Ph.D. dissertation, Northwestern Univ. Evanston, IL, 2002.
[Takano, 2001] Takano, S., Tanaka, K., Mizuno, H., Abe, M. and Nakajima, S., “A Japanese TTS System Based on Multiform Units and a Speech Modification Algorithm with Harmonics Reconstruction,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 1, pp. 3-10, Jan. 2001.
[Tao, 2006] Tao, J., Kang, Y. and Li, A., “Prosody Conversion from Neutral Speech to Emotional Speech,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1145-1154, July 2006.
[Taylor, 1999] Taylor, P. and Black, A. W., “Speech Synthesis by Phonological Structure Matching,” in Proc. of EUROSPEECH 1999, vol. II, pp. 623-626, Budapest, Hungary, Sept. 1999.
[Toda, 2003] Toda, T., Kawai, H. and Tsuzaki, M., “Optimizing Integrated Cost Function for Segment Selection in Concatenative Speech Synthesis Based on Perceptual Evaluations,” in Proc. of EUROSPEECH 2003, pp. 297-300, Geneva, Switzerland, Sept. 2003.
[Toda, 2005] Toda, T., Black, A. W. and Tokuda, K., “Spectral Conversion Based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter,” in Proc. of ICASSP 2005, vol. 1, pp. 9-12, Philadelphia, USA, Mar. 2005.
[Tsai, 2002] Tsai, W. H. and Chang, W. W., “Discriminative Training of Gaussian Mixture Bi-gram Models with Application to Chinese Dialect Identification,” Speech Communication, vol. 36, no. 3-4, pp. 317-326, Mar. 2002.
[Tseng, 2000] Tseng, C. Y. and Chen, D. D., “The Interplay and Interaction Between Prosody and Syntax: Evidence from Mandarin Chinese,” in Proc. of ICSLP 2000, vol. II, pp. 95-97, Beijing, China, Oct. 2000.
[Tseng, 2005] Tseng, C. Y., Pin, S. H., Lee, Y., Wang, H. M. and Chen, Y. C., “Fluent Speech Prosody: Framework and Modeling,” Speech Communication, vol. 46, no. 3-4, pp. 284-309, July 2005.
[Tuckenbrodt, 1995] Tuckenbrodt, H. Phonological Phrases: Their Relation to Syntax, Focus, and Prominence. Doctoral dissertation, MIT, Cambridge, Mass., 1995.
[Turk, 2003] Turk, O. and Arslan, L. M., “Voice Conversion Methods for Vocal Tract and Pitch Contour Modification,” in Proc. of EUROSPEECH 2003, pp. 2845-2848, Geneva, Switzerland, Sept. 2003.
[Wu, 2001] Wu, C. H. and Chen, J. H., “Automatic Generation of Synthesis Units and Prosodic Information for Chinese Concatenative Synthesis,” Speech Communication, vol. 35, no. 3-4, pp. 219-237, Oct. 2001.
[Wu, 2004] Wu, C. H. and Chen, Y. J., “Recovery of False Rejection Using Statistical Partial Pattern Trees for Sentence Verification,” Speech Communication, vol. 43, no. 1-2, pp. 71-88, June 2004.
[Wu, 2006] Wu, C. H., Hsia, C. C., Liu, T. H. and Wang, J. F., “Voice Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1109-1116, July 2006.
[Xu, 2001] Xu, Y. and Wang, Q. E., “Pitch Targets and Their Realization: Evidence from Mandarin Chinese,” Speech Communication, vol. 33, no. 4, pp. 319-337, Mar. 2001.
[Yi, 1998] Yi, J. R. W., Natural-Sounding Speech Synthesis Using Variable-Length Units, Masters Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1998.
[Yi, 2003] Yi, J. R. W., Corpus-Based Unit Selection for Natural-Sounding Speech Synthesis, Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2003.