| 研究生: |
陳俊甫 Chen, Jiun-Fu |
|---|---|
| 論文名稱: |
應用機率式句法結構與隱含式語意索引於情緒語音合成之單元選取 Unit Selection for Corpus-Based Emotional Speech Synthesis Using PCFG and LSI |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2004 |
| 畢業學年度: | 92 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 情緒 、語音合成 |
| 外文關鍵詞: | Emotional Speech Synthesis, TTS |
| 相關次數: | 點閱:72 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人機溝通介面中,語音扮演著重要的角色。現階段電腦電信、網路通訊及資訊等科技的成熟發展,已可整合並應用相關語音科技於日常生活各個層面。然而,傳統電腦語音缺乏情緒特性,使得電腦與人的互動機能嚴重降低。因此,使電腦合成出帶有不同情緒特性的電腦語音,是本研究主要目的。
在本文中,對於語料式情緒語音合成系統主要的問題,分為下列四項研究重點:1)根據不同情緒,設計一套平衡語料庫,並利用自動單元切割技術,生成基本合成單元;2)提出修正式可變長度單元機制,將機率式句法模型概念導入,決定單元長度與單元合適性;3)有別於一般聲學上的單元失真度計算,應用隱含式語意索引的概念,針對單元的語意失真度進行量度;4)最後,應用動態規劃與自動斷句預測,挑選出單元並合成情緒語音。
在實驗中,首先針對中文斷句預測的正確率做比較;接著,對於語音合成的結果,觀察合成語音與實際語音在參數上的差距。並利用主觀式的評估方式,分別進行自然度MOS測試,情緒鑑定測試與理解度測試,本論文提出之方法,在合成的自然度與情緒的表現上,有不錯的表現。
Speech plays an important role in human computer interaction. With the improvement of technique in telecommunication, internet and computer science, speech can be integrated with other techniques associated with human’s life to makes communication more convenient and easy. Therefore, lack of the stress of emotion in traditional synthesized speech will makes human computer iteration less fun and natural. So, A Text To Speech (TTS) system with emotion is developed in this research.
In this thesis, a variable length unit selection method using PCFG and LSI was proposed for text to emotional speech system. More specially , this study focus on: 1)Design an emotional balanced corpus and use automatic speech segmentation and verification method to generate basic synthesis units , 2) propose a variable length unit selection method based on PCFG to decide the length and suitability of synthesis units, 3)besides the cost in acoustic features , LSI is adapted to estimate syntactic cost to measure the similarity of units on different syntactic structures, 4) finally, apply the techniques described above on developing a text to emotional speech system.
In order to evaluate our proposed approach, 4 emotions * 10 sentences are used as experimental speech samples. The experiment results using subjective test on proposed approach and baseline system shows that proposed unit selection scheme has better score on MOS and better accuracy on emotion identification and intelligibility evaluation test
[1] Akemi Iida “A Study on Corpus-based Speech Synthesis with Emotion” September, 2002 , PhD. Thesis
[2] Jon Rong Wei Yi “Corpus-Based Unit Selection for Natural-Sounding Speech Synthesis” MIT 2003 PhD Thesis
[3] Min Chu, Hu Peng, Hong-yun Yang, Eric Chang “Selecting Non-Uniform Units From a Very Large Corpus For Concatenative Speech Synthesizer” in ICASSP 2001
[4] Jau-Hung Chen “A Study on Synthesis Unit Selection and Prosodic Information Generation in a Chinese Text-to-Speech System” PhD. Thesis
[5] Yao Qian, Min Chu. Hu Peng “Segmenting Unrestricted Chinese Text Into Prosodic Words Instead of Lexical Words” in ICASSP 2001 , Microsoft Research China
[6] Fu-chiang Chou , Chiu-yu Tseng , Lin-Shan Lee “Automatic Generation of Prosodic Structure for High Quality Mandarin Speech Synthesis” in ICSLP, 1996
[7] Alan W Black , “Unit Selection and Emotional Speech” in EuroSpeech 2003
[8] Chih-Chung Kuo , Chi-Shiang Kuo , Jau-Hung Chen , Sen-Chia Chang “Automatic Speech Segmentation and Verification for Concatenative Synthesis” in EuroSpeech 2003
[9] Yeon-Jun Kim , Alistair Conkie , “Automatic Segmentation Combining An HMM-Based Approach and Spectral Boundary Correction” in ICSLP 2002
[10] F. Malfrere , O. Deroo , T. Dutoit , C. Ris , “Phonetic Alignment: Speech Synthesis-based vs. Viterbi-based” in Speech Communication 2003
[11] Yong Zhao , Min Chu , Hu Peng , Eric Chang , “Custom-Tailoring TTS Voice Font –Keeping the Naturalness When Reducing Database Size” in EuroSpeech 2003
[12] Murtaza Bulut , Shrikanth S. Narayanan , Ann K. Syrdal “Expressive Speech Synthesis Using a Concatenative Synthesizer” in ICSLP 2002
[13] Nick Campbell , “Towards Synthesising Expressive Speech ; Design and Collecting Expressive Speech Data” in EuroSpeech 2003
[14] Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Spoken Language Processing Chapter 4
[15] Abhinav Sethy , Shrikanth Narayanan “Refined Speech Segmentation for concatenative Speech Synthesis” in ICSLP’02
[16] Akemi Iida , Nick Campbell , “A Database Design for a Concatenative Speech Synthesis System for Disabled” In Proceedings of ISCA 4th International Workshop on Speech Synthesis, pp.188-194
[17] Hu Peng , Yong Zhao , Min Chu , “Perpetually Optimizing the Cost Function For Unit Selection In a TTS System with One Single Run of MOS Evalution” in Proc. of ICSLP2002
[18] Min Chu and Hu Peng , “An Objective Measure for Estimating MOS of Synthesized Speech” in EuroSpeech 2001
[19] Amold, M.B.(1960). Emotion and Personality, New York , Columbia University Press
[20] Carlson, R. Granstrom, G. & Nord. L, (1992). “Experiments with Emotive Speech, Acted Utterances and Synthesized Replicas.” Speech Communication , Vol 2,pp.347-355
[21] Daly , E. M, Lancee, W. J. Polivy, J.(1983). “A Conical Model For the Taxonomy of Emotional Experience.” Journal of Personality and Social Psychology , Vol. 45, pp 443-457
[22] Darwin. C. (1872). The Expression of the Emotions in Man and Animals. John Murray,London.
[23] Dutoit, T. (1997). Text, Speech and Language Technology: Vol.3. An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht.
[24] Frijda, N. (1986). The emotions, Cambridge University Press, N.Y.
[25] Guerrero, L. K., Andersen, P. A., & Trost, M.R. (1998). communication and emotion: Basic concepts and approaches. In Andersen, P.A., Guerrero, & L.K. (Eds.), Handbook of Communication and Emotion: Research, Theory, Applications, and Contexts, pp. 3-27. Academic Press, San Diego.
[26] Wang, W. J., Campbell, W. N., N. and Sagisaka, Y., “Tree-based unit selection for English Speech Synthesis”, ICASSP’93, vol.2, 191-194
[27] Hon, H., Acero, A., Huang, S., Liu, J. and Plumpe, M.,” Automatic Generation of Synthesis Units for Trainable Text-to-Speech System”, ICASSP’98, vol.1 , 293-296
[28]M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal “The AT&T Next-Generation TTS System” AT&T Labs
[29]Russell,J.A.(1980). A Circumplex model of affect. Journal of Personality and Social Psychology, Vol. 39,No. 6,pp.1161-1178
[30]Russell,J.A.(1989). Measures of Emotion. In Plutchik,R., & Kellerman H. (Hds.), Emotion Theory, Research, and Experience. Pp. 83-111, Academic Press, N.Y.
[31]Sagisaka, Y.,Kaiki, N., Iwahashi, N., Minura, K.(1992). ATR v-talk Speech Synthesis System. Proc. ICSLP’92, Banff, Canada, pp. 483-486.
[32]王小川教授 “語音信號處理”