簡易檢索 / 詳目顯示

研究生: 陳俊甫
Chen, Jiun-Fu
論文名稱: 應用機率式句法結構與隱含式語意索引於情緒語音合成之單元選取
Unit Selection for Corpus-Based Emotional Speech Synthesis Using PCFG and LSI
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 59
中文關鍵詞: 情緒語音合成
外文關鍵詞: Emotional Speech Synthesis, TTS
相關次數: 點閱:72下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   人機溝通介面中,語音扮演著重要的角色。現階段電腦電信、網路通訊及資訊等科技的成熟發展,已可整合並應用相關語音科技於日常生活各個層面。然而,傳統電腦語音缺乏情緒特性,使得電腦與人的互動機能嚴重降低。因此,使電腦合成出帶有不同情緒特性的電腦語音,是本研究主要目的。
      在本文中,對於語料式情緒語音合成系統主要的問題,分為下列四項研究重點:1)根據不同情緒,設計一套平衡語料庫,並利用自動單元切割技術,生成基本合成單元;2)提出修正式可變長度單元機制,將機率式句法模型概念導入,決定單元長度與單元合適性;3)有別於一般聲學上的單元失真度計算,應用隱含式語意索引的概念,針對單元的語意失真度進行量度;4)最後,應用動態規劃與自動斷句預測,挑選出單元並合成情緒語音。
      在實驗中,首先針對中文斷句預測的正確率做比較;接著,對於語音合成的結果,觀察合成語音與實際語音在參數上的差距。並利用主觀式的評估方式,分別進行自然度MOS測試,情緒鑑定測試與理解度測試,本論文提出之方法,在合成的自然度與情緒的表現上,有不錯的表現。

      Speech plays an important role in human computer interaction. With the improvement of technique in telecommunication, internet and computer science, speech can be integrated with other techniques associated with human’s life to makes communication more convenient and easy. Therefore, lack of the stress of emotion in traditional synthesized speech will makes human computer iteration less fun and natural. So, A Text To Speech (TTS) system with emotion is developed in this research.
      In this thesis, a variable length unit selection method using PCFG and LSI was proposed for text to emotional speech system. More specially , this study focus on: 1)Design an emotional balanced corpus and use automatic speech segmentation and verification method to generate basic synthesis units , 2) propose a variable length unit selection method based on PCFG to decide the length and suitability of synthesis units, 3)besides the cost in acoustic features , LSI is adapted to estimate syntactic cost to measure the similarity of units on different syntactic structures, 4) finally, apply the techniques described above on developing a text to emotional speech system.
      In order to evaluate our proposed approach, 4 emotions * 10 sentences are used as experimental speech samples. The experiment results using subjective test on proposed approach and baseline system shows that proposed unit selection scheme has better score on MOS and better accuracy on emotion identification and intelligibility evaluation test

    中文摘要 IV 表目錄 IX 圖目錄 X 第一章 序論 1 1.1. 前言 1 1.1.1. 研究動機與目的 1 1.1.2. 研究背景之現況 2 1.2. 文獻回顧 3 1.3. 研究方法 7 1.3.1 系統架構 7 1.3.2 平衡語料庫與合成單元之生成 9 1.3.3 合成單元挑選與語音合成機制 9 1.4. 章節概要 10 第二章 平衡語料庫與合成單元之生成 11 2.1. 平衡語料庫之設計與建制 12 2.1.1. 情緒語句修飾 13 2.1.2. 語句計分方式與挑選流程 14 2.1.3. 平衡語料文字特性 16 2.1.4. 實際音檔錄製 20 2.2. 自動單元切割與調整 23 2.2.1. 自動單元切割 23 2.2.2. 單元確認 26 第三章 隱含式語意索引文法結構距離與可變長度單元合成機制 27 3.1. 可變長度單元合成機制 28 3.2. 中文文法機率模型 31 3.2.1. 內部機率(Inside Probability) 33 3.2.1. 外部機率(Outside Probability) 34 3.2.2. 單元內部機率(Unit Joint Inside Probability) 36 3.3. 文法結構距離 37 3.3.1. 文法結構樹向量化 37 3.3.2. 中文文法結構距離 38 3.4. 情緒語音合成系統 40 3.4.1. 聲學失真度 40 3.4.2. 整句元網格最佳路徑搜尋 42 3.4.3. 中文斷句預測 43 第四章 實驗結果與討論 45 4.1. 中文斷句預測 45 4.2. 合成語音參數比照 46 4.2. 主觀式評估與聽覺實驗 50 第五章 結論與未來展望 54 5.1. 結論 54 5.2. 未來展望 54 參考文獻 56

    [1] Akemi Iida “A Study on Corpus-based Speech Synthesis with Emotion” September, 2002 , PhD. Thesis
    [2] Jon Rong Wei Yi “Corpus-Based Unit Selection for Natural-Sounding Speech Synthesis” MIT 2003 PhD Thesis
    [3] Min Chu, Hu Peng, Hong-yun Yang, Eric Chang “Selecting Non-Uniform Units From a Very Large Corpus For Concatenative Speech Synthesizer” in ICASSP 2001
    [4] Jau-Hung Chen “A Study on Synthesis Unit Selection and Prosodic Information Generation in a Chinese Text-to-Speech System” PhD. Thesis
    [5] Yao Qian, Min Chu. Hu Peng “Segmenting Unrestricted Chinese Text Into Prosodic Words Instead of Lexical Words” in ICASSP 2001 , Microsoft Research China
    [6] Fu-chiang Chou , Chiu-yu Tseng , Lin-Shan Lee “Automatic Generation of Prosodic Structure for High Quality Mandarin Speech Synthesis” in ICSLP, 1996
    [7] Alan W Black , “Unit Selection and Emotional Speech” in EuroSpeech 2003
    [8] Chih-Chung Kuo , Chi-Shiang Kuo , Jau-Hung Chen , Sen-Chia Chang “Automatic Speech Segmentation and Verification for Concatenative Synthesis” in EuroSpeech 2003
    [9] Yeon-Jun Kim , Alistair Conkie , “Automatic Segmentation Combining An HMM-Based Approach and Spectral Boundary Correction” in ICSLP 2002
    [10] F. Malfrere , O. Deroo , T. Dutoit , C. Ris , “Phonetic Alignment: Speech Synthesis-based vs. Viterbi-based” in Speech Communication 2003
    [11] Yong Zhao , Min Chu , Hu Peng , Eric Chang , “Custom-Tailoring TTS Voice Font –Keeping the Naturalness When Reducing Database Size” in EuroSpeech 2003
    [12] Murtaza Bulut , Shrikanth S. Narayanan , Ann K. Syrdal “Expressive Speech Synthesis Using a Concatenative Synthesizer” in ICSLP 2002
    [13] Nick Campbell , “Towards Synthesising Expressive Speech ; Design and Collecting Expressive Speech Data” in EuroSpeech 2003
    [14] Xuedong Huang , Alex Acero , Hsiao-Wuen Hon , Spoken Language Processing Chapter 4
    [15] Abhinav Sethy , Shrikanth Narayanan “Refined Speech Segmentation for concatenative Speech Synthesis” in ICSLP’02
    [16] Akemi Iida , Nick Campbell , “A Database Design for a Concatenative Speech Synthesis System for Disabled” In Proceedings of ISCA 4th International Workshop on Speech Synthesis, pp.188-194
    [17] Hu Peng , Yong Zhao , Min Chu , “Perpetually Optimizing the Cost Function For Unit Selection In a TTS System with One Single Run of MOS Evalution” in Proc. of ICSLP2002
    [18] Min Chu and Hu Peng , “An Objective Measure for Estimating MOS of Synthesized Speech” in EuroSpeech 2001
    [19] Amold, M.B.(1960). Emotion and Personality, New York , Columbia University Press
    [20] Carlson, R. Granstrom, G. & Nord. L, (1992). “Experiments with Emotive Speech, Acted Utterances and Synthesized Replicas.” Speech Communication , Vol 2,pp.347-355
    [21] Daly , E. M, Lancee, W. J. Polivy, J.(1983). “A Conical Model For the Taxonomy of Emotional Experience.” Journal of Personality and Social Psychology , Vol. 45, pp 443-457
    [22] Darwin. C. (1872). The Expression of the Emotions in Man and Animals. John Murray,London.
    [23] Dutoit, T. (1997). Text, Speech and Language Technology: Vol.3. An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht.
    [24] Frijda, N. (1986). The emotions, Cambridge University Press, N.Y.
    [25] Guerrero, L. K., Andersen, P. A., & Trost, M.R. (1998). communication and emotion: Basic concepts and approaches. In Andersen, P.A., Guerrero, & L.K. (Eds.), Handbook of Communication and Emotion: Research, Theory, Applications, and Contexts, pp. 3-27. Academic Press, San Diego.
    [26] Wang, W. J., Campbell, W. N., N. and Sagisaka, Y., “Tree-based unit selection for English Speech Synthesis”, ICASSP’93, vol.2, 191-194
    [27] Hon, H., Acero, A., Huang, S., Liu, J. and Plumpe, M.,” Automatic Generation of Synthesis Units for Trainable Text-to-Speech System”, ICASSP’98, vol.1 , 293-296
    [28]M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal “The AT&T Next-Generation TTS System” AT&T Labs
    [29]Russell,J.A.(1980). A Circumplex model of affect. Journal of Personality and Social Psychology, Vol. 39,No. 6,pp.1161-1178
    [30]Russell,J.A.(1989). Measures of Emotion. In Plutchik,R., & Kellerman H. (Hds.), Emotion Theory, Research, and Experience. Pp. 83-111, Academic Press, N.Y.
    [31]Sagisaka, Y.,Kaiki, N., Iwahashi, N., Minura, K.(1992). ATR v-talk Speech Synthesis System. Proc. ICSLP’92, Banff, Canada, pp. 483-486.
    [32]王小川教授 “語音信號處理”

    下載圖示 校內:2005-07-07公開
    校外:2005-07-07公開
    QR CODE