簡易檢索 / 詳目顯示

研究生: 黃喻豐
Huang, Yu-Fong
論文名稱: 利用多維尺度空間轉換之情緒控制向量產生方法之表達性語音合成
Expressive Speech Synthesis Based on Emotion Control Vector Generation Using MDS-based Space Transformation
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 48
中文關鍵詞: 語音合成控制向量產生情緒控制向量
外文關鍵詞: Speech synthesis, emotion control, control vector generation
相關次數: 點閱:91下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音是人機互動介面中重要的一環,基於隱藏式馬可夫模型的語音合成器已可合成出穩定與流暢的語音,而近年來的研究都朝向增加語音的變化性和表達性發展。
    在以控制向量為基底下的表達性語音合成中,當使用者想合成自己所想要的任意情緒時,如何準確定義在類別式(Categorical)的情緒空間下的情緒/風格(emotion/style)的控制向量,將是非常困難的。本論文是利用心理學上的喚起-情緒價(Arousal-Valence, AV)向空間理論到多元回歸隱藏式馬可夫模型為架構的表現性語音合成。
    在本研究中,使用者透過定義在喚起-情緒價向空間上的喚起值與情緒價向值來描述特定的情緒。並採用多維尺度分析的方式把類別式的情感空間與喚起-情緒價向情緒空間投影到其對應的正交坐標系,並提出了一個轉換方式來把喚起值與情緒價向值轉換到基於多元回歸隱藏式馬可夫模型下的類別式情感空間來做表達性的語音合成。在合成階段,給定輸入的文本與想要的情緒並轉換成情緒控制向量,接著利用多元回歸式隱藏式馬可夫模型來產生使用者想要的語音。
    在實驗的部分,我們對情緒參數的表達與合成語音對於受測者做了主觀的評估,實驗結果表明,本論文提出的方法有助於使用者容易且精確的決定用於表現性語音合成所需的情緒參數。

    In human-machine interaction, speech-related techniques provide user a convenient way to interact with the computer. The speech synthesis based on Hidden Markov Model (HMM) has been developed. It can synthesize stable and smooth speech. In recent years, demand for synthetic speech with more variability and expressivity has been increasing.
    In control vector-based expressive speech synthesis, the emotion/style control vector defined in the categorical (CAT) emotion space is uneasy to be precisely defined by the user to synthesize the speech with the desired emotion/style.
    This thesis applies the arousal-valence (AV) space to the multiple regression hidden semi-Markov model (MRHSMM)-based synthesis framework for expressive speech synthesis. In this study, the user can designate a specific emotion by defining the AV values in the AV space. The multidimensional scaling (MDS) method is adopted to project the AV emotion space and the categorical (CAT) emotion space onto their corresponding orthogonal coordinate systems. A transformation approach is thus proposed to transform the AV values to the emotion control vector in the CAT emotion space for MRHSMM-based expressive speech synthesis. In the synthesis phase given the input text and desired emotion, with the transformed emotion control vector, the speech with the desired emotion is generated from the MRHSMMs.
    In experimental, participants were invited to do subjective test on the emotion parameters and the quality of the synthesized speech. Experimental result shows the proposed method is helpful for the user to easily and precisely determine the desired emotion for expressive speech synthesis.

    Table of Contents 中文摘要 i Abstract iii 誌謝 v Chapter 1. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Proposed Ideas 3 1.4 Organization 4 Chapter 2. Related work 5 2.1 HMM-based Speech Synthesis System 5 2.1.1 HMM Speech Synthesis Technique 5 2.1.2 Mandarin Phoneme Label 7 2.1.3 Context Preprocessor 9 2.1.4 The Question Set of Decision Tree 12 2.1.5 STRAIGH Parameter Extraction 13 2.2 Expressive Speech Synthesis 14 2.2.1 Style Modeling 14 2.2.2 Style Adaptation 15 2.2.3 Style Interpolation 17 2.2.4 Style Control 19 2.3 Multidimensional Scaling 21 2.4 Emotion Space 23 Chapter 3. Expressive Speech Synthesis using Emotion Control Vector Generation 24 3.1 Overview of the Propose System 24 3.2 Multiple Regression HMMs 25 3.3 Control Vector Generation based on MDS 26 3.3.1 Emotion Space Defined by The Psychologist 26 3.3.2 Class-Based Emotional Space 28 3.3.3 Emotional Control Vector Generation 30 Chapter 4. Experimental Results 32 4.1 Corpus 32 4.2 System Evaluation and Comparisons 33 4.2.1 System Friendliness Test 33 4.2.2 Emotion Perception Test 35 4.2.3 Reproduction test 36 4.2.4 Emotion Intensity Test 37 4.2.5 Multiple Parameters Test 41 Chapter 5. Conclusion and Future Work 43 5.1 Conclusion 43 5.2 Future Work 43 Reference 44

    [1] 財團法人資訊工業策進會, 服務創新體驗設計系統研究與推動計畫, 台灣,2015
    [2] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, and K. Tokuda, “The HMM-based speech synthesis system version 2.0,” in Proc. ISCA SSW6, Bonn, Germany, Aug. 2007, pp. 294–299.
    [3] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Commun., vol. 51, no. 11, pp. 1039–1064, 2009.
    [4] M. Schrӧder, “Emotional speech synthesis: A review,” in Proc. EUROSPEECH 2001, Aalborg, Denmark, Sept. 2001, pp. 561–564.
    [5] T. Nose and T. Kobayashi, “Recent development of HMM-based expressive speech synthesis and its applications,” in Proc. APSIPA ASC 2011, PID:189, Xi'an, China, Nov. 2011.
    [6] J. Yamagishi, T. Masuko, and T. Kobayashi, “HMM-based expressive speech synthesis—towards TTS with arbitrary speaking styles and emotions,” in Proc. SWIM 2004, Maui, USA, Jan. 2004.
    [7] C.-H. Wu, C.-C. Hsia, T.-H. Liu, and J.-F. Wang, “Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1109–1116, July 2006.
    [8] C.-H. Wu, C.-C. Hsia, C.-H. Lee, and M.-C. Lin, “Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 6 pp. 1394–1405, August 2010.
    [9] A. Hunt and A. Black, “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 373-376, 1996.
    [10] 國立臺灣師範大學國音教材編輯委員會編纂,國音學,臺北縣 :正中出版
    [11] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 901-904, 2004.
    [12] 謝雲飛, 語音學大綱, 民國63年初版
    [13] H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” Proc. ICASSP ’97, Vol. 2, Muenich, pp. 1303–1306 ,1997
    [14] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of various speaking styles and emotions for HMM-based speech synthesis,” in Proc. EUROSPEECH 2003, Geneva, Switzerland, Sept. 1-4, 2003, pp.2461–2464.
    [15] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis,” IEICE Trans. Inf. & Syst., vol. E88-D, no. 3, pp. 503–509, 2005.
    [16] J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, “Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis,” in Proc. ICASSP 2004, Montreal, Quebec, Canada, May 17-21, 2004, vol. 1, pp. I-5-8.
    [17] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style adaptation technique for speech synthesis using HSMM and suprasegmental features,” IEICE Trans. Inf. & Syst., vol. E89-D, no. 3, pp. 1092–1099, Mar. 2006.
    [18] Junichi Yamagishi, "Average-voice-based speech synthesis," Tokyo Institute of Technology, 2006.
    [19] Junichi Yamagishi, Takao Kobayashi, Yuji Nakano, Katsumi Ogata and Juri Isogai, "Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm," Audio, Speech, and Language Processing, IEEE Transactions on, 2009, 17.1: 66-83.
    [20] M. Tachibana, J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “HMM-based speech synthesis with various speaking styles using model interpolation”, in Proc. Int. Conf. Speech Prosody, Nara, Japan, Mar. 23-26, 2004, pp.41–3–416.
    [21] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,” IEICE Trans. Inf. & Syst., vol. E88-D, no. 11, pp. 2484–2491, Nov. 2005.
    [22] K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama, “Multiple-regression hidden Markov model,” in Proc. ICASSP 2001, Salt Lake City, Utah, USA, May 7-11, 2001, pp. 513–516.
    [23] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for HMM-based expressive speech synthesis,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 9, pp. 1406–1413, Sept. 2007.
    [24] T. Nose and T. Kobayashi, “A perceptual expressivity modeling technique for speech synthesis based on multiple-regression HSMM,” in Proc. INTERSPEECH 2011, Florence, Italy, Aug. 27-31, 2011, pp. 109-112.
    [25] J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “A context clustering technique for average voice models,” IEICE Trans. Inf. & Syst., E86-D(3),pp.534–542, 2003.
    [26] T. F. Cox and M. A. A. Cox, Multidimensional Scaling. Chapman Hall, London, 1994.
    [27] C.-C. Hsia, K.-Y. Lee, C.-C. Chuang, and Y.-H. Chiu, “Multidimensional Scaling for Fast Speaker Clustering,” in Proc. ISCSLP 2010, Tainan, Taiwan, Nov. 29-Dec. 3, 2010, pp. 296–299.
    [28] Mehrabian, Albert. "Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament." Current Psychology 14.4 (1996): 261-292
    [29] J. A. Russell, “A circumplex model of affect,” J. Personality Social Psychology, vol. 39, pp. 1161–1178, 1980.
    [30] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” in Proc. EUROSPEECH 1999, vol. 5, Budapest, Hungary, Sept. 5-9, 1999, pp. 2347–2350.
    [31] C.-C. Hsia, C.-H. Wu, and J.-Y. Wu, “Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis.” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp. 1994-2003, Nov. 2010.
    [32] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “A hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 5, pp. 825–834, May 2007.
    [33] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis from HMMs using dynamic features,” in Proc. ICASSP 1996, Atlanta, Georgia, USA, May 7-10, 1996, vol. 1. pp. 389-392.
    [34] K. Tokuda, T. Yoshimura, T. Masuko, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” in Proc. ICASSP 2000, Istanbul, Turkey, Jun. 5-9, 2000, pp. 1315–1318.
    [35] T. Toda and K. Tokuda, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 5, pp. 816–824, May 2007.
    [36] T. Kobayashi, S. Imai, and T. Fukuda, “Mel-generalized log spectral approximation filter,” IEICE Trans. Fund., vol. J68-A, no. 6, pp. 610-611, 1985.
    [37] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. ICASSP 1992, San Francisco, CA, USA, Mar. 23-26, 1992, vol. 1, pp. 137–140.
    [38] J. Goldberger and H. Aronowitz, “A distance measure between GMMs based on the unsented transform and its application to speaker recognition,” in Proc. EUROSPEECH 2005, Lisboa, Portugal, Sept. 4-8, 2005, pp. 1985-1988.
    [39] T. Toda, A. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222–2235, Nov. 2007.
    [40] C.-C. Hsia, C.-H. Wu, and J.-Q. Wu, “Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion,” IEEE Trans. Comp., vol. 56, no. 9, pp. 1245-1253, Sept. 2007.

    下載圖示 校內:2018-07-31公開
    校外:2018-07-31公開
    QR CODE