簡易檢索 / 詳目顯示

研究生: 張家豪
Chang, Chia-Hao
論文名稱: 非語言情緒線索產生與故事角色轉換之生動說故事系統
A Vivid Storytelling System with Nonverbal Emotional Cues Generation and Story Role Conversion
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 46
中文關鍵詞: 支持向量機語音合成基於共振峰調變之語者轉換
外文關鍵詞: Support vector machine (SVM), HMM-based speech synthesis, Formant warping based voice conversion
相關次數: 點閱:136下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在說故事時,非語言情緒線索在傳遞情緒上扮演了一個非常重要的角色,例如:情緒韻律及非語言語聲(笑聲、咆嘯、哭聲等等)。此外角色扮演元素的加入,如語者風格,可以提升故事的戲劇化程度。因此,此篇論文提出一個包含非語言情緒線索產生與故事角色轉換之生動說故事系統。此說故事系統包含兩個部分 1) 使用非語言情緒線索傳遞情緒 2) 將說書人的聲音轉換成特定故事角色的聲音。為了產生非語言情緒線索,我們使用基於HMM的文字轉語音系統合成表達性語音及使用一個包含正負面情緒辨識與非語言情緒語聲預測與插入模型的模組來產生非語言情緒語聲。產生不同故事角色聲音方面,我們使用基於共振峰調變的語者轉換技術來克服需要錄製大量不同語者的表達性語音語料的問題。
    在實驗中,正負面情緒辨識準確率可以達76.2%。非語言情緒語聲的插入: 笑聲、咆嘯、哭聲準確率可分別達87.5%、92.5%及86.6%。在表達性文字轉語音評測中,我們使用音質、情緒感程度、理解度MOS來評估合成結果,結果分別為3.2、3.4及3.1。而在故事角色轉換評測中,則是使用語者相似度與音質MOS來評分,結果分別為2.48與3.17。最後的整體說故事系統評量,MOS則可以達到4.43,此證明了此篇生動說故事系統可以更達到更自然合成結果且吸引更多聽眾的注意。

    In this thesis, a vivid storytelling system is proposed for making audience immerse deeply into the story and also enhancing the interest of storytelling. For conveying emotion more vividly, the nonverbal paralinguistic cues like affective prosody and nonverbal vocalizations (laughter, shout, weeping and etc.) play an important roles in storytelling. In addition, role-playing element such like speaker style applied into storytelling can improve the dramatic impression of story. For this purpose, this storytelling system is with two artifices: 1) convey emotions by nonverbal emotional cues 2) convert the voice of storyteller to the specific story role.
    In order to generate the nonverbal emotional cues, we synthesis expressive speech by HMM-based TTS and insert nonverbal emotional vocalization (NEV) by a generation module consisted of emotion classification based on SVM and a template-based NEV prediction model. In addition, for generating different story role’s voice, the speaker conversion based on formant warping is used to conquer the problem of recording a heavy amount of multiple speaker’s expressive style speech corpora.
    The experimental result shows that our storytelling system can achieve a MOS of 4.43, it indicates that the proposed storytelling is more near to real storyteller except a slight unnaturalness. Furthermore, the storytelling with emotional expression and role-playing is more natural for audience, and captures more audience’s attention.

    中文摘要 I Abstract II 誌謝 IV Contents V Table List VII Figure List VIII Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 1 1.3 Previous Works and Problems 2 1.4 Objectives 2 1.5 Organization 3 Chapter 2 System Framework and Related Toolsets 5 2.1 System Overview 5 2.2 HowNet Knowledge System 6 2.3 HMM-based Speech Synthesis System (HTS) 8 2.3.1 Feature Extraction 8 2.3.2 Training of HMMs 8 2.3.3 Parameter Generation from HMM 9 2.3.4 MLSA Vocoder 10 Chapter 3 Proposed Vivid Storytelling System 13 3.1 Story Annotation 13 3.2 Nonverbal Emotional Vocalization Generation 14 3.2.1 Onomatopoeia Replacement 15 3.2.2 Positive and Negative Emotion Classification 16 3.2.3 Nonverbal Emotional Vocalization Prediction and Insertion 24 3.3 The HMM-based Expressive TTS 27 3.4 Story Role Conversion 27 3.4.1 Story Role Spectral Conversion based on Formant Warping 29 3.4.2 Story Role Pitch Modification 32 Chapter 4 Experiments 33 4.1 Experiment for Emotion Classification 33 4.2 Experiment for NEV Prediction and Insertion 34 4.3 Experiment for Expressive TTS 37 4.3.1 Experiment Environment of HMMS 37 4.3.2 Expressive TTS Performance Evaluation 38 4.4 Experiment for Story Role Conversion 39 4.5 Experiment for Proposed Vivid Storytelling System 41 Chapter 5 Conclusion and Future Work 42 5.1 Conclusions 42 5.2 Future works 43 References 44

    [1] J.-F. Wang and B.-W. Chen, “Orange computing: challenges and opportunities for awareness science and technology,” in Proceedings of the 3rd International Conference on Awareness Science and Technology, pp. 538–540, Dalian, China, September 2011.
    [2] C.-F. Shih, C.-W. Chang, and G.-D. Chen, “Robot as a storytelling partner in the English classroom - Preliminary Discussion” In 7th IEEE International Conference on Advanced Learning Technologies, 2007.
    [3] C. Plaisant, A. Druin, C. Lathan, K. Dakhane, K. Edwards, J. M. Vice, and J. Montemayor, "A storytelling robot for pediatric rehabilitation," presented at ASSET, 2000.
    [4] R. Gelin, C. d'Alessandro, Q. A. Le, O. Deroo, D. Doukhan, J.-C. Martin, C. Pelachaud, A. Rilliard, and S. Rosset, "Towards a storytelling humanoid robot", presented at AAAI, 2010 Fall Symposia, "Dialog with Robots", Arlington (VA), November, 2010.
    [5] A. Silva, M. Vala, and A. Paiva, “The Storyteller: Building a synthetic character that tells stories,” in Proc. Workshop Multimodal Communication and Context in Embodied Agents, Autonomous Agents, Montréal, QC, Canada, May/Jun, 2001, pp. 53–58.
    [6] D. Sauter, F. Eisner, P. Ekman, and S. K. Scott, "Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations", in Proceedings of the National Academy of Sciences, 107(6), 2408-2412, 2010
    [7] Morton, J. B., Trehub, S., E., “Children's understanding of emotion in speech”, in Child Development, vol. 72, pp. 834–843, May/Jun 2001.
    [8] A. Silva, M. Vala, and A. Paiva, “The Storyteller: Building a synthetic character that tells stories,” in Proc. Workshop Multimodal Communication and Context in Embodied Agents, Autonomous Agents, Montréal, QC, Canada, May/Jun, 2001, pp. 53–58.
    [9] M. Theune, K. Meijs, D. Heylen, and R. Ordelman, “Generating expressive speech for storytelling applications”, IEEE Trans. on Audio, Speech, and Language Processing, July, 2006.
    [10] F. Burkhardt, “An affective spoken storyteller”, in proceedings of Interspeech, 2011, pp. 3305–3306.
    [11] H.-J. Lee, “Fairy tale storytelling system: Using both prosody and text for emotional Speech Synthesis”, in Proc. ICHIT, CCIS, 2012, vol. 310, pp 317–324.
    [12] R. M. Aparicio, F. Alías and J. F. Rafart, “Prosodic analysis of storytelling discourse modes and narrative situations oriented to Text-to-Speech synthesis”, Proc. 8th ISCA Speech Synthesis Workshop, August, 2013.
    [13] H.-J. Min, S.-C. Kim, J.-Y. Kim, J.-W. Chung, and J.C. Park, “Speaker-TTS voice mapping towards natural and characteristic robot storytelling”, in Proceedings of the 22nd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2013), pp. 793–800, Gyeongju, Korea, August, 2013.
    [14] Z. Shuang, R. Bakis, and Y. Qin, “Voice conversion based on mapping formants,” in Proc. TC-STAR Workshop Speech-to-Speech Translation, Barcelona, Spain, 2006, pp. 219–223.
    [15] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through combining modified GMM and formant mapping for Mandarin,” Proc. IEEE, ICDT, 2007, pp.10.
    [16] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. Syst., vol. E85-D, no. 3, pp. 455-464, Mar. 2002
    [17] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “A hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 825-834, May 2007
    [18] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp. 1315-1318, 2000
    [19] S. Imai, K. Sumita, and C. Furuichi, “Mel-Log Spectrum Approximation (MLSA) Filter for Speech Synthesis,” Trans. IECE, vol. JGG-A, pp. 122-129, Feb. 1983
    [20] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An Adaptive Algorithm for Mel-cepstral Analysis of Speech,” Proc. ICASSP, 1992
    [21] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proc. ACMSIGMOD, Jun. 1993, pp. 22(2):207–216.
    [22] Y. C. Lin, “Emotion Classification System based on Semantic Content Analysis,” Master thesis, National Cheng Kung University, 2003.
    [23] V. Vapnik, “Statistical learning theory”, Wiley-Interscience, New York, 1998.
    [24] B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola, “ Input space vs. feature space in kernel-based methods, ” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000-1017, 1999
    [25] Lawrence Rabiner and B-H Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993.
    [26] Y. Zhao, B. Qin, T. Liu, L. Zhang, and Z. Su, “Sentence similarity computing based on multi-features fusion,” in Proc. JSCL, Aug. 2005, pp. 168–174.
    [27] C.-H. Wu, C.-H. Lee and C.-H. Liang, "Idiolect extraction and generation for personalized speaking style modelings", IEEE Transactions on Audio Speech and Language Processing, vol.17, Issue.1, pp. 127–137, 2009.
    [28] K. Tokuda, H. Zen, and A. Black, “An HMM-based speech synthesis system applied to english,” in Proceedings of 2002 IEEE Workshop on Speech Synthesis. IEEE, 2002, pp. 227–230
    [29] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in Proc. ICASSP, 1988, vol. 1, pp. 655–658.
    [30] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Process., vol. 6, no. 2, pp. 131–142, Mar. 1998.
    [31] T. Toda, H. Saruwatari and K. Shikano, “Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT Spectrum,” Proc. of ICASSP, pp. 841–844, 2001.
    [32] T. Toda, H. Saruwatari and K. Shikano, “High quality voice conversion based on Gaussian mixture Model with dynamic frequency warping,” Proc. EUROSPEECH. Aelborg, Denmark, pp. 349–352, 2001.
    [33] N. Maeda, H. Banno, S. Kajita, K. Takeda, and F. Itakura, “Speaker conversion through non-linear frequency warping of STRAIGHT spectrum,” Proc. EUROSPEECH, Budapest, Hungary, vol. 2, pp. 827–830, 1999.
    [34] D. Erro, A. Moreno, and A. Bonafonte, “Voice conversion based on weighted frequency warping,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 5, pp. 922–931, Jul. 2010.
    [35] E. Godoy, O. Rosec, and T. Chonavel, “Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp. 1313–1323, May 2012.
    [36] Suendermann, D., Ney, H., and Hoege, H., “VTLN-Based cross-language voice conversion,” Proc. ASRU, 2003.

    下載圖示 校內:2019-08-29公開
    校外:2019-08-29公開
    QR CODE