簡易檢索 / 詳目顯示

研究生: 翁思婷
Weng, Sz-Ting
論文名稱: 應用韻律結構和Fujisaki模型之階層式音高樣式選取於自然語音合成
Hierarchical Pitch Pattern Selection Based on Prosodic Structure and Fujisaki Model for Natural Speech Synthesis
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 65
中文關鍵詞: 韻律結構Fujisaki模型階層式音高自然語音合成
外文關鍵詞: Prosodic Structure, Fujisaki Model, Hierarchical Pitch, Natural Speech Synthesis
相關次數: 點閱:87下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自然的韻律表現是影響語音合成的重要因素,其蘊含了溝通意涵與個人說話風格表現。基於隱藏式馬可夫模型的語音合成器,近年來已可合成出穩定與流暢的語音,其適應性更是其發展優勢,但在自然度上過度平滑的問題仍需改善。
    以往韻律產生以相似的語言學特徵做韻律特徵的分群,使得相同的語言學特徵在不同語者情況下,無法表達特定語者獨特的韻律樣式。因此本論文以Fujisaki模型之“階層式韻律單元”考量句子的音高曲線變化;以資訊檢索作為韻律單元選取方法,不僅考量語言學的特徵作為選擇準則,也將音高曲線的形狀納入作為檢索訓練語料庫中自然音高曲線的依據。本論文採用階層式韻律單元的目的在於逼近語料庫中句子的區域和全域音高曲線變化,並將音高曲線轉換做碼字分群和利用Fujisaki command保留。建構句子的碼字(codeword)序列目的於找出訓練真實語料與合成語料之間的音高樣式間對應關係。
    在實驗中,進行主觀及客觀的評估,證明本論文提出之方法與傳統的方法,在合成語音的自然度表現上,有不錯的表現及改善。

    One of the goals of speech synthesis is to generate natural-sounding speech. Prosody plays an important role for conveying both communicative meanings and specific speaking styles. In recent years, speech synthesis based on Hidden Markov Model (HMM) has been developed, which can synthesize stable and fluent speech and it has advantages of flexibility and small footprint. Nevertheless, there is still room for improving the over-smoothing problem, which reduces the naturalness of the synthesized speech.
    Previous studies of prosody feature generation mainly focused on clustering similar linguistic information and used the clustered prosodic features to train prosody models, which could be incapable of conveying speaker-specific prosody pattern because the linguistic cues are same for different speakers. In this research, we adopted the idea of information retrieval, and proposed a hierarchical prosodic unit-selection method, which combines the traditional linguistic cues and the shape of pitch contour as a query to find natural pitch contour in the training corpus. The hierarchical prosodic units, aimed to model local pitch contour variation and global intonation of utterances in the corpus, are clustered into codewords and their pitch patterns are stored as the modified Fujisaki commands. The codeword sequences of utterances in the training and synthesized corpora are constructed and used to map the relation between real speech and synthesized speech.
    The experimental results of subjective and objective tests compare the proposed approach and other conventional systems shows that the proposed method achieves better results of naturalness.

    中文摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 VI 表目錄 VII 第一章 緒論 - 1 - 1.1 研究背景 - 1 - 1.2 研究動機與目的 - 2 - 1.3 研究方法簡介 - 3 - 1.4 章節概要 - 5 - 第二章 文獻回顧 - 6 - 2.1 中文語音合成器 - 6 - 2.1.1 HTS語音合成系統 [1] - 6 - 2.1.1.1 中文音素標記 - 7 - 2.1.2 串接式合成系統 - 9 - 2.2 韻律階層架構 - 11 - 2.2.1 中文韻律結構 - 11 - 2.2.2 韻律結構之產生 - 14 - 2.2.2.1 韻律結構預測模型 - 14 - 2.2.2.2 預測模型之建立 - 16 - 第三章 階層式音高樣式選取之建立 - 19 - 3.1 階層式音高樣式之建立 - 22 - 3.1.1 音高階層式結構 - 24 - 3.1.2 各層之音高曲線量化 - 26 - 3.1.2.1 量化模型 - 26 - 3.1.2.2 韻律短語階層 - 26 - 3.1.2.3 韻律詞階層 - 26 - 3.1.3 音高樣式分群 - 27 - 3.1.4 音高樣式儲存 - 29 - 3.2 階層式音高樣式對應與統計之建立 - 30 - 3.2.1 真實資料統計 - 30 - 3.2.2 合成資料統計之向量建立 - 31 - 第四章 階層式音高樣式選取之語音合成 - 32 - 4.1 輸入文字之向量表示 - 33 - 4.2 各層之音高樣式選取 - 34 - 4.3 各層之音高樣式串接 - 36 - 4.3.1 語言成本(Language cost) - 36 - 4.3.2 連續成本(Continuity cost) - 37 - 4.4 自然語音合成之建置 - 41 - 第五章 實驗結果與分析 - 43 - 5.1 實驗語料 - 43 - 5.1.1 語料簡介 - 43 - 5.1.2 語料設定 - 44 - 5.1.3 實驗環境設定 - 46 - 5.2 實驗與評估 - 49 - 5.2.1 評估方法 - 49 - 5.3 實驗結果 - 52 - 5.3.1 客觀評估 - 52 - 5.3.2 主觀評估 - 57 - 第六章 結論與未來展望 - 61 - 6.1 結論 - 61 - 6.2 未來展望 - 62 - 參考文獻 - 63 -

    [1] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," Proc. ICASSP, pp.1315-1318, 2000
    [2] T. Kobayashi, S. Imai, and T. Fukuda, “Mel-Generalized Log Spectral Approximation Filter,” IEICE Transactions, Vol. J68-A, No. 6, pp. 610-611, 1985.
    [3] A. Hunt and A. Black, “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 373-376, 1996.
    [4] 國立臺灣師範大學國音教材編輯委員會編纂,國音學,臺北縣 :正中出版
    [5] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 901-904, 2004.
    [6] A. Black, H. Zen, and K. Tokuda, “Statistical Parametric Speech Synthesis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1229-1232, 2007.
    [7] H. Segi, T. Takagi and T. Ito, “A Concatenative Speech Synthesis Method Using Context Dependent Phoneme Sequences with Variable Length as Search Units,” in Proceedings of ISCA Speech Synthesis Workshop (ISCA SSW-5), 2004.
    [8] M. Begum, R. N. Ainon, R. Zainuddin , Z. M. Don, and G. Knowles, “Prosody Generation by Integrating Rule and Template-Based Approaches for Emotional Malay Speech Synthesis,” in Proceedings of IEEE Region 10 Conference (TENCON), pp. 1-6, 2008.
    [9] L.L. Syaheerah, N.A. Raja, M. Salimah, and M.D. Zuraidah, “Template-Driven Emotions Generation in Malay Text-to-Speech: A Preliminary Experiment,” in Proceedings of International Conference of Information Technology in Asia (CITA), pp.144-149, 2005.
    [10] H. Segi, R. Takou, N. Seiyama, T. Takagi, H. Saito, and S. Ozawa, “Template-Based Methods for Sentence Generation and Speech Synthesis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1757-1760, 2011.
    [11] X. Sun, “The Determination, Analysis and Synthesis of Fundamental Frequency,” Ph.D Dissertation, Northwestern University, 2002.
    [12] J. Y. Wu, “Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMM-Based Mandarin Speech Synthesis,” M.S. thesis, National Cheng-Kung University, Tainan, Taiwan, 2008.
    [13] M. Chu and Y. Qian, “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts,” International Journal of Computational Linguistic & Chinese Language Processing, Vol. 6, No. 1, pp. 61-82, 2001.
    [14] C. Wang, H. Fujisaki, S. Ohno, and T. Kodama, "Analysis and Synthesis of the Four Tones in Connected Speech of the Standard Chinese Based on a Command-Response Model", in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1655-1658, 1999.
    [15] R. A. J. Clark, “Using Prosodic Structure to Improve Pitch Range Variation in Text to Speech Synthesis,”in Proceedings of International Congress of Phonetic Sciences, Vol. 1, pp. 69-72, 1999.
    [16] C. Y. Tseng, S. H. Pin, Y. L. Lee, H. M. Wang, and Y. C. Chen, "Fluent Speech Prosody: Framework and Modeling," Speech Communication, Vol. 46, No. 3-4, pp.284-309, 2005.
    [17] S. Sakai, “F0 Modeling with Multi-Layer Additive Modeling Based on a Statistical Learning Technique,” in Proceedings of ISCA Speech Synthesis Workshop (ISCA SSW-5), 2004.
    [18] S. H. Chen and Y. R. Wang, “Vector Quantization of Pitch Information in Mandarin Speech”, IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, 1990.
    [19] F. Liu, H. Jia and J. Tao, “A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin,” in Proceeding of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1-4, 2008.
    [20] J. M. Ponte and W. B. Croft, “A Language Modeling Approach to Information Retrieval," in Proceedings of International ACM SIGIR conference on Research and development in information retrieval (SIGIR), pp. 275–281, 1998.
    [21] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, Vector Spaces, and Information Retrieval,” in Society for Industrial and Applied Mathematics Philadelphia, PA, USA, 1999.
    [22] Z. J. Chuang, “A Study on Speech and Sign Language Processing for Speech/Hearing Impaired,” Ph.D. Dissertation, National Cheng-Kung University, Tainan, Taiwan, 2006.
    [23] L. H. Cai, D. D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” Journal of Chinese Information Processing, Vol. 21, No. 2, pp. 94-99, 2007.

    下載圖示 校內:2022-12-31公開
    校外:2022-12-31公開
    QR CODE