| 研究生: |
翁思婷 Weng, Sz-Ting |
|---|---|
| 論文名稱: |
應用韻律結構和Fujisaki模型之階層式音高樣式選取於自然語音合成 Hierarchical Pitch Pattern Selection Based on Prosodic Structure and Fujisaki Model for Natural Speech Synthesis |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2012 |
| 畢業學年度: | 100 |
| 語文別: | 中文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 韻律結構 、Fujisaki模型 、階層式音高 、自然語音合成 |
| 外文關鍵詞: | Prosodic Structure, Fujisaki Model, Hierarchical Pitch, Natural Speech Synthesis |
| 相關次數: | 點閱:87 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自然的韻律表現是影響語音合成的重要因素,其蘊含了溝通意涵與個人說話風格表現。基於隱藏式馬可夫模型的語音合成器,近年來已可合成出穩定與流暢的語音,其適應性更是其發展優勢,但在自然度上過度平滑的問題仍需改善。
以往韻律產生以相似的語言學特徵做韻律特徵的分群,使得相同的語言學特徵在不同語者情況下,無法表達特定語者獨特的韻律樣式。因此本論文以Fujisaki模型之“階層式韻律單元”考量句子的音高曲線變化;以資訊檢索作為韻律單元選取方法,不僅考量語言學的特徵作為選擇準則,也將音高曲線的形狀納入作為檢索訓練語料庫中自然音高曲線的依據。本論文採用階層式韻律單元的目的在於逼近語料庫中句子的區域和全域音高曲線變化,並將音高曲線轉換做碼字分群和利用Fujisaki command保留。建構句子的碼字(codeword)序列目的於找出訓練真實語料與合成語料之間的音高樣式間對應關係。
在實驗中,進行主觀及客觀的評估,證明本論文提出之方法與傳統的方法,在合成語音的自然度表現上,有不錯的表現及改善。
One of the goals of speech synthesis is to generate natural-sounding speech. Prosody plays an important role for conveying both communicative meanings and specific speaking styles. In recent years, speech synthesis based on Hidden Markov Model (HMM) has been developed, which can synthesize stable and fluent speech and it has advantages of flexibility and small footprint. Nevertheless, there is still room for improving the over-smoothing problem, which reduces the naturalness of the synthesized speech.
Previous studies of prosody feature generation mainly focused on clustering similar linguistic information and used the clustered prosodic features to train prosody models, which could be incapable of conveying speaker-specific prosody pattern because the linguistic cues are same for different speakers. In this research, we adopted the idea of information retrieval, and proposed a hierarchical prosodic unit-selection method, which combines the traditional linguistic cues and the shape of pitch contour as a query to find natural pitch contour in the training corpus. The hierarchical prosodic units, aimed to model local pitch contour variation and global intonation of utterances in the corpus, are clustered into codewords and their pitch patterns are stored as the modified Fujisaki commands. The codeword sequences of utterances in the training and synthesized corpora are constructed and used to map the relation between real speech and synthesized speech.
The experimental results of subjective and objective tests compare the proposed approach and other conventional systems shows that the proposed method achieves better results of naturalness.
[1] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," Proc. ICASSP, pp.1315-1318, 2000
[2] T. Kobayashi, S. Imai, and T. Fukuda, “Mel-Generalized Log Spectral Approximation Filter,” IEICE Transactions, Vol. J68-A, No. 6, pp. 610-611, 1985.
[3] A. Hunt and A. Black, “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 373-376, 1996.
[4] 國立臺灣師範大學國音教材編輯委員會編纂,國音學,臺北縣 :正中出版
[5] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 901-904, 2004.
[6] A. Black, H. Zen, and K. Tokuda, “Statistical Parametric Speech Synthesis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1229-1232, 2007.
[7] H. Segi, T. Takagi and T. Ito, “A Concatenative Speech Synthesis Method Using Context Dependent Phoneme Sequences with Variable Length as Search Units,” in Proceedings of ISCA Speech Synthesis Workshop (ISCA SSW-5), 2004.
[8] M. Begum, R. N. Ainon, R. Zainuddin , Z. M. Don, and G. Knowles, “Prosody Generation by Integrating Rule and Template-Based Approaches for Emotional Malay Speech Synthesis,” in Proceedings of IEEE Region 10 Conference (TENCON), pp. 1-6, 2008.
[9] L.L. Syaheerah, N.A. Raja, M. Salimah, and M.D. Zuraidah, “Template-Driven Emotions Generation in Malay Text-to-Speech: A Preliminary Experiment,” in Proceedings of International Conference of Information Technology in Asia (CITA), pp.144-149, 2005.
[10] H. Segi, R. Takou, N. Seiyama, T. Takagi, H. Saito, and S. Ozawa, “Template-Based Methods for Sentence Generation and Speech Synthesis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1757-1760, 2011.
[11] X. Sun, “The Determination, Analysis and Synthesis of Fundamental Frequency,” Ph.D Dissertation, Northwestern University, 2002.
[12] J. Y. Wu, “Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMM-Based Mandarin Speech Synthesis,” M.S. thesis, National Cheng-Kung University, Tainan, Taiwan, 2008.
[13] M. Chu and Y. Qian, “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts,” International Journal of Computational Linguistic & Chinese Language Processing, Vol. 6, No. 1, pp. 61-82, 2001.
[14] C. Wang, H. Fujisaki, S. Ohno, and T. Kodama, "Analysis and Synthesis of the Four Tones in Connected Speech of the Standard Chinese Based on a Command-Response Model", in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1655-1658, 1999.
[15] R. A. J. Clark, “Using Prosodic Structure to Improve Pitch Range Variation in Text to Speech Synthesis,”in Proceedings of International Congress of Phonetic Sciences, Vol. 1, pp. 69-72, 1999.
[16] C. Y. Tseng, S. H. Pin, Y. L. Lee, H. M. Wang, and Y. C. Chen, "Fluent Speech Prosody: Framework and Modeling," Speech Communication, Vol. 46, No. 3-4, pp.284-309, 2005.
[17] S. Sakai, “F0 Modeling with Multi-Layer Additive Modeling Based on a Statistical Learning Technique,” in Proceedings of ISCA Speech Synthesis Workshop (ISCA SSW-5), 2004.
[18] S. H. Chen and Y. R. Wang, “Vector Quantization of Pitch Information in Mandarin Speech”, IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, 1990.
[19] F. Liu, H. Jia and J. Tao, “A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin,” in Proceeding of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1-4, 2008.
[20] J. M. Ponte and W. B. Croft, “A Language Modeling Approach to Information Retrieval," in Proceedings of International ACM SIGIR conference on Research and development in information retrieval (SIGIR), pp. 275–281, 1998.
[21] M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, Vector Spaces, and Information Retrieval,” in Society for Industrial and Applied Mathematics Philadelphia, PA, USA, 1999.
[22] Z. J. Chuang, “A Study on Speech and Sign Language Processing for Speech/Hearing Impaired,” Ph.D. Dissertation, National Cheng-Kung University, Tainan, Taiwan, 2006.
[23] L. H. Cai, D. D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” Journal of Chinese Information Processing, Vol. 21, No. 2, pp. 94-99, 2007.