| 研究生: |
趙郁婷 Chao, Yu-Ting |
|---|---|
| 論文名稱: |
使用語音音框校準及調適式條件隨機域於個人化頻譜及韻律之轉換 Frame-Based Alignment and Adaptive CRF for Personalized Spectral and Prosody Conversion |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 音框校準 、調適式條件隨機域 、個人化 、轉換 |
| 外文關鍵詞: | alignment, adaptive CRF, personalized, conversion |
| 相關次數: | 點閱:105 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
個人化語音合成是近年來語音合成技術發展的一大重點,主要又可分為聲學參數、韻律兩大因素。聲學參數方面,常見的方式之一為使用聲音轉換達到個人化之目標,但是聲音轉換結果通常會受到資料對應好壞的影響。在平行語料的資料對應中,當兩者頻譜參數差異過大時,單純依靠頻譜參數做動態時間校正不足以找到正確恰當對應關係。因此,本研究中提出將頻譜參數做主成分分析的結果進行分群編碼,統計各特徵值出現次數機率,輔助修正原動態校正之結果,結合頻譜參數考量,得到更精確之資料對應成果。在韻律資訊方面,針對韻律停頓之預測也進行個人化處理,利用少量平行語料對一般性條件隨機域的預測模型進行調適,以此達到個人化之目標。
本論文可分為下列兩項重點:1)藉由特徵參數出現次數機率統計結果的協助,改善語音音框校準及聲音轉換之成果;2)以調適韻律邊界使用條件隨機域之預測模型方式達到個人化目標。
經由實驗評估分析顯示,本論文提出之改善資料對應方法,能夠協助個人化聲音轉換得到更好的結果。韻律邊界預測模型的調適,也能夠改善其個人化之成果。
Research on personalized speech synthesis is a popular issue in recent years. The personalized speech generally consists of two major factors, which are acoustic and prosodic feature. Traditionally, the personalized acoustic feature can be obtained through spectral feature transformation by voice conversion methods. Frame-based voice conversion suffers from the inaccurate results of phone pair alignment using only spectral distance and conversion results are improper. In this study, the feature vectors of parallel corpus are transformed into codewords in an eigen-space and the occurrence distribution of the codewords will be used for distance measure of DTW. Considering both spectral and eigen-codeword distribution, a more precise alignment result can be obtained. The prosodic feature is an important part for personalized speech synthesis. The prosodic boundaries of the same sentences are different since it is uttered by different speakers. To generate the personalized prosodic boundaries, the personalized prosodic boundaries prediction can be obtained using CRF model adaptation for personalized speech synthesis.
The purpose of this study is to develop a personalized speech synthesis system by voice conversion using small parallel corpus. It contains two major parts: (1)The result of personalized spectral and prosody conversion can be improved by parallel corpus alignment considering both spectral distance and eigen-codeword distribution. (2)Personalized prosodic boundary prediction using CRF model adaptation.
Objective and subjective tests were performed to evaluate the performance of the proposed approach. The experimental results demonstrate that the proposed method can improve the quality of personalized voice conversion.
[1] A. P. Breen, P. Jackson, “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” in Proc. of the Third ESCA/COCOSDA Workshop on Speech Synthesis, pp.201-206, Blue Mountain, Australia, Nov. 1998.
[2] M. Chu, Y. Qian, “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts”, Computational Linguistics and Chinese Language Processing, 6(1), pp. 61-82, 2001.
[3] CrfSgd: A general purpose CRF solver. http://leon.bottou.org/projects/sgd
[4] CAI Lianhong, CUI Dandan, and CAI Rui,“TH-CoSS, a Mandarin Speech Corpus for TTS,”key Lab. Of Pervasive Computing, Ministry od Education, Dept. of Computer, Tsinghua Univ., Beijian.
[5] T. Fukada, K. Tokuda, T. Kobayashi, S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. of ICASSP, vol.1, pp.137–140, 1992
[6] C.C. Hsia, C.H. Wu, J.Y. Wu,“Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-based Speech Synthesis”IEEE Trans. Audio, Speech, and Language Processing, p.1-1 2010.
[7] C. Huang, Y. Shi, J. L. Zhou, ,M. Chu, T. Wang, E. Chang,“Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004
[8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.
[9] J. Lafferty, A. McCallum, and F. Pereira.“Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001.
[10] C.H. Lee, C.H. Wu, J.C. Guo,“Pronunciation Variation Generation for Spontaneous Speech Synthesis Using State-Based Voice Transformation” Proceedings of ICASSP2010, p.4826-4829, March 15-19, 2010.
[11] T. Lin, L. J. Wang,“Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992
[12] N. H. Pan, W. T. Jen, S. S. Yu, S. S Yu, S. Y. Huang, M. J. Wu,“Prosody Model in a Mandarin Text-to-Speech System Based on a Hierarchical Approach,” IEEE International Conference on Multimedia and Expo, vol. 1, pp. 448-451, 2000.
[13] L. R. Rabiner. “A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77 (2): 257-286, February 1989.
[14] J. Yamagishi, T. Kobayashi,“Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training,”IEICE Trans. On Inf. & Syst., vol.E90D, no.2, pp.533-543, Feb. 2007.
[15] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006.
http://htk.eng.cam.ac.uk/
[16] Q. Zhang, X. Qiu, X. Huang, L. Wu,“Domain Adaptation for Conditional Random Fields”Information Retrieval Technology, 2008, Springer.
[17] W. Zhang, L. Shen, D. Tang, “Voice Conversion Based on Acoustic Feature Transformation,” in Proc. of the 6th National Conference on Man-Machine Speech Communication 2001.
[18] Z. Zhao, Y Zhu,“Prediction of Prosodic Phrase Boundaries in Chinese TTS Based on Conditional Random Fields and Transformation Based Learning,” fskd, vol. 2, pp.599-602, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 2009.
[19] H. Zen, T. Nose, J. Yamagishi, S. Sako, K. Tokuda, The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007.
http://hts.sp.nitech.ac.jp/
[20] 謝雲飛, 語音學大綱, 民國63年初版