簡易檢索 / 詳目顯示

研究生: 趙郁婷
Chao, Yu-Ting
論文名稱: 使用語音音框校準及調適式條件隨機域於個人化頻譜及韻律之轉換
Frame-Based Alignment and Adaptive CRF for Personalized Spectral and Prosody Conversion
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 52
中文關鍵詞: 音框校準調適式條件隨機域個人化轉換
外文關鍵詞: alignment, adaptive CRF, personalized, conversion
相關次數: 點閱:105下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 個人化語音合成是近年來語音合成技術發展的一大重點,主要又可分為聲學參數、韻律兩大因素。聲學參數方面,常見的方式之一為使用聲音轉換達到個人化之目標,但是聲音轉換結果通常會受到資料對應好壞的影響。在平行語料的資料對應中,當兩者頻譜參數差異過大時,單純依靠頻譜參數做動態時間校正不足以找到正確恰當對應關係。因此,本研究中提出將頻譜參數做主成分分析的結果進行分群編碼,統計各特徵值出現次數機率,輔助修正原動態校正之結果,結合頻譜參數考量,得到更精確之資料對應成果。在韻律資訊方面,針對韻律停頓之預測也進行個人化處理,利用少量平行語料對一般性條件隨機域的預測模型進行調適,以此達到個人化之目標。
    本論文可分為下列兩項重點:1)藉由特徵參數出現次數機率統計結果的協助,改善語音音框校準及聲音轉換之成果;2)以調適韻律邊界使用條件隨機域之預測模型方式達到個人化目標。
    經由實驗評估分析顯示,本論文提出之改善資料對應方法,能夠協助個人化聲音轉換得到更好的結果。韻律邊界預測模型的調適,也能夠改善其個人化之成果。

    Research on personalized speech synthesis is a popular issue in recent years. The personalized speech generally consists of two major factors, which are acoustic and prosodic feature. Traditionally, the personalized acoustic feature can be obtained through spectral feature transformation by voice conversion methods. Frame-based voice conversion suffers from the inaccurate results of phone pair alignment using only spectral distance and conversion results are improper. In this study, the feature vectors of parallel corpus are transformed into codewords in an eigen-space and the occurrence distribution of the codewords will be used for distance measure of DTW. Considering both spectral and eigen-codeword distribution, a more precise alignment result can be obtained. The prosodic feature is an important part for personalized speech synthesis. The prosodic boundaries of the same sentences are different since it is uttered by different speakers. To generate the personalized prosodic boundaries, the personalized prosodic boundaries prediction can be obtained using CRF model adaptation for personalized speech synthesis.
    The purpose of this study is to develop a personalized speech synthesis system by voice conversion using small parallel corpus. It contains two major parts: (1)The result of personalized spectral and prosody conversion can be improved by parallel corpus alignment considering both spectral distance and eigen-codeword distribution. (2)Personalized prosodic boundary prediction using CRF model adaptation.
    Objective and subjective tests were performed to evaluate the performance of the proposed approach. The experimental results demonstrate that the proposed method can improve the quality of personalized voice conversion.

    中文摘要 IV Abstract V 誌謝 VI 目錄 VII 圖目錄 X 表目錄 XII 第一章 緒論 1 1.1 前言 1 1.1.1 研究背景 1 1.1.2 研究動機與目的 1 1.1.3 文獻回顧 2 1.2 研究方法簡介 3 1.2.1 系統架構 4 1.3 章節概要 6 第二章 HMM-based中文語音合成器 8 2.1 HMM-based語音合成系統 8 2.2 中文HMM模型之建立 10 2.2.1 中文音素模型 10 2.2.2 文字分析前處理器 12 2.2.3 狀態合併分裂樹(決策樹)之問題集 13 2.3 參數提取:STRAIGHT 15 第三章 個人化韻律邊界預測模型之建立 16 3.1 中文韻律結構簡介 16 3.2 韻律結構之產生 17 3.2.1 韻律結構預測模型 18 3.2.2 個人化韻律邊界模型之調適 19 第四章 個人化轉換函式模型之建立 20 4.1 資料對應關係之建立 20 4.1.1 資料參數處理及向量量化 21 4.1.2 資料統計與對應 22 4.2 轉換函式模型之建立 26 4.2.1 線性轉換函式 26 4.2.1 轉換預測模型 28 4.2.1.1 頻譜轉換預測模型之建立 28 4.2.1.2 音長及音高轉換預測模型之建立 30 第五章 實驗結果與分析 34 5.1 實驗語料 34 5.1.1 實驗語料設定 34 5.2 實驗環境設定 35 5.3 實驗與評估 36 5.3.1 韻律邊界預測模型評估 37 5.3.2 特徵碼字機率統計方式評估 38 5.3.3 權重值評估 42 5.3.4 聲音轉換實驗結果評估 43 5.4 分析與討論 45 第六章 結論與未來展望 46 6.1 結論 46 6.2 未來展望 47 參考文獻 48 附錄 50 作者簡介 52

    [1] A. P. Breen, P. Jackson, “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” in Proc. of the Third ESCA/COCOSDA Workshop on Speech Synthesis, pp.201-206, Blue Mountain, Australia, Nov. 1998.
    [2] M. Chu, Y. Qian, “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts”, Computational Linguistics and Chinese Language Processing, 6(1), pp. 61-82, 2001.
    [3] CrfSgd: A general purpose CRF solver. http://leon.bottou.org/projects/sgd
    [4] CAI Lianhong, CUI Dandan, and CAI Rui,“TH-CoSS, a Mandarin Speech Corpus for TTS,”key Lab. Of Pervasive Computing, Ministry od Education, Dept. of Computer, Tsinghua Univ., Beijian.
    [5] T. Fukada, K. Tokuda, T. Kobayashi, S. Imai, “An adaptive algorithm for mel-cepstral analysis of speech,” in Proc. of ICASSP, vol.1, pp.137–140, 1992
    [6] C.C. Hsia, C.H. Wu, J.Y. Wu,“Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-based Speech Synthesis”IEEE Trans. Audio, Speech, and Language Processing, p.1-1 2010.
    [7] C. Huang, Y. Shi, J. L. Zhou, ,M. Chu, T. Wang, E. Chang,“Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp.901-904, 2004
    [8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.
    [9] J. Lafferty, A. McCallum, and F. Pereira.“Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001.
    [10] C.H. Lee, C.H. Wu, J.C. Guo,“Pronunciation Variation Generation for Spontaneous Speech Synthesis Using State-Based Voice Transformation” Proceedings of ICASSP2010, p.4826-4829, March 15-19, 2010.
    [11] T. Lin, L. J. Wang,“Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992
    [12] N. H. Pan, W. T. Jen, S. S. Yu, S. S Yu, S. Y. Huang, M. J. Wu,“Prosody Model in a Mandarin Text-to-Speech System Based on a Hierarchical Approach,” IEEE International Conference on Multimedia and Expo, vol. 1, pp. 448-451, 2000.
    [13] L. R. Rabiner. “A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE 77 (2): 257-286, February 1989.
    [14] J. Yamagishi, T. Kobayashi,“Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training,”IEICE Trans. On Inf. & Syst., vol.E90D, no.2, pp.533-543, Feb. 2007.
    [15] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006.
    http://htk.eng.cam.ac.uk/
    [16] Q. Zhang, X. Qiu, X. Huang, L. Wu,“Domain Adaptation for Conditional Random Fields”Information Retrieval Technology, 2008, Springer.
    [17] W. Zhang, L. Shen, D. Tang, “Voice Conversion Based on Acoustic Feature Transformation,” in Proc. of the 6th National Conference on Man-Machine Speech Communication 2001.
    [18] Z. Zhao, Y Zhu,“Prediction of Prosodic Phrase Boundaries in Chinese TTS Based on Conditional Random Fields and Transformation Based Learning,” fskd, vol. 2, pp.599-602, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, 2009.
    [19] H. Zen, T. Nose, J. Yamagishi, S. Sako, K. Tokuda, The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007.
    http://hts.sp.nitech.ac.jp/
    [20] 謝雲飛, 語音學大綱, 民國63年初版

    下載圖示 校內:2013-08-13公開
    校外:2013-08-13公開
    QR CODE