簡易檢索 / 詳目顯示

研究生: 白育瑋
Bai, Yu-Wei
論文名稱: 使用語言與聲學資訊之高斯混合模型語音轉換應用於可自訂文字轉語音系統
A GMM-based Voice Conversion System using Linguistic and Acoustic Information for Customizable Text-To-Speech
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 71
中文關鍵詞: 梅爾倒頻譜係數語言資訊聲學特徵機率密度函數正規化加總
外文關鍵詞: Classification and Regression Tree (CART), HMM-based speech synthesis system (HTS, T: triple)
相關次數: 點閱:104下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本篇論文中,ㄧ個基於HTS和語言資訊CART之頻譜係數與基頻轉換的可自訂語音轉換系統被實現。在HTS中,有三個主要的語音特徵:梅爾倒頻譜係數、基頻、狀態時長。為了合出目標語者之語音,其中有兩項主要語音特徵分別透過本篇論文提出的方法作轉換。在訓練階段,我們需要平行語料以做為CART所使用,為了保證語料蒐集的效率與平衡程度;ㄧ個預先設計好的平衡音素文稿庫與平衡音素文句挑選演算法分別被建立與提出。頻譜係數與基頻分別透過不同的機制,由語言資訊CART與聲學特徵各別進行分群。在合成階段,首先根據由文字分析器產生出來的文脈資訊,分別由語言資訊CART與聲學群中挑選最佳的轉換函式。各別音框的頻譜係數經由參數產生步驟(parameter generation process)產生,再分別由語言資訊端與聲學特徵端進行轉換。藉由兩種機制各別轉換後的頻譜參數,再經由機率密度函數正規化加總,以達到互補的效果,最後再透過MLSA濾波器將聲音合成出來。
    在實驗中,我們為語者轉換之結果設計了客觀與主觀的評測方式。在客觀評測中,我們針對梅爾倒頻譜係數作評測。在主觀的評測中,我們使用流暢程度、可理解度與音質MOS分數來評估轉換結果。總結來說,所提出的語者轉換系統改善了語者轉換的結果。

    In this thesis, a customizable speaker conversion system is implemented using linguistic classification-and-regression-tree (CART)-based spectrum, pitch conversion, and HMM-based speech synthesis system (HTS, T: triple). There are three major acoustic features in synthesis phase: spectrum, pitch and duration in HTS. Two major features are transformed by proposed methods respectively, to synthesize target speaker’s speech. In training phase, the parallel corpora are required for CART training, and due to the corpus collection efficiency and phonetic balance, a pre-designed phonetic balanced text corpus is established and a phonetic balanced sentence selection algorithm is proposed. Then, the linguistic CART and acoustic clusters of spectrum and pitch are constructed through the proposed mechanisms respectively. In synthesis phase, according to the label sequence generated by text analyzer, the conversion functions of spectrum and pitch are determined from the linguistic CART and acoustic clusters respectively. Next, the frame-based spectrum and pitch features are generated from the parameter generation process and then converted by the linguistic and acoustic conversion functions of spectrum and pitch. A complementary effect is achieved by using linguistic and acoustic conversion. Finally, target speaker’s speech is synthesized from MLSA vocoder with those converted features.
    In the experiments, objective and subjective evaluation tests are designed to compare the speaker conversion results. The objective evaluation of spectrum is carried out. In subjective evaluation, three types of MOS are used to estimate the conversion results: fluency, intelligibility and voice quality, MOS scores are achieved 4.03, 4.12 and 4.09 respectively. In summary, the proposed speaker conversion system has improved the conversion performance.

    中文摘要...................................................IV Abstract..................................................VI 誌謝....................................................VIII Contents..................................................IX Table List................................................XI Figure List..............................................XII Chapter.1 Introduction...............................1 1.1 Background.........................................1 1.2 Motivation.........................................1 1.3 Previous Works and Problems........................1 1.4 Objectives.........................................2 1.5 Organization.......................................3 Chapter.2 Related Works..............................5 2.1 HMM-based Speech Synthesis System (HTS)............5 2.1.1 Feature Extraction.................................6 2.1.2 Training of HMMs...................................6 2.1.3 Parameter Generation from HMM......................7 2.1.4 MLSA Vocoder.......................................8 2.2 Model-based Speaker Conversion Method.............11 2.2.1 AMCC & SMAP.......................................12 2.2.2 MLLR & SMAPLR.....................................14 2.2.3 CMLLR & CSMAPLR...................................16 2.3 Frame-based Speaker Conversion Method.............19 2.3.1 Vector Quantization (VQ)..........................19 2.3.2 Linear Multivariate Regression (LMR)..............20 2.3.3 Gaussian Mixture Model (GMM)......................21 2.3.4 Hidden Markov Model (HMM).........................23 2.3.5 CART-based Voice Conversion.......................25 Chapter.3 Proposed Voice Conversion System..........26 3.1 System Overview...................................26 3.1.1 Parallel Corpus Generation........................26 3.1.2 Conversion Models Estimation......................26 3.1.3 Voice Conversion..................................27 3.2 Parallel Corpus Generation........................27 3.2.1 Sentence Selection Algorithm......................28 3.2.2 Phonetic Balanced Text Corpus Generation..........29 3.3 Conversion Models Estimation......................31 3.3.1 Time Alignment Between Source and Target Speeches.31 3.3.2 Linguistic Classification and Conversion Models Estimation................................................34 3.3.3 Acoustic Classification and Conversion Models Estimation................................................43 3.3.4 Pitch Conversion Models Estimation................45 3.4 Voice Conversion..................................46 3.4.1 Linguistic-based Models Selection and Voice Conversion................................................46 3.4.2 Acoustic-based Models Selection and Voice Conversion ..........................................................48 3.4.3 Posterior Probability Normalization Weighting Sum ..........................................................49 3.4.4 Pitch Voice Conversion............................50 Chapter.4 Experiments...............................52 4.1 Experiment Environment............................52 4.2 Proposed System Comparison........................53 4.2.1 Objective Evaluation..............................53 4.2.2 Subjective Evaluation.............................54 4.2.3 Comparison in Different Training Speech Number of Parallel Corpus...........................................57 Chapter.5 Conclusion and Future Work................63 5.1 Conclusion and Discussion.........................63 5.2 Future works......................................63 References................................................65 作者簡介...................................................67 Appendix..................................................68 Appendix A................................................68 Appendix B................................................71

    [1] V. Popa, H. Silen, J. Nurminen and M. Gabbouj, ‘’Local Linear transformation For Voice Conversion,’’ in Proc. ICASSP, pp. 4517-4520, Mar. 2012
    [2] W.T. Freeman, J.B. Tenenbaum and E. Pasztor, ‘’Learning Style Translation for the Lines of a Drawing,” in Proc ACM Transaction on Graphics, vol. 22, no. 1, pp1-14, Jan. 2003
    [3] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proc. EUROSPEECH-99, pp. 2374-2350, Sep. 1999
    [4] K. Tokuda, H.Zen, J. Yamagishi, T. Masuko, S. Sako, A. Black and, T. Nose, The HMM-Based Speech Synthesis System (HTS) Version 2.2
    http://hts.sp.nitech.ac.jp/
    [5] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. Syst., vol. E85-D, no. 3, pp. 455-464, Mar. 2002
    [6] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “A hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 825-834, May 2007
    [7] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp. 1315-1318, 2000
    [8] S. Imai, K. Sumita, and C. Furuichi, “Mel-Log Spectrum Approximation (MLSA) Filter for Speech Synthesis,” Trans. IECE, vol. JGG-A, pp. 122-129, Feb. 1983
    [9] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An Adaptive Algorithm for Mel-cepstral Analysis of Speech,” Proc. ICASSP, 1992
    [10] K. Shinoda and T. Watanabe, “Speaker adaptation with autonomous model complexity control by MDL principle,” Proc. ICASSP, pp. 717-720, May 1996
    [11] K. Shinoda and C. Lee, “A structural Bayes approach to speaker adaptation,” IEEE Trans. Speech Audio Process., vol. 9, pp. 276-287, Mar. 2001
    [12] C. Leggetter and P.Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, no. 2, pp. 171-185, 1995
    [13] O. Shiohan, T. Myrvoll, and C. Lee, “Structural maximum a posteriori linear regression for fast HMM adaptation,” Comput. Speech Lang., vol. 16, no. 3, pp. 5-24, 2002
    [14] V. Digalakis, D. Rtischev, and L. Neumeyer, “Speaker adaptation using constrained reestimation of Gaussian mixtures,” IEEE Trans. Speech Audio Process., vol. 3, no. 5, pp. 357-366, Sep. 1995
    [15] M. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, pp. 75-98, 1998
    [16] Y. Nakano, M. Tachibana, J. Yamagishi, and T. Kobayashi, “Constrained Structural Maximum A Posteriori Linear Regression for Average-Voice-Based Speech Synthesis,” Proc. INTERSPEECH, pp. 2286-2289, 2006
    [17] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of Speaker Adaptation Algorithms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm,” IEEE Audio, Speech, & Language Processing, vol.17, pp. 66-83, Jan. 2009
    [18] M. Abe, S. Nakanura, and K. Shikano, “Voice conversion through vector quantization,” in Proc. ICASSP, New York, USA, 1988, April 11-14, pp.655-658.
    [19] H. Valbret, E. Mulines, and J. Tubach, “Voice transformation using PSOLA techniques,” in Proc. ICASSP, San Francisco, USA, 1992, March 23-26, pp.145-148 vol.1
    [20] Y. Linde, A. Buzo, and R.M. Gray, “An algorithm for vector quantizer design,” in Proc. IEEE Trans. Communications, 1980, January, pp.84-95.
    [21] Y. Stylianou, O. Cappe, and E. Moulines, “Statistical methods for voice quality transformation,” in Proc. EUROSPEECH, Madrid, Spain, 1995, September 18-21, pp.447-450
    [22] A. Kain, H. Duxans, A. Bonafonte, and J. Santen, ‘’Including Dynamic and Phonetic Information in Voice Conversion Systems,’’ In International Conference on Spoken Language Processing, 2004
    [23] M. G.H. Omran, A. P Engelbrecht, and A. Salman., ‘’Dynamic Clustering using Particle Swarm Optimization with Application in Unsupervised Image Classification,’’ Word Academy Of Science, Engineering And Technology, vol. 9, Nov. 2005

    無法下載圖示 校內:2018-08-20公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE