成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	李昀璋 Li, Yun-Chang
論文名稱：	基於HTS語音合成系統和迴歸樹與回溯機制之頻譜係數與基頻轉換之語者轉換系統 Speaker Conversion System Based on HMM-Based Speech Synthesis System and Regression-Tree-Based MGC and F0 Conversion with Backtracking Mechanism
指導教授：	王駿發 Wang, Jhing-Fa
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2012
畢業學年度：	100
語文別：	英文
論文頁數：	57
中文關鍵詞：	語音合成、語者轉換
外文關鍵詞：	HTS, speech synthesis, speaker adaptation, speaker conversion
相關次數：	點閱：202 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　本篇論文實作了一個基於HTS和迴歸樹與回溯機制之頻譜係數與基頻轉換的語音轉換系統。在HTS中，有三個主要的語音特徵：梅爾倒頻譜係數、基頻、狀態時長。為了合出目標語者之語音，這三項主要語音特徵分別透過本篇論文提出的方法作轉換。在訓練階段，我們需要平行語料來作決策樹的訓練；而因所提出之語者轉換系統架構，目標語者之語料庫可以任意的選取。狀態時長之決策樹和梅爾倒頻譜係數與基頻之迴歸樹會依不同的機制建立完成。在合成階段，首先根據由文字分析器產生出來的文脈資訊，從決策樹中預測出目標語者的狀態時長序列，並由迴歸樹中決定最佳轉換函式；而音框參數經參數產生步驟(parameter generation process)產生便依此函式轉換為目標語者之音框參數，最後再透過MLSA濾波器將聲音合成出來。
　在實驗中，我們為語者轉換之結果設計了客觀與主觀的評測方式。在客觀評測中，我們針對梅爾倒頻譜係數、基頻、狀態時長作評測。因為基頻在音框中又分為有聲或無聲，故基頻評測又分為二項。在主觀的評測中，我們使用音質與相似度MOS分數來評估轉換結果。總結來說，所提出的語者轉換系統改善了語者轉換的結果，特別是在梅爾倒頻譜係數與狀態時長的部分。

　In this thesis, a new speaker conversion system is implemented using regression-tree-based MGC and F0 conversion based on HMM-based speech synthesis system (HTS, T: triple). In HTS, there are three major acoustic features in synthesis phase: MGC, F0, and duration. To synthesize target speaker’s speech, these three major features are transformed by the proposed methods respectively. In training phase, the parallel corpora are required for decision tree training, and due to the proposed architecture, the target speaker’s corpus can be arbitrarily chosen. Then, the decision tree of duration and regression trees of MGC and F0 are constructed through the proposed mechanisms respectively. In synthesis phase, according to the label sequence generated by text analyzer, the duration sequence of target speaker is predicted from duration’s decision tree at first, and the conversion functions of MGC and F0 are determined from the regression trees respectively. Next, the frame-based features MGC and F0 are generated from the parameter generation process and then converted by the conversion functions of MGC and F0. Finally, target speaker’s speech is synthesized from MLSA vocoder with those converted features.
　In the experiments, objective and subjective evaluation tests are designed to compare the speaker conversion results. In objective evaluation, for MGC, F0 and duration, three types of evaluation tests are carried out. Since F0 is voiced or unvoiced in each frame, two evaluation tests are designed for F0. In subjective evaluation, two types of MOS are used to estimate the conversion results: quality and similarity. In summary, the proposed speaker conversion system has improved the conversion performance especially in MGC and duration.

中文摘要	III
Abstract	IV
誌謝	V
Content	VI
Table List	VIII
Figure List	IX
Chapter.1	Introduction	1
1	Background	1
2	Motivation	2
3	Objectives	2
4	Organization	3
Chapter.2	Related Work	5
1	HMM-based Speech Synthesis System	5
1.1	Feature Extraction	6
1.2	Training of HMMs	6
1.3	Parameter Generation from HMM	7
1.4	MLSA Vocoder	8
2	Model-based Speaker Conversion Method	11
2.1	AMCC & SMAP	13
2.2	MLLR & SMAPLR	14
2.3	CMLLR & CSMAPLR	16
3	Frame-based Speaker Conversion Method	20
Chapter.3	Proposed Speaker Conversion System	22
1	System Overview	22
2	Label Format	23
2.1	Full-context Label	24
2.2	Final Part’s Five-tone Label Representation	25
2.3	Syllable to Phoneme	26
3	Text Analyzer	27
4	Target Corpus Building	28
Chapter.4	Regression Tree Training of Conversion Function	30
1	Parallel Corpora Alignment	30
2	Question Set Selection	32
3	Decision Tree Training of Duration	33
4	Regression Tree Training of MGC	35
5	Regression Tree Training of F0	38
6	Function Selection with Backtracking Mechanism	41
Chapter.5	Experiments	44
1	Introduction to Experimental Environment	44
2	Experimental Results	46
2.1	Objective Evaluation	46
2.2	Subjective Evaluation	52
Chapter.6	Conclusions and Future Work	53
1	Conclusions	53
2	Future Work	53
References	55
作者簡介	57
                                    

[1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proc. EUROSPEECH-99, pp. 2374-2350, Sep. 1999
[2] K. Tokuda, H.Zen, J. Yamagishi, T. Masuko, S. Sako, A. Black and, T. Nose, The HMM-Based Speech Synthesis System (HTS) Version 2.2
http://hts.sp.nitech.ac.jp/
[3] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The Hidden Markov Model Toolkit (HTK) Version 3.4.1
http://htk.eng.cam.ac.uk/
[4] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. Syst., vol. E85-D, no. 3, pp. 455-464, Mar. 2002
[5] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “A hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 825-834, May 2007
[6] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp. 1315-1318, 2000
[7] S. Imai, K. Sumita, and C. Furuichi, “Mel-Log Spectrum Approximation (MLSA) Filter for Speech Synthesis,” Trans. IECE, vol. JGG-A, pp. 122-129, Feb. 1983
[8] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An Adaptive Algorithm for Mel-cepstral Analysis of Speech,” Proc. ICASSP, 1992
[9] K. Shinoda and T. Watanabe, “Speaker adaptation with autonomous model complexity control by MDL principle,” Proc. ICASSP, pp. 717-720, May 1996
[10] K. Shinoda and C. Lee, “A structural Bayes approach to speaker adaptation,” IEEE Trans. Speech Audio Process., vol. 9, pp. 276-287, Mar. 2001
[11] C. Leggetter and P.Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., vol. 9, no. 2, pp. 171-185, 1995
[12] O. Shiohan, T. Myrvoll, and C. Lee, “Structural maximum a posteriori linear regression for fast HMM adaptation,” Comput. Speech Lang., vol. 16, no. 3, pp. 5-24, 2002
[13] V. Digalakis, D. Rtischev, and L. Neumeyer, “Speaker adaptation using constrained reestimation of Gaussian mixtures,” IEEE Trans. Speech Audio Process., vol. 3, no. 5, pp. 357-366, Sep. 1995
[14] M. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, pp. 75-98, 1998
[15] Y. Nakano, M. Tachibana, J. Yamagishi, and T. Kobayashi, “Constrained Structural Maximum A Posteriori Linear Regression for Average-Voice-Based Speech Synthesis,” Proc. INTERSPEECH, pp. 2286-2289, 2006
[16] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of Speaker Adaptation Algorithms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm,” IEEE Audio, Speech, & Language Processing, vol.17, pp. 66-83, Jan. 2009
[17] Yu-Ting Chao and Chung-Hsien Wu, “Frame-Based Alignment and Adaptive CRF for Personalized Spectral and Prosody Conversion,” Taiwan National Cheng Kung University Institute of Computer Science and Information Engineering, July 2010
[18] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, June 28-July 01, 2001
[19] Edsger Dijkstra, “Dijkstra’s algorithm,” from Wikipedia
http://en.wikipedia.org/wiki/Dijkstra's_algorithm

2017-02-15公開

簡易檢索 / 詳目顯示

相關論文