簡易檢索 / 詳目顯示

研究生: 李崇漢
Lee, Chung-Han
論文名稱: 自發性語音合成中個人化習語和發音變異產生之研究
A study on the Generation of Personalized Speaking Style and Pronunciation Variation for Spontaneous Speech Synthesis
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 84
中文關鍵詞: 自發性語音合成個人化習語發音變異文字轉語音
外文關鍵詞: Spontaneous Speech Synthesis, idiolect, Pronunciation Variation, Text-to-speech
相關次數: 點閱:119下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自發性語音合成之研究近年來逐漸被重視,合成語音之自然度、流暢性與個人化之特色成為主要探討之目標,人的說話風格包含語音、文字與肢體動作等,不僅可以用來強化訊息內容,更可從說話風格判斷出語者的身份。不同層次的說話風格搭配能讓訊息的表達更加活潑生動,聽者也不會感覺索然無味。針對個人風格之習語特性,本論文先就文字處理上對個人化風格來進行探討。主要目的即透過個人習語的萃取與生成,將原始文章轉換為帶有某語者說話風格的文字訊息,達到建立個人化風格之習語模型為目的。

    本論文利用統計方法從訓練語料中自動萃取出個人之慣用習語,並且進行分類;將贅語類別之個人習語,建立贅語插入之模型,並將之插入文章中適當的位置;並且對於非贅語類別的個人習語,設計和建立一同義詞對照表,藉由同義詞的取代與插入,將原始輸入的文章修飾成具有個人化語者說話風格的文字內容。個人習語之同義詞對照表與插入模型均由譯本中自動取得。經由實驗與檢定驗證,確能將原始文章,轉換為含有目標語者說話風格之文章。

    自發性語音除了文字上處理個人化風格外,針對合成語音之自然度本論文亦提出發音變異產生之方法來提升。自然語音中發音變異之現象是影響語音自然度的重要因素。基於隱藏式馬可夫模型的語音合成器,近年來已經可合成出流暢及清晰的語音,其系統的可攜性及適應性更是其發展優勢,但在語音的自然度上仍嫌不足,需要改善合成語音的自然度。因此,本論文以語音轉換函式作為轉換並合成出發音變異現象的方法,以及考慮構音特性參數做發音變異現象之預測;透過轉換函式產生新的音韻模型,希望改善在傳統合成方法中,僅利用固定數量音韻模型合成的不足,並以構音特性參數對發音變異做聲學特性上的分類,以彌補訓練語料不足的問題。藉由產生發音變異現象,用以增進基於隱藏式馬可夫模型之合成語音的自然度。

    最後本論文以應用轉換函式及發音變異預測模型,導入轉換函式於隱藏式馬可夫模型建立發音變異模型;並且運用分類迴歸樹預測發音變異種類來達到產生發音變異現象的目的。對於發音變異模型進行主觀及客觀評估之實驗結果顯示,本論文所提出之方法,在合成語音之自然度表現上,具有相當程度的改進。

    A person’s speaking style, consisting of such attributes as voice, choice of vocabulary and the physical motions employed, not only expresses the speaker’s identity but also emphasizes the content of an utterance. Speech combining these aspects of speaking style becomes more vivid and expressive to listeners. Recent research on speaking style modeling has paid more attention to speech signal processing. This approach focuses on text processing for idiolect extraction and generation to model a specific person’s speaking style for the application of text-to-speech (TTS) conversion. The first stage of this study adopts a statistical method to automatically detect the candidate idiolects from a personalized, transcribed speech corpus. Based on the categorization of the detected candidate idiolects, superfluous idiolects are extracted using the fluency measure while the remaining candidates are regarded as the non-superfluous idiolects. In idiolect generation, the input text is converted into a target text with a particular speaker’s speaking style via the insertion of superfluous idiolect or synonym substitution of non-superfluous idiolect. To evaluate the performance of the proposed methods, experiments were conducted on a Chinese corpus collected and transcribed from the speech files of three Taiwanese politicians. The results show that the proposed method can effectively convert a source text into a target text with a personalized speaking style.
    Pronunciation normally varies in spontaneous speech, and is an integral aspect of spontaneous expression. This study describes a voice transformation-based approach to the generation of pronunciation variations for Hidden Markov Model (HMM)-based spontaneous speech synthesis. In this approach, context-dependent linear transformation functions constructed by a small parallel corpus using a statistical HMM are adopted to model the relationship between read speech and the corresponding spontaneous speech with pronunciation variations. The state-based context-dependent transformation function is then adopted to determine the feature parameters for generating pronunciation variations. Owing to the lack of training data, the transformation functions are categorized using a decision tree based on linguistic and articulatory features. Consequently, the unseen pronunciation variation can be generated from the transformation function that is retrieved from the decision tree using linguistic and articulatory features. For evaluation, each of three small parallel speech databases recorded by one of three native speakers was collected. Objective and subjective tests were performed to evaluate the performance of the proposed approach.
    Finally, the experimental results demonstrate that the proposed transformation function substantially improves apparent spontaneity of synthesized speech.

    TABLE OF CONTENT VI LIST OF FIGURES VIII LIST OF TABLES X CHAPTER 1. INTRODUCTION 1 1.1. Motivation 1 1.2. The Approach of this Dissertation 5 1.3. The Organization of this Dissertation 6 CHAPTER 2. FRAMEWORK OVERVIEW 7 CHAPTER 3. IDIOLECT CANDIDATE DETECTION AND IDIOLECT INSERTION 11 3.1. Co-Occurring Word Extraction 11 Strength Measure 12 Spread Measure 13 Extraction of Idiolects with Non-Consecutive Phrases 14 3.2. Idiolect Claffification 15 3.3. Idiolect Insertion and Generation 17 Superfluous Idiolect Insertion 17 Non-Superfluous Idiolect Generation 20 3.4. Experimental Results 23 Corpus Analysis 23 Evaluation of Idiolect Candidate Detection 24 Experiments on Word Frequency Constraint 29 Superfluous Idiolect Extraction Evaluation 31 Sentence Similarity 32 Evaluation of Superfluous Idiolect Insertion 34 Evaluation of Synonym Extraction and Substitution 37 Overall Evaluation 37 3.5. Summarization of this Chapter 40 CHAPTER 4. PRONUNCIATION VARIATION GENERATION VIA STATE-BASED CONTEXT-DEPENDENT VOICE TRANSFORMATION 42 4.1. Transformation function modeling 44 4.2. Transformation function Clustering and Cross Validation 50 Transformation function Clustering 51 Cross Validation 57 4.3. Experimental Results 58 Design and Collection of Speech Database 60 Evaluation of Modeling of Pronunciation Variation 62 Comparisons with Other Variation Generation Methods 64 Comparisons of Linear Transformation Functions 67 Evaluation via Formal Listening 68 4.4. Summarization of this Chapter 70 CHAPTER 5. CONCLUSION AND FUTURE WORKS 72 REFERENCE 74

    [Agrawal et al. 1993] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD, pp. 22(2):207-216, June 1993.
    [Anastasakos et al. 1996] T. Anastasakos, J. McDonogh, R. Schwartz, and J. Makhoul, “A Compact Model for Speaker Adaptive Training,” Proc. ICSLP, pp. 1137-1140, Philadelphia, Oct. 1996.
    [Barzilay and Lee 2003] R. Barzilay and L. Lee, “Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment,” Proc. HLT-NAACL, pp. 16-23, 2003.
    [Bellegarda and Silverman 2003] J.R. Bellegarda and K.E.A. Silverman, “Natural Language Spoken Interface Control Using Data-driven Semantic Inference,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 11, No. 3, pp.267~277, May, 2003.
    [Bennett and Black 2005] C.L. Bennett, and A.W. Black, “Prediction of pronunciation variations for speech synthesis: A data-driven approach,” in Proc. of IEEE Int. Conf. Acoust., Speech, and Signal Processing, pp. 297-300, Philadelphia, Mar. 2005.
    [Breen and Jackson 1998] A.P. Breen and P. Jackson, “Non-Uniform Unit Selection and the Similarity Metric within BT’s Laureate TTS System,” in Proc. of the Third ESCA/COCOSDA Workshop on Speech Synthesis, pp. 201-206, Blue Mountain, Australia, Nov. 1998.
    [Brown et al. 1990] P. Brown, J. Cocke, S.A. Della Pietra, V.J. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer, and P.S. Roossin, “A Statistical Approach to Machine Translation,” Proc. Computational Linguistics, Vol.16, 1990.
    [Cai et al. 2007] L. H. Cai, D. D. Cui, and R. Cai, “TH-CoSS, a Mandarin Speech Corpus for TTS,” Journal of Chinese Information Processing, Vol. 21, No. 2, pp. 94-99, Mar. 2007.
    [Campbell et al. 2004] W.M. Campbell, J.R. Campbell, D.A. Reynolds, D.A. Jones and T.R. Leek, “High-level speaker verification with support vector machines,” Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 73 – 76, May 2004.
    [Chao 1994] Y.R. Chao, “A Grammar of Spoken Chinese,” California Univ., Berkeley, 1994.
    [Chen and Bai 1998] K.J. Chen and M.H. Bai, “Unknown Word Detection for Chinese by a Corpus-based Learning Method,” International Journal of Computational Linguistics and Chinese Language Processing, Vol.3, No.1, pp. 27-44, 1998.
    [Chen and You 2002] K.J. Chen and J.M. You, “A Study on Word Similarity using Context Vector Models,” International Journal of Computational Linguistics and Chinese Language Processing, Vol.7 No.2, pp.37-58, 2002.
    [Crystal 2006] D. Crystal, “Language and the Internet,” Second Edition, Cambridge University Press, 2006.
    [Dempster et al. 1977] A. P. Dempster, N. M. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Vol. 39, No. 1, pp. 1–38, 1977.
    [Doddington 2001] G. Doddington, “Speaker recognition based on idiolectal differences between speakers,” Proc. EUROSPEECH, pp. 2521–2524, 2001.
    [Dong and Lua 2002] M. Dong and K.T. Lua, “Pitch Contour Model for Chinese Text-to-Speech Using CART and Statistical Model”, in Proc. of ICSLP, pp. 2405-2408, 2002.
    [En-Najjary et al. 2003] T. En-Najjary, O. Rosec, and T. Chonavel, “A New Method for Pitch Prediction from Spectral Envelope and Its Application In Voice Conversion,” Proc. EUROSPEECH, 2003.
    [Fosler-Lussier and Morgan 1999] E. Fosler-Lussier, and N. Morgan, “Effects of Speaking Rate And Word Frequency on Conversation Pronunciations,” Speech Communication, Vol. 29, pp. 137-158, 1999.
    [Fukada et al. 1992] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An Adaptive Algorithm for Mel-cepstral Analysis of Speech”, in Proc. of ICASSP, Vol. 1, pp. 137-140, Mar. 1992.
    [Graff and Chen 2003] David Graff and Ke Chen “Chinese Gigaword,” Linguistic Data Consortium, Catalo No.: LDC2003T09, ISBN: 1-58563-230-9, 2003.
    [Honal amd Schultz 2005] M. Honal and T. Schultz, “Automatic Disfluency Removal on Recognized Spontaneous Speech - Rapid Adaptation to Speaker Dependent Disfluencies,” Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vol.1, pp. 969- 972, 2005.
    [Hsia et al. 2007] C.C. Hsia, C.H. Wu, and J.Q. Wu, “Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion,” IEEE Trans. Computers, Vol. 56, No. 9, pp. 1225-1233, September 2007.
    [Huang et al. 2004] C. Huang, Y. Shi, J. L. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental Tonal Modeling for Phone Set Design in Mandarin LVCSR”, in Proc. of ICASSP, pp. 901-904, 2004.
    [Inkpen 2007] D. Inkpen, “A Statistical Model for Near-synonym Choice,” Proc. ACM Transactions on Speech and Language Processing (TSLP), Vol.4, Issue 1, pp.1-17, 2007.
    [IPA] International Phonetic Association, Handbook. http://www.langsci.ucl.ac.uk/ipa/.
    [Kauchak and Barzilay 2006] D. Kauchak and R. Barzilay, “Paraphrasing for Automatic Evaluation,” Proc. HLT-NAACL, pp. 455-462, 2006.
    [Kawahara 1997] H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited,” In Proc. of IEEE Int. Conf. Acoustics., Speech, Signal Processing, pp. 1303–1306, Munich, Germany, 1997.
    [Kuwabara and Sagisaka 1995] H. Kuwabara and Y. Sagisaka, “Acoustic Characteristics of Speaker Individuality: Control and Conversion,” Speech Communication, Vol. 16, No. 2, pp. 165-173, 1995.
    [Lai and Wu 2002] Y.S. Lai and C.H. Wu, “Meaningful Term Extraction and Discriminative Term selection in Text Categorization via Unknown-word Methodology,” ACM Transactions on Asian Language Information Processing, Vol.1, No.1, pp.34-64, March 2002.
    [Landauer and Dumais 1997] T.K. Landauer and S.T. Dumais, “A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge,” Psychological Review, pp. 104(2):211-240, 1997.
    [Landauer et al. 1998] T. K. Landauer, P. W. Foltz, and D. Laham, “An Introduction to Latent Semantic Analysis,” Discourse Processes, Vol. 25, pp. 259-284, 1998.
    [Lee et al. 2009] C.H. Lee, C.C. Hsia, C.H. Wu, and M.C. Lin, “Regression-based Clustering for Hierarchical Pitch Conversion,” in Proc. of ICASSP2009, pp. 3593-3596, Taiwan, 2009.
    [Lin and Wang 1992] T. Lin and L. J. Wang, “Phonetic Tutorials”, Beijing University Press, pp. 103-121, 1992.
    [Ling et al. 2008] Z.H. Ling, K. Richmond, J. Yamagishi, and R.H. Wang. “Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge,” In Proc. Interspeech, pp. 573-576, Brisbane, Australia, Sep. 2008.
    [Lu et al. 2003] Q. Lu, Y. Li, and R. Xu, “Improving Xtract for Chinese Collocation Extraction,” Proc. IEEE Intl. Conf. NLPKE 2003, pp. 333-338, 2003.
    [Manning and Schutze 1999] C.D. Manning and H. Schutze, “Foundations of Statistical Natural Language Processing,” MIT Press, 1999.
    [Mei et al 1983] G.C. Mei et al., “Cilin-thesaurus of Chinese words,” Commercial Press Hong Kong, 1983.
    [MCDC] The Mandarin Conversational Dialogue Corpus (MCDC). http://mmc.sinica.edu.tw/mcdc_e.htm.
    [Papineni et al. 2002] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: a method for automatic evaluation of machine translation,” Proc. ACL, pp.311-318, July 2002.
    [Prahallad et al. 2006] K. Prahallad, A.W. Black, and R. Mosur, “Sub-phonetic modeling for capturing pronunciation variation in conver-sational speech synthesis,” in Proc.of IEEE Int. Conf. Acoust., Speech, and Signal Processing, France, pp. 853–856, May 2006.
    [Rabiner 1989] L.R. Rabiner. "A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286, Feb. 1989.
    [Redeker 1990] G. Redeker, “Ideational and Pragmatic Markers of Discourse Structure,” Proc. Journal of Pragmatics 14, pp. 367-381, 1990.
    [Rubin et al. 1981] P. Rubin, T. Baer, and P. Mermelstein, “An Articulatory Synthesizer for Perceptual Research, “ Journal of the Acoustical Society of America, Vol. 70, pp. 321-328, 1981.
    [Sakoe and Chiba 1978] H. Sakoe, and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 26, No.1, pp. 43- 49, 1978, ISSN: 0096-3518.
    [Smadja 1993] F. Smadja, “Retrieving Collocations from Text: Xtract,” Proc. Computational Linguistics, pp.19(1):143–177, 1993.
    [Stenstrom 1994] A. B. Stenstrom, “An Introduction to Spoken Interaction,” Addison-Wesley, 1994.
    [Sun and Wang 2004] L.Y. Sun and Y.R. Wang, “An Analysis Modeling of Syllable Contraction in Spontaneous Mandarin Speech Recognition,” Master Thesis, Dept. of Communication Engineering, NCTU, Taiwan, 2004.
    [Stanford Parser] The Stanford Parser: A statistical parser http://nlp.stanford.edu/software/lex-parser.shtml.
    [Thomas et al. 2001] H. Thomas, E. Charles, L. Ronald, and Stein Clifford, “Introduction to Algorithms,” Second Edition, MIT Press, 2001.
    [Toda et al. 2007] T. Toda, A.W. Black, and K. Tokuda, “Voice conversion based on maximum likelihood estimation of speech parameter trajectory,” IEEE Trans. on Audio, Speech and Language Processing, Vol.15, No.8, pp.2222-2235, Nov. 2007.
    [Tseng and Liu 2002] S.C. Tseng and Y.F. Liu, “Annotation of Mandarin Conversational Dialogue Corpus,” CKIP Technical Report, No. 02-01, Academia Sinica, 2002.
    [Tseng 2005a] S.C. Tseng, “Contracted Syllables in Mandarin: Evidence from Spontaneous Conversation,” Journal of Language and linguistics, pp. 153-180, 2005.
    [Tseng 2005b] S.C. Tseng, “Syllable Contraction in a Mandarin Conversation Dialogue Corpus,” International Journal of Corpus Linguistics, pp. 63-83, 2005.
    [Turney et al. 2003] P. Turney, M.L. Littman, J. Bigham, Shnayder, and V. Source, “Combining Independent Modules to Solve Multiple-Choice Synonym and Analogy Problems,” Proc. Intl. Conf. Recent Advances in Natural Language Processing (RANLP-03), pp. 482-489, 2003.
    [Venkataramani and Byrne 2001] V. Venkataramani and W. Byrne, “MLLR adaptation techniques for pronunciation modeling,” in Proc. ASRU, pp. 421-424, 2001.
    [Verma and Kumar 2005] A. Verma and A. Kumar, “Introducing Roughness in Individuality Transformation through Jitter Modeling and Modification,” Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 5 – 8, March 2005.
    [Werner et al. 2004] S. Werner, M. Eichner, M. Wolff, and R. Hoffmann,“Toward spontaneous speech synthesis - utilizing language model information in TTS,” IEEE Trans. Speech, Audio Processing, pp. 436–445, 2004.
    [Wu et al. 2010] C.H. Wu, C.C. Hsia, C.H. Lee, and M.C. Lin, “Hierarchical Prosody Conversion Using Regression-based Clustering for Emotional Speech Synthesis,” IEEE Trans. Audio, Speech, and Language Processing, 2010.
    [Wu et al. 2009] C.H. Wu, C.H. Lee and C.H. Liang, “Idiolect Extraction and Generation for Personalized Speaking Style Modeling,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 17, pp. 127-137, Jan. 2009.
    [Wu et al. 2007] C.H. Wu, C.C. Hsia, J.F. Chen, and J.F. Wang, “Variable-Length Unit Selection in TTS Using Structural Syntactic Cost,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 15, No. 4, pp.1227~1235, May, 2007.
    [Wu et al. 2006] C.H. Wu, C.C. Hsia, T.H. Liu, and J.F. Wang, “Voice Conversion Using Duration-Embedded Bi-HMMs for Expressive Speech Synthesis,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 14, No. 4, pp.1109~1116, July, 2006.
    [Wu and Chang 2004] C.C. Wu and J.S. Chang, “Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses,” International Journal of Computational Linguistics and Chinese Language Processing, Vol.9, No.1, pp. 1-20, 2004.
    [Wu and Zhou 2003a] H. Wu and M. Zhou, “Synonymous Collocation Extraction Using Translation Information,” Proc. ACL, pp. 120-127, 2003.
    [Wu and Zhou 2003b] H. Wu and M. Zhou, “Optimizing Synonym Extraction using Monolingual and Bilingual Resources,” Proc. The Second International Workshop on Paraphrasing (IWP),Vol.16, pp. 72-79, 2003.
    [Wu and Chen 2001] C.H. Wu and J.H. Chen, “Automatic Generation of Synthesis Units and Prosodic Information for Chinese Concatenative Synthesis,” Speech Communication, Vol.35, pp.219-237, 2001.
    [WordNet] WordNet: a lexical database of English, http://wordnet.princeton.edu/.
    [Xu and Lu 2005] R. Xu and Q. Lu, “Multi-Stage Chinese Collocation Extraction,” Proc. International Conference on Machine Learning and Cybernetics, pp. 3254-3259, 2005.
    [Yamagishi et al. 2009] J. Yamagishi, T. Nose, H. Zen, Z.H. Ling, T. Toda, K. Tokuda, S. King, S. Renals, “A robust speaker-adaptive HMM-based text-to-speech synthesis,” IEEE Trans. on Audio, Speech, and Language Processing, Vol.17, No.6, pp. 1208-1230, August 2009.
    [Young et al. 2006] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The Hidden Markov Model Toolkit (HTK) Version 3.4, 2006. http://htk.eng.cam.ac.uk/.
    [Zen et al. 2007a] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden Semi-Markov Model Based Speech Synthesis System,” IEICE Trans. on Information Systems, Vol. E90-D, No. 5, pp. 825-834, May 2007.
    [Zen et al. 2007b] H. Zen, T. Nose, J. Yamagishi, S. Sako, and K. Tokuda, The HMM-based Speech Synthesis System (HTS) Version 2.0, 2007. http://hts.sp.nitech.ac.jp/.
    [Zen et al. 2009] H. Zen, K. Tokuda, A.W. Black, “Statistical parametric speech synthesis,” Speech Communication, Vol.51, No.11, pp.1039-1154, Nov. 2009.
    [Zhao and Jurafsky 2005] Y. Zhao and D. Jurafsky, “A Preliminary Study of Mandarin Filled Pauses,” Proc. Disfluency in Spontaneous Speech Workshop (DiSS), pp.179-182, 2005.
    [Zhao et al. 2005] Y. Zhao, B. Qin, T. Liu, L. Zhang, and Z. Su, “Sentence Similarity Computing Based on Multi-Features Fusion,” Proc. JSCL, pp.168-174, Aug 2005.

    下載圖示 校內:2012-08-02公開
    校外:2013-08-02公開
    QR CODE