簡易檢索 / 詳目顯示

研究生: 陳柏均
Chen, Po-Chun
論文名稱: 基於目標語者少量訓練語料之跨語言多語者語音合成系統
Cross-Lingual Multi-Speaker Speech Synthesis System based on a Small-Sized Training Corpus of Target Speakers
指導教授: 楊中平
Yang, Chung-Ping
共同指導教授: 盧文祥
Lu, Wen-Hsiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 45
中文關鍵詞: 跨語言合成多語者合成少量語料
外文關鍵詞: Cross-lingual synthesis, Multi-speaker synthesis, A small-sized corpus
相關次數: 點閱:68下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基於少量訓練語料的跨語言多語者語音合成在語言教學和商業用途上都是實用且具有挑戰性的研究領域。雖然目前已經有很多關於語音合成的研究,但大家時常會遺忘台語和客語,我們認為台語和客語在語言學中依舊扮演著重要的角色,我們希望可以為台灣的語言教育盡一份力。語音合成系統變得越來越自然和流暢,並且在各種應用中得到廣泛使用,如語音助手、電子書朗讀和無障礙輔助技術等。隨著技術的不斷演進,語音合成的未來發展仍然充滿潛力和挑戰。為了滿足不同場景,我們的跨語言包含台語、國語、客語和英語,並同時融入男性和女性的聲音,實現跨性別跨語言多語者的語音合成系統。實驗結果我們只需約30句(3分鐘)且零外語的目標語者語料,在預訓練的基礎模型上進行少量時間的微調模型訓練,實現保留目標語者聲學特性和高準確度發音的跨語言語音合成。

    Cross-lingual multi-speaker speech synthesis using a small-sized training corpus is a practical and challenging research in both language teaching and commercial application. Although there has been much research on speech synthesis, Taiwanese and Hakka are often forgotten. We believe that Taiwanese and Hakka are still important in linguistics, and we aspire to contribute to language education in Taiwan. Speech synthesis systems have become increasingly natural and fluent, and are widely used in various applications such as voice assistant, e-book narration, and accessibility assistance technology. With the continuous evolution of technology, the future development of speech synthesis still holds great potential and challenges. To cater different scenarios, our cross-lingual speech synthesis includes Taiwanese, Mandarin, Hakka, and English. It also comprises both male and female to achieve a cross-gender cross-lingual multi-speaker speech synthesis system. Our experimental results show we only require approximately 30 utterances (3 minutes) of target speakers’ corpus, without any foreign content. Based on the pre-trained model, a short period of training time is used for fine-tuned models to achieve cross-lingual speech synthesis that preserve the acoustic characteristics of target speakers and high accuracy pronunciation.

    摘要 I ABSTRACT II 致謝 III TABLE OF CONTENT IV LIST OF TABLES VI LIST OF FIGURES VII Chapter 1. INTRODUCTION 1 1.1 Background 1 1.2 Motivation 1 1.3 Goal 2 1.4 Method 2 1.5 Contribution 2 Chapter 2. RELATED WORK 3 2.1 Text-to-Speech 3 2.2 Multi-Speaker Text-to-Speech 4 2.3 Cross-Lingual Text-to-Speech 5 Chapter 3. METHODOLOGY 6 3.1 Overview 6 3.2 System Architecture 7 3.3 Data Collection 9 3.3.1 Dictionary 9 3.3.2 Taiwanese Corpus 10 3.3.3 Hakka Corpus 10 3.3.4 Mandarin Corpus 11 3.3.5 English Corpus 11 3.4 Text-to-Phonetic Unit 12 3.4.1 Text Normalization 12 3.4.2 Word Segmentation and Part-of-Speech Tagging 18 3.4.3 Pronunciation Selection 18 3.4.4 Taiwanese CTL Conversion 19 3.4.5 Tone Sandhi 21 3.4.6 Phonetic Unit Segmentation 24 3.5 Synthesis System 26 3.5.1 Acoustic Model 26 3.5.2 Vocoder Model 27 Chapter 4. EXPERIMENT 29 4.1 Data 29 4.2 Phonetic Unit Alignment 29 4.3 Extract Features from Audio 31 4.4 Pre-Trained Model 32 4.5 Fine-Tuned Model 33 4.6 Comparison of Various Configurations in Fine-Tuned Model 34 4.7 Acoustic Characteristics Analysis 37 4.8 Pronunciation and Acoustic Characteristics Evaluation 38 4.8.1 Taiwanese Evaluation 38 4.8.2 Mandarin Evaluation 39 4.8.3 English Evaluation 40 Chapter 5. CONCLUSION 42 REFERENCE 43

    [1] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume1, pages 373-376. IEEE, 1996.
    [2] N. Iwahashi, N. Kaidi, and Y. Sagisaka. speech segment selection for concatenative synthesis based on spectral distortion minimization, IEICE Trans. Fundamentals, vol.E76-A, no.11, pp.1942-1948, Nov.1998.
    [3] T. Yoshimura, K. Tokuda, T. Kobayashi, and T.kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, Proc. EUROSPEECH, pp.2347-2350, Budapest, Hungary, Sept. 1999.
    [4] K. Tokuda, H. Zen, and A. W. Black. An HMM-based speech synthesis system applied to English, Proc, IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sept. 2002.
    [5] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron:Towards end-to-end speech synthesis. Proc. Interspeech 2017, pages 4006-4010, 2017.
    [6] Jonathan Shen , Ruoming Pang , Ron J. Weiss , Mike Schuster , Navdeep Jaitly , Zongheng Yang , Zhifeng Chen , Yu Zhang , Yuxuan Wang , RJ Skerry-Ryan , Rif A. Saurous ,Yannis Agiomyrgiannakis , and Yonghui Wu et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884, 2017.
    [7] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in Proc. ICLR, 2017.
    [8] Sercan O. Arık, Mike Chrzanowski, Adam Coates,Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi. Deep Voice: Real-time Neural Text-to-Speech. In International Conference on Machine Learning, pages 195-204. PMLR, 2017
    [9] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech. In Proceedings of the 33rd International Conference on Neural Information Processing System, pages 3171-3180, 2019.
    [10] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech”, 2022
    [11] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu “WaveNet: A Generative Model for Raw Audio”, 2016
    [12] Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in NeurIPS, vol. 31, 2018.
    [13] Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram”, 2020
    [14] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis,” pp. 4475–4479, 2015, iD: 1.
    [15] Jeeweon Jung, Heesoo Heo, Ilho Yang, Sunghyun Yoon, Hyejin Shim, and Hajin Yu, “D-vector based speaker verification system using Raw Waveform CNN”, 2017
    [16] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329-5333. IEEE, 2018.
    [17] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, “TTransfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis”, 2018
    [18] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep Voice 2: Multi-speaker neural text-to-speech,” in Advances in neural information processing systems, 2017, pp. 2962–2970.
    [19] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
    [20] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017.
    [21] T. Nekvinda and O. Dusek, “One model, many languages: Meta-learning for multilingual text-to-speech,” arXiv preprint arXiv:2008.00768, 2020.
    [22] Y. Cao, X. Wu, S. Liu, J. Yu, X. Li, Z. Wu, X. Liu, and H. Meng, “End-to-end code-switched tts with mix of monolingual recordings,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6935–6939.
    [23] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” Proc. Interspeech 2019, pp. 2080–2084, 2019.
    [24] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber “Common Voice: A Massively-Multilingual Speech Corpus”, 2020
    [25] Tzu-Feng Huang. Taiwanese Emotional Speech Synthesis System Based on a Small-Sized Training Corpus of Target Speaker 2022
    [26] Wei-Yun Ma, Keh-Jiann Chen. “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. SIGHAN, Jul 2003.
    [27] 中華民國教育部. 台灣閩南語羅馬字拼音方案使用手冊https://www.ccvs.ntpc.edu.tw/ischool/public/resource_view/openfid.php?id=8881, 2007
    [28] Pan, Guan-Xun. Taiwanese Speech Synthesis System Based on Taiwanese Tone Sandhi and Implementation of Chinese to Taiwanese translation. 2021
    [29] Chris Donahue, Julian McAuley, Miller Puckette “Adversarial Audio Synthesis”, 2018
    [30] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, Morgan Sonderegger “Montreal Forced Aligner: trainable text-speech alignment using Kaldi” 2017

    無法下載圖示 校內:2028-07-31公開
    校外:2028-07-31公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE