| 研究生: |
楊崇文 Yang, Chung-Wen |
|---|---|
| 論文名稱: |
一個調整文字轉語音模型所產生之語音語速之系統 A System for Modifying the Duration of Synthesized Speech from Text-To-Speech Models |
| 指導教授: |
賴槿峰
Lai, Chin-Feng |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 61 |
| 中文關鍵詞: | 文字轉語音 、音訊時長控制 、蒙特婁文字對齊器 |
| 外文關鍵詞: | Text-To-Speech, Audio Time-Scale Modification, Montreal Forced Aligner |
| 相關次數: | 點閱:88 下載:17 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在過去幾年裡,文字轉語音因為其多樣的應用性而受到許多的研究關注。隨著文宇轉語音相關技術的發展,人們對於所產生語音的要求,除了語音内容的正確性以外,更要求所產生語音必須具有高自然度。而影響語音自然度的其中一個關鍵的因素為語音的語速。大多數早期所提出的文字轉語音模型是以自我迴歸模型為基礎,產生語音的方式為基於前一幀的語音内容,再接續產生下一幀的語音内容。然而,這種以自我迴歸模型產生語音的方式有一個最大的缺點,在於其對於所產生語音之語速缺乏控制能力。為了增加對於語速的控制能力,較後期所提出的文字轉語音模型便轉而利用非自我迴歸模型作為其模型基礎。以非自我迴歸模型作為基礎的文宇轉語音模型,在訓練上所需要的資料量相當的大,因此在訓練上對於硬體設備的需求相當的高,所需的訓練時間也相對較長,造成模型訓練上的難度。因此,本論文提出一個調整文字轉語音模型所產生之語音語速之系統,本系統由一個文字對齊器、一個語速調整器、一個語音語速調整網路以及一個聲碼器組成。本系統的文字對齊器會找出語音中的文字邊界,語速調整器則會將語音轉換至頻域,接著根據文宇邊界調整每個文字所對應語音之語速,調整的方式如下:加入空白幀以延長語音語速,刪除數個幀以縮短語音語速。調整後的頻譜則被輸入至語音語速調整網路,替空白幀内填入適當的語音内容,並弭平插入與刪除幀所造成幀與幀之間的不連續性。最後,再由聲碼器將調整後的頻譜轉換回時域,輸出成為語音音訊。實驗結果顯示透過本系統所調整語速之語音,與基於非自我迴歸模型的文字轉語音模型所座生之語音,其產生之語音品質相當,並相當接近真人錄製之語音品質。
In the past decades, synthesizing speech from texts, also known as Text to Speech (TTS), has drawn a great attention from researchers since it is applicable to a variety of applications. One of the factors that affects the prosody of synthesized speech is the speed at which it is spoken. Most of the TTS models proposed earlier are based on the autoregressive mechanism that generates speeches frame by frame. However, these autoregressive TTS models have a major drawback that lack the ability of controlling the duration of the synthesized speeches. In order to increase the ability of TTS models to control the duration of the synthesized speeches, many non-autoregressive TTS models are proposed to model the duration of synthesized speech. However, comparing with training an autoregressive TTS model, it takes a huge amount of data and computing power to train a non-autoregressive TTS model. Therefore, in thesis, a system for modifying the duration of speech synthesized from a TTS model is proposed. The proposed system consists of a forced aligner, a duration modifier, a neural network named ATM-Net and a vocoder. The forced aligner in the proposed system is adopted from the Montreal Forced Aligner with pretrained mandarin model. Once the boundary of each word / phoneme is determined, the duration modifier is then used to lengthen or shorten speech segments by inserting dummy frames or removing frames in a mel spectrogram respectively. The modified mel spectrogram is then fed into the ATM-Net to fill in audio contents as well as smoothen the discontinuities between frames. At last, the output mel spectrogram is used to synthesize the audio signal by a vocoder. Experiments show that the proposed system could modify the duration of speech and synthesize speech with natural prosody that is close to a real human does.
[1] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agionyrgiannakis, Y., Clark, R., Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[2] Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.
[3] Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
[4] Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 6706-6713).
[5] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
[6] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
[7] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Weiss, R. J., & Wu, Y. (2021, June). Parallel tacotron: Non-autoregressive and controllable tts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5709-5713). IEEE.
[8] Yu, C., Lu, H., Hu, N., Yu, M., Weng, C., Xu, K., Liu, P., Tou, D., Kang, S., Lei, G., Su, D., Yu, D. (2019). Durian: Duration informed attention network for multimodal synthesis. arXiv preprint arXiv:1909.01700.
[9] Łańcucki, A. (2021, June). Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6588-6592). IEEE.
[10] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R. J., Saurous, R. A., Agiomyrgiannakis, Y., Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE.
[11] Lim, D., Jang, W., Park, H., Kim, B., & Yoon, J. (2020). Jdi-t: Jointly trained duration informed transformer for text-to-speech without explicit alignment. arXiv preprint arXiv:2005.07799.
[12] Beliaev, S., Rebryk, Y., & Ginsburg, B. (2020). TalkNet: Fully-convolutional non-autoregressive speech synthesis model. arXiv preprint arXiv:2005.05514.
[13] Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., Zhang, Y. (2020, May). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6124-6128). IEEE.
[14] Zeng, Z., Wang, J., Cheng, N., Xia, T., & Xiao, J. (2020, May). Aligntts: Efficient feed-forward text-to-speech system without explicit alignment. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6714-6718). IEEE.
[15] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech (Vol. 2017, pp. 498-502).
[16] Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., & Wu, Y. (2020). Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301.
[17] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Skerry-Ryan, R. J., & Wu, Y. (2021). Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling. arXiv preprint arXiv:2103.14574.
[18] Abbas, A., Merritt, T., Moinet, A., Karlapati, S., Muszynska, E., Slangen, S., Gatti, E., Drugman, T. (2022). Expressive, variable, and controllable duration modelling in TTS. arXiv preprint arXiv:2206.14165.
[19] Allen, J. B., & Rabiner, L. R. (1977). A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE, 65(11), 1558-1564.
[20] Roucos, S., & Wilgus, A. (1985, April). High quality time-scale modification for speech. In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 10, pp. 493-496). IEEE.
[21] Rudresh, S., Vasisht, A., Vijayan, K., & Seelamantula, C. S. (2018). Epoch-synchronous overlap-add (ESOLA) for time-and pitch-scale modification of speech signals. arXiv preprint arXiv:1801.06492.
[22] Laroche, J. (1993, October). Autocorrelation method for high-quality time/pitch-scaling. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 131-134). IEEE.
[23] Lawlor, B., & Fagan, A. D. (1999). A novel high quality efficient algorithm for time-scale modification of speech.
[24] Wong, P. H., & Au, O. C. (2003). Fast SOLA-based time scale modification using envelope matching. Journal of VLSI signal processing systems for signal, image and video technology, 35, 75-90.
[25] Wong, P. H., & Au, O. C. (2002, May). Fast SOLA-based time scale modification using modified envelope matching. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 3, pp. III-3188). IEEE.
[26] Dorran, D., Lawlor, R., & Coyle, E. (2003, April). High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA). In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). (Vol. 1, pp. I-I). IEEE.
[27] Xin, D., Takamichi, S., Okamoto, T., Kawai, H., & Saruwatari, H. (2022). Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation. arXiv preprint arXiv:2204.10561.
[28] Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033.
[29] Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25(1-3), 133-147.
[30] Povey, D., & Saon, G. (2006, September). Feature and model space speaker adaptation with full covariance gaussians. In Interspeech (pp. 1145-1148).
[31] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. “MFA Pretrained Mandarin Models in International Phonetic Alphabet”, https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/models/index.html
[32] Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4784-4788). IEEE.
[33] Yang, J., Lee, J., Kim, Y., Cho, H., & Kim, I. (2020). VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. arXiv preprint arXiv:2007.15256.
[34] Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001, May). Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 2, pp. 749-752). IEEE.
[35] FFmpeg, http://ffmpeg.org
[36] Verhelst, W., & Roelands, M. (1993, April). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 554-557). IEEE.
[37] ffmpeg atempo API, https://ffmpeg.org/ffmpeg-filters.html#atempo
[38] Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2), 236-243.
[39] I. Rosenfelder, J. Fruehwald, K. Evanini, and J. Yuan, “FAVE (Forced Alignment and Vowel Extraction) Program Suite [Computer program],” 2011, available at http://fave.ling.upenn.edu.
[40] Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on Hidden Markov Models. Speech Communication, 12(4), 357-370.
[41] Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3), 268-278.
[42] Malfrere, F., & Dutoit, T. (1997). High-quality speech synthesis for phonetic speech segmentation. In Fifth European Conference on Speech Communication and Technology.
[43] van Santen, J. P., & Sproat, R. (1999, September). High-accuracy automatic segmentation. In EUROSPEECH.
[44] Katsamanis, A., Black, M., Georgiou, P. G., Goldstein, L., & Narayanan, S. (2011, January). SailAlign: Robust long speech-text alignment. In Proc. of workshop on new tools and methods for very-large scale phonetics research (Vol. 1).
[45] Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7), 1877-1884.
[46] Morise, M., Kawahara, H., & Katayose, H. (2009, February). Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. In Audio Engineering Society Conference: 35th International Conference: Audio for Games. Audio Engineering Society.
[47] Morise, M. (2015). CheapTrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67, 1-7.
[48] Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33(2), 123-125.
[49] Prenger, R., Valle, R., & Catanzaro, B. (2019, May). Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3617-3621). IEEE.
[50] Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31.
[51] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
[52] Yamamoto, R., Song, E., & Kim, J. M. (2020, May). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199-6203). IEEE.
[53] Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., De Brebisson, A., Bengio, Y., Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32.
[54] Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america, 8(3), 185-190.
[55] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[56] The VocGAN, https://github.com/rishikksh20/VocGAN.
[57] Keith Ito and Linda Johnson, "The LJ Speech Dataset", https://keithito.com/LJ-Speech-Dataset/, 2017.
[58] pysox tempo, https://pysox.readthedocs.io/en/latest/api.html
[59] LeCun, Y., & Bengio, Y. (1998). The handbook of brain theory and neural networks.
[60] Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. Advances in neural information processing systems, 28.
[61] McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3), 276-282.