| 研究生: |
李韋宗 Lee, Wei-Tsung |
|---|---|
| 論文名稱: |
基於語音解耦的非監督式口音轉換 Unsupervised Accent Conversion Based on Speech Decomposition |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 71 |
| 中文關鍵詞: | 非監督式學習 、語音解耦 、口音轉換 、語音轉換 |
| 外文關鍵詞: | unsupervised learning, speech decomposition, accent conversion, voice conversion |
| 相關次數: | 點閱:72 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
口音轉換旨在將非母語(L2)語音轉換到母語(L1)語音同時維持原有的語者音色和語言內容。本研究所提出的模型相異於其他先前的研究,不需要標註資料即可進行訓練。複數的編碼器提取各種細緻的語音屬性,以及口音對抗訓練透過語音解耦技術將口音分離出來,再使用多層級的解碼器架構依序還原語音屬性同時穩定合成。實施的實驗證實了對於先前的口音轉換系統的限制,在非監督學習下達到了跟 Baseline 模型的可比性。所提出架構具有一定的強健性並且能夠泛化到完全沒見過的資料集上。人工評測也證實了本研究所提出模型的有效性。另外,作為額外的貢獻,本研究所提出的架構還可以執行普通的語音轉換,效能已經超越更早以前的語音轉換模型,也跟目前的語音轉換SOTA 具有可比性。
Accent conversion aims to transform non-native (L2) speech into native (L1) speech while preserving the original speaker's voice timbre and linguistic content. The model proposed does not require annotated data for training. Speech decomposition techniques for separating accents, multiple encoders extract various fine-grained attributes and accent adversarial training. A multi-level decoder architecture to sequentially restore the speech attributes, ensuring stable synthesis. Experimental results demonstrate comparable performance to the baseline model and show the limitations of previous accent conversion systems. The proposed framework exhibits robustness and generalization to completely unseen datasets. Human evaluations also validate the effectiveness of the proposed model in this study. Furthermore, the proposed framework can also perform conventional voice conversion, surpassing earlier models and achieving comparable performance to the current state-of-the-art.
[1] J. Algeo, "The New Oxford American Dictionary," Dictionaries: Journal of the Dictionary Society of North America, vol. 24, no. 1, pp. 236-252, 2003.
[2] M. P. Kesarkar and P. Rao, "Feature extraction for speech recognition," Electronic Systems, EE. Dept., IIT Bombay, 2003.
[3] K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, "Unsupervised speech decomposition via triple information bottleneck," in International Conference on Machine Learning, 2020: PMLR, pp. 7836-7846.
[4] S. S. Stevens, J. Volkmann, and E. B. Newman, "A scale for the measurement of the psychological magnitude pitch," The journal of the acoustical society of america, vol. 8, no. 3, pp. 185-190, 1937.
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[6] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[9] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, "Learning deep representations by mutual information estimation and maximization," arXiv preprint arXiv:1808.06670, 2018.
[10] S. Schneider, A. Baevski, R. Collobert, and M. Auli, "wav2vec: Unsupervised pre-training for speech recognition," arXiv preprint arXiv:1904.05862, 2019.
[11] A. v. d. Oord, Y. Li, and O. Vinyals, "Representation learning with contrastive predictive coding," arXiv preprint arXiv:1807.03748, 2018.
[12] A. Baevski, S. Schneider, and M. Auli, "vq-wav2vec: Self-supervised learning of discrete speech representations," arXiv preprint arXiv:1910.05453, 2019.
[13] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in neural information processing systems, vol. 33, pp. 12449-12460, 2020.
[14] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, "Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6419-6423.
[15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "Hubert: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
[16] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, "Deep clustering for unsupervised learning of visual features," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132-149.
[17] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, "One-shot voice conversion by separating speaker and content representations with instance normalization," arXiv preprint arXiv:1904.05742, 2019.
[18] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, "Autovc: Zero-shot voice style transfer with only autoencoder loss," in International Conference on Machine Learning, 2019: PMLR, pp. 5210-5219.
[19] S. Shechtman and A. Sorin, "Sequence to sequence neural speech synthesis with prosody modification capabilities," arXiv preprint arXiv:1909.10302, 2019.
[20] M. D. Pell, A. Jaywant, L. Monetta, and S. A. Kotz, "Emotional speech processing: Disentangling the effects of prosody and semantic cues," Cognition & Emotion, vol. 25, no. 5, pp. 834-853, 2011.
[21] K. Zhou, B. Sisman, M. Zhang, and H. Li, "Converting anyone's emotion: Towards speaker-independent emotional voice conversion," arXiv preprint arXiv:2005.07025, 2020.
[22] K. Zhou, B. Sisman, and H. Li, "Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training," arXiv preprint arXiv:2103.16809, 2021.
[23] M. Schroeder and B. Atal, "Code-excited linear prediction (CELP): High-quality speech at very low bit rates," in ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10: IEEE, pp. 937-940.
[24] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in International Conference on Machine Learning, 2018: PMLR, pp. 5180-5189.
[25] J. Ebbers, M. Kuhlmann, T. Cord-Landwehr, and R. Haeb-Umbach, "Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 3860-3864.
[26] B. Nguyen and F. Cardinaux, "Nvc-net: End-to-end adversarial voice conversion," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7012-7016.
[27] S. Ding and R. Gutierrez-Osuna, "Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion," in INTERSPEECH, 2019, pp. 724-728.
[28] Y. Y. Lin, C.-M. Chien, J.-H. Lin, H.-y. Lee, and L.-s. Lee, "Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 5939-5943.
[29] S. Liu, Y. Cao, D. Wang, X. Wu, X. Liu, and H. Meng, "Any-to-many voice conversion with location-relative sequence-to-sequence modeling," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1717-1728, 2021.
[30] S.-H. Lee, J.-H. Kim, H. Chung, and S.-W. Lee, "Voicemixer: Adversarial voice style mixup," Advances in Neural Information Processing Systems, vol. 34, pp. 294-308, 2021.
[31] M. A. Kramer, "Nonlinear principal component analysis using autoassociative neural networks," AIChE journal, vol. 37, no. 2, pp. 233-243, 1991.
[32] C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, "Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 6332-6336.
[33] N. Jaitly and G. E. Hinton, "Vocal tract length perturbation (VTLP) improves speech recognition," in Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, 2013, vol. 117, p. 21.
[34] S. Yang, M. Tantrawenith, H. Zhuang, Z. Wu, A. Sun, J. Wang, N. Cheng, H. Tang, X. Zhao, and J. Wang, "Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion," arXiv preprint arXiv:2208.08757, 2022.
[35] L.-W. Chen, S. Watanabe, and A. Rudnicky, "A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023: IEEE, pp. 1-5.
[36] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
[37] M. Morise, F. Yokomori, and K. Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877-1884, 2016.
[38] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
[39] J. Kong, J. Kim, and J. Bae, "Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis," Advances in Neural Information Processing Systems, vol. 33, pp. 17022-17033, 2020.
[40] G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1649-1660, 2019.
[41] T. J. Hazen, W. Shen, and C. White, "Query-by-example spoken term detection using phonetic posteriorgram templates," in 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009: IEEE, pp. 421-426.
[42] G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams," in Interspeech, 2019, pp. 2843-2847.
[43] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, and R. Skerrv-Ryan, "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018: IEEE, pp. 4779-4783.
[44] W. Li, B. Tang, X. Yin, Y. Zhao, W. Li, K. Wang, H. Huang, Y. Wang, and Z. Ma, "Improving accent conversion with reference encoder and end-to-end text-to-speech," arXiv preprint arXiv:2005.09271, 2020.
[45] S. Liu, D. Wang, Y. Cao, L. Sun, X. Wu, S. Kang, Z. Wu, X. Liu, D. Su, and D. Yu, "End-to-end accent conversion without using native utterances," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6289-6293.
[46] Z. Wang, W. Ge, X. Wang, S. Yang, W. Gan, H. Chen, H. Li, L. Xie, and X. Li, "Accent and speaker disentanglement in many-to-many voice conversion," in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021: IEEE, pp. 1-5.
[47] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, and S. Bengio, "Tacotron: Towards end-to-end speech synthesis," arXiv preprint arXiv:1703.10135, 2017.
[48] G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367-2381, 2021.
[49] S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning," Computer Speech & Language, vol. 72, p. 101302, 2022.
[50] T. N. Nguyen, N.-Q. Pham, and A. Waibel, "Accent Conversion using Pre-trained Model and Synthesized Data from Voice Conversion," in Proc. Interspeech, 2022, vol. 2022, pp. 2583-2587.
[51] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[52] A. Ezzerg, T. Merritt, K. Yanagisawa, P. Bilinski, M. Proszewska, K. Pokora, R. Korzeniowski, R. Barra-Chicote, and D. Korzekwa, "Remap, warp and attend: Non-parallel many-to-many accent conversion with normalizing flows," in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023: IEEE, pp. 984-990.
[53] D. Rezende and S. Mohamed, "Variational inference with normalizing flows," in International conference on machine learning, 2015: PMLR, pp. 1530-1538.
[54] Y. Zhou, Z. Wu, M. Zhang, X. Tian, and H. Li, "TTS-Guided Training for Accent Conversion Without Parallel Data," IEEE Signal Processing Letters, 2023.
[55] W. Quamer, A. Das, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Zero-Shot Foreign Accent Conversion without a Native Reference," Proc. Interspeech 2022, pp. 4920-4924, 2022.
[56] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA workshop on speech synthesis, 2004.
[57] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English speech corpus," in Interspeech, 2018, pp. 2783-2787.
[58] J. Yamagishi, C. Veaux, and K. MacDonald, "CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92)," University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
[59] S. Weinberger, "Speech accent archive," George Mason University, 2015.
[60] B. Stroube, "Literary freedom: Project gutenberg," XRDS: Crossroads, The ACM Magazine for Students, vol. 10, no. 1, pp. 3-3, 2003.
[61] P. Meier and D. Paul, "International dialects of English archive," IDEA-The International Dialects of English Archive, 1997.
[62] E. Kharitonov, J. Copet, K. Lakhotia, T. A. Nguyen, P. Tomasello, A. Lee, A. Elkahky, W.-N. Hsu, A. Mohamed, and E. Dupoux, "textless-lib: A library for textless spoken language processing," arXiv preprint arXiv:2202.07359, 2022.
[63] A. Hannun, "Sequence modeling with ctc," Distill, vol. 2, no. 11, p. e8, 2017.
[64] A. Cutler, D. Dahan, and W. Van Donselaar, "Prosody in the comprehension of spoken language: A literature review," Language and speech, vol. 40, no. 2, pp. 141-201, 1997.
[65] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, "Crepe: A convolutional representation for pitch estimation," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 161-165.
[66] G. Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations (no. 2). Walter de Gruyter, 1971.
[67] H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, "Neural analysis and synthesis: Reconstructing speech from self-supervised representations," Advances in Neural Information Processing Systems, vol. 34, pp. 16251-16265, 2021.
[68] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, "Least squares generative adversarial networks," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794-2802.
[69] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, "Fastspeech: Fast, robust and controllable text to speech," Advances in neural information processing systems, vol. 32, 2019.
[70] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial networks," Communications of the ACM, vol. 63, no. 11, pp. 139-144, 2020.
[71] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, and M. Funtowicz, "Huggingface's transformers: State-of-the-art natural language processing," arXiv preprint arXiv:1910.03771, 2019.
[72] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015: IEEE, pp. 5206-5210.
[73] B. Desplanques, J. Thienpondt, and K. Demuynck, "Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification," arXiv preprint arXiv:2005.07143, 2020.
[74] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, and J. Zhong, "SpeechBrain: A general-purpose speech toolkit," arXiv preprint arXiv:2106.04624, 2021.
[75] F. Wang, J. Cheng, W. Liu, and H. Liu, "Additive margin softmax for face verification," IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926-930, 2018.
[76] A. Nagrani, J. S. Chung, and A. Zisserman, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:1706.08612, 2017.
[77] J. S. Chung, A. Nagrani, and A. Zisserman, "Voxceleb2: Deep speaker recognition," arXiv preprint arXiv:1806.05622, 2018.
[78] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[79] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
[80] W. F. a. T. P. L. team. "PyTorch Lightning." https://github.com/Lightning-AI/lightning (accessed.
[81] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, "Mosnet: Deep learning based objective assessment for voice conversion," arXiv preprint arXiv:1904.08352, 2019.
[82] P. C. Loizou, "Speech quality assessment," Multimedia analysis, processing and communications, pp. 623-654, 2011.
[83] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, and Z. Ling, "The Voice Conversion Challenge 2018: database and results," 2019.
[84] L. Van der Maaten and G. Hinton, "Visualizing data using t-SNE," Journal of machine learning research, vol. 9, no. 11, 2008.