簡易檢索 / 詳目顯示

研究生: 邱華苓
Chiu, Hua-Ling
論文名稱: 結合即插即用擴散模型採樣之語者一致性的零樣本語音分離方法
Speaker-Consistent Zero-shot Speech Separation with Plug-and-Play Diffusion Sampling
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 62
中文關鍵詞: 語音分離擴散模型後驗採樣無分類器引導
外文關鍵詞: Speech Separation, Diffusion Model, Posterior Sampling, Classifier-Free Guidance
相關次數: 點閱:16下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在語音任務中,擴散模型已經逐漸成為主流模型,取得顯著的進展。而語音分離問題是指從多個語音訊號的混合中分離出單一語者訊號的任務。然而,現有的方法應用在語音分離上大多受限於訓練階段時的混合語者數量,無法在推論階段分離超出訓練設置的語者數量上限。目前有少數零樣本方法能突破此限制,只是分離的語音品質有限。
    本論文提出了一種優化的後驗採樣方法,將後驗採樣視為多個獨立的子問題,可以更好的解耦似然度與先驗機率分布,使分離後的樣本結果趨近於真實單一語者樣本的先驗分布。我們進一步設計一個損失函數,強化不同語者之間的特徵向量趨近正交,避免分離過程中語者向量過於相似而出現語者混淆現象。此外,我們在模型中訓練一塊語者條件模組,用於提高分離結果的語者一致性。
    我們在 LibriTTS-R、VCTK、LibriSpeech 資料集上進行測試,並以Si-SNR作為評估指標。結果顯示,與同為零樣本方法的基準相比,我們的方法在語音分離品質上取得了顯著的改善,在2-Mix情境下,在上述三個資料集中各別提升了12.83dB、6.89dB 及 11.23dB,在 3-Mix 和 4-Mix 情境下,我們的系統也比基線有明顯的改善。與其他監督式方法相比,我們的方法在分離效果上已相當接近。此外,我們的方法不需針對不同數量語者混合的情境而重新訓練模型,因此在應用上更具彈性與便利性。

    In recent years, diffusion models have gradually become the mainstream approach for speech-related tasks, achieving remarkable progress. Speech separation refers to the task of isolating individual speaker signals from a mixture of multiple speech sources.
    However, most existing methods for speech separation are often constrained by the number of speakers present during the training phase and fail to generalize to a number of speakers beyond training setup during inference phase. Although a few zero-shot methods have attempted to address this limitation, the quality of the results remains suboptimal.
    In this work, we propose an optimized posterior sampling method that regards posterior inference as a set of independent subproblems, enabling better decoupling of the likelihood and prior distributions. As a result, the separated samples are more consistent with the true prior distribution of single-speaker signals. To further reduce speaker confusion, we design a loss function that enforces the feature vectors of different speakers to become progressively orthogonal, thereby minimizing inter-speaker similarity during separation. Additionally, we incorporate a speaker conditioning module into our model, trained to enhance speaker consistency in the separation outputs.
    We evaluate our method on the LibriTTS-R, VCTK, and LibriSpeech datasets using Si-SNR as the evaluation metric. The results show that compared to existing zero-shot baselines, our method achieves significant improvements in separation quality, with Si-SNR gains of 12.83dB, 6.89dB and 11.23dB on the three datasets in 2-Mix scenario, respectively. In 3-Mix and 4-Mix scenarios, our system showed clear improvements over the baseline. Moreover, our method performs comparably to supervised approaches in terms of separation quality. Unlike most existing methods, our method does not require retraining when applied to mixtures containing different numbers of speakers, offering greater flexibility and practicality for real-world applications.

    摘要 I Abstract III 致謝 V Content VI List of Tables IX List of Figures X Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Literature Review 3 1.3.1 Discriminative Learning for source separation 3 1.3.2 Score-based Diffusion Models 5 1.3.3 Zero-shot speech separation with diffusion models 7 1.3.4 Split Gibbs Sampling 8 1.3.5 Classifier-Free Diffusion Guidance 9 1.4 Problems 10 1.5 Brief Description of Research Methods 12 Chapter 2 Proposed Method 13 2.1 Unconditional Speech Generator 14 2.1.1 DDPM-IP 14 2.2 PnP-SS 16 2.2.1 Prior step 16 2.2.2 Likelihood step 18 2.2.3 Putting it all together 20 2.2.4 Comparison of Sampling Geometries with Undiff 21 2.3 Enhance Speaker Consistency 22 2.3.1 Inter-Speaker Orthogonality Loss 23 2.3.2 Speaker Condition Module 27 Chapter 3 Dataset 29 3.1.1 LibriSpeech 29 3.1.2 LibriTTS 30 3.1.3 LibriTTS-R 32 3.1.4 VCTK 33 Chapter 4 Experimental Setup and Result 35 4.1 Evaluation Metrics 35 4.1.1 SI-SNR 35 4.1.2 STOI 36 4.2 Experiment Setting 36 4.2.1 Data preprocessing 36 4.2.2 Unconditional Speech Generator 36 4.2.3 Speaker Condition Module 38 4.2.4 Diffusion Process 39 4.2.5 Baseline 39 4.3 Results 40 4.3.1 Multiple Sampling per Input Sample 40 4.3.2 Comparison with Baseline 41 4.3.3 Ablation Study 45 Chapter 5 Conclusion and Future Work 46 Reference 47

    [1] J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: Discriminative embeddings for segmentation and separation, 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2016, pp. 31-35.
    [2] H. Erdogan, J.R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 708-712.
    [3] D. Yu, M. Kolbæk, Z.-H. Tan, J. Jensen, Permutation invariant training of deep models for speaker-independent multi-talker speech separation, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 241-245.
    [4] Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM transactions on audio, speech, and language processing 27(8) (2019) 1256-1266.
    [5] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, S. Watanabe, TF-GridNet: Making time-frequency domain models great again for monaural speaker separation, ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2023, pp. 1-5.
    [6] Y. Song, J. Sohl-Dickstein, D.P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456 (2020).
    [7] Y. Song, S. Ermon, Generative modeling by estimating gradients of the data distribution, Advances in neural information processing systems 32 (2019).
    [8] A. Iashchenko, P. Andreev, I. Shchekotov, N. Babaev, D. Vetrov, UnDiff: Unsupervised voice restoration with unconditional diffusion model, arXiv preprint arXiv:2306.00721 (2023).
    [9] Z. Wu, Y. Sun, Y. Chen, B. Zhang, Y. Yue, K. Bouman, Principled probabilistic imaging using diffusion models as plug-and-play priors, Advances in Neural Information Processing Systems 37 (2024) 118389-118427.
    [10] F. Coeurdoux, N. Dobigeon, P. Chainais, Plug-and-play split Gibbs sampler: embedding deep generative priors in Bayesian inference, IEEE Transactions on Image Processing (2024).
    [11] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, Advances in neural information processing systems 33 (2020) 6840-6851.
    [12] Y. Song, S. Ermon, Improved techniques for training score-based generative models, Advances in neural information processing systems 33 (2020) 12438-12448.
    [13] M. Vono, N. Dobigeon, P. Chainais, Split-and-augmented Gibbs sampler—Application to large-scale inference problems, IEEE Transactions on Signal Processing 67(6) (2019) 1648-1661.
    [14] J. Ho, T. Salimans, Classifier-free diffusion guidance, arXiv preprint arXiv:2207.12598 (2022).
    [15] K. Saijo, T. Ogawa, Self-remixing: Unsupervised speech separation via separation and remixing, ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1-5.
    [16] S. Choe, S.-W. Chung, Y. Ji, H.-G. Kang, Orthonormal embedding-based deep clustering for single-channel speech separation, arXiv preprint arXiv:1901.04690 (2019).
    [17] M. Ning, E. Sangineto, A. Porrello, S. Calderara, R. Cucchiara, Input perturbation reduces exposure bias in diffusion models, arXiv preprint arXiv:2301.11706 (2023).
    [18] P. Dhariwal, A. Nichol, Diffusion models beat gans on image synthesis, Advances in neural information processing systems 34 (2021) 8780-8794.
    [19] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, pp. 234-241.
    [20] F. Schmidt, Generalization in generation: A closer look at exposure bias, arXiv preprint arXiv:1910.00292 (2019).
    [21] B. Kawar, G. Vaksman, M. Elad, Snips: Solving noisy inverse problems stochastically, Advances in Neural Information Processing Systems 34 (2021) 21757-21769.
    [22] Y. Sun, B. Wohlberg, U.S. Kamilov, An online plug-and-play algorithm for regularized image reconstruction, IEEE Transactions on Computational Imaging 5(3) (2019) 395-408.
    [23] C.A. Bouman, G.T. Buzzard, Generative plug and play: Posterior sampling for inverse problems, 2023 59th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2023, pp. 1-7.
    [24] X. Xu, Y. Chi, Provably robust score-based diffusion posterior sampling for plug-and-play image reconstruction, arXiv preprint arXiv:2403.17042 (2024).
    [25] T. Karras, M. Aittala, T. Aila, S. Laine, Elucidating the design space of diffusion-based generative models, Advances in neural information processing systems 35 (2022) 26565-26577.
    [26] B. Desplanques, J. Thienpondt, K. Demuynck, Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification, arXiv preprint arXiv:2005.07143 (2020).
    [27] A. Nagrani, J.S. Chung, A. Zisserman, Voxceleb: a large-scale speaker identification dataset, arXiv preprint arXiv:1706.08612 (2017).
    [28] J.S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D.A. Reynolds, A. Zisserman, VoxSRC 2019: The first VoxCeleb speaker recognition challenge, arXiv preprint arXiv:1912.02522 (2019).
    [29] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in neural information processing systems 33 (2020) 12449-12460.
    [30] H. Zen, V. Dang, R. Clark, Y. Zhang, R.J. Weiss, Y. Jia, Z. Chen, Y. Wu, Libritts: A corpus derived from librispeech for text-to-speech, arXiv preprint arXiv:1904.02882 (2019).
    [31] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y. Zhang, W. Han, A. Bapna, Libritts-r: A restored multi-speaker text-to-speech corpus, arXiv preprint arXiv:2305.18802 (2023).
    [32] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2015, pp. 5206-5210.
    [33] J. Yamagishi, C. Veaux, K. MacDonald, CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2019) 271-350.
    [34] Y. Koizumi, H. Zen, S. Karita, Y. Ding, K. Yatabe, N. Morioka, Y. Zhang, W. Han, A. Bapna, M. Bacchiani, Miipher: A robust speech restoration model integrating self-supervised speech and text representations, 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, 2023, pp. 1-5.
    [35] Y. Koizumi, S. Karita, S. Wisdom, H. Erdogan, J.R. Hershey, L. Jones, M. Bacchiani, DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, 2021, pp. 161-165.
    [36] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, Y. Wu, W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp. 244-250.
    [37] Y. Koizumi, K. Yatabe, H. Zen, M. Bacchiani, WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration, 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2023, pp. 884-891.
    [38] C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, 2010 IEEE international conference on acoustics, speech and signal processing, IEEE, 2010, pp. 4214-4217.
    [39] Z. Kong, W. Ping, J. Huang, K. Zhao, B. Catanzaro, Diffwave: A versatile diffusion model for audio synthesis, arXiv preprint arXiv:2009.09761 (2020).
    [40] R. Scheibler, Y. Ji, S.-W. Chung, J. Byun, S. Choe, M.-S. Choi, Diffusion-based generative speech source separation, ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1-5.
    [41] S. Lutati, E. Nachmani, L. Wolf, Separate and diffuse: Using a pretrained diffusion model for improving source separation, arXiv preprint arXiv:2301.10752 (2023).

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE