| 研究生: |
朱士銓 Chu, Shih-Chuan |
|---|---|
| 論文名稱: |
邁向通用零樣本語音增強:從域內動態蒸餾到生成式多模型聯合約束 Towards Generalized Zero-Shot Speech Enhancement: From In-Domain Dynamic Distillation to Generative Multi-Model Joint Constraints |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 110 |
| 中文關鍵詞: | 語音強化 、強化學習 、知識蒸餾 、品質導向的多步驟培訓 、跨數據集 、零樣本 |
| 外文關鍵詞: | Speech enhancement, reinforcement learning, knowledge distillation, quality-oriented multi-step training, cross-domain, zero-shot |
| 相關次數: | 點閱:5 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在多變且不可預測的真實環境中,語音訊號常受到各類噪音干擾,使得語音品質下降並影響後續語音辨識與語音通訊系統的效能。傳統語音增強方法多依賴固定的噪音假設或大量同分佈訓練資料,因而在未知環境下缺乏足夠的泛化能力。本研究旨在突破此限制,依序提出三項核心技術,包括:域內動態知識蒸餾機制的學習策略、基於強化策略碼本的跨域語音增強方法、可擴展噪音建模的零樣本語音增強框架。透過這些技術,本研究從標記的配對語料方法著手,逐步於域內提升語音品質,再到改善跨資料集增強效果,最後脫離標記的配對語料此一限制,進一步邁向在未知聲學環境下仍具彈性以及穩定表現的通用零樣本強化方法。
首先,我們研究從同領域下的語音強化方法開始。有鑑於相關方法運算需求逐年膨脹,本研究提出具強化學習導向的動態知識蒸餾策略,透過自適應調整軟目標與硬目標的比例,使學生模型在不同樣本與噪音條件下均能獲得最適化的學習目標。此策略有效改善傳統知識蒸餾面臨的最優教師身分問題,並進一步考量在大幅降低參數量的情況下所提出的方法仍維持優異的增強品質。
但隨著模型進一步的壓縮,反而失去了相應的泛化能力,我們意識到同領域的下的訓練以及測試並不能反映出真實的應用情況。因此我們提出基於強化策略碼本的跨域語音增強方法,透過多步式的訓練對模型語音強化策略建立以音框為單位的碼本,透過一系列的碼來引導離線後該模型對於跨域資料的強化策略,並透過離線循環進一步強化其效果。
最後,為解決跨資料集與未知噪音環境下的性能退化問題,本研究提出可擴展噪音建模的零樣本語音增強框架(EN-AZS),以Undiff結構為基礎,引入其多模型聯合約束此一概念,並轉化成可擴充的噪音建模,使EN-AZS方法無需配對資料或額外訓練,透過即插即用的優勢,即可於全新噪音類型、語者特性與訊噪比條件下達成穩定的增強效果。多項跨資料集實驗證明,所提出方法不僅顯著提升語音品質指標,也展現了強大的泛化能力與零樣本適應性。
綜合而言,本研究從動態知識蒸餾到無需配對語料的零樣本強化,完整建立了一套面向通用語音增強的技術路線,為語音處理模型在真實世界中的部署提供更具彈性、穩健與實用的解決方案。
In dynamic and unpredictable real-world environments, speech signals are frequently exposed to various types of noise, resulting in degraded speech quality and reduced performance in downstream speech recognition and communication systems. Traditional Speech Enhancement (SE) methods often rely on fixed noise assumptions or large amounts of identically distributed training data, which limits their generalization capability in unseen environments. This work addresses these limitations through three core techniques: a dynamic knowledge distillation strategy for in-domain learning, a cross-domain SE method based on a enhance strategy codebook, and a zero-shot enhancement framework with scalable noise modeling. Together, these techniques establish a progressive research path that begins with paired and labeled in-domain training, advances toward cross-dataset generalization, and ultimately removes the dependence on paired corpora to achieve a general zero-shot enhancement framework that remains flexible and stable in unseen acoustic conditions.
We first investigate SE within a single domain. Due to the growing computational demands of existing approaches, we propose a reinforcement learning oriented dynamic knowledge distillation strategy called DL-KD. By adaptively adjusting the balance between soft and hard objectives, the student model obtains an optimal learning target under different sample and noise conditions. This approach effectively addresses the optimal teacher-selection issue commonly encountered in conventional knowledge distillation methods while maintaining high enhancement quality with significantly fewer parameters.
However, aggressive model compression can weaken generalization capability. Moreover, training and testing within the same domain do not accurately reflect real-world deployment scenarios. To address this gap, we propose a cross-domain enhancement method based on a reinforcement strategy codebook. Through multi-stage training, the model constructs a codebook of frame-level enhance strategies. These codes guide the offline enhancement process for cross-domain data, and the enhancement results can be further refined through additional offline iterations.
Finally, to mitigate performance degradation across datasets and in unseen noise environments, we introduce a zero-shot enhancement framework with scalable noise modeling called EN-AZS. Building on the Undiff architecture, the proposed framework incorporates multi-model joint constraints and reformulates them into a scalable noise modeling mechanism. This design enables EN-AZS to maintain stable enhancement performance under new noise types, speaker characteristics, and signal-to-noise ratio conditions without additional training, providing a plug-and-play solution. Extensive cross-dataset experiments demonstrate that the proposed method not only improves speech quality metrics but also exhibits strong generalization capability and zero-shot adaptability.
In summary, this work presents a comprehensive technical roadmap for generalized speech enhancement, spanning perceptual quality optimization, dynamic knowledge distillation, and zero-shot reinforcement without paired data. The proposed framework offers a flexible, robust, and practical solution for real-world speech processing applications.
[1] J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197-210, 1978.
[2] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, "Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement," in International Conference on Machine Learning, 2019, pp. 2031-2041: PmLR.
[3] A. Kumar et al., "Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5: IEEE.
[4] Z. Lin, J. Wang, R. Li, F. Shen, and X. Xuan, "PrimeK-Net: Multi-scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement," in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1-5: IEEE.
[5] C. Chen, Y. Hu, W. Weng, and E. S. Chng, "Metric-oriented speech enhancement using diffusion probabilistic model," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5: IEEE.
[6] P. Manocha and A. Kumar, "Speech quality assessment through MOS using non-matching references," arXiv preprint arXiv:2206.12285, 2022.
[7] T. Rosenbaum, I. Cohen, E. Winebrand, and O. Gabso, "Differentiable mean opinion score regularization for perceptual speech enhancement," Pattern recognition letters, vol. 166, pp. 159-163, 2023.
[8] N. Upadhyay and A. Karmakar, "Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study," Procedia Computer Science, vol. 54, pp. 574-584, 2015.
[9] M. A. Abd El-Fattah et al., "Speech enhancement with an adaptive Wiener filter," International Journal of Speech Technology, vol. 17, no. 1, pp. 53-64, 2014.
[10] C. Macartney and T. Weyde, "Improved speech enhancement with the wave-u-net," arXiv preprint arXiv:1811.11307, 2018.
[11] Y. Luo and N. Mesgarani, "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256-1266, 2019.
[12] H. Yen, F. G. Germain, G. Wichern, and J. Le Roux, "Cold diffusion for speech enhancement," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5: IEEE.
[13] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, "Conditional diffusion probabilistic model for speech enhancement," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7402-7406: Ieee.
[14] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, "Speech enhancement and dereverberation with diffusion-based generative models," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364, 2023.
[15] R. E. Zezario, C.-S. Fuh, H.-M. Wang, and Y. Tsao, "Speech enhancement with zero-shot model selection," in 2021 29th European Signal Processing Conference (EUSIPCO), 2021, pp. 491-495: IEEE.
[16] S. Kim and M. Kim, "Test-time adaptation toward personalized speech enhancement: Zero-shot learning with knowledge distillation," in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 176-180: IEEE.
[17] A. Sivaraman and M. Kim, "Zero-shot personalized speech enhancement through speaker-informed model selection," in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 171-175: IEEE.
[18] A. Iashchenko, P. Andreev, I. Shchekotov, N. Babaev, and D. Vetrov, "UnDiff: Unsupervised voice restoration with unconditional diffusion model," arXiv preprint arXiv:2306.00721, 2023.
[19] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log-spectral amplitude estimator," IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443-445, 2003.
[20] T. Minipriya and R. Rajavel, "Review of ideal binary and ratio mask estimation techniques for monaural speech separation," in 2018 Fourth International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), 2018, pp. 1-5: IEEE.
[21] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, "Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement," arXiv preprint arXiv:1703.07172, 2017.
[22] M. H. Soni, N. Shah, and H. A. Patil, "Time-frequency masking-based speech enhancement using generative adversarial network," in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 5039-5043: IEEE.
[23] P.-S. Huang, M. Kim, and M. Hasegawa-Johnson, "Joint optimization of masks and deep recurrent neural networks for monaural source separation," IEEE/ACM Transactions on Audio, Speech, Language Processing, vol. 23, no. 12, pp. 2136-2147, 2015.
[24] H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex u-net," in International Conference on Learning Representations, 2018.
[25] W. Tai, Y. Lei, F. Zhou, G. Trajcevski, and T. Zhong, "Dose: Diffusion dropout with adaptive prior for speech enhancement," Advances in Neural Information Processing Systems, vol. 36, 2024.
[26] M. Yang et al., "Usee: Unified speech enhancement and editing with conditional diffusion models," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7125-7129: IEEE.
[27] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in neural information processing systems, vol. 33, pp. 6840-6851, 2020.
[28] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, "Score-based generative modeling through stochastic differential equations," arXiv preprint arXiv:2011.13456, 2020.
[29] Y. Song and P. Dhariwal, "Improved techniques for training consistency models," arXiv preprint arXiv:2310.14189, 2023.
[30] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," in 2010 IEEE international conference on acoustics, speech and signal processing, 2010, pp. 4214-4217: IEEE.
[31] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, "SDR–half-baked or well done?," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626-630: IEEE.
[32] K. Tan and D. Wang, "Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6865-6869: IEEE.
[33] S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," arXiv preprint arXiv: 09452, 2017.
[34] Z. Meng, J. Li, Y. Gong, and a. p. arXiv:.02251, "Adversarial feature-mapping for speech enhancement," 2018.
[35] T.-A. Hsieh, C. Yu, S.-W. Fu, X. Lu, Y. Tsao, and a. p. arXiv:.15174, "Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement," 2020.
[36] S.-W. Fu, C.-F. Liao, Y. Tsao, and I. S. P. Letters, "Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality," vol. 27, pp. 26-30, 2019.
[37] S.-W. Fu et al., "MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement," 2021.
[38] S. Kataria, J. Villalba, and N. Dehak, "Perceptual loss based speech denoising with an ensemble of audio pattern recognition and self-supervised models," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7118-7122: IEEE.
[39] X. Tan and X.-L. Zhang, "Speech enhancement aided end-to-end multi-task learning for voice activity detection," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6823-6827: IEEE.
[40] Y. Gong, L. Liu, M. Yang, and L. Bourdev, "Compressing deep convolutional networks using vector quantization," arXiv preprint arXiv., 2014.
[41] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, "Neural discrete representation learning," arXiv preprint arXiv: 00937, 2017.
[42] Z. Huang and N. Wang, "Data-driven sparse structure selection for deep neural networks," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 304-320.
[43] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, "Importance estimation for neural network pruning," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11264-11272.
[44] X. Hao, X. Su, Z. Wang, Q. Zhang, H. Xu, and G. Gao, "Snr-Based Teachers-Student Technique For Speech Enhancement," in 2020 IEEE International Conference on Multimedia and Expo (ICME), 2020, pp. 1-6: IEEE.
[45] X. Hao, S. Wen, X. Su, Y. Liu, G. Gao, and X. Li, "Sub-band Knowledge Distillation Framework for Speech Enhancement," arXiv preprint arXiv: 14435, 2020.
[46] S. Nakaoka, L. Li, S. Inoue, and S. Makino, "Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 661-665: IEEE.
[47] K. Tan, D. Wang, and s. IEEE/ACM transactions on audio, language processing, "Towards model compression for deep learning based speech enhancement," vol. 29, pp. 1785-1794, 2021.
[48] K. Xu, L. Rui, Y. Li, and L. Gu, "Feature normalized knowledge distillation for image classification," in European conference on computer vision, 2020, pp. 664-680: Springer.
[49] Y. Liu, L. Sheng, J. Shao, J. Yan, S. Xiang, and C. Pan, "Multi-label image classification via knowledge distillation from weakly-supervised detection," in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 700-708.
[50] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, "Be your own teacher: Improve the performance of convolutional neural networks via self distillation," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3713-3722.
[51] K. Kim, B. Ji, D. Yoon, and S. Hwang, "Self-knowledge distillation: A simple way for better generalization," arXiv preprint arXiv:2006.12000, vol. 3, no. 1, p. 5, 2020.
[52] G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.
[53] D. Y. Park, M.-H. Cha, D. Kim, B. Han, and A. i. N. I. P. Systems, "Learning student-friendly teacher networks for knowledge distillation," vol. 34, pp. 13292-13303, 2021.
[54] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, "Revisiting knowledge distillation via label smoothing regularization," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903-3911.
[55] H. Ma et al., "Undistillable: Making a nasty teacher that cannot teach students," 2021.
[56] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon technical report n, vol. 93, p. 27403, 1993.
[57] A. Varga and H. J. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech communication, vol. 12, no. 3, pp. 247-251, 1993.
[58] C. Veaux, J. Yamagishi, and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," in 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), 2013, pp. 1-4: IEEE.
[59] J. Thiemann, N. Ito, and E. Vincent, "The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings," in Proceedings of Meetings on Acoustics, 2013, vol. 19, no. 1, p. 035081: Acoustical Society of America.
[60] K. Kim, B. Ji, D. Yoon, and S. Hwang, "Self-knowledge distillation with progressive refinement of targets," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6567-6576.
[61] I. Goodfellow et al., "Generative adversarial networks," Communications of the ACM, vol. 63, no. 11, pp. 139-144, 2020.
[62] R. Cao, S. Abdulatif, and B. Yang, "CMGAN: Conformer-based metric GAN for speech enhancement," arXiv preprint arXiv:2203.15149, 2022.
[63] Z. Cui, S. Zhang, Y. Chen, Y. Gao, C. Deng, and J. Feng, "Semi-Supervised Speech Enhancement Based On Speech Purity," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5: IEEE.
[64] W. Zhao, F. Xie, K. Ouyang, and N. Zheng, "A Speech-Noise-Equilibrium Loss Function for Deep Learning-Based Speech Enhancement," in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022, pp. 468-472: IEEE.
[65] V. A. Trinh and S. Braun, "Unsupervised speech enhancement with speech recognition embedding and disentanglement losses," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 391-395: IEEE.
[66] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, "Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps," Advances in neural information processing systems, vol. 35, pp. 5775-5787, 2022.
[67] K. Zheng, C. Lu, J. Chen, and J. Zhu, "Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics," Advances in Neural Information Processing Systems, vol. 36, pp. 55502-55542, 2023.
[68] Z. Zhou, D. Chen, C. Wang, and C. Chen, "Fast ode-based sampling for diffusion models in around 5 steps," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7777-7786.
[69] W. Tai, Y. Lei, F. Zhou, G. Trajcevski, and T. Zhong, "DOSE: Diffusion dropout with adaptive prior for speech enhancement," Advances in Neural Information Processing Systems, vol. 36, pp. 40272-40293, 2023.
[70] Y. Hu, C. Chen, R. Li, Q. Zhu, and E. S. Chng, "Noise-aware speech enhancement using diffusion probabilistic model," arXiv preprint arXiv:2307.08029, 2023.
[71] J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, "A deep learning loss function based on the perceptual evaluation of the speech quality," IEEE Signal processing letters, vol. 25, no. 11, pp. 1680-1684, 2018.
[72] J. F. Gemmeke et al., "Audio set: An ontology and human-labeled dataset for audio events," in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 776-780: IEEE.
[73] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, "Csr-i (wsj0) complete," (No Title), 2007.
[74] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, "Diffwave: A versatile diffusion model for audio synthesis," arXiv preprint arXiv:2009.09761, 2020.
[75] D. Snyder, G. Chen, and D. Povey, "Musan: A music, speech, and noise corpus," arXiv preprint arXiv:1510.08484, 2015.