簡易檢索 / 詳目顯示

研究生: 蘇才維
Su, Tsai-Wei
論文名稱: 應用基於強化學習之知識蒸餾於語音降噪
Speech Enhancement using Reinforcement Learning-based Knowledge Distillation
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 81
中文關鍵詞: 語音降噪強化學習知識蒸餾音訊品質深度神經網路
外文關鍵詞: Speech enhancement, reinforcement learning, knowledge distillation, speech quality, deep neural network
相關次數: 點閱:137下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 從古至今對話一直都是人類重要的溝通手段之一,除此之外近年來各類之電子產品急遽上升,人們藉由語音操作的方式也漸漸受到大眾接受。但是在生活中存在著各種噪音,噪音不只會影響人類也會干擾電子產品,因此如何在吵雜環境中透過語音降噪並改善音訊品質就顯得相當重要。然而現在表現最好的語音降噪都是透過深度神經網路模型,而該模型需要龐大的記憶體資源及計算量。
    因此在本論文中,主要貢獻為建立一個,運用基於強化學習的知識蒸餾於語音降噪訓練系統,透過彈性的知識蒸餾進而使得改善模型語音降噪之結果提升其語音品質。該系統包含了三個以Wave-U-Net為基礎的語音降噪模型以及一個梯度策略模型,整體流程是:首先我們透過梯度策略模型進行知識蒸餾學習率估算,讓學生模型依照當前的狀態進行知識蒸餾,然後再透過參考模型幫助梯度策略模型進行獎勵之計算,即可依照獎勵的好壞對模型進行訓練,透過這種方式可以讓模型在語音降噪的訓練過程中學習,以此循環使得學生模型能夠得到較好的音訊品質。
    另外,本論文採取不同的進行知識蒸餾傳遞方式,傳統的知識蒸餾學習率用於調節軟標籤及硬標籤之比例,稱為定量傳遞。而我們的方式則是把軟標籤及硬標籤區分為主標籤及引導標籤,透過引導標籤對其主標籤進行輔助、引導修正方向,而此時知識蒸餾學習率用於決定引導標籤之權重,稱為定向傳遞。透過定向傳遞方法使得訓練過程更加穩定且有更卓越的表現。
    在實驗方面,我們採用TIMIT作為語音資料於以及noiseX-92作為噪音資料,並以訊號雜訊比-7.5、-2.5、2.5、7.5dB來進行混合音訊。在依靠定量的知識傳遞方式結合上基於強化學習之知識蒸餾,相比於不使用知識蒸餾之模型在PESQ提升了10.41%、STOI提升了13.51%、SISDR提升了5.3% ; 相比於固定的知識蒸餾方式之模型在PESQ提升了1.92%、STOI提升了15.06%、SISDR提升了8.84%,在各項評估指標都有明顯的提升。而透過本論文之方法,能夠使得學生模型(25%)之效能可以與教師模型相提並論,甚至優於教師模型。

    Speech has always been one of the important means of communication for mankind from ancient times to the present. In addition, various electronic products have risen sharply in recent years, the way people operate by voice has gradually been accepted by the public. However, in the real world, there is a lot of background noise. Noise not only disturbs people, but also reduces machine performance. Therefore, how to reduce noise and improve audio quality through speech enhancement in a noisy environment is very important. However, the best performance of speech enhancement is through the deep neural network model. The architecture of deep neural networks usually requires high computational and memory costs in inference.
    Therefore, in this thesis, the main contribution is to use reinforcement learning-based knowledge distillation in speech enhancement training system. Through flexible knowledge distillation, the quality of the enhanced speech using a down-sized model is improved. The system includes three Wave-U-Net-based speech enhancement models and a policy gradient model. The overall process is: First, we use the policy gradient model to estimate the knowledge distillation learning ratio, and let student model perform knowledge distillation based on the current state. Then use reference model to help the policy network model to calculate the reward, and then policy network can be trained according to the reward. In this way, policy network can learn during the training process of speech enhancement.
    In addition, this thesis adopts different methods of knowledge distillation transfer. The traditional knowledge distillation learning ratio is used to adjust the ratio of soft-label and hard-label, which is called quantitative transfer. Our proposed method is to divide soft-label and hard-label into the main label and the guiding label, and use the guiding label to assist and guide the correct direction of the main label. At this time, the knowledge distillation learning ratio is used to determine the weight of the guiding label, which is called directional transfer. Through the directional transfer method, the training process is more stable and has a more outstanding performance.
    In the experiment, we used TIMIT as the speech data and noiseX-92 as the noise data, and mixed them at -7.5, -2.5, 2.5, 7.5dB signal-to-noise ratios (SNR). Compared with the non-knowledge distillation model, PESQ was improved by 10.41%, STOI was improved by 13.51%, and SISDR was improved by 5.3%; Compared with the fixed knowledge distillation model, PESQ was improved by 1.92%, STOI was improved by 15.06%, and SISDR was improved by 8.84%. Significant improvements have been made in various evaluation indicators. Through the proposed method of this thesis, the efficiency of the down-sized (25%) student model can be on par or even outperformed the teacher model.

    摘要 i Abstract iii 誌謝 v Contents vi List of Tables viii List of Figures ix Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 4 1.3 Literature Review 7 1.3.1 Methods of Speech Enhancement 7 1.3.2 Methods of Knowledge distillation 9 1.3.3 Methods of Reinforcement learning 12 1.3.4 Deep Neural Network 13 1.3.5 Convolutional Neural Network 19 1.4 Problems 21 1.5 Proposed Method 23 Chapter 2 System Framework 24 2.1 Speech Enhancement Model 25 2.1.1 Speech Enhancement Model based on Wave-U-Net 26 2.1.2 Introduction to each block of Speech Enhancement Model 29 2.1.3 Pre-convolution and Post-convolution 30 2.1.4 Down-sampling Convolution 31 2.1.5 Up-sampling Convolution 33 2.1.6 Bottleneck Convolution 35 2.1.7 Deep Residual Network 35 2.1.8 Relu Activation Function 38 2.2 Knowledge Distillation Learning Ratio Evaluation Model 40 2.2.1 Policy Network Architecture 40 2.3 Speech Enhancement System using Reinforcement Learning-based Knowledge Distillation 41 2.3.1 The first stage of System Training 43 2.3.2 The second stage of System Training 46 2.3.3 System Testing Phase 50 Chapter 3 Experimental Results and Discussion 51 3.1 Evaluation Metrics 51 3.1.1 SI-SDR 51 3.1.2 PESQ 52 3.1.3 STOI 55 3.2 Dataset 57 3.3 Experimental results and discussion 61 3.3.1 Result analysis of the size of the Speech Enhancement Model 61 3.3.2 Result analysis of Fixed Knowledge Distillation Learning Ratio 62 3.3.3 Result analysis of Speech Enhancement System 64 3.3.4 Analysis of the results of different Loss Function in Policy Network 69 3.3.5 Result analysis of Reinforcement Learning-based Knowledge Distillation 73 3.3.6 Spectrogram of Reinforcement Learning-based Knowledge Distillation 75 Chapter 4 Conclusion and Future Work 77 Reference 79

    [1] What’s the Difference Between Artificial Intelligence (AI), Machine Learning, and Deep Learning?” Retrieved from https://www.prowesscorp.com/whats-the-difference-between-artificial-intelligence-ai-machine-learning-and-deep-learning/ (August 5, 2021)
    [2] "Google Home" Retrieved from https://www.minwt.com/life/19364.html (August 5, 2021)
    [3] "Virtual Assistant" Retrieved from https://www.storm.mg/lifestyle/411636?mode=whole(August 5, 2021)
    [4] Philipos C Loizou, Speech enhancement: theory and practice, CRC press.
    [5] Daniel數碼周邊, "一次看懂抗噪耳機原理" Retrieved from https://www.techteller.com/sci/anc-true-wireless/ (August 5, 2021)
    [6] Cohen, I, "On speech enhancement under signal presence uncertainty. " IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221) (Vol. 1, pp. 661-664). IEEE, 2001.
    [7] Wang, Y., Narayanan, A., & Wang, D. "On training targets for supervised speech separation." IEEE/ACM transactions on audio, speech, and language processing, 22(12), 1849-1858, 2014.
    [8] Chen, Z., Luo, Y., & Mesgarani, N. "Deep attractor network for single-microphone speaker separation." International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 246-250). IEEE, 2017.
    [9] Luo, Y., & Mesgarani, N. "Tasnet: time-domain audio separation network for real-time, single-channel speech separation. " International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696-700). IEEE, 2018.
    [10] Stoller, D., Ewert, S., & Dixon, S. (2018). "Wave-u-net: A multi-scale neural network for end-to-end audio source separation." arXiv preprint arXiv:1806.03185, 2018.
    [11] Hinton, G., Vinyals, O., & Dean, J. "Distilling the knowledge in a neural network. " arXiv preprint arXiv:1503.02531, 2015.
    [12] Zagoruyko, S., & Komodakis, " Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. " arXiv preprint arXiv:1612.03928, 2016.
    [13] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. " Playing atari with deep reinforcement learning. " arXiv preprint arXiv:1312.5602, 2013.
    [14] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. "Policy gradient methods for reinforcement learning with function approximation." In Advances in neural information processing systems (pp. 1057-1063), 2000.
    [15] Rosenblatt, F. (1958). "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological review, 65(6), 386, 1958.
    [16] "Why are neuron axons long and spindly? Study shows they're optimizing signaling efficiency." Retrieved from https://medicalxpress.com/news/2018-07-neuron-axons-spindly-theyre-optimizing.html (August 5, 2021)
    [17] Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. "A learning algorithm for Boltzmann machines." Cognitive science, 9(1), 147-169, 1985.
    [18] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. "Learning representations by back-propagating errors." nature, 323(6088), 533-536, 1986.
    [19] Hinton, G. E., Osindero, S., & Teh, Y. W. "A fast learning algorithm for deep belief nets." Neural computation, 18(7), 1527-1554, 2006.
    [20] Krizhevsky, A., Sutskever, I., & Hinton, G. E. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems, 25, 1097-1105, 2012.
    [21] Simonyan, K., & Zisserman, A. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556, 2014.
    [22] He, K., Zhang, X., Ren, S., & Sun, J. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778), 2016.
    [23] Yuan, L., Tay, F. E., Li, G., Wang, T., & Feng, J. "Revisiting knowledge distillation via label smoothing regularization." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3903-3911), 2020.
    [24] Cortes, C., & Vapnik, V. "Support-vector networks." Machine learning, 20(3), 273-297, 1995.
    [25] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. "Phoneme recognition using time-delay neural networks." IEEE transactions on acoustics, speech, and signal processing, 37(3), 328-339, 1989.
    [26] Srivastava, R. K., Greff, K., & Schmidhuber, J. "Highway networks." arXiv preprint arXiv:1505.00387, 2015.
    [27] Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. "A short-time objective intelligibility measure for time-frequency weighted noisy speech." international conference on acoustics, speech and signal processing (pp. 4214-4217), 2010.
    [28] Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. "An algorithm for intelligibility prediction of time–frequency weighted noisy speech." IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125-2136, 2011.
    [29] Jensen, J., & Taal, C. H. "An algorithm for predicting the intelligibility of speech masked by modulated noise maskers." IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2009-2022, 2016.
    [30] Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs." international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (Vol. 2, pp. 749-752), 2001.
    [31] Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM." NIST speech disc 1-1.1. NASA STI/Recon technical report n, 93, 27403, 1993.
    [32] Varga, A., & Steeneken, H. J. "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems." Speech communication, 12(3), 247-251, 1993.

    下載圖示 校內:2022-08-30公開
    校外:2022-08-30公開
    QR CODE