| 研究生: |
劉益呈 Liu, Yi-Cheng |
|---|---|
| 論文名稱: |
運用神經網路於老式錄音的語音增強研究 Speech Enhancement by Neural Network for Old Voice Recording |
| 指導教授: |
郭致宏
Kuo, Chih-Hung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2020 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 語音降噪 、老式錄音 、生成對抗性網路 、喀擦噪音 |
| 外文關鍵詞: | Speech enhancement, generative adversarial network, old voice recording |
| 相關次數: | 點閱:147 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
大多數應用於語音增強的網路,只關注降噪效果卻忽略移除噪音時所造成的失真,其中失真代表語音資訊被破壞。因此本篇論文提出預處理步驟和修復-降噪網路,並應用於老式錄音。其中預處理步驟用於移除喀擦噪音和被噪音嚴重破壞的部分,修復-降噪網路由兩個網路組成:修復網路和降噪網路,修復網路以時域生成對抗網路修復因預處理步驟造成的語音缺失;降噪網路以增強型兩階段網路和頻域生成對抗網路移除背景噪音,藉由頻域生成對抗網路強大的生成能力修補增強型兩階段網路降噪時引發的失真。大多數的降噪架構能很好地移除噪音,但是相對地造成更多失真,在老式錄音中,往往因為找不到與錄音中紀錄語言相同的訓練數據集導致失真更加嚴重。透過生成對抗網路的生成能力,可以修復失真並獲得與人聲更接近的結果。實驗結果表明,本篇論文所提的演算法有較高的語音質量和較少的失真,並在STOI和PESQ有優於其他降噪方法的表現。
Most of the neural networks used in speech enhancement focus only on noise reduction but ignore the mitigation of other sources of distortion which crucially mark the difference between clean speech and denoised speech. This study proposes a preprocessing step and a repair-denoise network (RDN) model for speech enhancement of old speech recording. The preprocessing step is to silence the click noise and the seriously corrupted parts. The RDN model consists of two networks: repair network and denoise network. We use the time-domain generative adversarial network (TGAN) to repair the silence caused by the declick step. We use the advanced two-stage network (aTSN) and the frequency-domain generative adversarial network (FGAN) as the denoise network. The FGAN is used to repair the distorted result of the aTSN. Most denoise architectures provide good performance for noise reduction. However, distortion is a critical problem that has not been resolved. Moreover, the problem gets worse when there is no training dataset available in the same language, such as old speech recording. Therefore, using power generation capabilities of GAN, we can repair the distortion and obtain an advanced result which is more similar to human voice. Experimental results show that the proposed model RDN retains high speech quality with lower distortion outperforming other baseline methods in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ).
參考文獻
[1] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875, 2017.
[2] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Advances in neural information processing systems, pp. 5767–5777, 2017.
[3] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” arXiv preprint arXiv:1709.01703, 2017.
[4] J. Kim and M. Hahn, “Speech enhancement using a two-stage network for an efficient boosting strategy,” IEEE Signal Processing Letters, vol. 26, no. 5, pp. 770–774, 2019.
[5] Y.-H. Tu, J. Du, and C.-H. Lee, “DNN training based on classic gain function for single-channel speech enhancement and recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 910–914, IEEE, 2019.
[6] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in neural information processing systems, pp. 4480–4490, 2018.
[7] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979.
[8] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1218–1234, 2006.
[9] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 006–012, IEEE, 2017.
[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.
[11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[12] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
[14] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. International Conference on Machine Learning (ICML), pp. 3-8, 2013.
[15] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[16] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
[17] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[18] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096, IEEE, 2013.
[19] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for LSTM-RNN based speech enhancement,” in 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, IEEE, 2017.
[20] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[21] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 7398–7402, IEEE, 2013.
[22] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
[23] G. Hu, “100 nonspeech environmental sounds,” The Ohio State University, Department of Computer Science and Engineering, 2004.
[24] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: Ii. NOISEX -92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993.
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[26] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752, IEEE, 2001.
[28] T. Kounovsky and J. Malek, “Single channel speech enhancement using convolutional neural network,” in 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), pp. 1–5, IEEE, 2017.