| 研究生: | 阮青萍 Nguyen, Thanh Binh | 
|---|---|
| 論文名稱: | 應用注意力機制增強具空間歧異性之多通道語音分離 Attention Mechanism to Improve Spatial Ambiguity in Multi-Channel Speech Separation | 
| 指導教授: | 吳宗憲 Wu, Chung-Hsien | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2021 | 
| 畢業學年度: | 109 | 
| 語文別: | 英文 | 
| 論文頁數: | 54 | 
| 中文關鍵詞: | 語者獨立的語音分離 、時域結構 、通道注意力 、時序卷積網絡 、多通道語音分離 、far-field condition 、空間歧義問題 | 
| 外文關鍵詞: | speaker-independent speech separation, time-domain structures, channel attention, Temporal Convolution Network, multi-channel speech separation, far-field condition, spatial ambiguity problem | 
| 相關次數: | 點閱:91 下載:4 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
基於時域語音分離方法在低延遲、低計算成本和高精度的潛力,本論文針對語者無關的時域語音分離方法進行研究。藉由時域方法所提供的波形輸入-波形輸出結構的端到端框架,可將其應用在小型設備。儘管單通道方法取得了成功,但仍需要多道程序才能在復雜的實際環境(也稱為far-field condition)中妥善部署。解決far-field condition的常見方案是使用結構化麥克風陣列,以此捕獲多通道信號。對於語音識別或語者識別等語音應用,可以使用多通道方法來提高捕獲的語音質量。利用通道之間的內部差異,藉由特徵如語者位置(稱為空間特徵)來提高語音分離系統的質量。
近年的語音分離研究中,空間特徵已被廣泛使用,但受限於位置信息的精準度亦稱為空間歧義性(spatial ambiguity)。為了產生平衡的語音分離系統,最新的多通道方法之一[1]意在解決空間歧義性的問題,其中涉及兩個與位置相關的特徵:空間特徵和方向特徵。為了進一步分析,我們重新執行[1]中的工作以重現上述問題,重新實施的結果將用作進一步評估的基準。
為了解決空間歧義性問題,針對時空神經濾波器(TSNF)系統,本研究提出了兩種注意機制。首先,注意力機制——利用聲源方向來辨別位置鄰近的語者情況,並選擇對應的功能。接者,對時序卷積網絡中合併特徵和卷積特徵塊提出了通道注意力機制。本研究依WSJ0-2mix數據集構建的多通道混響數據集進行評估,並利用 “Room Impulse Response” 生成器在模擬真實環境。
在實驗結果中,重新實現的TSNF在級別固定的信噪比(Si-SNR)評測中獲得了最高的14.14dB的結果。針對TSNF系統的時間,頻譜,空間和角度特徵等特徵進行評估,本研究所提出的方法在語者鄰近的情況下得到了約1.3dB的改善:平均改善了0.13dB,在其他情況下,降低了0.1dB。然而結果顯示,在語者位置不鄰近時,相對於原方法,提出的方法卻有小幅度下降。這也許是此研究未來能改善的問題。
The time-domain speech separation has shown its potential in low latency, low computation cost, and high accuracy in speaker-independent speech separation methods. The time-domain approach provides an end-to-end framework with a waveform-in waveform-out structure which is suitable for application in small devices. Despite the success of the single-channel approach, it still requires a lot of work to be able to deploy this application in a complex real-world environment also known as the far-field condition. The common solution to deal with the far-field condition is using a multi-channel signal captured by a structured microphone array. The multi-channel approach could be used to increase the captured speech quality for speech applications such as speech recognition and speaker identification. By leveraging the inner difference between channels, the features contain some clues about speaker location, known as a spatial feature that could be used to enhance the speech separation system quality.
The spatial feature has been widely used in recent speech separation research. This feature appears to be insufficient when the location information becomes ambiguous. This is known as the spatial ambiguity problem which has to be solved to produce a balanced speech separation system. The spatial ambiguity problem still occurs on one of the latest multi-channel approach [1] with the participation of two location-related features: spatial feature and directional feature. In order to have further analysis, we re-implement the work in [1] to show the above issue. The result of the re-implementation will be used as the baseline for further evaluation.
To deal with the spatial ambiguity problem, this study proposes two attention mechanisms for the Temporal-Spatial Neural filter (TSNF) system. First, the attention mechanism uses the source directions to identify the close speaker location case and makes a selection of involved features. Next, channel attention on merged features and on the feature map of conv1D block in the Temporal Convolution Network is proposed. The proposed method is evaluated on the multi-channel reverberant dataset which is built based on the popular WSJ0-2mix dataset. The dataset is simulated in the real-environment room by using the Room Impulse Response generator.
In the experimental results, the re-implemented TSNF got the highest result of 14.14dB in scale-invariant signal-to-noise (Si-SNR) metric. The proposed methods produced an improvement of about 1.3dB in close speakers’ case and 0.13dB on average with a small decrease of 0.1dB in other cases. The evaluation was performed on the TSNF system with four features: temporal, spectral, spatial, and angle features. The results reveal some drawbacks when using attention mechanisms because of the small decline in the performance of unambiguous speakers' location. This limitation should be considered in future work.
Reference
[1]	R. Gu and Y. Zou, "Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation," arXiv preprint arXiv:2001.00391, 2020.
[2]	Y.-m. Qian, C. Weng, X.-k. Chang, S. Wang, and D. Yu, "Past review, current progress, and challenges ahead on the cocktail party problem," Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 40-63, 2018.
[3]	S. Haykin and Z. Chen, "The cocktail party problem," Neural computation, vol. 17, no. 9, pp. 1875-1902, 2005.
[4]	J. Barker, S. Watanabe, E. Vincent, and J. Trmal, "The fifth'CHiME'speech separation and recognition challenge: dataset, task and baselines," arXiv preprint arXiv:1803.10609, 2018.
[5]	A. Glaser. "Google’s ability to understand language is nearly equivalent to humans." https://www.vox.com/2017/5/31/15720118/google-understand-language-speech-equivalent-humans-code-conference-mary-meeker (accessed 2021).
[6]	J. P. Bajorek. "Voice Recognition Still Has Significant Race and Gender Biases." https://hbr.org/2019/05/voice-recognition-still-has-significant-race-and-gender-biases (accessed 2020).
[7]	M. Cooke, J. R. Hershey, and S. J. Rennie, "Monaural speech separation and recognition challenge," Computer Speech & Language, vol. 24, no. 1, pp. 1-15, 2010.
[8]	C.-M. Yuan, X.-M. Sun, and H. Zhao, "Speech Separation Using Convolutional Neural Network and Attention Mechanism," Discrete Dynamics in Nature and Society, vol. 2020, 2020.
[9]	Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, "Single-Channel Multi-Speaker Separation using Deep Clustering," p. arXiv:1607.02173. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2016arXiv160702173I
[10]	 L. Chen, M. Yu, D. Su, and D. Yu, "Multi-band pit and model integration for improved multi-channel speech separation," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 705-709. 
[11]	D. Wang, "On ideal binary mask as the computational goal of auditory scene analysis," in Speech separation by humans and machines: Springer, 2005, pp. 181-197.
[12]	M. Wu, D. Wang, and G. J. Brown, "A multipitch tracking algorithm for noisy speech," IEEE Transactions on Speech and Audio Processing, vol. 11, no. 3, pp. 229-241, 2003.
[13]	Y. Wang, A. Narayanan, and D. Wang, "On training targets for supervised speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849-1858, 2014.
[14]	H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, "Deep recurrent networks for separation and recognition of single-channel speech in nonstationary background audio," in New Era for Robust Speech Recognition: Springer, 2017, pp. 165-186.
[15]	 A. Narayanan and D. Wang, "Ideal ratio mask estimation using deep neural networks for robust speech recognition," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013: IEEE, pp. 7092-7096. 
[16]	M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, "Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901-1913, 2017.
[17]	 H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, "Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015: IEEE, pp. 708-712. 
[18]	 D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017: IEEE, pp. 241-245. 
[19]	 S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget, "Recurrent neural network based language modeling in meeting recognition," in Twelfth annual conference of the international speech communication association, 2011. 
[20]	 Z. Chen, Y. Luo, and N. Mesgarani, "Deep attractor network for single-microphone speaker separation," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017: IEEE, pp. 246-250. 
[21]	N. Zeghidour and D. Grangier, "Wavesplit: End-to-end speech separation by speaker clustering," arXiv preprint arXiv:2002.08933, 2020.
[22]	Y. Liu and D. Wang, "Divide and conquer: A deep casa approach to talker-independent monaural speaker separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2092-2102, 2019.
[23]	Y. Luo, E. Ceolini, C. Han, S.-C. Liu, and N. Mesgarani, "FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing," p. arXiv:1909.13387. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2019arXiv190913387L
[24]	Z.-Q. Wang, J. L. Roux, D. Wang, and J. R. Hershey, "End-to-end speech separation with unfolded iterative phase reconstruction," arXiv preprint arXiv:1804.10204, 2018.
[25]	D. Griffin and J. Lim, "Signal estimation from modified short-time Fourier transform," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, 1984.
[26]	S. Choi, A. Cichocki, H.-M. Park, and S.-Y. Lee, "Blind source separation and independent component analysis: A review," Neural Information Processing-Letters and Reviews, vol. 6, no. 1, pp. 1-57, 2005.
[27]	 K. Yoshii, R. Tomioka, D. Mochihashi, and M. Goto, "Beyond NMF: Time-Domain Audio Source Separation without Phase Reconstruction," in ISMIR, 2013, pp. 369-374. 
[28]	Y. Luo and N. Mesgarani, "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256-1266, 2019.
[29]	 S. Venkataramani, J. Casebeer, and P. Smaragdis, "End-to-end source separation with adaptive front-ends," in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018: IEEE, pp. 684-688. 
[30]	 Y. Luo and N. Mesgarani, "Tasnet: time-domain audio separation network for real-time, single-channel speech separation," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 696-700. 
[31]	 C. Lea, R. Vidal, A. Reiter, and G. D. Hager, "Temporal convolutional networks: A unified approach to action segmentation," in European Conference on Computer Vision, 2016: Springer, pp. 47-54. 
[32]	Z.-Q. Wang and D. Wang, "Combining spectral and spatial features for deep learning based blind speaker separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 457-468, 2018.
[33]	 Z.-Q. Wang, J. Le Roux, and J. R. Hershey, "Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 1-5. 
[34]	 Z.-Q. Wang and D. Wang, "Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation," in Interspeech, 2018, pp. 2718-2722. 
[35]	 Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, "Multi-channel overlapped speech recognition with location guided speech extraction network," in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018: IEEE, pp. 558-565. 
[36]	C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, "Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features," arXiv preprint arXiv:2002.01626, 2020.
[37]	 B. Tolooshams, R. Giri, A. H. Song, U. Isik, and A. Krishnaswamy, "Channel-Attention Dense U-Net for Multichannel Speech Enhancement," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 836-840. 
[38]	Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation," p. arXiv:1809.07454. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2018arXiv180907454L
[39]	 G. Wichern and J. Le Roux, "Phase reconstruction with learned time-frequency representations for single-channel speech separation," in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018: IEEE, pp. 396-400. 
[40]	R. Gu et al., "End-to-end multi-channel speech separation," arXiv preprint arXiv:1905.06286, 2019.
[41]	 S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, "Cbam: Convolutional block attention module," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19. 
[42]	 J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016: IEEE, pp. 31-35. 
[43]	L. D. Consortium. "CSR-I (WSJ0) Complete." https://catalog.ldc.upenn.edu/LDC93S6A (accessed 2020).
[44]	 Y. Luo, Z. Chen, and T. Yoshioka, "Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 46-50. 
[45]	J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small‐room acoustics," The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943-950, 1979.
[46]	ehabets. "RIR-Generator." https://github.com/ehabets/RIR-Generator (accessed 2020).
[47]	F. Bahmaninezhad, S.-X. Zhang, Y. Xu, M. Yu, J. H. Hansen, and D. Yu, "A Unified Framework for Speech Separation," arXiv preprint arXiv:1912.07814, 2019.
[48]	 J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, "SDR–half-baked or well done?," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 626-630.