簡易檢索 / 詳目顯示

研究生: 林明賜
Lin, Ming-Tsz
論文名稱: 利用時域與頻域分離注意機制於卷積神經網絡以優化環境音分類
Optimizing Environmental Sound Classification with Temporal and Frequency Separable Attention in Convolutional Neural Networks Architectures
指導教授: 蔡家齊
Tsai, Chia-Chi
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 90
中文關鍵詞: 環境聲音分類頻譜特徵提取注意力機制機器學習
外文關鍵詞: Environmental Sound Classification, Spectrogram Feature Extraction, Attention Mechanisms, Machine Learning
相關次數: 點閱:103下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 環境聲音分類 (Environmental Sound Classification,ESC) 正迅速成為音頻處理和機 器聽覺領域中一項關鍵的研究方向。本研究的主要目標是精確識別並分類來自不同 環境的聲音,利用傳統應用於視覺數據的深度學習模型的強大能力。ESC 模型的成 功實施對於城市規劃、野生動物監測和智能家居系統等真實世界應用有著巨大潛力, 環境聲學的解釋在這些應用中扮演著關鍵角色。儘管深度學習模型在圖像識別任務中取得了顯著的成功,但將這些以視覺為導向的架構適應音頻處理提出了獨特的挑戰。模型通常在視覺數據上訓練,而這些數據在結構和內容上與音頻信號有本質的不同,因此迫切需要發展能夠有效轉化圖像基礎神經網絡優勢到聽覺上下文的專門模塊。時間-頻率塊的實施對於特徵提取過程是一項關鍵的增強,它們在捕捉環境聲音的複雜時域和頻譜特性中起到了決定性作用。注意力機制的策略性整合,產生了更加集中和辨識度更高的特徵表示,這一創新使模型能夠專注於聲景中最關鍵的元素,顯著提升了聲音事件檢測的準確性。總結來看,本研究提出的注意力頻譜特徵提取方法為環境聲音分類問題提供了一種堅實的解決框架。這項方法不僅在學術上具有創新性,同時在多種實際應用場景中展示出巨大的潛力和價值。透過精心設計的時間與頻率塊,以及注意力機制的有效融合,我們能夠更精確地捕捉和分類環境中的聲音,從而促進更深層次的理解與互動。此外,透過消融研究對模型配置的深入分析,我們成功確立了最適合環境聲音分類任務的模型架構,為進一步的研究和開發奠定了堅實的基礎。

    Environmental Sound Classification (ESC) is rapidly emerging as a key area of research within the fields of audio processing and computational audition. The primary goal of this study is to accurately identify and categorize sounds from various environments by harnessing the powerful capabilities of deep learning models traditionally applied to visual data. The successful deployment of ESC models promises significant potential for real-world applications such as urban planning, wildlife monitoring, and intelligent home systems, where the interpretation of environmental acoustics plays an essential role. Despite the remarkable successes of deep learning models in image recognition tasks, adapting these visually-oriented architectures for audio processing presents unique challenges. Typically, models are trained on visual data, which inherently differ in structure and content from audio signals. Thus, there is an urgent need to develop specialized modules that can effectively translate the strengths of image-based neural networks to the auditory context. The implementation of time-frequency blocks is a critical enhancement to the feature extraction process. These blocks have been instrumental in capturing the complex temporal and spectral characteristics of environmental sounds, offering a comprehensive view of the acoustic landscape. The strategic integration of attention mechanisms has yielded a more focused and discerning feature representation. This innovation has enabled our model to concentrate on the most pertinent elements of the soundscape, significantly elevating the accuracy of sound event detection. In summary, the attention spectrogram feature extraction method proposed in this research provides a robust framework for addressing the challenges of environmental sound classification. This method is not only innovative academically but also demonstrates vast potential across various practical applications. Through the carefully designed time and frequency blocks, along with the effective integration of attention mechanisms, we have been able to more precisely capture and categorize sounds in the environment, thereby facilitating a deeper understanding and interaction. Moreover, the thorough ablation study for model configuration has been instrumental in establishing the most suitable model structure for ESC tasks, laying a solid foundation for further research and development.

    摘要 i Abstract ii 誌謝 iv Table of Contents v List of Tables vii List of Figures viii Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Thesis Organization 2 1.3 Thesis Contribution 3 Chapter 2. Related Work and Background 4 2.1 Introduction to Environmental Sound Classification 4 2.2 Audio Input Representation 6 2.2.1 Raw Audio 7 2.2.2 Spectrogram 10 2.2.3 Advanced Spectrogram Feature Extraction 13 2.3 Machine Learning Based Classifier 23 2.3.1 Support Vector Machine 23 2.3.2 1D-CNN 26 2.3.3 2D-CNN 29 2.4 Attention Mechanism of Environmental Sound Classification 31 2.4.1 Introduction to Attention 32 2.4.2 Attention on Spectrogram 34 Chapter 3. Problems Analysis and Design Methodology 39 3.1 Analyzing the CNN models in ESC 39 3.1.1 Domain discrepancy 39 3.1.2 Data Scarcity in ESC through Transfer Learning 41 3.2 Temporal and Frequency Separable Attention 43 3.2.1 Temporal-wise and Frequency-wise Feature Block 43 3.2.1 Attention Block 53 3.2.1 Model 56 Chapter 4. Experimental Results Comparison and Evaluation 58 4.1 Experiment Setup 58 4.1.1 Datasets 58 4.1.2 Evaluation Metrics 62 4.2 Experimental Results 63 4.2.1 Ablation Study 63 4.2.2 Comparison with Other Papers 71 Chapter 5. Conclusion and Future Work 76 5.1 Conclusion 76 5.1 Future Work 77 References 78

    [1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880– 2894, 2020.
    [2] Sachin Chachada and C-C Jay Kuo. Environmental sound recognition: A survey. APSIPA Transactions on Signal and Information Processing, 3:e14, 2014.
    [3] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
    [4] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark D Plumbley. Cross-task learning for audio tagging, sound event detection, and spatial localization: Dcase 2019 baseline systems. arXiv preprint arXiv:1904.03476, 2019.
    [5] Emre Cakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(6):1291–1303, 2017.
    [6] Justin Salamon and Juan Pablo Bello. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3):279–283, 2017.
    [7] Yong Xu, Qiuqiang Kong, Qiang Huang, Wenwu Wang, and Mark D Plumbley. Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging. arXiv preprint arXiv:1703.06052, 2017.
    [8] Yu Su, Ke Zhang, Jingyu Wang, and Kurosh Madani. Environment sound classification using a two-stream CNN based on decision-level fusion. Sensors, 19(7):1733, 2019.
    [9] Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160, 2016.
    [10] BZJLS Thornton. Audio recognition using mel spectrograms and convolution neural networks. 2019.
    [11] Jie Xie, Kai Hu, Mingying Zhu, Jinghu Yu, and Qibing Zhu. Investigation of different cnn-based models for improved bird sound classification. IEEE Access, 7:175353– 175361, 2019.
    [12] Wenjie Mu, Bo Yin, Xianqing Huang, Jiali Xu, and Zehua Du. Environmental sound classification using temporal-frequency attention based convolutional neural network. Scientific Reports, 11(1):21552, 2021.
    [13] Yuan Gong, Yu-An Chung, and James Glass. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
    [14] Zhao Ren, Qiuqiang Kong, Jing Han, Mark D Plumbley, and Björn W Schuller. Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 56–60. IEEE, 2019.
    [15] Jordi Pons and Xavier Serra. Randomly weighted cnns for (music) audio classification. In ICASSP 2019-2019 IEEE International Conference on acoustics, speech and signal processing (ICASSP), pages 336–340. IEEE, 2019.
    [16] Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-end environmental sound classification using a 1d convolutional neural network. Expert Systems with Applications, 136:252–263, 2019.
    [17] Michael Neumann and Ngoc Thang Vu. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv preprint arXiv:1706.00612, 2017.
    [18] Muhammad Huzaifah. Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. arXiv preprint arXiv:1706.07156, 2017.
    [19] Carlos Mateo and Juan Antonio Talavera. Short-time Fourier transform with the window size fixed in the frequency domain. Digital Signal Processing, 77:13–21, 2018.
    [20] Zhichao Zhang, Shugong Xu, Shan Cao, and Shunqing Zhang. Deep convolutional neural network with mixup for environmental sound classification. In Pattern Recognition and Computer Vision: First Chinese Conference, PRCV 2018, Guangzhou, China, November 23-26, 2018, Proceedings, Part II 1, pages 356–367. Springer, 2018.
    [21] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. Esresnet: Environmental sound classification based on visual domain models. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4933–4940. IEEE, 2021.
    [22] Jivitesh Sharma, Ole-Christoffer Granmo, and Morten Goodwin. Environment sound classification using multiple feature channels and attention based deep convolutional neural network. In Interspeech, volume 2020, pages 1186–1190, 2020.
    [23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    [24] Wenjie Mu, Bo Yin, Xianqing Huang, Jiali Xu, and Zehua Du. Environmental sound classification using temporal-frequency attention based convolutional neural network. Scientific Reports, 11(1):21552, 2021.
    [25] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 1041–1044, 2014.
    [26] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press.
    [27] George Tzanetakis, Georg Essl, and Perry Cook. Automatic musical genre classification of audio signals, 2001.
    [28] Avi Gazneli, Gadi Zimerman, Tal Ridnik, Gilad Sharir, and Asaf Noy. End-to-end audio strikes back: Boosting augmentations towards an efficient audio classification network. arXiv preprint arXiv:2204.11479, 2022.
    [29] Shaobo Li, Yong Yao, Jie Hu, Guokai Liu, Xuemei Yao, and Jianjun Hu. An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 8(7):1152, 2018.
    [30] Jonathan Driedger, Meinard Müller, and Sascha Disch. Extending harmonic-percussive separation of audio signals. In ISMIR, pages 611–616, 2014.
    [31] Kamalesh Palanisamy, Dipika Singhania, and Angela Yao. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154, 2020.
    [32] Helin Wang, Yuexian Zou, Dading Chong, and Wenwu Wang. Environmental sound classification with parallel temporal-spectral attention. arXiv preprint arXiv:1912.06808, 2019.
    [33] Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino. Masked modeling duo: Learning representations by encouraging both networks to model the input. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
    [34] Rodrigo Castellon, Chris Donahue, and Percy Liang. Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677, 2021.

    無法下載圖示 校內:2029-01-29公開
    校外:2029-01-29公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE