| 研究生: |
黃柏鈞 Huang, Po-Chun |
|---|---|
| 論文名稱: |
使用 CNN-BiLSTM 的小提琴獨奏錄音採譜 Transcription of Violin Solo Recordings with CNN-BiLSTM |
| 指導教授: |
蘇文鈺
Su, Wen-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 中文 |
| 論文頁數: | 68 |
| 中文關鍵詞: | 自動音樂採譜 、小提琴 、機器學習 、CNN 、梅爾頻譜圖 、BiLSTM |
| 外文關鍵詞: | Automatic Music Transcription, violin, machine learning, CNN, BiLSTM, Mel spectrogram |
| 相關次數: | 點閱:60 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自動音樂採譜(Automatic Music Transcription, AMT)是將原聲音樂訊號轉換成音樂記號,通常我們會將AMT分為二個部分 onset/offset和note detection。在過去的研究中,大多數研究集中於鋼琴獨奏或多樂器演奏的自動採譜,較少針對小提琴音樂的AMT系統。這不僅是由於目前缺乏足夠的小提琴音樂資料集,還因為小提琴與其他樂器不同。當小提琴發出聲音時,通常不會一開始就達到最大能量,而是呈現獨特的能量曲線。此外,它還包含複音和顫音等演奏技巧。因為這些原因造成目前研究中onset/offset的精度不足。
本論文探討小提琴音樂的音樂採譜研究問題,我們提出了一種結合卷積神經網絡(Convolutional Neural Network, CNN)和雙向長短期記憶網絡(Bidirectional Long Short-Term Memory, BiLSTM)的採譜方法,用於小提琴獨奏的採譜。我們的模型使用梅爾頻譜圖(Mel spectrogram)作為輸入數據,並將CNN-BiLSTM作為整個採譜過程的預處理。
為了進一步提高準確性,我們實施了一種後處理技術,對模型輸出結果進行閾值(Thresholding)處理。這種方法減少了誤判的可能性,從而提高了採譜的整體可靠性。我們的實驗結果顯示,onset/offset time的準確性有顯著的提升。
除了技術上的進步,本論文的目的也在建立一個全面且專門的小提琴音樂採譜資料集,這個資料集將成為未來音樂信息檢索(Music Information Retrieval, MIR)研究的資源。
Automatic Music Transcription (AMT) involves converting raw audio music signals into musical notation and is typically divided into two components: onset/offset and note detection. Past research has primarily focused on AMT systems for solo piano performances or multi-instrumental pieces, with fewer studies dedicated to violin music. This discrepancy is not only due to the current scarcity of violin music datasets but also because of the unique characteristics of the violin as an instrument. Unlike other instruments, a violin does not reach its peak energy immediately upon sound production but exhibits a distinct energy curve. Additionally, it encompasses complex playing techniques such as polyphony and vibrato. Consequently, in current research, its onset/offset accuracy for violin music remains less satisfactory.
This thesis addresses the problem of music transcription research specific to violin music. We propose an approach that utilizes a Convolutional Neural Network (CNN) in combination with a Bidirectional Long Short-Term Memory (BiLSTM) network for the transcription of violin solos. Our model employs Mel spectrograms as the input data with CNN-BiLSTM as our preprocessing of the entire transcription process.
To further enhance the accuracy, we implement a post-processing technique involving thresholding on the preprocessed result. This method reduces the likelihood of misclassifications, thereby improving the overall reliability of the transcriptions. Our experimental results demonstrate a marked improvement in the accuracy of onset/offset detection.
In addition to the technical advancements, this thesis aims to establish a comprehensive and dedicated dataset for violin music transcription. This dataset is intended to serve as a valuable resource for future music information retrieval (MIR) research.
[1] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010.
[2] L. Su and Y.-H. Yang, “Combining spectral and temporal representations for multipitch estimation of polyphonic music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600– 1612, Oct 2015
[3] Z. Duan, B. Pardo, and C. Zhang, “Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2121–2133, 2010.
[4] P. H. Peeling, A. T. Cemgil, and S. J. Godsill, “Generative spectrogram factorization models for polyphonic piano transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 519–527, March 2010.
[5] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003, pp. 177–180.
[6] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537, 2010.
[7] E. Benetos and S. Dixon, “Multiple-instrument polyphonic music transcription using a temporally-constrained shift-invariant model,” Journal of the Acoustical Society of America, vol. 133, no. 3, pp. 1727–1741, March 2013.
[8] B. Fuentes, R. Badeau, and G. Richard, “Harmonic adaptive latent component analysis of audio and application to music transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1854–1866, Sept 2013.
[9] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, May 2016.
[10] R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer, “On the potential of simple framewise approaches to piano transcription,” in Proc. International Society for Music Information Retrieval Conference, 2016, pp. 475–481.
[11] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based polyphonic piano transcription approach using learned feature representations,” in ISMIR, 2011, pp. 175–180.
[12] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” in Proc. International Conference on Machine Learning (ICML), 2012.
[13] Z. Duan and D. Temperley, “Note-level music transcription by maximum likelihood sampling.” in ISMIR, 2014, pp. 181–186.
[14] M. Marolt, “A connectionist approach to automatic transcription of polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 439–449, 2004
[15] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music transcription with convolutional sparse coding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218– 2230, Dec 2016.
[16] S. Ewert and M. B. Sandler, “Piano transcription in the studio using an extensible alternating directions framework,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983– 1997, Nov 2016.
[17] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” in Proc. International Society for Music Information Retrieval Conference, 2018.
[18] Z. Duan, J. Han, and B. Pardo, “Multi-pitch streaming of harmonic sound mixtures,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 138–150, Jan 2014.
[19] V. Arora and L. Behera, “Multiple F0 estimation and source clustering of polyphonic music audio using PLCA and HMRFs,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 2, pp. 278–287, 2015.
[20] A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano performances into music notation,” in Proc. International Society for Music Information Retrieval Conference, 2016, pp. 758–764.
[21] R. G. C. Carvalho and P. Smaragdis, “Towards end-to-end polyphonic music transcription: Transforming music audio directly to a score,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct 2017, pp. 151–155.
[22] Gardner, J., Simon, I., Manilow, E., Hawthorne, C., & Engel, J. (2021). MT3: Multi-Task Multitrack Music Transcription. In arXiv preprint arXiv:2107.11043.
[23] Henningsson, D. (2011). FluidSynth: Real-time and thread safety challenges.
[24] Wu (2022). Dataset of Bach Sonata and Partita for Solo Violin Records
[25] Hilary Hahn plays Bach: Violin Sonatas Nos. 1 & 2; Partita No. 1. https://www.amazon.com/Hilary-Hahn-plays-Bach-Sonatas/dp/B07DVGXNLP
[26] Bach: Sonatas and Partitas for Solo Violin. https://www.amazon.com/Bach-Sonatas-Partitas-Solo-Violin/dp/B000001H00
[27] Bach: Sonatas & Partitas for Solo Violin ~ Khachatryan. https://www.amazon.com/Bach-Sonatas-Partitas-Violin-Khachatryan/dp/B0038KUZI6
[28] University of Rochester Multi-Modal Music Performance (URMP) Dataset. https://labsites.rochester.edu/air/projects/URMP.html
[29] Hung Chih Yang (2020) A novel source filter model using LSTM/K-means machine learning methods for the synthesis of bowed-string musical instruments
[30] Brian McFee, Colin Raffel, Dawen Liang, D. Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto (2015) librosa: Audio and Music Signal Analysis in Python
[31] Philosophical Transactions, Royal Society 1963, A. 255 512 (cited in OED).
[32] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. (2018) Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv preprint arXiv:1810.12247.
[33] Mauch, M. & Dixon, S. (2014). pYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions
[34] Downie, J. S. (2008). The music information retrieval evaluation exchange (MIREX). D-Lib Magazine, 14(12).