研究生: |
邱晴瑜 Chiu, Ching-Yu |
---|---|
論文名稱: |
受人類啟發的音樂資訊檢索方法—以音樂源分離以及拍點偵測為例 Human-Inspired Methods for Music Information Retrieval (MIR)— Taking Music Source Separation and Beat Tracking as Examples |
指導教授: |
蘇文鈺
Su, Alvin Wen-Yu |
共同指導教授: |
楊奕軒
Yang, Yi-Husan |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 多媒體系統與智慧型運算工程博士學位學程 Multimedia System and Intelligent Computing Ph.D. Degree Program |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 97 |
中文關鍵詞: | 音樂資訊檢索 、聲源分離 、拍點偵測 、深度學習 、資料增強 |
外文關鍵詞: | music information retrieval, blind music source separation, beat tracking, deep learning, data augmentation |
相關次數: | 點閱:103 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
音樂資訊檢索(Music information retrieval; MIR) 是一門涵蓋所有旨在萃取音樂當中有
意義的資訊的學科,更弘遠的目標則是幫助人類搜尋、使用、創作音樂。然而儘管
前述所涉及的活動皆牽涉人類的經驗(例如,對音樂的感知、主觀認定,以及採取的
創作手法),並且“音樂” 的本質以及資料特性高度地取決於人類,與人相關的變因
(human-related factors) 卻很少直接被考慮到音樂資訊檢索相關技術的開發當中。隨著
深度學習這類高度依賴資料驅動(data-driven) 的技術逐漸成為主流,重新思考人類相
關的變因如何影響音樂當中的資料特性分布、如何可以啟發相關檢索模型的設計以
及評估,很自然地就成為此論文的目標。在此論文中,我們以兩種對人類而言最基
本的音樂感知能力—聲源分離以及拍點偵測為例,探討人類相關的變因如何可以幫
助改善現有音樂資訊檢索技術在資料增強(Chapter 2, 3)、模型開發(Chapter 4, 5)、模
型分析及評估方面(Chapter 6) 的表現。
本論文彙集了五篇相關代表著作,Chapter 2[20] 探討了一般音樂製作過程中常用的
混音相關程序如何影響音樂資料的分布而可以應用於鋼琴小提琴聲源分離任務中的
資料增強。Chapter 3[19] 則研究了對深度學習拍點偵測模型而言,訓練資料當中的
鼓與非鼓聲源組成如何影響訓練出的模型在不同聲源組成的測試資料中的表現。觀
察到人類打拍子時能夠自如地隨著音樂而切換對於鼓聲、非鼓聲的注意力,我們在
Chapter 4[23] 提出了一個整合聲源分離模組與拍點偵測模型的架構,並有效地提升
模型在各種不同聲源組成測試資料的表現。在Chapter 5[22],我們則深入探討現有拍
點偵測模型在具有表現力的古典鋼琴音樂中表現不良的細部原因;觀察到人類普遍
更具焦於當下音樂事件而可以適應局部的節拍變化,我們提出一套不同於現今主流
(依賴全域節拍假設) 的做法,讓模型透過局部週期性(local periodicity) 聚焦於當下,
而可以更好地適應節拍變化。在Chapter 6[21],考慮到曲間節拍切換(metric-level
switching; MLS) 在人類打拍子時是很常見的行為,為了補足現有拍點偵測評估指標
無法呈現模型在MLS 方面的行為,以及因MLS 相關資訊缺失所致的分析困難,我
們提出了一套新的方法能夠更全面地分析拍點偵測模型在富含節拍變化的音樂行為。
綜合此五篇著作的結果,我們可以看見人類相關因素在音樂資訊檢索技術中都有著
不容小覷的影響力,並且確實能夠有效地幫助相關技術的開發。無論是在資料增強、
深度學習模型的開發、模型的行為分析方面皆能有所助益並帶來新的洞見。我們期
望後續能朝此方向持續探索,並擴展加深人類與音樂之間的關聯與可能性。
Music information retrieval (MIR) is a research field that gathers all related tasks that aims at extracting meaningful information from music, hoping to help humans more effectively search, make use of, and create music. Despite that all the above mentioned activities are related to human experiences (e.g., our perception or subjective definition for music, and the way we create music) and that the essence of music and the characteristic distribution of music data are highly dependent on humans, those human-related factors are rarely directly considered in development of MIR methods. As the highly data-driven techniques (i.e., deep learning) gradually become mainstream, it is natural for us to rethink how human-related factors affect the distribution of data characteristics in music, and how these factors can inspire the design, analysis, and evaluation of related retrieval methods. To tackle this goal, we take as examples two of the most fundamental human music perception abilities---music source separation and beat tracking to investigate how human-related factors can help to improve current MIR methods for data augmentation (Chapter 2, 3), model development (Chapter 4, 5), model analysis and evaluation (Chapter 6).
In this thesis, we gather five of our related representative publications. Chapter 2 describes our investigation for how the mixing-related techniques commonly adopted in music production process influence the characteristic distribution of music data, and how the findings could be used as data augmentation techniques to improve the violin/piano sound source separation task. In Chapter 3, we investigate, for deep learning-based beat trackers, how the composition of drum/non-drum sound sources in training data could influence the performance of the trained models for testing data with different sound source composition. Observing that humans can freely and adaptively switch their attentions for drum/non-drum sound sources, we propose in Chapter 4 an ensemble architecture that incorporates a sound source separation module and beat trackers to effectively improve the model performance for testing data with different sound source composition. In Chapter 5, we investigate the underlying reasons why existing beat trackers fail for tracking expressive classical piano music. Observing that humans generally focus more on local current musical events and thus are able to adapt to local tempo changes, we propose a local periodicity-based method to endow beat trackers human-like local temporal expectations. Unlike the mainstream methods that based on global tempo assumptions, our local periodicity-based methods can focus on local events and nicely adapt to local tempo changes. In Chapter 6, considering that the metric-level switching (MLS) behaviors are common for humans during music tapping, and that existing evaluation metric can not reveal any information regarding beat trackers' MLS behaviors, we propose an analysis method to comprehensively reveal MLS-related behaviors of models and facilitate the related model analysis and evaluation.
Combining the results of these five works, we can see that human-related factors have unnegligible influences on music information retrieval technology, and can indeed effectively help the development of related technologies. Whether it is in data enhancement, deep learning model development, or model behavior analysis, it can be helpful and bring new insights. In future, we will continue to explore in this direction, and to expand and deepen the connection and possibilities between humans and music.
[1] The Mazurka Project, 2010.
[2] J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M.B. Sandler. A tutorial on onset detection in music signals. IEEE Trans. Audio, Speech, and Language Process., 13(5):1035–1047, 2005.
[3] E. Benetos et al. Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1):20–30, 2019.
[4] Rachel Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. Proc. International Society for Music Information Retrieval Conference, pages 155– 160, 2014.
[5] Rachel M Bittner, Eric Humphrey, and Juan P Bello. Pysox : Leveraging the au- dio signal processing power of Sox in Python. Proc. International Society for Music Information Retrieval Conference, pages 4–6, 2016.
[6] Rachel M Bittner, Julia Wilkins, Hanna Yip, and Juan P Bello. Medleydb 2.0 : New data and a system for sustainable data collection. Proc. International Society for Music Information Retrieval Conference, pages 2–4, 2016.
[7] Sebastian Böck and Matthew E. P. Davies. Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation. In Proc. Int. Soc. Music Inf. Retr. Conf., page 574–582, 2020.
[8] Sebastian Böck, Matthew E. P. Davies, and Peter Knees. Multi-task learning of tempo and beat: Learning one to improve the other. Proc. Int. Soc. Music Inf. Retr. Conf., pages 486–493, 2019.
[9] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Wid- mer. Madmom: A new Python audio and music signal processing library. In Proc. ACM Multimed. Conf., pages 1174–1178, 2016.
[10] Sebastian Böck, Florian Krebs, and Gerhard Widmer. A multi-model approach to beat tracking considering heterogeneous music styles. Proc. Int. Soc. Music Inf. Retr. Conf., pages 603–608, 2014.
[11] Sebastian Böck, Florian Krebs, and Gerhard Widmer. Joint beat and downbeat tracking with recurrent neural networks. Proc. Int. Soc. Music Inf. Retr. Conf., pages 255–261, 2016.
[12] Fleur L. Bouwer, Henkjan Honing, and Heleen A. Slagter. Beat-based and memory- based temporal expectations in rhythm: Similar perceptual effects, different underly- ing mechanisms. J. Cognitive Neuroscience, 32(7):1221–1241, 2020.
[13] G. Burloiu. An online tempo tracker for automatic accompaniment based on audio-to- audio alignment and beat tracking. In Proc. Sound and Music Computing Conf., pages 93–98, 2016.
[14] E. Cano, F. M. Angel, G. A. L. Gil, J. R. Zapata, A. Escamilla, J. F. A. Londoño, and
M. B. Pelaez. Sesquialtera in the Colombian Bambuco: Perception and estimation of beat and meter. In Proc. Int. Soc. Music Inf. Retr. Conf., page 409–415, 2020.
[15] Estefanía Cano, Fernando Mora-Ángel, Lopez Gil, Jose Zapata, Antonio Escamilla, Juan Alzate, and Moisés Betancur. Sesquialtera in the colombian bambuco: Perception and estimation of beat and meter–extended version. Trans. of the Int. Soc. Music Inf. Retr., 4:248, 2021.
[16] Estefanía Cano, Fernando Ángel, Lopez Gil, Jose Zapata, Antonio Escamilla, Juan Alzate, and Moisés Betancur. Sesquialtera in the colombian bambuco: Perception and estimation of beat and meter. In Proc. Int. Soc. Music Inf. Retr. Conf., 2020.
[17] Tsung-Ping Chen and Li Su. Toward postprocessing-free neural networks for joint beat and downbeat estimation. In Proc. Int. Soc. Music Inf. Retr. Conf., 2022.
[18] E. Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. Journal of the Acoustical Society of America, 25:975–979, 1953.
[19] Ching-Yu Chiu, Joann Ching, Wen-Yi Hsiao, Yu-Hua Chen, Alvin Wen-Yu Su, and Yi-Hsuan Yang. Source separation-based data augmentation for improved joint beat and downbeat tracking. In Proc. Eur. Signal Process. Conf., pages 391–395, 2021.
[20] Ching-Yu Chiu, Wen-Yi Hsiao, Yin-Cheng Yeh, Yi-Hsuan Yang, and Alvin W. Y. Su. Mixing-specific data augmentation techniques for improved blind violin/piano source separation. In Proc. IEEE Int. Workshop on Multimedia Signal Processing, 2020.
[21] Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, and Yi- Hsuan Yang. An analysis method for metric-level switching in beat tracking. IEEE Signal Processing Letters, 29:2153–2157, 2022.
[22] Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, and Yi- Hsuan Yang. Local periodicity-based beat tracking for expressive classical piano mu- sic. IEEE/ACM Trans. Audio, Speech, and Language Process., 2022. under review.
[23] Ching-Yu Chiu, Alvin Wen-Yu Su, and Yi-Hsuan Yang. Drum-aware ensemble archi- tecture for improved joint musical beat and downbeat tracking. IEEE Signal Process- ing Letters, 2021.
[24] Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181––204, 2013.
[25] Matthew E. P. Davies and Sebastian Böck. Temporal convolutional networks for mu- sical audio beat tracking. In Proc. Eur. Signal Process. Conf., pages 1–5, 2019.
[26] Matthew E. P. Davies and Sebastian Böck. Evaluating the evaluation measures for beat tracking. In Proc. Int. Soc. Music Inf. Retr. Conf., 2014.
[27] Matthew E. P. Davies, Sebastian Böck, and Magdalena Fuentes. Tempo, beat and downbeat estimation. Proc. Int. Soc. Music Inf. Retr. Conf., 2021.
[28] Matthew E. P. Davies, Norberto Degara, and Mark D. Plumbley. Measuring the per- formance of beat tracking algorithms using a beat error histogram. IEEE Signal Pro- cessing Letters, 18(3):157–160, 2011.
[29] Matthew E. P. Davies and Mark Plumbley. Context-dependent beat tracking of musical audio. IEEE Trans. Audio, Speech, and Language Process., 15:1009 – 1020, 2007.
[30] Matthew E.P. Davies, Norberto Degara, and Mark D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. In Queen Mary University of London, Centre for Digital Music, Tech. Rep. C4DM-TR-09-06, 2009.
[31] T. de Clercq and D. Temperley. A corpus analysis of rock harmony. Popular Music, 30(1):47–70, 2011.
[32] Peter Desain and Henkjan Honing. Tempo curves considered harmful. Contemporary Music Review, 7:123–138, 1993.
[33] S Dixon and E Cambouropoulos. Beat tracking with musical knowledge. In Proc. Eur. Conf. on Artificial Intelligence, page 626–630, 2000.
[34] Simon Dixon. An empirical comparison of tempo trackers. In In Proc. of Brazilian Symposium on Computer Music,, pages 832–840, 2001.
[35] J. Stephen Downie. Music information retrieval. Annu. Rev. Inf. Sci. Technol., 37(1):295–340, 2003.
[36] S. Durand, J. P. Bello, B. David, and G. Richard. Robust downbeat tracking using an ensemble of convolutional networks. IEEE/ACM Trans. Audio, Speech, and Language Processing, 25(1):76–89, 2017.
[37] Simon Durand, Juan P. Bello, Bertrand David, and Gael Richard. Downbeat tracking with multiple features and deep neural networks. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 409–413, 2015.
[38] Simon Durand, Bertrand David, and Gael Richard. Enhancing downbeat detection when facing different music styles. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 3132–3136, 2014.
[39] Simon Durand and Slim Essid. Downbeat detection with conditional random fields and deep learned features. In Proc. Int. Soc. Music Inf. Retr. Conf., 2016.
[40] Daniel P.W. Ellis. Beat tracking by dynamic programming. J. New Music Res., 36(1):51–60, 2007.
[41] Hakan Erdogan and Takuya Yoshioka. Investigations on data augmentation and loss functions for deep learning based speech-background separation. Proc. INTER- SPEECH, pages 3499–3503, 2018.
[42] F. Foscarin, A. McLeod, P. Rigaux, F. Jacquemard, and M. Sakai. ASAP: a dataset of aligned scores and performances for piano transcription. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 534–541, 2020.
[43] M. Fuentes, B. McFee, H. C. Crayencour, S. Essid, and J. P. Bello. Analysis of common design choices in deep learning systems for downbeat tracking. Proc. Int. Soc. Music Inf. Retr. Conf., pages 106–112, 2018.
[44] Magdalena Fuentes, Brian McFee, Hélène C. Crayencour, Slim Essid, and Juan Pablo Bello. A music structure informed downbeat tracking system using skip-chain condi- tional random fields and deep learning. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 481–485, 2019.
[45] B. D. Giorgi, M. Mauch, and M. Levy. Downbeat tracking with tempo-invariant con- volutional neural networks. Proc. Int. Soc. Music Inf. Retr. Conf., pages 216–222, 2020.
[46] B. D. Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Automatic chord recognition based on the probabilistic modeling of diatonic modal harmony. In Proc. Int. Workshop on Multidimensional Systems, pages 1–6, 2013.
[47] A. Gkiokas, V. Katsouros, G. Carayannis, and T. Stajylakis. Music tempo estimation and beat tracking by applying source separation and metrical relations. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, pages 421–424, 2012.
[48] M. Goto. An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research, 30(2):159–171, 2001.
[49] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: Popular, Classical, and Jazz music databases. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 287–288, 2002.
[50] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: Music genre database and musical instrument sound database. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 229–230, 2003.
[51] Masataka Goto. An audio-based real-time beat tracking system for music with or without drum-sounds. J. New Music Research, 30(2):159–171, 2001.
[52] Masataka Goto and Yoichi Muraoka. A beat tracking system for acoustic signals of music. In Proc. ACM Int. Conf. Multimedia, pages 365–372, 1994.
[53] Masataka Goto and Yoichi Muraoka. Music understanding at the beat level real-time beat tracking for audio signals. Proc. Int. Joint Conf. Artificial Intelligence Workshop on Computational Auditory Scene Analysis, 1995.
[54] Masataka Goto and Yoichi Muraoka. Real-time beat tracking for drumless audio signals: Chord change detection for musical decisions. Speech Communication, 27(3):311–335, 1999.
[55] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Trans. Audio, Speech, and Language Process., 14(5):1832–1844, 2006.
[56] F. Gouyon, G. Widmer, X. Serra, and A. Flexer. Acoustic cues to beat induction: A machine learning perspective. Music Perception, 24(2):177–188, 2006.
[57] Fabien Gouyon. A computational approach to rhythm description - Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. PhD thesis, Universitat Pompeu Fabra, 2006.
[58] Fabien Gouyon and Perfecto Herrera. Determination of the meter of musical audio signals: Seeking recurrences in beat segment descriptors. Audio Engineering Society Convention, 2003.
[59] D. Griffin and J. Lim. Signal estimation from modified short-time fourier trans- form. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236– 243, 1984.
[60] H. Grohganz, M. Clausen, and M. Müller. Estimating musical time information from performed MIDI files. Proc. Int. Soc. Music Inf. Retr. Conf., pages 35–40, 2014.
[61] Peter Grosche and Meinard Müller. Extracting predominant local pulse information from music recordings. IEEE Trans. Audio, Speech Lang. Process., 19(6):1688–1701, 2011.
[62] Peter Grosche, Meinard Müller, and Frank Kurth. Cyclic tempogram–a mid-level tempo representation for musicsignals. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 5522–5525, 2010.
[63] Peter Grosche, Meinard Müller, and Craig Stuart Sapp. What makes beat tracking difficult? a case study on Chopin Mazurkas. Proc. Int. Soc. Music Inf. Retr. Conf., pages 649–654, 2010.
[64] Siddharth Gururani and Alexander Lerch. Mixing Secrets: A multi-track dataset for instrument recognition in polyphonic music. In International Society for Music Infor- mation Retrieval Conference, Late-breaking paper, 2017.
[65] Curtis Hawthorne et al. Enabling factorized piano music modeling and generation with the MAESTRO dataset. Proc. International Conference on Learning Representations, pages 1–12, 2019.
[66] Simon Haykin and Zhe Chen. The cocktail party problem. Neural Comput., 17(9):1875–1902, 2005.
[67] Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. BeatNet: CRNN and particle filtering for online joint beat downbeat and meter tracking. In Proc. Int. Soc. Music Inf. Retr. Conf., 2021.
[68] Mojtaba Heydari and Zhiyao Duan. Don't look back: An online beat tracking method using RNN and enhanced particle filtering. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021.
[69] Jason Hockman, Matthew E. P. Davies, and Ichiro Fujinaga. One in the jungle: Down- beat detection in hardcore, jungle, and drum and bass. Proc. Int. Soc. Music Inf. Retr. Conf., pages 169–174, 2012.
[70] André Holzapfel, Matthew E. P. Davies, José R. Zapata, João Lobato Oliveira, and Fabien Gouyon. Selective sampling for beat tracking evaluation. IEEE Trans. Audio, Speech, and Language Process., 20(9):2539–2548, 2012.
[71] Tsung-Han Hsieh, Kai-Hsiang Cheng, Zhe-Cheng Fan, Yu-Ching Yang, and Yi-Hsuan Yang. Addressing the confounds of accompaniments in singer identification. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
[72] Yu-Siang Huang and Yi-Hsuan Yang. Pop Music Transformer: Beat-based model- ing and generation of expressive pop piano compositions. In Proc. ACM Int. Conf. Multimedia, page 1180–1188, 2020.
[73] Yun-Ning Hung, Ju-Chiang Wang, Xuchen Song, Wei-Tsung Lu, and Minz Won. Mod- eling beats and downbeats with a time-frequency Transformer. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 401–405, 2022.
[74] Matej Istvanek, Zdenek Smekal, Lubomir Spurny, and Jiri Mekyska. Enhancement of conventional beat tracking system using Teager-Kaiser energy operator. Appl. Sci., 10(1):1–20, 2020.
[75] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Ku- mar, and Tillman Weyde. Singing voice separation with deep U-Net convolutional networks. In Proc. International Society for Music Information Retrieval Conference, 2017.
[76] Ping-Keng Jao, Li Su, Yi-Hsuan Yang, and Brendt Wohlberg. Monaural music source separation using convolutional sparse coding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.
[77] Bijue Jia, Jiancheng Lv, and Dayiheng Liu. Deep learning-based automatic downbeat tracking: a brief review. Multimedia Systems, 25(6):617–638, 2019.
[78] Diederik P. Kingma and Jimmy Lei Ba. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. In International Society for Music Information Retrieval Conference, Late-breaking paper, 2019.
[79] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. IEEE Trans. Audio, Speech, and Language Process., 6:3089–3092, 1999.
[80] Anssi Klapuri, Antti Eronen, and J. Astola. Analysis of the meter of acoustic musical signals. IEEE Trans. Audio, Speech, and Language Process., 14:342 – 355, 2006.
[81] F. Krebs, S. Böck, M. Dorfer, and G. Widmer. Downbeat tracking using beat- synchronous features and recurrent neural networks. Proc. Int. Soc. Music Inf. Retr. Conf., pages 129–135, 2016.
[82] F. Krebs, S. Böck, and G. Widmer. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proc. Int. Soc. Music Inf. Retr. Conf., 2013.
[83] Florian Krebs, Sebastian Böck, and Gerhard Widmer. An efficient state-space model for joint tempo and meter tracking. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 72–78, 2015.
[84] Jie Hwan Lee, Hyeong-Seok Choi, and Kyogu Lee. Audio query-based music source separation. In Proc. International Society for Music Information Retrieval Conference, pages 878–885, 2019.
[85] Jen-Yu Liu and Yi-Hsuan Yang. Denoising auto-encoder with recurrent skip connec- tions and residual regression for music source separation. In Proc. IEEE Int. Conf. Machine Learning and Applications, 2018.
[86] Jen-Yu Liu and Yi-Hsuan Yang. Dilated convolution with dilated GRU for music source separation. In Proc. Int. Joint Conf. Artificial Intelligence, pages 4718–4724, 2019.
[87] Lele Liu, Veronica Morfi, and Emmanouil Benetos. ACPAS dataset: Aligned classical piano audio and score. In Demos and Late Breaking News of the Int. Soc. Music Inf. Retr. Conf., 2021.
[88] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani. Deep clustering and conventional networks for music separation: Stronger together. In Proc. IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, pages 61–65, 2017.
[89] Yi Luo, Zhuo Chen, and Takuya Yoshioka. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020.
[90] M. Macleod and S. Hainsworth. Particle filtering applied to musical tempo tracking.
EURASIP Journal on Advances in Signal Processing, (927847), 2004.
[91] Brecht De Man. Towards a better understanding of mix engineering. PhD thesis, Queen Mary University of London, 2017.
[92] Ethan Manilow, Prem Seetharaman, and Bryan Pardo. Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 771–775, 2020.
[93] U. Marchand and G. Peeters. Swing ratio estimation. In Proc. Digital Audio Effects, Trondheim, Norway, 2015.
[94] Brian McFee and Daniel P.W. Ellis. Better beat tracking through robust onset aggre- gation. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 2154–2158, 2014.
[95] Brian McFee et al. Librosa: Audio and music signal analysis in Python. Proc. Python in Science Conference, pages 18–24, 2015.
[96] Martin McKinney and Dirk Moelants. Ambiguity in tempo perception: What draws listeners to different metrical levels? Music Perception, 24:155–166, 2006.
[97] Martin McKinney, Dirk Moelants, Matthew E. P. Davies, and Anssi Klapuri. Eval- uation of audio beat tracking and music tempo extraction algorithms. J. New Music Research, 36:1–16, 2007.
[98] Peter Meier, Gerhard Krump, and Meinard Müller. A real-time beat tracking system based on predominant local pulse information. In Demos and Late Breaking News of the Int. Soc. Music Inf. Retr. Conf., 2021.
[99] Gabriel Meseguer-Brocal and Geoffroy Peeters. Conditioned-U-Net: Introducing a control mechanism in the U-Net for multiple source separations. In Proc. International Society for Music Information Retrieval Conference, pages 159–165, 2019.
[100] Martin A Miguel and Diego Fernandez Slezak. Modeling beat uncertainty as a 2D distribution of period and phase: A MIR task proposal. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 452–459, 2021.
[101] Meinard Müller. Beat tracking by dynamic programming.
[102] Meinard Müller. Predominant local pulse (PLP).
[103] Meinard Müller and Vlora Arifi-Müller. Tempo and beat.
[104] Meinard Müller and Frank Zalkow. libfmp: A Python package for fundamentals of music processing. J. Open Source Software, 6(63):3326, 2021.
[105] Meinard Müller. Fundamentals of Music Processing – Using Python and Jupyter Notebooks. Springer Verlag, 2nd edition, 2021.
[106] Anna Nobre and Freek Ede. Anticipated moments: Temporal structure in attention.
Nature Reviews Neuroscience, 19, 2017.
[107] Jonas Obleser, Molly Henry, and Peter Lakatos. What do we talk about when we talk about rhythm? PLOS Biology, 15, 2017.
[108] Takehisa Oyama, Ryoto Ishizuka, and Kazuyoshi Yoshii. Phase-aware joint beat and downbeat estimation based on periodicity of metrical structure. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 493–499, 2021.
[109] F. Pedersoli and M. Goto. Dance beat tracking from visual information alone. In Proc. Int. Soc. Music Inf. Retr. Conf., page 400–408, 2020.
[110] António S. Pinto, Sebastian Böck, Jaime S. Cardoso, and Matthew E. P. Davies. User- driven fine-tuning for beat tracking. Electronics, 10(13), 2021.
[111] Colin Raffel, Brian Mcfee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis. mir_eval: a transparent implementation of common mir metrics. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 367–372, 2014.
[112] Zafar Rafii, Antoine Liutkus, Fabian Robert Stoter, Stylianos Ioannis Mimilakis, Derry Fitzgerald, and Bryan Pardo. An overview of lead and accompaniment separation in music. IEEE/ACM Trans. Audio Speech Lang. Process., 26(8):1307–1335, 2018.
[113] Andrew Robertson, Adam Stark, and Matthew E. P. Davies. Percussive beat tracking using real-time median filtering. Int. Workshop on Machine Learning and Music, 2013.
[114] Craig Sapp. Hybrid numeric/rank similarity metrics for musical performance analysis. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 501–506, 2008.
[115] Markus Schedl, Emilia Gómez, and Julián Urbano. Music information retrieval: Re- cent developments and applications. Found. Trends Inf. Retr., 8(2-3):127–261, 2014.
[116] Jan Schlüter and Thomas Grill. Exploring data augmentation for improved singing voice detection with neural networks. Proc. International Society for Music Informa- tion Retrieval Conference, pages 121–126, 2015.
[117] Hendrik Schreiber and Meinard Müller. A single-step approach to musical tempo estimation using a convolutional neural network. Proc. Int. Soc. Music Inf. Retr. Conf., pages 98–105, 2018.
[118] Hendrik Schreiber, Julián Urbano, and Meinard Müller. Music tempo estimation: Are we done yet? Transactions of the International Society for Music Information Re- trieval, 3(1):111–125, 2020.
[119] Hendrik Schreiber, Frank Zalkow, and Meinard Müller. Modeling and estimating local tempo: A case study on Chopin’s Mazurkas. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 773–779, 2020.
[120] Kentaro Shibata, Eita Nakamura, and Kazuyoshi Yoshii. Non-local musical statistics as guides for audio-to-score piano transcription. Information Sciences, 566:262–280, 2021.
[121] George Sioros, Marius Miron, Mathew E.P. Davies, Fabien Gouyon, and Guy Madi- son. Syncopation creates the sensation of groove in synthesized music examples. Fron- tiers in Psychology, 5:1036, 2014.
[122] A. Srinivasamurthy and X. Serra. A supervised approach to hierarchical metrical cycle tracking from audio music recordings. In Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 5217–5221, 2014.
[123] Adam Stark, Matthew E. P. Davies, and Mark Plumbley. Real-time beat-synchronous analysis of musical audio. In Proc. Digital Audio Effects, 2009.
[124] Daniel Stoller, Sebastian Ewert, and Simon Dixon. Adversarial semi-supervised audio source separation applied to singing voice extraction. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2391–2395, 2018.
[125] Fabian-Robert Stöter, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji. Open- Unmix - A reference implementation for music source separation. Journal of Open Source Software, 4(41):1667, 2019.
[126] Hideyuki Tachibana, Yu Mizuno, Nobutaka Ono, and Shigeki Sagayama. A real-time audio-to-audio karaoke generation system for monaural recordings based on singing voice suppression and key conversion techniques. Journal of Information Processing, 24(3):470–482, 2016.
[127] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Trans. Speech and Audio Processing, 10(5):293–302, 2002.
[128] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, Naoya Takahashi, and Yuki Mitsufuji. Improving music source separation based on deep neural networks through data augmentation and network blending. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, pages 261–265, 2017.
[129] Ashish Vaswani et al. Attention is all you need. In Proc. Advances in Neural Infor- mation Processing Systems, pages 5998–6008, 2017.
[130] Len Vande Veire and Tijl De Bie. From raw audio to a seamless mix: Creating an automated DJ system for Drum and Bass. EURASIP Journal on Audio, Speech, and Music Processing, (13), 2018.
[131] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. IEEE Trans. Audio, Speech, Language Processing, 14(4):1462– 1469, July 2006.
[132] Pauli Virtanen et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17:261–272, 2020.
[133] Nick Whiteley, A. Taylan Cemgil, and Simon Godsill. Sequential inference of rhyth- mic structure in musical audio. In Proc. IEEE Int. Conf. Acoust. Speech Signal Pro- cess., volume 4, pages 1321–1324, 2007.
[134] Nick Whiteley, Ali Cemgil, and S.J. Godsill. Bayesian modelling of temporal structure in musical audio. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 29–34, 2006.
[135] Yueh-Kao Wu, Ching-Yu Chiu, and Yi-Hsuan Yang. Jukedrummer: Conditional beat- aware audio-domain drum accompaniment generation via transformer VQ-VAE. In Proc. Int. Soc. Music Inf. Retr. Conf., 2022.
[136] Kazuhiko Yamamoto. Human-in-the-loop adaptation for interactive musical beat tracking. In Proc. Int. Soc. Music Inf. Retr. Conf., pages 794–801, 2021.
[137] Yi-Hsuan Yang. Low-rank representation of both singing voice and music accompa- niment via learned dictionaries. In Proc. International Society for Music Information Retrieval Conference, 2013.
[138] Jose R. Zapata and Emilia Gomez. Using voice suppression algorithms to improve beat tracking in the presence of highly predominant vocals. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pages 51–55, 2013.
[139] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead opti- mizer: k steps forward, 1 step back. In Proc. Advances in Neural Information Pro- cessing Systems, pages 9597–9608, 2019.
[140] Ruohua Zhou, Marco Mattavelli, and Giorgio Zoia. Music onset detection based on resonator time frequency image. IEEE Trans. Audio, Speech, and Language Process., 16(8):1685–1695, 2008.