| 研究生: |
譚至斌 Tan, Chih-Pin |
|---|---|
| 論文名稱: |
使用Transformer類別深度學習模型於結構資訊相關的樂譜填空生成之應用 Structure-Aware Music Score Infilling via Transformer-based models |
| 指導教授: |
蘇文鈺
Su, Wen-Yu |
| 共同指導教授: |
楊奕軒
Yang, Yi-Hsuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 35 |
| 中文關鍵詞: | Transformer 、樂譜生成 、樂譜填寫 、基於提示條件的生成 |
| 外文關鍵詞: | Transformer, Music Generation, Music Infilling, Prompt-based Conditional Generation |
| 相關次數: | 點閱:141 下載:16 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究主要探討音樂段落資訊對於 Transformer 的衍伸模型於樂譜填寫之應用。 樂譜填寫屬於條件生成類型的問題:給予前後語境,生成一串音樂序列來填補中間的空缺。在已知關於音樂填寫的研究中,對於生成結果局部性順暢的問題已經有了系統性的解決方法。 然而,基於提示的條件狀況沒辦法保證生成結果的結構性以及相似性,換句話說,生成結果是否符合音樂整體的架構以及生成的結果是否與對應的音樂段落相似。 因此,在本研究中,我們提出了基於條件狀況的方法,藉由在訓練過程中明確的要求模型參考所提供的段落資訊,並增加了全新的注意力選擇模組於 Transformer 的衍生模型,讓音樂生成的過程可以更有效率地使用使用者提供的段落資訊提示,來解決基於提示條件狀況的音樂生成結果中所缺失的結構性和相似性的問題。 在實驗方法中,我們與其他人的研究結果做比較,衡量方式包含旋律,節奏與調性的相似度計算,以及志願者的聽力測試,來顯示我們所提出的新架構能更有效率地利用音樂段落資訊。
The purpose of this thesis is to apply music structure information to the Transformer-based models of automatic music score generation systems. In many music score generation applications, we focus on music score infilling, i.e., generating a music sequence to fill in the gap between given past and future contexts. Known researches have demonstrated that prompt-based conditioning approaches could make great results in local smoothness among past context, future context, and the generated sequence. However, these cannot guarantee the repeatness and similarity corresponding to the structures of musical context. Therefore, we propose a structure-based conditioning approach, which hires a novel attention-selecting module and explicitly makes the model refer to the given structure information in the training process, on a Transformer-based model, TransformerXL, to solve the problem of the loss of structural completeness. We report on objective and subjective evaluations of the proposed models and variants of conventional prompt-based baselines, including comparisons of melody, rhythm and tonality, and human listening tests, to show that our approach greatly improves the generation of pop music by efficiently taking advantage of music structure information.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. pages 2978–2988, 01 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanNGomez,L?ukaszKaiser,andIlliaPolosukhin.Attentionisallyouneed. Advances in neural information processing systems, 30, 2017.
Chin-Jui Chang, Chun-Yi Lee, and Yi-Hsuan Yang. Variable-length music score infilling via xlnet and musically specialized positional encoding. The 22nd International Society for Music Information Retrieval (ISMIR), 2021.
Jia-Lien Hsu and Shuh-Jiun Chang. Generating music transition by using a transformer-based model. Electronics, 10(18):2276, 2021.
David Cope. Computers and musical style, volume 6. Oxford University Press Oxford, 1991.
David Cope. Computer models of musical creativity. Mit Press Cambridge, 2005.
Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1881–1888, Madison, WI, USA, 2012. Omnipress.
Florian Colombo. Algorithmic composition of melodies with deep recurrent neural networks. 2016. Poster associated with the conference article of the same name. It was presented at the Machine Learning Summer School (MLSS) 2016 and during the special poster session of AISTATS 2016 in Cadiz, Spain.
Bob L Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. Music transcription modelling and composition using deep learning. 2016.
Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In International conference on machine learning, pages 4364–4373. PMLR, 2018.
Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. The 18th International Society of Music Information Retrieval (ISMIR), 2017.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, and Douglas Eck. Music transformer: Generating music with long-term structure. arXiv preprint arXiv:1809.04281, 2018.
Yu-Siang Huang and Yi-Hsuan Yang. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. MM ’20, page 1180–1188, New York, NY, USA, 2020. Association for Computing Machinery.
Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures. The 21st International Society for Music Information Retrieval Conference (ISMIR), 2020.
Ashis Pati, Alexander Lerch, and Ga ̈etan Hadjeres. Learning to traverse latent spaces for musical score inpainting. The 20th International Society for Music Information Retrieval Conference (ISMIR), 2019.
Th ́eis Bazin and Ga ̈etan Hadjeres. Nonoto: A model-agnostic web interface for interactive music composition by inpainting. arXiv preprint arXiv:1907.10380, 2019.
Daphne Ippolito, Anna Huang, Curtis Hawthorne, and Douglas Eck. Infilling piano performances. In NIPS Workshop on Machine Learning for Creativity and Design, 2018.
Ga ̈etan Hadjeres, Fran ̧cois Pachet, and Frank Nielsen. Deepbach: a steerable model for bach chorales generation. In International Conference on Machine Learning, pages 1362–1371. PMLR, 2017.
Fred Lerdahl and Ray S Jackendoff. A Generative Theory of Tonal Music, reissue, with a new preface. MIT press, 1996.
Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counterpoint by convolution. In International Society for Music Information Retrieval (ISMIR), 2017.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. The International Conference on Learning Representations (ICLR), 2014.
Shiqi Wei, Gus Xia, Yixiao Zhang, Liwei Lin, and Weiguo Gao. Music phrase inpainting using long-term representation and contrastive loss. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 186–190. IEEE, 2022.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729– 9738, 2020.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
Man-Kwan Shan and Shih-Chuan Chiu. Algorithmic compositions based on discovered musical patterns. Multimedia Tools and Applications, 46(1):1–23, 2010.
SAP SE. Musical style modification as an optimization problem. In Proceedings of the International Computer Music Conference, page 206, 2016.
Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Muller, and Yi-Hsuan Yang. Theme transformer: Symbolic music generation with theme-conditioned transformer. IEEE Transactions on Multimedia, 2022.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative po- sition representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, and Douglas Eck. An improved relative self-attention mechanism for transformer with application to music generation. 2018.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Ziyu Wang*, Ke Chen*, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Guxian Bin, and Gus Xia. Pop909: A pop-song dataset for music arrangement generation. In Proceedings of 21st International Conference on Music Information Retrieval, ISMIR, 2020.
Shuqi Dai, Huan Zhang, and Roger B Dannenberg. Automatic analysis and influence of hierarchical structure on melody, rhythm and harmony in popular music. In Proceedings of the 2020 Joint Conference on AI Music Creativity, 2020.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. The International Conference on Learning Representations (ICLR), 2019.
Ning Hu, Roger B Dannenberg, and Ann L. Lewis. A probabilistic model of melodic similarity. In International Computer Music Conference, 2002.