簡易檢索 / 詳目顯示

研究生: 張瑜真
Chang, Yu-Chen
論文名稱: 基於半監督式學習的多對多域之間樂器音色轉換
Semi-supervised Many-to-many Music Timbre Transfer
指導教授: 朱威達
Chu, Wei-Ta
胡敏君
Hu, Min-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 44
中文關鍵詞: 樂器音色轉換自編碼三元組神經網路半監督式學習
外文關鍵詞: Music Timbre Transfer, Auto-Encoder, Triplet Network, Semi-supervised
相關次數: 點閱:104下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 樂器音色轉換的問題可以被定義為,在保留音樂語義內容(音符、旋律)的同時,將其音色轉換為另一個目標片段的樂器音色。近年來音色轉換的研究蓬勃發展之中,不同樂器之間的多對多域音色轉換仍正在被探索。在此篇研究中,我們基於自編碼器框架,研究了多對多域之間樂器音色轉換的可能性。而該框架由兩個預訓練的編碼器和一個非監督式訓練的解碼器所組成。為了讓預訓練的編碼器能學習到更富涵音色及音樂內容意義的隱空間,我們利用音樂數位介面資料和數位音樂工作站生成一個平行數據集來做預訓練。

    並且,我們以客觀和主觀的方式評估音色轉換後音符旋律的保持程度和音色轉換的成功度。我們使用基頻(旋律)一致性和音色轉換的成功率來客觀地評估轉換後的結果與模型的表現。另外在主觀評估中,我們招募了78名受試者來填寫問卷,針對問卷中音色轉換後的結果音檔進行評分。由實驗結果可以證明,我們的模型優於原先應用在人聲轉換的多對多架構。而通過客觀和主觀評估,我們驗證了所提出模型的表現。

    除了以上實驗評估之外,我們還驗證了在相似性三元組神經網絡的架構上,內容編碼器能夠利用我們收集的不具人工音樂內容標記之平行數據集學到具有內容意義的隱空間,幫助音色轉換後的結果能更精確得保留住原先片段的音符旋律。此外,我們仍假設平行數據集的製作收集需要花費一定的成本。為了讓我們提出的模型在未來有更彈性的應用空間,我們證明了模型可以在較少量平行數據集的預訓練之下,仍讓編碼器學習到具有音樂意義的隱空間,最後再藉由半監督式學習達到多對多域之間的樂器音色轉換。

    This work presents a music timbre transfer model that aims to transfer the style of a music clip while preserving the semantic content. Compared to the existing music timbre transfer models, our model can achieve many-to-many timbre transfer between different instruments. The proposed method is based on an autoencoder framework, which comprises two pretrained encoders and one decoder trained in an unsupervised manner. To learn more representative features for the encoders, we produced a parallel dataset, called MI-Para, synthesized from MIDI files and digital audio workstations.

    We evaluated the content preservation and success of style transfer objectively and subjectively. The F0 consistency and hit rate were used to objectively evaluate the performance of the transferred output. For subjective evaluation, we recruited 78 subjects for a listening test of scoring the transferred audio. Our model outperforms the architecture proposed in the work of many-to-many voice conversion. Through the evaluations, we validated the effectiveness of the proposed framework.

    In addition to the performance measurement, we also demonstrated that, based on the state-of-the-art triplet network, the content encoder can learn meaningful content representations with the collected parallel dataset, which has no manually labeled annotations of music content. Moreover, we assume that it still takes a great deal of effort to produce such a parallel dataset. To scale up the application scenario of our proposed method, we also demonstrated that our model can achieve a many-to-many style transfer by training in a semi-supervised manner with a smaller parallel dataset.

    Abstract (Chinese) IV Abstract (English) V Table of Contents VII List of Tables IX List of Figures X Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 Image Style Transfer 4 2.2 Music Style Transfer 5 2.3 Voice Conversion 5 Chapter 3 Data Collection 7 3.1 The Tools 7 3.2 Data Collection Process 8 Chapter 4 Methodology 10 4.1 Overview 10 4.2 Data Processing 11 4.3 Pretrained Encoder 12 4.3.1 Approach 12 4.3.2 Content Encoder 13 4.3.3 Style Encoder 16 4.4 Autoencoder 18 4.4.1 Approach 18 4.4.2 Architecture 21 4.5 Signal Reconstruction 23 Chapter 5 Experimental Results 24 5.1 Pretrained Encoder 24 5.1.1 Content Encoder 24 5.1.2 Style Encoder 25 5.2 Objective Evaluation 26 5.2.1 Content Preservation 26 5.2.2 Style Transfer 30 5.3 Subjective Evaluation 32 5.4 Discussion 33 5.5 Semi-supervised with Novel Melodies 35 5.6 Ablation Study 36 5.6.1 Pre-trained Content Encoder 36 5.6.2 Number of Target Input Segments 37 Chapter 6 Conclusion 41 References 42

    [1]F. Aboukhadijeh. Bitmidi.com.https://github.com/feross/bitmidi.com.
    [2]J.-c. Chou, C.-c. Yeh, and H.-y. Lee. One-shot voice conversion by separating speaker and content representations with instance normalization.arXivpreprint arXiv:1904.05742, 2019.
    [3]J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations.arXiv preprint arXiv:1804.02812, 2018.
    [4]S. Dai, Z. Zhang, and G. G. Xia. Music style transfer issues: A position paper, march 2018.arXiv preprint arXiv:1803.06841.
    [5]T. I. M. A. David Back. Standard midi-file format.URL http://www.csw2.co.uk, 1999.
    [6]J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
    [7]L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
    [8]D. Griffin and J. Lim. Signal estimation from modified short-time Fourier transform.IEEE Transactions on Acoustics, Speech, and Signal Processing,32(2):236–243, 1984.
    [9]E. Grinstein, N. Q. Duong, A. Ozerov, and P. Pérez. Audio style transfer. In2018 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pages 586–590. IEEE, 2018.42
    [10]S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse. Timbretron: Awavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer.arXivpreprint arXiv:1811.09620, 2018.
    [11]X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
    [12]X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. InProceedings of the European Conference onComputer Vision (ECCV), pages 172–189, 2018.
    [13]Y.-N. Hung and Y.-H. Yang. Frame-level instrument recognition by timbre and pitch.arXiv preprint arXiv:1806.09587, 2018.
    [14]P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
    [15]Y. Jadoul, B. Thompson, and B. De Boer. Introducing parselmouth: A python interface to praat.Journal of Phonetics, 71:1–15, 2018.
    [16]D. P. Kingma and M. Welling. Auto-encoding variational bayes.arXivpreprint arXiv:1312.6114, 2013.
    [17]H. Lim and J. Park. Rare sound event detection using 1d convolutional recurrent neural networks.
    [18]C.-Y. Lu, M.-X. Xue, C.-C. Chang, C.-R. Lee, and L. Su. Play as you like: Timbre-enhanced multi-modal music style transfer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1061–1068,2019.
    [19]H. Martel. Pix2pix-timbre-transfer.URL https:// github.com/hmartelb/Pix2Pix-Timbre-Transfer, 2019.43
    [20]K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss.arXiv preprintarXiv:1905.05879, 2019.
    [21]L. Rabiner and R. Schafer.Theory and applications of digital speech processing. Prentice Hall Press, 2010.
    [22]C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P.Ellis, and C. C. Raffel. mir_eval: A transparent implementation of common mir metrics. In Proceedings of the 15th International Society for MusicInformation Retrieval Conference, ISMIR. Citeseer, 2014.
    [23]J. Salamon and E. Gómez. Melody extraction from polyphonic music signals using pitch contour characteristics.IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, 2012.
    [24]E. Sejdić, I. Djurović, and J. Jiang. Time–frequency feature representation using energy concentration: An overview of recent advances.Digital signal processing, 19(1):153–183, 2009.
    [25]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.
    [26]D. Ulyanov and V. Lebedev. Audio texture synthesis and style transfer.URLhttps:// dmitryulyanov. github. io/ audio-texture-synthesis-and-style-transfer,2016.
    [27]D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization.arXiv preprint arXiv:1607.08022, 2016.
    [28]P. Verma and J. O. Smith. Neural style transfer for audio spectograms.arXivpreprint arXiv:1801.01589, 2018.
    [29]J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232,2017.

    QR CODE