研究生: |
林思彤 Lin, Szu-Tung |
---|---|
論文名稱: |
Transformer 架構的目標函數用於語意理解任務訓練 T3: Learning to Train Transformer by Transformer for Natural Language Understanding |
指導教授: |
高宏宇
Kao, Hung-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 44 |
中文關鍵詞: | 自然語言處理 、目標函數 、元學習 |
外文關鍵詞: | Natural Language Processing, Objective Function, Meta-learning |
相關次數: | 點閱:108 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目標函數是深度學習演算法的基礎,其功能為計算模型預測和資料集答案之間的差異,以作為評估及最佳化模型在任務上的表現的依據。然而目標函數的選擇需要針對不同任務及模型的先備知識,使得在設計機器學習模型架構與訓練過程時造成困難。先前的研究針對目標函數的選擇結合了超參數的設計,雖然此類做法能對於不同的任務及模型進行最佳化,但超參數的搜尋在實務上需要耗費相當大的計算資源。最近已經有研究提出了一個動態目標函數來解決這個問題,在使用一個參數化目標函數後,該目標函數可直接從資料集中使用梯度下降算法的訓練,以得到最適合任務的目標函數。本論文延續動態目標函數的研究,我們提出了一個基於Transformer的動態目標函數(T3),並且利用T3在 RoBERTa 模型上進行了基於元學習的訓練,創建了一個完全透過資料計算出來的目標函數。實驗結果顯示,我們的方法能夠強化大型預訓練語言模型在下游任務的表現。T3能達到接近或超過針對每個任務找出來的最佳目標函數的表現。在主題分類與自然語言理解任務上,我們的方法能夠比交叉熵目標函數,獲得約1%的分數提升。
The loss function is fundamental in deep learning. It measures the difference between model prediction and ground truth and produces a continuous function value for model optimization. The choice of loss function requires prior knowledge of data and models, which can be suboptimal when designing a machine learning model. Previous work has proposed several objectives that are optimized for specific tasks by introducing hyper-parameter into the objective function.However, searching for both objectives and parameters is computationally expensive. Recently, a dynamic objective function has been proposed to address this problem by creating a parameterized objective that is directly trained from data with gradient descent algorithms.
In this work, we propose a transformer-based dynamic objective function (T3) that is meta-trained on the RoBERTa model, creating both a data-driven and pre-trained model-driven objective function. Empirical results show that the fine-tuning of a large pre-trained NLP model can benefit from our proposed objective, achieving similar or better performance than the best objective function chosen from baselines for each task. On the topic classification and NLU tasks, our method can have an overall 1% score improvement over cross-entropy baselines.
[1]Abien Fred Agarap. Deep learning using rectified linear units (relu).arXiv preprintarXiv:1803.08375, 2018.
[2]MarcinAndrychowicz,MishaDenil,SergioGomez,MatthewWHoffman,DavidPfau,TomSchaul,BrendanShillingford,andNandoDeFreitas. Learningtolearnbygradientdescent by gradient descent.Advances in neural information processing systems, 29,2016.
[3]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXivpreprint arXiv:1607.06450, 2016.
[4]Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. Few-shot text classifica-tion with distributional signatures. InInternational Conference on Learning Represen-tations, 2020.
[5]Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, LudovicRighetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. In2020 25th International Conference on Pattern Recognition (ICPR), pages 4161–4168.IEEE, 2021.
[6]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners.Advances in neural information processingsystems, 33:1877–1901, 2020.
[7]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprintarXiv:1810.04805, 2018.
[8]ChelseaFinn, PieterAbbeel, andSergeyLevine. Model-agnosticmeta-learningforfastadaptation of deep networks. InInternational conference on machine learning, pages1126–1135. PMLR, 2017.
[9]SantiagoGonzalezandRistoMiikkulainen.Improvedtrainingspeed,accuracy,anddatautilization through loss function optimization. In2020 IEEE Congress on EvolutionaryComputation (CEC), pages 1–8. IEEE, 2020.
[10]Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov,Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. Generalizedinner loop meta-learning.arXiv preprint arXiv:1910.01727, 2019.
[11]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.
[12]Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computa-tion, 9(8):1735–1780, 1997.
[13]Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net-works are universal approximators.Neural networks, 2(5):359–366, 1989.
[14]Chen Huang, Shuangfei Zhai, Walter Talbott, Miguel Bautista Martin, Shih-Yu Sun,CarlosGuestrin,andJoshSusskind. Addressingtheloss-metricmismatchwithadaptiveloss alignment. InInternational conference on machine learning, pages 2891–2900.PMLR, 2019.
[15]PeterJHuber. Robustestimationofalocationparameter. InBreakthroughsinstatistics,pages 492–518. Springer, 1992.
[16]Katarzyna Janocha and Wojciech Marian Czarnecki. On loss functions for deep neuralnetworks in classification.arXiv preprint arXiv:1702.05659, 2017.
[17]Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. The multilingual ama-zon reviews corpus. InProceedings of the 2020 Conference on Empirical Methods inNatural Language Processing, 2020.
[18]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,and Radu Soricut. Albert: A lite bert for self-supervised learning of language represen-tations.arXiv preprint arXiv:1909.11942, 2019.
[19]Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal lossfor dense object detection. InProceedings of the IEEE international conference oncomputer vision, pages 2980–2988, 2017.
[20]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly op-timized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
[21]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprint arXiv:1711.05101, 2017.
[22]Rishabh Misra. News category dataset, 06 2018.
[23]Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability offine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprintarXiv:2006.04884, 2020.
[24]Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng.Meta-weight-net: Learning an explicit mapping for sample weighting.Advances inneural information processing systems, 32, 2019.
[25]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances inneural information processing systems, pages 5998–6008, 2017.
[26]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel RBowman. Glue: A multi-task benchmark and analysis platform for natural languageunderstanding.arXiv preprint arXiv:1804.07461, 2018.
[27]Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-YanLiu. Learning to teach with dynamic loss functions.Advances in neural informationprocessing systems, 31, 2018.