簡易檢索 / 詳目顯示

研究生: 林思彤
Lin, Szu-Tung
論文名稱: Transformer 架構的目標函數用於語意理解任務訓練
T3: Learning to Train Transformer by Transformer for Natural Language Understanding
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 44
中文關鍵詞: 自然語言處理目標函數元學習
外文關鍵詞: Natural Language Processing, Objective Function, Meta-learning
相關次數: 點閱:108下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 目標函數是深度學習演算法的基礎,其功能為計算模型預測和資料集答案之間的差異,以作為評估及最佳化模型在任務上的表現的依據。然而目標函數的選擇需要針對不同任務及模型的先備知識,使得在設計機器學習模型架構與訓練過程時造成困難。先前的研究針對目標函數的選擇結合了超參數的設計,雖然此類做法能對於不同的任務及模型進行最佳化,但超參數的搜尋在實務上需要耗費相當大的計算資源。最近已經有研究提出了一個動態目標函數來解決這個問題,在使用一個參數化目標函數後,該目標函數可直接從資料集中使用梯度下降算法的訓練,以得到最適合任務的目標函數。本論文延續動態目標函數的研究,我們提出了一個基於Transformer的動態目標函數(T3),並且利用T3在 RoBERTa 模型上進行了基於元學習的訓練,創建了一個完全透過資料計算出來的目標函數。實驗結果顯示,我們的方法能夠強化大型預訓練語言模型在下游任務的表現。T3能達到接近或超過針對每個任務找出來的最佳目標函數的表現。在主題分類與自然語言理解任務上,我們的方法能夠比交叉熵目標函數,獲得約1%的分數提升。

    The loss function is fundamental in deep learning. It measures the difference between model prediction and ground truth and produces a continuous function value for model optimization. The choice of loss function requires prior knowledge of data and models, which can be suboptimal when designing a machine learning model. Previous work has proposed several objectives that are optimized for specific tasks by introducing hyper-parameter into the objective function.However, searching for both objectives and parameters is computationally expensive. Recently, a dynamic objective function has been proposed to address this problem by creating a parameterized objective that is directly trained from data with gradient descent algorithms.
    In this work, we propose a transformer-based dynamic objective function (T3) that is meta-trained on the RoBERTa model, creating both a data-driven and pre-trained model-driven objective function. Empirical results show that the fine-tuning of a large pre-trained NLP model can benefit from our proposed objective, achieving similar or better performance than the best objective function chosen from baselines for each task. On the topic classification and NLU tasks, our method can have an overall 1% score improvement over cross-entropy baselines.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vii List of Figures viii Chapter 1. Introduction 1 1.1 Selection of objective function in machine learninig 1 1.2 Large scale pre-trained Language Model 3 1.3 Motivation 3 1.4 Our Work 4 Chapter 2. Related Work 6 2.1 Learning to Learn: Meta learning 6 2.2 Dynamic Objective 7 2.3 Nerual Objective 9 2.4 Summary of the related work 9 Chapter 3. Methodology 11 3.1 Problem setup 11 3.2 How to model the parameterized dynamic objective functionLΦ? 12 3.2.1. Transformer 12 3.2.2. Model Overview 16 3.2.3. Example Ecnoder 17 3.2.4. Status Encoder 18 3.2.5. Batch Encoder 19 3.2.6. Pooling Layer 20 3.3 How to optimize the parameterΦ? 21 3.3.1. Meta-Training 21 3.3.2. Meta-Testing 22 3.3.3. Selection of Performance Measuring Function 22 Chapter 4. Experiment 23 4.1 Dataset 23 4.1.1. Topic Classification Dataset 23 4.1.2. Natural Language Understanding Dataset 25 4.2 Model Configuration and Experimental Setup 25 4.2.1. Target Model 25 4.2.2. T3 configuration 26 4.2.3. Training status 26 4.2.4. Training Details 26 4.3 The performance of T3 27 4.3.1. Learned T3 as objective for the target model 27 4.3.2. The impact of target model stability when using T3 29 4.4 Generalizability across dataset 31 4.5 Fine-tuning on target dataset to improve performance 32 4.6 Fine-tuned for new metrics 33 Chapter 5. Analysis 35 5.1 Ablation Study 35 5.1.1. Training status 35 5.1.2. Batch Self Attention 36 5.2 Fine-tuning vs train from scratch 37 5.3 Performance under different training configurations 38 Chapter 6. Discussion and Conclusion 41 References 42

    [1]Abien Fred Agarap. Deep learning using rectified linear units (relu).arXiv preprintarXiv:1803.08375, 2018.
    [2]MarcinAndrychowicz,MishaDenil,SergioGomez,MatthewWHoffman,DavidPfau,TomSchaul,BrendanShillingford,andNandoDeFreitas. Learningtolearnbygradientdescent by gradient descent.Advances in neural information processing systems, 29,2016.
    [3]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXivpreprint arXiv:1607.06450, 2016.
    [4]Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. Few-shot text classifica-tion with distributional signatures. InInternational Conference on Learning Represen-tations, 2020.
    [5]Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, LudovicRighetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. In2020 25th International Conference on Pattern Recognition (ICPR), pages 4161–4168.IEEE, 2021.
    [6]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.Language models are few-shot learners.Advances in neural information processingsystems, 33:1877–1901, 2020.
    [7]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprintarXiv:1810.04805, 2018.
    [8]ChelseaFinn, PieterAbbeel, andSergeyLevine. Model-agnosticmeta-learningforfastadaptation of deep networks. InInternational conference on machine learning, pages1126–1135. PMLR, 2017.
    [9]SantiagoGonzalezandRistoMiikkulainen.Improvedtrainingspeed,accuracy,anddatautilization through loss function optimization. In2020 IEEE Congress on EvolutionaryComputation (CEC), pages 1–8. IEEE, 2020.
    [10]Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov,Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. Generalizedinner loop meta-learning.arXiv preprint arXiv:1910.01727, 2019.
    [11]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. InProceedings of the IEEE conference on computer vision andpattern recognition, pages 770–778, 2016.
    [12]Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computa-tion, 9(8):1735–1780, 1997.
    [13]Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward net-works are universal approximators.Neural networks, 2(5):359–366, 1989.
    [14]Chen Huang, Shuangfei Zhai, Walter Talbott, Miguel Bautista Martin, Shih-Yu Sun,CarlosGuestrin,andJoshSusskind. Addressingtheloss-metricmismatchwithadaptiveloss alignment. InInternational conference on machine learning, pages 2891–2900.PMLR, 2019.
    [15]PeterJHuber. Robustestimationofalocationparameter. InBreakthroughsinstatistics,pages 492–518. Springer, 1992.
    [16]Katarzyna Janocha and Wojciech Marian Czarnecki. On loss functions for deep neuralnetworks in classification.arXiv preprint arXiv:1702.05659, 2017.
    [17]Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. The multilingual ama-zon reviews corpus. InProceedings of the 2020 Conference on Empirical Methods inNatural Language Processing, 2020.
    [18]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,and Radu Soricut. Albert: A lite bert for self-supervised learning of language represen-tations.arXiv preprint arXiv:1909.11942, 2019.
    [19]Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal lossfor dense object detection. InProceedings of the IEEE international conference oncomputer vision, pages 2980–2988, 2017.
    [20]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly op-timized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
    [21]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXivpreprint arXiv:1711.05101, 2017.
    [22]Rishabh Misra. News category dataset, 06 2018.
    [23]Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability offine-tuning bert: Misconceptions, explanations, and strong baselines.arXiv preprintarXiv:2006.04884, 2020.
    [24]Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng.Meta-weight-net: Learning an explicit mapping for sample weighting.Advances inneural information processing systems, 32, 2019.
    [25]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances inneural information processing systems, pages 5998–6008, 2017.
    [26]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel RBowman. Glue: A multi-task benchmark and analysis platform for natural languageunderstanding.arXiv preprint arXiv:1804.07461, 2018.
    [27]Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-YanLiu. Learning to teach with dynamic loss functions.Advances in neural informationprocessing systems, 31, 2018.

    下載圖示 校內:2023-09-26公開
    校外:2023-09-26公開
    QR CODE