簡易檢索 / 詳目顯示

研究生: 陳冠友
Chen, Kuan-Yu
論文名稱: LAD: 針對 BERT 模型壓縮的適應性知識蒸餾機制
LAD: Layer­-wise Adaptive Distillation for BERT Model Compression
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 人工智慧科技碩士學位學程
Graduate Program of Artificial Intelligence
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 55
中文關鍵詞: 模型壓縮知識蒸餾自然語言處理
外文關鍵詞: Model Compression, Knowledge Distillation, Natural Language Processing
相關次數: 點閱:154下載:15
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來大型的預訓練語言模型例如 BERT, 已經幫助研究人員在自然語言處理領域中的多項任務達到當代最佳的表現。
    然而這些預訓練模型往往參數量龐大,使其難以被部署至運算資源匱乏的裝置上。
    這樣的問題可能導致許多自然語言處理的應用無法普及至日常生活中。
    為了解決這樣的困境,我們提出了一個嶄新的下游任務導向知識蒸餾方法稱為 LAD。
    其利用了知識蒸餾將 BERT 模型壓縮至特定的大小。
    過去的相關研究證實了,蒸餾 BERT 模型中每一層的內部知識有助於提升小模型的能力。
    然而有鑒於大模型與小模型之間在層數上的不一致性,大模型中特定層的內部知識將無法避免的被忽略。
    因此為了在知識蒸餾訓練過程中保持大模型每一層的整體內部知識,我們 設計了一個嶄新的學習目標函數。
    該目標函數能使得小模型在訓練過程中,自發性的選擇對當前任務而言,要學習哪些大模型層數中的內部知識才能夠達到最好的表現。
    研究結果顯示,我們提出的方法,LAD,在 GLUE 評分系統上不論是在小模型只有 6 層或是 4 層的情況下,表現都更勝以往。

    Large-scale pre-trained language models such as BERT have help researchers obtain state-of-the-art performances on numerous natural language processing tasks. However, their considerable number of parameters makes it difficult to deploy the models into low-computational devices, hampers the real-world natural language processing applications.
    In this work, we present LAD (Layer-wise Adaptive Distillation),
    a task-specific distillation framework that can be used to reduce the model size of BERT.
    Previous studies have shown that distilling intermediate layers increase the performance of the lightweight student model. However, due to the inconsistent layer number between the teacher and the student models, knowledge of specific teacher layers may be ignored.
    To retain the complete knowledge in each layer of the teacher model, we design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to the student model.
    The proposed method enables an effective knowledge transfer for a student model without skipping any teacher layers.
    The experimental results show that both six-layer and four-layer LAD student models outperform previous tasks-specific distillation approaches on the GLUE tasks.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 1.1. Pre­training and Fin­tuning Paradigm in NLP 1 1.1.1. Word2Vec: the origin of the pre­trained word embeddings 1 1.1.2. ELMo: a new type of deep contextualized word representations 2 1.1.3. BERT: state­of­the­art pre­trained language models 3 1.2. Recent states of pre­trained language models: battle of the parameters 4 1.3. Knowledge distillation 5 1.3.1. Knowledge distillation on BERT 6 1.4. Drawbacks of current task­specific distillation methods of BERT 7 1.5. Motivation and contribution 9 Chapter 2. Related Work 11 2.1. Pre­trained language model: BERT 11 2.1.1. The pre­training procedure of BERT 11 2.1.2. Fine­tune BERT to specific downstream task 13 2.2. Task­specific Distillation of BERT: Patient Knowledge Distillation 15 2.2.1. DS Loss 16 2.2.2. CE Loss 17 2.2.3. PT loss 17 2.3. ALP­KD: Attention­Based Layer Projection for Knowledge Distillation 20 2.3.1. Mechanism of ALP­KD 20 2.3.2. Training objectives of ALP­KD 22 Chapter 3. Methodology 24 3.1. Problem Definition 25 3.2. Gate Block 26 3.3. Layer­wise Adaptive Distillation 26 3.4. Transfer the teacher model’s prediction ability 28 3.5. Learn how to solve the specific downstream task 30 3.6. Training Objectives 30 Chapter 4. Experiments 31 4.1. Datasets 31 4.1.1. MNLI: Multi­Genre Natural Language Inference 31 4.1.2. QNLI 32 4.1.3. QQP: Quora Question Pairs 32 4.1.4. SST2: Stanford Sentiment Treebank 32 4.1.5. MRPC: Microsoft Research Paraphrase Corpus 32 4.1.6. RTE: Recognizing Textual Entailment 32 4.2. Teacher Model 33 4.3. Baselines and Implementation Details 33 4.4. Results on GLUE test sets 34 4.5. Results on GLUE development sets 35 4.6. Similarity Analysis of the Teacher and the Student 35 Chapter 5. Analysis 37 5.1. How much aggregated knowledge should we distil ? 37 5.2. Which type of internal knowledge of the teacher model should we distill ? . 38 5.3. Why should the student model not learn the knowledge from higher layers of the teacher model ? 39 5.4. Comparison between Gate Network and Attention mechanism 40 5.5. Ablation study on the Gate Network 42 Chapter 6. Conclusion 45 References 46 Chapter 7. Appendix 51 7.1. Re­implementation of PKD 51 7.2. Hyper­parameters of our implemented models 52

    [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    [2] Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
    [3] Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment chal­lenge. 2006.
    [4] Y. Bengio, P. Simard, and P. Frasconi. Learning long­term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
    [5] Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo
    Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009.
    [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,ArvindNeelakantan,PranavShyam,GirishSastry,AmandaAskell,Sandhini Agarwal, Ariel Herbert­Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few­shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
    [7] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu­Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis­covery and data mining, pages 535–541, 2006.
    [8] Kevin Clark, Minh­Thang Luong, Quoc V. Le, and Christopher D. Manning. ELEC­TRA:Pre­-training text encoders as discriminators rather than generators. In ICLR,2020.
    [9] Baiyun Cui, Yingming Li, Ming Chen, and Zhongfei Zhang. Fine­-tune BERT with sparse self-­attention mechanism. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP­-IJCNLP), pages 3548–3553, Hong Kong, China, November 2019. Association for Computational Linguistics.
    [10] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncer­tainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer, 2006.46
    [11] Jacob Devlin, Ming­-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre­-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa­tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [12] William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005.
    [13] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on de­mand with structured dropout. In International Conference on Learning Representa­tions, 2019.
    [14] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PAS­CAL recognizing textual entailment challenge. In Proceedings of the ACL­-PASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Compu­tational Linguistics, 2007.
    [15] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed­forward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
    [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    [17] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020,pages 4163–4174, Online, November 2020. Association for Computational Linguistics.
    [18] Yongjie Lin Lin, Yi Chern Tan, and Robert Frank. Open sesame: Getting inside bert's linguistic knowledge. In Proceedings of the Second BlackboxNLP Workshop on Ana­lyzing and Interpreting Neural Networks for NLP, 2019.
    [19] Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa­tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, 2019.
    [20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly op­timized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    [21] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Interna­tional Conference on Learning Representations, 2018.
    [22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    [23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
    [24] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. PMLR, 2013.
    [25] Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, and Qun Liu. Alp-­kd: Attention­based layer projection for knowledge distillation. arXiv preprint arXiv:2012.14022, 2020.
    [26] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vec­tors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
    [27] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken­ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Pro­ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.
    [28] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving lan­guage understanding by generative pre-­training. 2018.
    [29] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text­to­text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
    [30] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD:100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, pages 2383–2392. Association for Computational Linguistics, 2016.
    [31] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
    [32] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, An­drew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, pages 1631–1642, 2013.
    [33] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NIPS, 2015.
    [34] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Meth­ods in Natural Language Processing and the 9th International Joint Conference on Nat­ural Language Processing (EMNLP-­IJCNLP), pages 4323–4332, Hong Kong, China, November 2019. Association for Computational Linguistics.
    [35] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.MobileBERT: a compact task -agnostic BERT for resource-limited devices. In Proceed­ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, Online, July 2020. Association for Computational Linguistics.
    [36] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Dis­tilling task­-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
    [37] Wilson L Taylor. “cloze procedure": A new tool for measuring readability. Journalism quarterly, 30(4):415–433, 1953.
    [38] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computa­tional Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Com­putational Linguistics.
    [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso­ciates, Inc., 2017.
    [40] Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, and Anna Korhonen. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222–7240, 2020.
    [41] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi­-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
    [42] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-­attention distillation for task-­agnostic compression of pre-­trained transform­ers. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Ad­vances in Neural Information Processing Systems, volume 33, pages 5776–5788. Cur­ran Associates, Inc., 2020.
    [43] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-­coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL­HLT,2018.
    [44] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
    [45] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019.
    [46] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Anto­nio Torralba, and Sanja Fidler. Aligning books and movies: Towards story­like visual explanations by watching movies and reading books. In Proceedings of the IEEE inter­national conference on computer vision, pages 19–27, 2015.

    下載圖示 校內:2022-11-01公開
    校外:2022-11-01公開
    QR CODE