簡易檢索 / 詳目顯示

研究生: 陳均嫚
Chen, Jiun-Man
論文名稱: 基於 Transformer 自然語言模型混合剪枝和知識蒸餾的壓縮方法
Hybrid Pruning and Knowledge Distillation: Effective Compression Methods for Transformer-based Natural Language Models
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 46
中文關鍵詞: 模型壓縮自然語言處理剪枝知識蒸餾
外文關鍵詞: Model Compression, Natural Language Processing, Pruning, Knowledge Distillation
相關次數: 點閱:110下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基於Transformer的預訓練的模型顯著提升了各種自然語言處理(NLP)任務的性能。儘管其出色的性能,但由於預訓練語言模型龐大的參數,將其在部署在資源受限的邊緣設備上並滿足嚴格的延遲和記憶體要求,變得具有挑戰性,阻礙了它們在實際場域下的佈署。為應對這個困難,我們提出了一種混和式的壓縮方法,其結合了知識蒸餾和修剪技術來壓縮語言模型BERT。這些壓縮方法有著相同的目標,保持模型的性能同時縮減參數量。通過整合模型剪枝和知識蒸餾技術,我們可以充分利用每種方法的優勢。透過我們的實驗,我們展示了這些方法的整合的應用優於單獨使用。我們提出的混合式的壓縮方法成功地將這些技術結合在一起,使得模型參數量縮小了10倍,並在GLUE測試集中保留了超過96.5%的原始BERT模型的性能,成功的實現了模型的高效壓縮。

    Pre-trained Transformer-based models have remarkably improved the performances of various natural language processing (NLP) tasks. Despite their impressive performance, deploying pre-trained language models on resource-constrained edge devices with strict latency and memory requirements becomes challenging due to their large parameter size. The significant computational and memory demands of these models hinder their practical implementation. Therefore, addressing these challenges is crucial to enable efficient deployment in real-world scenarios. To tackle this challenge, we propose an integrated approach that combines knowledge distillation and pruning techniques to compress the language model BERT. These methods share the same goal: to boost the model’s performance while making it more compact. By integrating both pruning and knowledge distillation techniques, we can harness the advantages of each method. Through our experiments, we demonstrate that the integration of these methods outperforms using them individually, highlighting the benefits of an integrated approach. Our proposed hybrid approach effectively combines these techniques, resulting in a 10x smaller model that retains over 96.5% of the performance of the densely trained teacher model on the GLUE benchmark. This successful integration enables efficient compression of the model.

    摘要 i Abstract ii 誌謝 iii Table of Content iv List of Tables vii List of Figures viii Chapter 1 1 Introduction 1 1.1 Background - Era of Foundation Models 1 1.2 Bidirectional Encoder Representations from Transformers (BERT) 2 1.3 Challenges of Transfer Learning 3 1.4 Motivation 4 1.5 Our Approach 5 Chapter 2 7 Related Works 7 2.1 Knowledge Distillation 7 2.1.1 Depth Compression: Patient Knowledge Distillation 9 2.1.2 Width Compression: MobileBERT 10 2.2 Pruning 11 2.3 Hybrid Compression 12 2.4 Two-stage and One-stage Hybrid Compression 12 Chapter 3 14 Methodology 14 3.1 Sparse Model Compression 15 3.1.1 Iterative Pruning: A Progressive Approach 16 3.2 Knowledge Distillation: Leveraging Dark Knowledge from the Teacher Model 18 3.2.1 Soft Loss (KL Divergence): 18 3.2.2 Hard Loss (Cross-Entropy): 19 3.2.3 Embedding Loss 19 3.2.4 Hidden States Loss 20 3.2.5 Layer-wise Non-Uniform Weighting Factors 21 3.2.6 Total Loss for Knowledge Distillation with Layer Importance 24 Chapter 4 25 Experiments 25 4.1 Datasets 25 4.2 Baseline 26 4.3 Experimental Details 26 4.4 Results on Glue Benchmark test dataset 27 4.5 Results on Glue Benchmark dev dataset 28 Chapter 5 31 Analysis 31 5.1 Integration Beats Individuality: Harnessing the Power of Knowledge Distillation and Pruning 31 5.2 Exploring the Efficacy of Two-Stage and One-Stage Hybrid Compression 33 5.2.1 Two-Stage Compression: Knowledge Distillation Followed by Pruning 33 5.2.2 Two-Stage Compression: Pruning Followed by Knowledge Distillation 34 5.2.3 One-Stage Compression: Our Hybrid Compression Method 34 5.3 Analyzing Weight Sparsity Variation Across Layer Subunits in BERT Models Across GLUE Benchmarks 35 5.4 Ablation Study on Influence of Layer-wise Non-Uniform Weighting Factor 40 Chapter 6 42 Conclusion 42 Reference 44

    [1] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." arXiv preprint arXiv:1804.07461. 2019.
    [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. “Attention Is All You Need.” Advances in neural information processing systems, 30. 2017.
    [3] Brownet al.Tom,. “Language Models are Few-Shot Learners.” Advances in neural information processing systems, 33, 1877-1901. 2020.
    [4] Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, Bo Xu. “Compression of Acoustic Model via Knowledge Distillation and Pruning.” In 2018 24th International conference on pattern recognition (ICPR) (pp. 2785-2790). IEEE. 2018.
    [5] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov. “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” arXiv preprint arXiv:1905.09418. 2019.
    [6] Geoffrey Hinton, Oriol Vinyals, Jeff Dean. “Distilling the Knowledge in a Neural Network.” arXiv preprint arXiv:1503.02531. 2014.
    [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” ACL. 2019.
    [8] Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Pablo Villalobos. “Compute Trends Across Three Eras of Machine Learning.” In 2022 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE. 2022.
    [9] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus. “Emergent Abilities of Large Language Models.” arXiv preprint arXiv:2206.07682. 2022.
    [10] Jianping Gou, Baosheng Yu, Stephen John Maybank, Dacheng Tao. “Knowledge Distillation: A Survey.” International Journal of Computer Vision, 129, 1789-1819. 2021.
    [11] Lei Li, Yankai Lin, Shuhuai Ren, Peng Li, Jie Zhou, Xu Sun. “Dynamic Knowledge Distillation for Pre-trained Language Models.” arXiv preprint arXiv:2109.11295. 2021.
    [12] Liyang Chen, Yongquan Chen, Juntong Xi, Xinyi Le. “Knowledge from the original network: restore a better pruned network with knowledge distillation.” Complex & Intelligent Systems, 1-10. 2021.
    [13] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu. “DynaBERT: Dynamic BERT with Adaptive Width and Depth.” Advances in Neural Information Processing Systems, 33, 9782-9793. 2020.
    [14] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. “Deep contextualized word representations.” arXiv preprint. arXiv preprint arXiv:1802.05365. 2018.
    [15] Michael Zhu, Suyog Gupta. “To prune, or not to prune: exploring the efficacy of pruning for model compression.” arXiv preprint arXiv:1710.01878. 2017.
    [16] Mitchell A. Gordon, Kevin Duh, Nicholas Andrews. “Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.” arXiv preprint arXiv:2002.08307. 2020.
    [17] Nima Aghli, Eraldo Ribeiro. “Combining Weight Pruning and Knowledge Distillation for CNN Compression.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3191-3198). 2021.
    [18] Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.” Transactions of the Association for Computational Linguistics, 9, 1061-1080. 2021.
    [19] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin. “Distilling Task-Specific Knowledge from BERT into Simple Neural Networks.” arXiv preprint arXiv:1903.12136. 2019.
    [20] Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu. “Patient knowledge distillation for bert model compression.” arXiv preprint arXiv:1908.09355. 2019.
    [21] Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, Gu-Yeon Wei. “EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference.” In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 830-844). 2021.
    [22] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108. 2019.
    [23] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, Qi Ju. “FastBERT: a Self-distilling BERT with Adaptive Inference Time.” arXiv preprint arXiv:2004.02178. 2020.
    [24] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. “TinyBERT: Distilling BERT for Natural Language Understanding.” arXiv preprint arXiv:1909.10351. 2019.
    [25] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices.” In Proceedings of the European conference on computer vision (ECCV) (pp. 784-800). 2018.
    [26] Yongjie Lin, Yi Chern Tan, Robert Frank. “Open Sesame: Getting Inside BERT's Linguistic Knowledge.” arXiv preprint arXiv:1906.01698. 2019.
    [27] Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie Zhou. “Marginal utility diminishes: Exploring the minimum knowledge for BERT knowledge distillation.” arXiv preprint arXiv:2106.05691. 2021.
    [28] Yuanxin Liu, Zheng Lin, Fengcheng Yuan. “ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques.” In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 10, pp. 8715-8722). 2021.
    [29] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler. “Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books.” In Proceedings of the IEEE international conference on computer vision (pp. 19-27). 2015.
    [30] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou. “MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices.” arXiv preprint arXiv:2004.02984. 2020.

    下載圖示 校內:2024-08-31公開
    校外:2024-08-31公開
    QR CODE