簡易檢索 / 詳目顯示

研究生: 洪途慰
David Eduardo, Hernandez Machado
論文名稱: 透過知識蒸餾和整合梯度增強模型壓縮:深度學習方法
Enhancing Model Compression via Knowledge Distillation and Integrated Gradients: A Deep Learning Approach
指導教授: 吳馬丁
Torbjörn, Nordling
學位類別: 碩士
Master
系所名稱: 工學院 - 機械工程學系
Department of Mechanical Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 120
中文關鍵詞: 深度學習知識蒸餾模型壓縮整合梯度MobileNetV2CIFAR-10神經網路資源效率統計分析蒙地卡羅模擬
外文關鍵詞: Deep Learning, Knowledge Distillation, Model Compression, Integrated Gradients, MobileNetV2, CIFAR-10, Neural Networks, Resource Efficiency, Statistical Analysis, Monte Carlo Simulation
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 機器學習與深度學習技術已徹底改變我們處理各種領域複雜運算任務的能力,透過有效管理大量資料、自動化決策流程,並隨時間演進提升預測準確性。儘管這些技術帶來顯著的效能提升,但也導致模型日益龐大與計算成本高昂,限制了其在資源受限裝置上的部署可能性。
    為了解決此挑戰,模型壓縮成為關鍵技術,其中知識蒸餾(Knowledge Distillation)是一種有效方法,藉由將大型教師模型的知識傳遞至較小的學生模型來達成壓縮目的。傳統知識蒸餾主要透過平滑後的機率分佈進行輸出層知識轉移,然而近年來已有方法引入注意力轉移(Attention Transfer),對齊中間層的特徵表示。
    然而,這些技術僅利用正向傳遞的資訊,忽略了影響模型預測的梯度導向決策過程。因此,本論文探討如何結合整合梯度(Integrated Gradients)── 一種以梯度為基礎之模型可解釋性技術,來強化知識蒸餾的效果,藉此指出輸入特徵的重要性區域。
    本研究方法結合知識蒸餾與整合梯度作為資料擴增技術,實現從教師模型至學生模型的學習遷移。教師模型預先計算出的整合梯度歸因圖(Attribution Map)以機率p 疊加至訓練影像,標示出教師模型判定為關鍵的區域。此一預先處理策略將計算負擔轉化為單次的前處理步驟,降低訓練與推論階段的成本。本架構以 CIFAR-10 資料集為測試對象,教師模型採用 MobileNetV2 架構。
    實驗結果顯示,整合整合梯度與知識蒸餾後,學生模型仍保有教師模型效能的 98.6%(92.5% 對 93.9% 準確率),同時參數數量減少 4.1 倍,推論時間從每批次 140 毫秒大幅減至 13 毫秒,推論效率提升達 10.8 倍。於不同壓縮比例(2.2× 至 1122×)下測試,結果顯示本方法於適用於中度壓縮範圍(2-12×),特別符合邊緣運算應用需求。
    消融實驗結果顯示,結合整合梯度的知識蒸餾(92.6%)表現優於單獨知識蒸餾(92.3%)、單獨整合梯度(92.0%)與注意力轉移(91.6%),經 60 次蒙地卡羅模擬驗證具統計顯著性(p < 0.001)。此外,於不同硬體平台(RTX 3060 Ti、RTX 3090 與 RTX A5000)上評估後發現,其效能增益具有良好可擴展性。
    使用與 CIFAR-10 類別對應的 ImageNet 子集進行驗證,顯示該方法具有良好之資料集泛化能力,整合知識蒸餾與整合梯度配置可達 85.7% 準確率,優於基準學生模型(83.8%),驗證此技術能協助模型學習可遷移特徵,而非僅侷限於特定資料集。
    本研究結論指出,透過整合梯度進行特徵層引導,有效克服傳統知識蒸餾在輸入關鍵區域識別上的限制。該方法不僅保留了模型效能與可解釋性,更提供在資源受限環境中部署深度神經網路的可行方案。

    Machine learning and deep learning techniques have revolutionised our ability to handle complex computational tasks across various domains by effectively managing large datasets, automating decision-making processes, and refining predictive accuracy over time. While these technological advances have yielded impressive performance gains, they have come at the cost of increasingly complex and computationally intensive models hindering deployment on resource-constrained devices. Model compression addresses this challenge, with knowledge distillation emerging as an effective technique that transfers knowledge from a larger teacher model to a smaller student model. While conventional knowledge distillation primarily transfers output-level knowledge through softened probability distributions, recent approaches have incorporated attention transfer to align intermediate feature representations. However, these techniques only leverage forward-pass information and neglect the gradient-based decision processes that influence model predictions. This thesis investigates enhancing knowledge distillation with integrated gradients, a model interpretability gradient-based technique that attributes importance to input features.
    The methodology implements knowledge distillation to transfer learning from a teacher model to a smaller student model, while incorporating integrated gradients as a data augmentation technique. Integrated gradients attribution maps are pre-computed from the teacher model and overlaid onto training images with probability p, highlighting regions the teacher model deems important. This pre-computation strategy transforms the computational burden into a one-time preprocessing step rather than a runtime cost. The framework was evaluated on CIFAR-10 using MobileNetV2 as the teacher architecture.
    This study demonstrates that by integrating knowledge distillation with integrated gradients, the student model maintains 98.6% of the teacher model's performance (92.5% vs 93.9% accuracy) despite a 4.1× reduction in parameters. This parameter reduction translates to a 10.8× reduction in inference time, from 140ms to 13ms per batch. Experiments across compression factors (2.2×-1122×) reveal distinct operational regimes, with the approach performing well in the moderate compression range (2-12×) relevant for edge deployment.
    The ablation study demonstrates that knowledge distillation with integrated gradients (92.6%) outperforms standalone knowledge distillation (92.3%), integrated gradients (92.0%), and attention transfer (91.6%), with improvements statistically validated through 60-run Monte Carlo simulations (p < 0.001). Cross-hardware evaluation on RTX 3060 Ti, RTX 3090, and RTX A5000 GPUs confirms that computational efficiency gains scale across platforms. Despite attention transfer showing slightly lower accuracy when compared with knowledge distillation and integrated gradients, combining all three techniques provides a more robust framework with lower variance in performance across training runs.
    Validation on an ImageNet subset aligned with CIFAR-10 classes demonstrates generalisation beyond the initial dataset, with the knowledge distillation and integrated gradients configuration achieving 85.7% accuracy compared to the teacher's performance, outperforming the baseline student model (83.8%). This cross-dataset performance confirms that the method helps models develop transferable features rather than dataset-specific characteristics.
    The research concludes that feature-level guidance through integrated gradients effectively addresses limitations in traditional knowledge distillation by highlighting important input regions. This approach enables the development of compression techniques that preserve both model performance and interpretability, providing an effective solution for deploying sophisticated neural networks in resource-constrained environments.

    摘要 ii Abstract iv Acknowledgments vi Table of Contents viii List of Tables ix List of Figures x Nomenclature xii Introduction 1 1.1 Motivation and purpose 1 1.2 Study Delimitations 2 1.3 Publications 3 1.4 Outline 4 2 Background 6 2.1 Evolution of Deep Learning Architectures 6 2.1.1 Multilayer Perceptron (MLP) 6 2.1.2 The Growing Scale of Neural Networks 7 2.1.3 Foundations of Neural Networks 10 2.1.4 Convolutional Neural Networks 12 2.2 Model Compression 18 2.2.1 Foundations of Model Compression 18 2.2.2 Need for Model Compression 20 2.2.3 Compression Techniques Overview 21 2.2.4 Knowledge Distillation 24 2.2.5 Attention Transfer 26 2.3 Model Interpretability 26 2.3.1 Foundations of Model Interpretability 26 2.3.2 Feature Attribution Methods 27 2.3.3 Integrated Gradients 28 2.3.4 Challenges in Interpretability 36 2.4 Evaluation Frameworks for Model Compression 37 2.4.1 Benchmarks and Datasets 37 2.4.2 Experimental Design Considerations 39 2.4.3 Statistical Validation Methods 39 2.4.4 Current Challenges in Model Compression 41 3 Present investigations 43 3.1 Model compression using knowledge distillation with integrated gradients 43 3.1.1 Introduction 44 3.1.2 Methodology 47 3.1.3 Experiments 53 3.1.4 Results and Discussion 59 3.1.5 Conclusions 66 4 Discussions 67 4.1 Key Findings 67 4.2 Theoretical and Practical Implications 68 4.2.1 Complementarity of Knowledge Types 68 4.2.2 Rethinking Model Capacity Requirements 68 4.2.3 Data Augmentation as Knowledge Transfer 68 4.2.4 Practical Applications and Efficiency Benefits 69 4.3 Limitations 69 4.3.1 Computational Overhead and Scalability 69 4.3.2 Methodological Considerations 70 5 Conclusions 71 5.1 Future Research Directions 71 5.1.1 Ultra-Efficient Models and Tiered Deployment 72 5.1.2 Resolution-CompressionTrade-offs 72 5.1.3 Cross-Domain Applications 72 5.1.4 Enhanced Knowledge Transfer 73 References 74 Appendix A Supplementary of Model compression using KD & IG 81 A.1 Supplementary 82 A.1.1 Literature Review Details 82 A.1.2 Model Architecture and Compression Details 88 A.1.3 Implementation Details of the Training Method 90 A.1.4 Attention Map Analysis 96 A.1.5 Statistical Analyses 96 A.1.6 Hyperparameter Optimisation Results 98 A.1.7 ImageNet Validation Details 99 A.1.8 Detailed Performance Metrics Across Compression Levels 102

    Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. (2018). Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, volume 31. Available at: https://arxiv.org/abs/1810.03292.

    Armeniakos, G., Zervakis, G., Soudris, D., and Henkel, J. (2022). Hardware approximate techniques for deep neural network accelerators: A survey. ACM Computing Surveys, 55(4):1–36.

    Ashok, A., Rhinehart, N., Beainy, F., and Kitani, K. M. (2017). N2n learning: Network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030.

    Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140.

    Bhardwaj, K., Lin, C.-Y., Sartor, A., and Marculescu, R. (2019). Memory- and communication-aware model compression for distributed deep learning inference on iot. ACM Transactions on Embedded Computing Systems, 18:1–22.

    Blakeney, C., Li, X., Yan, Y., and Zong, Z. (2020). Parallel blockwise knowledge distillation for deep neural network compression. arXiv preprint arXiv.2012.03096.

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

    Chen, D., Mei, J.-P., Zhang, H., Wang, C., Feng, Y., and Chen, C. (2022). Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942.

    Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B., Xu, C., Xu, C., and Tian, Q. (2019a). Data-free learning of student networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3514–3522.

    Chen, L. and Varoquaux, G. (2024). What is the role of small models in the llm era: A survey. arXiv preprint arXiv:2409.06857.

    Chen, W.-C., Chang, C.-C., and Lee, C.-R. (2019b). Knowledge distillation with feature maps for image classification. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14, pages 200–215. Springer.

    Chen, Y., Wang, N., and Zhang, Z. (2018). Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In Proceedings of the AAAI conference on artificial intelligence, volume 32.

    Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2017). A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.

    Choi, Y., Choi, J., El-Khamy, M., and Lee, J. (2020). Data-free network quantization with adversarial knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 710–711.

    Daliparthi, V. S. S. A. (2024). Andhra bandersnatch: Training neural networks to predict parallel realities. arXiv preprint arXiv:2411.19213.

    Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, and Li Fei-Fei (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.

    Deng, L., Li, G., Han, S., Shi, L., and Xie, Y. (2020). Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532.

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

    Dhar, S., Guo, J., Liu, J., Tripathi, S., Kurup, U., and Shah, M. (2021). A survey of on-device machine learning: An algorithms and learning theory perspective. ACM Transactions on Internet of Things, 2(3):1–49.

    Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

    Epoch AI (2024). Data on notable ai models. Accessed: 2025-06-25.

    Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.

    Gesmundo, A. and Dean, J. (2022). An evolutionary approach to dynamic introduction of tasks in large-scale multitask learning systems. arXiv preprint arXiv:2205.12755.

    Gou, J., Sun, L., Yu, B., Wan, S., Ou, W., and Yi, Z. (2022). Multilevel attention based sample correlations for knowledge distillation. IEEE Transactions on Industrial Informatics, 19(5):7099–7109.

    Gou, J., Sun, L., Yu, B., Wan, S., and Tao, D. (2023). Hierarchical multi-attention transfer for knowledge distillation. ACM Transactions on Multimedia Computing, Communications and Applications, 20(2):1–20.

    Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819.

    Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

    Hassibi, B. and Stork, D. (1992). Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5.

    He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

    He, Y., Zhang, X., and Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397.

    Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.

    Homayounfar, M., Koohi-Moghadam, M., Rawassizadeh, R., and Vardhanabhuti, V. (2023). Beta-rank: a robust convolutional filter pruning method for imbalanced medical image analysis. arXiv preprint arXiv:2304.07461.

    Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.

    Hossain, M. I., Akhter, S., Hong, C. S., and Huh, E.-N. (2024). Purf: Improving teacher representations by imposing smoothness constraints for knowledge distillation. Applied Soft Computing, 159:111579.

    Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks. Advances in neural information processing systems, 29.

    Idelbayev, Y. and Carreira-Perpinán, M. A. (2020). Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8049–8059.

    Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. (2017). Quantization and training of neural networks for efficient integer-arithmetic-only inference. arxiv. arXiv preprint arXiv:1712.05877.

    Jaderberg, M., Vedaldi, A., and Zisserman, A. (2014). Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866.

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

    Kaplanoglou, P. I. and Diamantaras, K. (2024). Learning local discrete features in explainable-by-design convolutional neural networks. arXiv preprint arXiv:2411.00139.

    Kim, B. J. and Kim, S. W. (2024). Stochastic subsampling with average pooling. arXiv preprint arXiv:2409.16630.

    Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., and Shin, D. (2015). Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530.

    Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. (2019). The (un)reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pages 267–280. Available at: https://arxiv.org/abs/1711.00867.

    Krishnamoorthi, R. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342.

    Krizhevsky, A. (2012). Learning multiple layers of features from tiny images. University of Toronto.

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.

    Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. (2014). Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553.

    LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323.

    LeCun, Y., Denker, J., and Solla, S. (1989). Optimal brain damage. In Touretzky, D., editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.

    Lee, C.-Y., Gallagher, P. W., and Tu, Z. (2016). Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Artificial intelligence and statistics, pages 464–472. PMLR.

    Liu, D., Kong, H., Luo, X., Liu, W., and Subramaniam, R. (2022). Bringing ai to edge: From deep learning's perspective. Neurocomputing, 485:297–320.

    Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS). https://arxiv.org/abs/1705.07874.

    Markidis, S., Der Chien, S. W., Laure, E., Peng, I. B., and Vetter, J. S. (2018). Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pages 522–531. IEEE.

    McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5:115–133.

    Misra, D. (2019). Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681.

    Miyashita, D., Lee, E. H., and Murmann, B. (2016). Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025.

    Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440.

    Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.

    Pan, Y., Ouyang, P., Zhao, Y., Kang, W., Yin, S., Zhang, Y., Zhao, W., and Wei, S (2018). A multilevel cell stt-mram-based computing in-memory accelerator for binary convolutional neural network. IEEE Transactions on Magnetics, 54(11):1–5.

    Phan, H. (2021). Pytorch_cifar10. Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "why should i trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://arxiv.org/abs/1602.04938.

    Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386.

    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520.

    Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

    Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

    Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.

    Sturmfels, P., Lundberg, S., and Lee, S.-I. (2020). Visualizing the impact of feature attribution baselines. Distill, 5(1):e22.

    Su, T., Zhang, J., Yu, Z., Wang, G., and Liu, X. (2022). Stkd: Distilling knowledge from synchronous teaching for efficient model compression. IEEE Transactions on Neural Networks and Learning Systems, 34(12):10051–10064.

    Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3319–3328. PMLR.

    Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329.

    Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Thirty First AAAI Conference on Artificial Intelligence.

    Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and Jégou, H. (2021). Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 32–42.

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

    Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B., and Philip, S. Y. (2019). Private model compression via knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1190–1197.

    Wang, T., Wang, K., Cai, H., Lin, J., Liu, Z., Wang, H., Lin, Y., and Han, S. (2020a). Apq: Joint search for network architecture, pruning and quantization policy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2078–2087.

    Wang, W., Chen, W., Luo, Y., Long, Y., Lin, Z., Zhang, L., Lin, B., Cai, D., and He, X. (2024). Model compression and efficient inference for large language models: A survey. arXiv preprint arXiv:2402.09748.

    Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. (2020b). Pruning from scratch. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12273–12280.

    Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31.

    Wu, S., Chen, H., Quan, X., Wang, Q., and Wang, R. (2023). Ad-kd: Attribution driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010.

    Xia, M., Zhong, Z., and Chen, D. (2022). Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408.

    Xie, H., Jiang, W., Luo, H., and Yu, H. (2021). Model compression via pruning and knowledge distillation for person re-identification. Journal of Ambient Intelligence and Humanized Computing, 12(2):2149–2161.

    Zagoruyko, S. and Komodakis, N. (2016). Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv preprint arXiv:1612.03928.

    Zhao, H., Sun, X., Dong, J., Chen, C., and Dong, Z. (2020). Highlight every step: Knowledge distillation via collaborative teaching. IEEE Transactions on Cybernetics, 52(4):2070–2081.

    Zhu, M. and Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE