| 研究生: |
粘舒涵 Nien, Su-Han |
|---|---|
| 論文名稱: |
深度分離卷積運算之張量運算單元設計於 CASLab-GPU Depthwise separable convolution supported TPU based on CASLab-GPU |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 中文 |
| 論文頁數: | 86 |
| 中文關鍵詞: | 通用型繪圖處理器 、深度分離卷積 、張量處理單元 、編譯器 |
| 外文關鍵詞: | GPGPU, Depthwise Separable Convolution, TPU, Compiler |
| 相關次數: | 點閱:80 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,深度分離卷積的出現和發展在影像處理、自然語言處理等領域取得了顯著的進步。深度分離卷透過積將標準卷積運算分解為深度卷積與逐點卷積,大幅減少了計算量和參數量,且同時維持了模型性能。因此,這種技術成為高效輕量化模型的首選,廣泛應用於手機端和嵌入式設備中。許多知名的神經網路模型如MobileNet和EfficientNet等都採用了這種技術以提升其運算效率和性能。
然而,這些深度分離卷積網路在本實驗室開發的CASLab-GPU上的實作效能並不理想。CASLab-GPU為了因應神經網路中大量的卷積運算,設計了一個張量處理單元來加速運算,但其在處理運算模式特殊的深度卷積時效率明顯不佳。
為了解決這一問題,本論文擴充了CASLab-GPU的運算單元,改善了內部資料流。並增加了特殊指令,以支援深度卷積運算。最後優化軟體程式庫以充分利用硬體資源。這些改進顯著提升了CASLab-GPU在深度分離卷積網路上的效能,支持未來CASLab-GPU在輕量化高效神經網路模型的應用。
Depthwise separable convolution is an efficient variation of the standard convolution operation used in convolutional neural networks. By factorizing standard convolution operations into depthwise convolutions and pointwise convolutions, depthwise separable convolutions significantly reduce computational complexity and parameter count while maintaining model performance. Therefore, depthwise separable convolution has become the preferred choice for efficient and lightweight models, widely applied in mobile and embedded devices. Many renowned neural network models, such as Xception[1] and MobileNet[2] apply this technique to enhance computational efficiency and performance.
To handle the extensive convolution operations in neural networks, our laboratory developed the CASLab-GPU with a tensor processing unit (TPU). The TPU is primarily composed of systolic arrays, using the Im2col+GEMM method to accelerate standard convolution. However, this method is only suitable for standard convolutions, and the performance of depthwise separable convolutions on the CASLab-GPU has been inefficient.
In this paper, we enhance the CASLab-GPU's support and computational efficiency for depthwise convolutions by improving its TPU architecture and optimizing data transmission methods. Additionally, we added a TPU intrinsic function in the compiler. Finally, we optimized the algorithms in the software library to fully utilize the hardware. These enhancements significantly improve the performance of the CASLab-GPU in executing depthwise separable convolutions.
[1] François Chollet, "Xception: Deep learning with depthwise separable convolutions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251-1258, 2017.
[2] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" arXiv preprint arXiv:1704.04861, 2017
[3] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, "Shufflenet: an extremely efficient convolutional neural network for mobile devices" In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848-6856, 2018.
[4] Mingxing Tan, Quoc Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," In Proc. 36th International Conference on Machine Learning, pp. 6105-6144, 2019.
[5] B. Sander, "HSAIL: Portable compiler IR for HSA," IEEE Hot Chips 25 Symposium (HCS), pp. 1-32, 2013.
[6] "Nvidia Tensor Cores website" [Online]. Available: https://developer.nvidia.com/tensor-cores
[7] Norman P Jouppi et al. "In-datacenter performance analysis of a tensor processing unit," In Proceedings of the 44th annual international symposium on computer architecture, pp. 1-12, 2017.
[8] Dun-Jie Chen, Chung-Ho Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2019.
[9] Feng-Ming Hsu, Chung-Ho Chen, “Tensor Processing Unit(TPU) design and TPU APIs implementation for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[10] Jhih-Wei Wang, Chung-Ho Chen, “Computation Optimization for Neural Network on CASLab-GPGPU with TVM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[11] Sheng-Yao Lin, Chung-Ho Chen, "Optimizing Convolution Computing on CASLab-GPU with Tensor Core" the thesis for Master of Science, Tainan, Taiwan, 2022.
[12] Dee-Kai Chuah, Chung-Ho Chen, "8-Bit TPU Design & GEMM (8-8-32) Kernel Implementation for CASLab-GPU" the thesis for Master of Science, Tainan, Taiwan, 2023.
[13] "Using CUDA Warp-Level Primitives " [Online]. Available: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
[14] "TensorFlow Official Website," [Online]. Available: https://www.tensorflow.org
[15] "Keras," [Online]. Available: https://keras.io
[16] "Pytorch," [Online]. Available: https://pytorch.org
[17] "Clang: a C language family frontend for LLVM," [Online]. Available: https://clang.llvm.org
[18] "LLVM," [Online]. Available: https://llvm.org/
[19] "CUDA Introduction" [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
[20] Perkins, Hugh.: "CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices", Proceedings of the 5th International Workshop on OpenCL, 2017.
[21] "Parallel Thread Execution ISA Version 8.1." [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[22] Rui Xu, Sheng Ma, Yaohua Wang, Yang Guo, Dongsheng Li, and Yuran Qiao. “Heterogeneous systolic array architecture for compact CNNS hardware accelerators. " IEEE Transactions on Parallel and Distributed Systems 33, pp. 2860-2871, 2022.
[23] Rui Xu, Sheng Ma, Yaohua Wang, Xinhai Chen, and Yang Guo. "Configurable multi-directional systolic array architecture for convolutional neural networks. " ACM Trans. Architecture and Code Optimization, Vol. 18, No.4, pp.42:1-42:24, 2021.
[24] Hyungmin Cho, "RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping" ACM Transactions on Embedded Computing Systems, Vol. 20, No. 5s, pp. 53:1-53:20, 2021.
校內:2029-07-24公開