成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	粘舒涵 Nien, Su-Han
論文名稱：	深度分離卷積運算之張量運算單元設計於 CASLab-GPU Depthwise separable convolution supported TPU based on CASLab-GPU
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2024
畢業學年度：	112
語文別：	中文
論文頁數：	86
中文關鍵詞：	通用型繪圖處理器、深度分離卷積、張量處理單元、編譯器
外文關鍵詞：	GPGPU, Depthwise Separable Convolution, TPU, Compiler
相關次數：	點閱：80 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，深度分離卷積的出現和發展在影像處理、自然語言處理等領域取得了顯著的進步。深度分離卷透過積將標準卷積運算分解為深度卷積與逐點卷積，大幅減少了計算量和參數量，且同時維持了模型性能。因此，這種技術成為高效輕量化模型的首選，廣泛應用於手機端和嵌入式設備中。許多知名的神經網路模型如MobileNet和EfficientNet等都採用了這種技術以提升其運算效率和性能。
然而，這些深度分離卷積網路在本實驗室開發的CASLab-GPU上的實作效能並不理想。CASLab-GPU為了因應神經網路中大量的卷積運算，設計了一個張量處理單元來加速運算，但其在處理運算模式特殊的深度卷積時效率明顯不佳。
為了解決這一問題，本論文擴充了CASLab-GPU的運算單元，改善了內部資料流。並增加了特殊指令，以支援深度卷積運算。最後優化軟體程式庫以充分利用硬體資源。這些改進顯著提升了CASLab-GPU在深度分離卷積網路上的效能，支持未來CASLab-GPU在輕量化高效神經網路模型的應用。

Depthwise separable convolution is an efficient variation of the standard convolution operation used in convolutional neural networks. By factorizing standard convolution operations into depthwise convolutions and pointwise convolutions, depthwise separable convolutions significantly reduce computational complexity and parameter count while maintaining model performance. Therefore, depthwise separable convolution has become the preferred choice for efficient and lightweight models, widely applied in mobile and embedded devices. Many renowned neural network models, such as Xception[1] and MobileNet[2] apply this technique to enhance computational efficiency and performance.
To handle the extensive convolution operations in neural networks, our laboratory developed the CASLab-GPU with a tensor processing unit (TPU). The TPU is primarily composed of systolic arrays, using the Im2col+GEMM method to accelerate standard convolution. However, this method is only suitable for standard convolutions, and the performance of depthwise separable convolutions on the CASLab-GPU has been inefficient.
In this paper, we enhance the CASLab-GPU's support and computational efficiency for depthwise convolutions by improving its TPU architecture and optimizing data transmission methods. Additionally, we added a TPU intrinsic function in the compiler. Finally, we optimized the algorithms in the software library to fully utilize the hardware. These enhancements significantly improve the performance of the CASLab-GPU in executing depthwise separable convolutions.

摘要	I
誌謝	VIII
目錄	IX
圖目錄	XI
表目錄	XIV
第1章	序論	1
1	論文動機	1
2	論文貢獻	2
3	論文架構	2
第2章	背景知識	3
1	Depthwise Separable convolution	3
2	CASLab GPU	5
2.1	CASLab-GPU Memory Hierarchy	6
2.2	CASLab -GPU Microarchitecture	8
2.3	Tensor Processing Unit	11
3	CASLab-GPU Software Stack	15
3.1	OpenCL Runtime	16
3.2	CASLab LLVM Compiler	18
4	Related Works	19
4.1	NVIDIA Tensor Core	20
4.2	HeSA	22
第3章	問題分析與設計方法	29
1	在TPU上執行深度卷積的缺陷與做法探討	30
2	Tensor Process Unit support DWConv	32
2.1	Systolic Array	33
2.2	Data Loader	37
3	Depthwise convolution TPU Instruction	43
3.1	TPU Instruction Format	43
3.2	Implementation of TPU Intrinsic Function	46
3.3	Compilation Flow of TPU Intrinsic Function	53
4	GEMM kernel for TPU	54
4.1	Depthwise Convolution Kernel	54
4.2	Pointwise Convolution Kernel	57
第4章	實驗結果與效能分析	59
1	實驗平台環境介紹	59
2	實驗結果與分析	61
第5章	結論	67
參考文獻	68
                                    

[1] François Chollet, "Xception: Deep learning with depthwise separable convolutions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251-1258, 2017.
[2] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" arXiv preprint arXiv:1704.04861, 2017
[3] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, "Shufflenet: an extremely efficient convolutional neural network for mobile devices" In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848-6856, 2018.
[4] Mingxing Tan, Quoc Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," In Proc. 36th International Conference on Machine Learning, pp. 6105-6144, 2019.
[5] B. Sander, "HSAIL: Portable compiler IR for HSA," IEEE Hot Chips 25 Symposium (HCS), pp. 1-32, 2013.
[6] "Nvidia Tensor Cores website" [Online]. Available: https://developer.nvidia.com/tensor-cores
[7] Norman P Jouppi et al. "In-datacenter performance analysis of a tensor processing unit," In Proceedings of the 44th annual international symposium on computer architecture, pp. 1-12, 2017.
[8] Dun-Jie Chen, Chung-Ho Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2019.
[9] Feng-Ming Hsu, Chung-Ho Chen, “Tensor Processing Unit(TPU) design and TPU APIs implementation for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[10] Jhih-Wei Wang, Chung-Ho Chen, “Computation Optimization for Neural Network on CASLab-GPGPU with TVM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[11] Sheng-Yao Lin, Chung-Ho Chen, "Optimizing Convolution Computing on CASLab-GPU with Tensor Core" the thesis for Master of Science, Tainan, Taiwan, 2022.
[12] Dee-Kai Chuah, Chung-Ho Chen, "8-Bit TPU Design & GEMM (8-8-32) Kernel Implementation for CASLab-GPU" the thesis for Master of Science, Tainan, Taiwan, 2023.
[13] "Using CUDA Warp-Level Primitives " [Online]. Available: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
[14] "TensorFlow Official Website," [Online]. Available: https://www.tensorflow.org
[15] "Keras," [Online]. Available: https://keras.io
[16] "Pytorch," [Online]. Available: https://pytorch.org
[17] "Clang: a C language family frontend for LLVM," [Online]. Available: https://clang.llvm.org
[18] "LLVM," [Online]. Available: https://llvm.org/
[19] "CUDA Introduction" [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
[20] Perkins, Hugh.: "CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices", Proceedings of the 5th International Workshop on OpenCL, 2017.
[21] "Parallel Thread Execution ISA Version 8.1." [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[22] Rui Xu, Sheng Ma, Yaohua Wang, Yang Guo, Dongsheng Li, and Yuran Qiao. “Heterogeneous systolic array architecture for compact CNNS hardware accelerators. " IEEE Transactions on Parallel and Distributed Systems 33, pp. 2860-2871, 2022.
[23] Rui Xu, Sheng Ma, Yaohua Wang, Xinhai Chen, and Yang Guo. "Configurable multi-directional systolic array architecture for convolutional neural networks. " ACM Trans. Architecture and Code Optimization, Vol. 18, No.4, pp.42:1-42:24, 2021.
[24] Hyungmin Cho, "RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping" ACM Transactions on Embedded Computing Systems, Vol. 20, No. 5s, pp. 53:1-53:20, 2021.

校內：2029-07-24公開
校外：2029-07-24公開電子論文尚未授權公開，紙本請查館藏目錄

簡易檢索 / 詳目顯示

相關論文