| 研究生: |
王祥宇 Wang, Hsiang-Yu |
|---|---|
| 論文名稱: |
於可擴展CASLab-DLA–TVM系統實現卷積神經網路指令排程優化 Instruction Scheduling Optimization for Convolution Neural Network on Scalable CASLab-DLA–TVM System |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 104 |
| 中文關鍵詞: | 深度學習加速器 、指令層級平行 、指令排程 、多核心系統 |
| 外文關鍵詞: | deep learning accelerator, instruction-level parallelism, instruction scheduling, multi-core system |
| 相關次數: | 點閱:106 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
CASLab_DLA 為針 CNN 所開發的深度神經網路加速器,目前在電子系統層級模 擬階段。CASLab_DLA 指令集到微架構都以 CNN 所需的運算為出發點做設計,指令 集包含了卷積神經網路運算訊息,硬體微架構針對不同類型卷積層做優化,例如針對 3x3 Convolution 設計 9 MAC PE、Pointwise 和 Depthwise Convolution 透過控制硬體資 料流來提升 PE 使用效率。除了加速器核心,另外也建立記憶體子系統,讓其可以與 CPU 執行 TVM 做端到端模擬。過去實現許多硬體層面的優化,執行單指令效率有所 提升,但軟體如何排程指令也影響整體運算效能。過去 CASLab_DLA 指令產生器還 未在指令排程有太多著墨,只有在某些特定大小的卷積層有較佳排程方法。因此設計 HTLT (High-Throughput Less-Transfers) Instruction Generator,用 Double Buffer 的方式 提升指令平行度達到高吞吐量,並以 Input 和 Weight Stationary 之間的 trade-off 來減 少資料傳輸量。在沒有改動硬體微架構的情況下,各 CNN 模型卷積層執行時間可減 少 10%到 30%。另外起初設計於邊緣運算的 CASLab_DLA,以現有效能若要支援實 時應用 FPS 還有倍數的差距。因此建立多核 SoC 系統,軟體可以根據系統上核心數 分派指令。比起單核系統運算 CNN 模型卷積層,雙核速度可達 1.5 倍到 1.9 倍,四核 速度可達 1.9 倍到 2.7 倍。能並分析多核系統下的效能瓶頸,作為未來優化系統和核 心微架構的依據。
CASLab_DLA is a deep neural network accelerator developed for the convolutional neural network (CNN), which is currently in the electronic system-level simulation phase. The CASLab_DLA instruction set architecture and microarchitecture are designed based on the computations required by the CNN, and the instruction set architecture includes convolutional neural network operation information. The hardware microarchitecture is optimized for different types of convolutional layers, for example, 9 MAC PEs are designed for 3x3 Convolution, and the Pointwise and Depthwise Convolutions are optimized by controlling the hardware data flow to improve PE utilization efficiency. In addition to the accelerator core, a memory subsystem is also established to enable end-to- end simulation with CPU execution of TVM.
In the past, many hardware-level optimizations have been implemented to improve the efficiency of executing a single instruction, but how the software schedules instructions also affect the overall computational efficiency. In the past, the CASLab_DLA instruction generator did not focus much on instruction scheduling, and only some specific-sized convolutional layers had better scheduling methods. Therefore, the HTLT (High-Throughput Less-Transfers) Instruction Generator was designed to use Double Buffer to increase instruction parallelism and achieve high throughput while reducing data transfer between the trade-off of Input and Weight Stationary. Without changing the hardware microarchitecture, the execution time of each CNN model's convolutional layers can be reduced by 10% to 30%.
In addition, the originally designed CASLab_DLA for edge computing had a significant gap in supporting real-time applications in terms of FPS performance. Therefore, a multi-core SoC system was established, and software can assign instructions according to the number of cores on the system. Compared to computing CNN model convolutional layers on a single-core system, the dual-core speed can reach 1.5 to 1.9 times, and the quad-core speed can reach 1.9 to 2.7 times. It can analyze the performance bottleneck of the multi-core system as a basis for future optimization of the system and core microarchitecture.
1. Xie, C.-Y., Optimizations of CNN Micro-architecture and Memory Sub-system for CASLab-DLA. 2022.
2. Apache TVM - An End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators. 2023; Available from: https://tvm.apache.org.
3. Kumar, P.A.P.R.M., YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. 2020: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).
4. Andrew G. Howard, M.Z., Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017: arXiv:1704.04861.
5. Welcome to the SystemC Community Portal. 2023; Available from: https://systemc.org.
6. Huang, H.-Q., Integration of Machine Learning Compiler Framework with Custom Instruction Set Architecture for CASLab-DLA. 2022.
7. Wu, T.J., A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters. 2020, NCKU.
8. Hsiao, C.-C., Quantization Implementation for Neural Network Accelerator based on CNN Inference Engine. 2021.
9. Lin, B.-R., An ESL (electronic system level virtual platform for convolution accelerator design and verification. 2019.
10. Redmon, J.C. Darknet: Open Source Neural Networks in C. 2013; Available from: https://pjreddie.com/darknet/.
11. Wang, J.-W., Computation Optimization for Neural Network on CASLab-GPGPU with TVM. 2022.
12. Yu, Z.C.a.C. How to Bring Your Own Codegen to TVM. 2020; Available from: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm.
13. Hung, S.-Y., Virtual Platform Design of Acceleration based on RISC-V Architecture with CASLab-GPU. 2021.
14. AbnerChang. RISC-V Platform-Level Interrupt Controller Specification. 2023; Available from: https://github.com/riscv/riscv-plic-spec/blob/master/riscv- plic.adoc.
15. Intel® Distribution of OpenVINOTM Toolkit. 2023; Available from: https://www.intel.com/content/www/us/en/developer/tools/openvino- toolkit/overview.html.
校內:2028-12-27公開