簡易檢索 / 詳目顯示

研究生: 王祥宇
Wang, Hsiang-Yu
論文名稱: 於可擴展CASLab-DLA–TVM系統實現卷積神經網路指令排程優化
Instruction Scheduling Optimization for Convolution Neural Network on Scalable CASLab-DLA–TVM System
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 104
中文關鍵詞: 深度學習加速器指令層級平行指令排程多核心系統
外文關鍵詞: deep learning accelerator, instruction-level parallelism, instruction scheduling, multi-core system
相關次數: 點閱:106下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • CASLab_DLA 為針 CNN 所開發的深度神經網路加速器,目前在電子系統層級模 擬階段。CASLab_DLA 指令集到微架構都以 CNN 所需的運算為出發點做設計,指令 集包含了卷積神經網路運算訊息,硬體微架構針對不同類型卷積層做優化,例如針對 3x3 Convolution 設計 9 MAC PE、Pointwise 和 Depthwise Convolution 透過控制硬體資 料流來提升 PE 使用效率。除了加速器核心,另外也建立記憶體子系統,讓其可以與 CPU 執行 TVM 做端到端模擬。過去實現許多硬體層面的優化,執行單指令效率有所 提升,但軟體如何排程指令也影響整體運算效能。過去 CASLab_DLA 指令產生器還 未在指令排程有太多著墨,只有在某些特定大小的卷積層有較佳排程方法。因此設計 HTLT (High-Throughput Less-Transfers) Instruction Generator,用 Double Buffer 的方式 提升指令平行度達到高吞吐量,並以 Input 和 Weight Stationary 之間的 trade-off 來減 少資料傳輸量。在沒有改動硬體微架構的情況下,各 CNN 模型卷積層執行時間可減 少 10%到 30%。另外起初設計於邊緣運算的 CASLab_DLA,以現有效能若要支援實 時應用 FPS 還有倍數的差距。因此建立多核 SoC 系統,軟體可以根據系統上核心數 分派指令。比起單核系統運算 CNN 模型卷積層,雙核速度可達 1.5 倍到 1.9 倍,四核 速度可達 1.9 倍到 2.7 倍。能並分析多核系統下的效能瓶頸,作為未來優化系統和核 心微架構的依據。

    CASLab_DLA is a deep neural network accelerator developed for the convolutional neural network (CNN), which is currently in the electronic system-level simulation phase. The CASLab_DLA instruction set architecture and microarchitecture are designed based on the computations required by the CNN, and the instruction set architecture includes convolutional neural network operation information. The hardware microarchitecture is optimized for different types of convolutional layers, for example, 9 MAC PEs are designed for 3x3 Convolution, and the Pointwise and Depthwise Convolutions are optimized by controlling the hardware data flow to improve PE utilization efficiency. In addition to the accelerator core, a memory subsystem is also established to enable end-to- end simulation with CPU execution of TVM.

    In the past, many hardware-level optimizations have been implemented to improve the efficiency of executing a single instruction, but how the software schedules instructions also affect the overall computational efficiency. In the past, the CASLab_DLA instruction generator did not focus much on instruction scheduling, and only some specific-sized convolutional layers had better scheduling methods. Therefore, the HTLT (High-Throughput Less-Transfers) Instruction Generator was designed to use Double Buffer to increase instruction parallelism and achieve high throughput while reducing data transfer between the trade-off of Input and Weight Stationary. Without changing the hardware microarchitecture, the execution time of each CNN model's convolutional layers can be reduced by 10% to 30%.

    In addition, the originally designed CASLab_DLA for edge computing had a significant gap in supporting real-time applications in terms of FPS performance. Therefore, a multi-core SoC system was established, and software can assign instructions according to the number of cores on the system. Compared to computing CNN model convolutional layers on a single-core system, the dual-core speed can reach 1.5 to 1.9 times, and the quad-core speed can reach 1.9 to 2.7 times. It can analyze the performance bottleneck of the multi-core system as a basis for future optimization of the system and core microarchitecture.

    摘要 I 誌謝 XVI 目錄 XVII 圖目錄 XX 表目錄 XXIII 第1章 緒論 1 1.1 論文動機 2 1.2 論文貢獻 3 1.3 論文架構 3 第2章 背景知識 4 2.1 Electronic system level Design & SystemC 4 2.2 Instruction-based CASLab-DLA 6 2.2.1  CASLab-DLA Instruction Set Architecture 6 2.2.2  CASLab-DLA Micro-architecture 9 2.2.3  CASLab-DLA Dataflow & Software Instruction Generator 14 2.2.4  Memory Sub-system 18 2.3 NN Compiler - Tensor Virtual Machine (TVM) 19
 2.3.1  TVM Compilation Flow 20 
 2.3.2  TVM Data Structure of Intermediate Representation 22 
 2.3.3  TVM Graph Level Optimization 23 
 2.3.4  TVM Bring Your Own Codegen (BYOC) 24 
 2.3.5  CASLab-DLA Runtime Library 26 
 2.4 Full System of CASLab-DLA with TVM 26 2.4.1  RISC-V QEMU 27 2.4.2  Bridge Interface & Inter-process Connection 28 第3章 Optimization of Instruction Schedule Method 29 3.1 Comprehensive Double Buffer 30 3.1.1  Small Input Size Double Buffer 31 
3.1.2  Tiling of Different Convolution 33 
 3.1.3  High Throughput Instruction Generator 36 3.2 Load Balance of Pointwise Convolution 39 3.2.1  Filter Segmentation 40 
3.2.2  Low Transfer Instruction Generator 42 3.3 CASLab-DLA Instruction Pipeline View 44 第4章 System with Scalable CASLab-DLA 47 4.1 Multi-core CASLab-DLA System on Chip 47 4.2 Instruction Dispatcher 48 4.3 Modification of CASLab-DLA Runtime Library 52 第5章 實驗環境與效能分析 54 5.1 實驗環境 54 5.2 Comprehensive Double Buffer 效能分析 56 5.2.1  Standard Convolution after Double Buffer 57 
5.2.2  Depthwise Convolution after Double Buffer 61 5.3 Load Balance of Pointwise Convolution 效能分析 64 5.4 Multi-core CASLab-DLA 效能分析 67 5.4.1  Standard Convolution of Yolov3-tiny 67 5.4.2  Pointwise and Depthwise Convolution of Mobilenet-v1 71 5.5 Overall 效能提升 74 第 6 章 結論與未來展望 78 參考文獻 79

    1. Xie, C.-Y., Optimizations of CNN Micro-architecture and Memory Sub-system for CASLab-DLA. 2022.
    2. Apache TVM - An End to End Machine Learning Compiler Framework for CPUs, GPUs and accelerators. 2023; Available from: https://tvm.apache.org.
    3. Kumar, P.A.P.R.M., YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. 2020: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).
    4. Andrew G. Howard, M.Z., Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. 2017: arXiv:1704.04861.
    5. Welcome to the SystemC Community Portal. 2023; Available from: https://systemc.org.
    6. Huang, H.-Q., Integration of Machine Learning Compiler Framework with Custom Instruction Set Architecture for CASLab-DLA. 2022.
    7. Wu, T.J., A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters. 2020, NCKU.
    8. Hsiao, C.-C., Quantization Implementation for Neural Network Accelerator based on CNN Inference Engine. 2021.
    9. Lin, B.-R., An ESL (electronic system level virtual platform for convolution accelerator design and verification. 2019.
    10. Redmon, J.C. Darknet: Open Source Neural Networks in C. 2013; Available from: https://pjreddie.com/darknet/.
    11. Wang, J.-W., Computation Optimization for Neural Network on CASLab-GPGPU with TVM. 2022.
    12. Yu, Z.C.a.C. How to Bring Your Own Codegen to TVM. 2020; Available from: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm.
    13. Hung, S.-Y., Virtual Platform Design of Acceleration based on RISC-V Architecture with CASLab-GPU. 2021.
    14. AbnerChang. RISC-V Platform-Level Interrupt Controller Specification. 2023; Available from: https://github.com/riscv/riscv-plic-spec/blob/master/riscv- plic.adoc.
    15. Intel® Distribution of OpenVINOTM Toolkit. 2023; Available from: https://www.intel.com/content/www/us/en/developer/tools/openvino- toolkit/overview.html.

    無法下載圖示 校內:2028-12-27公開
    校外:2028-12-27公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE