簡易檢索 / 詳目顯示

研究生: 鄭庭茵
Ching, Ting-Yin
論文名稱: 透過SystemC/C++建模比較靜態隨機存取記憶體之記憶體內運算處理核心相對於脈動陣列之優勢
Benchmarking of Static Random Access Memory based Computing-in-Memory Processing Element against Systolic Array via System C/C++ Modeling
指導教授: 盧達生
Lu, Dar-Sen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 63
中文關鍵詞: 行為階層設計靜態隨機存取記憶體記憶體內運算脈動陣列
外文關鍵詞: Behavioral coding, SRAM, Computing-in-Memory, Systolic Array
相關次數: 點閱:106下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著物聯網與人工智慧高速發展,需要運算大量的資料。同時,為了減少網路傳輸、注重隱私並加快AI推測速度,邊緣運算裝置的需求隨之提高。在邊緣裝置中,低延遲與低功耗都是不可或缺的條件。在傳統的馮紐曼電腦架構中,因為資料計算時,需要搬運資料往返於中央處理器與記憶體,而消耗大量能量與時間。於是各界開始尋求高速且低功耗的記憶體中運算架構。由記憶體組成的記憶體陣列,可將計算與儲存同時進行達成「記憶體內運算」,並具備平行化處理大量矩陣運算的能力, 成為近年來熱門的研究項目。然而,除了低功耗的優勢外,記憶體內運算是否能滿足運算過程中低延遲的要求,是其能否實用化的關鍵。本篇論文旨在探討讀寫速度相對快速的靜態隨機存取記憶體特性,並利用結構設計與最佳化權重分布的方式提高運算速度,最後與Google TPU提出之脈動陣列進行比較。
    本篇論文透過以SystemC/C++為基礎寫成的記憶體行為模擬器,利用結構設計與權重分布最佳化,加速其運算Convolution所需的時間。並同時透過SystemC模擬脈動陣列運算Convolution的狀況。此外,將脈動陣列及記憶體內運算之數位電路部分以Verilog實現並計算其面積與功耗,以比較靜態隨機存取記憶體之記憶體內運算處理核心相對於脈動陣列之優勢。

    With the rapid development of Internet of Things and artificial intelligence in recent years, the need to compute large amounts of data has increased. At the same time, the need for edge computing devices has increased in order to reduce network traffic, focus on privacy and speed up AI predictions. Low latency and low power consumption are essential in edge devices. In the traditional von Neumann architecture, data computation consumes a lot of energy and time because of the need to move data between the central processing unit and the memory. This has led to a search for high-speed and low-power computing architectures. A new computational approach named computing-in-memory (CIM) is adopted using memory arrays by simultaneously computing and storing, as well as handling large arrays in parallel. However, apart from the advantage of low power consumption, the capability of CIM to meet the requirements of low latency in edge computing is one of the critical factors for its practicality. The aim of this thesis is to investigate the characteristics of static-random-access-memory (SRAM) with relatively fast read/write speeds, and to improve the speed of computation by means of structural design and optimized weight mapping. Finally, the result is compared with the systolic array design proposed for the Google TPU.
    In this thesis, an SRAM CIM behavior simulator written in SystemC/C++ is implemented. Simultaneously, structural design and weight mapping optimization is used to speed up the time required for the computation of convolution. This thesis also simulates the systolic array computation of convolution in SystemC. In addition, the digital components of systolic array and CIM are both implemented in Verilog and their area and power consumption calculated to compare the advantages of the processing elements of CIM with those of systolic arrays.

    Content 摘要 i Abstract ii Acknowledgement iv Content v List of Figure vii List of Table x Chapter 1 Introduction 1 1.1 Research background and motivation 1 1.2 Research Objective 3 Chapter 2 Literature Review 4 2.1 Artificial Intelligence and AI Accelerator 4 2.1.1 Multilayer Perceptron (MLP) and Convolutional Neural Network (CNN) 4 2.1.2 Google TPU 6 2.2 Computing in Memory with SRAM 9 2.2.1 6T-SRAM Read and Write 9 2.2.2 TSMC’s 8T-SRAM-based CIM circuit 12 2.2.3 NeuroSim 16 2.2.4 Overlapped Mapping Method (OMM) 19 Chapter 3 Methodology 21 3.1 CIM macro architecture 21 3.1.1 Multibit-Weight-Product-Unit (MWPU) 21 3.1.2 Binary Input mode 23 3.1.3 Sense amplifier 26 3.1.4 Ion variation 28 3.2 Weight Mapping and Data Flow of Convolution 29 3.2.1 CIM 30 3.2.2 Systolic Array 31 3.3 NVDLA Virtual Platform 33 3.4 Power Estimation 33 3.5 Power and SRAM scaling 35 Chapter 4 Result and Discussion 38 4.1 C++ Simulation 39 4.2 SystemC behavior model 41 4.3 CIM Convolution Weight Mapping Optimization 43 4.4 Timing Estimation 45 4.5 Area Estimation 47 4.7 CIM vs. Systolic Array 49 4.8 CIM vs. Other devices 52 Chapter 5 Conclusion and Future work 54 5.1 Conclusion 54 5.2 Future Work 55 Answer to Thesis Defense Question 57 Reference 59 Appendix 62 A. CIM mode in SRAM array (SystemC code) 62 B. MAC operation in Systolic Array (SystemC code) 63

    [1] A. Keshavarzi, K. Ni, W. Van Den Hoek, S. Datta, and A. Raychowdhury, "Ferroelectronics for edge intelligence," IEEE Micro, vol. 40, no. 6, pp. 33-48, 2020.
    [2] W.-F. Lin, C.-T. Hsieh, and C.-Y. Chou, "Onnc-based software development platform for configurable nvdla designs," in 2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), 2019: IEEE, pp. 1-2.
    [3] T. Isokawa, H. Nishimura, and N. Matsui, "Quaternionic multilayer perceptron with local analyticity," Information, vol. 3, no. 4, pp. 756-770, 2012.
    [4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
    [5] M. Wurm, T. Stark, X. X. Zhu, M. Weigand, and H. Taubenböck, "Semantic segmentation of slums in satellite images using transfer learning on fully convolutional neural networks," ISPRS journal of photogrammetry and remote sensing, vol. 150, pp. 59-69, 2019.
    [6] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1-12.
    [7] Q. Dong et al., "15.3 A 351TOPS/W and 372.4 GOPS compute-in-memory SRAM macro in 7nm FinFET CMOS for machine-learning applications," in 2020 IEEE International Solid-State Circuits Conference-(ISSCC), 2020: IEEE, pp. 242-244.
    [8] P.-Y. Chen, X. Peng, and S. Yu, "NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3067-3080, 2018.
    [9] Y.-W. Lin et al., "An all-digital Read Stability and Write Margin characterization scheme for CMOS 6T SRAM array," in Proceedings of Technical Program of 2012 VLSI Design, Automation and Test, 2012: IEEE, pp. 1-4.
    [10] X. Peng, S. Huang, H. Jiang, A. Lu, and S. Yu, "DNN+ NeuroSim V2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020.
    [11] Z. Zhu et al., "Mixed size crossbar based RRAM CNN accelerator with overlapped mapping method," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018: IEEE, pp. 1-8.
    [12] J.-W. Su et al., "15.2 a 28nm 64kb inference-training two-way transpose multibit 6t sram compute-in-memory macro for ai edge chips," in 2020 IEEE International Solid-State Circuits Conference-(ISSCC), 2020: IEEE, pp. 240-242.
    [13] W.-C. Hung, "A Deep Learning Simulation Platform for Non-Volatile Memory-Based Analog Neuromorphic Circuits," NCKU, 2019.
    [14] C.-H. Tsai, "Exploration of Emerging Non-Volatile Memory for On-chip Training of Artificial Neural Network," NCKU, 2020.
    [15] M. A. Hanif, R. V. W. Putra, M. Tanvir, R. Hafiz, S. Rehman, and M. Shafique, "MPNA: A massively-parallel neural array accelerator with dataflow optimization for convolutional neural networks," arXiv preprint arXiv:1810.12910, 2018.
    [16] V. Panchbhaiyye and T. Ogunfunmi, "A FIFO based accelerator for convolutional neural networks," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 1758-1762.
    [17] Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 367-379, 2016.
    [18] H.-T. Kung, "Why systolic architectures?," Computer, vol. 15, no. 01, pp. 37-46, 1982.
    [19] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolić, Digital integrated circuits: a design perspective. Pearson education Upper Saddle River, NJ, 2003.
    [20] C. Liang et al., "A 28nm poly/SiON CMOS technology for low-power SoC applications," in 2011 Symposium on VLSI Technology-Digest of Technical Papers, 2011: IEEE, pp. 38-39.
    [21] T. Chen et al., "TVM: An automated end-to-end optimizing compiler for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578-594.
    [22] C.-H. Hong, Y.-W. Chiu, J.-K. Zhao, S.-J. Jou, W.-T. Wang, and R. Lee, "A low-power charge sharing hierarchical bitline and voltage-latched sense amplifier for SRAM macro in 28 nm CMOS technology," in 2014 27th IEEE International System-on-Chip Conference (SOCC), 2014: IEEE, pp. 160-164.
    [23] N. Kurd et al., "Haswell: A family of IA 22 nm processors," IEEE Journal of Solid-State Circuits, vol. 50, no. 1, pp. 49-58, 2014.
    [24] R. Kawamoto et al., "A 1.15-TOPS 6.57-TOPS/W Neural Network Processor for Multi-Scale Object Detection With Reduced Convolutional Operations," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 4, pp. 634-645, 2020.
    [25] S. Yin et al., "A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications," in 2017 Symposium on VLSI Circuits, 2017: IEEE, pp. C26-C27.
    [26] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018: IEEE, pp. 218-220.
    [27] Z. Yuan et al., "Sticker: A 0.41-62.1 TOPS/W 8Bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers," in 2018 IEEE Symposium on VLSI Circuits, 2018: IEEE, pp. 33-34.
    [28] J. Song et al., "7.1 an 11.5 tops/w 1024-mac butterfly structure dual-core sparsity-aware neural processing unit in 8nm flagship mobile soc," in 2019 IEEE International Solid-State Circuits Conference-(ISSCC), 2019: IEEE, pp. 130-132.
    [29] M. Anders et al., "2.9 tops/w reconfigurable dense/sparse matrix-multiply accelerator with unified int8/inti6/fp16 datapath in 14nm tri-gate cmos," in 2018 IEEE Symposium on VLSI Circuits, 2018: IEEE, pp. 39-40.

    無法下載圖示 校內:2026-08-23公開
    校外:2026-08-23公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE