簡易檢索 / 詳目顯示

研究生: 傅俊輝
Fu, Jun-Hui
論文名稱: 一個適用於卷積神經網路之電荷重新分佈式記憶體內運算加速器
A Charge Redistribution Based Computing-in-Memory Accelerator for Convolutional Neural Networks
指導教授: 張順志
Chang, Soon-Jyh
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 115
中文關鍵詞: 記憶體內運算靜態隨機存取記憶體神經網路電荷重新分佈類比運算
外文關鍵詞: computing-in-memory (CIM), static random access memory (SRAM), neural network (NN), charge redistribution, analog computation
相關次數: 點閱:141下載:37
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出一個適用於卷積神經網路之電荷重新分佈式記憶體內運算加速器。此晶片採用了八顆電晶體之靜態隨機存取記憶體,其去耦合讀取埠可避免資料誤寫入且可使用類比式運算進一步降低神經網路中算術運算所需的能量,以降低功率消耗。此外,為了提升能源效率,提出了兩個技巧:第一個技巧為不同權重電容切換的方式,其比常見的電流充放電方式具有較佳的線性度,且能降低類比數位轉換器的數量;第二個為低乘加值跳越的方式,藉由跳過前幾個位元的類比數位轉換,以增加運算速度及降低功率消耗。
    本設計以台積電40奈米CMOS標準1P9M製程實作測試晶片,整體晶片面積為2.421 mm2,其中核心電路占整體的18.24%。測試結果顯示在輸入電壓0.9伏特及一億赫茲時脈下,本晶片在MNIST與CIFAR-10資料集的Top-1準確率可分別達99%與91%,而換算出來的能源效率為3.7 TOPS/W,使用提出的低乘加值跳越技巧,可達到9.88 TOPS/W;在輸入電壓0.7伏特及八千萬赫茲時脈下,能源效率為5.1 TOPS/W,使用提出的低乘加值跳越技巧,最高可達到12.02 TOPS/W。

    This thesis presents a charge redistribution based computing-in-memory (CIM) accelerator for convolutional neural networks (CNNs). This CIM macro adopts 9T static random access memory (SRAM) with a read-decoupled port to avoid read-disturbing and perform the analog computation for further diminishing the energy consumption per arithmetic operation. A weighted capacitor switching technique is proposed to achieve a better linearity performance than conventional current charging/discharging scheme and reduce the number of analog-to-digital converters (ADC). Moreover, a low multiply-accumulate (MAC) value skipping technique is also proposed to enhance the speed and reduce the power consumption of the CIM macro by skipping the first few bits during the analog-to-digital conversion.
    The proof-of-concept prototype was fabricated in TSMC 40-nm CMOS standard 1P9M technology, where the chip covers an area of 2.42 mm2, and the core circuit accounts for 18.24% of the total area. The measurement result shows that the top-1 accuracies of MNIST and CIFAR-10 are 99% and 91% with a supply voltage of 0.9-V and a clock frequency of 100-MHz, respectively. The equivalent energy efficiency is 3.7 TOPS/W, and could achieve 9.88 TOPS/W with the proposed low MAC value skipping technique. In addition, the prototype could achieve 5.1 TOPS/W with a supply voltage of 0.7-V and a clock frequency 80-MHz. With the proposed low MAC value skipping technique, the energy efficiency could obtain at most 12.02 TOPS/W.

    摘 要 I Abstract II List of Tables VIII List of Figures IX Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Computing-in-Memory 3 1.3 Thesis Organization 5 Chapter 2 Basics of Neural Networks and Data Management 6 2.1 Fundamentals of Deep Neural Networks 7 2.2 Overview of Convolutional Neural Network 12 2.2.1 Basics of Convolutional Neural Networks 12 2.2.2 Non-Linear Activation Function 13 2.2.3 Pooling Function 15 2.2.4 Normalization Function 16 2.3 Quantization of Neural Networks 16 2.3.1 Linear Quantization 18 2.3.2 Non-Linear Quantization 21 2.3.3 Quantization-Aware Training 22 2.4 Introduction to Data Management 24 2.4.1 Dataflow 24 2.4.2 Data Reuse 30 Chapter 3 Fundamentals of Computing-in-Memory 33 3.1 Introduction to Computing-in-Memory 34 3.1.1 Concept of Computing-in-Memory 34 3.1.2 Structure of Computing-in-Memory 36 3.2 Digital-to-Analog Conversion 38 3.2.1 Basic concept of DAC 39 3.2.2 Static Specifications of DAC 42 3.2.3 Implementation of DAC 45 3.3 Mixed-Signal Computation in SRAM 48 3.3.1 6T SRAM Bit-cell 49 3.3.2 Charge-Based Multiplication and Accumulation 53 3.4 Analog-to-Digital Conversion 55 3.4.1 The Concept and Operation of SAR ADC 56 3.4.2 Quantization Error 62 3.4.3 Static Specifications of ADC 64 Chapter 4 A Charge Redistribution Based CIM Accelerator for CNNs 71 4.1 Introduction 71 4.2 Proposed Architecture 72 4.2.1 Low MAC Value Skipping technique [32] 76 4.3 Adopted Techniques of SAR ADC 77 4.3.1 Merged Capacitor Switching Method [39] 78 4.3.2 Direct Switching Technique [42] and compact combinational timing control [43] 80 4.4 Circuit Implementation 82 4.4.1 Peripheral Circuits of SRAM 82 4.4.2 Dynamic Comparator 85 4.4.3 Digital Control Logic Circuits 87 4.4.4 Capacitive DAC 88 Chapter 5 Simulation and Measurement Results 91 5.1 Layout and Chip Floor Plan 91 5.2 Simulation Results 93 5.3 Die Micrograph and Measurement Setup 97 5.4 Measurement Results 101 Chapter 6 Conclusion and Future Works 107 Bibliography 110

    [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015, doi: 10.1038/nature14539.
    [2] Siri Team, “Hey Siri: An on-device DNN-powered voice trigger for apple’s personal assistant,” Apple Mach. Learn. J., vol. 1, no. 6, Oct. 2017. [Online]. Available: https://machinelearning.apple.com/2017/10/01/hey-siri.html.
    [3] Computer Vision Machine Learning Team, “An on-device deep neural network for face detection,” Apple Mach. Learn. J., vol. 1, no. 7, Nov. 2017. [Online]. Available: https://machinelearning.apple.com/2017/11/16/face-detection.html.
    [4] A. G. Howard et al. (2017). “MobileNets: Efficient convolutional neural networks for mobile vision applications.” [Online]. Available: https://arxiv.org/abs/1704.04861
    [5] Silver, D., Huang, A., Maddison, C. et al. “Mastering the game of Go with deep neural networks and tree search,” Nature 529, 484–489 (2016).
    [6] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
    [7] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, 2016.
    [8] J. Wu, C. Leng, Y. Wang, Q. Hu and J. Cheng, “Quantized Convolutional Neural Networks for Mobile Devices,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4820-4828, doi: 10.1109/CVPR.2016.521.
    [9] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” arXiv preprint arXiv:1603.05279, 2016. 1, 2.
    [10] I. Hubara, D. Soudry, and R. El-Yaniv, “Binarized Neural Networks,” NIPS. arXiv preprint arXiv:1602.02505, 2016.
    [11] Q. Dong et al., “A 351TOPS/W and 372.4GOPS Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2020, pp. 242-244, doi: 10.1109/ISSCC19947.2020.9062985.
    [12] M. Horowitz, “Computing's energy problem (and what we can do about it),” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2014, pp. 10-14, doi: 10.1109/ISSCC.2014.6757323.
    [13] N. Verma et al., “In-Memory Computing: Advances and Prospects,” in IEEE Solid-State Circuits Magazine, vol. 11, no. 3, pp. 43-55, Summer 2019, doi: 10.1109/MSSC.2019.2922889.
    [14] W. -S. Khwa et al., “A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2018, pp. 496-498, doi: 10.1109/ISSCC.2018.8310401.
    [15] M. Kang, S. K. Gonugondla, A. Patil and N. R. Shanbhag, “A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array,” in IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 642-655, Feb. 2018, doi: 10.1109/JSSC.2017.2782087.
    [16] J. Zhang, Z. Wang and N. Verma, “In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array,” in IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 915-924, April 2017, doi: 10.1109/JSSC.2016.2642198.
    [17] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, 2013, pp. 1–6.
    [18] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. ICCV, 2015, pp. 1026–1034.
    [19] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” in Proc. ICLR, 2016.
    [20] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in Proc. ICASSP, 2014, pp. 215–219.
    [21] Y. Zhang, et al., “Towards end-to-end speech recognition with deep convolutional neural networks,” Interspeech, 2016.
    [22] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In ICML, 2015.
    [23] M. Yufei, N. Suda, Yu Cao, J. Seo and S. Vrudhula, “Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA,” 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1-8, doi: 10.1109/FPL.2016.7577356.
    [24] H, Song, H. Mao, and W. Dally. “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding,” in Computer Vision and Pattern Recognition, 2016.
    [25] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” in CoRR, 2018.
    [26] E. H. Lee, D. Miyashita, E. Chai, B. Murmann and S. S. Wong, “LogNet: Energy-efficient neural networks using logarithmic computation,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5900-5904, doi: 10.1109/ICASSP.2017.7953288.
    [27] A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen. “Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights,” in ICLR, 2017.
    [28] V. Sze, Y. Chen, T. Yang and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017, doi: 10.1109/JPROC.2017.2761740.
    [29] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4013-4021, doi: 10.1109/CVPR.2016.435.
    [30] J. Cong and B. Xiao, “Minimizing computation in convolutional neural networks,” in Proc. ICANN, 2014, pp. 281–290.
    [31] X. Si et al., “A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2019, pp. 396-398, doi: 10.1109/ISSCC.2019.8662392.
    [32] X. Si et al., “A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2020, pp. 246-248, doi: 10.1109/ISSCC19947.2020.9062995.
    [33] J. -W. Su et al., “A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2021, pp. 250-252, doi: 10.1109/ISSCC42613.2021.9365984.
    [34] M. E. Sinangil et al., “A 7-nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS,” in IEEE Journal of Solid-State Circuits, vol. 56, no. 1, pp. 188-198, Jan. 2021, doi: 10.1109/JSSC.2020.3031290.
    [35] A. J. Bhavnagarwala, Xinghai Tang and J. D. Meindl, “The impact of intrinsic device fluctuations on CMOS SRAM cell stability,” in IEEE Journal of Solid-State Circuits, vol. 36, no. 4, pp. 658-665, April 2001, doi: 10.1109/4.913744.
    [36] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2018, pp. 488-490, doi: 10.1109/ISSCC.2018.8310397.
    [37] B. Murmann, “ADC Performance Survey 1997-2018,” [Online]. Available: http://web.stanford.edu/~murmann/adcsurvey.html.
    [38] H. Jia, H. Valavi, Y. Tang, J. Zhang and N. Verma, “A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing,” in IEEE Journal of Solid-State Circuits, vol. 55, no. 9, pp. 2609-2621, Sept. 2020, doi: 10.1109/JSSC.2020.2987714.
    [39] V. Hariprasath, J. Guerber, S. Lee, U. Moon, “Merged capacitor switching based SAR ADC with highest switching energy-efficiency,” Electronics Letters, vol. 46, no.9, pp. 620-621, 2010.
    [40] Y. Zhu et al., “A 10-bit 100-MS/s Reference-Free SAR ADC in 90 nm CMOS,” in IEEE Journal of Solid-State Circuits, vol. 45, no. 6, pp. 1111-1121, June 2010, doi: 10.1109/JSSC.2010.2048498.
    [41] C. Liu, S. Chang, G. Huang and Y. Lin, “A 10-bit 50-MS/s SAR ADC With a Monotonic Capacitor Switching Procedure,” in IEEE Journal of Solid-State Circuits, vol. 45, no. 4, pp. 731-740, April 2010, doi: 10.1109/JSSC.2010.2042254.
    [42] G. Huang, S. Chang, Y. Lin, C. Liu and C. Huang, “A 10b 200MS/s 0.82mW SAR ADC in 40nm CMOS,” 2013 IEEE Asian Solid-State Circuits Conference (A-SSCC), 2013, pp. 289-292, doi: 10.1109/ASSCC.2013.6691039.
    [43] C.-H. Kuo, “A 10-bit 120-MS/s SAR ADC with compact architecture and noise suppression technique,” M.S. thesis, Dept. Elect. Eng., National Cheng Kung Univ., Tainan, Taiwan, 2014.
    [44] C. Liu, C. Kuo and Y. Lin, “A 10 bit 320 MS/s Low-Cost SAR ADC for IEEE 802.11ac Applications in 20 nm CMOS,” in IEEE Journal of Solid-State Circuits, vol. 50, no. 11, pp. 2645-2654, Nov. 2015, doi: 10.1109/JSSC.2015.2466475.
    [45] J. Yue et al., “A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2021, pp. 238-240, doi: 10.1109/ISSCC42613.2021.9365958.
    [46] C. -X. Xue et al., “A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2021, pp. 245-247, doi: 10.1109/ISSCC42613.2021.9365769.

    下載圖示 校內:2022-09-27公開
    校外:2022-09-27公開
    QR CODE