簡易檢索 / 詳目顯示

研究生: 王昱承
Wang, Yu-Cheng
論文名稱: 利用分段技術釋放 Softmax 潛能之基於 ReRAM 記憶體內運算加速器
An Efficient ReRAM-based Processing-in-Memory Accelerator using Segmentation to Unleash Softmax’s Potential
指導教授: 林英超
Lin, Ing-Chao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 58
中文關鍵詞: 自我注意力記憶體處理器可變電阻式記憶體
外文關鍵詞: Self-attention, Processing-in-memory, ReRAM
相關次數: 點閱:49下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1. Introduction 1 1.1. Contribution 5 Chapter 2. Background and Motivation 7 2.1. Transformer 7 2.1.1. Structure 7 2.1.2. Attention Mechanism 8 2.2. RRAM-based PIM 10 2.2.1. Vector/Matrix-Matrix Multiplication 10 2.2.2. Searching and Subtraction 12 2.3. Motivation 14 2.3.1. Data Dependency during Softmax Computation 15 2.3.2. Accuracy Loss when Computing Local Softmax 17 2.3.3. Ineffective Softmax Computation 18 Chapter 3. SegTransformer 20 3.1. Overall Structure 21 3.1.1. Segmented Transmission Module 21 3.1.2. Integrated Softmax Processing Unit (ISPU) 21 3.1.3. Matrix Multiplication Module 22 3.2. Segmented Transmission Module 22 3.3. Integrated Softmax Processing Unit (ISPU) 25 3.3.1. FindMAX 27 3.3.2. Subtraction and Rounding 28 3.3.3. Softmax Recalibrate 29 3.3.4. Exponential Index Rounding 30 3.4. Matrix Multiplication Module 31 Chapter 4. Experimental evaluation 33 4.1. Experimental Evaluation 33 4.1.1. Experimental Setup 33 4.2. Experimental Results 36 4.2.1. Latency Comparison 36 4.2.2. Power Comparison 38 4.2.3. Area Comparison 39 4.2.4. Computing Efficiency Comparison 40 4.3. Experimental Results with Different Segment Number 41 4.3.1. Latency Comparison 41 4.3.2. Power Comparison 42 4.3.3. Area Comparison 43 4.4. Discussion 43 Chapter 5. Conclusion 46 5.1. Conclusion 46 References 47

    [1] Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language under-standing. arxiv,” arXiv preprint arXiv:1810.04805, 2019.
    [2] P. P. Ray, “Chatgpt: A comprehensive review on background, applications, key chal-lenges, bias, ethics, limitations and future scope,” Transactions on Internet of Things and Cyber-Physical Systems, vol. 3, pp. 121–154, 2023.
    [3] Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach,”arXiv:1907.11692, 2019.
    [4] J. Achiam et al., “Gpt-4 technical report,” arXiv:2303.08774, 2023.
    [5] J. Ni et al., “Sentence-t5: Scalable sentence encoders from pre-trained text-to-text mod-els,” arXiv:2108.08877, 2021.
    [6] Z. Yang et al., “Xlnet: Generalized autoregressive pretraining for language understand-ing,” Transactions on Advances in neural information processing systems, vol. 32, 2019.
    [7] A. Shafiee et al., “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” Transactions on ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
    [8] X. Yang et al., “Retransformer: Reram-based processing-in-memory architecture for transformer acceleration,” in Proceedings of the 39th International Conference on Computer-Aided Design, pp. 1–9, 2020.
    [9] Y. Zhai et al., “Star: An efficient softmax engine for attention model with rram cross-bar,” in Proceedings of the 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–2, 2023.
    [10] Vaswani et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
    [11] N. Parmar et al., “Image transformer,” in Proceedings of the International conference on machine learning, pp. 4055–4064, 2018.
    [12] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recog-nition at scale,” arXiv:2010.11929, 2020.
    [13] Y. Wu et al., “Pim gpt a hybrid process in memory accelerator for autoregressive trans-formers,” IEEE Transactions on npj Unconventional Computing, vol. 1, no. 1, p. 4, 2024.
    [14] S. Sridharan et al., “X-former: In-memory acceleration of transformers,” IEEE Trans-actions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 8, pp. 1223–1233,2023.
    [15] H.-S. P. Wong et al., “Metal–oxide rram,” Proceedings of the IEEE, vol. 100, no. 6, pp. 1951–1970, 2012.
    [16] C. Liu et al., “Regnn: a reram-based heterogeneous architecture for general graph neu-ral networks,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, pp. 469–474, 2022.
    [17] H. Jin et al., “Rehy: A reram-based digital/analog hybrid pim architecture for acceler-ating cnn training,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2872–2884, 2021.
    [18] L. Song et al., “Pipelayer: A pipelined reram-based accelerator for deep learning,” in Proceedings of the 2017 IEEE international symposium on high performance computer architecture (HPCA), pp. 541–552, 2017.
    [19] M. Zhou et al., “Transpim: A memory-based acceleration via software-hardware co-design for transformer,” in Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 1071–1085, 2022.
    [20] J. R. Stevens et al., “Softermax: Hardware/software co-design of an efficient softmax for transformers,” in Proceedings of the 2021 58th ACM/IEEE Design Automation Con-ference (DAC), pp. 469–474, IEEE, 2021.
    [21] Kagalkar et al., “Cordic based implementation of the softmax activation function,”in Proceedings of the 2020 24th International Symposium on VLSI Design and Test (VDAT), pp. 1–4, 2020.
    [22] Y. Gao et al., “Design and implementation of an approximate softmax layer for deep neural networks,” in Proceedings of the 2020 IEEE international symposium on circuits and systems (ISCAS), pp. 1–5, 2020.
    [23] A. F. Laguna et al., “In-memory computing based accelerator for transformer networks for long sequences,” in Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1839–1844, IEEE, 2021.
    [24] S. Mittal, “A survey of reram-based architectures for processing-in-memory and neural networks,” Transactions on Machine learning and knowledge extraction, vol. 1, no. 1, pp. 75–114, 2018.
    [25] P. Chi et al., “Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory,” Transactions on ACM SIGARCH Com-puter Architecture News, vol. 44, no. 3, pp. 27–39, 2016.
    [26] Y. Halawani et al., “Reram-based in-memory computing for search engine and neural network applications,” Transactions on IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 388–397, 2019.
    [27] A. Carvajal et al., “High capacity motors on-line diagnosis based on ultra wide band partial discharge detection,” in Proceedings of the 4th IEEE International Symposium on Diagnostics for Electric Machines, Power Electronics and Drives, 2003. SDEMPED 2003., pp. 168–170, 2003.
    [28] Q. Yang et al., “A quantized training method to enhance accuracy of reram-based neu-romorphic systems,” in Proceedings of the 2018 IEEE International Symposium on Cir-cuits and Systems (ISCAS), pp. 1–5, 2018.
    [29] S. Yu et al., “Emerging memory technologies: Recent trends and prospects,” IEEE Transactions on Solid-State Circuits Magazine, vol. 8, no. 2, pp. 43–56, 2016.
    [30] Z. Qu et al., “Dota: detect and omit weak attentions for scalable transformer accel-eration,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 14–26, 2022.
    [31] Yazdanbakhsh et al., “Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation,” in Proceedings of the 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 744–762, 2022.
    [32] I. Beltagy et al., “Longformer: The long-document transformer,” arXiv:2004.05150,2020.
    [33] J. Lu et al., “Soft: Softmax-free transformer with linear complexity,” Transactions on Advances in Neural Information Processing Systems, vol. 34, pp. 21297–21309, 2021.
    [34] R. Fackenthal et al., “19.7 a 16gb reram with 200mb/s write and 1gb/s read in 27nm technology,” in Proceedings of the 2014 IEEE International Solid-State Circuits Con-ference Digest of Technical Papers (ISSCC), pp. 338–339, IEEE, 2014.
    [35] X. Peng et al., “Dnn+ neurosim v2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 40, no. 11, pp. 2306–2319, 2020.
    [36] X. Peng et al., “Optimizing weight mapping and data flow for convolutional neural networks on processing-in-memory architectures,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 4, pp. 1333–1343, 2019.
    [37] Y. Long et al., “Reram-based processing-in-memory architecture for recurrent neural network acceleration,” IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems, vol. 26, no. 12, pp. 2781–2794, 2018.
    [38] S.-S. Sheu et al., “A 4mb embedded slc resistive-ram macro with 7.2 ns read-write random-access time and 160ns mlc-access capability,” in Proceedings of the 2011 IEEE International Solid-State Circuits Conference, pp. 200–202, 2011.
    [39] Brito et al., “Quaternary logic lookup table in standard cmos,” IEEE Transactions on very large scale integration (vlsi) systems, vol. 23, no. 2, pp. 306–316, 2014.
    [40] S. Gupta et al., “Felix: Fast and energy-efficient logic in memory,” in Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–7, 2018.
    [41] S. Sarangi et al., “Deepscaletool: A tool for the accurate estimation of technology scal-ing in the deep-submicron era,” in Proceedings of the 2021 IEEE International Sympo-sium on Circuits and Systems (ISCAS), pp. 1–5, 2021.
    [42] L. Wu et al., “Dram-cam: General-purpose bit-serial exact pattern matching,” IEEE Transactions on Computer Architecture Letters, vol. 21, no. 2, pp. 89–92, 2022.

    無法下載圖示 校內:2029-08-01公開
    校外:2029-08-01公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE