| 研究生: |
呂孟洋 Lu, Meng-Yang |
|---|---|
| 論文名稱: |
應用基底-差值遮罩之低延遲記憶體壓縮架構 A Low Latency Memory Compression Framework Using Base-Delta Mask |
| 指導教授: |
郭致宏
Kuo, Chih-Hung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 97 |
| 中文關鍵詞: | 無損資料壓縮 、快取壓縮 、主記憶體壓縮 、記憶體連接壓縮 |
| 外文關鍵詞: | Lossless Data Compression, Cache Compression, Main Memory Compression, Memory I/O Compression |
| 相關次數: | 點閱:65 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
電腦運算架構中處理器的運行速度日益遞增,與記憶體間的效能隔閡也
隨之擴大,被視為整體效能的關鍵瓶頸之一。應用無損資料壓縮演算法於記憶體階層,當記憶體資料以壓縮型態傳輸,可間接地增加記憶體可用容量或減少頻寬傳遞,提升系統整體效能。本論文提出一個針對浮點數表示法開發的低延遲記憶體壓縮架構,但同時支援浮點數與整數格式資料。以快取區塊為壓縮單位並選定參考基底,分析記憶體資料的可壓縮特性後,設計低硬體延遲的壓縮演算法程序,避免抵銷壓縮技術的好處。透過實驗證明,所提出的壓縮架構在眾多測試基準下,可提供平均50.18% 的資料壓縮率。於圖形處理器(Graphic Processing Unit, GPU) 整合階段,套用記憶體I/O 連接壓縮架構可減少約45% 的記憶體頻寬使用,並藉由DRAMPower 與Primetime 來評估對於存取記憶體功耗的改善。
As the speed of contemporary computer architecture increases, there is a continuous growing gap between processor and memory. Even though we can increase the
capacity or bandwidth, the memory is still a critical part for the data-intensive applications. Data compression is a promising approach to resolve this issue by reducing the data size. In general-purpose graphics processing unit (GPGPU), many parallel computing applications have large amounts of single precision floating-point data.
However, most prior compression algorithms of memory have been designed for integer data composition. In this paper, we propose an innovative compression algorithm of low latency decompression for GPGPU. Both floating-point and integer in-memory data can be effectively compressed while introducing minimal affection on access latency and hardware complexity. Simulation results demonstrate that our compression algorithm can significantly reduce data size by an average compression ratio of 50.18%. By integrating the proposed architecture with the memory I/O
controller, the GPU will reduce 45% of the memory bandwidth usage for huge data transfer.
[1] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression: A significance-based compression scheme for l2 caches,” Dept. Comp. Scie., Univ.Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.
[2] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.Mowry, “Base-delta-immediate compression: practical data compression for on chip
caches,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 377–388, ACM, 2012.
[3] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A high performance
microprocessor cache compression algorithm,” IEEE transactions on very large scale integration (VLSI) systems, vol. 18, no. 8, pp. 1196–1208,
2010.
[4] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International
Symposium on Workload Characterization (IISWC), pp. 44–54, Oct 2009.
[5] A. Bakhoda, G. L. Yuan,W.W. L. Fung, H.Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174, April 2009.
[6] V. Sathish, M. J. Schulte, and N. S. Kim, “Lossless and lossy memory i/o link compression for improving performance of gpgpu workloads,” in Proceedings
of the 21st international conference on Parallel architectures and compilation
techniques, pp. 325–334, ACM, 2012.
[7] D. Patterson, “The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges,” Nvidia Whitepaper, vol. 47, 2009.
[8] D. Patterson and J. Hennessy, “Computer organization and design,” Morgan Kaufmann, 2014.
[9] “Ieee standard for floating-point arithmetic,” IEEE Std 754-2008, pp. 1–70, Aug 2008.
[10] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,”IEEE Transactions on information theory, vol. 23, no. 3, pp. 337–343, 1977.
[11] S. C. Tai, “Data compression,” FLAG, 2009.
[12] H. Lekatsas, R. P. Dick, S. Chakradhar, and L. Yang, “Crames: compressed ram for embedded systems,” in 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’05),pp. 93–98, Sept 2005.
[13] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,” in Proceedings of the 23rd international conference on Supercomputing, pp. 46–55,
ACM, 2009.
[14] J. Yang, R. Gupta, and C. Zhang, “Frequent value encoding for low power data buses,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 9, no. 3, pp. 354–384, 2004.
[15] V. Bui and M. A. Kim, “The cache and codec model for storing and manipulating data,” IEEE Micro, vol. 34, no. 4, pp. 28–35, 2014.
[16] M. Thuresson and P. Stenstrom, “Accommodation of the bandwidth of large cache blocks using cache/memory link compression,” in Parallel Processing, 2008. ICPP’08. 37th International Conference on, pp. 478–486, IEEE, 2008.
[17] K. S. Yim, J. Kim, and K. Koh, “Performance analysis of on-chip cache and main memory compression systems for high-end parallel computers.,” in PDPTA, pp. 469–475, 2004.
[18] A. R. Alameldeen and D. A. Wood, “Adaptive cache compression for high performance processors,” in Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, pp. 212–223, IEEE, 2004.
[19] M. Ekman and P. Stenstrom, “A robust main-memory compression scheme,” in ACM SIGARCH Computer Architecture News, vol. 33, pp. 74–85, IEEE Computer Society, 2005.
[20] G. Pekhimenko, T. C. Mowry, and O. Mutlu, “Linearly compressed pages: A main memory compression framework with low complexity and low latency,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 489–490, ACM, 2012.
[21] M. Thuresson, L. Spracklen, and P. Stenstrom, “Memory-link compression schemes: A value locality perspective,” IEEE Transactions on Computers,
vol. 57, pp. 916–927, July. 2008.
[22] B. X. Zeng and C. H. Chen, “Architecture exploration and optimization of caslab-gpusim memory subsystem,” 國立成功大學電機工程學系博碩士論文, 2017.
[23] C. M. Chiu and C. H. Chen, “Gpu warp scheduling using memory stall sampling on caslab-gpusim,” 國立成功大學電機工程學系博碩士論文, 2017.
[24] 蔡森至, “Optimization of workgroup scheduling on caslab-gpusim,” 國立成功大學電機工程學系博碩士論文, 2017.
[25] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, and A. N. Udipi, “Simulating dram controllers for future system architecture exploration,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pp. 201–210, IEEE, 2014.
[26] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn, and K. Goossens,“Drampower: Open-source dram power & energy estimation tool,” URL:http://www. drampower. info, vol. 22, 2012.
[27] K. Chandrasekar, B. Akesson, and K. Goossens, “Improved power modeling of ddr sdrams,” in Digital System Design (DSD), 2011 14th Euromicro Conference on, pp. 99–108, IEEE, 2011.
校內:2019-10-26公開