簡易檢索 / 詳目顯示

研究生: 呂孟洋
Lu, Meng-Yang
論文名稱: 應用基底-差值遮罩之低延遲記憶體壓縮架構
A Low Latency Memory Compression Framework Using Base-Delta Mask
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2017
畢業學年度: 106
語文別: 中文
論文頁數: 97
中文關鍵詞: 無損資料壓縮快取壓縮主記憶體壓縮記憶體連接壓縮
外文關鍵詞: Lossless Data Compression, Cache Compression, Main Memory Compression, Memory I/O Compression
相關次數: 點閱:65下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 電腦運算架構中處理器的運行速度日益遞增,與記憶體間的效能隔閡也
    隨之擴大,被視為整體效能的關鍵瓶頸之一。應用無損資料壓縮演算法於記憶體階層,當記憶體資料以壓縮型態傳輸,可間接地增加記憶體可用容量或減少頻寬傳遞,提升系統整體效能。本論文提出一個針對浮點數表示法開發的低延遲記憶體壓縮架構,但同時支援浮點數與整數格式資料。以快取區塊為壓縮單位並選定參考基底,分析記憶體資料的可壓縮特性後,設計低硬體延遲的壓縮演算法程序,避免抵銷壓縮技術的好處。透過實驗證明,所提出的壓縮架構在眾多測試基準下,可提供平均50.18% 的資料壓縮率。於圖形處理器(Graphic Processing Unit, GPU) 整合階段,套用記憶體I/O 連接壓縮架構可減少約45% 的記憶體頻寬使用,並藉由DRAMPower 與Primetime 來評估對於存取記憶體功耗的改善。

    As the speed of contemporary computer architecture increases, there is a continuous growing gap between processor and memory. Even though we can increase the
    capacity or bandwidth, the memory is still a critical part for the data-intensive applications. Data compression is a promising approach to resolve this issue by reducing the data size. In general-purpose graphics processing unit (GPGPU), many parallel computing applications have large amounts of single precision floating-point data.
    However, most prior compression algorithms of memory have been designed for integer data composition. In this paper, we propose an innovative compression algorithm of low latency decompression for GPGPU. Both floating-point and integer in-memory data can be effectively compressed while introducing minimal affection on access latency and hardware complexity. Simulation results demonstrate that our compression algorithm can significantly reduce data size by an average compression ratio of 50.18%. By integrating the proposed architecture with the memory I/O
    controller, the GPU will reduce 45% of the memory bandwidth usage for huge data transfer.

    摘要..............I Abstract .............II 目錄..............XV 圖目錄..............XIX 表目錄.............XXII 1 緒論..............1 1-1 前言............1 1-2 研究動機............2 1-3 研究貢獻............3 1-4 論文架構............4 2 研究背景介紹............5 2-1 圖形處理器架構(graphics processing unit, GPU) .....5 2-2 IEEE 754 浮點數格式..........7 2-3 記憶體階層架構...........8 2-3-1 快取記憶體(Cache) .........9 2-3-2 主記憶體(Main memory) ........10 3 記憶體壓縮相關文獻回顧.........11 3-1 通用型(General-purpose) 無損資料壓縮演算法.....11 3-1-1 LZ77 壓縮演算法........11 3-1-2 LZO 壓縮演算法.........12 3-2 應用於記憶體壓縮之演算法........14 3-2-1 頻繁型態壓縮(Frequent Pattern Compression, FPC) ...14 3-2-2 基底-差值壓縮(Base-Delta-Immediate Compression, BΔI) .15 3-2-3 結合字典與型態模式之快取壓縮(Cache Packer,C-Pack) ..17 3-2-4 記憶體壓縮演算法比較.......19 3-3 記憶體階層壓縮架構.........21 3-3-1 快取壓縮(Cache compression) .......21 3-3-2 主記憶體壓縮(Main memory compression) ...24 3-3-3 主記憶體I/O 連接壓縮(Memory I/O link compression) ..27 3-3-4 記憶體階層壓縮架構比較.......30 4 低延遲之基底-差值-遮罩壓縮架構(Low-Latency Base-Delta Mask, LLBΔM) .............31 4-1 LLBΔM 記憶體壓縮演算法.........32 4-1-1 LLBΔM 壓縮型態編碼簿........32 4-1-2 LLBΔM 壓縮流程........36 4-1-3 LLBΔM 解壓縮流程........40 4-2 GPGPU 測試基準(Benchmark) 資料擷取.....43 4-3 GPGPU 記憶體資料分析........43 4-3-1 IEEE 754 浮點數表示各別部分之數值分佈...44 4-3-2 GPU 記憶體快取區塊基底(Base) 分析架構...47 4-3-3 指數部分之基底-差值分析(Base-Δ) .....48 4-3-4 小數部分之基底-遮罩分析(Base-Bitmask) ....49 4-4 LLBΔM 壓縮與解壓縮硬體架構........51 4-4-1 LLBΔM 壓縮引擎........51 4-4-2 LLBΔM 解壓縮引擎........56 4-5 CASLAB-GPUSIM 記憶體I/O 連接壓縮架構....59 4-5-1 GPU 主記憶體規格.........60 4-5-2 應用LLBΔM 之記憶體I/O 連接壓縮單元....61 4-5-3 主記憶體壓縮快取區塊讀取......62 4-5-4 主記憶體壓縮快取區塊寫入......63 4-6 記憶體I/O 連接壓縮之優化技術........64 4-6-1 記憶體頻寬之壓縮破碎(Compression fragmentation) ..65 4-6-2 優化LLBΔM 之記憶體I/O 壓縮.....65 4-6-3 適應性之LLBΔM 調整(Adaptive LLBΔM) ...68 5 實驗結果與分析...........70 5-1 壓縮率統計實驗之環境設定與結果......70 5-1-1 LLBΔM 參數設定與分析........70 5-1-2 LLBΔM 各別壓縮型態分佈......74 5-1-3 整體壓縮率統計結果........75 5-2 分析GDDR3 記憶體頻寬........78 5-2-1 GDDR3 記憶體頻寬使用率......78 5-2-2 記憶體I/O 讀取時間........79 5-3 CASLAB-GPUSIM 效能提升........81 5-4 LLBΔM 硬體合成結果分析與比較......82 5-4-1 PrimeTime 功耗分析........83 5-4-2 系統記憶體功耗評估........84 5-4-3 合成數據比較..........87 5-5 適應性之LLBΔM 實驗統計結果.......89 5-5-1 適應性之LLBΔM 壓縮率表現......89 5-5-2 適應性之LLBΔM 記憶體使用率比較....90 6 結論與未來展望...........92 6-1 結論............92 6-2 未來展望............93 參考文獻.............94

    [1] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression: A significance-based compression scheme for l2 caches,” Dept. Comp. Scie., Univ.Wisconsin-Madison, Tech. Rep, vol. 1500, 2004.
    [2] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.Mowry, “Base-delta-immediate compression: practical data compression for on chip
    caches,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 377–388, ACM, 2012.
    [3] X. Chen, L. Yang, R. P. Dick, L. Shang, and H. Lekatsas, “C-pack: A high performance
    microprocessor cache compression algorithm,” IEEE transactions on very large scale integration (VLSI) systems, vol. 18, no. 8, pp. 1196–1208,
    2010.
    [4] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International
    Symposium on Workload Characterization (IISWC), pp. 44–54, Oct 2009.
    [5] A. Bakhoda, G. L. Yuan,W.W. L. Fung, H.Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174, April 2009.
    [6] V. Sathish, M. J. Schulte, and N. S. Kim, “Lossless and lossy memory i/o link compression for improving performance of gpgpu workloads,” in Proceedings
    of the 21st international conference on Parallel architectures and compilation
    techniques, pp. 325–334, ACM, 2012.
    [7] D. Patterson, “The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges,” Nvidia Whitepaper, vol. 47, 2009.
    [8] D. Patterson and J. Hennessy, “Computer organization and design,” Morgan Kaufmann, 2014.
    [9] “Ieee standard for floating-point arithmetic,” IEEE Std 754-2008, pp. 1–70, Aug 2008.
    [10] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,”IEEE Transactions on information theory, vol. 23, no. 3, pp. 337–343, 1977.
    [11] S. C. Tai, “Data compression,” FLAG, 2009.
    [12] H. Lekatsas, R. P. Dick, S. Chakradhar, and L. Yang, “Crames: compressed ram for embedded systems,” in 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’05),pp. 93–98, Sept 2005.
    [13] J. Dusser, T. Piquet, and A. Seznec, “Zero-content augmented caches,” in Proceedings of the 23rd international conference on Supercomputing, pp. 46–55,
    ACM, 2009.
    [14] J. Yang, R. Gupta, and C. Zhang, “Frequent value encoding for low power data buses,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 9, no. 3, pp. 354–384, 2004.
    [15] V. Bui and M. A. Kim, “The cache and codec model for storing and manipulating data,” IEEE Micro, vol. 34, no. 4, pp. 28–35, 2014.
    [16] M. Thuresson and P. Stenstrom, “Accommodation of the bandwidth of large cache blocks using cache/memory link compression,” in Parallel Processing, 2008. ICPP’08. 37th International Conference on, pp. 478–486, IEEE, 2008.
    [17] K. S. Yim, J. Kim, and K. Koh, “Performance analysis of on-chip cache and main memory compression systems for high-end parallel computers.,” in PDPTA, pp. 469–475, 2004.
    [18] A. R. Alameldeen and D. A. Wood, “Adaptive cache compression for high performance processors,” in Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, pp. 212–223, IEEE, 2004.
    [19] M. Ekman and P. Stenstrom, “A robust main-memory compression scheme,” in ACM SIGARCH Computer Architecture News, vol. 33, pp. 74–85, IEEE Computer Society, 2005.
    [20] G. Pekhimenko, T. C. Mowry, and O. Mutlu, “Linearly compressed pages: A main memory compression framework with low complexity and low latency,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 489–490, ACM, 2012.
    [21] M. Thuresson, L. Spracklen, and P. Stenstrom, “Memory-link compression schemes: A value locality perspective,” IEEE Transactions on Computers,
    vol. 57, pp. 916–927, July. 2008.
    [22] B. X. Zeng and C. H. Chen, “Architecture exploration and optimization of caslab-gpusim memory subsystem,” 國立成功大學電機工程學系博碩士論文, 2017.
    [23] C. M. Chiu and C. H. Chen, “Gpu warp scheduling using memory stall sampling on caslab-gpusim,” 國立成功大學電機工程學系博碩士論文, 2017.
    [24] 蔡森至, “Optimization of workgroup scheduling on caslab-gpusim,” 國立成功大學電機工程學系博碩士論文, 2017.
    [25] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, and A. N. Udipi, “Simulating dram controllers for future system architecture exploration,” in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pp. 201–210, IEEE, 2014.
    [26] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn, and K. Goossens,“Drampower: Open-source dram power & energy estimation tool,” URL:http://www. drampower. info, vol. 22, 2012.
    [27] K. Chandrasekar, B. Akesson, and K. Goossens, “Improved power modeling of ddr sdrams,” in Digital System Design (DSD), 2011 14th Euromicro Conference on, pp. 99–108, IEEE, 2011.

    無法下載圖示 校內:2019-10-26公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE