簡易檢索 / 詳目顯示

研究生: 莊竺家
Chuang, Chu-Chia
論文名稱: 圖形處理器中基於可配置內建記憶體的改良動態雜湊表
An Improved Dynamic Hash Table on GPU with Configurable On-chip Memory
指導教授: 張燕光
Chang, Yeim-Kuan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 80
中文關鍵詞: 雜湊表圖形處理器共享記憶體SIMT
外文關鍵詞: Hash Table, GPU, Shared memory, SIMT
相關次數: 點閱:49下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 雜湊表是一個重要的數據結構,用於有效地存儲和訪問數據,並廣泛應用於計算機圖形、機器學習、流量跟蹤、資料庫查詢等各個領域。本文的主要目標是實現一個具有高吞吐量且能平行運作的動態雜湊表。我們提出了一種透過圖形處理器進行資料平行雜湊的技術,並透過圖形處理器上的可配置內建記憶體來改良的動態雜湊表,該方法可以在符合基於現實世界80/20法則下,實現大批量的高吞吐量資料插入、查詢和刪除。我們充分利用了NVIDIA CUDA提供的程式設計技術,通過良好設計的資料結構來最小化記憶體訪問次數,並且利用原子操作確保沒有競爭危害,以及使用協作群組充分利用圖形處理器內的所有執行緒,以達到最高效益的平行運算。通過SIMT(Single Instruction, Multiple Threads)的方式,使得圖形處理器能夠平行地執行動態雜湊表的操作功能。我們還改良了目前在圖形處理器上的資料並行雜湊技術,解決了文中提到的難題,即當雜湊表中的相同鍵頻繁更新時,吞吐量下降的問題。我們的架構大大改善了這個問題,並能夠有效處理現實世界中,兩成的元素佔據了八成頻率的情況,也能夠處理重複插入或更新相同鍵的情況。此外,我們使用LRU (Least Recently Used)方法讓較常被查詢的資料儘量保留在可配置內建記憶體中,以實現較重要且經常被請求的資料能夠較快速被找到的方法。根據實驗結果,我們提出的方法在符合現實世界80/20法則的情況下,插入吞吐量相比目前最先進的圖形處理器平行雜湊技術提高了82.64倍,搜尋吞吐量則提高了5~8%。

    Hash tables are crucial data structures used for efficient storage and retrieval of data, finding wide applications in various domains such as computer graphics, machine learning, traffic tracking, and database queries. The main objective of this paper is to implement a dynamic hash table that achieves high throughput and operates in parallel. We propose a technique for data-parallel hashing using graphics processing units (GPU) and improve upon the dynamic hash table by leveraging the configurable on-chip memory on GPU. This approach enables high-throughput data insertion, search, and deletion while adhering to the real-world 80/20 rule. We utilize the programming techniques provided by NVIDIA CUDA, minimizing memory access through well-designed data structures. Atomic operations ensure the absence of race conditions, and cooperative groups maximize the utilization of GPU threads, achieving optimal parallel computation. The GPU executes hash table operations in parallel by employing the SIMT (Single Instruction, Multiple Threads) mechanism. Furthermore, we enhance existing data-parallel hashing techniques on GPU to address the challenge of decreasing throughput when the same keys are frequently updated in the hash table. Our architecture significantly improves upon this issue and efficiently handles scenarios in which 20% of elements occupy 80% of the frequency in the real world, even in cases of frequent duplicate key insertions or updates. Additionally, we adopt the LRU (Least Recently Used) approach to keep frequently queried data in the configurable on-chip memory, ensuring that important and frequently requested information is readily accessible. According to the experimental results, our proposed method, under the real-world 80/20 rule, achieves an 82.64 times improvement in insertion throughput compared to the state-of-the-art parallel hashing techniques on GPU. Also, the search throughput is increased by 5 to 8%.

    摘要 i Abstract ii 誌謝 iii TABLE OF CONTENTS iv LIST OF TABLES vi LIST OF FIGURES vii Chapter 1 Introduction 1 1.1 Introduction 1 1.2 Organization of the Thesis 2 Chapter 2 Related Work 3 2.1 Graphics Processing Unit (GPU) 3 2.1.1 Compute Unified Device Architecture (CUDA) 3 2.1.2 The NVIDIA GPU Manycore Architecture 3 2.1.3 The Memory Systems of a GPU-Accelerated Computer 5 2.2 CUDA Programming Model and GPU Architecture 6 2.2.1 Host and Device Memory 7 2.2.2 Unified Memory 7 2.2.3 Parallel Execution with CUDA Threads and Thread Blocks 7 2.2.4 GPU Architecture and Memory Access Optimization 9 2.2.5 On-Chip Memory Management and Thread Scheduling 10 2.2.6 Configurable Shared Memory and L1 Cache Ratio 11 2.3 Hashing Techniques 12 2.3.1 Perfect Hashing 12 2.3.2 Separate Chaining 13 2.3.3 Open-Addressing 14 2.4 Optimizing GPU Programs 15 2.4.1 Warp Divergence Prevention 16 2.4.2 Cooperative Groups 17 2.4.3 Coalesced Memory Access 17 2.4.4 Bank Conflicts Prevention 19 2.4.5 Atomic Operations 21 2.4.6 Configurable Shared Memory Carveout 21 2.5 Data-Parallel Hashing for GPU Architectures 22 2.5.1 GPU hashing techniques with Separate Chaining 23 2.5.2 GPU hashing techniques with Open-Addressing 24 Chapter 3 Proposed Scheme 26 3.1 Overview and Motivation 26 3.2 Main Hash Table Structure 27 3.3 Insertion 29 3.3.1 Primary Insertion 29 3.3.2 Alternative Insertion 31 3.3.3 Collision and Eviction 32 3.3.4 Insertions/Updates on the Same Key 34 3.3.5 Overview of Insertion Algorithm 35 3.4 Search 42 3.4.1 Search Flow Overview 43 3.4.2 Construction Process of the Shared Memory Hash Table 44 3.4.3 Resolving Collisions in Shared Memory Hash Table: LRU Approach 46 3.4.4 Synchronization of Hash Table on Shared Memory 47 3.4.5 Overview of Search Algorithm 49 3.5 Deletion 58 3.5.1 Deletion Flow Overview 58 3.5.2 Overview of Delete Algorithm 59 Chapter 4 Experimental Results 64 4.1 Experimental Environment 64 4.2 DataSets 64 4.3 Experimental Results 66 4.3.1 Experimental Results of Insertion 67 4.3.2 Experimental Results of Search 70 4.3.3 Experimental Results of Deletion 72 Chapter 5 Conclusion 75 References 77

    [1] B. Lessley and H. Childs, “Data-Parallel Hashing Techniques for GPU Architectures,”in IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 237-250, 1 Jan. 2020.
    [2] P. Bakkum and K. Skadron, “Accelerating sql database operations on a gpu with cuda,”in GPGPU, pp. 94–103, ACM, 2010.
    [3] M. Girondi, M. Chiesa and T. Barbette, “High-speed Connection Tracking in Modern Servers,”2021 IEEE 22nd International Conference on High-Performance Switching and Routing (HPSR), Paris, France, 2021, pp. 1-8.
    [4] Shouqian Shi and Chen Qian, “Ludo Hashing: Compact, Fast, and Dynamic Key-value Lookups for Practical Network Systems,” Proc. ACM Meas. Anal. Comput. Syst. 4, 2, Article 22 (June 2020), 32 pages.
    [5] Nicolas Le Scouarnec. 2018. Cuckoo++ hash tables: high-performance hash tables for networking applications. In Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems (ANCS '18). Association for Computing Machinery, New York, NY, USA, 41–54.
    [6] NVIDIA Corporation, "NVIDIA CUDA C++ Programming Guide," 2023, version 12.0. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#
    [7] X. Yi, D. Stokes, Y. Yan, and C. Liao, “CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming,” 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 2021, pp. 397-406.
    [8] B. Jang, D. Schaa, P. Mistry and D. Kaeli, "Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures," in IEEE Transactions
    on Parallel and Distributed Systems, vol. 22, no. 1, pp. 105-118, Jan. 2011.
    [9] NVIDIA Corporation, “CUDA C++ Best Practices Guide,” 2023, version 12.0. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory
    [10] NVIDIA Corporation, “NVIDIA Ampere GPU Architecture Tuning Guide,” 2023, version 12.0. https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
    [11] K. Choo, W. Panlener and B. Jang, "Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads," 2014 IEEE 13th International Symposium on Parallel and Distributed Computing, Marseille, France, 2014, pp. 189-196.
    [12] Bingchao Li, Jizeng Wei, Jizhou Sun, Murali Annavaram, and Nam Sung Kim. 2019. An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns. ACM Trans. Archit. Code Optim. 16, 3, Article 20 (September 2019), 24 pages.
    [13] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. M. auf der Heide, H. Rohnert and R. E. Tarjan, “Dynamic perfect hashing: upper and lower bounds,” [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science, White Plains, NY, 1988, pp.524-531.
    [14] J. Sanders and E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.
    [15] M. Moazeni and M. Sarrafzadeh, “Lock-free Hash Table on Graphics Processors,” 2012 Symposium on Application Accelerators in High-Performance Computing, Argonne, IL, USA, 2012, pp. 133-136.
    [16] S. Ashkiani, M. Farach-Colton, and J. D. Owens, “A dynamic hash table for the GPU,”in Proc. 31st IEEE Int. Parallel Distrib. Process. Symp., May 2018, pp. 419–429.
    [17] Kai Zhang, Kaibo Wang, Yuan Yuan, Lei Guo, Rubao Lee, and Xiaodong Zhang, ”Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores,” in Proc. VLDB Endow. 8, 11 (July 2015), 1226–1237.
    [18] D. Jünger, C. Hundt, and B. Schmidt, "WarpDrive: Massively Parallel Hashing on Multi-GPU Nodes," 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, 2018, pp. 441-450.
    [19] Y. Li, Q. Zhu, Z. Lyu, Z. Huang and J. Sun, “DyCuckoo: Dynamic Hash Tables on GPUs,” 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 2021, pp. 744-755.
    [20] C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller and H. Zhou, "Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs," 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, USA, 2014, pp. 231-242.
    [21] S. Ashkiani, M. Farach-Colton, and J. D. Owens, “A dynamic hash table for the gpu,”in IPDPS, pp. 419–429, IEEE, 2018.
    [22] L. Gao et al., "Towards a General and Efficient Linked-List Hash Table on GPUs," 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 1452-1460.
    [23] H. Zhou, D. Troendle, and B. Jang, "DACHash: A Dynamic, Cache-Aware and Concurrent Hash Table on GPUs," 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Belo Horizonte, Brazil, 2021, pp. 1-10.
    [24] M. Javed, H. Zhou, D. Troendle, and B. Jang, "Cuckoo Node Hashing on GPUs," 2022 21st International Symposium on Parallel and Distributed Computing (ISPDC), Basel, Switzerland, 2022, pp. 25-32.
    [25] D. A. Alcantara, V. Volkov, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta,“Building an efficient hash table on the gpu,” in GPU Computing Gems Jade Edition. Elsevier, 2012, pp. 39–53.
    [26] R. Pagh and F. F. Rodler, “Cuckoo hashing,” Journal of Algorithms, vol. 51, no. 2, pp.122–144, 2004.
    [27] D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta, “Real-time parallel hashing on the gpu,” TOG, vol. 28, no. 5, p. 154, 2009.

    無法下載圖示 校內:2028-08-23公開
    校外:2028-08-23公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE