簡易檢索 / 詳目顯示

研究生: 張弘諭
Chang, Hung-Yu
論文名稱: 適應性多應用程序MapReduce處理框架於圖形處理器之研究與實現
Adaptive MapReduce Framework for Multi-Application Processing on GPU
指導教授: 黃悅民
Huang, Yeuh-Min
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2013
畢業學年度: 101
語文別: 中文
論文頁數: 74
中文關鍵詞: MapReduceGPUGPGPUMars操作重疊性
外文關鍵詞: MapReduce, GPU, GPGPU, Mars, Overlapped GPU Operations
相關次數: 點閱:89下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於近年來電子資訊科技之迅速進展,各大企業所需處理之資料量也與日俱增。隨著分散式處理框架MapReduce之發展與演進,大量資料之處理也不再是難題。各領域之應用可藉由普遍運行於大量CPU群上之MapReduce框架,對資料進行平行與分散式運算,以提高處理效率。而隨著Graphics Processing Unit硬體技術之提升,其大量之運算核心數量及其強大之運算能力使之足以勝任更多工作負載之處理。有許多MapReduce運算框架也逐漸於GPU上以GPGPU之概念設計與實現,更進一步提升計算效能。
    而目前普遍運作於GPU之MapReduce框架主要以單應用程序為主,無法同時處理多應用程序,對於多應用程序之服務需求僅能以序列式之方式處理,且欠缺具效率之資料分割及資源排程管理方式,使得在多應用程序之處理下,未能有效發揮其硬體之效能。
    本研究基於現有GPU之MapReduce框架-Mars設計一多應用程序平行處理機制。依據當前所有應用程序之處理需求、硬體資源需求量及資料處理量,分割由多應用程序所造成之大量資料,並依據硬體負荷能力傳送適合之工作負載片段予以處理。同時考量相關硬體控制之重疊性,以求較高之硬體運作重疊性,增進其處理效能。本研究以普遍應用於MapReduce框架計算之應用程式作為實驗之工作負載,並以執行時間做為其效能改善之指標。其整體多應用程序於此平行處理機制下之平均速率增進約為1.3倍。

    With the improvements in electronic and computer technology, the amount of data to be processed by each enterprise is getting larger. Handling such amount of data is not a big challenge with the help of MapReduce framework anymore. Many applications from every field can take advantage of MapReduce framework on large amount of CPUs for efficient distributed and parallel computing. On the other hand, graphics processing unit (GPU) technology is also improving. The multi-cores GPU provides stronger computing power that is capable of handling more workloads and data processing. Many MapReduce frameworks are gradually designed and implemented in general purpose graphics processing unit concept on GPU hardware to achieve better performance.
    However, most GPU MapReduce frameworks are focusing single application processing so far. In other words, no more methodologies or mechanisms are provided for multi-application execution and only can be processed in sequential order. The GPU hardware resources may not be fully utilized and distributed that result in the decrease of computing performance.
    This study designs and implements a multi-application execution mechanism based on the state-of-the-art GPU MapReduce framework, Mars. It not only provides problem partitioning utility, by considering the data size and hardware resources requirements of each application, but also feeds appropriate amount of workloads into GPU with overlapped GPU operations for efficient parallel execution. Finally, several common applications are used to verify the applicability of this mechanism. The time cost is the main evaluation metric in this study. The overall 1.3 speedup for random application combinations is achieved with the proposed method.

    摘要 III Abstract IV 致謝 V 圖目錄 VIII 表目錄 X 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 章節提要 2 第二章 相關標準與研究 3 2.1 通用型圖形處理器 3 2.1.1 GPU與CPU之比較 3 2.1.2 GPU運算架構 5 2.2 MapReduce 7 2.2.1 MapReduce之運作 7 2.2.2 MapReduce程式模型 8 2.3 統一運算架構 10 2.3.1 程式模型 12 2.3.2 記憶體階層架構 14 2.3.3 記憶體連續存取 15 2.3.4 CUDA Streams 16 2.4 文獻探討 17 第三章 軟硬體平台介紹 19 3.1 Tesla K20 19 3.2 Mars 23 3.2.1 Mars系統架構 24 3.2.2 Mars運作流程 27 第四章 多應用程序平行處理機制之設計與實作 29 4.1 系統架構介紹 29 4.2 基本工作單位切割 31 4.2.1 Map/Reduce之基本工單位切割 31 4.2.2 Group之基本工作單位切割 34 4.3 工作負載輸入量決策模型 38 4.3.1 總基本工作單位輸入量之評估 38 4.3.2 應用程序之基本工作單位額度分配 42 4.4 硬體操作排程演算法 45 4.5 Middleware Manager運作流程 50 第五章 系統實作與結果分析 52 5.1 測試平台與環境建置 52 5.2 工作負載 54 5.3 測試結果與分析 59 5.3.1 單一應用程序之驗證 60 5.3.2 雙應用程序之驗證 65 5.3.3 三應用程序之驗證 68 第六章 結論與未來展望 70 參考文獻 71

    [1] “Apache Hadoop”, http://hadoop.apache.org, retrieved on March 2013.
    [2] D. P. Bertsekas, Constrained Optimization and Lagrange Multiplier Methods. Mass: Athena Scientific, 1996.
    [3] K. E. Batcher, “Sorting Networks and their Applications,” Proceedings of the AFIPS Spring Joint Computer Conference, vol. 32, pp. 307-314, 1968.
    [4] “CUDA”, http://developer.nvidia.com/category/zone/cuda-zone, retrieved on March 2013.
    [5] “Compute Capability”, http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/, retrieved on March 2013.
    [6] L. Chen, and G. Agrawal, “Optimizing MapReduce for GPUs with Effective Shared Memory Usage,” Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pp. 199-210, 2012.
    [7] L. Chen, X. Huo, and G. Agrawal, “Accelerating MapReduce on a Coupled CPU-GPU Architecture,” Proceedings of the international conference on High Performance Computing, Networking, Storage and Analysis, no. 25, 2012.
    [8] B. Catanzaro, N. Sundaram, and K. Keutzer, “A Map Reduce Framework for Programming Graphics Processors,” Workshop on Software Tools for MultiCore Systems, 2008.
    [9] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata, “Cell Broadband Engine Architecture and its First Implementation – A Performance View,” IBM Journal of Research and Development, vol. 51, no. 5, pp. 559-572, 2007.
    [10] R. M. Chen, S. T. Lo, Y. M. Huang, and C.M. Wang, "Solve Multiprocessor Real-Time Scheduling Using Competitive Slack Neural Networks," International Computer Symposium, 2004.
    [11] S. C. Cheng, and Y. M. Huang, "Scheduling Multi-Processor Tasks with Resource and Timing Constraints Using Genetic Algorithm," Proceedings of 5th IEEE International Symposium on Computational Intelligence in Robotics and Automation, 2003.
    [12] R. M. Chen, and Y. M. Huang, "Multiprocessor Task Assignment with Fuzzy Hopfield Neural Network Clustering Technique," Neural Computing & Applications, vol. 10, pp. 12-21, 2001.
    [13] R. M. Chen, and Y. M. Huang, "Multiconstraint Task Scheduling in Multiprocessor System by Neural Network," Proceedings of 10th IEEE International Conference on Tools with Artificial Intelligence, 1998.
    [14] A. Dou, V. Kalogeraki, D. Gunopulos, T. Mielikainen, and V. H. Tuulos, “Misco: A MapReduce Framework for Mobile Systems,” Proceedings of the 3rd international conference on Pervasive Technologies Related to Assistive Environments, no. 32, 2010.
    [15] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, pp. 107-113, 2008.
    [16] “Flynn分類法”, http://computing.llnl.gov/tutorials/parallel_comp, retrieved on March 2013.
    [17] “GPGPU”, http://gpgpu.org/about, retrieved on March 2013.
    [18] D. Gross, and C. M. Harris, Fundamentals of Queueing Theory, 3rd ed. New York: Wiley, 1998.
    [19] B. V. Gnedenko, and I. N. Kovalenko, Introduction to Queueing Theory, 2nd ed. Boston: Birkhauser, 1989.
    [20] C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin, “MapCG: Writing Parallel Program Portable between CPU and GPU,” Proceedings of the 19th international conference on Parallel Architecture and Compilation Techniques, pp. 217-226, 2010.
    [21] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: A MapReduce Framework on Graphics Processors,” Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques, pp. 260-269, 2008.
    [22] Y. M. Huang, and R. M. Chen, "Scheduling Multiprocessor Job with Resource and Timing Constraints using Neural Networks," IEEE Transactions on Systems, Man, and Cybernetics on, vol. 29, pp. 490-502, 1999.
    [23] F. Ji, and X. Ma, “Using Shared Memory to Accelerate MapReduce on Graphics Processing Units,” Proceedings of the IEEE International Parallel & Distributed Processing Symposium, pp. 805-816, 2011.
    [24] S. T. Lo, R. M. Chen, Y. M. Huang, and C. L. Wu "Multiprocessor System Scheduling with Precedence and Resource Constraints Using an Enhanced Ant Colony System," Expert Systems With Applications, vol. 34, pp. 2071-2081, 2008.
    [25] “Mars”, http://www.cse.ust.hk/gpuqp/Mars.html, retrieved on December 2012.
    [26] NVIDIA, “NVIDIA CUDA C Programming Guide, version 4.2,” NVIDIA Cooperation, retrieved on January 2013.
    [27] NVIDIA, “NVIDA’s Next Generation CUDATM Compute Architecture: KeplerTM GK110,” NVIDIA Cooperation, retrieved on January 2013.
    [28] NVIDIA, “The CUDA Compiler Driver NVCC,” NVIDIA Cooperation, retrieved on January 2013.
    [29] G. F. Newell, Applications of Queueing Theory, 2nd ed. New York: Chapman and Hall, 1982.
    [30] “OpenCL”, http://www.khronos.org/opencl/, retrieved on March 2013.
    [31] “Overlapping”, http://developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc, retrieved on March 2013.
    [32] “Phoenix”, http://mapreduce.stanford.edu, retrieved on March 2013.
    [33] A. Papagiannis, and D. S. Nikolopoulos, “Rearchitecturing MapReduce for Heterogeneous Multicore Processors with Explicitly Managed Memories,” Proceedings of the 39th international conference on Parallel Processing, pp. 121-130, 2010.
    [34] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, “Parallel External Sorting for CUDA-enabled GPUs with Load Balancing and Low Transfer Overhead,” Proceedings of the IEEE International Parallel & Distributed Processing Workshops and Phd Forum, pp. 1-8, 2010.
    [35] J. A. Stuart, and J. D. Owens, “Multi-GPU MapReduce on GPU Clusters,” Proceedings of the IEEE International Parallel & Distributed Processing Symposium, pp. 1068-1079, 2011.
    [36] N. Sundaram, A. Raghunathan, and S. T. Chakradhar, “A Framework for Efficient and Scalable Execution of Domain-specific Templates on GPUs,” Proceedings of the IEEE International Parallel & Distributed Processing Symposium, pp. 1-12, 2009.
    [37] M. Tanner, Practical Queueing Analysis. New York: McGraw-Hill, 1995.

    無法下載圖示 校內:2018-08-12公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE