簡易檢索 / 詳目顯示

研究生: 嚴健瑄
Yen, Chien-Hsuan
論文名稱: 應用於多核心平台之高效率記憶體網路晶片系統
A Memory-Efficient NoC System for Manycore Platform
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 75
中文關鍵詞: 動態隨機存取記憶體存取排程網狀網路晶片多層匯流排互聯多核心系統OpenCL框架
外文關鍵詞: DRAM access scheduling, Mesh Network-on-Chip, Multi-layer interconnection, Many-core system, OpenCL framework
相關次數: 點閱:148下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現代的平行運算系統下,互聯(interconnection)和記憶體效能扮演很重要的角色。本論文中,我們評估一個基於ARM架構之多核心全系統平台在OpenCL框架中,執行由OpenCL卸載之內核程式時的運算效能。對於密集存取記憶體的OpenCL應用程式來說,多核心系統中每個核心所存取記憶體的時間在整體執行時間中佔極高的比例。儘管其應用程式平均的記憶體頻寬需求遠小於整體互聯系統及記憶體控制器所提供之實體頻寬,但爭奪記憶體所造成之額外負擔會隨著系統的規模變大而跟著提高,最終造成多核心系統之可擴展性大幅降低。
    因此,我們首先開發一可配置之網狀網路晶片系統去提供比傳統匯流排互聯系統更高的互聯頻寬。但我們發現在執行密集存取記憶體的OpenCL應用程式時,網狀網路系統比互聯矩陣系統所提供之效能提升是很有限的。因此,我們整合了動態隨機存取記憶體存取排程之方法至該網路晶片系統以提升多達20%的記憶體存取效能。更重要的是,受益於網路晶片之封包交換特性,記憶體存取排程所帶來的效能提升會隨著系統的規模變大而愈加顯著,此特性恰巧符合了多核心系統之可擴展性的需求。因此,在多核心系統執行密集存取記憶體的OpenCL應用程式時,我們所提出的高效率記憶體網路晶片系統能有效提升多核心系統之可擴展性及記憶體存取效能。

    Interconnection and memory performance plays an important role in the contemporary parallel computing system. In this thesis, we evaluate a full system ARM-based many-core platform under the OpenCL framework by offloading the kernel programs into the many-core processors. The memory access time dominates the total execution time for the many-core processors in the execution of the memory-intensive OpenCL application. Despite the fact that the physical bandwidth provided by the interconnection and memory controllers are very sufficient to the average bandwidth requirement of the applications, the memory contention overheads dramatically increase with the scaled system, resulting to the poor scalability of the many-core platform.
    Therefore, we first develop a configurable mesh NoC system with higher interconnection bandwidth than the conventional bus-based on-chip interconnections. However, we find that the native NoC has only limited improvement compared to the interconnection matrix in the execution of the memory-intensive OpenCL application. Then, we integrate the DRAM memory access scheduling approach into the native NoC system to advance the overall memory performance up to 20%. More importantly, benefited by the packet-switch feature of the NoC, the performance improvement due to the memory access scheduling approach grows with the scaled system, matching the scalability requirement of the many-core system. In the execution of the memory-intensive OpenCL applications, the proposed the memory-efficient NoC system effectively upgrades the scalability and memory performance for the many-core platform.

    Chapter 1 - Introduction 1 1.1 Motivation 1 1.2 Contribution 2 1.3 Organization 3 Chapter 2 - Background 4 2.1 Network-on-Chip 4 2.1.1 Interconnection for Multiprocessor 4 2.1.2 Topology 6 2.1.3 Buffer Flow Control 8 2.1.4 Routing Protocol 10 2.1 DRAM Structure 12 2.2 OpenCL Framework 14 2.2.1 OpenCL Execution Model 15 2.2.2 OpenCL Memory Hierarchy 16 2.2.3 OpenCL Runtime System Architecture 17 Chapter 3 - Related work 19 3.1 Mesh NoC System 19 3.1.1 Priority-based Routing Protocol 19 3.1.1 Memory controller placement 20 3.1.2 Novel Research related to NoC 21 3.2 Memory Access Scheduling Scheme 22 Chapter 4 - Network-on-Chip Architecture 24 4.1 Overview of the NoC System 24 4.2 Routing Approach 25 4.2.1 Packet Format 25 4.2.2 Wormhole Flow Control 26 4.2.3 XY Laddering Routing 27 4.3 Detailed Descriptions of the Components in NoC System 28 4.3.1 Router Architecture 28 4.3.2 Hybrid Network Interface 30 4.3.3 Memory Controller 32 Chapter 5 - Many-Core Platform 33 5.1 Overview of the Many-Core Platform 33 5.2 Full System Simulation Platform for OpenCL 34 5.2.1 Work-item coalescing 36 5.2.2 OpenCL Memory Management 37 5.3 DRAM Access Scheduling Approach 39 Chapter 6 - Experiment 41 6.1 Experiment Environment 41 6.1.1 System Configuration 41 6.1.2 NoC Placement 43 6.2 Evaluation Metrics 45 6.2.1 Benchmark 45 6.2.2 Memory Access Flow 46 6.3 Performance Evaluation 48 6.3.1 Performance Bottleneck 48 6.3.2 Memory Access Scheduler 52 6.3.3 Scalability 55 6.4 Optimization 63 6.4.1 OpenCL Runtime 63 6.4.2 DDR3 65 6.4.3 Priority-Aging Hit-First Scheduling 67 Chapter 7 - Conclusion 70 References 72

    [1] P.P. Pande, C. Grecu, M. Jones, A. Ivanov, et al., ” Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures,” in the IEEE Transactions on Computers, Vol. 54, No. 8, pp. 1025-1040, Aug. 2005.
    [2] M. Daneshtalab, M. Ebrahimi, and P. Liljeberg, “Memory-Efficient On-Chip Network with Adaptive Interfaces,” in the IEEE Transactions on Computer-Aided Design of Integrated Circuit and Systems (TCAD), Vol. 31, No. 8, pp. 146-159, Jan. 2012.
    [3] E. Bolotin, Z. Guz, R. Ginosar, et al.,” The Power of Priority: NoC Based Distributed Cache Coherency,” in the First International Symposium on Network-on-Chip (NOCS’07), pp. 117-126, 2007.
    [4] G. Ascia, V. Catania, et al., “Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networks-on-Chip,” in the IEEE Transactions on Computers, Vol. 57, No.6, pp. 809-820, Jun. 2008.
    [5] W.J. Dally, and B. Towles,” Principles and Practices of Interconnection Networks,” Morgan Kaufmann Publishers, 2004.
    [6] M. Yang, T. Li, Y. Jiang, Y. Yang, “Fault-Tolerant Routing Schemes in RDT(2,2,1)/ α -Based Interconnection Network for Networks-on-Chip Designs,” in the Proceedings of the 8th International Symposium on Parallel Architectures, Dec 2005
    [7] H. Kariniemi, J. Nurmi, “Arbitration and Routing Schemes for On-chip Packet Networks. Interconnect-Centric Design for Advanced SoC and NoC,” Kluwer Academic publishers, pp. 253-282, 2004.
    [8] M. Cyriel, G. Mitch et al. “Bidirectional Fat Tree Construction and Routing for IEEE 802.1au,” http://www.ieee802.org/1/files/public/docs2007/au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf.
    [9] W.J. Dally, “Virtual-channel flow control,” in the IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No.2, pp.194-205, Mar. 1992.
    [10] A. Dennis, K. John, et al., ”Achieving predictable performance through better memory controller placement in many-core CMPs,” in the Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09), pp. 451-461, Jun. 2009.
    [11] D.C. Shubhangi, M.A. Gaikwad, et al., “Review of XY Routing Algorithm for Network-on-Chip Architecture,” in the International Journal of Computer Application (IJCA), Vol. 43, No. 21, pp. 48-52, 2012.
    [12] C. Kuan-Chung, C. Chung-Ho, “An OpenCL Runtime System for a Heterogeneous Many-Core Virtual Platform,” in the IEEE international symposium on Circuits and systems (ISCAS), pp. 2197-2200, Jun. 2014.
    [13] D. Wentzlaff, P. Griffin, H. Hoffmann, et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Computer Society, Vol.27, pp. 15-31, 2007.
    [14] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in Proceedings of the 2012 ACM International Conference on Supercomputing(ICS’12), pp. 341-352, Dec. 2012.
    [15] Khronos OpenCL Working Group, “The OpenCL Secification Version 1.2,”2012, http://www.khronos.org/opencl/.
    [16] L. Chien-Te, C. Kuan-Chung, C. Chung-Ho, “CASL Hypervisor and its virtualization platform,” in the IEEE international symposium on Circuits and systems (ISCAS), pp. 1224-1227, May. 2013.
    [17] R. Amir-Mohammad et al., “High-Performance and Fault-Tolerant 3D NoC-Bus Hybrid Architecture Using ARB-NET-Based Adaptive Monitoring Platform,” in the IEEE Transactions on Computers, Vol. 63, No. 3, pp. 734-747, Mar. 2014.
    [18] Z. Hongzhong, L. Jiang et al., “Memory Access Scheduling Schemes for Systems with Multi-Core Processors,” in the Proceedings of the 37th International Conference on Parallel Processing (ICPP), pp. 406-413, 2008.
    [19] S. Rixner, W.J. Dally, et al., “Memory access scheduling,” in the Proceedings of the 27th International Symposium on Computer Architecture (ISCA), pp. 128-138, 2000.
    [20] M. Mira-Aghatabar, S. Koohi et al., “An Empirical Investigation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and Traffic Models,” in the Proceedings of the 10th Euromicro Conference on Digital System Design Architecture (DSD), pp. 19-26, 2007.
    [21] E.A. Carara. N.L.V. Calazans et al., “Differentiated Communication Services for NoC-Vased MPSoCs,” in the IEEE Transactions on Computers, Vol. 63, No.3, pp. 595-908, 2014.
    [22] K. Puttaswamy, H. G., “Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors,” in the Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture (HPCA), pp. 193-204, 2007.
    [23] B.S. Feero, P.P. Pande, “Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation,” in the IEEE Transactions on Computers, Vol. 58. No. 1, pp. 32-45, 2009.
    [24] F.A. Samman, T. Hollstein, M. Glesner, “Networks-on-chip based on dynamic wormhole packet identity mapping management,” in the ACM VLSI Design archive, No. 2, Jan. 2009
    [25] Z. lu, A. Jantsch, “Admitting and ejecting flits in wormhole-switched networks on chip,” in the Computers & Digital Techniques (IET), pp. 546-556, 2007.
    [26] A. M. Shafieem, M. Montazeri, M. Nikdast, “An Innovational Intermittent Algorithm in Networks-on-Chip (NOC),” in the International Journal of Computer Application (IJCA), Vol. 43, No. 21, pp. 37-39, 2012.
    [27] F. karim, A. Nguyen et al., “An interconnect architecture for networking systems on chips,” IEEE Micro, Vol. 22, No. 5, pp. 36-45, Sep 2002.
    [28] I. Hur, C. lin, “Adaptive History-Based Memory Schedulers for Modern Processors,” IEEE Computer Society, Vol.26, No. 1, pp. 22-29, Jan 2007.
    [29] Z. zhu, Z. Zhang, “A performance comparison of dram memory system optimizations for SMT processors,” in the Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA), pp. 213-224, 2005.
    [30] S. I. Hong, S. A. McKee et al., “Access order and effective bandwidth for streams on a Direct Rambus memory,” in the Proceedings of the fifth International Symposium on High Performance Computer Architecture (HPCA), pp.80-89, 1999.
    [31] ARM, “Multi-layer AHB Overview v2.0,”2008, http://infoceter.arm.com.
    [32] Khronos Group, “Introduction to OpenCL,”2010, http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/06-intro_to_opencl.pdf.

    下載圖示 校內:2016-08-24公開
    校外:2016-08-24公開
    QR CODE