成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳冠仲 Chen, Kuan-Chung
論文名稱：	SIMT/MIMD雙模多核心處理器系統架構之研究 A Study of SIMT/MIMD Dual-Mode Multi-Core Processor System Architecture
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	98
中文關鍵詞：	執行緒分歧、資料平行度、多執行緒架構、開放運算語言、多指令多資料處理器、單指令多執行緒處理器、基於時空之單指令多執行緒運算
外文關鍵詞：	Control divergence, Data level parallelism, Multithreading, Open Computing Language OpenCL, MIMD processors, SIMT processors, Spatiotemporal SIMT
相關次數：	點閱：132 下載：24
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

單指令多執行緒機器在高效能運算領域中是一種主流的計算機架構。其主要原因為單指令多執行緒運算模型可有效運用資料平行度達成輸出取向的計算方法。本篇論文探討此單指令多執行緒的運算模型應用於傳統同質性多核心中央處理器的可行性與好處；而在此之前，大多數的多核心中央處理器僅支援多指令多資料的運算模型。為了使多核心中央處理器支援單指令多執行緒運算模型，我們提出了三個計算機架構上所需解決的議題，包含有：多執行緒運算模型實現、核心執行緒內文儲存方式以及執行緒分支所造成效能損失之問題。
我們整合了單指令多執行緒運算模型於ARM多核心處理器架構中。為此目的，我們提出一套可應用在傳統多核心中央處理器系統的細質多執行緒運算模型。為了滿足細質多執行緒運算於每個執行週期切換執行執行緒的需求，在執行單指令多執行緒運算期間，每個處理器核心的L1快取記憶體將被用來儲存核心執行緒內文。而為了分歧密集的運算核心，我們提出一套名為“內圈條件運算式優先”的機制來達成執行緒分歧提前聚合以有效提升運算效能。與傳統多指令多執行緒模型相比，在單指令輸入循序處理器(single issue in-order processor)中採用單指令多執行緒模型執行OpenCL核心，平均可減少36%執行指令數，並且可達到平均1.52倍與高達5倍的速度提升。而在執行向量優化的OpenCL核心時，單指令多執行緒模型可額外得到單指令多資料擴展指令集(SIMD extension)的好處，因而比多指令多執行緒運算加速1.71倍。而單指令多執行緒模型也可應用於超純量循序(superscalar in-order)處理器架構，並且在執行效能上勝過超純量亂序(superscalar out-of-order)處理器40個百分比。實驗結果顯示本論文所提出的雙模運算架構在提升多核心處理器系統對於資料平行度的利用效率上扮演著至關重要的角色。

SIMT machine emerges as a primary computing device in high performance computing since the SIMT execution paradigm can exploit data-level parallelism effectively. This dissertation explores the SIMT execution potential on homogeneous multi-core processors, which generally run in MIMD mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence.
For the SIMT execution model, we propose a fine-grained multithreading mechanism on an ARM-based multi-core system. Each of the processor cores stores the kernel thread contexts in its L1 data cache for per-cycle thread-switching requirement. For divergence-intensive kernels, an Inner Conditional Statement First (ICS-First) mechanism helps early re-convergence to occur and significantly improves the performance. The experiment results show that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve on average 1.52x and up to 5x speedups over the MIMD counterpart for OpenCL benchmarks for single issue in-order processor cores. By using the explicit vectorization optimization on the kernels, the SIMT model gains further benefits from the SIMD extension and achieves 1.71x speedup over the MIMD approach.The SIMT model using in-order superscalar processor cores outperforms the MIMD model that uses superscalar out-of-order processor cores by 40 percent. This study shows that, to exploit data-level parallelism, enabling the SIMT model on homogeneous multi-core processors is important.

摘要 i
Abstract ii
誌謝 iii
Table of Contents iv
List of Tables vi
List of Figures vii
Chapter 1. Introduction 1
1. Motivation 1
2. Challenges for Dual-Mode Architecture 3
3. Contributions 4
4. Organization of The Dissertation 5
Chapter 2. Background and Related Work 6
1. Open Computing Language Framework 6
2. SIMT Execution Model 10
3. Thread Divergence Problem 12
4. Related Work 17
4.1. Data-Parallel Program Execution on Homogeneous Multi-Core Processor 18
4.2. Bulk Thread Execution Mechanisms on SIMT Machines 21
Chapter 3. Spatiotemporal SIMT Extension on A Homogeneous Multi-Core Processor 24
1. Dual Mode CPU Architecture 27
2. CPU Thread Descriptor 29
3. Warp Scheduling and Warp PC Arbitration 31
4. Instruction Fetch and Sub-Warp Scheduling 35
5. Implementation of Thread-Privatization Memory 37
6. Thread Stack Object Placement 39
7. Summary 42
Chapter 4. Compiler-Assisted Thread Scheduling 43
1. Inner Conditional Statement First Strategy 43
2. Compiler-Assisted Insertion Framework 45
3. CASE STUDY: Priority-Based Thread Execution on The Proposed Dual-Mode Architecture 47
3.1. Per-Thread Priority Adjustment 48
4. Analysis of the Inner-Conditional-First Mechanism 51
5. Summary 56
Chapter 5. Experimental Evaluation 57
1. Methodology 57
1.1. Implementations of Dual-Mode Multi-Core Processors 57
1.2. Implementation of The OpenCL Runtime System 60
1.3. Benchmarks for Evaluations 66
2. Performance Overview 67
2.1. Performance Comparisons of The Dual-Mode Multi-Core Processor 68
2.2. Impacts of Kernel Optimizations 72
3. Evaluations for Divergence Optimizations 75
4. Discussions of Architecture-Level Design Issues 79
4.1. CPU Thread Descriptor Table 80
4.2. Warp PC Arbitration Overhead Comparison 81
4.3. Impacts on Memory System 82
4.4. Comparison of Difference SIMD Widths 83
5. Comparison of OpenCL and OpenMP Programming Models 85
6. Comparison of Kernel Executions on Advanced Processor Architectures 86
6.1. Comparison of Vectorization Kernel Executions 86
6.2. Comparison of Superscalar Architectures 87
Chapter 6. Conclusions 90
References 92
Publications 97
Vita 98
                                    

[1] Khronos Group. The opencl specification 1.2, 2011.
[2] NVIDIA Corporation. Nvidia cuda toolkit 4.2, 2012.
[3] John Nickolls and William J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010.
[4] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. Multi2sim: A simulation framework for cpu-gpu computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 335–344, New York, NY, USA, 2012. ACM.
[5] John A. Stratton, Sam S. Stone, and Wen-mei W. Hwu. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, volume 5335, pages 16–30. Springer Berlin Heidelberg, 2008.
[6] John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pages 111–119, New York, NY, USA, 2010. ACM.
[7] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. Snucl: An opencl framework for heterogeneous cpu/gpu clusters. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 341–352, New York, NY, USA, 2012. ACM.
[8] Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. pocl: A performance-portable opencl implementation. International Journal of Parallel Programming, 43(5):752–785, October 2015.
[9] Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. An opencl framework for heterogeneous multicores with local memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 193–204, New York, NY, USA, 2010. ACM.
[10] Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 205–216, New York, NY, USA, 2010. ACM.
[11] Advanced Micro Devices Inc. Amd accelerated parallel processing sdk, 2017.
[12] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer Society.
[13] Cavium. Cavium thunderx arm processors, 2017.
[14] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5):15–31, September 2007.
[15] James Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.
[16] Y. Wen, Z. Wang, and M. F. P. O’Boyle. Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In 2014 21st International Conference on High Performance Computing (HiPC), pages 1–10, Dec 2014.
[17] John L. Hennessy and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 5th edition, 2011.
[18] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 163–174, April 2009.
[19] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[20] Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. Characterization and transformation of unstructured control flow in bulk synchronous gpu applications. The International Journal of High Performance Computing Applications, 26(2):170–185, 2012.
[21] Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. Simd re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 477–488, New York, NY, USA, 2011. ACM.
[22] Sylvain Collange. Stack-less simt reconvergence at low cost. Technical report, September 2011.
[23] Nicolas Brunie, Sylvain Collange, and Gregory Diamos. Simultaneous branch and warp interweaving for sustained gpu performance. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pages 49–60, Washington, DC, USA, 2012. IEEE Computer Society.
[24] Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. An opencl framework for homogeneous manycores with no hardware cache coherence, Oct 2011.
[25] Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratkovic, Adrian Cristal, Osman Unsal, and Mateo Valero. An integrated vector-scalar design on an in-order arm core. ACM Trans. Archit. Code Optim., 14(2):17:1–17:26, May 2017.
[26] Gangwon Jo, Won Jong Jeon, Wookeun Jung, Gordon Taft, and Jaejin Lee. Opencl framework for arm processors with neon support. In Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP ’14, pages 33–40, New York, NY, USA, 2014. ACM.
[27] Ali Vahidsafa, Sebastian Turullols, David Smentek, Ram Sivaramakrishnan, Paul Loewenstein, Sumti Jairath, and John Feehrer. The oracle sparc t5 16-core processor scales to eight sockets. IEEE Micro, 33:48–57, 2013.
[28] Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1):1–12, 2002.
[29] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22Nd Annual International Symposium on Computer Architecture, ISCA ’95, pages 392–403, New York, NY, USA, 1995. ACM.
[30] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-Shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen. Dynamic simd re-convergence with paired-path comparison. In 2016 IEEE International Symposium on Circuits and Systems (ISCAS), pages 233–236, May 2016.
[31] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Computer Architecture Letters, 14(1):34–36, Jan 2015.
[32] Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. Gpus and the future of parallel computing. IEEE Micro, 31(5):7–17, 2011.
[33] Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, and Ben Juurlink. Spatiotemporal simt and scalarization for improving gpu efficiency. ACM Trans. Archit. Code Optim., 12(3):32:1–32:26, September 2015.
[34] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic warp formation: Efficient mimd control flow on simd graphics hardware. ACM Trans. Archit. Code Optim., 6(2):7:1–7:37, July 2009.
[35] Wilson W. L. Fung and Tor M. Aamodt. Thread block compaction for efficient simt control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA ’11, pages 25–36, Washington, DC, USA, 2011. IEEE Computer Society.
[36] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308–317, New York, NY, USA, 2011. ACM.
[37] Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. Simd divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 368–379, New York, NY, USA, 2013. ACM.
[38] Minsoo Rhu and Mattan Erez. Maximizing simd resource utilization in gpgpus with simd lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 356–367, New York, NY, USA, 2013. ACM.
[39] Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pages 235–246, New York, NY, USA, 2010. ACM.
[40] Minsoo Rhu and Mattan Erez. The dual-path execution model for efficient gpu control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, pages 591–602, Washington, DC, USA, 2013. IEEE Computer Society.
[41] Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. A variable warp size architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 489–501, New York, NY, USA, 2015. ACM.
[42] Yaohua Wang, Shuming Chen, Jianghua Wan, Jiayuan Meng, Kai Zhang, Wei Liu, and Xi Ning. A multiple simd, multiple data (msmd) architecture: Parallel execution of dynamic and static simd fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, pages 603–614, Washington, DC, USA, 2013. IEEE Computer Society.
[43] Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society.
[44] Dong Li, Minsoo Rhu, Daniel R Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S Fussell, and Stephen W Redder. Priority-based cache allocation in throughput processors. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 89–100. IEEE, 2015.
[45] Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pages 42:1–42:12, New York, NY, USA, 2016. ACM.
[46] En-Hao Chang, Chen-Chieh Wang, Chien-Te Liu, Kuan-Chung Chen, and Chung-Ho Chen. Virtualization technology for tcp/ip offload engine. IEEE Transactions on Cloud Computing, 2(2):117–129, April 2014.
[47] Kuan-Chung Chen and Chung-Ho Chen. An opencl runtime system for a heterogeneous many-core virtual platform. In 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pages 2197–2200, June 2014.
[48] John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 127, 2012.
[49] Louis-Noël Pouchet. Polybench: The polyhedral benchmark suite. URL: http://www.cs. ucla. edu/pouchet/software/polybench, 2012.
[50] Chris Lattner and Vikram Adve. Llvm: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pages 75–86, March 2004.
[51] Yuan Chi. Opencl kernel attribute prediction for operation mode se-lection in simt/mimd dual-mode architecture. Master’s thesis, National Cheng Kung University, Taiwan, 2016.
[52] Jie Shen, Jianbin Fang, Henk Sips, and Ana Lucia Varbanescu. Performance gaps between openmp and opencl for multi-core cpus. In 2012 41st International Conference on Parallel Processing Workshops, pages 116–125, Sept 2012.
[53] Teo Milanez, Sylvain Collange, Fernando Magno Quintão Pereira, Wagner Meira, and Renato Ferreira. Thread scheduling and memory coalescing for dynamic vectorization of spmd workloads. Parallel Computing, 40(9):548–558, 2014.

校內：立即公開
校外：立即公開

簡易檢索 / 詳目顯示

相關論文