簡易檢索 / 詳目顯示

研究生: 黃季偉
Huang, Jih-Woei
論文名稱: 以節點對應為基礎的 Block-Cyclic 之資料重分配
Processor Mapping Based Block-Cyclic Data Redistribution
指導教授: 朱治平
Chu, Chih-Ping
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 109
中文關鍵詞: 高效能計算訊息通訊排程資料重分配處理器對應分散式記憶體電腦系統
外文關鍵詞: High performance computing, Distributed-memory multi-computers, Data redistribution, Message Communication Scheduling, Processor Mapping
相關次數: 點閱:66下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 高效能計算系統為大型科學問題提供了實際的解決方案。分散式記憶體電腦系統是此類系統的一種主要架構。許多的高效能計算應用包含有多個計算單元,在某一計算單元的資料分配方式可能會與其後的計算單元的資料分配方式不同。為了在分散式記憶體電腦系統上更有效率的執行資料平行程式,將陣列資料重新分配可能是被需要的,以避免處理器在計算時要進行彼此間大量的資料通訊。

    以非同步的方式進行資料重分配時,可能會有節點競爭的情況發生,而引發較高的資料通訊代價。以同步的方式進行資料重分配時,通常會需要使用最佳的訊息通訊排程,它除了要避免節點競爭的情況,也要讓在同一通訊步驟的訊息有相同的長度。對於在同一個節點集合上進行 Block-Cyclic 到 Block-Cyclic 的資料重分配來說,現有的訊息排程方法中,有的並不是最佳的訊息通訊排程方法,有的則是在某些情況下會有較高的資料通訊代價及訊息排程代價。

    在本論文中,我們提出一個以處理器對應為基礎的訊息通訊排程方法。此方法將來源節點上的資料看作是一個環圈,並讓每個來源節點以節點間的對應為基礎得知其開始傳送訊息的所在位置。我們的方法是一個最佳的訊息通訊排程方法,它可以改進資料通訊的代價而且具有較小的訊息排程代價。

    另一方面,目前的處理器對應技術的主要目的是在進行資料重分配時盡量減少被重分配的資料數量,以期降低資料重分配的訊息通訊代價。為達此目的,這些處理器對應技術通常將來源節點上最前的資料區塊在重分配時保留下來。在本論文中,我們提出一個較具彈性的處理器對應技術,它讓程式設計人員可以指定來源節點上的那些資料區塊在重分配時被保留下來(被區域化),以便滿足實際計算的需求。如果被區域化資料的計算能與未被區域化資料的通訊同時進行,則此技術可以有助於整體系統效能的改善。

    另外,我們延伸了我們對一般資料重分配所提出的最佳訊息排程方法,以使得使用同步的通訊模式進行資料被區域化的資料重分配時有最小的資料通訊代價。

    High performance computing systems provide practical solutions to large scale scientific problems. Distributed-memory multi-computers are one of the main architectural trends towards the design of these systems. Many high performance computing (HPC) applications consist of several computing units. The data distribution in a computation unit may be different from that in the followed computation units. For more efficiently executing the data-parallel programs on distributed-memory machines, redistributing arrays may be needed in order to avoid the large amount of inter-processor communication.
    Asynchronously performing array redistribution, there may exist node contention which will incur higher data communication cost. Synchronously performing array redistribution, an optimal communication scheduling is generally required, which enables all the messages in a communication step to have the same length in addition to avoiding node contention. For Block-Cyclic to Block-Cyclic data redistribution over the same processors set, the previously proposed communication scheduling methods are non-optimal methods or will result in higher data communication cost and have higher scheduling overheads for certain case.
    In this dissertation, we propose a processor mapping based communication scheduling method. This method considers that the data elements on a source node form a ring and makes each source node start sending its data elements from a location determined based on the processor mapping. Our method is an optimal communication scheduling method and can help improve data communication cost and has less scheduling overhead.
    On the other hand, the previously proposed processor mapping techniques aimed to reduce as far as possible the amount of data elements to be redistributed for minimizing the data communication cost in redistribution. These techniques generally enable the beginning data block on a source processor to remain on-processor in redistribution. In this dissertation, we propose a flexible processor mapping technique to allow a programmer to specify which data block on a source processor to be retained on-processor (to be localized) in redistribution for meeting practical computation need. This technique can help improve overall system performance if the computation of the localized data could overlap the communication of the un-localized data.
    Besides, we extend our optimal communication scheduling method used in the normal data redistribution in to make the data redistribution with data localization performed using synchronous communication mode have the least data communication cost.

    Chapter 1 Introduction 1 1.1 Background 1 1.2 Related Works 3 1.3 Our Works 6 Chapter 2 Synchronous versus Asynchronous Data Communications 9 2.1 Communication Patterns of Data Redistribution 11 2.2 Asynchronous Data Communications 14 2.3 Synchronous Data Communications 17 2.3.1 Walker’s Method 17 2.3.2 The Caterpillar Method 20 2.3.3 Guo’s Method 21 2.4 Experimental Results 23 Chapter 3 Processor Mapping Based Optimal Communication Scheduling 31 3.1 The Processor Mapping Technique 32 3.2 Our Communication Scheduling Method 36 3.3 The Communication Scheduling Algorithms 40 3.3.1 Send Scheduling Algorithm 40 3.3.2 Receive Scheduling Algorithm 42 3.4 Experimental Results 44 Chapter 4 A Data Localization Technique in Data Redistribution 50 4.1 Motivation Examples 50 4.2 The Flexible Processor Mapping Technique 54 4.3 The Efficient Asynchronous Redistribution Method 62 4.3.1 Send Phase 63 4.3.2 Receive Phase 66 4.4 Experimental Results 66 Chapter 5 Scheduling Communications for Data Localized Array Redistribution 72 5.1 The Communication Scheduling Method 72 5.2 The Communication Scheduling Algorithms 75 5.2.1 Send Scheduling Algorithm 75 5.2.2 Receive Scheduling Algorithm 77 5.3 Node Contention using Other Scheduling Method 79 5.4 Experimental Results 81 Chapter 6 Conclusions and Future Works 84 Appendix A 92 Appendix B 101

    1. J. Li and M. Chen. Compiling Communication-Efficient Programs for Massively Parallel Machines. IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 3, pp. 361376, October 1991.
    2. J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, pp. 472482, October 1991.
    3. S. Chatterjee, J. R. Gilbert, and S. Robert. The Alignment-Distribution Graph. In Proceedings of the Sixth Workshop on Languages and Compilers for Parallel Computing, pp. 234-252, 1994.
    4. S. Hiranandani, K. Kennedy, J. Mellor-Crummey and A. Sethi. Compilation Techniques for Block-Cyclic Distributions. ACM International Conference on Supercomputing, pp. 392-403, 1994.
    5. D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill. Solving alignment using elementary linear algebra. In Proceedings of the 7th Workshop on Languages and Compilers for Parallel Computing, pp. 4660, 1994.
    6. R. Bixby, K. Kennedy, and U. Kremer. Automatic Data Layout Using 0-1 Integer Programming. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pp. 111-122, 1994.
    7. S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating Local Address and Communication Sets for Data Parallel Programs. Journal of Parallel and Distributed Computing, Vol. 26, pp. 72-84, 1995.
    8. S. K. S. Gupta, S. D. Kaushik, C.-H. Huang, and P. Sadayappan. Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines. Journal of Parallel and Distributed Computing, Vol. 32, pp. 155-172, 1996.
    9. E. Ayguade, J. Garcia, M. L. Grande, and J. Labarta. Data Distribution and Loop Parallelization for Shared-Memory Multiprocessors. In Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing, pp. 41-55, 1996.
    10. M. Dion and Y. Robert. Mapping Affine Loop Nests: New Results. Parallel Computing, Vol. 22, No. 10, pp. 1373-1397, December 1996.
    11. A. W. Lam and M. S. Lam. Maximizing Parallelism and Minimizing Synchronization with Affine Partitions. Parallel Computing, Vol. 24, No. 3-4, pp. 445-475, May 1998.
    12. M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts. IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 2, pp. 115-135, February 1999.
    13. J. Garcia, E. Ayguade, and J. Labarta. A Framework for Integrating Data Alignment, Distribution, and Redistribution in Distributed Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 4, pp. 416-431, April 2001.
    14. P. Lee and Z. M. Kedem. Automatic Data and Computation Decomposition On Distributed Memory Parallel Computers. ACM Transactions on Programming Languages and Systems, Vol. 24, No. 1, pp. 1-50, January 2002.
    15. J. Knoop and E. Mehofer. Distribution Assignment Placement: Effective Optimization of Redistribution Costs. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 6, pp. 628-647, June 2002.
    16. A. Navarro, E. Zapata, and D. Padua. Compiler Techniques for the Distribution of Data and Computation. IEEE Transactions on Parallel and Distributed Systems, Vol. 14, No. 6, pp. 545-562, June 2003.
    17. W.-L. Chang, J.-W. Huang, and C.-P. Chu. Using Elementary Linear Algebra to Solve Data Alignment for Arrays with Linear or Quadratic References. IEEE Transactions on Parallel and Distributed Systems, Vol. 15, No. 1, pp. 28-39, January 2004.
    18. M. Satoh, K. Negishi, and A. Kobayashi. Analysis of Two-level Data Mapping in an HPF Compiler for Distributed-memory Machines. Parallel Computing, Vol. 32, No. 4, pp. 280-300, April 2006.
    19. High Performance Fortran Language Specification, http://hpff.rice.edu/versions/
    hpf2/hpf-v20/hpf-report.html.
    20. M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1996.
    21. C.-H. Hsu and Y.-H. Chung. Efficient Methods for kr  r and r  kr Array Redistribution. The Journal of Supercomputing, Vol. 12, pp. 253-276, 1998.
    22. N. Park, V. K. Prasanna, and C. S. Raghavendra. Efficient Algorithms for Block-Cyclic Array Redistribution Between Processor Sets. IEEE Transactions on Parallel and Distributed Systems, Vol. 10, No. 12, pp. 1217-1240, December 1999.
    23. R. Thakur, A. Choudhary, and G. Fox. Runtime Array Redistribution in HPF Programs. In Proceedings of Scalable High Performance Computing Conference, pp. 309-316, May 1994.
    24. S. Ramaswamy and P. Banerjee. Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers. In Frontiers '95: The Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 342-349, February 1995.
    25. S. Ramaswamy, B. Simons, and P. Banerjee. Optimization for Efficient Array Redistribution on Distributed Memory Multicomputers. Journal of Parallel and Distributed Computing, Vol. 38, pp. 217-228, 1996.
    26. R. Thakur, A. Choudhary, and J. Ramanujam. Efficient Algorithm for Array Redistribution, IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 6, pp. 587-594, June 1996.
    27. L. Prylli and B. Tourancheau. Fast Runtime Block Cyclic Data Redistribution on Multiprocessors. Journal of Parallel and Distributed Computing, Vol. 45, pp. 63-72, 1997.
    28. Y.-C. Chung, C.-H. Hsu, and S.-W. Bai. A Basic-Cycle Calculation Technique for Efficient Dynamic Data Redistribution. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 4, pp. 359-377, April 1998.
    29. C.-H. Hsu, S.-W. Bai, Y.-C. Chung, C.-S. Yang. A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution. IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No. 12, pp. 1201-1216, December 2000.
    30. E. T. Kalns and L. M. Ni. DaReL: A portable data redistribution library for distributed-memory machines. In Proceedings of Scalable Parallel Libraries Conference II, October 1994.
    31. S. D. Kaushik, C.-H. Huang, R. W. Johnson, and P. Sadayappan. An Approach to Communication-Efficient Data Redistribution. In Proceedings of International Conference on Supercomputing, pp. 364-373, July 1994.
    32. S. D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. Multi-Phase Array Redistribution: Modeling and Evaluation. In Proceedings of International Parallel Processing Symposium, pp. 441-445, April 1995.
    33. E. T. Kalns and L. M. Ni. Processor Mapping Techniques Toward Efficient Data Redistribution. IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 12, pp. 1234-1247, December 1995.
    34. A. Wakatani and M. Wolfe. Optimization of Array Redistribution for Distributed Memory Multicomputers. Parallel Computing, Vol. 21, No. 9, pp. 1485-1490, 1995.
    35. C.-H. Hsu, Y.-C. Chung, D.-L.Yang, C.-R. Dow. A Generalized Processor Mapping Technique for Array Redistribution. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, No. 7, pp. 743-757, July 2001.
    36. D. W. Walker and S. W. Otto. Redistribution of Block-Cyclic Data Distributions Using MPI. Concurrency: Practice and Experience, Vol. 8, No 9, pp. 707-728, 1996.
    37. Y. W. Lim, P. B. Bhat, and V. K. Prasanna. Efficient Algorithms for Block-Cyclic Redistribution of Arrays. In Proceedings of IEEE Symposium on Parallel and Distributed Processing, pp. 74-83, 1996.
    38. F. Desprez, J. Dongarra, C. Randriamaro, and Y. Robert. Scheduling Block-Cyclic Array Redistribution. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 2, pp. 192-205, February 1998.
    39. M. Guo, I. Nakata, and Y. Yamashita. Contention-free Communication Scheduling for Array Redistribution. Parallel Computing, Vol. 26, No. 10, pp. 1325-1343, September 2000.
    40. M. Guo and I. Nakata. A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers. The Journal of Supercomputing, Vol. 20, pp. 243-265, 2001.
    41. M. Guo and Y. Pan. Improving Communication Scheduling for Array Redistribution. Journal of Parallel and Distributed Computing, Vol. 65, pp. 553-563, 2005.
    42. S. Lee, H. Yook, M. Koo, and M. Park. Processor Reordering Algorithms toward Efficient GEN_BLOCK Redistribution. In Proceedings of ACM Applied Computing Symposium, pp. 539-543, 2001.
    43. C.-H Hsu and K.-M Yu. A Compressed Diagonals Remapping Technique for Dynamic Data Redistribution on Banded Sparse Matrix. The Journal of Supercomputing, Vol. 29, pp. 125-143, 2004.
    44. C.-H Hsu, M.-H Chen, C.-T Yang, and K.-C Li. Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers. IEEE Transactions on Parallel and Distributed Systems, Vol. 17, No. 11, pp. 1226-1241, November 2006.
    45. H. Wang, M. Guo, and D. Wei. A Divide-and-Conquer Algorithm for Irregular Redistribution in Parallelizing Compilers. The Journal of Supercomputing, Vol. 29, pp. 157-170, 2004.
    46. The Message Passing Interface (MPI) standard, http//www-unix.mcs.anl.gov/mpi.

    下載圖示 校內:立即公開
    校外:2009-07-23公開
    QR CODE