簡易檢索 / 詳目顯示

研究生: 莊宜璋
Zhuang, Yi-Chang
論文名稱: 分散式共用記憶體系統採群組工作分配方式之系統重組機制
Adapting Node Configuration on Distributed Shared Memory Systems with Group-Based Workload Redistribution Scheme
指導教授: 梁廷宇
Liang, Tyng-Yeu
謝錫堃
Shieh, Ce-Kuen
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2004
畢業學年度: 92
語文別: 英文
論文頁數: 113
中文關鍵詞: 負載平衡效能預測分散式共用記憶體
外文關鍵詞: performance prediction, node reconfiguration, DSM, load balance
相關次數: 點閱:129下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   多執行緒的分散式共用記憶體系統是一個由網路上的電腦叢集所組成的平行計算環境。過去分散式共用記憶體系統大多是將程式的執行緒平均分配到所有節點上執行。若節點的電腦效能不同,效能最差的節點將成為程式整體效能的瓶頸並決定程式執行完成的時間。過去的動態負載平衡機制僅利用執行緒遷移,在程式執行過程中,根據節點效能重新分配程式的執行緒,達到系統負載的平衡以減少程式的執行時間。我們發現雖然動態負載平衡機制可以平衡系統負載,可是執行緒的遷移可能使得不同節點的執行緒共用的資料因而大量增加,造成程式執行時網路通訊量激增,嚴重影響程式效能。為了解決此問題,我們採用群組式工作分配的方式,利用執行緒共用的資料量當作重新分配的標準,減少動態負載平衡所引起的網路通訊。實驗結果顯示,群組式負載平衡機制比非群組式負載平衡機制平均減少20%以上的資料分享與網路通訊量。另外,在搬移較少的引線數目情況下,群組式負載平衡機制所造成之網路通訊量與最佳引線分配狀態幾近相同。因此,群組式負載平衡機制不僅可以適當地平衡系統負載,更可以有效降低因引線搬移所造成的資料分享與網路通訊。

      在分散式共用記憶系統上,利用較多的節點數目執行程式並無法保證可以獲得較好的系統效能。由於分散式共用記憶體系統必須透過網路將程式更新的資訊傳遞給其他的節點,因此節點數目越多代表所需傳送的網路訊息也越多。如果節點個數超過了一定的數量,花費在網路通訊的時間會超出平行計算所減少的時間。那麼程式執行的時間反而不會因為節點數目的增加而減少。所以找到程式執行最佳的系統規模是很重要的。因此我們提出了一個程式效能分析模型,以程式執行的行為與系統資訊來評估程式的執行效能。系統重組機制根據所評估之結果自動調整程式執行所需之系統規模,並利用群組工作分配方式重新分配程式的執行緒,以期獲得最佳的執行效率。實驗結果顯示,透過群組工作分配方式之系統重組機制均可以正確預測出應用程式之最佳執行效能與重新調整系統架構所需之額外負擔,並可縮短10%以上之程式執行時間。因此,採群組工作分配方式之系統重組機制對於DSM系統應用程式最佳執行效能的獲得是必要的。應用程式的工作分配在較佳的系統架構下執行時,不僅可以讓應用程式得到較短的執行時間。另一方面,平行系統之計算資源亦可有較好的使用效率。

      Multi-threaded distributed shared memory (DSM) system consists of computer cluster on the network and provides a parallel computing environment. In the past, most DSM systems divided the computation by distributing the threads equally to every node. If the CPU performance of each node is different, the node with the lowest processing power will be the bottleneck of the system performance and becomes the factor to dominate the execution time of the program. Past dynamic load balance mechanisms employed thread migration to balance the system workload and to reduce the program execution time. At run time, the mechanism redistributed the working threads according to the performance of each node. However, we find that there is a problem with using thread migration for the balance of system workload. Although thread migration can effectively balance system workload at run time, it will probably result in large amount of inter-node communication if the threads sharing the same data are redistributed to different nodes. Therefore, the amount of network message will increase substantially and has great impact on system performance. To solve this problem, we propose a group-based load balance scheme in this research. The proposed scheme takes the data sharing factor into consideration to reduce the network communication resulted from thread migration.

      In addition, using more nodes to execute program cannot necessarily guarantee to obtain better performance on DSM systems. Since DSM has to propagate the update of program data among the system nodes, the amount of network message will be proportional to the number of nodes in the system. If the number of nodes is greater than a threshold, the time spent in network communication will be more than the execution time reduced by parallel computation. Execution time of the program will not reduce as the number of nodes increases. Consequently, it is critical to find the best system scale for the performance of a DSM program. In this paper, we propose a model to characterize the program execution behavior on DSM. Besides, we use the run time information of the program and the DSM system to predict the system performance. According to the predicted results, the reconfiguration mechanism adapts system configuration dynamically and redistributes the working threads by group-based load balance scheme to obtain the best system performance. We implement and test the proposed mechanism on a DSM platform, Cohesion. The experimental results show that the system configuration with group-based workload redistribution scheme is necessary for the performance improvement of a DSM program. It can not only reduce the program execution time but it also can increase the utilization of the computing resource of the system.

    Chapter 1. Introduction....................................................1 1.1. Motivation ...........................................................3 1.2. Adapting Node Configuration ..........................................6 1.3. Contributions ........................................................7 1.4. Thesis Organization...................................................8 Chapter 2. Related Works ..................................................9 Chapter 3. Group-Based Load Balance .......................................13 3.1. Sender Group and Receiver Group.......................................13 3.2. Evaluation Functions .................................................19 3.2.1. Workload Threshold Function.........................................19 3.2.2. Inter-Node Sharing Function.........................................20 3.3. Update Method ........................................................21 3.4. Simulation............................................................28 3.5. Theoretical Experimentation and Analysis..............................36 3.5.1. Overview of the Evaluation .........................................44 3.5.2. Further Analysis ...................................................45 3.6. Algorithm Overhead with/without Update Method.........................50 3.7. Comparison between MSMR and Optimal Solution..........................52 Chapter 4. Peak Performance Prediction and Node Reconfiguration............54 4.1. Peak Performance Prediction...........................................54 4.1.1. Program Execution Model ............................................54 4.1.2. Peak Performance Prediction Function................................56 4.1.3. Theoretical Simulation..............................................58 4.2. Applying Group-Based Workload Redistribution in Node Reconfiguration..62 4.2.1. Node Reconfiguration................................................62 4.2.2. Node Reconfiguration Criterion Function.............................64 4.2.3. Performance Evaluation..............................................65 4.2.4. Analysis............................................................79 Chapter 5. Implementation..................................................83 5.1. Node Reconfiguration..................................................83 5.1.1. Node ID Translation.................................................85 5.1.2. Affinity Page Movement..............................................87 5.2. Access Pattern and Access Type Tracking...............................89 5.2.1. Access Pattern Tracking.............................................89 5.2.2. Access Type Tracking ...............................................90 Chapter 6. Performance ....................................................91 6.1. Experimental Environments and Applications ...........................91 6.2. Experimental Results..................................................92 6.2.1. Nodes with 8 P-90...................................................92 6.2.2. Nodes with 8 P-90 and 4 PII-233.....................................98 6.2.3. Nodes with 7 P-90, 4 PII-233 and 1 PIII-550 (Root Node: P-90) ......101 6.2.4. Nodes with 8 P-90, 4 PII-233 and 1 PIII-550 (Root Node: PIII-550) ..103 Chapter 7. Conclusions and Future Work.....................................106 Bibliography ..............................................................108

    [1]Agrawal, V. D., and Chakradhar, S. T., Performance Analysis of Synchronized Iterative Algorithms on Multiprocessor Systems, IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 5, pp. 739-746, November 1992.

    [2]Amza, C., Cox, A. L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., and Zwaenepoel W., TreadMarks: Shared Memory Computing on Networks of Workstations, In IEEE Computer, volume 29, No. 2, pp. 18-28, 1996.

    [3]Bennett, J. K., Carter, J. B and Zwaenepoel, W., Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence, In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pp. 168-176, March, 1990.

    [4]Bershad, B. N., Zekauskas, M. J., and Sawdon, W. A., Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Annual IEEE International Computer Conference, pp. 528-537, 1993.

    [5]Blanco, V., Gonzalez, J.A., Leon, C., Rodriguez, C., Rodriguez, G., and Printista, M., Predicting the Performance of Parallel Programs, Parallel Computing, vol. 30, issue 3, pp. 337-356, March 2004.

    [6]Calzarossa, M., Massari, L., and Tessera, D., A Methodology towards Automatic Performance analysis of Parallel Applications, Parallel Computing, vol. 30, issue 2, pp. 211-223, February 2004.

    [7]Cater J. B, Efficient Distributed Shared Memory Based on Multi-Protocol Release Consistency, Ph. D. thesis, Rice University, 1993.

    [8]Eom, H., and Hollingsworth, J. K., LBF: A Performance Metric for Program Reorganization, International Conference on Distributed Computing Systems, pp. 222-229, 1998.

    [9]Feeley, M. J., Bershad, B. N., Chase, J. S., and Levy, H. M., Dynamic Node Reconfiguration in a Parallel-Distributed Environment, Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPOPP), SIGPLAN NOTICES, volume 26, Number 7, July 1991.

    [10]Geist, G. A., and Sundern, V. S., Network-Based Concurrent Computing on the PVM system, Concurrency: Practice and Experience, pp. 293-311, 1992.

    [11]Gersho, A. and Gray, R. M., Vector Quantization and Signal Compression, Kluwer Academic Publishers, London, 1992.

    [12]Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., and Hennessy, J., Memory consistency and event ordering in scalable shared-memory multiprocessors, in Proc. 17th Annual International Symposium on Computer Architecture, pp. 15-26, 1990.

    [13]Ghosal, D., Serazzi, G., and Tripathi, S. K., The Processor Working Set and its Use in Scheduling Multiprocessor Systems, IEEE Transactions on Software Engineering, vol. 17, no. 5, pp. 443-453, May 1991.

    [14]Gong, L., Sun, X. H., and Watson, E. F., Linguo Gong, Xian-He Sun, Edward F. Watson, IEEE Transactions on Computers, vol. 51, no. 9, pp. 1041-1055, September 2002.

    [15]Hollingsworth, J. K, and Keleher P., Prediction and adaptation in Active Harmony, Cluster Computing, volume 2, No. 3, pp. 195-205, 1999.

    [16]Hutto, P. W., and Ahamad, M., Slow memory: Weakening consistency to enhance concurrency in distributed shared memories, In Proceeding of 10th International Conference on Distributed Computing Systems, pp. 302-311, May 1990.

    [17]Iftode, L., Singh, J. P., Li, K., Scope Consistency : A Bridge between Release Consistency and Entry Consistency, Theory of Computing Systems, 31, pp. 451-473, 1998.

    [18]Itzkovitz, A., Schuster, A., and Shalev, L., Thread migration and its applications in distributed shared memory systems, The Journal of Systems and Software, volume 42, No. 1, pp.71-87, 1998.

    [19]Johnson, K. L., Kashoek, M. F. and Wallach, D. A., CRL: High-Performance All-Software Distributed Shared Memory, ACM Operating Systems Review, volume 29, No. 5, pp. 213-226, 1995.

    [20]Keleher, P., Cox, A. L. and Zwaenepoel W., Lazy Release Consistency for Software Distributed Shared Memory, Proceeding of the 19th Annual International Symposium on Computer Architecture (ISCA92), pp. 13-21, 1992.

    [21]Lamport, L., How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computer, volume C-28, pp. 690-691, September 1979.

    [22]Li, K., Shared Virtual Memory on Loosely Coupled Multi-processors, Ph.D. Thesis, Yale University, 1986.

    [23]Li, K., IVY: A Shared Virtual Memory System for Parallel Computing, In Proceedings of 1988 IEEE International Conference on Parallel Processing, pp. 94-101, August 1988.

    [24]Liang, T. Y., Ueng, J. C., Shieh, C. K., Zhuang, D. Y., and Lee, J. Q., Distinguishing Sharing Types to Minimize Communication in Software Distributed Shared Memory Systems, Journal of Systems and Software, volume 5, pp.73-85, 2000.

    [25]Linde, Y. , Buzo, A. ,and Gray, R. M., An algorithm for vector quantizer design, IEEE Transaction on Communication, volume COM-28, No. 1, pp. 85-95, 1980.

    [26]Message Passing Interface Forum, MPI: A Message-Passing Interface Standard version 1.0, 1994.

    [27]Musciano, A. J., and Sterling, T. L., Efficient dynamic scheduling of medium-grained tasks for general purpose parallel processing. In: Proceedings of the International Conference on Parallel Processing, pp. 166-175, August 1988.

    [28]M. Castro, C. de Amorim, Efficient Categorization of Memory Sharing Patterns in Software DSM Systems. In Proceedings of 15th International Parallel and Distributed Processing Symposium (IPDPS'01), April 23 - 27, 2001

    [29]Nguyen, T. D., Vaswani, R, Zahorjan, J., Maximizing speedup through self-tuning of processor allocation. In Proceedings of 10th International Parallel Processing Symposium, pp. 463-468, 1996.

    [30]Quinn, M. J., Parallel Computing Theory and Practice, McGraw-Hill Inc. New York, NY, USA.

    [31]Raquel Pinto, Ricardo Bianchini, and Claudio L. Amorim, Comparing Latency-Tolerance Techniques for Software DSM Systems, IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 11, pp. 1180-1190, November 2003.

    [32]Renard, H., Robert, Y., Vivien, F. and Legrand, A., Mapping and Load-Balancing Iterative Computations, IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 6, pp. 546-558, June 2004.

    [33]Sevcik, K .C., Characterizations of parallelism in applications and their use in scheduling. ACM SIGMETRICS Performance Evaluation Review, volume 17, No. 1, pp. 171-180, May, 1989.

    [34]Shieh, C. K., Lai, A. C., Ueng, J. C., Liang, T. Y., Chang, T. Z., and Mac, S. C., Cohesion: An Efficient Distributed Shared Memory System Supporting Multiple Memory Consistency Models, In Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis, March, 1995.

    [35]Speight, W. E., and Bennett, J. K., Brazos: A Third Generation DSM System, Proceeding of the USENIX Windows NT Workshop, 1997.

    [36]Steed, M. R., and Clement, M. J., Performance prediction of PVM programs, Proceedings of the 10th International Parallel Processing Symposium, pp. 803-807, 1996.

    [37]Thitikamol, K., and Keleher P., Per-Node Multithreading and Remote Latency, IEEE Transactions on Computers, volume 47, No. 4, pp. 414-426, 1998.

    [38]Thitikamol, K., and Keleher, P., Active Correlation Tracking, International Conference on Distributed Computing Systems, pp. 324-331, 1999.

    [39]Thitikamol, K., and Keleher, P., Thread Migration and Load Balancing in Non-Dedicated Environments, 14th International Parallel and Distributed Processing Symposium (IPDPS'00), pp. 583-588, 2000.

    [40]Ueng, J. C., Shieh, C. K., and Liang, T. Y., Proteus: An Efficient Runtime Reconfigurable Distributed Shared Memory System, Journal of Systems and Software, volume 56, pp 247-260, 2001.

    [41]Ueng, J. C., Shieh, C. K, Mac, S. C., Lai, A. C. and Liang, T. Y., Multi-threaded Design for a Software Distributed Shared Memory System, IEICE Transactions on Information and Systems, volume E82-D, No.12, pp.1512-1523, 1999.

    [42]Valiant., L. G., A Bridging Model for Parallel Computation, Communication of ACM, vol. 33, no. 8, pp. 103-111, 1990.

    [43]Violard., E., A Semantic Framework to Address Data Locality in Data Parallel Languages, Parallel Computing, vol. 30, pp. 139-161, 2004.

    [44]Vrsalovic, D. F., Siewiorek, D. P., Segall, Z. Z., and Gehringer, E.F., Performance Prediction and Calibration for a Class of Multiprocessors, IEEE Transactions on Computers, vol. 37, no. 11, pp. 1353-1365, November 1988.

    [45]Xiao, L., Chen, S. Q., Zhang, X. D., Dynamic cluster resource allocations for jobs with known and unknown memory demands, IEEE Transactions on Parallel and Distributed Systems, volume 13, No. 3, pp. 223-240, 2002.

    [46]Zhang, X., and Qin, X., Performance Prediction and Evaluation of Parallel Processing on a NUMA Multiprocessor, IEEE Transactions on Software Engineering, vol. 17, no. 10, pp. 1059-1068, October 1991.

    [47]Zhuang, Y. C., Shieh, C. K., Chen, K. L., Liang, T. Y., and Ueng, J. C., Maximizing Speedup through Self-tuning of Node Allocation on Heterogeneous DSM Systems, International Conference on Open Source, 2002.

    [48]Zhuang, Y. C., Shieh, C. K., and Liang, T. Y., Centralized Load Balance on Distributed Shared Memory Systems, The Fourth Workshop on Compiler Techniques for High-Performance Computing (CTHPC'98), pp. 166-174, 1998.

    [49]Zhuang, Y. C., Shieh, C. K., and Liang, T. Y., The Study of Centralized Load Balance on Distributed Shared Memory Systems, Proceedings of the 1998 International Computer Symposium, pp.137-143, 1998.

    [50]Zhuang, Y. C., Shieh, C. K., and Liang, T. Y., Automatic Reconfiguration for Maximizing System Performance on Software Distributed Shared Memory Systems, The Proceedings of the IASTED 2002 International Conference on Networks, Parallel and Distributed Processing, and Applications, pp. 53-58, 2002.

    [51]Zhuang, Y. C., Shieh, C. K., Liang, T. Y., and Chou, C. H., Maximizing Speedup through Performance Prediction for Distributed Shared Memory Systems, The 21st IEEE International Conference on Distributed Computing Systems, pp. 723-726, 2001.

    [52]Zhuang, Y. C., Liang, T. Y., Shieh, C. K., and Lee, J. Q., A Group-Based Load Balance Scheme for Software Distributed Shared Memory Systems, First IEEE/ACM International Symposium on Cluster Computing and the Grid (IEEE CCGrid'2001), pp. 371-378, 2001.

    [53]Zhou, J. Z., Mizuno, M., and Singh, G., A Sequentially Consistent Distributed Shared Memory, International Conference on Computing and Information, pp. 165-169, 1993.

    下載圖示 校內:2005-07-16公開
    校外:2005-07-16公開
    QR CODE