簡易檢索 / 詳目顯示

研究生: 黃翔
Huang, Xiang
論文名稱: 平行任務產生器的設計與實現
Design and Implementation of A Parallel Tasks Generator
指導教授: 周哲民
Jou, Jer-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 59
中文關鍵詞: 循序程式平行化平行萃取資料驅動多執行緒訊息傳遞介面
外文關鍵詞: Parallel Execution of Sequential Code, Parallel Extraction, Data-Driven Multithreading, Message Passing Interface
相關次數: 點閱:139下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多核心系統的興起促使程式設計師必須考慮循序程式中指令之間的關聯性,因為只有在循序程式能夠被分割且分配至多個核心上執行時,才能夠妥善地利用多核心系統的資源。在靜態任務分配時,程式設計者只能找出任務間明顯的相依關係,然而若在程式中隱藏著大量的相依關係時,將使得系統無法有效地利用系統資源;而編譯階段的相依性偵測將會無法應付多核心環境中的動態變化且降低程式的執行效率。為了使循序程式在眾多的軟體與硬體環境中得以有效地平行執行,我們提出一個於多核心系統上運行的平行任務產生器,透過動態的萃取、通訊與同步分析,將循序程式轉換成適用於平行計算的任務同步圖,以此作為後續在各個平行運算模型下動態配置的依據,近一步來提升系統整體的效能。而實驗的結果顯示平行任務產生器的效率在4~8個執行緒時可達循序執行時的2倍;而所產生的任務同步圖在多核心系統上能提高約1.37倍的執行速度;在計算機叢集上,能在僅花費額外33%的通訊時間下提高約1.3倍任務執行的速度。

    The rise of multicore system forces programmers to consider dependences among instructions in the sequential codes, because a sequential code can use multicore resources appropriately only when it can be seperated as the instructions which could be assigned to the cores. Programmers can only observe obvious dependences between tasks statically, however, the sequential codes which hid lots of dependences can not use the system resources efficiently. Compiled-time dependences detection can fail to be efficient and unable to account for dynamic changes in run-time. In order to make time-efficient paarallel execution of sequential codes in wide range of hardware and software, we propose a parallel tasks generator, which can generate the task synchronization graphs applicated in parallel computing by using extraction amd analization of communication and synchronization in run-time. Accroding to the task synchronization graphs, we can schedule tasks dynamically on the parallel computing models, and further promote the enntire system efficacy. Experimental results show that a parallel tasks generator used 4 to 8 threads could be executed with 2x faster than its sequential version. The generated task synchronization graphs could be executed with 1.37x faster than their sequential version on multicore system, and with 1.3x faster by only extra 33% communicaation penalty on computer clusters.

    摘要 III SUMMARY IV 誌謝 VIII 圖目錄 XII 表目錄 XIV 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 背景知識與相關研究 3 2.1 平行運算架構 3 2.1.1 單處理器系統 3 2.1.2 多處理器系統 4 2.1.3 多計算機系統 4 2.1.4 Flynn的分類法 5 2.2 循序程式平行化 6 2.2.1 尋找平行性 6 2.2.2 編寫平行程式 7 2.2.3 Foster的設計方法 8 2.2.4 平行化所衍生的問題 10 2.2.5 程序間通訊方式 11 2.3 相關研究 13 第3章 平行任務產生器之設計 16 3.1 平行任務產生器的設計考量 16 3.1.1 資料驅動多執行緒程式 17 3.1.2 任務選取 18 3.2 平行萃取設計 19 3.3 層級劃分設計 21 3.4 計算相依關係 22 3.5 產生任務同步圖 23 3.6 本章小結 24 第4章 資料驅動多執行緒資源管理器 26 4.1 資料驅動多執行緒資源管理器的組成 26 4.2 資料驅動多執行緒資源管理器的行為 27 4.2.1 管理策略 28 4.2.2 任務同步圖分割 28 4.2.3 任務映射 29 4.2.4 任務同步 30 4.2.5 負載平衡 32 第5章 訊息傳遞介面 34 5.1 訊息傳遞介面環境配置與執行流程 34 5.2 平行程式設計 38 5.3 計算機集群上的訊息傳遞行為 39 5.3.1 任務分配 39 5.3.2 資料同步 40 第6章 實驗環境與數據分析 42 6.1 測試程式 42 6.2 實驗環境配置 43 6.2.1 資料驅動多執行緒資源管理器配置 44 6.2.2 訊息傳遞介面環境配置 44 6.3 實驗結果 45 6.4 實驗總結 55 第7章 結論與未來展望 56 參考文獻 57

    [1] Kobbe, S., Bauer, L., Lohmann, D., Schroder-Preikschat, W., Henkel, J. “DistRM: Distributed Resorce Management for On-Chip Many-Core Systems”. In: CODES+ISSS, 2011, pp. 119-128.
    [2] Huang, Jialu, et al. "Automatically exploiting cross-invocation parallelism using runtime information." Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 2013.
    [3] Anantpur, Jayvant, and R. Govindarajan. "Runtime dependence computation and execution of loops on heterogeneous systems." Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on. IEEE, 2013.
    [4] Raman, Arun, et al. "Parallelism orchestration using DoPE: the degree of parallelism executive." ACM SIGPLAN Notices 46.6 (2011): 26-37.
    [5] 孫大群, 嚴義, 鄔惠峰, 梯形圖資料依賴關係分析與並行提取,杭州電子科技大學智慧與軟體技術研究所,2014。
    [6] Vallerio, K. S. and Jha, N. K., “Task Graph Extraction for Embedded System Synthesis,”. In Proceedings of the 16th international Conference on VLSI Design, 480.
    [7] Liu, A. and Dick, R. P. 2006, “Automatic run-time extraction of communication graphs from multithreaded applications,” In Proceedings of CODES+ISSS '06. ACM, New York, NY, 46-51.
    [8] K. Ganeshpure, S. Kundu, “On Run-time Task Graph Extraction of SoC,” International SoC design Conference, ISOCC 2010.
    [9] Sherwood, T., Perelman, E., and Calder, B. 2001. Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications. In Proceedings of PACT ‘01., 3-14.
    [10] Denning, P. J. 1967. The working set model for program behavior. In Proceedings of. SOSP '67. ACM, New York, NY, 15.1-15.12.
    [11] Arvind and Gostelow, K. P. "The U-interpreter." Computer 15.2 (1982): 42-49.
    [12] Amdahl, Gene M. "Validity of the single processor approach to achieving large scale computing capabilities." Proceedings of the April 18-20, 1967, spring joint computer conference. ACM, 1967.
    [13] C. Kyriacou, P. Evripidou, and P. Trancoso. "Data-driven multithreading using conventional microprocessors." Parallel and Distributed Systems, IEEE Transactions on 17.10 (2006): 1176-1188.
    [14] Foster Ian, "Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering", Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1995.
    [15] Fiduccia, Charles M., and Robert M. Mattheyses. "A linear-time heuristic for improving network partitions." Design Automation, 1982. 19th Conference on. IEEE, 1982.
    [16] Kernighan, B. W., and S. Lin. "An eflicient heuristic procedure for partitioning graphs." Bell system technical journal 49.2 (1970): 291-307.
    [17] M.R. Guthaus et al., "MiBench: A free, commercially representative embedded benchmark suite," Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on. IEEE, 2001.
    [18] Stavrou, Kyriakos, Paraskevas Evripidou, and Pedro Trancoso. "DDM-CMP: data-driven multithreading on a chip multiprocessor." Embedded Computer Systems: Architectures, Modeling, and Simulation. Springer Berlin Heidelberg, 2005. 364-373.
    [19] Stavrou, Kyriakos, et al. "Tflux: A portable platform for data-driven multithreading on commodity multicore systems." Parallel Processing, 2008. ICPP'08. 37th International Conference on. IEEE, 2008.
    [20] S. Campanoni, T. Jones, et al., "HELIX: automatic parallelization of irregular programs for chip multiprocessing", Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 2012.
    [21] Trancoso, Pedro, Kyriakos Stavrou, and Paraskevas Evripidou. "DDMCPP: The data-driven multithreading C pre-processor." The 11th Workshop on Interaction between Compilers and Computer Architectures. 2007.
    [22] Blaise B., "Introduction to Parallel Computing", Lawrence Livermore National Laboratory.
    [23] Quinn, Michael J., "Parallel Programming in C with MPI and OpenMP", McGrawHill.
    [24] IEEE. Threads Extension for Portable Operating Systems (Draft 6), February 1992. P1003.4a/D6.
    [25] OpenMP: Simple, portable, scalable SMP programming. http://www.openmp.org, 2006.
    [26] 謝政宏. “資料驅動多執行緒之多核心系統資源管理器設計.” 成功大學電機工程學系學位論文(2014):1-126.
    [27] P. Evripidou and J. Gaudiot. "A Decoupled graph/computation Data-Driven architecture with variable resolution actors". No. CONF-900874-5. University of Southern California, Los Angeles, CA (United States). Dept. of Electrical Engineering, 1990.

    下載圖示 校內:2018-09-01公開
    校外:2018-09-01公開
    QR CODE