簡易檢索 / 詳目顯示

研究生: 施廷憲
Shi, Ting-Shian
論文名稱: 多執行緒計算系統架構之研究與設計
Design and Research of A Multi-threaded Computation System Architecture
指導教授: 周哲民
Jou, Jer-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 84
中文關鍵詞: 指令層級平行執行緒層級平行同時多執行緒單晶片多處理器多執行緒服務
外文關鍵詞: Instruction Level Parallelism, Thread Level Parallelism, Simultaneous Multi-thread, Chip Multi-Processor, Multi-threaded Management
相關次數: 點閱:104下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 傳統的處理器架構主要是以指令層級平行(Instruction Level Parallelism ; ILP)來提高效能,由於程式本身固有平行程度的限制及記憶體存取延遲皆會使ILP的效能提升越來越困難,因此提升至執行緒層級平行(Thread Level Parallelism ; TLP)以隱藏ILP的平行度的限制及記憶體存取延遲。
    目前TLP處理器架構設計主流基本上可分為同時多執行緒(Simultaneous Multi-thread ; SMT)及單晶片多處理器(Chip Multi-Processor ; CMP)兩種,而本論文是以CMP為設計基礎。
    本論文之多執行緒計算系統架構是由處理單元(Processing Unit ; PU)、多執行緒服務單元(Multi-thread Servie Unit ; MSU)、載入及儲存單元(Load Store Unit ; LSU)及記憶體管理單元(Memory Management Unit ; MMU)四個單元所組成。
    PU除了能夠同時執行多個執行緒以外,還包含執行緒的預先擷取的功能避免執行緒切換延遲,更加入了分散式同步控制器(Distribted Synchronization Controller ; DSC)來完成記憶體同步機制,以提高系統效能;執行緒間的管理主要是藉由MSU來完成,且能夠動態地將執行緒分配至PU執行,如此一來可避免藉由作業系統(軟體)來管理執行緒,增加系統執行效率;LSU能夠降低共享記憶體存取延遲以及使多個執行緒能正確的存取記憶體及避免執行緒資料錯誤;MMU內具有轉換搜尋緩衝器(Translatiion Look-aside Buffer)提供虛擬位址轉換實體位址的功能,以減少搜尋分頁對應表(Page Mapping Table ; PMT)的機會,降低記憶體存取延遲,此外還包含共享層級2指令及資料快取,提供存執行緒指令及資料存取所需。

    In this dissertation, we have analysed and designed a multi-threaded computation system which is composed of a Processing Unit (PU), a Multi-thread Service Unit (MSU), a Load Store Unit (LSU) and a Memory Management Unit (MMU). The MSU contains some multi-threaded management mechanisms and is used to manage threads running efficiently. The thread processing instructions such as real-time scheduling, spawning, waiting, switching, and synchronization, are handled by MSU. Threads will be dynamically dispatched to the PU and run simultaneously. The shared L2 data cache accesses are handled by LSU, and it was designed to reduce thread stalls as much as possible. The MMU is composed of Translation Look-aside Buffers (TLB) and shared L2 instruction cache and shared L2 data cache. The TLBs are divided into Instruction TLB (ITLB) and Data TLB (DTLB) which are used for address translation from virtual address to physical address and reducing the opportunity of accessing the Page Mapping Table (PMT).

    摘要.....................................................I ABSTRACT................................................II 第一章 緒論...............................................1 1.1 研究背景.............................................1 1.2 研究動機與目的........................................2 1.3 論文架構.............................................2 第二章 背景與相關研究......................................2 2.1 指令層級平行 (ILP)....................................3 2.2 執行緒層級平行(TLP)...................................5 2.2.1 處理器架構設計種類(Design Category)..................6 2.2.2 執行緒交換機制(Thread Switch Mechanism)..............7 2.2.2.1 固定週期 (Fixed Cycles)...........................8 2.2.2.2 不固定週期 (Unfixed Cycles).......................8 2.2.3 執行緒管理機制(Thread Management Mechanism).........9 2.2.3.1 優先權(Priority).................................9 2.2.3.2 同步(Synchronization)............................9 第三章 多執行緒系統設計考量.................................11 3.1快取方案的選擇.........................................11 3.2執行緒交換機制的選擇....................................13 3.2.1 預先擷取機制........................................13 3.3執行緒管理問題及解決方案.................................14 3.3.1 執行緒生命週期 (Thread Life Cycle)...................15 3.3.2 執行緒生命週期與硬體結構的對應關係......................16 3.3.2.1 執行緒在MSU中的狀態................................17 3.3.2.2 執行緒在PU中的狀態.................................18 3.4執行緒執行效能的提升.....................................18 3.4.1 新增額外指令.........................................21 第四章 多執行緒系統架構設計..................................22 4.1 處理單元 (Processing Unit ; PU)........................26 4.1.1 預先擷取結構 (Pre-fetch Architecture)................26 4.1.1.1 偵測區塊(Detect Block ; DB)........................28 4.1.1.2 層級1指令快取(Level 1 Instruction Cache)...........29 4.1.1.3 擷取單元(Fetch Units ; FUs)........................31 4.1.1.4 擷取排程器(Fetch Scheduler ; FS)...................33 4.1.1.5 解碼排程器(Decode Scheduler ; DS)..................34 4.1.1.6 暫存器集合(Register Sets ; RS).....................34 4.1.1.7 執行緒資訊表(Thread Information Table ; TIT).......37 4.1.2 處理器單元(Processor Unit ; pU)......................39 4.1.2.1 母處理器單元(Parent Processor Unit ; PpU)..........42 4.1.2.2 子處理器單元(Child Processor Unit ; CpU)...........42 4.1.3 分散式同步控制器(DSC).................................43 4.1.3.1 DSC新增指令集......................................44 4.2 多執行緒服務單元(MSU)...................................45 4.2.1 Active Frame Cache與Empty AF Link List ..............46 4.2.1.1 AF Cache Table與Empty AF Link List................47 4.2.1.2 AF Cache..........................................49 4.2.2 硬體排程器(HS).......................................50 4.2.3執行緒內容交換處理單元(CSHU)............................50 4.2.4 多執行緒要求佇列(MTRQ)................................51 4.3 載入及儲存單元(LSU).....................................53 4.3.1載入儲存要求佇列 (LSRQ)................................54 4.3.2 儲存等待緩衝(SWB)....................................55 4.3.3載入等待緩衝(LWB).....................................56 4.3.4 儲存邏輯(Store Logic)................................57 4.3.5 載入回傳邏輯(Load Return Logic)......................57 4.3.6 處理器單元資訊表(pU Information Table)................58 4.4 記憶體管理單元 (MMU)....................................59 4.4.1 層級1指令計數器 (Level 1 Instruction Counter).........60 4.4.2 指令轉換搜尋緩衝器 (ITLB).............................61 4.4.3 層級2指令快取 (Level 2 Instruction Cache)............62 4.4.4資料轉換搜尋緩衝器 (DTLB)..............................63 4.4.5 層級2資料快取 (Level 2 Data Cache)...................63 4.4.6 主記憶體介面 (Main Memory Interface).................64 第五章 實驗與結果...........................................66 5.1 系統環境...............................................66 5.2 實驗例1...............................................67 5.2.1 實驗例1在系統中之運作說明..............................68 5.2.2 實驗例1結果分析.......................................73 5.3 實驗例2...............................................74 5.3.1 實驗例2在系統中之運作說明..............................75 5.3.2 實驗例2結果分析.......................................79 第六章 結論與未來研究........................................80 6.1 結論..................................................80 6.2 未來研究...............................................81 參考文獻...................................................83

    [1] D.M. Tullsen, S.J. Eggers, and H.M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” In 22nd Annual International Symposium on Computer Architecture, June, 1995
    [2] Lance Hammond, Basem A. Nayfeh, Kunle Olukotun, “A Single-Chip Multiprocessor,” IEEE Computer, September 1997, 30(9): 79-85.
    [3] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Quantifying the Complexity of Superscalar Processors,” University of Wisconsin-Madison, Tech. Rep. CS-1328, May 1997.
    [4] J.L. Hennessy, D.A. Paterson, “Computer Architecture: A Quantitative Approach, Third Edition,” Elsevier Science Pte Ltd, 2002.
    [5] William M. Johnson, “Superscalar Processor Design,” Stanford University, 1989
    [6] S.J Eggers, J.S. Emer, H.M Levy, J.L. Lo, R.L. Stamm and D.M Tullsen, “Simultaneous multithreading: A platform for next-generation processors,” IEEE Micro, September 1997
    [7] Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K.Prabhu, Michael Chen, Kunle Olukotun , “The Stanford Hydra CMP,” Microprocessor, IEEE, April 2000.
    [8] Jonathan Appavoo, Marc Auslander, Dilma DaSilva, David Edelsohn, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Robert W. Wisniewski, Jimi Xenidis, “K42 Overview,” IBM K42 , August 2002.
    [9] Jonathan Appavoo, Marc Auslander, Dilma DaSilva, David Edelsohn, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Robert W. Wisniewski, Jimi Xenidis, “Scheduling in K42,” IBM K42, August 2001.
    [10] Jonathan Appavoo, Marc Auslander, Dilma DaSilva, David Edelsohn, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Robert W. Wisniewski, Jimi Xenidis “Memory Management in K42,” IBM K42, August 2002.
    [11] Adrian Tam, David Kar-Fai Tam, Reza Azimi, “Implementing Resource Containers in K42,” IBM K42, August 2002.
    [12] Francisco J. Cazorla, Peter M.W. Knijnenburg, Rizos Sakellariou, Enrique Fern´andez, Alex Ramirez, Mateo Valero, “Architectural Support for Real-Time Task Scheduling in SMT Processors,” Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, P. 166 – 176 , 2005.
    [13] Francisco J. Cazorla, Peter M. W. Knijnenburg, Rizos Sakellariou, Enrique Fernandez, Alex Ramirez, Mateo Valero, "Implicit vs. Explicit Resource Allocation in SMT Processors," Euromicro Symposium on Digital System Design (DSD'04), pp.44-51,2004.
    [14] B.D.Theelen a, A.C. Verschueren, V.V. Reyes Su_arez, M.P.J. Stevens, A.Nunez, “A scalable single-chip multi-processor architecture with on-chip RTOS kernel,” ELEVIER Journal of Systems Architecture, P.619–639, September 2003.
    [15] D. W. Wall, “Limits of instruction-level parallelism,” Digital Western
    Research Laboratory, Tech. Rep. 93/6, Nov. 1993.
    [16] Chenjie.Yu, Peter Petrov, “Distributed and LowPower Synchronization Architecture for Embedded Multiprocessors,” Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis, P.73-78, 2008.
    [17] Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja, Pen-Chung Yew, “Thread Superthreaded Processor Architecture, ” IEEE TRANSACTIONS ON COMPUTERS, September 1999.
    [18] J.Huang, D.J. Lilja, “An Efficient Strategy for Developing a Simulator for a Novel Concurrent Multithreaded Porcessor Architecture,” IEEE International Symposium on Modeling, July 1998.
    [19] D.M. Tullsen, S.J. Eggers, and H.M. Simultaneous Multithreading: Maximizing On-Chip Parallelism, Levy, In 22nd Annual International Symposium on Computer Architecture, June, 1995
    [20] S.W. Moore, B.Y. Graham, “Tagged up/down sorter – A hardware priority queue,” The Computer Journal, vol. 38, no.9, pp.695-703, Sep.1995.
    [21] Saez.S, Vila.J, Crespo.A, Garcia.A, “A Hardware Scheduler for Complex Real-Time Systems,” Proceedings of the IEEE International Symposium, P.43-48, 1999.
    [22] Murtaza.Z, Khan.S.A, Rafique.A, Bajwa.K.B, Zaman.U, “Silicon real time operating system for embedded DSPs,” IEEE ICET, November 2006.
    [24]陳泳超, 可延展型多執行緒爪哇虛擬機器之系統晶片軟硬體協同設計, 國立成功大學電機工程學系, 碩士論文, 2007.
    [23]鍾旗鴻, 同時多執行緒處理器之研究與分析, 國立成功大學電機工程學系, 碩士論文, 2010.
    [25]路放, 多線程處理器體系結構模擬器的設計和實現, 中國科學技術大學計算機科學技術系, 碩士論文, 2006

    下載圖示 校內:立即公開
    校外:2013-08-25公開
    QR CODE