簡易檢索 / 詳目顯示

研究生: 莊易騰
Zhuang, Yi-Teng
論文名稱: 支援作業系統之 RISC-V 亂序處理器的快速開發、除錯與效能評估全系統平台
A Full-System Platform for Rapid Development, Debugging, and Performance Evaluation of OS-Capable RISC-V Out-of-Order Processors
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2026
畢業學年度: 114
語文別: 中文
論文頁數: 95
中文關鍵詞: RISC-V亂序處理器全系統模擬協同模擬時間回溯除錯效能評估
外文關鍵詞: RISC-V, Out-of-Order Processor, Full-System Simulation, Co-simulation, Time-travel Debugging, Performance Evaluation
ORCID: 0009-0008-0276-8976
相關次數: 點閱:17下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,RISC-V 架構因具備開放與可擴展特性,逐漸成為處理器研究與客製化設計的重要基礎。然而,支援作業系統之亂序執行處理器涉及微架構設計、系統整合、軟體移植、功能驗證與效能分析等多層次挑戰。傳統工具如指令集模擬器、功能模擬器與 RTL 模擬器多半各自獨立,難以同時兼顧執行效率、微架構可觀測性與 RTL 等級準確性,使處理器開發流程容易破碎,錯誤定位與效能瓶頸分析也更加困難。

    本研究提出 NOVA,一個支援作業系統之 RISC-V 亂序處理器全系統開發、除錯與效能評估平台。NOVA 平台整合自行開發之 RISC-V 指令集模擬器、single-issue 亂序 RTL 處理器、分層式快取與記憶體子系統、AXI interconnect、DRAMSys 主記憶體模型,以及 UART、ACLINT、PLIC 等 SoC 周邊元件,形成可執行實際工作負載的完整系統環境。NOVA 核心採用 11 級 single-issue out-of-order 架構,實作分支預測、暫存器重命名、checkpoint recovery、instruction scheduling、replay mechanism、LSU memory disambiguation 與 non-blocking cache 等機制,以支援指令層級平行性與記憶體延遲隱藏。

    為提升驗證與除錯效率,本研究建立 ISS 與 RTL 之協同模擬機制,於指令提交點進行架構狀態比對,以即時偵測 RTL 行為與參考模型之不一致。此外,本平台提出基於 process forking 的時間回溯除錯機制,使系統在偵測錯誤後可回到先前執行狀態並啟用局部波形擷取,避免全程記錄波形造成的高額 I/O 與儲存成本。平台亦提供 debug 與 performance visualization 工具,將 ROB、FTQ、physical register file、IPC、cache hit rate、MSHR 使用率等資訊轉換為高層次視覺化資料,協助設計者快速理解系統行為與效能瓶頸。

    在軟體與實驗方面,NOVA 支援 FreeRTOS、多組 bare-metal benchmark,以及 TFLM 與 llama2.c 等小型 AI inference workload。實驗結果顯示,相較於前一代 single-issue in-order LUNA Core,NOVA 在 MiBench 與 Embench 上皆取得 1.20x geomean IPC speedup,證明亂序執行對多樣化嵌入式與 IoT workload 具有實際效益;在 PolyBench small dataset 上則取得 1.01x geomean speedup,顯示規則化數值 kernel 對 OoO 微架構的受益程度較有限。此外,NOVA 成功執行 FreeRTOS hello world、TFLM MNIST int8 推論與 llama2.c 小型語言模型推論,驗證平台具備作業系統支援、SoC 整合與實際應用執行能力。整體而言,本研究建立了一個兼具 RTL 準確性、系統完整性、可觀測性與可擴展性的 RISC-V 處理器開發平台,可有效縮短從微架構設計到實際 workload 驗證與效能分析之間的距離。

    This thesis presents NOVA, a full-system platform for rapid development, debugging, and performance evaluation of OS-capable RISC-V out-of-order processors. Modern out-of-order CPU development requires not only RTL implementation, but also system integration, software support, correctness validation, and performance analysis under realistic workloads. Existing tools often cover only part of this workflow, leaving a gap between fast functional simulation, cycle-accurate RTL behavior, and efficient debugging. NOVA addresses this gap by integrating a RISC-V instruction set simulator, a single-issue out-of-order RTL processor, cache and memory-system models, SoC peripherals, co-simulation, time-travel debugging, and visualization tools into a unified environment. The platform supports bare-metal benchmarks, FreeRTOS, TensorFlow Lite for Microcontrollers, and llama2.c inference workloads. Experimental results show that NOVA achieves 1.20x geomean IPC speedup over the previous in-order LUNA Core on both MiBench and Embench, while maintaining comparable performance on PolyBench small dataset with 1.01x geomean speedup. These results demonstrate that NOVA provides a practical foundation for studying RISC-V out-of-order processors across software, microarchitecture, memory hierarchy, and debugging infrastructure.

    摘要 i 英文延伸摘要 ii 誌謝 vii 目錄 viii 表格 x 圖片 xi Chapter 1. Introduction 1 Chapter 2. Background 4 2.1 RISC-V Instruction Set Architecture 4 2.1.1 Design Philosophy 4 2.1.2 Modularity and Extensibility 4 2.1.3 Privilege Model and Software Ecosystem 5 2.2 Out-of-Order Processor Microarchitecture 6 2.2.1 Instruction Fetch 7 2.2.2 Instruction Decode 7 2.2.3 Register Renaming 7 2.2.4 Instruction Dispatch 7 2.2.5 Instruction Scheduling 8 2.2.6 Register Read 8 2.2.7 Execution 9 2.2.8 Writeback 9 2.2.9 Commit 9 2.3 Processor Simulation and Modeling 10 2.3.1 Instruction Set Simulation 10 2.3.2 Performance Modeling 11 2.3.3 RTL Simulation 11 2.3.4 Comparison and Discussion 12 2.4 Benchmark Suites 13 Chapter 3. System Architecture 14 3.1 System Overview 14 3.2 RISC-V OoO Core Microarchitecture 16 3.2.1 Design Overview 16 3.2.2 High Performance BPU Design 19 3.2.3 Checkpoint Recovery Mechanism 23 3.2.4 High Performance LSU Design 26 3.2.5 Instruction Scheduling and Speculative Execution 30 3.2.6 Replay Mechanism 34 3.3 High Performance Cache Modeling 36 3.4 Interconnect and Peripherals 40 Chapter 4. Simulation and Debug Infrastructure 42 4.1 Co-simulation Framework 42 4.1.1 Instruction Set Simulator 42 4.1.2 SoC Integration 43 4.1.3 State Comparison Mechanism 44 4.2 Time-travel Debugging 45 4.3 Debug and Performance Visualization 47 4.3.1 NOVA Debug GUI 48 4.3.2 NOVA Perf Web GUI 49 4.3.3 Summary 52 Chapter 5. Software Stack 53 5.1 Software Stack Overview 53 5.2 FreeRTOS Porting 54 5.3 Model Inference Support 55 5.3.1 TFLM 56 5.3.2 llama2.c 58 5.4 Benchmark Porting 59 5.4.1 Bare-metal Benchmark Execution Environment 60 5.4.2 Dhrystone 60 5.4.3 CoreMark 60 5.4.4 PolyBench 61 5.4.5 MiBench 61 5.4.6 Embench 62 Chapter 6. Evaluation 63 6.0.1 Experimental Environment Setting 63 6.0.2 Dhrystone Score Analysis 65 6.0.3 CoreMark Score Analysis 66 6.0.4 MiBench Performance Analysis 67 6.0.5 Embench Performance Analysis 69 6.0.6 PolyBench Performance Analysis 71 6.0.7 TFLM MNIST Inference Validation 74 6.0.8 llama2.c Inference Validation 75 6.0.9 FreeRTOS Execution Validation 76 Chapter 7. Conclusion and Future Work 78 7.1 Conclusion 78 7.2 Future Work 78 References 80

    [1] F. Bellard et al. Qemu, a fast and portable dynamic translator. In Usenix ATC, Freenix Track, pages 41–46, 2005.
    [2] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH computer architecture news, 39(2):1–7, 2011.
    [3] R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, T. Wang, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proceedings of machine learning and systems, 3:800–811, 2021.
    [4] S. Gal-On and M. Levy. Exploring coremark a benchmark maximizing simplicity and efficacy. The Embedded Microprocessor Benchmark Consortium, 6(23):87, 2012.
    [5] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In Proceed- ings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538), pages 3–14. IEEE, 2001.
    [6] M. Jung, C. Weis, and N. Wehn. Dramsys: A flexible dram subsystem design space exploration framework. IPSJ Transactions on System and LSI Design Methodology, 8:63–74, 2015.
    [7] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In 25 years of the international symposia on Computer architecture (selected papers), pages 195–201, 1998.
    [8] C.-C. Lee, I.-C. Chen, and T. N. Mudge. The bi-mode branch predictor. In Proceed- ings of 30th Annual International Symposium on Microarchitecture, pages 4–13. IEEE, 1997.
    [9] A. Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.
    [10] A. Seznec and P. Michaud. A case for (partially) tagged geometric history length branch prediction. The Journal of Instruction-Level Parallelism, 8:23, 2006.
    [11] K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark. Improving prediction for pro-cedure returns with return-address-stack repair mechanisms. In Proceedings. 31st An-nual ACM/IEEE International Symposium on Microarchitecture, pages 259–271. IEEE,1998.
    [12] R. P. Weicker. Dhrystone: a synthetic systems programming benchmark. Communica-tions of the ACM, 27(10):1013–1030, 1984.

    QR CODE