| 研究生: |
莊易騰 Zhuang, Yi-Teng |
|---|---|
| 論文名稱: |
支援作業系統之 RISC-V 亂序處理器的快速開發、除錯與效能評估全系統平台 A Full-System Platform for Rapid Development, Debugging, and Performance Evaluation of OS-Capable RISC-V Out-of-Order Processors |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 95 |
| 中文關鍵詞: | RISC-V 、亂序處理器 、全系統模擬 、協同模擬 、時間回溯除錯 、效能評估 |
| 外文關鍵詞: | RISC-V, Out-of-Order Processor, Full-System Simulation, Co-simulation, Time-travel Debugging, Performance Evaluation |
| ORCID: | 0009-0008-0276-8976 |
| 相關次數: | 點閱:17 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,RISC-V 架構因具備開放與可擴展特性,逐漸成為處理器研究與客製化設計的重要基礎。然而,支援作業系統之亂序執行處理器涉及微架構設計、系統整合、軟體移植、功能驗證與效能分析等多層次挑戰。傳統工具如指令集模擬器、功能模擬器與 RTL 模擬器多半各自獨立,難以同時兼顧執行效率、微架構可觀測性與 RTL 等級準確性,使處理器開發流程容易破碎,錯誤定位與效能瓶頸分析也更加困難。
本研究提出 NOVA,一個支援作業系統之 RISC-V 亂序處理器全系統開發、除錯與效能評估平台。NOVA 平台整合自行開發之 RISC-V 指令集模擬器、single-issue 亂序 RTL 處理器、分層式快取與記憶體子系統、AXI interconnect、DRAMSys 主記憶體模型,以及 UART、ACLINT、PLIC 等 SoC 周邊元件,形成可執行實際工作負載的完整系統環境。NOVA 核心採用 11 級 single-issue out-of-order 架構,實作分支預測、暫存器重命名、checkpoint recovery、instruction scheduling、replay mechanism、LSU memory disambiguation 與 non-blocking cache 等機制,以支援指令層級平行性與記憶體延遲隱藏。
為提升驗證與除錯效率,本研究建立 ISS 與 RTL 之協同模擬機制,於指令提交點進行架構狀態比對,以即時偵測 RTL 行為與參考模型之不一致。此外,本平台提出基於 process forking 的時間回溯除錯機制,使系統在偵測錯誤後可回到先前執行狀態並啟用局部波形擷取,避免全程記錄波形造成的高額 I/O 與儲存成本。平台亦提供 debug 與 performance visualization 工具,將 ROB、FTQ、physical register file、IPC、cache hit rate、MSHR 使用率等資訊轉換為高層次視覺化資料,協助設計者快速理解系統行為與效能瓶頸。
在軟體與實驗方面,NOVA 支援 FreeRTOS、多組 bare-metal benchmark,以及 TFLM 與 llama2.c 等小型 AI inference workload。實驗結果顯示,相較於前一代 single-issue in-order LUNA Core,NOVA 在 MiBench 與 Embench 上皆取得 1.20x geomean IPC speedup,證明亂序執行對多樣化嵌入式與 IoT workload 具有實際效益;在 PolyBench small dataset 上則取得 1.01x geomean speedup,顯示規則化數值 kernel 對 OoO 微架構的受益程度較有限。此外,NOVA 成功執行 FreeRTOS hello world、TFLM MNIST int8 推論與 llama2.c 小型語言模型推論,驗證平台具備作業系統支援、SoC 整合與實際應用執行能力。整體而言,本研究建立了一個兼具 RTL 準確性、系統完整性、可觀測性與可擴展性的 RISC-V 處理器開發平台,可有效縮短從微架構設計到實際 workload 驗證與效能分析之間的距離。
This thesis presents NOVA, a full-system platform for rapid development, debugging, and performance evaluation of OS-capable RISC-V out-of-order processors. Modern out-of-order CPU development requires not only RTL implementation, but also system integration, software support, correctness validation, and performance analysis under realistic workloads. Existing tools often cover only part of this workflow, leaving a gap between fast functional simulation, cycle-accurate RTL behavior, and efficient debugging. NOVA addresses this gap by integrating a RISC-V instruction set simulator, a single-issue out-of-order RTL processor, cache and memory-system models, SoC peripherals, co-simulation, time-travel debugging, and visualization tools into a unified environment. The platform supports bare-metal benchmarks, FreeRTOS, TensorFlow Lite for Microcontrollers, and llama2.c inference workloads. Experimental results show that NOVA achieves 1.20x geomean IPC speedup over the previous in-order LUNA Core on both MiBench and Embench, while maintaining comparable performance on PolyBench small dataset with 1.01x geomean speedup. These results demonstrate that NOVA provides a practical foundation for studying RISC-V out-of-order processors across software, microarchitecture, memory hierarchy, and debugging infrastructure.
[1] F. Bellard et al. Qemu, a fast and portable dynamic translator. In Usenix ATC, Freenix Track, pages 41–46, 2005.
[2] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH computer architecture news, 39(2):1–7, 2011.
[3] R. David, J. Duke, A. Jain, V. Janapa Reddi, N. Jeffries, J. Li, N. Kreeger, I. Nappier, M. Natraj, T. Wang, et al. Tensorflow lite micro: Embedded machine learning for tinyml systems. Proceedings of machine learning and systems, 3:800–811, 2021.
[4] S. Gal-On and M. Levy. Exploring coremark a benchmark maximizing simplicity and efficacy. The Embedded Microprocessor Benchmark Consortium, 6(23):87, 2012.
[5] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In Proceed- ings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538), pages 3–14. IEEE, 2001.
[6] M. Jung, C. Weis, and N. Wehn. Dramsys: A flexible dram subsystem design space exploration framework. IPSJ Transactions on System and LSI Design Methodology, 8:63–74, 2015.
[7] D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In 25 years of the international symposia on Computer architecture (selected papers), pages 195–201, 1998.
[8] C.-C. Lee, I.-C. Chen, and T. N. Mudge. The bi-mode branch predictor. In Proceed- ings of 30th Annual International Symposium on Microarchitecture, pages 4–13. IEEE, 1997.
[9] A. Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.
[10] A. Seznec and P. Michaud. A case for (partially) tagged geometric history length branch prediction. The Journal of Instruction-Level Parallelism, 8:23, 2006.
[11] K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark. Improving prediction for pro-cedure returns with return-address-stack repair mechanisms. In Proceedings. 31st An-nual ACM/IEEE International Symposium on Microarchitecture, pages 259–271. IEEE,1998.
[12] R. P. Weicker. Dhrystone: a synthetic systems programming benchmark. Communica-tions of the ACM, 27(10):1013–1030, 1984.