簡易檢索 / 詳目顯示

研究生: 蔡宜穎
Tsai, Yi-Ying
論文名稱: 適用於嵌入式處理器之低功耗指令遞送機制之研究
Energy-efficient Instruction Delivery for Embedded Processors
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 70
中文關鍵詞: 低功率快取記憶體嵌入式處理器分支預測指令歷程快取
外文關鍵詞: Embedded processor, Branch prediction, Cache memory, Low power, Trace cache
相關次數: 點閱:109下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對嵌入式處理器而言,指令快取記憶體所耗用的能量佔整體處理器相當的比例,因此如何藉由改善指令遞送過程的能量效率,進而達成省電的目標,也就成為嵌入式系統的重要課題。本論文首先探討諸多改善指令遞送過程所耗用成本的方案,並進而提出一個稱為指令歷程再利用的架構,來改善嵌入式處理器在遞送指令時耗用的能源效率,同時增進效能。藉由一個稱為歷程快取記憶體的簡單硬體結構,我們提出的方案可以在管線的尾端將執行完畢的指令收集起來,並保持不同指令間前後的順序關係,也就是執行的歷程,進而在必要的時候重複利用這些已經出現過的歷程,達到同時提高能源效率和執行效能的雙重目標。根據實驗結果,對於一個內建16KB指令快取記憶體的嵌入式處理器,增設2048個指令的歷程快取記憶體之後,可以提高21%的IPC效能同時降低75%指令快取記憶體所耗用的能量。本文所提出的歷程快取方案還包括了另一項特色,就是使系統對於傳統記憶體快取的依賴度大幅的降低。增設了歷程快取記憶體的系統只需要不到一半的傳統指令快取記憶體,就能達到同水準的執行效能。這樣的特色提供了嵌入式系統設計者們一個新的選擇,也就是透過整合本文所提出的歷程快取記憶體,設計者將有機會將傳統快取記憶體所佔用的面積和功率縮小,用以換取更大的設計彈性。

    For an embedded processor, the efficiency of instruction delivery has attracted much attention since instruction cache accesses consume a great portion of the whole processor power dissipation. In this thesis, the previous works related to instruction delivery are explored, and a novel scheme called Trace Reuse Cache (TR cache) architecture is proposed to serve as an alternative option for energy-efficient instruction delivery. Through an effective mechanism to reuse the retired instructions from the pipeline back-end of the processor, the TR cache presents improvement both in performance and power efficiency. Experimental results show that a 2048-entry TR cache is able to provide 75% energy saving for an instruction cache of 16KB, at the same time boosts the IPC up to 21%. The scalability of the TR cache is also demonstrated with the estimated area usage and energy-delay product. The results of our evaluation indicate that the TR cache outperforms the traditional filter cache under all configurations of the reduced cache sizes. The TR cache exhibits strong tolerance to the IPC degradation induced by smaller instruction caches, thus makes it an ideal design option for the cases of trading cache size for better energy and area efficiency.

    Table of Contents List of Tables viii List of Figures ix Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Proposed Method 3 1.3 Contributions 6 1.4 Organization of Thesis 6 Chapter 2 Background 8 2.1 Branch Resolution 8 2.2 Instruction Reuse 9 2.3 Trace Cache 9 2.4 Filter Cache 10 2.5 Code Compression 10 2.6 Address Compression 17 2.7 Summary and Discussions 23 Chapter 3 Energy-efficient Instruction Delivery 26 3.1 Trace Reuse Cache Architecture 26 3.2 Design Options for Embedded Processor 31 3.3 Discussion 34 Chapter 4 Experimental Results 37 4.1 Simulation Platform 37 4.2 IPC Performance Analysis 39 4.3 Impact on Instruction Cache Access 44 4.4 Energy Efficiency 47 4.5 Summary 56 Chapter 5 Implementation Issues 58 5.1 Pipeline Modification 58 5.2 Access Latency Estimation 59 5.3 Area Estimation 60 5.4 Cost analysis and Discussions 61 Chapter 6 Conclusions and Future Works 64 6.1 Conclusions 64 6.2 Future Works 65 Reference 66

    [1] S. McFarling, “Combining Branch Predictors,” Technical Report of Digital WRL, June 1993.
    [2] J. L. Hennessy and D. A. Patterson, “Computer Architecture: A Quantitative Approach,” 4th edition, Morgan Kaufman Publishers Inc., 2006.
    [3] D. A. Jimenez and C. Lin, “Dynamic Branch Prediction with Perceptrons,” Proceedings of the 7th International Symposium on High-Performance Computer Architecture, January 2001, pp.197-206.
    [4] P. Petrov and A. Orailoglu, “Low-power Branch Target Buffer for Application-specific Embedded Processors,” IEE Proceedings Computers & Digital Techniques, July 2005, pp.482-488.
    [5] P. Petrov and A. Orailoglu, “A Reprogrammable Customization Framework for Efficient Branch Resolution in Embedded Processors,” ACM Transaction on Embedded Computing Systems, May 2005, Vol.2, Issue 2, pp.452-468.
    [6] B. Salamat, A. Baniasadi, and K. J. Deris, “Area-aware Optimizations for Resource Constrained Branch Predictors Exploited in Embedded Processors,” Proceedings of International Conference on Embedded Computer Systems: Architecture, Modeling and Simulation, July 2006, pp.50-55.
    [7] T. Y. Yeh and Y. N. Patt, “A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution,” Proceedings of the 25th Annual International Symposium on Microarchitecture, December 1992, pp.129-139.
    [8] T. M. Conte, K. N. Menezes, P. M. Mills, and B. A. Patel, “Optimization of instruction fetch mechanisms for high issue rates,” Proceedings of the 22nd International Symposium on Computer Architecture, May 1995, pp.333-344.
    [9] G. Reinman, T. Austin, and B. Calder, “A Scalable Front-End Architecture for Fast Instruction Delivery,” Proceedings of the 26th International Symposium on Computer Architecture, May 1999, pp.234-245.
    [10] E. Rotenberg, S. Bennett, and J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, February 1999, Vol. 48, Issue 2, pp.111-120.
    [11] A. Hossain, D. J. Pease, J. S. Burns, and N. Parveen, “Trace Cache Performance Parameters,” Proceedings of the 2002 IEEE International Conference on Computer Design, February 2002, pp.348-355.
    [12] J. S. Hu, M. J. Irwin, N. Vijaykrishnan, and M. Kandemir, “Selective Trace Cache: A Low Power and High Performance Fetch Mechanism,” Tech-cse-02-016, Dept. of Computer Science and Engineering, Pennsylvania State University, 2002.
    [13] J. S. Hu, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir, “Using dynamic branch behavior for power-efficient instruction fetch,” Proceedings of the IEEE Computer Society Annual Symposium on VLSI, February 2003, pp.127.
    [14] J. Montanaro, R. T. Witek, K. Anne, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M. Donahue, J. Eno, G. W. Hoeppner, D. Kruckemyer, T. H. Lee, C. M. Lin, L. Madden, D. Murray, M. H. Pearce, S. Santhanam, K. J. Snyder, R. Stephany, and S. C. Thierauf, ”A 160-Mhz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE journal of Solid-State Circuits, 1996, Vol.31, pp.1703-1714.
    [15] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filter Cache: An Energy Efficient Memory Structure,” Proceedings of the 30th International Symposium on Microarchitecture, December 1997, pp.184-193.
    [16] J. Kin, M. Gupta, and W. H. Magione-Simth, “Filtering memory references to increase energy efficiency,” IEEE Transaction on Computers, January 2000, Vol.49, pp.1-15.
    [17] N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a loop cache,” Proceedings of the International Conference on Computer Design, October 1999, pp.378-383.
    [18] L. Lee, B. Moyer, and J. Arends, “Instruction Fetch Energy Reduction using Loop Caches for Embedded Applications with Small Tight Loops,” Proceedings of the 1999 International Symposium on Low Power Electronics and Design, 1999, pp.267-269.
    [19] W. Tang, R. Gupta, and A. Nicolau, “Design of a predictive filter cache for energy savings in high performance processor architectures,” Proceedings of the International Conference on Computer Design, September 2001, pp.68-73.
    [20] W. Tang, R. Gupta, and A. Nicolau, “Power savings in embedded processors through decode filter cache,” Proceedings of the Design Automation and Test in Europe Conference and Exhibition, October 2002, pp.443-448.
    [21] C. Yang and A. Orailoglu, “Power-efficient instruction delivery through trace reuse,” Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, 2006, pp.192-201.
    [22] A. Efthymiou and J. D. Garside, “A CAM with mixed serial-parallel comparison for use in low energy caches,” IEEE Transaction on VLSI Systems, Mar 2004, Vol.12, pp.325-329.
    [23] Chung-Ho Chen, Chih-Kai Wei, Tai-Hua Lu and Hsun-Wei Gao, “Software-based Self-Testing with Multiple-Level Abstractions for Soft Processor Cores,” IEEE Transactions on VLSI Systems, May 2007, Vol.15, pp.505-517.
    [24] T. Austin, E. Larson, and D. Ernst. “SimpleScalar: an infrastructure for computer system modeling,” IEEE Computer, February 2002, Vol.35, Issue 2, pp.59–67.
    [25] HP Labs, “CACTI: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model,” http://www.hpl.hp.com/research/cacti/ .
    [26] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, “CACTI 5.1,” Technical Report HPL-2008-20, HP Laboratories Palo Alto, June 2006.
    [27] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R. B. Brown. “MiBench: A free, commercially representative embedded benchmark suite,” Proceedings of the IEEE 4th Annual Workshop on Workload Characterization, December 2001, pp.3-14.
    [28] Semiconductor Industries Association, ”International Technology Roadmap for Semiconductors,” http://www.itrs.net/, 2005.
    [29] J. Penton and S. Jalloq, “Cortex-R4: A mid-range processor for deeply-embedded applications,” ARM white paper, http://www.arm.com/products/CPUs/ARM_Cortex-R4F.html, May 2006.
    [30] A. Janapsatya, S. Parameswaran, and A. Ignjatovic, “HitME: low power Hit Memory buffer for embedded systems,” Proceedings of the Asia and South Pacific Design Automation Conference, January 2009, pp.335-340.
    [31] A. Sodani and G. S. Sohi, “Dynamic Instruction Reuse,” Proceedings of 24th Annual International Symposium on Computer Architecture, June 1997, pp.194-205.
    [32] S. Wallace and D. M. Tullsen, “Instruction recycling on a multiple-path processor,” Proceedings of 5th International Symposium on High-Performance Computer Architecture, January 1999, pp.44-53.
    [33] A. T. da Costa, F. M. G. Franca, and E. M. C. Filho, “The dynamic trace memorization reuse technique,” Proceedings of International Conference on Parallel Architectures and Compilation Techniques, October 2000, pp.92-99.
    [34] D. Charles, A. R. Hurson, and N. Vijaykrishnan, “Improving ILP with instruction-reuse cache hierarchy,” Proceedings of 5th International Conference on Algorithms and Architectures for Parallel Processing, 2002, pp.206-213.
    [35] Ke-Chia Li, “Energy Efficient Code Compression Architecture for Embedded Processors,” M.S. Thesis, Institute of Computer & Communication, National Cheng Kung University, Taiwan, R.O.C., July 2006.
    [36] Chia-Jung Hsu, “Applying Virtual Address Compression in Branch Target Buffer and Load/Store Queue in High-performance Processors,” M.S. Thesis, Institute of Computer & Communication, National Cheng Kung University, Taiwan, R.O.C., July 2007.
    [37] Yi-Ying Tsai, Ke-Chia Lee, and Chung-Ho Chen, “Code Compression Architecture for Memory Bandwidth Optimization in Embedded Systems,” Proceedings of the International Computer Symposium, 2006, pp.236-241.
    [38] Yi-Ying Tsai, Chia-Jung Hsu, and Chung-Ho Chen, “Power-efficient and Scalable Load/Store Queue Design via Address Compression,” Proceedings of the ACM Symposium on Applied Computing, February 2008, pp.1523-1527.
    [39] Yi-Ying Tsai, Chia-Jung Hsu, and Chung-Ho Chen, “Address Compression for Scalable Load/Store Queue Implementation,” Proceedings of International Symposium on Circuits and Systems, May 2008, pp.1680-1683.
    [40] S. Segars, K. Clarke, and L. Goude, “Embedded Control Problems, Thumb and the ARM7TDMI,” IEEE Micro, Vol.16, No.6, 1995, pp.22-30.
    [41] K.D. Kissell, “MIPS16: High-Density MIPS for the Embedded Market,” Proceedings of Real-Time System, 1997.
    [42] IBM, “CodePack PowerPC Code Compression Utility User’s Manual Version 3.0,” IBM, 1998.
    [43] Luca Benini, Francesco Menichelli, and Mauro Olivieri, “A Class of Code Compression Schemes for Reducing Power Consumption in Embedded Microprocessor Systems,” IEEE Transactions on Computers, Vo. 53, No. 4, April 2004, pp.467-482.
    [44] A. Park and M. K. Farrens, “Address Compression through Base Register Caching,” in Proceedings of the Annul IEEE/ACM International Symposium on Microarchitecture,1990 , pp.193-199.
    [45] D. Citron and L. Rudolph, “Creating a Wider Bus Using Caching Techniques,” in Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, 1995, pp.90-99.
    [46] Kostas Pagiamtzis, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” in IEEE Journal of Solid-State Circuits, 2006, pp.712-727.
    [47] J. L. Henning, “SPEC CPU2000: Measuring CPU performance in the new millennium,” IEEE Computer, Vol. 33, 2000, pp.28-35.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE