| 研究生: |
許家榮 Hsu, Chia-Jung |
|---|---|
| 論文名稱: |
利用虛擬位址壓縮減少高效能處理器之分支目標緩衝器及載入儲存佇列之面積及功率需求 Applying Virtual Address Compression in Branch Target Buffer and Load / Store Queue in high-performance processors |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 79 |
| 中文關鍵詞: | 虛擬位址壓縮 、分支目標緩衝器 、載入儲存佇列 、功率消耗 |
| 外文關鍵詞: | energy reduction, BTB, branch target buffer, load store queue, LSQ, virtual address compression |
| 相關次數: | 點閱:56 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本論文中針對高效能處理器中的分支目標暫存器(Branch Target Buffer-BTB)以及載入儲存佇列(Load / Store Queue-LSQ)所儲存及比對的虛擬位址做壓縮處理,因為BTB在處理器中是一個儲存虛擬位址的快取記憶體架構,經過虛擬位址壓縮過後可以節省BTB的面積及功率需求。LSQ在處理器中不僅僅是儲存虛擬位址,還需要利用全體搜尋(fully-associative)的Content-Address-Memory(CAM)架構使用將要被擺置到LSQ的虛擬位址尋找位址碰撞(address collision)的發生,而這樣的架構以及搜尋比對所產生的能量消耗及面積需求的問題都會隨著執行中的(in-flight)指令增加而日益重視。
而使用虛擬位址壓縮的BTB設計可以減少53.6%-69.3%左右的面積需求,而且也可以減少BTB能量消耗4.2%-28.5%左右,不但不會讓原始的時脈週期造成額外的負擔而且Instruction Per Cycle(IPC)只減少0.4%以下。而LSQ的設計經過虛擬位址壓縮過後也可以減少35%-70%左右的面積需求以及39%-72%左右的LSQ能量消耗,在LSQ最後所採用的最佳虛擬位址壓縮設定結果中IPC減少不到0.3%。最後結合BTB和LSQ虛擬位址壓縮的設計可以減少處理器2.5%-3.1%的能量消耗,以及45%-52%的LSQ和BTB面積需求且只有0.2%以下的IPC減少比例。
This paper proposes a virtual address compression technique for branch target buffer (BTB) and load/store queue (LSQ) that use virtual address for matching or comparisons. Since a BTB is a large address cache, applying address compression will reduce the area cost of the BTB. A load/store queue (LSQ) typically needs a fully-associative CAM structure to search the address for matching and consequently poses scalability challenges for power consumption and area cost once the number of the in-flight instructions is raised. Using the proposed approach, the BTB design is able to reduce the area usage by 53.6%-69.3% and energy consumption by 4.2%-28.5% while the LSQ can reduce the area cost by 35%-70% and energy consumption by 39%-72%. The experiment on combining the two shows that 45%-52%total area saving of the two components are achieved while providing 2.5%-3.1% overall processor energy reduction and causing only 0.2% performance loss.
[1] A. Park and M. K. Farrens, “Address Compression through Base Register Caching,” in Proceedings of the Annul IEEE/ACM International Symposium on Microarchitecture,1990 , pp.193-199.
[2] D. Citron and L. Rudolph, “Creating a Wider Bus Using Caching Techniques,” in Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, 1995, pp.90-99.
[3] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Quantifying the Complexity of Superscalar Processors,” University of Wisconsin-Madison, Tech. Rep. CS-1328, May 1997.
[4] D. Burger and T. M. Austin, “The SimpleScalar tool set, version 2.0”, in University of Wisconsin-Madison, Jun. 1997, CS-1342.
[5] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework for architectural-level power analysis and optimizations,” in Proceedings on the 27th Annual International Symposium on Computer Architecture, 2000, pp.83-94.
[6] G. Reinman and N. P. Jouppi, “CACTI 2.0: An Integrated Cache Timing and Power Model,” COMPAQ Western Research Lab, Palo Alto, CA, Tech. Rep., Feb. 2000.
[7] J. L. Henning, “SPEC CPU2000: Measuring CPU performance in the new millennium,” IEEE Computer, Vol: 33, 2000, pp.28-35
[8] L. Villa, M. Zhang, and K. Asanovic, “Dynamic Zero Compression for Cache Energy Reduction,” in Proceedings of the 33rd International Symposium on Microarchitecture, Dec.2000
[9] R. Canal, A. González, and J. E. Smith, “Very low power pipelines using significance compression,” in Proceedings of the 33rd Annual ACM/IEEE international Symposium on Microarchitecture (Monterey, California, United States). MICRO 33. ACM Press, New York , 2000, pp.181-190
[10] Dmitry Ponomarev, Gurhan Kucuk, Kanad Ghose, “Power Reduction in Superscalar Datapaths Through Dynamic Bit-Slice Activation,” Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01), 2001, pp.0016
[11] I. Park, C. L. Ooi, and T. N. Vijaykumar, “Reducing Design Complexity of the Load/Store Queue,” in Proceedings of the 36th Annul IEEE/ACM International Symposium on Microarchitecture, 2003, pp.411-422.
[12] S. Sethumadhavan, R. Desikan, D. Burger. C. R. Moore, and S. W. Keckler, “Scalable Hardware Memory Disambiguation for High ILP Processors,” in Proceedings of the 36th Annul IEEE/ACM International Symposium on Microarchitecture, 2003, pp.188-127.
[13] H. W. Cain and M. H. Lipasti, “Memory Ordering: A Value-Based Approach,” in Proceedings on the 31st Annual International Symposium on Computer Architecture, 2004, pp.90-101.
[14] J. Liu, K. Sundaresan and N. R. Mahapatra, “Dynamic Address Compression Schemes: A Performance, Energy, and Cost Study,” in Proceedings of the IEEE International Conference on Computer Design, 2004, pp.458-463.
[15] R. Gonzalez, A. Critstal, D. Ortega, A. Veidembaum, and M. Valero, “A content aware integer register file organization,” in 31st Annual International Symposium on Computer Architecture, 2004, pp.314-324.
[16] Ramon Canal, Antonio González and James E. Smith, “Value Compresson for Efficient Computation”, European Conference on Parallel Computing (Europar'05), Lisboa (Portugal); Lecture Notes in Computer Science, August 2005, pp. 519-529
[17] Abella and A. González, “SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue,” in 20th IEEE International Parallel and Distributed Processing Symposium, 2006.
[18] L. Baugh and C. Zilles, “Decomposing the Load-Store Queue by Function for Power Reduction and Scalability,” in IBM 2006 Journal of Research and Development in Computers & Technology, 2006, pp.287- 297.
[19] F. Castro, D. Chaver, L. Pinuel, M. Prieto, M .C. Huang, and F. Tirado, “LSQ: a power efficient and scalable implementation,” in IEE proceedings Computers and digital Techniques, 2006, pp.389-398.
[20] Kostas Pagiamtzis, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” in IEEE Journal of Solid-State Circuits, 2006, pp.712-727.
[21] J. Liu, K. Sundaresan, and N. R. Mahapatra , “A Fast Dynamic Compression scheme for Low-Latency On-Chip Address Buses,” in the 18th International Conference on Microelectronics, 2006.
[22] O. Rochecouste, G. Pokam, and A. Seznec, “A case for a complexity-effective, width-partitioned Microarchitecture,” in ACM Trans Archit. Code Optim, 2006, pp.295-326