| 研究生: |
賴紹銘 Lai, Shao-Ming |
|---|---|
| 論文名稱: |
俱雙重浮點精度之多核特殊函數單元設計 Design of Floating-Point Special Function Units with Dual-Precision in Multi-Core Systems |
| 指導教授: |
郭致宏
Kuo, Chih-Hung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 114 |
| 中文關鍵詞: | 特殊函數單元 、多項式近似法 、最小化極大多項式 、雙重精準度運算 、低功率算術單元 、查找表單元 |
| 外文關鍵詞: | special function unit, polynomial approximation, dual-precision operation, lower power arithmetic unit, lookup table |
| 相關次數: | 點閱:104 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現行的圖形處理器常搭載特殊函數單元以加速浮點函數運算,基於計算效能優於精準度的設計常使用多項式近似硬體架構,一般會先根據計算精準度決定多項式階數進而影響計算電路與查找表的硬體資源,接著利用軟體預先計算多項式係數並儲存於查找表中,許多相關研究針對單一精度函數單元探討硬體面積、功耗或速度取向之優化設計,但未考慮可變精度函數單元以因應程式動態精度之運算需求。本篇論文提出可支援雙重浮點精度的特殊函數單元,基於二階多項式設計可變精準度之低功耗硬體架構,並分析最小化極大多項式之降階運算特性,使硬體可共用查找表與乘法器等運算電路,並在低精度模式捨棄二階項運算平均降低功耗約38.1%;此外,我們利用霍納算法化簡多項式運算以節省乘法器電路,並根據關鍵路徑上各個單元的運算時間分配管線化架構,整體面積節省約49.9%,平均降低函數功耗約70%;另外,我們採用非均勻分段方式縮減查找表的使用量,並針對多核計算系統建立查找表共用機制,根據系統效能與硬體共用之關係設計分區查找表架構,整體面積可節省達20%。
As the compute capability of computer architecture increases, many graphics processing units are often equipped with special function units to speed up floating-point functions. Many related studies explore the method to optimize area, power consumption or speed for a single precision function unit. However, the dynamic precision computation is critical for the energy of mobile device. This thesis presents an efficient dual-precision floating-point architecture for special function unit based on 2^nd polynomial approximation. Instead of using two different precision function units, we utilize the truncated polynomial property to share the arithmetic unit and lookup tables. To avoid the exceeding truncated error, we adjust the expression of minimax polynomial inspired by the expansion point of Taylor series. The second order term is truncated in the low precision mode and the average power consumption is reduced by 38.1%. In addition, we exploit the Horner’s method and pipeline architecture to improve the polynomial evaluation. The synthesize results show that our method can efficiently reduce overall area and power consumption by 49.9% and 70%, respectively. Moreover, we establish a shared lookup table scheme for multicore system. The bank-partition lookup table architecture is configured according to the balance between performance and hardware sharing. The area of lookup table is reduced by 56% and the area of overall multicore system is reduce by 20%.
[1]IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2008 In IEEE Std 754-2008 (August 2008), pp. 1-70
[2]D. Goldberg, “What every computer scientist should know about floating-point arithmetic,” ACM Computing Surveys, Vol 23, No 1, March 1991.
[3]A. Munshi, “The OpenCL specification, ” 2009 IEEE Hot Chips 21 Symposium (HCS), Stanford, CA, 2009, pp. 1-314.
[4]Glaskowsky, Peter N. “NVIDIA’s Fermi: The First Complete GPU Computing Architecture (2009),” URL:http://www. nvidia. com/content/PDF/fermi_white_ papers/P. Glaskowsky_NVIDIA’s_Fermi-The_First_Complete_GPU_ Architecture. pdf.
[5]NVIDIA, Tesla. “P100 white paper,” NVIDIA Corporation (2016).
[6]J. Detrey and F. de Dinechin, “A parameterized floating-point exponential function for FPGAs,” Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005., Singapore, 2005, pp. 27-34.
[7]J. Detrey and F. de Dinechin, “A parameterizable floating-point logarithm operator for FPGAs,” Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005., Pacific Grove, CA, 2005, pp. 1186-1190.
[8]B. Zamanlooy and M. Mirhassani, “Efficient VLSI Implementation of Neural Networks With Hyperbolic Tangent Activation Function,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 1, pp. 39-48, Jan. 2014.
[9]Whitehead, Nathan “Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs,”
URL: https://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf.
[10]Jean-Michel Muller, “Elementary functions,” Birkhũser Boston, 2006.
[11]Sherif A. Tawfik. “Minimax approximation and Remez algorithm (2005),”
URL:http://www.math.unipd.it/~alvise/CS_2008/APPROSSIMAZIONE_2009/MFILES/Remez.pdf.
[12]M. J. Schulte and J. E. Stine, “Approximating elementary functions with symmetric bipartite tables,” in IEEE Transactions on Computers, vol. 48, no. 8, pp. 842-847, Aug. 1999.
[13]M. J. Schulte and J. E. Stine, “Accurate function approximations by symmetric table lookup and addition,” Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors, Zurich, Switzerland, 1997, pp. 144-153.
[14]F. de Dinechin and A. Tisserand, “Multipartite table methods,” in IEEE Transactions on Computers, vol. 54, no. 3, pp. 319-330, March 2005.
[15]S. Hsiao, P. Wu, C. Wen and P. K. Meher, “Table Size Reduction Methods for Faithfully Rounded Lookup-Table-Based Multiplierless Function Evaluation,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 5, pp. 466-470, May 2015.
[16]L. Moroz and V. Samotyy, “Efficient Floating-Point Division for Digital Signal Processing Application [Tips & Tricks],” in IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 159-163, Jan. 2019.
[17]M. Langhammer and B. Pasca, “Single Precision Natural Logarithm Architecture for Hard Floating-Point and DSP-Enabled FPGAs,” 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), Santa Clara, CA, 2016, pp. 164-171.
[18]M. Langhammer and B. Pasca, “Single Precision Logarithm and Exponential Architectures for Hard Floating-Point Enabled FPGAs,” in IEEE Transactions on Computers, vol. 66, no. 12, pp. 2031-2043, 1 Dec. 2017.
[19]G. Cao, H. Du, P. Wang, Q. Du and J. Ding, “A Piecewise Cubic Polynomial Interpolation Algorithm for Approximating Elementary Function,” 2015 14th International Conference on Computer-Aided Design and Computer Graphics (CAD/Graphics), Xi'an, 2015, pp. 57-64.
[20]C. Chen, “High-order Taylor series approximation for efficient computation of elementary functions,” in IET Computers & Digital Techniques, vol. 9, no. 6, pp. 328-335, 11 2015.
[21]M. Sadeghian and J. E. Stine, “Optimized low-power elementary function approximation for Chebyshev series approximations,” 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), Pacific Grove, CA, 2012, pp. 1005-1009.
[22]M. Sadeghian and J. E. Stine, “Elementary function approximation using optimized most significant bits of Chebyshev coefficients and truncated multipliers,” 2012 IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS), Boise, ID, 2012, pp. 450-453.
[23]J. -. Pineiro, S. F. Oberman, J. -. Muller and J. D. Bruguera, “High-speed function approximation using a minimax quadratic interpolator,” in IEEE Transactions on Computers, vol. 54, no. 3, pp. 304-318, March 2005.
[24]H. Ko, S. Hsiao and W. Huang, “A new non-uniform segmentation and addressing remapping strategy for hardware-oriented function evaluators based on polynomial approximation,” Proceedings of 2010 IEEE International Symposium on Circuits and Systems, Paris, 2010, pp. 4153-4156.
[25]D. Esposito, A. G. M. Strollo and M. Alioto, “Low-power approximate MAC unit,” 2017 13th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), Giardini Naxos, 2017, pp. 81-84.
[26]H. Ko and S. Hsiao, “Design and Application of Faithfully Rounded and Truncated Multipliers With Combined Deletion, Reduction, Truncation, and Rounding,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 58, no. 5, pp. 304-308, May 2011.
[27]E. G. Walters and M. J. Schulte, “Efficient function approximation using truncated multipliers and squarers,” 17th IEEE Symposium on Computer Arithmetic (ARITH'05), Cape Cod, MA, 2005, pp. 232-239.
[28]E. Libessart, M. Arzel, C. Lahuec and F. Andriulli, “A Scaling-Less Newton–Raphson Pipelined Implementation for a Fixed-Point Reciprocal Operator,” in IEEE Signal Processing Letters, vol. 24, no. 6, pp. 789-793, June 2017.
[29]U. Kucukkabak and A. Akkas, “Design and implementation of reciprocal unit using table look-up and Newton-Raphson iteration,” Euromicro Symposium on Digital System Design, 2004. DSD 2004., Rennes, France, 2004, pp. 249-253.
[30]S. Hsiao, C. Wen and M. Tsai, “Low-cost design of reciprocal function units using shared multipliers and adders for polynomial approximation and Newton Raphson interpolation,” 2010 International Symposium on Next Generation Electronics, Kaohsiung, 2010, pp. 40-43.
[31]A. Habegger, A. Stahel, J. Goette and M. Jacomet, “An Efficient Hardware Implementation for a Reciprocal Unit,” 2010 Fifth IEEE International Symposium on Electronic Design, Test & Applications, Ho Chi Minh City, 2010, pp. 183-187.
[32]S. Hsiao, C. Chiu and C. Wen, “Design of a low-cost floating-point programmable vertex processor for mobile graphics applications based on hybrid number system,” 2011 IEEE International Conference on IC Design & Technology, Kaohsiung, 2011, pp. 1-4.
[33]M. Lastras and B. Parhami, “A logarithmic approach to energy-efficient GPU arithmetic for mobile devices,” 2013 Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 2013, pp. 2177-2180.
[34]D. M. Ellaithy, M. A. El-Moursy, G. H. Ibrahim, A. Zaki and A. Zekry, “Double logarithmic arithmetic technique for GPU,” 2017 12th International Conference on Computer Engineering and Systems (ICCES), Cairo, 2017, pp. 373-376.
[35]H. Zhang, M. Putic and J. Lach, “Low power GPGPU computation with imprecise hardware,” 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC) , San Francisco, CA, 2014, pp. 1-6.
[36]Shen-Fu Hsiao, Ping-Chung Wei and Ching-Pin Lin, “An automatic hardware generator for special arithmetic functions using various ROM-based approximation approaches,” 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, 2008, pp. 468-471.
[37]Shen-Fu Hsiao and K. Huang, “Low-power dual-precision table-based function evaluation supporting dynamic precision changes,” 2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Jeju, 2016, pp. 710-712.
[38] Y. Kim, K. Chung, L. Kim and S. M. Park, “Bank-partition and multi-fetch scheme for floating-point special function units in multi-core systems,” 2009 IEEE International Symposium on Circuits and Systems, Taipei, 2009, pp. 1803-1806.
[39]Y. Kim, H. Kim, S. Kim, J. Park, S. Paek and L. Kim, “Homogeneous Stream Processors With Embedded Special Function Units for High-Utilization Programmable Shaders,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 9, pp. 1691-1704, Sept. 2012.
[40]M. Mohammadi, M. Shalchian and D. Shafaie, “Design of 8 bit, 1633µm2, 444µW squarer hardware for high performance VLSI applications,” 2017 Iranian Conference on Electrical Engineering (ICEE), Tehran, 2017, pp. 227-232.
[41]Synopsys, DesignWare IP, “DesignWare Building Block IP Documentation Overview,” URL: https://www.synopsys.com/dw/buildingblock.php.
[42]游世杰, “Design of Special Function Unit with Dual-Precision Function Approximation,” 國立成功大學電機工程學系博碩士論文, 2017.
[43]鄭基漢, “Design of Cycle-accurate SIMT Core and Implementation,” 國立成功大學電機工程學系博碩士論文, 2018.
[44]HSA Foundation, “HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) ,” URL:http://www.cs.nthu.edu.tw/~ychung/slides/HSA/HSA-PRM-1.02.pdf
[45]Tensorflow Official Website, URL:https://www.Tensorflow.org/.
校內:2020-10-01公開