成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	蘇郁翔 Su, Yu-Xiang
論文名稱：	移植Tensorflow至CASLAB-GPUSIM模擬平台與矩陣函式庫優化 Porting Tensorflow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2018
畢業學年度：	107
語文別：	中文
論文頁數：	75
中文關鍵詞：	終端裝置、通用繪圖處理器、矩陣乘法、機器學習
外文關鍵詞：	Edge Device, GPGPU, Matrix Multiplication, Machine Learning
相關次數：	點閱：92 下載：9
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著雲端計算的蓬勃發展，機器學習的應用也逐漸拓展到終端裝置的應用上，為了能夠在終端硬體之開發階段或是終端應用的效能分析，本論文整合了機器學習框架Tensorflow與本實驗室所開發的OpenCL Runtime，成功將Tensorflow Runtime移植至本實驗室所開發的CASLAB-GPUSIM模擬平台上，接著又透過以Tensorflow所撰寫的測試程式進行了一系列的系統驗證，借此模擬終端裝置上的機器學習應用情境。
除了終端機器學習模擬平台的搭建，本論文認為在以通用繪圖處理器作為終端加速的解決方案中，線性代數的函式庫並沒有隨著該應用情境以及計算資源而有所變化，其中尤以矩陣乘法影響最甚，因其為建構卷積神經模型之卷積層與全連結層的基本運算單元，有鑑於此，本論文針對CLBlast函式庫的矩陣乘法演算法提出了優化建議，亦即針對終端機器學習應用的運算型態減少矩陣乘法函式庫的前處理以達到減少整體矩陣乘法函式庫所需要的執行時間。

With the rapid development of cloud computing, the application of machine learning has gradually expanded to the application of edge devices. In order to analyze the performance of edge application in the early development stage of edge hardware, we complete the suggest that integration of Tensorflow and the GPGPU simulator, called CASLAB-GPUSIM.
In addition to the building of edge device simulation platform, we propose a matrix multiplication library for the machine learning application on edge device using GPGPU as the acceleration solution. According to our experiment result, we have 5.6 average speed up in the fully-connected layer of our benchmarks, including MNIST mode, Lenet-5 and MobileNet.

摘要	I
Summary	II
誌謝	VI
圖目錄	XI
第1章 序論	1
1 Motivation	1
2 Contribution	2
3 Organization	2
第2章 背景知識	3
1 Tensorflow Runtime	3
1.1 Tensorflow Kernel Operation	3
1.2 Tensorflow Stream Executor	5
1.3 Tf-coriander	6
2 OpenCL Runtime	7
2.1 OpenCL Programming Model	7
2.2 HSA Runtime	9
3 GPGPU Hardware	11
3.1 GPGPU Architecture	11
3.2 GPGPU Memory Model	14
第3章 矩陣乘法與機器學習相關研究	15
1 Convolution Neural Network	15
1.1 Convolution Layer	16
1.2 Pooling Layer	17
1.3 Fully Connected Layer	18
1.4 Activation Function	19
2 Matrix Multiplication in CNN	20
2.1 Implementation of Convolution Layer	20
2.2 Implementation of Fully Connected Layer	21
第4章 通用繪圖處理器上的矩陣乘法優化	23
1 Matrix Multiplication on GPGPU	23
2 Matrix Multiplication Optimization	24
2.1 Direct Implementation	25
2.2 Matrix Transposition	28
2.3 Shared Memory	29
2.4 Auto-Tuning Technique	30
3 Matrix Multiplication on Edge Device	33
3.1 Edge Computation	33
3.2 CASLAB Implementation	35
第5章 Tensorflow移植與矩陣乘法函式庫實作	38
1 Platform Introduction	38
2 Running Tensorflow on CASLAB-GPUSIM	42
2.1 OpenCL Runtime Implementation	43
2.2 Finalizer Implementation	44
3 Implementation of Matrix Multiplication	45
3.1 Kernel Operation Implementation	45
3.2 CLBlast Library	48
第6章 終端機器學習應用之矩陣乘法實驗探討	52
1 Experiment Environment and Benchmarks	52
2 Verification of Tensorflow porting	55
3 Performance of CASLAB MM implementation	64
3.1 Performance Summary	64
3.2 MNIST Benchmarks	66
3.3 MobileNet Fully Connected Layer	69
4 Experiment Limitation and Recommendation	70
第7章 結論	71
參考文獻	72

                                    

[1] “Movidius Official Website.” [Online]. Available: https://www.movidius.com/.
[2] “Tensorflow Official Website.” [Online]. Available: https://www.Tensorflow.org/.
[3] “Eigen Library Offical Website.” [Online]. Available: https://eigen.tuxfamily.org/dox/.
[4] “Nvidia CUDA Toolkit.” [Online]. Available: https://developer.nvidia.com/cuda-downloads.
[5] “Documentation for StreamExecutor open source proposal.” [Online]. Available: https://github.com/henline/streamexecutordoc.
[6] “cuBLAS Offical Website.” [Online]. Available: https://developer.nvidia.com/cublas.
[7] “Tf-coriander githut repository.” [Online]. Available: https://github.com/hughperkins/Tf-coriander.
[8] “Tuned OpenCL BLAS, CLBlast.” [Online]. Available: https://github.com/CNugteren/CLBlast.
[9] “EasyCL github repository.” [Online]. Available: https://github.com/hughperkins/EasyCL.
[10] “coriander github repository.” [Online]. Available: https://github.com/hughperkins/coriander/tree/f069f52b0574148c51151b7baee13616daba56f5.
[11] “The LLVM Compiler Infrastructure.” [Online]. Available: https://llvm.org/.
[12] A.Munshi, “OpenCL 1.2 Specification,” Version 1.2, p. 380, 2012.
[13] “Khronos Official Website.” [Online]. Available: https://www.khronos.org/.
[14] “OpenCL Offline Compiler.” [Online]. Available: https://github.com/HSAFoundation/CLOC.
[15] O.Api, R.Card, andC.Queues, “OpenCL API 1.2 Reference Card,” Khronos Gr., pp. 1–8, 2011.
[16] HSA Foundation, “HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide, and Object Format (BRIG),” no. May, pp. 1–391, 2013.
[17] H.Foundation, “HSA Runtime Programmer ’ s Reference Manual,” pp. 1–147, 2015.
[18] “PTX ISA.” [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[19] J. L.Hennessy andD. aPatterson, Computer Architecture, Fourth Edition: A Quantitative Approach, no. 0. 2006.
[20] Y.LeCun, L.Bottou, Y.Bengio, andP.Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[21] “The MNIST dataset.” [Online]. Available: http://yann.lecun.com/exdb/mnist/.
[22] “Linear Regression.” [Online]. Available: https://en.wikipedia.org/wiki/Linear_regression.
[23] S.Chetlur, C.Woolley, P.Vandermersch, J.Cohen, J.Tran, B.Catanzaro, andE.Shelhamer, “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.
[24] “Tensorflow MNIST tutorial.” [Online]. Available: https://www.Tensorflow.org/tutorials/.
[25] “Tensorflow Lenet-5 Model.” [Online]. Available: https://blog.csdn.net/NNNNNNNNNNNNY/article/details/70216265.
[26] T. D.Han andT. S.Abdelrahman, “Reducing branch divergence in GPU programs,” Proc. Fourth Work. Gen. Purp. Process. Graph. Process. Units, p. 3:1--3:8, 2011.
[27] “Direct Implementation.” [Online]. Available: https://www.quantstart.com/articles/Matrix-Matrix-Multiplication-on-the-GPU-with-Nvidia-CUDA.
[28] X.Cui, Y.Chen, C.Zhang, andH.Mei, “Auto-tuning dense matrix multiplication for GPGPU with cache,” Proc. Int. Conf. Parallel Distrib. Syst. - ICPADS, pp. 237–242, 2010.
[29] B.Wu, F.Iandola, P. H.Jin, andK.Keutzer, “SqueezeDet: UWu, B., Iandola, F., Jin, P. H., &Keutzer, K. (2016). SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. ArXiv Preprint ArXiv:1612.01051, 129–137.nified, small, low,” arXiv Prepr. arXiv1612.01051, pp. 129–137, 2016.
[30] A. G.Howard andW.Wang, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Andrew,” 2012.
[31] X.Sun, N.Ansari, N. E.Sun, X., & Ansari, X.Sun, andN.Ansari, “EdgeIoT: Mobile Edge Computing for the Internet of Things,” IEEE Commun. Mag., vol. 54, no. 12, pp. 22–29, 2016.
[32] P. N.Glaskowsky, “NVIDIA’s Fermi : The First Complete GPU Computing Architecture,” White Pap., no. September, pp. 1–26, 2009.
[33] K.Mo, “MS108 COMPUTER SYSTEM(1) Final Report — gpgpu-sim,” no. 1, pp. 1–17, 2014.
[34] “SystemC Offical Website.” [Online]. Available: http://www.accellera.org/downloads/standards/systemc.
[35] “GeForce 10 series Specification.” [Online]. Available: https://en.wikipedia.org/wiki/GeForce_10_series.
[36] “Adding a New Op.” [Online]. Available: https://www.Tensorflow.org/extend/adding_an_op.
[37] “SWIG Official Website.” [Online]. Available: http://www.swig.org/tutorial.html.
[38] “Tensorflow Tensorboard.” [Online]. Available: https://www.Tensorflow.org/guide/summaries_and_tensorboard.
[39] “Python3.3 time library.” [Online]. Available: https://docs.python.org/3/library/time.html.

校內：立即公開
校外：立即公開

簡易檢索 / 詳目顯示

相關論文