簡易檢索 / 詳目顯示

研究生: 林哲亘
Lin, Che-Hsuan
論文名稱: 分散式深度學習上的快取記憶體自適應預分派方法
Adaptive Cache Pre-forwarding Policy on Distributed Deep Learning
指導教授: 鄭憲宗
Cheng, Sheng-Tzong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 46
中文關鍵詞: 強化式學習快取記憶體深度學習分散式
外文關鍵詞: Reinforcement Learning, Cache, Deep Learning, Distribution
相關次數: 點閱:143下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習技術的快速發展,人們已經可以訓練出高準確率模型來應用到各種領域上,比如說前陣子AlphaGo使用了強化式學習打敗了職業圍棋棋王,Google也使用卷積神經網路來進行影像辨識和使用LSTM用來翻譯。現實中可以看到愈來愈多使用深度學習的例子,但要訓練出可靠的模型,必須花費大量的時間和資料進行模型調整以及模型訓練,因此為了提高訓練速度,很多的研究利用了GPU來進行加速,再搭配上分散式訓練來快速開發出一個有效的模型。
    分散式訓練適用在於訓練資料集過大或是需要大吞吐量的應用當中,但是使用分散式也有逃不掉的瓶頸存在,那就是網路延遲,尤其在分散式深度學習當中,結點和結點間的同步次數頻繁且需要同步的內容通常是百萬位元組等級,網路通訊時間佔了整個訓練過程很大一部份的比例,甚至超過學習所花的時間,因此如何有效利用分散式訓練變成很重要的課題。
    在本研究當中,我們提出了一個新穎的分散式運算架構,優化了同步過程中所需花費的隱藏成本,並且利用快取記憶體和自適應的快取資料預分派方法來減少網路同步次數和減少網路組塞時間,而且利用強化式學習的特性可以適用於任何的雲端運算集群當中,即使每次運行的計算環境不同,也能在訓練過程中找到最佳的快取分派策略。

    With the quick growth of deep learning algorithm, people develop large amounts of high-accuracy models and apply these model to many domain in real world. For instance, AlphaGo, developed by GoogleBrain, prevailed upon the world Go chess grandmaster, and Google trained the convolutional neural networks which are used for image recognition and image search engine. Even so the prospect of deep learning is inspiring, developers need to spend a lot of time for model training and model tuning. Therefore, GPU and distributed computing are used to parallelize the training operation for the purpose of reducing training time.
    Deep learning is embarrassingly parallel and is suitable for distributed computing which can significantly improve the system throughput. However, there is a bottleneck for cross-machine training, that is, network latency. Nodes need to wait for synchronization frequently and every synchronization content may range from several megabytes to hundred megabytes. Thus, it can be inferred that network communication takes a big proportion of time in training process that really decreases the system performance. As a result, researchers propose many computing architectures to prevent from this situation.
    In this research, we propose a variant distributed computing system for deep learning. Our design aims to reduce synchronization times and reduce network blocking times by using cache with a new cache mechanism named cache pre-forwarding. The main design concept of cache pre-forwarding is exploiting reinforcement learning to train a pre-forwarding policy to increase cache hit rate. Due to the features of reinforcement learning, our policy is adaptive and able to be applied on different computing environments. At last, we shows that our system is feasible and is proved by experiments.

    摘要 I Abstract II ACKNOWLEDGEMENT III TABLE OF CONTENTS IV LIST OF TABLES V LIST OF FIGURES VI Chapter 1. Introduction & Motivation 1 1.1 Introduction 1 1.2 Motivation 2 1.3 Thesis Overview 3 Chapter 2. Background & Related Work 5 2.1 Deep Learning 5 2.2 Tensorflow 6 2.3 Parallelism in Deep Learning 7 2.4 Reinforcement Learning 15 2.5 Cache and Cache Prefetching 16 Chapter 3. System Design and Approach 18 3.1 Problem Description 18 3.2 System Design 19 3.3 Cache Mechanism 26 Chapter 4. Implementation and Experiments 32 4.1 System Implementation 32 4.2 Experiment Environment and Settings 35 4.3 Experiment Result 36 Chapter 5. Conclusion & Future work 44 Reference 45

    [1] Valiant, Leslie G. "A bridging model for parallel computation." Communications of the ACM 33.8 (1990): 103-111.
    [2] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256.
    [3] Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 135. Cambridge: MIT Press, 1998.
    [4] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.
    [5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
    [6] Liao, Shih-wei, et al. "Machine learning-based prefetch optimization for data center applications." Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 2009.
    [7] Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010.
    [8] Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. "Improving the speed of neural networks on CPUs." Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. Vol. 1. 2011.
    [9] J. Lee, H. Kim and R. Vuduc, "When Prefetching Works, When It Doesn’t, and Why," ACM Transactions on Architecture and Code Optimization, 2012.
    [10] Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances in neural information processing systems. 2012.
    [11] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
    [12] Ho, Qirong, et al. "More effective distributed ml via a stale synchronous parallel parameter server." Advances in neural information processing systems. 2013.
    [13] Chilimbi, Trishul M., et al. "Project Adam: Building an Efficient and Scalable Deep Learning Training System." OSDI. Vol. 14. 2014.
    [14] Maldikar, Pranita. Adaptive Cache Prefetching using Machine Learning and Monitoring Hardware Performance Counters. Diss. University of Minnesota, 2014.
    [15] Li, Mu, et al. "Scaling Distributed Machine Learning with the Parameter Server." OSDI. Vol. 1. No. 10.4. 2014.
    [16] Zhang, Wei, et al. "Staleness-aware async-sgd for distributed deep learning." arXiv preprint arXiv:1511.05950 (2015).
    [17] Dettmers, Tim. "8-bit approximations for parallelism in deep learning." arXiv preprint arXiv:1511.04561 (2015).
    [18] Chen, Chun-Fu Richard, et al. "Efficient multi-training framework of image deep learning on GPU cluster." Multimedia (ISM), 2015 IEEE International Symposium on. IEEE, 2015.
    [19] Hegde, Vishakh, and Sheema Usmani. Parallel and Distributed Deep Learning. Tech. report, Stanford University, June 2016. https://stanford. edu/~ rezab/dao/projects_reports/hedge_usmani. pdf, 2016.
    [20] Abadi, Martín, et al. "TensorFlow: A System for Large-Scale Machine Learning." OSDI. Vol. 16. 2016.
    [21] Cui, Henggang, et al. "GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server." Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016.
    [22] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489.
    [23]Gupta, Suyog, Wei Zhang, and Fei Wang. "Model accuracy and runtime tradeoff in distributed deep learning: A systematic study." Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.
    [24] Intel Intrinsic Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/

    下載圖示 校內:2020-08-21公開
    校外:2020-08-21公開
    QR CODE