簡易檢索 / 詳目顯示

研究生: 張瑋霖
Chang, Wei-Lin
論文名稱: 聯合深度學習框架之研究
A Study on Federated Deep Learning Framework
指導教授: 陳敬
Chen, Jing
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 133
中文關鍵詞: 聯合深度學習深度學習框架檢查點智聯網
外文關鍵詞: Federated Deep Learning, Deep Learning Framework, Checkpoint, AIoT
相關次數: 點閱:113下載:40
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 聯合深度學習(Federated Deep Learning)在基於不傳輸設備蒐集的數據下,允許多個物聯網內之終端設備對同一個深度學習模型執行訓練及更新,除了解決了傳輸數據時可能暴露數據的機敏性議題外,深度學習的應用也能獲得更好的結果。然而,在大部分的物聯網終端裝置都處於資源較受限的情況下,實現聯合深度學習的應用就必須考量到訓練期間對物聯網終端裝置產生影響的各種議題,尤其是記憶體資源過度使用對系統的影響。因此,在資源受限的物聯網裝置中開發聯合深度學習應用,建立一個穩定、可靠、有效且負擔更小的聯合深度學習框架是必要的。
    本論文專注於物聯網終端裝置中開發聯合深度學習程式與執行訓練之議題,實現一個基於聯合深度學習場景中,專為物聯網終端裝置設計之聯合深度學習框架,以提供開發工程師在開發聯合深度學習應用時,能有效地解決訓練過程中因資源受限而引發之異常狀況。此一聯合深度學習框架之設計及實現汲取前人的相關研究並與相關開發經驗整合;其框架整體設計重點如下:(1)動態地分配記憶體,保持系統執行之穩定性並減少執行期錯誤的發生;(2)增加檢查點及回復功能,可儲存訓練時所產生的暫時性資料,載入中斷前之結果繼續執行,以減少重新啟動訓練之成本;(3)監控訓練任務之執行環境與系統資源之使用,以利於訓練任務執行時之穩定,當系統處於資源不足之狀態時,即時地釋放訓練任務之硬體資源,並於偵測系統資源充足後回復訓練任務之執行。為了驗證上述設計,本論文以樹梅派(Raspberry pi 3 Model B+)作為開發平台,實作範例並測試其功能之正確性。
    本論文之主要貢獻包含:延伸TensorFlow功能,增加對聯合深度學習的支持,提供開發工程師能易於在物聯網終端裝置上開發訓練任務等等。其模組化之設計則有利於開發工程師依應用作需求開發之便利性,可增加效率。

    Federated Deep Learning (FDL) allows multiple devices to jointly train a deep learning program without any of the participants having to reveal their local data to a centralized server. However, most IoT edge devices are relatively limited to resources, especially memory. The issue of limited resources brings significant overhead when training a deep learning program on IoT edge device, and makes the operations of the IoT edge device unstable. The development of a stable, reliable, lightweight and effective framework is necessary for program developer who wants to run a program on IoT edge device which is resource constraint.

    This thesis presents the study on developing a FDL framework which supports developing deep learning program with stability and reliability. The main features this FDL framework include: (1) Using a method of dynamically allocating available memory to maintain stable system operation. (3) Providing checkpoint and recovery functions to support system to recover to running state when the training task is terminated before completion. (4) Using a daemon process that can monitor the training task. For example, if the system operation becomes unstable, it can instantly release the hardware resources occupied by the training task, so that the system can stay in a stable state.

    The main contributions of this thesis include: the extended functions of a framework from TensorFlow, and concerned about IoT devices that hardware resource is in constraint. It can help developers develop deep learning model for IoT device without too much overhead. Moreover, it also provides a solution that can prevent the system from becoming unstable during training and make training more efficient and reliable.

    第1章 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究方法 6 1.4 章節規劃 6 第2章 相關研究 7 2.1 機器學習 7 2.1.1 機器學習之介紹 8 2.1.2 深度學習 10 2.1.3 遷移學習 11 2.1.4 聯合深度學習(Federated Deep Learning) 12 2.1.5 機器學習相關討論 14 2.2 機器學習與物聯網之議題 15 2.2.1 智聯網 15 2.2.2 使用案例討論:機器視覺 16 2.3 深度學習之框架 16 2.3.1 Keras 17 2.3.2 PyTorch 17 2.3.3 TensorFlow 18 2.3.4 深度學習框架之相關討論 19 2.4 Linux記憶體管理 20 2.4.1 虛擬記憶體系統(Virtual Memory System) 21 2.4.2 OOM管理及執行 22 2.4.3 Linux記憶體管理相關討論 22 2.5 檢查點 23 2.5.1 檢查點使用場景討論 23 2.5.2 檢查點設計及實作議題 26 2.5.3 檢查點使用案例之一:Windows NT 27 2.5.4 檢查點使用案例之二:CRIU 28 2.5.5 檢查點相關討論 29 2.6 討論 31 第3章 架構設計 33 3.1 框架軟體架構 33 3.2 框架運作模型之設計 35 3.2.1 檢查點元件之設計 37 3.2.2 回復系統狀態元件之設計 40 3.2.3 CSV讀取元件之設計 41 3.2.4 環境偵測元件之設計 45 3.2.5 檔案正確性元件之設計 47 3.3 框架使用及運作流程 49 3.4 框架應用程序介面之設計 50 第4章 框架實作 56 4.1 框架實作環境 56 4.2 框架元件之實作 58 4.2.1 檢查點元件之實作 58 4.2.2 回復系統狀態元件之實作 66 4.2.3 CSV讀取元件之實作 72 4.2.4 環境偵測元件之實作 76 4.2.5 檔案正確性元件之實作 81 4.3 框架之使用流程 84 4.4 框架應用程序介面之實作 87 第5章 功能測試與效能分析 93 5.1 框架測試環境 93 5.2 應用程式開發範例 94 5.2.1 框架開發範例介紹 95 5.2.2 使用第三方框架之開發範例 96 5.2.3 使用本論文框架之開發範例 98 5.3 框架之元件功能測試 101 5.3.1 CSV讀取元件功能測試 101 5.3.2 環境偵測元件功能測試 103 5.3.3 檢查點元件功能測試 107 5.3.4 回復系統狀態元件測試 110 5.3.5 檔案正確性元件功能測試 112 5.3.6 框架效能測試 114 5.4 比較與討論 118 第6章 結論與未來展望 122 6.1 結論 122 6.2 未來展望 123 參考文獻 124

    參考文獻
    [1] Ali Ebnenasir, Software Fault-Tolerance,
    [Online] Available: http://www.cse.msu.edu/~cse870/Lectures/SS2005/ft1.pdf, accessed on 2020-03-22.
    [2] Arden Dertat, Applied Deep Learning - Part4:Convolutional Neural Networks, ¬¬ [Online] Available: https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2, accessed on 2020-06-11.
    [3] Brenda Goh, Tesla very close to level 5 autonomous driving technology,
    [Online] Available: https://www.reuters.com/article/us-tesla-autonomous/tesla-very-close-to-level-5-autonomous-driving-technology-musk-says-idUSKBN24A0HE, accessed on 2020-09-21.
    [4] Bryan Christiansen, MTTR, MTBF, or MTTF?-A Simple Guide To Failure Metrics, [Online] Available: https://limblecmms.com/blog/mttr-mtbf-mttf-guide-to-failure-metrics/#, accessed on 2020-06-11.
    [5] Caruana Rich, Lorien Pratt, Sebastian Thrun, “Multitask Learning,” School of Computer Science, Carnegie Mellon University, Pittsburgh, 1997.
    [6] Chaoyang He, Murali Annavaram, Salman Avestimehr, “FedNAS: Federated Deep Learning via Neural Architecture Search,” University of Southern California, 2020.
    [7] Daniel Dauwe, Sudeep Pasricha, “An Analysis of Multilevel Checkpoint Performance Models,” IEEE International Parallel and Distributed Processing Symposium Workshops, Pages 783 – 792, 6 August 2018.
    [8] Data Flair, TensorFlow Pros and Cons – The Bright and the Dark Sides,
    [Online] Available: https://data-flair.training/blogs/tensorflow-pros-and-cons/, accessed on 2019-09-10.
    [9] DICTIONARY.COM, recoverable error,
    [Online] Available: https://www.dictionary.com/browse/recoverable-error, accessed on 2020-08-03.
    [10] Faisai Shahzad, Markus Wittmann, Thomas Zeiser, “An Evaluation of Different I/O Techniques for Checkpoint/Restart,” IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Pages 1708-1716, 31 October 2013.
    [11] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek, “Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data,” IEEE Transaction on Neural Networks and Learning Systems,Volumn 31, Issue:9, Pages 3400 – 3413, 1 November 2019.
    [12] Fpedregosa, PyPI-Memory Profiler,
    [Online] Available: https://pypi.org/project/memory-profiler/, accessed on 2020-07-22.
    [13] Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, “Artificial Intelligence Review,” An International Science and Engineering Journal, Volume 52, Issue 1, Pages 77-124, 2019.
    [14] Github, CRIU – A project to implement checkpoint/restore functionality for Linux, [Online] Available: https://github.com/checkpoint-restore/criu, accessed on 2020-06-13.
    [15] Github, Evaluation of Deep Learning Toolkits, [Online] Available: https://github.com/zer0n/deepframeworks/blob/master/README.md, accessed on 2019-08-01.
    [16] Github, keras, [Online] Available: https://github.com/keras-team/keras, accessed on 2019-08-04.
    [17] Github, Out of Memory when training on Big Images, [Online] Available: https://github.com/tensorflow/models/issues/1817, accessed on 2018-11-22.
    [18] Github, pysft, [Online] Available: https://github.com/OpenMined/PySyft, accessed on 2019-03-22.
    [19] Github, Unpaired Image-o-Image Translation using Cycle-Consistent Adversarial Networks, [Online] Available: https://junyanz.github.io/CycleGAN/, accessed on 2019-09-22.
    [20] Google AI, Federated Learning:Building better products with on-device data and privacy by default, [Online] Available: https://federated.withgoogle.com/, accessed on 2020-03-21.
    [21] Google AI, Federated Learning: Collaborative Machine Learning without Centralized Training Data, [Online] Available: https¬-://ai.googleblog.com/2017/04/federated-learning-collaborative.html, accessed on 2020-03-26.
    [22] Greg Horton, Reversing the 80-20 rule in data wrangling for AI and machine learning, [Online] Available: https://blog.timextender.com/reversing-the-80-20-rule-in-data-wrangling#:~:text=Based%20on%20the%20results%20from,data%20modelin
    g%20and%20machine%20learning, accessed on 2019-06-11.
    [23] Guohong Cao, M. Singhal, “On coordinated checkpointing in distributed systems,” IEEE Transactions on Parallel and Distributed Systems, Volume 9, Issue 12, Pages 1213 – 1225, Dec 1998.
    [24] B. McMahan, E. Moore, D. Ramage, et al “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Proc. 20th International Conference on Artificial Intelligence and Statistics, Vulume 54, Pages 1273 – 1282, 20 – 22 April 2017.
    [25] Hannah Kuchler, Pharma groups combine to promote drug discovery with AI, [Online] Available: https://www.ft.com/content/ef7be832-86d0-11e9-a028-86cea8523dc2, accessed on 2020-04-01.
    [26] IPython, Jupyter and the future of IPython, [Online] Available: https://ipython.org/, accessed on 2019-10-21.
    [27] Jared Smith, Google Chrome is a Greedy RAM Hog. Here’s How to Fix It, [Online] Available: https://www.hostdime.com/blog/reduce-ram-google-chrome/, accessed on 2021-01-17.
    [28] Jeff Hale, Deep Learning Framework Power Scores 2018, [Online] Available: https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a, accessed on 2019-08-02.
    [29] Johny Srouji, A Transparent Checkpoint Facility On NT, [Online] Available: https://www.usenix.org/legacy/publications/library/proceedings/lisa97/failsafe/usenix-nt98/full_papers/srouji/srouji_html/srouji.html, accessed on 2020-07-01.
    [30] K.Fukushima, N.Wake, “Handwrite alphanumeric character recognition by the recognition,” IEEE Transactions on Neural Networks, Volume 2, Issue 3, Pages 355 – 365, May 1991.
    [31] Ling Shao, Fan Zhu, Xuelong Li, “Transfer Learning for Visual Categorization: A Survey,” IEEE Transactions on Neural Network and Learning System, Volume 26, Issue 5, Pages 1019 - 1034, 1 July 2014.
    [32] M. Khludova, “Probabilistic Evaluation of Checkpoint-Based Fault Tolerance in Real-Time Systems,” International Multi-Conference on Industrial Engineering and Modern Technologies, 19 December 2019.
    [33] Matthew Kirk, “初探機器學習使用Python,” 基峰資訊股份有限公司, 2018, ISBN 978-986-476-582-9.
    [34] Mehdi Lotfi, Seyed Ahmad Motamed, Mojtaba Bandarabadi, “Lightweight blocking coordinated checkpointing for cluster computer systems,” 41st Southeastern Symposium on System Theory, 27 March 2009.
    [35] Mel Gorman, Understanding the Linux – Virtual Memory Manager, Prentice Hall, 2004, ISBN 0-13-145348-3.
    [36] Mohsen Bashiri, Seyed Ghassem Miremadi, Mahdi Fazeli, “A Checkpointing Technique for Rollback Error Recovery in Embedded Systems,” IEEE International Conference on Microelectronics, 18 June 2007.
    [37] Mora Chen, 人工智慧大歷史, [Online] Available: https://medium.com/@suipichen/%E4%BA%BA%E5%B7%A5%E6%99%BA%E6%85%A7%E5%A4%A7%E6%AD%B7%E5%8F%B2-ffe46a350543, accessed on 2019-08-20.
    [38] Oleksii Kharkovyna, Top 10 Best Deep Learning Framework in 2019, [Online] Available: https://mc.ai/top-10-best-deep-learning-frameworks-in-2019/, accessed on 2019-08-03.
    [39] Oracle, Oracle Database Instance:Overview of Checkpoints, [Online] Available: https://docs.oracle.com/database/121/CNCPT/startup.htm#CNCPT89052, accessed on 2020-02-12.
    [40] Oren laadan, Serge E. Hallyn, “Linux-CR: Transparent Application Checkpoint-Restart in Linux,” Linux Symposium, 13 July 2010.
    [41] Paulo Vincius Cardoso, Patricia Pitthan Barcelos, “Definition of Architecture for Dynamic and Automatic Checkpoints on Apache Spark,” IEEE 37th Symposium on Reliable Distributed System, 17 January 2019.
    [42] Ping Liang, Yunsheng Liu, “A Checkpointing Strategy and Redo Point Strategy for Embedded Real-Time Main Memory Database Crash Recovery,” WRI World Cogress on Computer Science and Information Engineering, 24 July 2009.
    [43] Pulkit Sharma, 5 Amazing Deep Learning Framework Every Data Scientist Must Know, [Online] Available: https://www.analyticsvidhya.com/blog/2019/03/deep-learning-frameworks-comparison/, accessed on 2019-08-03.
    [44] Qianqian Wu, Bin Li, Shuaijun Chen, Zhenzhou Ji, “A Study on the Method of Adaptive Time Intervals Checkpointing,” Fourth International Conference on Instrumentation and Measurement, Computer, Communication and Control, 29 December 2014.
    [45] Raspberry Pi foundation, Raspberry Pi3 Model B+, [Online] Available: https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/, accessed on 2019-11-10.
    [46] RealityAI, The 2018 Ultimate Guide to Machine Learning for Embedded System, [Online] Available: https://reality.ai/ultimate-guide-to-machine-learning-for-embedded-systems/, accessed on 2019-09-02.
    [47] Red Hat Enterprise Linux, CRIU – Checkpoint/Restore in user space, [Online] Available: https://access.redhat.com/articles/2455211, accessed on 2020-06-11.
    [48] Richard Quinnell, 嵌入式開發人員要準備擁抱AI, [Online] Available: https://www.edntaiwan.com/news/article/20180713NT01-Embedded-developers-should-prepare-to-embrace-AI, accessed on 2019-09-10.
    [49] Richmond Alake, Understanding and Implementing LeNet-5 CNN Architecture (Deep Learning), [Online] Available: https://towardsdatascience.com/understanding-and-implementing-lenet-5-cnn-architecture-deep-learning-a2d531ebc342, accessed on 2020-06-22.
    [50] Robin Familara, Differentiating PyTorch from all other Deep Learning frameworks, [Online] Available: https://medium.com/udacity-pytorch-challengers/differentiating-pytorch-from-all-other-deep-learning-frameworks-86a7f89754b5, accessed on 2019-02-05.
    [51] Ruessll Brandom, iPhone X will unlock with facial recognition instead of the home button, [Online] Available: https://www.theverge.com/2017/9/12/16270352/apple-iphone-x-home-button-removed-unlock-touch-id, accessed on 2020-06-22.
    [52] Saad Ahmed, Muhammad Hamad Alizai, Junaid Haroon Siddiqui, Naveed Anwar Bhatti, “Poster Abstract: Toward Smaller Checkpoints for Better Intermittent Computing,” 17th ACM/IEEE International Conference on Information Processing in Sensor Networks, 4 October 2018.
    [53] Sang-Moon Ryu, “Reliability Improvement of Real-Time Embedded System Using Checkpointing,” IEEE 2nd International Conference on Secure System Integration and Reliability Improvement, 29 July 2008.
    [54] Seyma Tas, Memory Management & Garbage Collection in Python, [Online] Available: https://towardsdatascience.com/memory-management-and-garbage-collection-in-python-c1cb51d1612c, accessed on 2021-01-12.
    [55] Sinno Jialin Pan, Qiang Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, Volume 22, Issue 10, 16 October 2010.
    [56] Stephen Evanczuk, Applying machine learning in embedded system, [Online] Available: https://www.embedded.com/design/prototyping-and-development/4460862/2/Applying-machine-learning-in-embedded-systems, accessed on 2019-09-10.
    [57] Syed Muhammad Abrar Akber, Hanhua Chen, Yonghui Wang, “Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems,” IEEE 7th International Conference on Cloud Networking, 29 November 2018.
    [58] TensorFlow, Install TensorFlow with pip, [Online] Available: https://www.tensorflow.org/install/pip, accessed on 2019-08-01.
    [59] TensorFlow, TensorFlow Lite簡介, https://tensorflow.juejin.im/mobile/tflite/, accessed on 2019-09-01.
    [60] TensorFlow, XLA: Optimizing Compiler for TensorFlow, [Online] Available: https://www.tensorflow.org/xla, accessed on 2019-08-03.
    [61] Tesla, Autopilot, [Online] Available: https://www.tesla.com/autopilotAI, accessed on 2020-09-21.
    [62] Thandar Aung, Hla Yin Min, “Coordinate Checkpoint Mechanism on Real-Time Messaging System in Kafka Pipeline Architecture,” IEEE International Conference on Advanced Technologies, 5 December 2019.
    [63] Théo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert, “A generic framework for privacy preserving deep learning,” Cornell University, 9 November 2018.
    [64] Virtuozzo, CRIU, [Online] Available: https://www.criu.org/Main_Page, accessed on 2020-06-11.
    [65] Webopedia, hard error, [Online] Available: https://www.webopedia.com/TERM/H/hard_error.html, accessed on 2020-11-23.
    [66] Webopedia, soft error, [Online] Available: https://www.webopedia.com/TERM/S/soft_error.html, accessed on 2020-11-23.
    [67] Yang Liu, Tianjian Chen, Yongxin Tong, “Federated Machine Learning: Concept and Applications,”ACM Transactions on Intelligent Systems and Technology,wda 2019.
    [68] Yann LeCun, Yoshua Bengio, Convolutional Networks for Images, Speech, and Time-Series, [Online] Available: https://www.researchgate.net/profile/Yann_Lecun/publication/2453996_Convolutional_Networks_for_Images_Speech_and_Time-Series/links/0deec519dfa2325502000000.pdf, accessed on 2020-06-11.
    [69] Zhan Zhang, De-cheng Zuo, Yi-wei, Ci, Xiao-zong Yang, “The Checkpoint Interval Optimization of Kernel-Level Rollback Recovery Based on the Embedded Mobile Computing System,” IEEE 8th International Conference on Computer and Information Technology Workshops, 18 July 2008.
    [70] 池田宗広, 大岩上宏, 島本裕至, 竹部晶雄, 平松雅巳, “Linux Kernel Hacks:改善效能、提升開發效率及節能的技巧與工具,” 基峰資訊, 2014, ISBN 978-986-347-014-4.
    [71] 吳佩軒, “深度類神經網路硬體加速器之架構設計與實作,” 碩士論文, 國立中山大學, 2018.
    [72] 林大貴, “TensroFlow+Keras深度學習人工智慧實務應用,"博碩文化, 2018, ISBN 978-986-434-216-7.
    [73] 唐宗麟, “機器學習方法之複雜處理器編譯器設計,” 碩士論文, 國立清華大學, 2004.
    [74] 朗銳智科, 機器學習成為嵌入式系統行業主流趨勢, [Online] Available: https://kknews.cc/zh-tw/tech/6qk4x4p.html, accessed on 2019-09-12.
    [75] 翁佩珊, “卷積神經網路影像辨識系統架構設計,” 碩士論文, 國立臺灣大學, 2016.
    [76] 張庭銉, AIoT, [Online] Available: https://www.bnext.com.tw/article/53719/iot-combine-ai-as-aiot, accessed on 2019-07-25.
    [77] 深度學習與NLP, 遷移學習(Transfer Learning)概述, [Online] Available: https://kknews.cc/zh-tw/education/ov5klnp.htm, accessed on 2019-08-25.
    [78] 陳宥任, “以樹莓派嵌入式系統及摄像頭進行CNN深度學習之模型自駕車,” 碩士論文, 國立臺北科技大學, 2018.
    [79] 陳衍智, “視窗應用程式檢查點與錯誤回復機制之實作”, 碩士論文, 中興大學, 2005.
    [80] 彭靖田, 林健, 白小龙, “深入理解TensorFlow:架构设计与实现原理,” 人民邮电出版社, 2018, ISBN 978-7-115-48094-1.
    [81] 程式前沿, 深度學習框架的比較(MXNet, Caffe, TensorFlow, Torch, Theano), [Online] Available: https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/620971/, accessed on 2019-08-02.
    [82] 樂毅, 王斌, “深度學習: Caffe 之經典模型詳解與實戰,” 北京: 電子工業出版社, 2016, ISBN 978-7-121-30118-6.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE