| 研究生: | 邱楚盦 Chiu, Chu-An | 
|---|---|
| 論文名稱: | 分散式數據中台-打破資料孤島的頭等公民 The First Class Citizen in the Distributed Data Fabric Attacking Isolated Data Islands | 
| 指導教授: | 蕭宏章 Hsiao, Hung-Chang | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2022 | 
| 畢業學年度: | 110 | 
| 語文別: | 中文 | 
| 論文頁數: | 48 | 
| 中文關鍵詞: | 分散式計算 、分散式儲存 、資料孤島 、數據中台 、資料虛擬化 | 
| 外文關鍵詞: | Distributed Computing, Distributed storage, Isolated Data Islands, Data Fabric, Data Virtualization | 
| 相關次數: | 點閱:109 下載:32 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
步入工業4.0時代,企業朝向大數據導向的「智慧製造」刻不容緩。然而企業內已有慣用的資料倉儲模式,各部門內的各類資料分散儲存於不同的資料庫管理系統(DBMS)中,且不同資料庫管理系統間資料亦無法互通,如此各單位部門的資料使用者或資料分析師僅能局限於極小範圍之資料存取,公司雖有大量資料可供價值萃取,但實際上資料卻分佈在多個位置,彼此間如同一個個互不連通之孤島,此謂之「資料孤島」問題。如何解決資料孤島問題,成為邁入大數據智慧製造的第一道坎。
為解決資料孤島問題,本研究與知名積體電路封測公司合作開發「分散式數據中台」系統,目的為建立一整合性平台系統,聚合各異質資料庫管理系統的資料表,達到「資料庫管理系統一體化」,使資料使用者或資料分析師透過此平台系統,得以直接地取得各異質資料庫中的資料表,亦能將多個異質資料庫內的資料表做整合(JOIN)計算,並將結果匯入數據中台之中供後續計算分析工作使用,以此來打破資料孤島之間的阻礙,達成資料的統一管理。
本文設計之「分散式數據中台」已支援 7 種常見之 SQL / NoSQL 資料庫管理系統,另融合分散式大數據儲存與運算之大型開源專案Hadoop、HBase、Yarn,用以支援檔案及資料表的儲存與計算,使用者無需熟稔分散式系統相關技術,也能享受其所帶來的效益。此外,平台以 No-code / Low-code 的形式設計,提供網頁介面以及REST API,使用者能夠無痛地介接本平台系統,接近0學習成本的體驗,並能夠降低存取異質儲存體、異質資料庫管理系統的成本,使資料使用者能更專注於開發後續資料分析與計算的相關應用,提升整體資料取用與計算分析的效率。此外數據中台還具良好的計算資源管理能力,能夠在計算任務執行的過程中動態地調整計算資源,確保整體系統資源的高使用率。
「分散式數據中台」經合作廠商驗證,已降低其 60% 資料整合週期時間,並以此系統為核心應用於智慧資安、智慧營運、生產品質預測、生產參數最佳化等領域。論文最後也提出了數據中台未來的努力目標,根據知名資訊科技顧問公司Gartner建議,數據中台將進一步進行資料治理,為此需進行資料編目(Data Catalog)以及資料推薦系統(Recommand System),幫助數據中台使用者可以更快速尋找資料、理解資料。另外可以搭配Dokcer、k8s等容器化應用工具部署與管理數據中台系統。
In the era of big data, enterprises have invested in the field of big data storage and computing in order to enjoy the benefits brought by big data. However, with the complexity of the organizational structure of the enterprise, there are many types of databases used by various departments of the enterprise, and the databases cannot be accessed to each other. As a result, each heterogeneous database system is like an isolated island, which is called an isolated data islands. Data scientists or/and analysts cannot easily access cross-departmental data, resulting in the inability to fully demonstrate the value of big data analysis.
This paper presents a Distributed Data Fabric to attack isolated data islands. Distributed Data Fabric is a unified platform that integrates open source projects including hadoop, hbase and yarn for distributed big data storage and computing. The user can import the tables of the heterogeneous database or the files of the heterogeneous storage into the Data Fabric. Through the Data Fabric, it is possible to access tables of heterogeneous databases easily, and to submit tables join tasks to a distributed computing environment. Users do not need to be familiar with distributed technologies but can still enjoy the advantages brought by distributed computing and storage.
In addition, the Distributed Data Fabric proposed in this paper has favorable computing resource management capabilities. It can dynamically adjust computing resources during the execution of computing tasks to ensure high utilization of overall system resources.
At present, the Distributed Data Fabric supports access to seven types of SQL/No-SQL databases and supports data transmission between eight types of storage protocols. This paper also implements the web user interface and REST API of the data center, which not only reduces the cost for users to access and join tables of heterogeneous databases but also minimizes the learning cost of the Distributed Data Fabric.
[1]	F. Tao, Q. Qi, A. Liu, and A. Kusiak, "Data-driven Smart Manufacturing," Journal of Manufacturing Systems, vol. 48, pp. 157-169, 2018, doi:10.1016/j.jmsy.2018.01.006.
[2]	大數據分析如何改變產業鏈?. [Online]. Available:https://www.semi.org/zh/blogs/technology-trends/big-data
[3]	Data Silo. [Online.] Available:https://www.techtarget.com/searchdatamanagement/definition/data-silo
[4]	曾冠博, "HDS:The Web-based Data Service over Hadoop, " 成功大學分散式系統實驗室, 2017.
[5]	Hadoop. [Online.] Available: https://hadoop.apache.org/
[6]	HDFS. [Online.] Available:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
[7]	HBase. [Online.] Available: https://hbase.apache.org/
[8]	ZooKeeper. [Online.] Available: https://zookeeper.apache.org/
[9]	Phoenix. [Online.] Available: https://phoenix.apache.org/
[10]	楊峻豪, "Autonomic,Fine-Grained Resource Management in Clouds for 
Batch and Real-Time Computing, " 成功大學分散式系統實驗室, 2021.
[11]	Yarn. [Online.] Available: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
[12]	M. Genkin, F. Dehne, M. Pospelova, Y. Chen, and P. Navarro, "Automatic, On-Line Tuning of YARN Container Memory and CPU Parameters," in 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016, pp. 317-324, doi: 10.1109/HPCC-SmartCity-DSS.2016.0053.
[13]	RabbitMQ. [Online.] Available: https://www.rabbitmq.com/
[14]	Flask. [Online.] Available: https://palletsprojects.com/p/flask/
[15]	The Netty Project. [Online.] Available: https://netty.io/
[16]	A. Kuzmanovska, R. H. Mak, and D. Epema, "Dynamically Scheduling a Component-Based Framework in Clusters," 2015, in Job Scheduling Strategies for Parallel Processing, pp. 129-146, doi: 10.1007/978-3-319-15789-4_8.
[17]	The 12 top strategic technology trends, 2022. [Online.] Available:
https://www.gartner.com/en/information-technology/insights/top-technology-trends
[18]	Data Fabric Architecture. [Online.] Available: https://www.gartner.com/smarterwithgartner/data-fabric-architecture-is-key-to-modernizing-data-management-and-integration
[19]	Denodo Platform. [Online.] Available:
https://www.denodo.com/en/denodo-platform/denodo-platform-80
[20]	Denodo Platform 8.0 Documentation. [Online.] Available:
https://community.denodo.com/docs/html/browse/8.0/en/
[21]	Mohammed Elshambakey, Mohamed Khalefa, William J. Tolone, Sreyasee Das Bhattacharjee, Huikyo Lee, Luca Cinquini Shannon Schlueter, Isaac Cho, Wenwen Dou, Daniel J. Crichton. "Towards a Distributed Infrastructure for Data-Driven Discoveries & Analysis, " in 2017 IEEE International Conference on Big Data (BIGDATA), pp. 4738-4740, doi: 10.1109/BigData.2017.8258526.
[22]	Jayesh Patel, "Overcoming Data Silos Through Big Data Integration, " International Journal of Computer Science and Technology (IJCST), Vol.3, No. 1, 2019. doi: 10.5121/IJDMS.2019.1301.