| 研究生: |
王喬韋 Wang, Chiao-Wei |
|---|---|
| 論文名稱: |
通過整合資料虛擬化和資料編目邁向 Data Fabric Towards Data Fabric with the Integration of Data Virtualization and Data Catalog |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 人工智慧科技碩士學位學程 Graduate Program of Artificial Intelligence |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 48 |
| 中文關鍵詞: | Data Fabric 、資料孤島 、資料虛擬化 、資料編目 、異質資料庫系統 |
| 外文關鍵詞: | Data Fabric, Data Silos, Data Virtualization, Data Catalog, Heterogeneous Database System |
| 相關次數: | 點閱:152 下載:24 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
「數據中台」系統的出現,逐漸消彌了各資料庫管理系統 (DBMS) 間的物理隔閡,整合異質資料庫系統不再屬難題。然而,數據中台搭起各資料孤島 (Data silos) 間的橋樑,卻未必意味著這些橋樑有人行走。事實上,本文合作廠商亦苦於此:長期處於資料孤島下,資料分析師對其熟稔的資料庫已有親和性 (Affinity) ,仍難以擺脫其已知資料領域,前往挖掘其他未知資料的價值,本文稱為「資料孤島慣性」,使資料轉型難見成效。如何誘導數據中台使用者們主動探勘未知資料領域,擺脫資料孤島慣性?為本研究主要探討議題。
為解決資料孤島慣性,本研究持續與知名積體電路封測公司,將先前合作開發之「AIoT 數據中台」系統,昇華為「AIoT Data Fabric」,旨在應用搜索引擎、推薦系統,使 AIoT Data Fabric 用戶能主動探勘企業內蘊藏資料,發掘璞玉並推薦給資料分析師細琢其價值。以此為核心價值主張,傳統被動的資料庫管理系統,將轉型為主動提供資料分析洞見 (Insight) 的 Data Fabric。
本文提出之 Data Fabric 構想基於資料虛擬化 (Data Virtualization) 與資料編目 (Data Catalog) 延伸,提出將兩者整合輔以推薦系統進而達到 1+1>2 之方案。文中進一步基於上述之整合構想,探討 Data Fabric 原型之設計與實作。
With the prevalence of Data Virtualization, the physical obstacles of the integration of heterogeneous DBMS (Database Management System) have become extinct. Nevertheless, Data Virtualization indeed bridges the gaps between Data Silos, it does not necessarily mean those bridges are bustling. In fact, our partner corporation has been suffering from the stagnation of Data Integration. Being confined by Data Silos for decades, the data analysts have already been coupled with their familiarized DBMS, and have non-negligible affinity to those DBMS, and found it hard to discover the unknown values out of the other DBMS which they know little about. We call this phenomenon “The Inertia of Data Silos”, which voids all our efforts. How to lead our users to be against The Inertia of Data Silos? It’s the topic for our research.
Aiming at attacking The Inertia of Data Silos, we sustain the cooperation with the prestigious semiconductor packaging and testing corporation, and metamorphose our novel work “AIoT Data Virtualization” into “AIoT Data Fabric” with the vision of incentivizing data analysts to exploit the unexplored Business Intelligence (BI) from heterogeneous data integration. Aspiring to this value proposition, we devote ourselves to transform the traditional clumsy DBMS into the insightful Data Fabric.
We introduce our 1+1>2 blueprint for the Data Fabric, which integrates Data Virtualization and Data Catalog, and further flourishes it with a recommender system. Not only do we introduce the blueprint, but also the design and implementation of the prototype of our 1+1>2 Data Fabric.
[1]Gupta, A. (2021). Understand the Role of Data Fabric. Gartner, Inc. Retrieved July 24, 2022, from https://www.gartner.com/smarterwithgartner/data-fabric-architecture-is-key-to-modernizing-data-management-and-integration
[2]Azevedo, L., Soares, E., Souza, R., and Moreno, M. (2020). Modern Federated Database Systems: An Overview. Proceedings of the 22nd International Conference on Enterprise Information Systems. https://doi.org/10.5220/0009795402760283
[3]Heimbigner, D., and McLeod, D. (1985). A Federated Architecture for Information Management. ACM Transactions on Information Systems, 3(3), 253–278. https://doi.org/10.1145/4229.4233
[4]Sheth, A. P., and Larson, J. A. (1990). Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3), 183–236. https://doi.org/10.1145/96602.96604
[5]Denodo Technologies. (2021). Data Virtualization for Dummies. Denodo Technologies. Retrieved July 24, 2022, from https://www.denodo.com/en/document/e-book/data-virtualization-dummies
[6]Gottlieb, M., Shraideh, M., Fuhrmann, I., Böhm, M., and Krcmar, H. (2019). Critical Success Factors for Data Virtualization: A Literature Review. The ISC International Journal of Information Security, 11(3), 131-137. https://doi.org/10.22042/isecure.2019.11.0.17
[7]Rushin, J. (2021). Essential Metadata Management Best Practices for Success. Alation, Inc. Retrieved July 24, 2022, from https://www.alation.com/blog/metadata-management-best-practices/
[8]Wells, D. (2020). Introduction to Data Catalogs. Alation, Inc. Retrieved July 24, 2022, from https://www.alation.com/wp-content/uploads/dave-wells-intro-to-data-catalogs-alation.pdf
[9]Alation, Inc. (2021). Active Data Governance Methodology. Alation, Inc. Retrieved July 24, 2022, from https://www.alation.com/resource-center/whitepapers/data-governance-methodology
[10]Great Data Minds. (2021). Alation Product Demo [Video]. YouTube. https://www.youtube.com/watch?v=sPqeMCvW8TE
[11]Alation, Inc. (2019). Data Catalog: Creating a Single Source of Reference. Alation, Inc. Retrieved July 24, 2022, from https://www.alation.com/wp-content/uploads/Data_Catalogs_Creating_a_Single_Source_of_Reference.pdf
[12]Eckerson Group. (2021). Data.World Data Catalog Demo [Video]. YouTube. https://www.youtube.com/watch?v=5sdZqYHU1RQ
[13]Hipp, R. D. (2022). SQLite. Retrieved July 24, 2022, from https://www.sqlite.org/index.html
[14]McKinney, W. (2010). Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51–56).
[15]The Pandas Development Team. (2022). pandas.DataFrame.join. Pandas. Retrieved July 24, 2022, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
[16]The Pandas Development Team. (2022). pandas.DataFrame.to_pickle. Pandas. Retrieved July 24, 2022, from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html
[17]Bayer, M. (2012). SQLAlchemy. In A. Brown and G. Wilson (Eds.), The Architecture of Open Source Applications Volume II: Structure, Scale, and a Few More Fearless Hacks. aosabook.org. Retrieved July 24, 2022, from "http://aosabook.org/en/sqlalchemy.html"
[18]Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30–37. https://doi.org/10.1109/mc.2009.263
[19]Apache Spark. (2022). Collaborative Filtering. The Apache Software Foundation. Retrieved July 24, 2022, from https://spark.apache.org/docs/3.3.0/ml-collaborative-filtering.html
[20]Apache Spark. (2022). ALS. The Apache Software Foundation. Retrieved July 24, 2022, from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.recommendation.ALS.html
[21]Zhou, Y., Wilkinson, D., Schreiber, R., and Pan, R. (2008). Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Algorithmic Aspects in Information and Management, 337–348. https://doi.org/10.1007/978-3-540-68880-8_32
[22]Yu, H. F., Hsieh, C. J., Si, S., and Dhillon, I. S. (2013). Parallel Matrix Factorization for Recommender Systems. Knowledge and Information Systems, 41(3), 793–819. https://doi.org/10.1007/s10115-013-0682-2
[23]Google Developers. (2022). Matrix Factorization. Google Inc. Retrieved July 24, 2022, from https://developers.google.com/machine-learning/recommendation/collaborative/matrix
[24]Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (pp. 518–529). Morgan Kaufmann Publishers Inc.
[25]Su, X., and Khoshgoftaar, T. M. (2009). A Survey of Collaborative Filtering Techniques. Advances in Artificial Intelligence, 2009, 1–19. https://doi.org/10.1155/2009/421425
[26]Ikeda, R., and Widom, J. (2009). Data Lineage: A Survey [White paper]. Stanford University
[27]Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Inc. Retrieved July 24, 2022, from https://blog.google/products/search/introducing-knowledge-graph-things-not/
[28]Dragoni, N., Lanese, I., Larsen, S., Mazzara, M., Mustafin, R., and Safina, L. (2017). Microservices: How To Make Your Application Scale. https://doi.org/10.48550/arXiv.1702.07149
[29]Grinberg, M. (2018). Flask Web Development: Developing Web Applications with Python (2nd ed.). O’Reilly Media.
[30]TIBCO Software Inc. (2022). Data Virtualization Demo [Video]. TIBCO Software Inc. https://www.tibco.com/zh-hant/node/510686
[31]TIBCO Spotfire. (2021, December 15). TIBCO Spotfire and TIBCO Data Virtualization - Demo [Video]. YouTube. https://www.youtube.com/watch?v=ldoic70KjFE
[32]Mark Mullen. (2018). TIBCO Data Virtualisation Demonstration [Video]. YouTube. https://www.youtube.com/watch?v=-Sx3ykvVUhs
[33]Denodo. (2020). Data Virtualization: An Overview [Video]. YouTube. https://www.youtube.com/watch?v=3eWltRLA0ZY
[34]Denodo. (2021). Denodo Platform 8.0 - Demo Overview [Video]. YouTube. https://www.youtube.com/watch?v=_ro0bqUQ1J0
[35]Cybertrend Data Academy. (2021). Data Virtualization with denodo [Video]. YouTube. https://www.youtube.com/watch?v=HqGkyi63wq4
[36]Patel, J. (2019). Bridging Data Silos Using Big Data Integration. International Journal of Database Management Systems, 11(3), 01–06. https://doi.org/10.5121/ijdms.2019.11301
[37]Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. (2019). Data Lake Management: Challenges and Opportunities. Proceedings of the VLDB Endowment, 12(12), 1986–1989. https://doi.org/10.14778/3352063.3352116
[38]Stein, B., and Morrison, A. (2014). The Enterprise Data Lake: Better Integration and Deeper Analytics. PwC. Retrieved July 24, 2022, from https://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/assets/pdf/pwc-technology-forecast-data-lakes.pdf
[39]Talend. (2022). Data Silos, Why They’re a Problem, and How to Fix It. Retrieved July 24, 2022, from https://www.talend.com/resources/what-are-data-silos/
[40]Tick, I. (2021). Data Fabric vs. Data Lake: Operational Comparison. DZone.Com. Retrieved July 24, 2022, from https://dzone.com/articles/data-fabric-vs-data-lake-comparison-9
[41]Hu, Y., Koren, Y., and Volinsky, C. (2008). Collaborative Filtering for Implicit Feedback Datasets. 2008 Eighth IEEE International Conference on Data Mining. https://doi.org/10.1109/icdm.2008.22
[42]Mousa, A. H., and Shiratuddin, N. (2015). Data Warehouse and Data Virtualization Comparative Study. 2015 International Conference on Developments of E-Systems Engineering (DeSE), 2015, pp. 369-372. https://doi.org/10.1109/DeSE.2015.26