研究生: |
李少琪 Li, Shao-Chi |
---|---|
論文名稱: |
數據中台上的資料血緣追蹤 Coarse-Grained Granularity Data Lineage Tracking in Data Fabric |
指導教授: |
蕭宏章
Hsiao, Hung-Chang |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 47 |
中文關鍵詞: | 數據治理 、數據追溯 、數據血緣 、圖形資料庫 、數據追蹤 |
外文關鍵詞: | Data Government, Data Tracing, Data Lineage, Graph Database, Data Tracking |
相關次數: | 點閱:72 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在當今數據驅動的時代,數據被認為是企業和組織中最寶貴的資產之一。隨著數據量增長和數據應用的廣泛普及,數據處理流程變得越來越複雜,數據管理、質量和可追溯性變得至關重要。特別是在數據中台興起的今天,傳統的數據管理方法由於諸多原因,使得數據的管理和追蹤變得尤其困難,例如跨系統文件記錄斷層、專業領域系統文件的複雜內容、四散的檔案位置、人力搜尋耗時耗力等,已經無法滿足日益增長的需求。而數據血緣追蹤作為一個解決方案,為我們提供了一種有效的方法來跟蹤數據的來源、流向和影響。
本研究旨在探討數據治理中有關資料血緣系統建立的方法和技術,以解決傳統數據追溯方法中存在的問題。透過血緣系統,我們能夠實時的追蹤紀錄數據的演化過程,實現對數據生命週期的追蹤和管理,進而降低數據追溯的成本。
為達此目的,本研究提出以圖形結構為基礎概念,將資料的演化過程以圖的方式記錄下來,搭配圖形路徑查詢的方法表示為數據血緣追溯回饋。我們使用當今最熱門的 Neo4j 原生圖形資料庫,搭配 Cypher 語法達到更快速的路徑回饋時間。
此外,我們開放血緣追蹤機制給任何系統使用,來實現更完善的血緣紀錄。並且,我們於實驗中顯示血緣追溯在維護空間開銷上降低了九倍,在追溯回饋時間上更是快了十倍,成功的解決了追溯成本開銷過大的問題。
In today's data-driven era, data is considered one of the most valuable assets for enterprises and organizations. As data volumes grow and data applications become more widespread, data governance becomes critical. However, as a solution, data lineage provides us with an effective way to track the source, flow, and impact of data, to manage the whole life of the data.
This study explores methods and technologies for establishing a data lineage system to address the shortcomings of traditional data tracing methods. Through the lineage system, data evolution processes can be recorded in real-time,facilitating tracking and management throughout the data lifecycle, thereby reducing the cost of data tracing. To achieve this, we propose recording data evolution processes in a graph structure, with graph path querying methods representing data lineage feedback. We leverage the widely-used Neo4j native graph database, coupled with Cypher syntax, to achieve faster path feedback times.
Furthermore, we offer an open lineage tracking mechanism for use by any system to enhance lineage recording. Our experiments demonstrate that lineage tracing reduces maintenance overhead by 9 times and achieves a 10-fold improvement in tracing feedback time, effectively addressing the issue of excessive tracing costs.
[1] Hannila, H.; Silvola, R.; Harkonen, J.; Haapasalo, H. Data-driven Begins with DATA; Potential of Data Assets. J. Comput. Inf. Syst. 2022, 62, 29–38. [Google Scholar][CrossRef]
[2] Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.S.; Janowski, T. Data Governance:Organizing Data for Trustworthy Artificial Intelligence. Gov. Inf. Q. 2020, 37,101493. [Google Scholar] [CrossRef]
[3] Ladley, J. Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program; Elsevier Science: Amsterdam, The Netherlands, 2019; pp. 16–18. [Google Scholar]
[4] Abraham, R.; Schneider, J.; Brocke, J.V. Data Governance: A Conceptual Frame-work, Structured Review, and Research Agenda. Int. J. Inf. Manag. 2019, 49, 424–438. [Google Scholar]
[5] Chen, Y.; Zhao, Y.; Xie, W.; Zhai, Y.; Zhao, X.; Zhang, J.; Long, J.; Zhou, F. An Empirical Study on Core Data Asset Identification in Data Governance. Big Data Cogn. Comput. 2023, 7, 161. https://doi.org/10.3390/bdcc7040161
[6] Tomingas, K., Järv, P., Tammet, T. (2019). Computing Data Lineage and Business Semantics for Data Warehouse. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Bernar-dino, J., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2016. Communications in Computer and Infor-mation Science, vol 914. Springer, Cham. https://doi.org/10.1007/978-3-319-99701-8_5
[7] 邱楚盦. The First Class Citizen in the Distributed Data Fabric Attacking Isolated Data Islands. 成功大學分散式系統實驗室, 2022
[8] 楊峻豪. Autonomic, Fine-Grained Resource Management in Clouds for Batch and Real-Time Computing. 成功大學分散式系統實驗室, 2021.
[9] 曾冠博. HDS:The Web-based Data Service over Hadoop. 成功大學分散式系統實驗室, 2017.
[10] RabbitMQ. [Online.] Available: https://www.rabbitmq.com/
[11] Flask. [Online.] Available: https://palletsprojects.com/p/flask/
[12] HBase. [Online.] Available: https://hbase.apache.org/
[13] HDFS. [Online.] Available: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
[14] Yarn. [Online.] Available: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
[15] Amundsen. [Online.] Available: https://www.amundsen.io/
[16] Atla. [Online.] Available: https://atlas.apache.org/#/
[17] DataHub. [Online.] Available: https://datahubproject.io/
[18] OpenDataDiscovery. [Online.] Available: https://opendatadiscovery.org/
[19] Neo4j. [Online.] Available: https://neo4j.com/
[20] Aggour, K. S. ; Williams, J. W. ; McHugh, J. and Kumar, V. S. 2017. Colt: concept lineage tool for data flow metadata capture and analysis. Proc. VLDB Endow. 10, 12 (August 2017), 1790–1801. https://doi.org/10.14778/3137765.3137783
[21] Cui, Y. and Widom, J. 2003. Lineage tracing for general data warehouse transfor-mations.The VLDB Journal 12, 1 (May 2003), 41–58. https://doi.org/10.1007/s00778-002-0083-8
[22] Bogle, I. and Slota, G.M. "Distributed Algorithms for the Graph Biconnectivity and Least Common Ancestor Problems," 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lyon, France, 2022, pp. 1139-1142, doi: 0.1109/IPDPSW55747.2022.00187.
[23] Tang et al, M. "SAC: A System for Big Data Lineage Tracking," 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 2019, pp. 1964-1967, doi: 10.1109/ICDE.2019.00215.
[24] Pokorný, J.; Sykora, J. and Valenta, M. 2020. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285–291. https://doi.org/10.1145/3297662.3365794
[25] Puri, C.; Kim, D.S.; Yeh, P.Z. and Verma, K. (2012). Implementing a Data Lineage Tracker. In: Cuzzocrea, A., Dayal, U. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2012. Lecture Notes in Computer Science, vol 7448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32584-7_32
[26] Backes, M. ; Grimm, N. and Kate, A. "Data Lineage in Malicious Environments," in IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 2, pp. 178-191, 1 March-April 2016, doi: 10.1109/TDSC.2015.2399296.