簡易檢索 / 詳目顯示

研究生: 李少琪
Li, Shao-Chi
論文名稱: 數據中台上的資料血緣追蹤
Coarse-Grained Granularity Data Lineage Tracking in Data Fabric
指導教授: 蕭宏章
Hsiao, Hung-Chang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 47
中文關鍵詞: 數據治理數據追溯數據血緣圖形資料庫數據追蹤
外文關鍵詞: Data Government, Data Tracing, Data Lineage, Graph Database, Data Tracking
相關次數: 點閱:72下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在當今數據驅動的時代,數據被認為是企業和組織中最寶貴的資產之一。隨著數據量增長和數據應用的廣泛普及,數據處理流程變得越來越複雜,數據管理、質量和可追溯性變得至關重要。特別是在數據中台興起的今天,傳統的數據管理方法由於諸多原因,使得數據的管理和追蹤變得尤其困難,例如跨系統文件記錄斷層、專業領域系統文件的複雜內容、四散的檔案位置、人力搜尋耗時耗力等,已經無法滿足日益增長的需求。而數據血緣追蹤作為一個解決方案,為我們提供了一種有效的方法來跟蹤數據的來源、流向和影響。
    本研究旨在探討數據治理中有關資料血緣系統建立的方法和技術,以解決傳統數據追溯方法中存在的問題。透過血緣系統,我們能夠實時的追蹤紀錄數據的演化過程,實現對數據生命週期的追蹤和管理,進而降低數據追溯的成本。
    為達此目的,本研究提出以圖形結構為基礎概念,將資料的演化過程以圖的方式記錄下來,搭配圖形路徑查詢的方法表示為數據血緣追溯回饋。我們使用當今最熱門的 Neo4j 原生圖形資料庫,搭配 Cypher 語法達到更快速的路徑回饋時間。
    此外,我們開放血緣追蹤機制給任何系統使用,來實現更完善的血緣紀錄。並且,我們於實驗中顯示血緣追溯在維護空間開銷上降低了九倍,在追溯回饋時間上更是快了十倍,成功的解決了追溯成本開銷過大的問題。

    In today's data-driven era, data is considered one of the most valuable assets for enterprises and organizations. As data volumes grow and data applications become more widespread, data governance becomes critical. However, as a solution, data lineage provides us with an effective way to track the source, flow, and impact of data, to manage the whole life of the data.
    This study explores methods and technologies for establishing a data lineage system to address the shortcomings of traditional data tracing methods. Through the lineage system, data evolution processes can be recorded in real-time,facilitating tracking and management throughout the data lifecycle, thereby reducing the cost of data tracing. To achieve this, we propose recording data evolution processes in a graph structure, with graph path querying methods representing data lineage feedback. We leverage the widely-used Neo4j native graph database, coupled with Cypher syntax, to achieve faster path feedback times.
    Furthermore, we offer an open lineage tracking mechanism for use by any system to enhance lineage recording. Our experiments demonstrate that lineage tracing reduces maintenance overhead by 9 times and achieves a 10-fold improvement in tracing feedback time, effectively addressing the issue of excessive tracing costs.

    摘要 i Extended Abstract ii 誌謝 vi 目錄 viii 表目錄 x 圖目錄 xi 第一章 緒論 1 第一節 研究動機 1 第二節 研究目標 2 第三節 研究方法與論文成果 3 第四節 論文結構 5 第二章 研究背景 6 資料血緣 6 數據中台 8 Neo4j 9 第三章 系統架構設計 10 第一節 數據血緣追蹤圖形結構 11 第二節 資料血緣追溯 13 第三節 使用情境概述 13 第四章 血緣系統功能介紹 16 第一節 血緣追溯 16 第二節 處理動作相關資訊調閱 20 第三節 血緣系統追蹤應用於外部系統 22 第五章 實驗 23 第一節 實驗環境 23 第二節 實驗方式 25 實驗一 血緣追蹤機制在原數據中台的資源用量開銷 25 實驗二 資料紀錄的空間使用比較 27 實驗三 圖形追溯回饋時間 28 第六章 相關研究 29 第七章 結論與未來展望 32 參考文獻 33

    [1] Hannila, H.; Silvola, R.; Harkonen, J.; Haapasalo, H. Data-driven Begins with DATA; Potential of Data Assets. J. Comput. Inf. Syst. 2022, 62, 29–38. [Google Scholar][CrossRef]
    [2] Janssen, M.; Brous, P.; Estevez, E.; Barbosa, L.S.; Janowski, T. Data Governance:Organizing Data for Trustworthy Artificial Intelligence. Gov. Inf. Q. 2020, 37,101493. [Google Scholar] [CrossRef]
    [3] Ladley, J. Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program; Elsevier Science: Amsterdam, The Netherlands, 2019; pp. 16–18. [Google Scholar]
    [4] Abraham, R.; Schneider, J.; Brocke, J.V. Data Governance: A Conceptual Frame-work, Structured Review, and Research Agenda. Int. J. Inf. Manag. 2019, 49, 424–438. [Google Scholar]
    [5] Chen, Y.; Zhao, Y.; Xie, W.; Zhai, Y.; Zhao, X.; Zhang, J.; Long, J.; Zhou, F. An Empirical Study on Core Data Asset Identification in Data Governance. Big Data Cogn. Comput. 2023, 7, 161. https://doi.org/10.3390/bdcc7040161
    [6] Tomingas, K., Järv, P., Tammet, T. (2019). Computing Data Lineage and Business Semantics for Data Warehouse. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Bernar-dino, J., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2016. Communications in Computer and Infor-mation Science, vol 914. Springer, Cham. https://doi.org/10.1007/978-3-319-99701-8_5
    [7] 邱楚盦. The First Class Citizen in the Distributed Data Fabric Attacking Isolated Data Islands. 成功大學分散式系統實驗室, 2022
    [8] 楊峻豪. Autonomic, Fine-Grained Resource Management in Clouds for Batch and Real-Time Computing. 成功大學分散式系統實驗室, 2021.
    [9] 曾冠博. HDS:The Web-based Data Service over Hadoop. 成功大學分散式系統實驗室, 2017.
    [10] RabbitMQ. [Online.] Available: https://www.rabbitmq.com/
    [11] Flask. [Online.] Available: https://palletsprojects.com/p/flask/
    [12] HBase. [Online.] Available: https://hbase.apache.org/
    [13] HDFS. [Online.] Available: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
    [14] Yarn. [Online.] Available: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
    [15] Amundsen. [Online.] Available: https://www.amundsen.io/
    [16] Atla. [Online.] Available: https://atlas.apache.org/#/
    [17] DataHub. [Online.] Available: https://datahubproject.io/
    [18] OpenDataDiscovery. [Online.] Available: https://opendatadiscovery.org/
    [19] Neo4j. [Online.] Available: https://neo4j.com/
    [20] Aggour, K. S. ; Williams, J. W. ; McHugh, J. and Kumar, V. S. 2017. Colt: concept lineage tool for data flow metadata capture and analysis. Proc. VLDB Endow. 10, 12 (August 2017), 1790–1801. https://doi.org/10.14778/3137765.3137783
    [21] Cui, Y. and Widom, J. 2003. Lineage tracing for general data warehouse transfor-mations.The VLDB Journal 12, 1 (May 2003), 41–58. https://doi.org/10.1007/s00778-002-0083-8
    [22] Bogle, I. and Slota, G.M. "Distributed Algorithms for the Graph Biconnectivity and Least Common Ancestor Problems," 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lyon, France, 2022, pp. 1139-1142, doi: 0.1109/IPDPSW55747.2022.00187.
    [23] Tang et al, M. "SAC: A System for Big Data Lineage Tracking," 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 2019, pp. 1964-1967, doi: 10.1109/ICDE.2019.00215.
    [24] Pokorný, J.; Sykora, J. and Valenta, M. 2020. Data Lineage Temporally Using a Graph Database. In Proceedings of the 11th International Conference on Management of Digital EcoSystems (MEDES '19). Association for Computing Machinery, New York, NY, USA, 285–291. https://doi.org/10.1145/3297662.3365794
    [25] Puri, C.; Kim, D.S.; Yeh, P.Z. and Verma, K. (2012). Implementing a Data Lineage Tracker. In: Cuzzocrea, A., Dayal, U. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2012. Lecture Notes in Computer Science, vol 7448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32584-7_32
    [26] Backes, M. ; Grimm, N. and Kate, A. "Data Lineage in Malicious Environments," in IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 2, pp. 178-191, 1 March-April 2016, doi: 10.1109/TDSC.2015.2399296.

    無法下載圖示 校內:不公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE