| 研究生: |
梁宸熏 Liang, Chen-Hsun |
|---|---|
| 論文名稱: |
應用於資料發布服務上基於效用強化差分隱私之抗洪水詢問微聚合方法 A Query-Flooding-Resistant Microaggregation Method Based on Utility-Enhanced Differential Privacy for Data Publishing Services |
| 指導教授: |
郭耀煌
Kuo, Yau-Hwang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 107 |
| 中文關鍵詞: | 資料發布服務 、微聚合 、差分隱私 、效用強化 、抗洪水詢問 |
| 外文關鍵詞: | Data Publishing Services, Microaggregation, Differential Privacy, Utility-Enhanced, Query-Flooding Resistance |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著機器學習應用的快速發展,由資料發布服務提供模型訓練所需的大量資料成為了維持機器學習生態系運作的關鍵基礎。一般來說,資料發布服務的存取協定應兼顧資料的隱私保護與實用性,前者抵禦多種針對個資敏感屬性的再識別攻擊,後者維持產出模型的服務品質。現有解決方案多採用微聚合機制,將同一分群內的資料統一改用其群心的數值代替,再搭配差分隱私技術對群心植入噪聲來抵抗常見的再識別攻擊。然而,這樣的機制不但無法抵抗更進階的洪水式詢問再識別攻擊,由差分隱私技術所植入的噪聲更會造成資料失真,進而導致機器學習應用的實用性低下。由此可見,現今機器學習生態系亟需一個能妥善解決資料在安全與實用之間衝突的資料發布服務。
為此,本研究提出基於效用強化差分隱私之抗洪水詢問微聚合方法,除了能抵抗進階的洪水式詢問再識別攻擊,亦改善了機器學習應用在保護資料隱私時所損失的實用性。在抗洪水微聚合元件中,抗洪水詢問群心生成機制為各個分群分別推導出的群心能抵抗利用大量詢問結果來消除噪聲之洪水式詢問再識別攻擊,而敏感度最佳化分群機制考量安全與效能的需求,進一步決定要植入各個群心的最適噪聲強度,防止植入過大的噪聲強度從而影響機器學習的實用性。在效用強化差分隱私元件中,鑑別度導向隱私預算分配機制則進一步考量資料庫中不同欄位可能洩漏資訊的程度差異,合理規劃其各自能分配到的安全預算,再搭配敏感度抑制型噪聲注入機制來植入其安全預算所對應的噪聲。
最後,本論文透過理論分析證明所提之方法足以有效抵禦洪水式詢問再識別攻擊。而實驗結果亦顯示,在多種機器學習情境中,本方法比現有方法可提升約25.58%的準確率,且若是基於相同安全程度的保護下,本方法的準確率較現有方法更可提升約41.29%。由此理論與實驗結果說明,本論文之方法可在強化隱私保護的同時維持機器學習的實用性,為機器學習生態系建構一可靠的基礎。
With the rapid development of machine learning applications, the large volumes of data provided by Data Publishing Services (DPS) have become a critical foundation for supporting the machine learning industry chain. In general, the access protocol of a DPS should balance both privacy protection and data utility, where the former defends against various re-identification attacks targeting personal sensitive attributes, and the latter ensures the service quality of the resulting models. Existing solutions commonly adopt microaggregation, where all records within the same cluster are replaced with their centroid values. Differential Privacy (DP) is then applied by injecting noise into the centroids to defend against common re-identification attacks. However, such mechanisms fail to resist more advanced Query-Flooding Based re-identification attacks. Moreover, the noise introduced by DP often leads to data distortion, which in turn degrades the utility of machine learning applications. This highlights the urgent need for a DPS that can properly resolve the conflict between data security and utility within the machine learning industry chain.
To address this, this study proposes a Utility-Enhanced Differential Privacy-Based Microaggregation method. This method not only defends against advanced Query-Flooding Based re-identification attacks, but also improves the utility loss often caused when applying privacy protection to machine learning applications. In the Query-Flooding Resistant Microaggregation component, the Query-Flooding Resistant Centroid Generation mechanism ensures that the centroids derived from each cluster can resist attacks that attempt to eliminate noise through a large number of query results. Meanwhile, the Sensitivity-Optimization ?-member Clustering mechanism considers both security and utility requirements to determine the optimal noise magnitude to be added to each centroid, preventing excessive noise that could harm the utility of machine learning. In the Utility-Enhanced Differential Privacy component, the Discrimination-Aware Privacy Budget Allocator further considers the different levels of potential information leakage across attributes in the database, and reasonably allocates privacy budgets. These budgets are then used by the Sensitivity-Reduction Noise Injector to add the corresponding amount of noise based on each attribute’s assigned budget.
Finally, this thesis provides a theoretical analysis showing that the proposed method can effectively defend against Query-Flooding Based re-identification attacks. Experimental results further demonstrate that, across various machine learning scenarios, the proposed method achieves up to a 25.58% improvement in accuracy compared to existing methods. Moreover, under the same level of privacy protection, it can further improve accuracy by up to 41.29%. These theoretical and empirical results indicate that the proposed method can strengthen privacy protection while maintaining the utility of machine learning, offering a reliable foundation for the machine learning industry chain.
[1] R. Bellman, "On the theory of dynamic programming," Proceedings of the national Academy of Sciences, vol. 38, no. 8, pp. 716-719, 1952.
[2] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
[3] B.-C. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala, "Privacy-preserving data publishing," Foundations and Trends® in Databases, vol. 2, no. 1–2, pp. 1-167, 2009.
[4] L. Chen, L. Zeng, Y. Mu, and L. Chen, "Global combination and clustering based differential privacy mixed data publishing," IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 11, pp. 11437-11448, 2023.
[5] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785-794.
[6] D. R. Cox, "The regression analysis of binary sequences," Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 20, no. 2, pp. 215-232, 1958.
[7] J. Domingo-Ferrer and J. M. Mateo-Sanz, "Practical data-oriented microaggregation for statistical disclosure control," IEEE Transactions on Knowledge and data Engineering, vol. 14, no. 1, pp. 189-201, 2002.
[8] J. Domingo-Ferrer and V. Torra, "Ordinal, continuous and heterogeneous k-anonymity through microaggregation," Data Mining and Knowledge Discovery, vol. 11, no. 2, pp. 195-212, 2005.
[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith, "Calibrating noise to sensitivity in private data analysis," in Theory of cryptography conference, 2006: Springer, pp. 265-284.
[10] C. Dwork and A. Roth, "The algorithmic foundations of differential privacy," Foundations and trends® in theoretical computer science, vol. 9, no. 3–4, pp. 211-407, 2014.
[11] P. F. Edemekong, P. Annamaraju, and M. J. Haydel, "Health insurance portability and accountability act," 2018.
[12] J. E. Everhart, "National Institute of Diabetes and Digestive and Kidney Diseases Bethesda, Maryland," Digestive diseases in the United States: epidemiology and impact, no. 94, p. 1, 1994.
[13] B. C. Fung, K. Wang, R. Chen, and P. S. Yu, "Privacy-preserving data publishing: A survey of recent developments," ACM Computing Surveys (Csur), vol. 42, no. 4, pp. 1-53, 2010.
[14] J. Henriksen-Bulmer and S. Jeary, "Re-identification attacks—A systematic literature review," International Journal of Information Management, vol. 36, no. 6, pp. 1184-1192, 2016.
[15] A. N. Kolmogorov, "Foundations of the theory of probability," 1933.
[16] A. N. Kolmogorov, Foundations of the theory of probability: Second English Edition. Courier Dover Publications, 2018.
[17] S. Lloyd, "Least squares quantization in PCM," IEEE transactions on information theory, vol. 28, no. 2, pp. 129-137, 1982.
[18] A. Majeed and S. Lee, "Anonymization techniques for privacy preserving data publishing: A comprehensive survey," IEEE access, vol. 9, pp. 8512-8545, 2020.
[19] S. Matwin, J. Nin, M. Sehatkar, and T. Szapiro, "A review of attribute disclosure control," Advanced research in data privacy, pp. 41-61, 2014.
[20] F. McSherry and K. Talwar, "Mechanism design via differential privacy," in 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), 2007: IEEE, pp. 94-103.
[21] S. Murthy, A. A. Bakar, F. A. Rahim, and R. Ramli, "A comparative study of data anonymization techniques," in 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing,(HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), 2019: IEEE, pp. 306-309.
[22] S. L. Pardau, "The california consumer privacy act: Towards a european-style privacy regime in the united states," J. Tech. L. & Pol'y, vol. 23, p. 68, 2018.
[23] P. Regulation, "General data protection regulation," Intouch, vol. 25, pp. 1-5, 2018.
[24] R. L. Rivest, L. Adleman, and M. L. Dertouzos, "On data banks and privacy homomorphisms," Foundations of secure computation, vol. 4, no. 11, pp. 169-180, 1978.
[25] A. Rodríguez-Hoyos, J. Estrada-Jiménez, D. Rebollo-Monedero, A. M. Mezher, J. Parra-Arnau, and J. Forné, "The fast maximum distance to average vector (F-MDAV): An algorithm for k-anonymous microaggregation in big data," Engineering Applications of Artificial Intelligence, vol. 90, p. 103531, 2020.
[26] P. Samarati and L. Sweeney, "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression," 1998.
[27] C. E. Shannon, "A mathematical theory of communication," The Bell system technical journal, vol. 27, no. 3, pp. 379-423, 1948.
[28] Y. Shen and S. Pearson, "Privacy enhancing technologies: A review," HP Laboratories, vol. 2739, pp. 1-30, 2011.
[29] J. Soria-Comas and J. Domingo-Ferrer, "Differentially private data sets based on microaggregation and record perturbation," in International Conference on Modeling Decisions for Artificial Intelligence, 2017: Springer, pp. 119-131.
[30] J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez, and S. Martínez, "Enhancing data utility in differential privacy via microaggregation-based k-anonymity," The VLDB Journal, vol. 23, no. 5, pp. 771-794, 2014.
[31] L. Sweeney, "k-anonymity: A model for protecting privacy," International journal of uncertainty, fuzziness and knowledge-based systems, vol. 10, no. 05, pp. 557-570, 2002.
[32] Y. Wang, "Privacy-enhancing technologies," in Handbook of research on social and organizational liabilities in information security: IGI Global Scientific Publishing, 2009, pp. 203-227.
[33] W. H. Wolberg and O. L. Mangasarian, "Multisurface method of pattern separation for medical diagnosis applied to breast cytology," Proceedings of the national academy of sciences, vol. 87, no. 23, pp. 9193-9196, 1990.
[34] H. Yan, X. Li, H. Li, J. Li, W. Sun, and F. Li, "Monitoring-based differential privacy mechanism against query flooding-based model extraction attack," IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 4, pp. 2680-2694, 2021.
[35] S. Zhang and X. Li, "Differential privacy medical data publishing method based on attribute correlation," Scientific Reports, vol. 12, no. 1, p. 15725, 2022.
[36] Z.-H. Zhou, Machine learning. Springer nature, 2021.