Macquarie University
01whole.pdf (5.93 MB)

Privacy-Preserving Data Sharing with Machine Learning

Download (5.93 MB)
posted on 2024-03-25, 00:33 authored by Nan Wu

Data analysis methods using machine learning (ML) can unlock valuable insights for improving revenue or quality of service from potentially private datasets provided by multiple data owners. However, it is prohibited to share personal data across different organizations without consent due to privacy, confidentiality concerns, and regulations. Therefore, it is crucial to ease the tension between the quality improvement of ML outcomes and the privacy concerns for using private datasets in data sharing. In this thesis, we propose novel methods and techniques in privacy-preserving data sharing with ML, which outperform other state-of-the-art techniques in terms of privacy, cost, and learning outcomes both theoretically and empirically on both synthetic and real-world datasets. Differential privacy (DP) is utilized throughout this thesis to preserve privacy guarantees. First, we evaluate the cost of privacy in asynchronous differentially-private ML using data located on multiple private and geographically-scattered servers with different privacy settings, as an extension of our previous work during my MRes study Wu et al. (2020b). We demonstrate that the cost of privacy has an upper bound, which is inversely proportional to the combined size of the training datasets squared and the sum of the privacy budgets squared. The experimental evaluations over financial and medical datasets confirm our theoretical analysis results and show that collaboration among more than 10 data owners with privacy budgets greater than or equal to 1 results in a superior machine-learning model in comparison to a model trained in isolation on only one of the datasets, illustrating the value of collaboration and the cost of privacy. Following, we study and optimize the differentially private learning outcomes from data shared among multiple separate data owners according to a classical privacy versus accuracy trade-off using a game-theoretic approach. Our novel method uses fixed-point theory in gradient descent learning to predict the contraction mapping of the learning outcomes. We use a dynamic non-cooperative game, with imperfect information, to provide the optimal trade-off between privacy budget and learning accuracy of the differentially private models. Our experimental results over a partitioned real financial dataset illustrate that, with an optimal choice of privacy budget parameter from our non-cooperative game, there are significant benefits in social welfare and improvements in privacy by 47.5%. Then, we evaluate the fairness-bias with respect to different protected feature groups (e.g., gender or race) in ML outcomes with a real-world application, namely privacy-preserving record linkage (PPRL) between multiple data providers. We propose new notions of fairness-constrained DP and fairness and cost-constrained DP for PPRL with theoretical proofs and develop a novel framework for PPRL which improves linkage quality with regard to privacy, cost, and fairness constraints. Our experimental results over two datasets containing person-specific data show that with these new notions of DP and three privacy budgets ϵ P r0.1, 1.0, 10.0s, PPRL achieves up to 51.5% improvement in fairness and 37.5% improvement in linkage performance (in terms of F˚-measure) compared to the standard DP notion for PPRL. In real-life applications, the generation of labelled data and hence accurate trained models is challenging, due to the privacy and confidentiality of data concerns. We propose a privacy-preserving active learning algorithm for PPRL which is constrained by DP budget, labelling budget, uncertainty of data samples, accuracy of oracle, and fairness-bias of oracle. Our experimental results for linking real and synthetic voter registration datasets show that our active learning algorithm could improve the linkage quality by on average 24% with an oracle accuracy of 0.8 and privacy budget of ϵ “ 1.0. In addition, the quality of data in different databases, for example, typos, errors, and variations, poses real-world challenges for record linkage. To overcome this, we propose a novel PPRL algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. Our experimental results on real and synthetic datasets are highly promising, significantly reducing the error rate by up to 80% with a privacy budget ϵ “ 1.0 compared to the state-of-the-art methods.


Australian Government Research Training Program (RTP) Scholarship


Table of Contents

1. Introduction -- 2. Literature Review -- 3. Asynchronous Communication in Differentially-Private ML -- 4. Optimized Data Sharing with Differential Privacy -- 5. Fairness and Cost Constrained Privacy-Aware Record Linkage -- 6. Privacy-Preserving Active Learning for Record Linkage -- 7. Privacy-Preserving Record Linkage for Cardinality Counting -- 8. Discussion -- A. Appendix -- B. Appendix -- References

Awarding Institution

Macquarie University

Degree Type

Thesis PhD


Doctor of Philosophy

Department, Centre or School

School of Computing

Year of Award


Principal Supervisor

Dali Kaafar

Additional Supervisor 1

Hassan Asghar

Additional Supervisor 2

David Smith


Copyright: The Author Copyright disclaimer:




216 pages

Former Identifiers

AMIS ID: 294245

Usage metrics

    Macquarie University Theses


    Ref. manager