Data privacy comes with costs. There are security techniques that protect sensitive user data, such as customer addresses, from attackers who try to extract them from AI models, but in many cases it reduces the accuracy of those models.
Researchers at MIT recently developed a framework based on a new privacy metric called PAC privacy. This is safe from attackers by ensuring sensitive data such as medical images and financial records while maintaining the performance of the AI model. Now they take this work a step further by making this technique more computationally efficient, improving the trade-off between accuracy and privacy, and creating formal templates that can be used to privatize virtually any algorithm without the need to access the internal work of that algorithm.
The team utilized a new version of PAC Privacy to privatize some classic algorithms for data analysis and machine learning tasks.
They also demonstrated that more “stable” algorithms are more likely to privatize that way. The prediction of a stable algorithm is consistent even when training data is slightly altered. Increased stability helps the algorithm make more accurate predictions for previously invisible data.
Researchers say the increased efficiency of the new PAC privacy framework and the four-stage templates that can be followed to implement it make this technique easier to deploy in real-world situations.
“We tend to think of robustness and privacy as unrelated or perhaps conflicted with the construction of high-performance algorithms. First, we create work algorithms, make them robust and private. Not always correct framing. Students and lead authors of this paper on privacy framework.
She was joined by Hanshen Xiao PhD ’24, who will start as an assistant professor at Purdue University in the fall. Senior author Srini Devadas and Professor Edwin Sibley Webster of Electrical Engineering at MIT. The survey will be presented at the IEEE Symposium on Security and Privacy.
Estimated noise
To protect the sensitive data used to train AI models, engineers often add noise and general randomness to the model, making it more difficult for enemies to infer the original training data. This noise reduces the accuracy of the model, so the less noise you can add, the better.
PAC Privacy automatically estimates the minimum noise required to add to the algorithm to achieve the desired level of privacy.
The original PAC privacy algorithm runs the user’s AI model over and over again on different samples of the dataset. As well as the correlation between many of these outputs, this information is used to measure the variance to estimate the amount of noise that needs to be added to protect the data.
This new variant of PAC privacy works the same way, but does not have to represent the entire matrix of data correlations across the output. Output differences are required.
“What you’re estimating is much smaller than the entire covariance matrix, so you can do it much faster,” explains Sridhar. This means that you can expand to a much larger dataset.
Adding noise can compromise the utility of the results, and it is important to minimize utility loss. Due to computational costs, the original PAC privacy algorithm was limited to the addition of isotropic noise. This is added evenly in all directions. The new variant estimates anisotropic noise tailored to the specific characteristics of the training data, allowing users to add overall noise to achieve the same level of privacy, increasing the accuracy of the privatized algorithm.
Privacy and stability
When she studied PAC privacy, Sridhar assumed that more stable algorithms would be more likely to privatize with this technique. She tested this theory with several classical algorithms using a more efficient variant of PAC privacy.
A more stable algorithm results in less variance in output when the training data changes slightly. PAC Privacy divides the dataset into chunks, runs the algorithm on each chunk of data, and measures the difference between the outputs. The higher the variance, the more noise you need to add to privatize the algorithm.
Using stability techniques to reduce the variance in the output of the algorithm also reduces the amount of noise that needs to be added to privatize it, she explains.
“In the best case, you can get these win-win scenarios,” she says.
The team showed that despite the algorithms they tested, these privacy assurances remain strong, and that new variants of PAC privacy require several orders of magnitude less testing to estimate noise. They also tested the method of attack simulation and demonstrated that its privacy guarantee can withstand cutting edge attacks.
“We want to investigate how algorithms can be co-designed with PAC privacy, so the algorithms are more stable, safe and robust from the start,” says Devadas. Researchers also want to test methods with more complex algorithms and further explore the trade-offs between privacy and effectiveness.
“The question now is when these win-win situations occur and how can they be done more frequently,” says Sridhar.
“I think the important advantage that has PAC privacy in this setting over other privacy definitions is that it is a black box. There is no need to manually analyze each query and privatize the results. It can be done completely automatically. Wisconsin in Madison, who was not involved in this study.
This research is supported in part by the Cisco Systems, Capital 1, the US Department of Defense, and Mathematics Fellowships.