MIT scientists investigate memorization risks in the age of clinical AI | Massachusetts Institute of Technology News



What is patient privacy for? The Hippocratic Oath, considered one of the oldest and most widely known medical ethics documents in the world, states: “Whatever I see or hear in a patient’s life, whether related to professional practice or not, that should not be shared with outsiders, I consider all such things to be confidential and I keep them secret.”

As privacy becomes increasingly scarce in an era of data-hungry algorithms and cyber-attacks, healthcare is one of the few areas where confidentiality remains central to practice, allowing patients to trust their doctors with sensitive information.

But a paper co-authored by MIT researchers examines how artificial intelligence models trained on de-identified electronic health records (EHRs) can remember patient-specific information. The study, recently presented at the 2025 Neural Information Processing Systems Conference (NeurIPS), recommends rigorous testing settings to ensure that no information is revealed through targeted prompts and highlights that leaks need to be evaluated in clinical settings to determine whether they significantly compromise patient privacy.

Foundation models trained on EHRs typically need to utilize many patient records to generalize their knowledge to make better predictions. However, in “memory”, the model utilizes a single patient record to provide output, which can potentially compromise patient privacy. In particular, the underlying model is already known to be prone to data leakage.

“The knowledge of these large models can be a resource for many communities, but it can also encourage hostile adversaries to force the models to extract information from the training data,” said Sana Tonekaboni, a postdoctoral fellow at the Eric and Wendy Schmidt Center at the Broad Institute at MIT and Harvard University, and lead author of the paper. Given the risk that underlying models may remember personal data, “this work is a step toward ensuring there are practical evaluation steps the community can take before releasing a model,” she said.

To conduct research on the potential risks that EHR-based models may pose to medicine, Tonekaboni turned to MIT Associate Professor Marzieh Ghasemi, a member of the Computer Science and Artificial Intelligence Institute and principal investigator at the Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic). Ghasemi is a faculty member in the MIT Department of Electrical Engineering and Computer Science and the Institute of Biomedical Engineering Sciences, where he runs the Healthy ML group, which focuses on robust machine learning in health.

How much information does a malicious party need to expose sensitive data? And what are the risks associated with leaked information? To assess this, the research team developed a series of tests that they hope will form the basis of future privacy assessments. These tests are designed to assess the actual risk to the patient by measuring different types of uncertainty and the likelihood of different stages of attack.

“I was really trying to emphasize practicality here. Even if an attacker needs to know the dates and values ​​of a dozen lab tests from your records in order to extract the information, there is little risk of harm. If you already have access to that level of protected source data, why attack a large underlying model for more than that?” Ghasemi says.

The digitization of medical records has become inevitable and data breaches have become more common. In the past 24 months, the U.S. Department of Health and Human Services has recorded 747 data breaches of health information affecting more than 500 people, the majority of which are classified as hacking/IT incidents.

Patients with specific symptoms are especially vulnerable because they are easy to identify. “Even with anonymized data, it depends on what information you divulge about the individual,” Tonekaboni said. “Once we identify them, we can learn more.”

In structured tests, the researchers found that the more information an attacker had about a particular patient, the more likely the model was to leak information. They demonstrated how to distinguish between model generalization cases and patient-level memorization cases in order to properly assess privacy risks.

The paper also highlighted that some breaches are more harmful than others. For example, a model that reveals a patient’s age or demographics can be characterized as a more benign leak than one that reveals more sensitive information, such as an HIV diagnosis or alcohol abuse.

Researchers note that patients with unusual symptoms may be particularly vulnerable given their ease of detection and may require higher levels of protection. “Even with anonymized data, it really depends on what kind of information you divulge about the individual,” Tonekaboni said. The researchers plan to expand the study to be more interdisciplinary, adding not only legal experts but also clinicians and privacy experts.

“There’s a reason our health data is private,” Tonekaboni said. “There’s no reason for anyone else to know about it.”

This research was supported by the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard University, Wallenberg AI, the Knut and Alice Wallenberg Foundation, the National Science Foundation (NSF), a Gordon and Betty Moore Foundation Award, a Google Research Scholar Award, and Schmidt Science’s AI2050 program. Some of the resources used to prepare this study were provided by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.



Source link