Why it’s important to move beyond overly aggregated machine learning metrics | Massachusetts Institute of Technology News



MIT researchers have identified significant instances in which machine learning models fail when applied to data other than the one they were trained on, raising questions about the need to test models each time they are introduced into a new environment.

“Even if you train a model on a large amount of data and choose the best average model, we show that in new settings, this ‘best model’ can end up being the worst model for 6 to 75 percent of the new data,” said Marjie Ghasemi, associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), member of the Biomedical Engineering Sciences Institute, and principal investigator in the Information and Decision Systems Institute.

In a paper presented at the Neural Information Processing Systems (NeurIPS 2025) conference in December, researchers note that a model trained to effectively diagnose a disease on chest X-rays in one hospital, for example, may be considered effective, on average, in another hospital. However, the researchers’ performance evaluation revealed that some of the models that performed best in the first hospital also performed worst for up to 75% of patients in the second hospital, even when all patients were aggregated to the second hospital, although the high average performance masked this failure.

Their findings show that spurious correlations (a simple example of which is when a machine learning system that has not “seen” many pictures of cows at the beach but classifies a picture of a cow going to the beach as a killer whale just because of the background), which are thought to be mitigated by simply improving the model’s performance on observed data, actually still occur and are a risk to the reliability of the model in new settings. Such spurious correlations are much harder to detect in many cases involving areas examined by researchers, such as chest X-rays, histopathology images of cancer, and hate speech detection.

For example, in the case of a medical diagnostic model trained on chest X-rays, the model might have learned to associate certain unrelated marks on X-rays from one hospital with a particular medical condition. In another hospital where marking is not used, the pathology may be missed.

Previous research by Ghasemi’s group has shown that models can incorrectly correlate medical findings with factors such as age, gender, and race. For example, if a model is trained on chest X-rays of more elderly people with pneumonia and doesn’t “see” as many X-rays of younger people, the model might predict that only older patients have pneumonia.

“We want to teach the model how to see the patient’s anatomy so it can make decisions based on that,” says Olawea Saladeen, an MIT postdoctoral fellow and lead author of the paper. “But in reality, anything in the data that correlates with decisions can be used in the model. And those correlations may not actually be robust to changes in the environment, potentially making the model’s predictions an unreliable source of information for decisions.”

False correlations contribute to the risk of biased decision making. In a NeurIPS conference paper, researchers showed, for example, that a chest X-ray model that improved overall diagnostic performance actually performed worse in patients with pleural disease or cardiac mediastinal enlargement, or enlargement of the heart or midthoracic cavity.

Other authors of the paper include doctoral students Haoran Zhang and Kumail Alhamoud, EECS assistant professor Sara Beery, and Ghassemi.

Previous research generally accepted that models ordered from best to worst based on performance would maintain that order when applied to a new setting called accuracy-on-the-line, but the researchers were able to demonstrate an example where a model that performed best in one setting performed worst in another.

Salaudeen devised an algorithm called OODSelect to find instances where on-line precision is compromised. Essentially, he trained thousands of models using within-distribution data, that is, where the data was from the initial configuration, and calculated its accuracy. We then applied the model to the data in the second setting. If the data in the first setting that showed the highest accuracy was incorrect when applied to the majority of examples in the second setting, this identified a subset, or subpopulation, of the problem. Salaudeen also highlights the dangers of aggregate statistics for evaluation. This can obscure more detailed and important information about the model’s performance.

In the course of their work, the researchers isolated the “most incorrect calculation examples” to avoid confusing spurious correlations in the dataset with situations that were simply difficult to classify.

The NeurIPS paper exposes the researchers’ code and several identified subsets for future research.

Once a hospital or an organization employing machine learning identifies a subset of a model that is underperforming, that information can be used to improve the model for a specific task or setting. The researchers recommend that future work adopt OODSelect to highlight evaluation targets and design approaches to improve performance more consistently.

“We hope that the released code and the OODSelect subset will serve as a stepping stone to benchmarks and models that combat the negative effects of spurious correlations,” the researchers wrote.



Source link

Leave a Reply