New method improves reliability of statistical inference | Massachusetts Institute of Technology News



Suppose that an environmental scientist is studying whether exposure to air pollution is associated with lower birth weight in a particular county.

Since machine learning techniques are particularly good at learning complex relationships, there is the potential to train machine learning models to estimate the magnitude of this association.

Standard machine learning techniques are good at making predictions, and these predictions can have uncertainties such as confidence intervals. However, it typically does not provide estimates or confidence intervals to determine whether two variables are related. Other methods have been specifically developed to address this association issue and provide confidence intervals. However, the MIT researchers found that in a spatial setting, these confidence intervals can be completely off the mark.

When variables such as air pollution levels or precipitation vary from place to place, common methods of generating confidence intervals can claim high confidence levels when in fact the estimates do not fully capture the actual values. Such inaccurate confidence intervals can mislead users into trusting a failing model.

After identifying this deficiency, the researchers developed a new method designed to produce valid confidence intervals for problems involving data that vary across space. In simulations and experiments using real data, their method was the only one that consistently produced accurate confidence intervals.

The study could help researchers in fields such as environmental science, economics and epidemiology better understand when to trust the results of a particular experiment.

“There are a lot of problems that people are interested in understanding phenomena in the universe, such as weather or forest management. We’ve shown that there are better ways to improve performance, better understand what’s going on, and get more reliable results for this wide range of problems,” said Tamara Broderick, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute for Information and Decision Systems (LIDS) and the Data Institute. Systems, and Society, an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and senior author of this study.

Broderick is joined on the paper by co-lead authors postdoctoral fellow David R. Burt and EECS graduate student Renato Berlinghieri. Stephen Bates is an assistant professor at EECS and a member of LIDS. This research was recently presented at the Neural Information Processing Systems Conference.

invalid assumption

Spatial associations involve studying how variables and specific outcomes are related across geographic areas. For example, you might want to study how tree cover in the United States is related to elevation.

To solve this type of problem, scientists can collect observational data from many locations and use it to infer associations at other locations where data is missing.

The MIT researchers found that existing methods often produced completely wrong confidence intervals in this case. A model might say with 95% confidence that its estimate captures the true relationship between tree cover and elevation, even though it does not capture it at all.

After investigating this issue, researchers determined that the assumptions that these confidence interval methods rely on do not hold when the data vary spatially.

Assumptions are like rules that must be followed to ensure that the results of a statistical analysis are valid. Common methods for generating confidence intervals are made under various assumptions.

First, assume that the source data (observations collected to train the model) are independent and identically distributed. This assumption means that the probability of one location in the data is independent of whether another location is included. However, for example, the U.S. Environmental Protection Agency (EPA) air sensor is placed with consideration to the location of other air sensors.

Second, existing methods often assume that the model is completely correct, an assumption that is never true in reality. Finally, assume that your source data is similar to the target data you want to estimate.

However, in a spatial setting, the source data may be fundamentally different from the target data because the target data is located in a different location than the source data was collected.

For example, scientists might use data from the EPA’s pollution monitors to train machine learning models that can predict health effects in rural areas without monitors. However, because EPA’s pollution monitoring equipment is likely to be installed in urban areas with high traffic and heavy industry, air quality data will differ significantly from local air quality data.

In this case, association estimates using city data are biased because the target data is systematically different from the source data.

smooth solution

New methods for generating confidence intervals explicitly account for this potential bias.

Rather than assuming that the source and target data are similar, researchers assume that the data varies smoothly in space.

For example, in the case of particulate air pollution, it is unlikely that the pollution level on one city block will be significantly different from the pollution level on the next city block. Instead, pollution levels decrease smoothly as you move away from the source.

“This spatial smoothness assumption is better for these types of problems; it better matches what’s actually happening in the data,” Broderick says.

Comparing their method with other popular methods, they found that it is the only method that can consistently produce reliable confidence intervals for spatial analyses. Moreover, their method remains reliable even when observed data are distorted by random errors.

In the future, the researchers hope to apply this analysis to different types of variables and explore other applications that can yield more reliable results.

This research was funded in part by MIT’s Social and Ethical Responsibility in Computing (SERC) Seed Grant, Generali, Microsoft, and the Office of Naval Research of the National Science Foundation (NSF).



Source link