Three Questions: Pros and Cons of AI Synthetic Data | MIT News



The synthetic data is artificially generated by algorithms to mimic the statistical properties of the actual data without including information from actual sources. Although it is difficult to pinpoint specific numbers, some estimates show that over 60% of the data used in AI applications in 2024 is synthetic, and this figure is expected to grow across the industry.

The synthetic data does not contain any real information, and it maintains the promise of protecting privacy while reducing costs and speeding up new AI models being developed. However, using synthetic data requires careful evaluation, planning, checking and balancing to prevent losses of performance when AI models are deployed.

To remove the advantages and disadvantages of using synthetic data, MIT News We spoke with Kalyan Veeramachaneni, the leading research scientist in the lab, and with information and decision-making systems, as well as co-founders. Datacebo Its open core platform, Synthetic Data Vault, It’s helpful Users generate and test synthetic data.

Q: How is synthetic data created?

A: The synthetic data is generated algorithmically, but does not arise from actual situations. These values ​​lie in their statistical similarity to the actual data. For example, if you are talking about language, the synthetic data appears to be as if a human wrote those sentences. Researchers have been creating synthetic data for a long time, but what has changed over the past few years is their ability to build generative models from the data and use them to create realistic synthetic data. You can get a little bit of actual data and then build a generative model. You can use this to create as much synthetic data as you want. Additionally, this model creates synthetic data in a way that captures all underlying rules and infinite patterns that exist in the actual data.

Essentially there are four different data modalities: language, video or image, audio, and surface data. All four of them have slightly different ways of building generative models and creating synthetic data. For example, LLM is nothing more than a generative model that samples synthetic data when asking questions.

A lot of language and image data is available on the Internet. However, tabular data, which is the data collected when interacting with physical and social systems, is often trapped behind an enterprise firewall. Many of them are sensitive or private, such as customer transactions stored by banks. For this type of data, platforms such as Synthetic Data Vault provide software that can be used to build generative models. These models maintain customer privacy and create synthetic data that can be shared more widely.

One powerful thing about this generation modeling approach to synthesizing data is that it allows companies to build customized local models for their own data. Generation AI automates what was a manual process.

Q: What are the benefits of using synthetic data? Also, which use cases and applications are particularly suitable?

A: One of the fundamental applications that has grown significantly over the past decade is testing software applications using synthetic data. Many software applications have data-driven logic, and require data to test the software and its functionality. In the past, people relied on manually generating data, but now we can use generative models to create as many data as we want.

Users can also create specific data for application testing. Let’s say I work for an e-commerce company. You can generate synthetic data that mimics real customers who live in Ohio and engage in transactions related to a particular product in February or March.

Privacy can also be ingested because the synthetic data is not drawn from actual situations. One of the biggest issues with software testing is that privacy concerns allow you to access sensitive real data to test your software in a non-production environment. Another immediate benefit is performance testing. You can create 1 billion transactions from a generative model and test how quickly the system can process them.

Another application held by synthetic data is many promising for training machine learning models. Sometimes, AI models can help predict less frequent events. Banks may want to use AI models to predict fraudulent transactions, but there may be too few real examples to train models that can accurately identify fraud. Synthetic data provides data augmentation – additional examples of data that resemble actual data. These can greatly improve the accuracy of AI models.

Additionally, users may not have the time or financial resources to collect all their data. For example, you need to do a lot of research to collect data about your customer’s intent. When you try to train a model after your data is limited, it doesn’t work well. You can scale these models by adding synthetic data and training them better.

Q. What are the risks and potential pitfalls of using synthetic data? Also, are there any steps that users can take to prevent or mitigate these issues?

A. One of the biggest questions people often think about when data is created synthetically is why should I trust them? Determining whether your data is reliable often means evaluating the entire system you are using.

There are many aspects to the synthetic data we have been able to assess for a long time. For example, there are existing methods of measuring how close synthetic data is to actual data, and you can measure their quality and measure whether they maintain privacy. However, there are other important considerations when you are using these synthetic data to train machine learning models for new use cases. How do you know that data still leads to a model that draws valid conclusions?

New effectiveness metrics are emerging, with emphasis on the effectiveness of specific tasks. You actually need to dig into the workflow to ensure that the synthetic data you add to your system can still draw valid conclusions. That’s something you need to do with caution for each application.

Bias can also be an issue. Because it is created from a small amount of actual data, the same bias present in the actual data can be carried over to the synthetic data. Just like actual data, it is necessary to intentionally confirm that biases are removed through various sampling techniques that allow for balanced datasets. Although some careful planning is required, data generation can be adjusted to prevent bias growth.

To aid in the evaluation process, our group created a synthetic data metrics library. We were worried that people would use synthetic data in their environments and reach different conclusions in the real world. We created a metric and evaluation library to ensure checks and balance. The machine learning community faces many challenges to enable models to generalize to new situations. Using synthetic data adds a whole new dimension to that problem.

Older systems using data, such as building software applications, responding to analytical questions, or train models, are expected to change dramatically as the construction of these generative models becomes more refined. It will allow for many things that we have not been able to do before.



Source link