New model predicts how molecules will dissolve in different solvents | MIT News



Using machine learning, MIT chemical engineers have created computational models that can predict how well a particular molecule will dissolve in organic solvents. This is an important step in the synthesis of almost every drug. This type of prediction makes it much easier to develop new methods of producing drugs and other useful molecules.

A new model that predicts how much solute amounts will be dissolved in a particular solvent should help chemists choose the right solvent for a particular reaction in their synthesis, researchers say. Common organic solvents include ethanol and acetone, and there are hundreds of other ones that can also be used for chemical reactions.

“There has been a long-standing interest in predicting solubility better because predicting solubility is a rate-limiting step in the synthesis planning and manufacturing of chemicals, particularly drugs.”

Researchers have made the model freely available, and many companies and labs have already begun to use it. The model could be particularly useful in identifying solvents that are less dangerous than some of the most commonly used industrial solvents, researchers say.

“There are some solvents known to dissolve most of them. They are really useful, but they are damaging the environment and damaging people, so you need to minimize the amount of solvent used by many companies.” “Our model is extremely useful in helping you identify the next best solvent.

William Green, a professor of chemical engineering in Whitthotter and director of the MIT Energy Initiative, is a senior author of the study and appears today. Natural Communication. Patrick Doyle, professor of chemical engineering, Robert T. Haslam, is also the author of this paper.

Solubility resolution

This new model came from a project that Attia and Burns worked together in their MIT course to apply machine learning to chemical engineering problems. Traditionally, chemists have predicted solubility with a tool known as the Abrahamic solvation model. This can be used to estimate the overall solubility of a molecule by summing the contributions of chemical structures within the molecule. These predictions are useful, but their accuracy is limited.

Over the past few years, researchers have been using machine learning to make more accurate solubility predictions. Before Burns and Attia began working on new models, the cutting edge model for predicting solubility was one developed by Green’s Lab in 2022.

This model, known as SolProp, works by predicting a set of related properties and combining them using thermodynamics to predict the solubility ultimately. However, this model is difficult to predict solubility of solutes that we have never seen before.

“In the case of drug and chemical discovery pipelines that are developing new molecules, we want to be able to predict in advance what their solubility will look like,” says Attia.

Part of the reason existing solubility models are not working well is because there was no comprehensive data set to train them. However, in 2023, a new data set called BigSoldB was released, compiling data from almost 800 published papers, including information on the solubility of molecules that have been dissolved for more than 100 organic solvents commonly used in synthetic chemistry for about 800 minutes.

Attia and Burns decided to train two models with this data. Both of these models represent the chemical structure of a molecule using numerical expressions known as embeddings that incorporate information such as the number of atoms in a molecule and which atoms are bound to other atoms. Models can use these representations to predict a variety of chemical properties.

One of the models used in this study is known as FastProp, and incorporates “static embeddings” into those developed by Burns and Green’s Lab others. This means that the model already knows how to embed each molecule before it begins any kind of analysis.

Another model, ChemProp, learns to embed each molecule during training. At the same time, we learn to associate embedding features with properties such as solubility. Developed in multiple MIT labs, this model has already been used for tasks such as antibiotic discovery, lipid nanoparticle design, and chemical reaction rate prediction.

The researchers trained both types of models at more than 40,000 data points from BigSoldB, including information on the effects of temperature that plays a critical role in solubility. The model was then tested with approximately 1,000 solutes that were withheld from the training data. They found that the predictions of the model were 2-3 times more accurate than those of SolProp, the best previous model, and that the new model was particularly accurate in predicting variation in solubility with temperature.

“Being able to accurately replicate small variations in solubility due to temperature, even when comprehensive experimental noise is very large, was a truly positive indication that the network had properly learned its underlying solubility prediction capabilities,” Burns says.

Accurate predictions

Researchers predicted that if ChemProp-based models could learn new representations, they could make more accurate predictions. But to my surprise, they discovered that the two models essentially perform the same. It suggests that the main limitation of their performance is the quality of the data, and that it is theoretically possible based on the data the model is using, researchers say.

“ChemProp should always outweigh static embeddings when there is enough data,” Burns says. “We were blown away to ensure that static, learned embeddings are indistinguishable in statistically inseparable performance across all different subsets. This indicates that the limitations of data present in this space dominate the performance of the model.”

The model could be more accurate, and researchers said that better training and test data were ideally obtained by one person or all trained to perform the experiment in the same way, provided that better training and test data were available.

“One major limitation of using these types of compiled datasets is that different labs use different methods and experimental conditions when performing solubility tests. This contributes to this variation between different datasets,” says Attia.

The model based on FastProp has faster predictions and has code that makes it easier for other users to adapt, so researchers have decided to make something called FastSolv available to the public. Several pharmaceutical companies are already starting to use it.

“There’s applications throughout the drug discovery pipeline,” Burns says. “We look forward to anything other than formulations and drug discovery where people may use this model.”

The study was funded in part by the US Department of Energy.



Source link