Self-improving language models are becoming a reality with MIT’s latest SEAL technology



Researchers at the Massachusetts Institute of Technology (MIT) are drawing new attention by developing and open sourcing technology that allows large-scale language models (LLMs), such as those that power ChatGPT and the latest AI chatbots, to be improved by generating synthetic data to fine-tune them.

The technology, known as SEAL (Self-Adapting LLM), was first described in a paper published in June and featured by VentureBeat at the time.

A significantly expanded and updated version of the paper was released last month, and the open source code was posted on Github (allowed for commercial and enterprise use under the MIT license). This week, there’s a new wave of AI power users on social network X.

SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike traditional models that rely on fixed external data and human-written optimization pipelines, SEAL allows you to evolve your model by generating your own synthetic training data and corresponding optimization directives.

The development is being led by a team at MIT’s Improbable AI Lab, including Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Their work was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Background: “Beyond static AI” to self-adaptive systems

Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allows language models to generate and train their own synthetic data. This could be a potential solution to pre-trained models getting stuck after deployment.

At that stage, SEAL was framed as a proof of concept that enabled enterprise AI agents to continuously learn in a dynamic environment without manual retraining.

Since then, research has advanced significantly. The new version extends the previous framework by demonstrating that SEAL’s self-adaptive ability scales with model size, more effectively integrating reinforcement learning to reduce catastrophic forgetting, and formalizing SEAL’s dual-loop structure (internal supervised fine-tuning and external reinforcement optimization) to increase reproducibility.

The updated paper also introduces evaluation across different prompt formats, improved stability during learning cycles, and a discussion of real-world deployment challenges during inference.

Dealing with static model limitations

Although LLMs have shown great abilities in text generation and comprehension, adaptation to new tasks and knowledge is often manual, fragile, and context-dependent.

SEAL challenges this status quo by equipping models with what the authors call “self-editing,” the ability to generate natural language output that specifies how the model should update its weights.

These self-edits may take the form of reformulated information, logical meaning, or tool configurations for extension and training. Once the model is generated, it will fine-tune itself based on these edits. This process is guided by reinforcement learning, where reward signals derive from improved performance on downstream tasks.

This design mimics the way human learners rephrase and rearrange material to better understand information. This reconstruction of pre-assimilation knowledge is an important advantage over models that passively consume new data “as is.”

Overall task performance

SEAL is tested across two major areas: knowledge embedding and few-shot learning.

In the knowledge embedding setting, the researchers evaluated how well the model could incorporate new factual content from sentences similar to those in the SQuAD dataset. The SQuAD dataset is a benchmark reading comprehension dataset introduced by Stanford University in 2016 and consists of over 100,000 crowdsourced question-answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

Rather than tweaking the passage text directly, The model generated the overall meaning of the sentence and fine-tune them.

After two rounds of reinforcement learning, the model improved its question answering accuracy from 33.5% to 47.0% on the no-context version of SQuAD. This exceeds the results obtained using synthetic data generated by GPT-4.1.

In the few-shot learning setting, SEAL was evaluated using a subset of the ARC benchmark where the task requires inference from only a small number of examples. Here, the SEAL generated a self-edit that specifies data extensions and hyperparameters.

After reinforcement learning, The success rate of correctly solving pending tasks jumped from 20% using self-editing generated without reinforcement learning to 72.5%. A model that relied solely on in-context learning without adaptation received a score of 0%.

technical framework

SEAL works using two loop structures. The inner loop performs supervised fine-tuning based on self-edits, and the outer loop uses reinforcement learning to tune the policy to generate self-edits.

The reinforcement learning algorithm used is based on ReSTEM, which combines sampling and filtered behavior cloning. During training, only self-editing that leads to improved performance is enhanced. This approach effectively teaches the model which types of edits are most beneficial for learning.

To increase efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, allowing rapid experimentation and low-cost adaptation.

strengths and limitations

Researchers report that SEAL can generate highly actionable training data with minimal supervision and can outperform even large external models like GPT-4.1 on certain tasks.

We also show that SEAL generalizes beyond its original setting. It continues to perform well when expanded from single-pass updates to multiple-document continuous pre-training scenarios.

However, this framework is not without its limitations. One problem is fatal forgetting, where updating to incorporate new information can reduce performance on previously learned tasks.

In response to this concern, co-author Joe Paris told VentureBeat via email that reinforcement learning (RL) appears to reduce forgetting more effectively than standard supervised fine-tuning (SFT), citing recent papers on the subject. He added that combining this insight with SEALs could lead to new variants where SEALs learn reward functions as well as training data.

Another challenge is computational overhead. Evaluation of each self-edit requires fine-tuning and performance testing, and each edit can take 30-45 seconds, significantly longer than standard reinforcement learning tasks.

As Jyo explained, “Training SEAL is not trivial because it requires two optimization loops: an outer RL loop and an inner SFT loop. Updating the model weights during inference also requires new system infrastructure.” He emphasized the need for future research on deployment systems as a key path to commercializing SEAL.

Furthermore, the current design of SEAL assumes the existence of paired tasks and reference answers for every context, which limits its direct application to unlabeled corpora. But Joe made it clear that SEALs can be trained to adapt accordingly, even in safety-critical areas, as long as there are downstream tasks with calculable rewards. In principle, a SEAL-trained model can learn to avoid training on harmful or malicious inputs if guided by appropriate reward signals.

AI community reaction

The AI ​​research and builder community responded to the SEAL paper with a mix of excitement and speculation. On X (formerly Twitter), several prominent AI-focused accounts weighed in on the potential impact.

User @VraserX, who calls himself an educator and AI enthusiast, called SEAL “the birth of continuous self-learning AI” and predicted that models like OpenAI’s GPT-6 could adopt a similar architecture.

In their words, SEAL represents “the end of the frozen weight era,” ushering in a system that evolves as the world around it changes.

They highlighted SEALs’ ability to form lasting memories, repair knowledge, and learn from real-time data, and compared it to a foundational step toward models that not only use but absorb information.

Meanwhile, @alex_prompter, co-founder of an AI-powered marketing venture, framed SEAL as a leap into a model that literally rewrites itself. “MIT has just built an AI that can rewrite its own code to become smarter,” he wrote. Citing the paper’s main results, using self-generated data increased fact recall by 40%, outperforming GPT-4.1. — He described the findings as confirmation that “LLMs that fine-tune themselves are no longer science fiction.”

This enthusiasm reflects a broader demand in the AI ​​field for models that can evolve without continuous retraining or human oversight, especially in rapidly changing domains and personalized use cases.

Future directions and open questions

In response to a question about scaling SEAL to larger models and tasks, Jyo pointed to experiments (Appendix B.7) that show that as the size of the model increases, so does its self-adaptive ability. He compared this to students improving their learning skills over time. Simply, the larger the model, the better it is at producing useful self-edits.

When asked if SEALs were generalizing the new prompting style, he acknowledged that they were, citing Table 10 of the paper. However, he acknowledged that the team has yet to test SEAL’s ability to transition between entirely new domains and model architectures.

“SEAL is an early example of what’s possible,” he said. “But that requires a lot more testing.” He added that generalizability could improve as SEALs are trained for a broader range of missions.

Interestingly, the team found that just a few steps of reinforcement learning already led to measurable performance improvements. “This is very interesting because it means that with more computing, we could potentially improve even further,” Joe said. He suggested that future experiments could explore more advanced reinforcement learning techniques beyond ReSTEM, such as group relative policy optimization (GRPO).

Aiming for a more adaptive agent model

SEAL represents a step toward a model that can autonomously improve over time by integrating new knowledge and reconfiguring learning methods. The authors envision future enhancements that could help SEALs self-pretrain, continuously learn, and develop agent systems (models that interact with and gradually adapt to evolving environments).

In such a setting, the model can use SEAL to synthesize weight updates after each interaction and gradually internalize actions and insights. This can reduce the need for repeated monitoring and manual intervention, especially in data-constrained or specialized domains.

As public web text becomes saturated and further scaling of LLM becomes bottlenecked by data availability, voluntary approaches like SEAL can play an important role in pushing the boundaries of what LLM can achieve.

The SEAL project, including code and detailed documentation, can be accessed at https://jyopari.github.io/posts/seal.



Source link