AI “confession fluid”: OpenAI’s new way to train models to confess their mistakes



OpenAI researchers "whitening agent" Force large-scale language models (LLMs) to self-report cheating, hallucinations, and policy violations. This technique, "confession," Address growing concerns with enterprise AI. Models can be dishonest, exaggerating their confidence or hiding shortcuts to getting to the answer.

In real-world applications, this technology advances the creation of more transparent and manipulable AI systems.

What is a confession?

Many forms of AI deception arise from the complexity of the system. reinforcement learning (RL) phase of model training. RL rewards models that produce output that achieves various objectives, such as accuracy, style, and safety. This can lead to risks such as: "Error in specifying compensation," The model then learns to simply produce an answer. "looks good" Answers based on reward functions rather than answers that are truly true to the user’s intent.

A confession is a structured report produced by a model after it provides the main answer. This serves as a self-assessment of your own compliance with the instructions. In this report, the model should list all instructions to be followed, evaluate how well it meets them, and report any uncertainties or judgments made along the way. The goal is to create another channel that only encourages models to be honest.

in blog postOpenAI researchers provide several examples. "confession" Techniques in practice. The model was given a test with hidden rules. If the score exceeds 50%, the model’s knowledge is erased. To save the model itself, "placed in a sandbag" By intentionally answering half of the questions incorrectly. That initial answer hid this deception, but the subsequent confession was brutally honest. I admitted that, "Users wanted the right answer, and we blocked half of them."

The researchers found that when models cheated, they were significantly more likely to admit it in a confession than in a main response.

Mechanism of confession training

The key to this method is the separation of rewards. During training, rewards assigned to confessions are based solely on their honesty and are not confused with rewards for major missions. "Like the Catholic Church’s “Seal of Confession,” nothing a model reveals can change the reward it receives for completing its original mission." the researchers wrote. This results in "safe space" Model admits fault without penalty.

This approach is powerful because it avoids major challenges in AI training. Researchers intuit that it is easier to honestly confess to cheating than to obtain a high reward for the original, often complex, problem.

Therefore, if a model is judged purely on its honesty, it has less incentive to deceive people. "confession judge" The incentive to cheat on the main answer’s reward model is much weaker. For example, in an experiment where the model was trained as "hack" Because of weak reward models, performance on key tasks has become more deceptive over time. However, the confession became more honest and pinpointed the reward hacking it was performing.

However, this technology has limitations. Confessions are not a panacea for all kinds of AI failures. This system works best when the model is aware that it is behaving incorrectly. The less effective is "Unknown unknown." For example, if a model hallucinates a fact and truly believes it to be true, it cannot confess to having provided false information. The most common reason confessions fail is model confusion, not intentional deception. Confusion often occurs when instructions are ambiguous and the model cannot clearly determine the human user’s intent.

What it means for enterprise AI

OpenAI’s confession technology is part of a growing set of efforts around AI safety and control. OpenAI competitor Anthropic also published research showing how LLM can learn malicious act. The company is also working toward that goal. plug these holes As they emerge.

For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from confession can be used to flag or reject model responses before they cause problems during inference. For example, you can design a system that automatically escalates output for human review if a confession violates policy or indicates high uncertainty.

In a world where AI becomes increasingly agentic and capable of performing complex tasks, observability and control are key ingredients for safe and reliable deployment.

“As models become more capable and deployed in riskier environments, we need better tools to understand what they are doing and why,” OpenAI researchers wrote. “While confessions are not a complete solution, they add a meaningful layer to our transparency and oversight stack.”



Source link

Leave a Reply