Openai – Humanity’s Cross Test reveals the risk of jailbreak and misuse – which companies must add to their GPT-5 ratings


Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now


Openai and humanity often oppose the basic models to each other, but the companies came together to evaluate each other’s public models to test the adjustments.

The company said it believes extraordinary accountability and safety will increase transparency by enabling these powerful models, allowing companies to choose the model that suits them best.

“We believe this approach supports accountable and transparent assessments and helps to ensure that each lab’s model continues to be tested against new and challenging scenarios,” Openai said in its findings.

The companies have discovered that inference models such as Openai’s 03, O4-Mini and Claude 4 resist the jailbreak of humanity, while general chat models such as GPT-4.1 are susceptible to misuse. Such assessments will help companies identify potential risks associated with these models, but it should be noted that GPT-5 is not part of the testing.


AI scaling reaches its limit

Power caps, rising token costs, and inference delays are rebuilding Enterprise AI. Join exclusive salons and discover what your top team looks like.

  • Turning energy into a strategic advantage
  • Architects efficient inference for real throughput gain
  • Unlock competitive ROI with a sustainable AI system

Make sure you have your place to stay first: https://bit.ly/4mwgngo


These safety and transparency alignment ratings follow the claims primarily by ChatGpt users, and Openai’s model falls prey to psychofancy and becomes overly cautious. Openai then rolled back the update that caused psychofancy.

“We are primarily interested in understanding the trends in models for harmful behavior,” Humanity states in its report. “We aim to understand the most concerning behaviors that these models may attempt to take when they are given the opportunity, rather than focusing on the opportunities that such opportunities arise and the likelihood that these actions will be completed normally,” he said.

Openai said the test was designed to show how the models interact in intentionally difficult environments. The scenarios they built are mostly edge cases.

The inference model continues to adjust

This test only covers published models from both companies: Anthropic’s Claude 4 Opus and Claude 4 Sonnet, and Openai’s GPT-4o, GPT-4.1 O3 and O4-Mini. Both companies have eased external protection measures for the model.

Openai has tested the public API of the Claude model and mandated it to use the inference feature of Claude 4. Humanity said they did not use Openai’s O3-Pro because “it’s not compatible with the APIs that the tool supports optimally.”

The goal of the test was not to conduct apple-apil comparisons between models, but to determine how often large language models (LLMs) deviate from alignment deviates. Both companies leveraged the Shade-Arena Sabotage evaluation framework and showed that the Claude model has a higher success rate for subtle sabotage behavior.

“These tests assess the orientation of the model for difficult or high-stakes situations in simulated settings rather than regular use cases, and often involve long, multi-turn interactions,” Anthropic reported. “This type of assessment is becoming an important focus for alignment science teams as they are more likely to catch behaviors that are unlikely to appear in normal pre-deployment testing with real users.”

Humanity said tests like these worked better if organizations could compare notes. “Designing these scenarios involves a huge number of degrees of freedom. A single research team cannot explore the full space of productive evaluation ideas.”

Findings showed that inference models are generally robustly implemented and can resist jailbreaking. Openai’s O3 was more aligned than the Claude 4 Opus, but along with the GPT-4O and GPT-4.1, the O4-Mini “was often a bit more concern than the Claude model.”

GPT-4O, GPT-4.1, and O4-MINI also showed their willingness to cooperate with human misuse, giving detailed instructions on how to create drugs, how to develop late living organisms, and what is frightening, and planning terrorist attacks. Because both Claude models had a high rejection rate, the model refused to answer questions that they had no idea of ​​the answer to avoid hallucinations.

The company’s model showed that it was “related to forms of psychofancy,” and at one point examined the simulated user’s harmful decisions.

What businesses need to know

For businesses, understanding the potential risks associated with the model is invaluable. Model evaluations have been largely deligued in many organizations, with many testing and benchmarking frameworks now available.

Companies need to continue to evaluate the models they use, and with the release of GPT-5, they need to perform their own safety assessments with these guidelines in mind.

  • Test both inference and irrational models. Because inference models showed greater resistance to misuse, but could provide hallucinations and other harmful behaviors.
  • Benchmarks between vendors because the model failed with different metrics.
  • Reject both misuse and collaborative stress tests, and their rejection and usefulness, and refuse to show trade-offs between usefulness and guardrails.
  • The model will continue to be audited even after deployment.

While many reviews focus on performance, there are third-party safety alignment tests. For example, this is from Cyata. Last year, Openai released a model alignment education method known as rule-based rewards, but humanity launched an audit agent to check the safety of the model.



Source link