Gemini 3 Pro scored 69% reliability in blind testing, up from 16% for Gemini 2.5. When evaluating AI based on real-world trust rather than academic benchmarks



Just a few weeks ago, Google gemini 3 The model claims to have achieved leadership status in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they’re just that: vendor-provided.

New vendor-neutral evaluation fecundityHowever, Gemini 3 is at the top of the leaderboard. It is not based on a set of academic criteria. Rather, it is based on a set of real-world attributes that are of interest to real users and organizations.

Prolific was founded by researchers at the University of Oxford. The company provides high-quality, reliable human data to drive rigorous research and ethical AI development. The company’shumane benchmark”, we apply this approach using representative human sampling and blind testing to rigorously compare AI models across different user scenarios, measuring not only technical performance but also user trustworthiness, adaptability, and communication style.

The latest HUMAINE test evaluated 26,000 users in a blind test of the model. In our ratings, Gemini 3 Pro’s confidence score jumped from 16% to 69%, the highest ever recorded by Prolific. Gemini 3 currently ranks #1 overall for reliability, ethics, and safety across demographic subgroups 69% of the time. Meanwhile, the previous generation Gemini 2.5 Pro held the top spot 16% of the time.

Overall, Gemini 3 ranked first in three of the four rating categories: Performance and Reasoning, Interaction and Adaptability, and Trust and Safety. Communication style was the only category that lost out, with DeepSeek V3 at the top of the list of preferences at 43%. HUMAINE testing also showed that Gemini 3 performed consistently well across 22 different demographic user groups, including differences in age, gender, ethnicity, and political orientation. The evaluation also found that users were five times more likely to select a model in a direct blind comparison.

But rankings aren’t that important why It won.

"It’s consistency across a huge range of different use cases, and personality and style that appeals to different types of users." Phelim Bradley, co-founder and CEO of Prolific, told VentureBeat. "In certain instances, other models may be preferred for smaller subgroups or specific conversation types, but it is the model’s breadth of knowledge and flexibility across different use cases and audience types that enabled it to win in this particular benchmark."

How blind tests reveal what academic benchmarks miss

HUMAINE’s methodology reveals a gap in how models are evaluated in the industry. Users interact with two models simultaneously in multi-turn conversations. They don’t know which vendors are strengthening each response. They discuss topics that are important to them rather than predetermined test questions.

What matters is the sample itself. HUMAINE uses representative sampling across the US and UK populations, controlling for age, gender, ethnicity, and political orientation. This reveals things that static benchmarks cannot capture. Model performance varies depending on the audience.

"If you get an AI leaderboard, most likely still have fairly static lists." Bradley said. "But in our case, when we control for the audience, we end up with slightly different leaderboards whether we look at the left-leaning sample, the right-leaning sample, the US, or the UK. And I think the most different condition in our experiment was actually age."

This is important for companies implementing AI across diverse employee populations. A model that performs well in one demographic may underperform in another.

This methodology also addresses fundamental issues in AI evaluation. In other words, why use human judges when AI can evaluate itself? Bradley emphasized that while his company uses AI judges for certain use cases, human evaluation remains a key factor.

"We believe the greatest benefits come from smart orchestration of both LLM judges and human data. Both have their advantages and disadvantages, and can work more effectively when combined wisely." said Bradley. "But we still believe that human data is alpha. We remain very bullish that human data and human intelligence need to be in the loop."

What is trust in AI evaluation?

Trust, Ethics, and Safety measures users’ confidence in trustworthiness, factual accuracy, and responsible behavior. In HUMAINE’s methodology, trust is not vendor claims or technical metrics, but what users report after a blind conversation with competing models.

The number 69% represents the probability across demographic groups. This consistency is more important than the sum of the scores because organizations can serve a diverse population.

"I wasn’t aware that I was using Gemini in this scenario, but" Bradley said. "It was based solely on blinded multiturn responses."

This distinguishes between perceived trust and earned trust. Users were judging model output without knowing which vendor created the model output, reducing Google’s brand advantage. This distinction is important in customer-facing deployments, where the AI ​​vendor is invisible to the end user.

What companies should do now

One of the important things for companies to do now as they consider different models is to adopt a valuation framework that works.

"It is becoming increasingly difficult to evaluate models based on atmosphere alone;" Bradley said. "I think we will increasingly need a more rigorous, scientific approach to truly understand how these models work."

HUMAINE data provides the framework. Test for consistency across use cases and user populations, not just peak performance for specific tasks. Blind the test to separate model quality and brand perception. Use a representative sample that matches your actual user population. Plan for ongoing evaluation as the model changes.

For companies looking to deploy AI at scale, this means: "Which model is best?" to "Which model is best suited for your specific use case, user demographic, and required attributes?"

The rigor of representative sampling and blind testing provides the data to make that decision. This is something that technical benchmarks and atmosphere-based assessments cannot provide.



Source link