The 70% factuality ceiling: Why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI



There is no shortage of generative AI benchmarks designed to measure the performance and accuracy of specific models as they complete a variety of useful enterprise tasks, from coding to following instructions to web browsing and tool usage by agents. However, many of these benchmarks have one major drawback. It measures the AI’s ability to complete a specific problem or request, not how. de facto A model’s output reflects how well it can produce objectively correct information associated with real-world data, especially when dealing with information contained in images and graphics.

Industries where accuracy is paramount, such as law, finance, and medicine, lack standardized measurement methods. fact is a major blind spot.

Today the situation is different. Google’s FACTS team and its data science arm Kaggle have released FACTS Benchmark Suite, a comprehensive assessment framework designed to fill this gap.

A related research paper reveals a more nuanced definition of the problem, "fact" Divided into two different operational scenarios. "Contextual facts" (grounding response for provided data) and "world knowledge facts" (Retrieves information from memory or the web).

While the headline news is Gemini 3 Pro’s top-tier ranking, the deeper story for builders is about the industry as a whole. "wall of facts."

Early results show that none of the models, including Gemini 3 Pro, GPT-5, and Claude 4.5 Opus, were able to achieve a 70% accuracy score across the problem set. For technology leaders, this is the era of signals. "trust but verify" It’s not over yet.

Deconstructing the benchmark

The FACTS suite is more than just Q&A. It consists of four different tests, each simulating a different real-world failure mode that developers might encounter in a production environment.

  1. Parametric benchmarks (in-house knowledge): Can the model accurately answer trivia-style questions using only training data?

  2. Search benchmark (tool usage): Can this model effectively use web search tools to retrieve and synthesize live information?

  3. Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams, and images without hallucinating?

  4. Grounding Benchmark v2 (context): Can the model closely follow the provided source text?

Google makes 3,513 samples publicly available, but Kaggle keeps a private set to prevent developers from training on test data. This is known as a common problem. "pollution."

Leaderboard: inch game

In the first run of the benchmark, Gemini 3 Pro led with an overall FACTS score of 68.8%, followed by Gemini 2.5 Pro (62.1%) and OpenAI’s GPT-5 (61.8%). But a closer look at the data reveals where the real battleground lies for engineering teams.

model

Fact score (average)

Search (RAG function)

Multimodal (Vision)

gemini 3 pro

68.8

83.8

46.1

gemini 2.5 pro

62.1

63.9

46.9

GPT-5

61.8

77.7

44.1

Grok 4

53.6

75.3

25.7

Claude 4.5 Opus

51.3

73.2

39.2

Data source is FACTS team release notes.

For builders: "search" versus "parametric" gap

Search benchmarks are the most important metrics for developers building RAG (search extension generation) systems.

The data show that there is a large discrepancy between the capabilities of the models. "know" Things (parametric) and their abilities "search" Things (search). For example, the Gemini 3 Pro scored a high 83.8% on the search task, but only 76.4% on the parametric task.

This validates current enterprise architecture standards. This means that you do not rely on the model’s internal memory for important facts.

If you are building an in-house knowledge bot, FACTS results suggest that connecting your model to a search tool or vector database is not an option. This is the only way to bring accuracy to an acceptable operational level.

Multimodal warning

The most concerning data point for product managers is multimodal task performance. The scores here are low overall. Even the Gemini 2.5 Pro, the leader in this category, only has an accuracy of 46.9%.

Benchmark tasks include reading charts, interpreting diagrams, and identifying objects in nature. An overall accuracy of less than 50% suggests that multimodal AI is not yet ready for unsupervised data extraction.

Conclusion: If your product roadmap includes AI automatically pulling data from invoices or interpreting financial graphs without human review; Significant error rates can occur Add it to your pipeline.

Why is this important for stacks?

The FACTS benchmark could become a standard reference point for procurement. When evaluating models for enterprise use, technology leaders must go beyond composite scores and drill into specific sub-benchmarks that match their use case.

  • Do you want to build a customer support bot? Check the grounding score to ensure your bot complies with your policy documents. (The Gemini 2.5 Pro actually beat the Gemini 3 Pro here (74.2 vs. 69.0).

  • Do you want to train research assistants? Prioritize search scores.

  • Do you want to build an image analysis tool? Proceed with extreme caution.

As the FACTS team noted in the release: "The overall accuracy of all evaluated models reaches less than 70%, leaving considerable room for future advances."For now, the message to the industry is clear. The model is getting smarter, but it’s still not perfect. Design your system with the assumption that the raw model can be wrong approximately one-third of the time.



Source link