Musk’s xAI launches Grok 4.1 with lower hallucination rates on web and apps

Looking set to steal the spotlight from Google ahead of the launch of its new flagship AI model Gemini 3, currently recorded by multiple independent evaluators as the world’s most powerful LLM, Elon Musk’s rival AI startup xAI last night unveiled its latest large-scale language model Grok 4.1.

The model is now live for consumers on Grok.com, Social Network X (formerly Twitter), and the company’s iOS and Android mobile apps, with significant architectural and usability enhancements. Among them are faster reasoning, improved emotional intelligence, and significantly reduced hallucination rates. xAI also published a white paper on evaluation and a little bit about the training process.

Across public benchmarks, Grok 4.1 outperformed rival models from Anthropic, OpenAI, and Google, including at least Google’s Gemini 3 and earlier models (Gemini 2.5 Pro), to the top of the leaderboard. This builds on the success of xAI’s Grok-4 Fast, which was favorably covered by VentureBeat shortly after its release in September 2025.

However, enterprise developers looking to integrate the new and improved model Grok 4.1 into their production environments will notice one major limitation. That said, Grok 4.1 is not yet available through xAI’s public API.

Despite the high benchmarks, Grok 4.1 is still limited to xAI’s consumer interface, and no timeline for API publication has been announced. Currently, only older models are available programmatically through the xAI developer API, including Grok 4 Fast (inferential and non-inferential variants), Grok 4 0709, and legacy models such as Grok 3, Grok 3 Mini, and Grok 2 Vision. They support up to 2 million context tokens, and token prices range from $0.20 to $3.00 per million depending on configuration.

Currently, this limits Grok 4.1’s usefulness in enterprise workflows that rely on backend integration, fine-tuned agent pipelines, or scalable internal tools. The consumer rollout positions Grok 4.1 as the most capable LLM in xAI’s portfolio, but production deployment in enterprise environments remains pending.

Model design and deployment strategy

Grok 4.1 is available in two configurations. One is a fast-response, low-latency mode that responds immediately, and the other is a “think” mode that performs multiple steps of inference before producing output.

Both versions are publicly available to end users and can be selected from the model picker in the xAI app.

The two configurations differ not only in delay but also in how deeply the model processes prompts. Grok 4.1 Thinking leverages internal planning and consideration mechanisms, while the standard version prioritizes speed. Despite the architectural differences, both scored higher than competing models in blind configuration tests and benchmark tests.

Leading the field with human and expert evaluations

On the LMArena Text Arena leaderboard, Grok 4.1 Thinking briefly held the top spot with a standardized Elo score of 1483, but was dethroned a few hours later by the release of Google’s Gemini 3 and its astounding 1501 Elo score.

However, the non-thinking version of Grok 4.1 also performs well in the index at 1465.

These scores place Grok 4.1 ahead of Google’s Gemini 2.5 Pro, Anthropic’s Claude 4.5 series, and OpenAI’s GPT-4.5 Preview.

In the creative writing category, Grok 4.1 ranks second behind Polaris Alpha (early GPT-5.1 variant), with the “Thinking” model scoring 1721.9 on the Creative Writing v3 benchmark. This represents an improvement of about 600 points over previous Grok iterations.

Similarly, on the Arena Expert leaderboard, which aggregates feedback from professional reviewers, Grok 4.1 Thinking once again leads the field with a score of 1510.

This advancement is especially notable considering that Grok 4.1 was released just two months after Grok 4 Fast, highlighting the accelerated pace of development in xAI.

Key improvements from previous generation

Technically, Grok 4.1 represents a huge leap forward in real-world usability. Grok 4 upgrades previously limited visual features to ensure images and videos are understood, including chart analysis and OCR-level text extraction. Multimodal reliability was an issue in previous versions, but has now been resolved.

Token-level latency was reduced by approximately 28% while maintaining inference depth.

For long-context tasks, Grok 4.1 maintains consistent output up to 1 million tokens, improving Grok 4’s tendency to degrade performance above the 300,000 token mark.

xAI also improves the model’s tool orchestration capabilities. Grok 4.1 allows multiple external tools to be planned and executed in parallel, reducing the number of interaction cycles required to complete multi-step queries.

Internal testing logs show that some research tasks that previously required four steps can now be completed in one or two.

Other tuning improvements include improved truth tuning, which reduces the tendency to hedge or soften politically sensitive output, and support for different speaking styles and accents for more natural, human-like prosody in voice mode.

Safety and adversarial robustness

xAI evaluated Grok 4.1 for denial behavior, hallucination tolerance, flattery, and dual-use safety as part of its risk management framework.

The hallucination rate in non-reasoning mode was reduced from 12.09 percent in Grok 4 Fast to just 4.22 percent. This is about a 65% improvement.

The model also scored 2.97 percent on FActScore, a fact-based QA benchmark, down from 9.89 percent in the previous version.

In the area of adversarial robustness, Grok 4.1 has been tested with prompt injection attacks, jailbreak prompts, and sensitive chemistry and biology queries.

The safety filter showed low false negative rates, especially for limited chemical knowledge (0.00 percent) and limited biological queries (0.03 percent).

The model’s ability to resist manipulation in persuasion benchmarks such as MakeMeSay also appears to be strong, with an attacker success rate of 0%.

Limited enterprise access via API

Despite these benefits, Grok 4.1 remains unavailable to enterprise users through xAI’s API. According to the company’s public documentation, the latest model available for developers is Grok 4 Fast (both inferred and non-inferred), each supporting up to 2 million context tokens at a price range of $0.20 to $0.50 per million tokens. These are supported by a throughput limit of 4M tokens per minute and a rate cap of 480 requests per minute (RPM).

In contrast, Grok 4.1 is only accessible through xAI’s consumer properties (X, Grok.com, and mobile apps). This means that organizations cannot yet deploy Grok 4.1 via fine-tuned internal workflows, multi-agent chains, or real-time product integration.

Industry welcome and next steps

This release received strong feedback from the public and industry. xAI founder Elon Musk posted a short support, calling it a “great model” and congratulating the team. AI benchmarking platforms are lauded for their dramatic improvements in ease of use and linguistic nuance.

However, for enterprise customers, the situation becomes more complex. While Grok 4.1’s performance represents a breakthrough for general purpose and creative tasks, it remains a consumer-first product with limited enterprise adoption until API access is enabled.

As competing models from OpenAI, Google, and Anthropic continue to evolve, xAI’s next strategic move may hinge on when and how it opens Grok 4.1 to outside developers.

Source link

Categories

Musk’s xAI launches Grok 4.1 with lower hallucination rates on web and apps – no API access (for now)

Model design and deployment strategy

Leading the field with human and expert evaluations

Key improvements from previous generation

Safety and adversarial robustness

Limited enterprise access via API

Industry welcome and next steps

Categories

Model design and deployment strategy

Leading the field with human and expert evaluations

Key improvements from previous generation

Safety and adversarial robustness

Limited enterprise access via API

Industry welcome and next steps

Related News

Aave deploys Aave Shield after $50M user loss incident

Differences in the reaction of Bitcoin and gold to the impact of the Iran war