Moonshot’s Kim K2 Thinking outperforms GPT-5 and Claude Sonnet 4.5 on key benchmarks, emerging as leading open source AI



While concerns and skepticism about U.S. AI startup OpenAI’s construction strategy and high-spending commitments grow, Chinese open source AI providers are increasing competition, with one provider even catching OpenAI’s flagship paid proprietary model GPT-5 on key third-party performance benchmarks with a new free model.

Released today, Chinese AI startup Moonshot AI’s new Kim K2 Thinking model has leapfrogged its own and open competitors to take the top spot in a benchmark for inference, coding, and agent tools.

Despite being completely open source, the model currently outperforms OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5 (thinking mode), and xAI’s Grok-4 in several standard assessments, marking an inflection point in the competitiveness of open AI systems.

Developers can access the model via platform.moonshot.ai and kimi.com. Weights and chords are hosted on Hugging Face. The open release includes APIs for chat, inference, and multi-tool workflows.

Users can try out Kimi K2 Thinking directly through competing websites like its own ChatGPT, and also on the Hugging Face space.

Modified Standard Open Source License

Moonshot AI has officially released Kim K2 Thinking under the modified MIT license from Hugging Face.

This license grants full commercial and derivative rights. This means that individual researchers and developers working on behalf of corporate clients can freely access and use it in commercial applications. However, there is one additional limitation:

"IF THE SOFTWARE OR DERIVATIVE PRODUCT WORKS Have more than 100 million monthly active users or generate more than $20 million in monthly revenue; Deployers must prominently display “Kimi K2” on the product’s user interface."

For most research and enterprise applications, this clause serves as a simple attribution requirement while preserving the freedom of the standard MIT License.

This makes K2 Thinking one of the most forgiving licensed Frontier-class models available today.

New benchmark leader

Kimi K2 Thinking is a Mixed Expertise (MoE) model built around 1 trillion parameters, 32 billion of which are activated for each inference.

It combines long-term reasoning with structured tool usage to perform up to 200-300 consecutive tool calls without human intervention.

According to the test results published by Moonshot, K2 Thinking achieved the following:

  • 44.9% above Humanity’s Last Exam (HLE)state-of-the-art score.

  • 60.2% above browse compWeb search and reasoning tests with agents.

  • 71.3% above SWE bench verified and 83.1% above live code bench v6,Evaluation of key coding.

  • 56.3% above Seal-0a benchmark for real-world information retrieval.

Across these tasks, K2 Thinking consistently outperforms its GPT-5 counterpart and outperforms previous open-weight leader MiniMax-M2, released a few weeks ago by Chinese rival MiniMax AI.

Open models demonstrate better performance than proprietary systems

GPT-5 and Claude Sonnet 4.5 Thinking remain the main original “thinking” models.

Even with the same benchmark suite, K2 Thinking’s agent inference score outperforms both: For example, in BrowseComp, the open model decisively leads with 60.2% over GPT-5 with 54.9% and Claude 4.5 with 24.1%.

K2 Thinking also outperforms GPT-5 GPQA Diamond (85.7% vs. 84.5%) and then match it to a mathematical reasoning task such as: AIME2025 and HMMT 2025.

Only in certain heavy mode configurations, where GPT-5 aggregates multiple trajectories, the unique model regains parity.

The possibility that Moonshot’s full openweight release could match or exceed GPT-5’s score marks a tipping point. In high-end reasoning and coding, the gap between closed frontier systems and publicly available models has virtually collapsed.

Beyond MiniMax-M2: Previous open source benchmark

When VentureBeat profiled the MiniMax-M2 just a week and a half ago, it achieved the highest score of any open weight system and was hailed as the “new king of open source LLMs.”

  • τ²-Bench 77.2

  • Browse Comp 44.0

  • FinSearchComp-Global 65.5

  • SWE bench verified 69.4

These results bring MiniMax-M2 closer to GPT-5 level functionality in its use of agent tools. still Kimi K2 thinking now far exceeds them.

BrowseComp results of 60.2% are better than M2’s 44.0%, and SWE-Bench Verified 71.3% are better than M2’s 69.4%. Even on financial reasoning tasks such as FinSearchComp-T3 (47.4%), K2 Thinking delivers comparable performance while maintaining good general purpose reasoning.

Technically, both models employ a sparse expert mixture architecture to increase computational efficiency, but Moonshot’s network activates more experts and deploys advanced quantization-aware training (INT4 QAT).

This design doubles the inference speed compared to standard precision without sacrificing accuracy. This is important for long “thought token” sessions that reach 256,000 context windows.

Agent Reasoning and Tool Usage

The defining power of K2 Thinking lies in its explicit tracing of reasoning. The model outputs an auxiliary field, reasoning_content, to reveal intermediate logic before each final response. This transparency maintains consistency across long multi-turn tasks and multi-step tool calls.

The reference implementation published by Moonshot shows how the model autonomously executes the “Daily News Report” workflow. This means calling date and web search tools, analyzing retrieved content, and creating structured output, all while maintaining internal reasoning state.

This end-to-end autonomy allows models to plan, search, execute, and synthesize evidence over hundreds of steps, reflecting an emerging class of “agent AI” systems that operate with minimal supervision.

efficiency and access

Despite the parameter size in the trillions, the cost of running K2 Thinking remains modest. Moonshot usage is listed here:

  • $0.15 / 1 million tokens (cash hit)

  • $0.60 / 1 million tokens (cash miss)

  • $2.50 / 1 million tokens output

These rates are competitive compared to MiniMax-M2’s $0.30 input/$1.20 output pricing and orders of magnitude lower than GPT-5 ($1.25 input/$10 output).

Comparison context: indifference weight acceleration

The rapid succession of M2 and K2 thinking shows how rapidly open source research is capturing frontier systems. MiniMax-M2 demonstrated that open models can approach GPT-5 class agent functionality at a fraction of the computational cost. Moonshot is now pushing that frontier even further, pushing open weights beyond their peers to become the outright leader.

Although both models rely on sparse activations for efficiency, K2 Thinking’s higher number of activations (32 B vs. 10 B active parameters) provides stronger inference fidelity across domains. Test-time scaling (expansion of “thought tokens” and tool call turns) provides measurable performance gains without retraining, a feature not yet observed in MiniMax-M2.

technical outlook

Moonshot reports supported by K2 Thinking Native INT4 Reasoning and 256 k-token contexts Minimize performance degradation. Its architecture integrates quantization, parallel trajectory aggregation (“heavy mode”), and mixed expert routing tailored for inference tasks.

In practice, these optimizations enable K2 Thinking to maintain complex planning loops that compile, test, modify, search, analyze, and summarize code across hundreds of tool invocations. This feature underpins the excellent results in BrowseComp and SWE-Bench, where inference continuity is critical.

Significant impact on the AI ​​ecosystem

The integration of open and closed models at the high end represents a tectonic shift in the AI ​​landscape. Enterprises that once relied solely on proprietary APIs can now deploy open alternatives that match GPT-5-level inference while maintaining full control of weight, data, and compliance.

Moonshot’s open publication strategy follows the precedent set by DeepSeek R1, Qwen3, GLM-4.6, and MiniMax-M2, but extends it to full agent inference.

K2 Thinking provides both transparency and interoperability for academic and enterprise developers. That is, the ability to inspect inference traces and fine-tune domain-specific agent performance.

K2 Thinking’s arrival signals that Moonshot, a young startup founded in 2023 with investments from China’s biggest app and tech companies, is here to join the growing competition and comes amid increased scrutiny of the financial sustainability of AI’s biggest companies.

Just the other day, OpenAI’s chief financial officer Sarah Friar caused a stir at a WSJ Tech Live event when she suggested that the U.S. government might eventually need to provide a “backstop” to the company’s $1.4 trillion computing and data center efforts. This comment has been widely interpreted as a call for loan guarantees from taxpayers.

Although Friar later clarified that OpenAI was not seeking direct federal support, the episode reignited debate about the scale and concentration of AI capital investment.

As OpenAI, Microsoft, Meta and Google compete to secure long-term chip supplies, commentators are warning of an unsustainable investment bubble and an “AI arms race” driven more by strategic fears than commercial interests. "explode" With so many deals and valuations taking place in anticipation of continued huge AI investments and huge returns, any hesitation or market uncertainty could engulf the entire global economy.

Against this backdrop, the open weight releases of Moonshot AI and MiniMax are increasing pressure on U.S. indigenous AI companies and their backers to justify the size of their investments and their path to profitability.

Enterprise customers are using free, open-source Chinese AI as much as they are using paid proprietary AI solutions such as OpenAI’s GPT-5, Anthropic’s Claude Sonnet 4.5, and Google’s Gemini 2.5 Pro. Why would they continue to pay for access to proprietary models if they can easily get the same or better performance from them? Already, Silicon Valley powerhouses like Airbnb have raised eyebrows by admitting that they use Chinese open source alternatives like Alibaba’s Qwen more than OpenAI’s own products.

For investors and businesses, these developments suggest that high-end AI capabilities are no longer synonymous with high-end capital investments. The most advanced inference systems are likely not to come from companies building gigascale data centers, but from research groups that optimize architecture and quantization to increase efficiency.

In that sense, K2 Thinking’s benchmark dominance is not just a technical milestone, but a strategic one, reached at a moment when the AI ​​market’s biggest questions have shifted from their initial challenges. How powerful will the model be? to who can afford to keep them.

What the future means for companies

Within weeks of MiniMax-M2’s rise, Kim K2 Thinking overtook MiniMax-M2 in nearly all inference and agent benchmarks, along with GPT-5 and Claude 4.5.

This model shows that an open weight system can: Match or exceed unique Frontier models both in ability and efficiency.

For the AI ​​research community, K2 Thinking represents more than just an open model, it is evidence that the frontier has become collaborative.

The best-performing inference models available today are open-source systems that are publicly accessible, rather than closed commercial products.



Source link