
Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded in one of the most demanding tests of artificial intelligence. The claim sent ripples of surprise, skepticism, and genuine curiosity throughout the tech industry.
The San Jose-based company announced that its AI system scored 48.1 percent on humanity’s last test. This benchmark is designed to help subject matter experts from around the world outperform even the most advanced AI models. This result beats the previous record holder, Google’s Gemini 3 Pro, which scored 45.8%.
"Zoom achieved new state-of-the-art results on the challenging Humanity’s Last Exam full-set benchmark, achieving a score of 48.1%. This represents a significant improvement of 2.3% compared to previous SOTA results." Xuedong Huang, Zoom’s chief technology officer, said in a blog post.
The announcement raised provocative questions that have puzzled AI watchers for days. How did a video conferencing company with no public history of training large-scale language models suddenly overtake Google, OpenAI, and Anthropic on a benchmark built to measure the cutting edge of machine intelligence?
The answer reveals a lot not only about Zoom’s own technological ambitions, but also about where AI is headed. And depending on who you ask, it’s either an ingenious demonstration of practical engineering or an empty claim to usurp credit for someone else’s research.
How Zoom built an AI traffic controller instead of training its own models
Zoom did not train its own large-scale language model. Instead, the company has so-called "Federated AI approach" — A system that routes queries to multiple existing models from OpenAI, Google, and Anthropic and uses proprietary software to select, combine, and adjust their output.
At the heart of this system is what Zoom calls a “system.” "Z scorer," A mechanism to evaluate responses from different models and select the best one for a particular task. The company combines this with what it describes as: "exploration, verification, collaboration strategy," Agent workflows that balance exploratory inference and verification across multiple AI systems.
"Our federation approach combines Zoom’s unique small language model with advanced open and closed source models." Huang wrote. framework "Align diverse models and generate, challenge, and refine inferences through dialectical collaboration."
Simply put, Zoom built advanced traffic controllers for AI, not AI itself.
In an industry where bragging rights and multi-billion valuations depend on who can claim the most competent model, this distinction is crucial. Major AI research institutes are spending hundreds of millions of dollars training frontier systems on vast computing clusters. In contrast, Zoom’s performance appears to be based on smart integration of these existing systems.
Why AI researchers disagree on what counts as true innovation
Reaction from the AI community was swift and widely divided.
Max Rumpf, an AI engineer who claims to have trained the most advanced language model, posted a scathing critique on social media. "Zoom has strung together API calls to Gemini, GPT, Claude, and more. We have made small improvements to benchmarks that do not provide any value to our customers." he wrote. "And they insist on SOTA."
Rumpf did not deny the technical approach itself. He noted that using multiple models for different tasks is important. "It’s actually very sensible and most applications should do this." He cited Sierra, an AI customer service company, as an example of this multi-model strategy being effectively executed.
His objections were more specific. "They did not train the model, but obscured this fact in their tweets. The injustice of taking credit for someone else’s work is deeply rooted in people’s hearts."
However, other observers saw this achievement differently. Developer Hongcheng Zhu offered a more cautious assessment. "Exceeding AI ratings will likely require model federation, as Zoom has done. For example, Kaggle’s competitors know that they need to ensemble their models to win the contest."
A comparison to Kaggle, a competitive data science platform where combining multiple models is standard practice for winning teams, reframes Zoom’s approach as an industry best practice rather than a magic trick. Academic research has long established that ensemble methods routinely outperform individual models.
Still, the debate exposed flaws in how the industry understands progress. Ryan Pream, founder of Exoria AI, was negative. "Zoom is simply creating a harness around another LLM and reporting it. It’s just noise." Another commenter captured the sheer surprise of this news: "The fact that video conferencing app ZOOM has developed a SOTA model that achieves 48% HLE was not on my bingo card."
Perhaps the sharpest criticism concerned priorities. Rumpf argued that Zoom has been able to focus its resources on real problems faced by customers. "Retrieving call transcripts is not “solved” by SOTA LLM." he wrote. "I think Zoom users will care much more about this than HLE."
Microsoft veteran stakes his reputation on a different kind of AI
If Zoom’s benchmark results seemed to come out of nowhere, the company’s chief technology officer didn’t.
Xuedong Huang joined Zoom from Microsoft, where he spent decades building the company’s AI capabilities. He founded Microsoft’s Speech Technology Group in 1993 and led a team that achieved what the company describes as human parity in speech recognition, machine translation, natural language understanding, and computer vision.
Fan holds a Ph.D. in Electrical Engineering from the University of Edinburgh. He has been elected to the National Academy of Engineering and the American Academy of Arts and Sciences, and is a Fellow of both the IEEE and the ACM. His credentials place him among the best AI executives in the industry.
His presence at Zoom shows that the company is serious about its AI ambitions, even if its methods differ from those of the headline-grabbing research institutes. In a tweet celebrating the benchmark results, Huang called the achievement a validation of Zoom’s strategy: "We have surpassed the performance limits of a single model and unlocked more powerful capabilities in exploration, inference, and multi-model collaboration."
That last clause is — "Exceeding the performance limits of a single model" — that may be the most important. Huang doesn’t claim that Zoom has built a better model. He claims that Zoom has built a better system for using models.
Inside a test designed to blow away the world’s smartest machines
The benchmark at the heart of this debate, humanity’s last test, is designed to be extremely difficult. Unlike previous tests where AI systems learn games through pattern matching, HLEs pose problems that require true understanding, multi-step reasoning, and the integration of information across complex domains.
The exam features questions from experts around the world, ranging from advanced mathematics to philosophy to specialized scientific knowledge. A score of 48.1 percent may not sound impressive to those accustomed to school grading curves, but in the context of the HLE, it represents the current upper limit of the machine’s performance.
"Developed by subject matter experts around the world, this benchmark is a key metric for measuring AI’s progress toward human-level performance on difficult intellectual tasks." Zoom’s announcement pointed out that:
On its own, the company’s improvement of 2.3 percentage points over Google’s all-time high may seem modest. However, competitive benchmarks often yield gains of less than a few percentage points, so such jumps attract attention.
Zoom’s approach reveals the future of enterprise AI
Zoom’s approach has an impact far beyond benchmark leaderboards. The company represents a fundamentally different vision for enterprise AI than the model-centric strategies pursued by OpenAI, Anthropic, and Google.
Rather than going all in on building the single most functional model, Zoom positions itself as an orchestration layer. That is, companies that can integrate the best features from multiple providers and deliver them through products that businesses already use every day.
This strategy avoids significant uncertainty in the AI market. No one knows which model will be the best next month or even next year. By building an infrastructure that can be exchanged between providers, Zoom avoids vendor lock-in while theoretically providing customers with the best AI for a specific task.
OpenAI’s GPT-5.2 announcement the next day highlighted this dynamic. In its own communications, OpenAI named Zoom as a partner that evaluated the performance of the new model. "We saw visible gains across the board across our AI workloads." In other words, Zoom is both a Frontier Labs customer and a benchmark competitor using its own technology.
This arrangement may prove sustainable. Major model providers have every incentive to sell API access broadly, and even to companies that may aggregate their output. A more interesting question is whether Zoom’s orchestration capabilities constitute genuine intellectual property or merely advanced prompt engineering that can be replicated by others.
The real test is when Zoom’s 300 million users start asking questions
Zoom Titled Industry Announcement Section "a collaborative future," From beginning to end, Mr. Huang expressed his gratitude. "The future of AI is collaboration, not competition." he wrote. "By combining the best innovations from across the industry with our own research advances, we create solutions that are greater than the sum of their parts."
This framework positions Zoom as a valuable integrator that brings together the best of the industry for the benefit of enterprise customers. Critics take a different view of companies that claim AI lab prestige without doing the basic research to earn it.
This debate is likely to be settled by product, not by leaderboards. When AI Companion 3.0 rolls out to Zoom’s hundreds of millions of users in the coming months, users will be making their own judgments about whether meeting summaries actually capture what’s important, whether action items make sense, and whether AI saves or wastes time, rather than benchmarks they’ve never heard of.
As it turns out, Zoom’s most provocative claim may not be that it beat the benchmark. In the age of AI, the implicit argument may be that the best models are not the ones you build, but the ones you know how to use.
