Cohere’s new vision model runs on two GPUs and hits first-class VLMS with visual tasks -

Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now

The increase in analytics with deep research capabilities and other AI-powered features has created more models and services, with the aim of simplifying that process and reading more about the documents that businesses are actually using.

Canadian AI Compare banks its banking models, including newly released visual models, and claims that deep research capabilities should be optimized for enterprise use cases as well.

The company has released Command A Vision, a visual model specifically targeting enterprise use cases, built on the back of the Command A model. The 112 billion parameter models “can unlock valuable insights from visual data and make highly accurate data-driven decisions through document optical character recognition (OCR) and image analysis,” the company says.

“Whether you interpret product manuals using complex diagrams or analyse real scene photos for risk detection, they have a great vision to tackle the challenges of the most demanding enterprise vision,” the company said in a blog post.

The AI Impact Series returns to San Francisco – August 5th

The next phase of AI is here – Are you ready? Join Block, GSK and SAP leaders to see exclusively how autonomous agents are reshaping their enterprise workflows, from real-time decision-making to end-to-end automation.

Secure your spot now – Space is limited: https://bit.ly/3guplf

This means commanding a vision that can read and analyze images of the most common types of images that an enterprise needs: graphs, charts, diagrams, scanned documents, PDFs.

? @cohere Command dropped @huggingface ?
Designed for enterprise multimodal use cases: questions about product manual interpretation, photo analysis, charting…❓??
Check out the benchmark metrics at 112B’s dense vision language model with SOTA performance… pic.twitter.com/ormfm5f8cf
– Jeff Beaudier? (@jeffboudier) July 31, 2025

As it is built on the architecture of Command A, Command A vision requires no more than two GPUs, just like the text model. The vision model also retains the text function of Command A to read words on images, understanding at least 23 languages. Unlike other models, Cohere commands Vision A reduces the total cost of ownership for a company and is fully optimized for company search use cases.

Architecting command a

Cohere said that according to the Llava architecture, the model containing the visual model was constructed in Command A. This architecture allows visual functionality to be transformed into soft vision tokens and split into different tiles.

These tiles are passed to a text tower command called “Text LLM with dense 111b parameters”. “This way, a single image consumes up to 3,328 tokens.”

Cohere said they trained the visual model in three stages: vision language alignment, monitored fine-tuning (SFT), and post-training reinforcement learning with human feedback (RLHF).

“This approach allows mapping of image encoder functionality to map to language models that embed spaces,” the company said. “In contrast, during the SFT stage, we trained Vision Encoder, Vision Adapter, and Language Model in multimodal tasks that follow a variety of instructions.”

Visualization of Enterprise AI

The benchmark test showed commands outweigh other models with similar visual capabilities.

A vision for nine benchmark tests: Cohere Pitited Command Openai’s GPT 4.1, Meta’s Llama 4 Maverick, Mistral’s Pixtral Large, and Mistral Medium 3. The company did not mention whether it tested the model against Mistral OCR, Mistral’s OCR-focused API.

This allows agents to safely view the inside of their organization’s visual data and unauthorize boring tasks, including slides, diagrams, PDFs and photos. pic.twitter.com/ihznuwekrk
– Cohere (@cohere) July 31, 2025

A vision that surpasses other models in tests such as Command a chartqa, ocrbench, ai2d, and texttvqa. Overall, the average score for Command A Vision was 83.1% compared to 78.6% for GPT 4.1, 80.5% for Llama 4 Maverick and 78.3% for Mistral Medium 3.

Most recent large-scale language models (LLMs) are multimodal. This means that you can generate or understand visual media such as photos and videos. However, companies generally use more graphical documents such as charts and PDFs, so extracting information from these unstructured data sources is often difficult.

Deep research into the rise has made it more important to bring in models that can read, analyze, and even download unstructured data.

Cohere also said it is offering the command a vision of an open weight system in the hopes that companies looking to move away from closed or unique models will begin using the product. So far, there’s been interest from developers.

I was very impressed with the accuracy of extracting handwritten notes from images!
– Adam Sardo (@sardo_adam) July 31, 2025

Finally, AI that doesn’t judge my awful graffiti.
– Martha Wisener? (@martwisener) August 1, 2025

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.

Source link

Categories

Architecting command a

Visualization of Enterprise AI

Related News

Aave deploys Aave Shield after $50M user loss incident

Differences in the reaction of Bitcoin and gold to the impact of the Iran war