Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now
The increase in analytics with deep research capabilities and other AI-powered features has created more models and services, with the aim of simplifying that process and reading more about the documents that businesses are actually using.
Canadian AI Compare banks its banking models, including newly released visual models, and claims that deep research capabilities should be optimized for enterprise use cases as well.
The company has released Command A Vision, a visual model specifically targeting enterprise use cases, built on the back of the Command A model. The 112 billion parameter models “can unlock valuable insights from visual data and make highly accurate data-driven decisions through document optical character recognition (OCR) and image analysis,” the company says.
“Whether you interpret product manuals using complex diagrams or analyse real scene photos for risk detection, they have a great vision to tackle the challenges of the most demanding enterprise vision,” the company said in a blog post.
The AI Impact Series returns to San Francisco – August 5th
The next phase of AI is here – Are you ready? Join Block, GSK and SAP leaders to see exclusively how autonomous agents are reshaping their enterprise workflows, from real-time decision-making to end-to-end automation.
Secure your spot now – Space is limited: https://bit.ly/3guplf
This means commanding a vision that can read and analyze images of the most common types of images that an enterprise needs: graphs, charts, diagrams, scanned documents, PDFs.
As it is built on the architecture of Command A, Command A vision requires no more than two GPUs, just like the text model. The vision model also retains the text function of Command A to read words on images, understanding at least 23 languages. Unlike other models, Cohere commands Vision A reduces the total cost of ownership for a company and is fully optimized for company search use cases.
Architecting command a
Cohere said that according to the Llava architecture, the model containing the visual model was constructed in Command A. This architecture allows visual functionality to be transformed into soft vision tokens and split into different tiles.
These tiles are passed to a text tower command called “Text LLM with dense 111b parameters”. “This way, a single image consumes up to 3,328 tokens.”
Cohere said they trained the visual model in three stages: vision language alignment, monitored fine-tuning (SFT), and post-training reinforcement learning with human feedback (RLHF).
“This approach allows mapping of image encoder functionality to map to language models that embed spaces,” the company said. “In contrast, during the SFT stage, we trained Vision Encoder, Vision Adapter, and Language Model in multimodal tasks that follow a variety of instructions.”
Visualization of Enterprise AI
The benchmark test showed commands outweigh other models with similar visual capabilities.
A vision for nine benchmark tests: Cohere Pitited Command Openai’s GPT 4.1, Meta’s Llama 4 Maverick, Mistral’s Pixtral Large, and Mistral Medium 3. The company did not mention whether it tested the model against Mistral OCR, Mistral’s OCR-focused API.
A vision that surpasses other models in tests such as Command a chartqa, ocrbench, ai2d, and texttvqa. Overall, the average score for Command A Vision was 83.1% compared to 78.6% for GPT 4.1, 80.5% for Llama 4 Maverick and 78.3% for Mistral Medium 3.
Most recent large-scale language models (LLMs) are multimodal. This means that you can generate or understand visual media such as photos and videos. However, companies generally use more graphical documents such as charts and PDFs, so extracting information from these unstructured data sources is often difficult.
Deep research into the rise has made it more important to bring in models that can read, analyze, and even download unstructured data.
Cohere also said it is offering the command a vision of an open weight system in the hopes that companies looking to move away from closed or unique models will begin using the product. So far, there’s been interest from developers.
Source link
