
Chinese AI startup Zhipu AI aka Z.ai releases GLM-4.6V seriesa new generation of open-source vision language models (VLMs) optimized for multimodal inference, front-end automation, and high-efficiency deployment.
This release includes two models "big" and "small" size:
-
GLM-4.6V(106B)a larger 106 billion parameter model for cloud-scale inference.
-
GLM-4.6V-Flash (9B)a small model with only 9 billion parameters designed for low-latency local applications.
Recall that, generally speaking, a model with more parameters, or internal settings governing its behavior, i.e. weights and biases, is more powerful, performs better, and can perform at a higher general level across a more diverse range of tasks.
However, smaller models are more efficient for edge and real-time applications where latency and resource constraints are important.
The distinctive innovation of this series is Calling a native function Visual language models enable tools such as visual input search, cropping, and chart recognition directly.
With a context length of 128,000 tokens (equivalent to 300 pages of novel-like text exchanged in a single input/output interaction with a user) and state-of-the-art (SoTA) results across more than 20 benchmarks, the GLM-4.6V series positions itself as a competitive alternative to both closed and open-source VLMs. Available in the following formats:
-
API access via OpenAI compatible interface
-
Try the demo on Zhipu’s web interface
-
Download weights from Hugging Face
-
Desktop assistant app available for Hugging Face Spaces
Licensing and corporate use
GLM‑4.6V and GLM‑4.6V‑Flash are distributed under the MIT License. The MIT License is a permissive open source license that permits free commercial and noncommercial use, modification, redistribution, and local deployment without obligation to create open source derivative works.
This licensing model makes the series suitable for enterprise deployments, such as scenarios requiring complete control of infrastructure, compliance with internal governance, or air-gapped environments.
Model weights and documentation are published on Hugging Face, and supporting code and tools are available on GitHub.
The MIT license ensures maximum flexibility for integration into your own systems, including internal tools, production pipelines, and edge deployments.
Architecture and technology
The GLM-4.6V model follows a traditional encoder/decoder architecture that is highly adapted to multimodal inputs.
Both models incorporate a Vision Transformer (ViT) encoder based on AIMv2-Huge and an MLP projector to match the visual capabilities to a Large Language Model (LLM) decoder.
The video input benefits from 3D convolution and temporal compression, while spatial encoding is handled using 2D-RoPE and bicubic interpolation with absolute position embedding.
An important technical feature is that the system supports any image resolution and aspect ratio, including wide panoramic inputs up to 200:1.
In addition to parsing static images and documents, GLM-4.6V ingests temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal inference.
On the decoding side, the model supports token generation aligned with function call protocols, allowing structured inference across text, images, and tool output. This is supported by an extended tokenizer vocabulary and output formatting templates to ensure consistent API or agent compatibility.
Using native multimodal tools
GLM-4.6V introduces native multimodal function calls, allowing you to pass visual assets such as screenshots, images, and documents directly to tools as parameters. This eliminates the need for intermediate text-only conversions, which previously introduced information loss and complexity.
The tool invocation mechanism works in both directions.
-
Input tools can be passed images or videos directly (for example, document pages to be cropped or analyzed).
-
Output tools such as chart renderers and web snapshot utilities return visual data. GLM-4.6V integrates these data directly into the inference chain.
In practice, this means that GLM-4.6V can complete tasks such as:
-
Generate structured reports from mixed format documents
-
Performing a visual audit of candidate images
-
Automatically cut diagrams from paper during generation
-
Perform visual web searches and respond to multimodal queries
High performance benchmarks compared to other similarly sized models
GLM-4.6V was evaluated across more than 20 public benchmarks covering general VQA, chart understanding, OCR, STEM reasoning, front-end replication, and multimodal agents.
According to the benchmark chart released by Zhipu AI:
-
GLM-4.6V (106B) achieves SoTA or near-SoTA scores among open source models of comparable size (106B) such as MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, and TreeBench.
-
GLM-4.6V-Flash (9B) outperforms other lightweight models (Qwen3-VL-8B, GLM-4.1V-9B, etc.) across nearly every category tested.
-
The 106B model’s 128K token window allows it to outperform larger models such as Step-3 (321B) and Qwen3-VL-235B on long context document tasks, video summarization, and structured multimodal inference.
Examples of leaderboard scores include:
-
MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)
-
WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)
-
Ref-L4 test: 88.9 vs. 89.5 (GLM-4.5V), but ground fidelity is better at 87.7 (flash) vs. 86.8.
Both models are evaluated using the vLLM inference backend and support SGLang for video-based tasks.
Front-end automation and long-context workflows
Zhipu AI highlighted GLM-4.6V’s capabilities to support front-end development workflows. The model can:
-
Clone pixel-accurate HTML/CSS/JS from UI screenshots
-
Accept natural language editing commands to change layouts
-
Visually identify and interact with specific UI components
This functionality is integrated into an end-to-end visual programming interface, and the model uses a native understanding of screen capture to iterate through layout, design intent, and output code.
In long document scenarios, GLM-4.6V can process up to 128,000 tokens, allowing a single inference pass for:
-
150 pages of text (input)
-
200 slide deck
-
1 hour video
Zhipu AI reported successful use of the model in financial analysis across multiple document corpora and in summarizing full-length sports broadcasts using time-stamped event detection.
Training and reinforcement learning
The model was trained using multi-stage pre-training followed by supervised fine-tuning (SFT) and reinforcement learning (RL). Key innovations include:
-
Curriculum Sampling (RLCS): Dynamically adjust the difficulty of training samples based on model progress.
-
Multidomain Reward Systems: Task-Specific Verification Tools for STEM, Chart Reasoning, GUI Agents, Video QA, and Spatial Grounding
-
Function-aware training: Structured tags (
, ,<|begin_of_box|> ) to adjust the format of your reasoning and answers.
Reinforcement learning pipelines emphasize verifiable reward (RLVR) over human feedback (RLHF) for scalability and avoid KL/entropy loss to stabilize training across multimodal domains.
Pricing (API)
Zhipu AI offers competitive prices for the GLM-4.6V series, with both the flagship model and its lightweight model offering high accessibility.
-
GLM-4.6V: $0.30 (input) / $0.90 (output) per million tokens
-
GLM-4.6V-Flash: Free
Compared to leading vision-enabled and text-first LLMs, GLM-4.6V is one of the most cost-effective for large-scale multimodal inference. Below is a comparison snapshot of prices between providers.
USD per million tokens — sorted by lowest total cost → highest total cost
|
model |
input |
output |
total cost |
sauce |
|
Quen 3 Turbo |
$0.05 |
$0.20 |
$0.25 |
alibaba cloud |
|
ernie 4.5 turbo |
$0.11 |
$0.45 |
$0.56 |
Qianho |
|
GLM‑4.6V |
$0.30 |
$0.90 |
$1.20 |
Z.AI |
|
Grok 4.1 Fast (Inference) |
$0.20 |
$0.50 |
$0.70 |
xAI |
|
Grok 4.1 fast (non-inferential) |
$0.20 |
$0.50 |
$0.70 |
xAI |
|
Deepseek Chat (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
deep seek |
|
Deep Seek Reasoner (V3.2-Exp) |
$0.28 |
$0.42 |
$0.70 |
deep seek |
|
Quen 3 Plus |
$0.40 |
$1.20 |
$1.60 |
alibaba cloud |
|
Ernie 5.0 |
$0.85 |
$3.40 |
$4.25 |
Qianho |
|
Quen Max |
$1.60 |
$6.40 |
$8.00 |
alibaba cloud |
|
GPT-5.1 |
$1.25 |
$10.00 |
$11.25 |
OpenAI |
|
Gemini 2.5 Pro (≤200K) |
$1.25 |
$10.00 |
$11.25 |
|
|
Gemini 3 Pro (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
Gemini 2.5 Pro (>200K) |
$2.50 |
$15.00 |
$17.50 |
|
|
Grok 4 (0709) |
$3.00 |
$15.00 |
$18.00 |
xAI |
|
Gemini 3 Pro (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Op. 4.1 |
$15.00 |
$75.00 |
$90.00 |
human |
Previous releases: GLM‑4.5 series and enterprise applications
Prior to GLM‑4.6V, Z.ai released the GLM‑4.5 family in mid-2025, establishing itself as a leading contender in open source LLM development.
Both the flagship GLM-4.5 and its smaller sibling, GLM-4.5-Air, support inference, tool usage, coding, and agent behavior, and provide strong performance across standard benchmarks.
The model introduces two modes of reasoning (‘thinking’ and ‘not thinking’) and can automatically generate a complete PowerPoint presentation from a single prompt. This feature is intended for use in corporate reporting, education, and internal communication workflows. Z.ai has also expanded the GLM-4.5 series with additional variants such as GLM-4.5-X, AirX, and Flash, targeting ultra-fast inference and low-cost scenarios.
Together, these features position the GLM‑4.5 series as a cost-effective, open, and production-ready alternative for enterprises that require autonomy for model deployment, lifecycle management, and integrated pipelines.
Ecosystem impact
The GLM-4.6V release represents a significant advancement in open source multimodal AI. Large-scale vision language models have proliferated over the past year, but few offer the following:
-
How to use integrated visual tools
-
Structured multimodal generation
-
Agent-oriented memory and decision logic
Zhipu AI focuses on “closing the loop” from perception to action through native function calls, and represents a step toward agentic multimodal systems.
The model’s architecture and training pipeline demonstrate the continued evolution of the GLM family, placing it competitively alongside products such as OpenAI’s GPT-4V and Google DeepMind’s Gemini-VL.
Key points for corporate leaders
With GLM-4.6V, Zhipu AI introduces an open-source VLM that allows for the use of native visual tools, long-context inference, and front-end automation. It sets new performance marks among similarly sized models and provides a scalable platform for building agentic and multimodal AI systems..
