In the Agent era, model capabilities are defined by two dimensions: model intelligence and the capacity of context they can handle. A foundational coding model that natively processes multimodal contexts such as images, videos, and text, while excelling in complex programming, long-term planning, and action execution, will be the cornerstone of all AI-native applications.
Today, we are launchingGLM-5V-Turbo, a multimodal foundational coding model designed for visual programming.
GLM-5V-Turbo deeply integrates visual and textual capabilities from the pre-training stage, allowing programming to go beyond pure text input. The model understands design drafts, screenshots, and web interfaces, generating fully operational code accordingly, truly achieving the ability to comprehend visuals and produce code.
Key takeaways are as follows:
– Native multimodal coding foundation: Natively comprehends multimodal inputs like images, videos, design drafts, and document layouts, supporting multimodal tool use such as framing, screenshotting, and reading web pages. With an extended context window of up to 200k tokens, it expands the perception-action chain of Agents from pure text to visual interaction.
– Balances visual and coding capabilities: Achieves leading performance on core benchmarks like multimodal coding, tool usage, and GUI agents. Through techniques such as multi-task collaborative reinforcement learning, it ensures no degradation in coding, reasoning, or tool-use abilities in pure text scenarios.
– Deeply adapted to Claude Code and lobster scenarios: Deep collaboration with agents such as Claude Code and OpenClaw/AutoClaw, supporting a complete closed-loop of 'understanding the environment → planning actions → executing tasks,' while providing a full set of official skills ready for use right out of the box.
Multimodal Coding foundation
In evaluation benchmarks across multimodal coding, agentic tasks, and pure text coding dimensions, GLM-5V-Turbo has achieved leading performance with a smaller size.

GLM-5V-Turbo has demonstrated leading performance in benchmarks like design draft restoration, visual code generation, multimodal retrieval and Q&A, and visual inspection; it also shows outstanding results in AndroidWorld and WebVoyager, which measure real GUI environment control capabilities. In terms of pure text coding ability, GLM-5V-Turbo maintains stable performance in the three core benchmark tests of Backend, Frontend, and Repo Exploration in CC-Bench-V2, indicating thatafter introducing visual capabilities, pure text programming and reasoning abilities remain at the same level.。

After integrating GLM-5V-Turbo into lobster agents such as AutoClaw,the lobster gained true visual capabilities, enabling it to understand information on the screen.The model achieved excellent results in PinchBench, ClawEval, and ZClawBench, which measure the task execution quality of lobster agents, validating its comprehensive capabilities in complex task scenarios.
During the beta testing phase, major internet companies such as ByteDance, Meituan, and Kuaishoupartners highly praised GLM-5V-Turbo.:
"GLM-5V-Turbo has achieved a complete transformation from design drafts to code. As a visual understanding model, it can well meet the front-end development scenarios for developers." — TRAE Model Evaluation Team
"The introduction of native multimodal capabilities has not weakened its programming logic; its programming ability still ranks among the top tier domestically. It has enhanced work experiences in areas such as D2C and image processing under the AI at Work domain." — A team from Meituan
"It has given Agent 'eyes,' while demonstrating superior capabilities compared to similar multimodal models in the programming field, making it more competitive in visual programming scenarios." — Kuaishou Wanqing Model Evaluation Team
GLM-5V-Turbo's leading performance is due to itssystematic upgrades across four levels: model architecture, training methods, data construction, and toolchain:
– Native multimodal fusion: GLM-5V-Turbo starts deep integration of text and visual capabilities from the pre-training stage and achieves multimodal collaborative optimization in the post-training phase. We have developed a new generation of CogViT visual encoder, which achieves optimal performance in general object recognition, fine-grained understanding, geometry, and spatial awareness. Additionally, we designed the MTP structure that is compatible with multimodal inputs and inference-friendly, achieving higher inference efficiency in multimodal scenarios.
– Reinforcement learning with over 30 tasks: During the reinforcement learning phase, over 30 task types are optimized simultaneously, covering subfields such as STEM, grounding, video, and GUI Agent. The model shows robust improvements in perception, reasoning, Agentic execution, and human sensory experience, effectively alleviating instability issues associated with single-domain training through collaborative reinforcement learning.
– Agentic data and task construction: In response to the industry challenges of scarce Agent data and difficult validation, we have constructed a multi-level system ranging from element perception to sequence-level action prediction. This is based on generating large-scale controllable and verifiable training data in synthetic environments. From the pre-training stage, we infuse Agentic meta-capabilities (such as incorporating GUI Agent PRM data into pre-training to reduce hallucination). At the same time, we explore asymmetric optimization, leveraging multimodal evaluation tasks to enhance stronger Agent capabilities.
– Multimodal Toolchain Expansion: Building on text-based tools, GLM-5V-Turbo now supports additional multimodal tools such as multimodal search, bounding box drawing, screenshot capturing, and webpage reading, extending the perception-action chain of programming and task execution from pure text to visual interaction. The synergy with frameworks like Claude Code and AutoClaw has been further enhanced, supporting a complete closed loop of 'understanding the environment → planning actions → executing tasks'.
Typical Scenario Showcase
1. Image as Code
GLM-5V-Turbo excels particularly in core visual programming scenarios.
– Frontend Replication: By sending sketches, design drafts, screenshots, or screen recordings of reference websites, the model can directly understand layouts, color schemes, component hierarchies, and interaction logic, generating fully operational frontend projects that accurately reproduce layout, color schemes, animations, and other visual details.
– GUI Autonomous Exploration and Replication: Combined with frameworks like Claude Code, GLM-5V-Turbo leverages its robust GUI Agent capabilities to autonomously explore target websites, browse page structures, organize navigation relationships between pages, collect visual materials and interaction details, and finally generate code based on the recorded exploration results to replicate the entire site, achieving an advancement from 'image replication' to 'GUI exploration replication'.
– Interactive editing: Supports adding or removing page modules as needed, modifying copy and styles, adjusting layout structures, and supplementing interactive functions such as button feedback, popup switching, and form interactivity to achieve visual iterative editing.
2. Give the lobster eyes
The mission boundaries of the lobster have been significantly expanded, for example, it can browse webpages and documents, generate reports and presentations with graphics, and even query and interpret complex charts like candlestick patterns.
AutoClaw has launched the 'Stock Analyst' Skill, utilizing GLM-5V-Turbo’s native visual capabilities so that the lobster can directly understand candlestick trends, valuation range charts, and brokerage research report graphs. It enables parallel data collection from four sources within 60 seconds and outputsinterwoven text and imagesresearch reports. Immediately switch to GLM-5V-Turbo in AutoClawand try asking, 'Help me analyze today’s XXX stock price and generate a professional analysis report.'
In addition to visual programming and lobster tasks, GLM-5V-Turbo has achieved significant performance improvements in a broader range of Agentic scenarios such as multimodal search, in-depth research, GUI Agent, and perceptual grounding. To this end, we provide a set ofOfficial Skillsthat cover native capabilities such as image captioning, visual grounding, document-based writing, resume screening, and prompt generation, as well as text recognition, table recognition, handwriting recognition, formula recognition, and text-to-image capabilities built on GLM-OCR and GLM-Image, helping users unlock the model's multimodal potential across more scenarios. The aforementioned Skills are now available on ClawHub, where all capabilities can be experienced with one-click installation.
Try it now
We welcome all users to access GLM-5V-Turbo through the following methods:
1. Product experience
2. Official API Access
– Coding PlanNow open for application to Coding Plan users; GLM Coding Plan will also include GLM-5V-Turbo in the future, so stay tuned. Application questionnaire: https://zhipu-ai.feishu.cn/share/base/form/shrcndgpmRlJoD5rMmIavUrPwzg
Risk Disclaimer: The above content only represents the author's view. It does not represent any position or investment advice of Futu. Futu makes no representation or warranty.Read more
Comments
to post a comment
3
4
