From 3-minute Vibe Coding to 30-minute Agentic Engineering, and now the 8-hour Long Horizon Task model we’ve introduced, GLM-5.1 has once again achieved a breakthrough.
GLM-5.1 is our smartest flagship model to date and currently the strongest open-source model globally.GLM-5.1 significantly enhances coding capabilities, with particularly notable improvements in handling long-duration tasks. Unlike previous models that interacted in minute-long sessions, GLM-5.1 can independently and continuously work on a single task for over 8 hours, autonomously planning, executing, self-evolving, and ultimately delivering complete, engineering-grade results.
The improvement in coding ability is key to further enhancing the intelligence of the model. The chart below shows the average results from three of the most representative code evaluation benchmarks in the industry, including SWE-Bench Pro, which measures professional-level software development work by models; Terminal-Bench 2.0, which evaluates problem-solving through command-line operations like an engineer; and NL2Repo, which builds complete code repositories from scratch. Across these three benchmarks, the composite average score is:GLM-5.1 ranks third globally among all models, first among domestically produced models, and first among open-source models.

In the SWE-bench Pro benchmark test, which most closely simulates real-world software development,GLM-5.1 sets a new global best performance record,outperforming GPT-5.4 and Claude Opus 4.6. SWE-Bench Pro requires models to locate and fix high-complexity engineering bugs in real GitHub repositories, serving as the hardest metric for assessing whether a model can handle professional software development.

The 8 hours you spend sleeping are the 8 hours the model spends working.
Over the past two years, the industry has measured model intelligence using benchmarks. We believe that the next phase of measurement should focus on 'how long it can work,' meaning how long the model canLong-Horizon Taskrefers to how long a model can independently complete human tasks.
This presents a deeper challenge for the model. Maintaining stable output during long-horizon tasks, the model faces not just more lines of code but a series of complex engineering decision points: proactively running benchmarks, identifying bottlenecks, revising plans, and rerunning tests. The model needs to function like a real engineer, forming a complete closed loop of 'experiment → analysis → optimization' rather than stopping after writing one version of the code to wait for evaluation.
Under the same evaluation criteria on the METR leaderboard,GLM-5.1 is the only open-source model capable of sustaining 8-hour continuous work, and globally, it is one of the few models with this ability aside from Claude Opus 4.6.Our ultimate goal is the fully autonomous agent (Autonomous Agent), where the model operates non-stop 24/7, breaking down objectives, executing deliverables, self-evaluating and correcting, and evolving autonomously without any human intervention.
Let's take a look at what the model’s 8-hour work can achieve.
Scenario One: Building a Linux desktop from scratch in 8 hours
The architecture blueprint was drawn during the day and handed over to GLM-5.1 before bedtime; by morning, it had produced a complete system. Over the span of exactly 8 hours, executing more than 1,200 steps, the first meaningful outcome emerged at the 20-minute mark, and after 8 hours, a fully functional Linux desktop system was generated, including: a complete desktop, window manager, status bar, applications, VPN manager, Chinese font support, gaming library, etc., accompanied by a 4.8MB supporting file.This is equivalent to a week’s workload for a team of four developers.
The following video shows the code commits made by GLM-5.1 over the course of 8 hours: These are not small patches of four or five lines; each commit represents a substantial system-level evolution, with no human involvement in testing or reviewing the code throughout the process. The model even wrote some regression tests for its own code and passed them.
Scenario two: 655 iterations break through the bottleneck of vector database optimization.
Vector databases serve as the core engine behind AI search and recommendation systems, with approximate nearest neighbor retrieval being a crucial component that also heavily tests algorithmic and engineering capabilities. This process requires the model to master underlying algorithmic knowledge such as IVF, HNSW, and vector quantization while also possessing practical engineering judgment, enabling it to actively identify bottlenecks and switch strategies when one optimization path hits a wall, rather than blindly repeating the same direction.
GLM-5.1 doesn’t merely fine-tune parameters; it independently completed the entire optimization chain from switching full scans to IVF bucketed recall, introducing half-precision compression, adding coarse quantization ranking, implementing two-tier routing, and performing early pruning. Over 655 iterations, it continuously ran benchmarks, identified bottlenecks, and adjusted solutions autonomously, ultimately boosting the query throughput of the vector database from an initial delivery of 3,108 QPS to 21,472 QPS—a 6.9-fold increase over the initial official version.

Scenario three: 1,000 rounds of tool invocation optimize real machine learning model workloads.
GLM-5.1's demonstrated ability for prolonged work and self-evolution has transformed it from a mere 'code generator' into an 'active system optimizer.' On KernelBench Level 3—an optimization benchmark covering 50 real machine learning computational workloads—we allowed GLM-5.1 to continuously optimize each workload independently. Over more than 24 hours of uninterrupted iterations, GLM-5.1 autonomously completed multiple rounds of compile-test-analyze-rewrite cycles, eventually achieving a geometric mean speedup of 3.6x, significantly higher than the 1.49x achieved under PyTorch’s max-autotune mode.
The depth and creativity of GLM-5.1’s optimizations deserve particular attention. GLM-5.1 can autonomously write custom Triton Kernels and CUDA Kernels, utilize cuBLASLt epilogue fusion, and implement shared memory tiling and CUDA Graph optimizations. These optimization strategies span the full technical stack, from high-level operator fusion to microarchitecture-level tuning, with every step being an autonomous decision made by the model.
These results indicate that in the field of GPU kernel optimization—traditionally highly reliant on expert experience—AI models have already demonstrated end-to-end autonomous working capabilities from problem analysis, solution design, to iterative optimization. In GPU and broader high-performance computing fields, long-standing optimization bottlenecks that have constrained engineering efficiency are gradually being overcome by AI.

Behind the 8h
Running the model for eight hours is not difficult; what's truly challenging is ensuring that the work remains effective even in the eighth hour.
Previously, models including GLM-5 often hit a bottleneck when facing complex optimization tasks. After achieving quick gains early on, they would repeatedly attempt known optimization methods but fail to switch strategies when one approach proved ineffective.
The training goal of GLM-5.1 is to break through this bottleneck. In vector database optimization tasks, we observed a typical 'step-like' optimization trajectory: the model incrementally fine-tunes within a fixed strategy, and when returns plateau, it actively analyzes Benchmark logs, identifies current bottlenecks, and then transitions to structurally different solutions - from full scans to IVF bucketing, from single precision to quantized coarse ranking, from single-layer routing to two-level pruning. Each jump is accompanied by a brief decline in Recall because the model temporarily breaks constraints while exploring new directions, then adjusts them back. This 'break-fix' cycle itself is a hallmark of effective optimization.
On KernelBench, by comparing the optimization curves of multiple models, we directly observed this difference. GLM-5 rose quickly in the early stages but plateaued early; GLM-5.1 continued to rise longer within the same time window, eventually reaching 1.4 times the performance of GLM-5. The key lies in how far the model can extend the window of 'effective optimization.'
In the Linux desktop construction task, the challenge is different. The first two scenarios had clear numerical metrics (QPS, speedup ratio) to measure the effectiveness of each step, but building a complete desktop system has no single metric; what constitutes 'good' depends on the comprehensive judgment of functionality completeness, visual consistency, and interaction quality. This requires the model to have preliminary self-assessment capabilities: after each round of execution, review its output, determine where improvements are needed, and continue optimizing. This is the scenario with the weakest feedback signal among the three, and also the direction most in need of breakthroughs at present.
We believe that extending the 'effective working duration' of models is a fundamental dimension in enhancing agent capabilities. There are still significant technical challenges along this path: overcoming context anxiety when dealing with complex tasks, maintaining execution consistency after thousands of tool invocations, exiting local optima earlier, and more importantly, establishing a reliable self-assessment mechanism for tasks without clear numerical metrics. GLM-5.1 represents a step forward in this direction, and we will continue to push ahead.
GLM-5.1 is not just a stronger model but the beginning of a new technical paradigm. At this moment, give it an instruction, then leave it for eight hours.
Open Source and Usage
Starting today, GLM-5.1 is open-sourced simultaneously on Hugging Face and ModelScope platforms.The model weights follow the MIT License。
GLM-5.1Included in GLM Coding Plan (Max/Pro/Lite), supporting mainstream development tools such as Claude Code and OpenCode.
1. Official API access
- BigModel Open Platform:https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1
1. Product Experience
- GLM-5.1 is coming soon to Z.ai: https://chat.z.ai
1. Open source link
– GitHub:https://github.com/zai-org/GLM-5
- Hugging Face: https://huggingface.co/zai-org/GLM-5.1
- ModelScope: https://modelscope.cn/models/ZhipuAI/GLM-5.1
Risk Disclaimer: The above content only represents the author's view. It does not represent any position or investment advice of Futu. Futu makes no representation or warranty.Read more
Comment (1)
to post a comment
4
3
