
On February 12, Zhipu's new flagship foundational model GLM-5 was released, evolving from Vibe Coding for writing code and front-end tasks to Agentic Engineering for handling engineering tasks and large-scale projects.
GLM-5 adopts a 744 billion parameter (activating 40 billion) Mixture of Experts (MoE) architecture, achieving state-of-the-art open-source performance in coding and agent capabilities:
– In the globally authoritative Artificial Analysis rankings, GLM-5 ranks fourth worldwide and first among open-source models;
– Achieved the highest scores among open-source models in key evaluations such as SWE-bench-Verified and BrowseComp for programming and intelligent agents;
– Real-world coding experience approaches that of Claude Opus 4.5.
GLM-5 is the first to implement W4A8 mixed-precision quantization on Ascend, deployable on a single Atlas 800T A3 machine, with out-of-the-box performance comparable to dual H100 setups, reducing deployment costs by 50% in long-sequence, low-latency scenarios.
The core work includes the following:
– W4A8 Quantization: Applying W4A8 quantization to model weight files can significantly reduce GPU memory usage and enhance execution speed during the Decode phase;
– High-performance fused operators: High-performance fused operators like Lightning Indexer, Sparse Flash Attention, and MLAPO can effectively accelerate end-to-end model inference execution.
– Inference EngineUtilizing the vLLM-Ascend and SGLang inference engines, the model's inference performance has been further enhanced.
Inference Deployment:
Training Deployment:
W4A8 Quantization
1. The easy extensibility of the MsModelSlim quantization tool facilitates effortless end-to-end quantization.
a) Differentiating quantization bits and algorithms by module:For instance, use W8A8 for the Attention and MLP components, while using W4A8 for MoE experts; sensitive layers like gates can be rolled back as needed to avoid excessive precision loss.

b) Subgraph-level switches:Control the fusion and smoothing of OV, norm-linear, and up-down through enable_subgraph_type, making it easier for the inference framework to utilize fused operators to enhance performance.
c) One-click quantization:Support the complete pipeline for GLM-5 quantization process, including preprocessing, subgraph fusion, and hierarchical linear quantization. After installation, simply input the following command line to easily complete the quantization:
msmodelslim quant --model_path ${model_path} --save_path ${save_path} --model_type GLM-5 --quant_type w4a8 --trust_remote_code True
MsModelSlim provides a rich set of quantization strategies for quick accuracy alignment
a) Rotation Quarot algorithm:Apply Hadamard rotation on weights and fuse with LayerNorm to reduce activation outliers and improve the numerical distribution for subsequent quantization.
b) Various outlier suppression algorithms:Adopt a hybrid strategy using Flex_AWQ_SSZ and Flex_Smooth_Quant algorithms, calibrate weights using SSZ (Smooth Scale Zero), support hyperparameters such as scaling factors, ensuring both precision and stability at low bit-widths.
c) Linear layer quantization strategy:Perform W8A8 or W4A8 on individual Linear layers; common activation value quantization uses per-token granularity with minimax algorithms; weight quantization uses per-channel granularity. The msModelSlim tool offers configurable quantization strategies, allowing flexible configuration of different quantization granularities and algorithms by module.
High-performance fused operators
1. Lightning Indexer integrated with Kernel
In long-sequence scenarios, the TopK operation can become a bottleneck. We have introduced the Lightning Indexer fusion operator, which includes operations such as Score Batchmatmul, ReLU, ReduceSum, and TopK, allowing the computation time to overlap with other operations and achieve computational pipelining benefits.
2. Sparse Flash Attention integrated with Kernel
We have introduced Sparse Flash Attention, which includes selecting the TopK-related tokens from the complete KVCache and performing sparse Flash Attention operations. This ensures that during discrete aggregated memory access, the computation time overlaps with other operations, achieving pipelined parallel acceleration benefits.
3. MLAPO integrated with Kernel
In the preprocessing stage of Sparse Flash Attention in GLM-5, query and KV undergo dimensionality reduction, and the activated values of the reduced query are passed to the Indexer module for sparse selection processing. MLAPO employs VV fusion (multiple Vector operator fusion) technology to directly merge 13 small operators in the preprocessing into one super large operator. Additionally, inside the MLAPO operator, performance is further enhanced through parallel processing and pipelining optimization using Vector and Cube computing units.
vLLM inference engine
1. Prefix Cache
In the vLLM framework, by utilizing cache structure optimizations and idle double-ended queues, the storage space for KV Cache is expanded from limited HBM memory to larger system memory (e.g., DDR) or shared storage. This significantly reduces computational resource waste and end-to-end latency, particularly providing greater performance optimization in GLM-5's long-sequence scenarios.
2. Asynchronous scheduling
During the Decoding phase of inference, significant scheduling bubbles often occur between two Decode steps due to synchronization operations between the CPU and NPU. For example, at the end of the current Decode step, the data from the sample operation is copied from the NPU to the CPU (i.e., D2H operation) and eventually returned. The next Decode step cannot begin until the Sample operation is completed, resulting in scheduling-induced idle time. The vLLM Ascend framework aims to minimize the synchronization overhead between the CPU and NPU by employing asynchronous scheduling, overlapping the execution of the current Decode step's model with the preparation for the next Decode step. By advancing the execution of operations such as prepare_input and update_states for the next Decode step, it effectively hides the D2H operation during the model execution of the current Decode step, thereby minimizing scheduling bubbles between Decode steps.
3.Local TP Parallel Split
Attention DP + MoE EP deployment was selected. Due to the large memory footprint of O_proj and LM_Head weights, which become a clear memory access bottleneck during the Decode phase, this practice employs local TP parallelism. Additionally, to reduce device memory usage, the Embedding layer also uses TP splitting. To minimize communication overhead caused by TP parallelism, the TP domain is confined within the high-speed interconnected HCCS domain.
4.FlashComm
By decomposing the AllReduce communication process into ReduceScatter and AllGather, and deeply integrating and optimizing these with subsequent computational modules, both the amount of communication data and intermediate operator calculations are reduced. This significantly decreases communication latency and enhances the inference performance of large models.
SGLang Inference Engine
1.Multi-stream Parallel Architecture
The Sparse Attention Indexer section adopts a multi-stream parallel strategy. The main stream is responsible for the calculation and management of Key vectors, including Key projection, RoPE positional encoding, read/write operations of the KV cache, and the final invocation of the sparse indexer. The auxiliary stream focuses on asynchronous computation of Query vectors, performing Query projection and RoPE positional encoding in parallel. Synchronization with the main stream is achieved through an event mechanism, effectively hiding the computation latency of the Query path. The weight stream independently computes the weight projection of the indexer, further enhancing hardware utilization efficiency through parallel computation.
2.MTP
MultiToken Prediction overcomes the issue where traditional models rely on the output of the previous step for every step; generating multiple tokens in a single inference drastically reduces the time required for sequence generation. In long-sequence inference, it also optimizes NPU parallel efficiency by increasing computational density, improving computational resource utilization.
3.Two-Batch-overlap(TBO)
TBO splits requests into smaller batches, alternately performing attention computation and dispatch/merge operations, thereby increasing overall throughput without causing a surge in peak memory. Additionally, computational tasks are submitted to the NPU before communication blocking occurs, ensuring that the NPU's computing units remain active during the communication process.
4.RadixCache
Prefix Sharing enables efficient reuse of KV caches, while RadixCache uses a tree structure to store and match prefixes across requests, allowing multiple requests sharing input sequences to reuse previously computed KV cache entries. This reduces redundant computations and improves NPU memory utilization. Performance improvements are more pronounced in long-sequence request scenarios.
Faced with the sudden influx of millions of real traffic hits after GLM-5 went online, it was these domestic chip clusters that handled the computational demand spike and completed emergency capacity expansion. In the future, Zhipu AI and Huawei will continue to deepen their cooperation, focusing on joint efforts in model training, inference optimization, and industrial implementation, collectively driving the coordinated evolution of domestic large models and domestic computing power, and building an independent and controllable full-stack technology ecosystem for China’s AI industry.
Risk Disclaimer: The above content only represents the author's view. It does not represent any position or investment advice of Futu. Futu makes no representation or warranty.Read more
Comments
to post a comment
