From 'Benchmarking' to 'Deployment': AI Inference Set for Explosive Growth in 2026

Source: Era Finance, Author: Yao Tingting

As artificial intelligence becomes the core battleground of global technological competition, the strategic value of AI chips as the cornerstone of computing power is becoming increasingly prominent.

Over the past two years, discussions in the industry about AI chips often focused on metrics such as training clusters, peak computing power, interconnection bandwidth, and '10,000-card scale.' However, entering 2025, a proposition closer to real business is becoming increasingly clear: when large models move out of the lab and towards scaled application, the key determinant of commercial success or failure is often no longer 'how fast training runs,' but rather 'how efficiently inference operates, how long it remains stable, and whether it can deliver consistent performance.'

As a result, the focus of competition has shifted from single hardware metrics to 'software-hardware synergy, engineering, and ecosystem delivery.' Research and adaptation investments have significantly increased, making capital an important driving force for industrial acceleration.

Driven by factors such as favorable industrial policies, capital support, and the explosion of domestic open-source large models, domestic AI chip companies are accelerating their breakout. A group of domestic chip companies, including the 'Four Little Dragons of Domestic GPUs,' are intensively entering the capital market to leverage financing for technological iteration. Additionally, already listed companies are actively expanding their financing channels. For instance, Yuntian Lifly (688343.SH) officially launched its Hong Kong IPO last mid-year, potentially becoming one of the few 'dual-capital platform (A+H)' companies in the domestic AI chip sector; Baidu (BIDU.O, 09888.HK) also announced in January this year the spin-off of its subsidiary Kunlun Chip for a Hong Kong listing.

The influx of capital, breakthroughs in differentiated technical routes, and structural shifts in market demand have outlined a new development landscape for the domestic AI chip industry. Entering 2026, a China-based AI computing power ecosystem centered on independent innovation and oriented towards global competition is beginning to take shape.

A Pivotal Year

Looking back at the journey of domestic AI chips, the first wave of pioneers appeared as early as ten years ago.

At that time, AI was still in the intelligent perception stage represented by visual recognition, with mainstream models being relatively small-scale networks like CNN. The industry focused more on getting algorithms to work in real-world scenarios, and there were hardly any dedicated inference chips domestically.

We were among the earliest batch of Chinese artificial intelligence inference chip ventures. When we returned to China to start our business in 2014, AI was still in the intelligent perception stage of visual recognition, and there were hardly any dedicated inference chips domestically,' recalled Chen Ning, Chairman of Yuntian Lifly.

At that time, research and engineering explorations around Neural Processing Units (NPUs) began to sprout domestically. The core proposition of this phase was 'from 0 to 1': the model was relatively defined, operators were relatively concentrated, and although the requirements for energy efficiency and cost in inference loads existed, they had not yet become a consensus across the entire industry.

With the evolution of technology and shifts in market demand, after 2020, the Transformer architecture emerged, AIGC became the industry mainstream, and large models began to rise rapidly. The 'hundred-model battle' drove the explosive demand for training computing power first, and the industry once concentrated its focus on the competition for scaled clusters and extreme performance.

However, as large models move out of laboratories into various industries, more and more businesses are placing the continuous operational costs and real-time response speed during the inference phase at a more central position. Inference has thus gradually shifted from being a 'subsidiary step' of training to becoming a key factor determining the success or failure of AI commercialization, leading to a trend where the industry focus migrates from training towards inference.

Source: Tuchong Creatives

These industry developments coincide with the rhythm of domestic AI chip companies 'concentrated emergence' around 2025.

Cambricon experienced significant stock price volatility in the second half of 2025, with intraday prices surpassing those of Kweichow Maotai at one point; in mid-2025, Intellifusion announced its intention to 'pursue an examination in Hong Kong,' aiming for dual listings on A+H shares. By the end of 2025, Moore Threads was the first to go public on the STAR Market as the 'first domestic GPU stock,' followed by Maxa Technologies listing on the STAR Market; this trend continued into 2026, with Biren Technology going public in Hong Kong on January 2, closely followed by Tianshu Zhixin listing on the Hong Kong stock market on January 8; additionally, Enflame Tech completed its IPO coaching, while Kunlun Core, under Baidu, also submitted its listing application to the Hong Kong Stock Exchange.

From an industrial perspective, 2025 became a concentrated breakout point due to the combined effect of multiple forces.

On one hand, as large models moved out of labs and into various industries, the continuous operational costs and real-time response demands during the inference stage were significantly amplified, driving the demand for computing power to extend from 'peak training' to 'long-term inference.'

On the other hand, factors such as policy dividends, capital push, and the vibrant ecosystem of domestic open-source large models have accelerated companies' financing and product development pace.

At the same time, supply uncertainties brought about by changes in the external environment have prompted the industry chain to more actively evaluate diversified computing solutions. Under the resonance of multiple factors, corporate actions have been more concentrated during the same period.

The landscape has changed.

The biggest variable in 2025 occurred at the beginning of the year. The rapid rise of domestic open-source large models represented by DeepSeek, coupled with features like 'open-source usability and low-threshold invocation,' significantly lowered the barrier to using AI technology, shifting industry discussions from 'whether the model can be made' to 'whether the model can be scaled up for use.'

As large models move out of labs and into various industries, the ongoing operational costs and real-time response speed during the inference stage have become critical in determining the success or failure of AI commercialization, creating a clear distinction from the pursuit of peak computing power during the training phase.

Global giants have keenly sensed this shift. At the end of last year, NVIDIA reached a 'non-exclusive technology licensing' agreement with Groq and brought in Groq’s core executives and part of its engineering team to join NVIDIA. This move has been interpreted by some industry insiders as being 'tantamount to an acquisition'.

Groq's LPU is a chip architecture designed for large model inference, focusing on low latency and deterministic execution. For NVIDIA, this move aims to quickly address its AI inference shortcomings.

On the other hand, Google continues to expand its TPU layout, strengthening its energy efficiency advantages in inference scenarios through architectural optimization; meanwhile, focusing on inference workloads, Microsoft launched the Azure Maia accelerator, and Amazon also continuously iterates on Inferentia and other self-developed AI inference chips. The parallel advancement of cloud providers’ self-developed routes and GPU routes further solidifies the industry consensus that 'inference has become the new battleground'.

In China, AI inference has also brought about a huge market.

According to the CIC Report, China's AI inference chip-related products and services industry is in a rapid growth phase, with the market size increasing from 11.3 billion yuan in 2020 to 162.6 billion yuan in 2024, at a compound annual growth rate (CAGR) of 94.9%. It is expected to grow at a CAGR of 53.4% from 2024 to 2029, reaching 1,383 billion yuan by 2029.

Domestically, in response to the critical track of inference chips, a group of local companies are accelerating their breakout.

For example, Huawei's Ascend series chips adopt Application-Specific Integrated Circuit (ASIC) design, based on the self-developed Da Vinci architecture, optimized specifically for efficient execution of AI neural network computing tasks.

Cambricon has launched the Siyuan 590 chip, a domestically produced AI chip built on a 7-nanometer process with an inference computing power of 512 TOPS, fully compatible with almost all mainstream domestic large models. Meanwhile, Tianshu Zhixin focuses on the 'training and inference combination' universal GPU route, and its publicly available information shows that it has released general-purpose GPU products for inference, emphasizing support for mainstream deep learning frameworks and multi-precision inference computing.

Yuntian Lifly proposes a new 'GPNPU' architecture, emphasizing design tailored for AI inference, and combines system-level methods such as packaging and storage to alleviate bandwidth bottlenecks, attempting to carve out a differentiated technical path.

Overall, as inference becomes the main battlefield of the industry, the focus of competition is expanding from single-point computing power metrics to comprehensive capabilities including 'software-hardware synergy, cost structure, delivery, and operations'. Domestic manufacturers are thus entering a new round of competitive windows that demand engineering fulfillment.

Future Breakthrough

Entering 2026, the upward demand for inference is more evidently intertwined with two industrial threads: one being the evolution of application forms from 'dialogue' to 'action,' and the other being the shift of inference systems from 'homogeneous stacking' to 'engineering decomposition.'

At CES 2026, NVIDIA CEO Jensen Huang repeatedly emphasized the arrival of 'agentic AI' and 'Physical AI,' describing it as the 'ChatGPT moment for Physical AI.' The core implication is that AI will extend further from content generation into understanding, planning, and execution, applied in scenarios closer to the real world such as robotics, autonomous driving, and industrial systems.

Meanwhile, AMD CEO Lisa Su also used the term 'yottaflops' at CES 2026 to describe the magnitude of AI computing power demand in the coming years, signaling that as more complex applications come online, the bottleneck of computing power will shift from 'whether it can be trained' to 'whether sustainable inference is possible.'

The 'world model/spatial intelligence' approach represented by World Labs, founded by Fei-Fei Li, pushes inference workloads further from 2D content generation towards 'building interactive 3D worlds,' implying longer chains, stronger real-time capabilities, and higher-frequency online inference calls becoming the norm.

These trends together point to the same conclusion: the total volume of inference will increase, and the requirements for latency, stability, and cost structures will rise simultaneously.

In China, opportunities on the inference side are similarly more certain. On one hand, China's AI inference chip sector has not yet formed a 'one dominant player' landscape, and different technical routes are still advancing in parallel, providing space for latecomers to differentiate and validate engineering approaches.

On the other hand, as a major application market, China’s policy level provides clear diffusion goals for the large-scale adoption of AI. The AI+ action plan released by the State Council mentions that by 2030, the penetration rate of 'new-generation smart terminals and AI agents' will exceed 90%, meaning that inference demand will come not only from leading large models but also from the long-tail diffusion of numerous industry applications and end products.

Against this backdrop, the key to 'future breakthroughs' is no longer about single-point performance narratives but whether soft-hard synergy, ecosystem adaptation, delivery operations, and cost structure can form a replicable engineering system in the era of inference—when inference demand continues to grow and becomes increasingly segmented, the market will leave room for more pragmatic approaches.

Source: Tuchong Creatives

In the inference market, different chip companies have made various plans.

Chen Ning stated that in the next 1-2 years, the focus will be on promoting the GPNPU architecture and series of chips to pass market validation and empower more AI-native hardware. Huawei previously planned to launch several Ascend chips over the next three years, including the 950PR, 950DT, Ascend 960, and Ascend 970, with the 950PR mainly targeting the Prefill stage of reasoning and recommendation business scenarios. For inference scenarios, Tianzhixin launched a general-purpose GPU 'Zhi Kai' series optimized for AI inference. In their prospectus, they mentioned that they plan to continue iterating their product lines for both training and inference scenarios in the future...

The competition among domestic AI chips has moved from an arms race focused on single-point computing power to a systematic contest revolving around inference efficiency, engineering delivery, and ecosystem collaboration. The competition has entered a more challenging second half, and the contest for 2026 has already begun.

79K Views