Pony AI

wrote a post ·

Pony AI World Model Evolution: From Virtual Driving Schools Teaching AI to Drive, to Self-Evolving Physical AI Engines

01 Autonomous driving is much harder than playing Go

Exactly ten years ago in March 2016, AlphaGo, which used reinforcement learning through self-play, defeated top professional Go player Lee Sedol in a five-game match with a score of 4:1, becoming the first Go AI to defeat a professional 9-dan player without handicap. This marked a milestone in the artificial intelligence industry. AlphaGo successfully demonstrated the potential of AI, sparking an explosion in the AI industry. Many tech giants shifted their strategies and began investing heavily in AI. Many AI companies, including Pony AI, were also founded in 2016.

At that time, some in the industry optimistically believed that by using manually labeled data, AI could possess human-like perception capabilities, thereby quickly achieving human-level driving ability and realizing Level 4 autonomous driving. However, driving is far more complex than recognizing cats in photos:

On one hand, a 99% success rate in image recognition is good enough for commercial use, but a 1% error in L4 autonomous driving scenarios means running red lights, collisions, violations, and accidents—completely unacceptable. Especially since human drivers making mistakes isn't news, but AI drivers making mistakes definitely is. The public's expectations for AI drivers are significantly higher than for human drivers.

On the other hand, driving is a scenario with strong interaction among surrounding traffic participants, not just about following specific rules—even if perception results are absolutely accurate, the final driving decisions and behaviors may not necessarily be safe or smooth enough.

Therefore, until 2019, no company in the industry had truly achieved fully driverless operations on open urban roads with a fleet of vehicles at scale. Why emphasize 'at scale'? Because scale represents statistically high enough safety. A few vehicles operating without safety drivers can rely on probability and luck, but only when a large-scale fleet operates safely without frequent accidents can the overall system safety and statistical reliability be proven.

The divergence of two approaches: imitation learning vs reinforcement learning

At this point, the industry began to see clearly different technical development paths toward achieving true autonomous driving:

Some companies emphasized collecting more human driving data to improve model performance, using 'shadow mode' to gather massive amounts of human driving data, especially data where human behavior differed from AI. This approach resembles the scaling law of large language models—achieving breakthroughs through sheer volume—using more driving data to cover long-tail scenarios and waiting for the 'aha moment.'

Pony AI chose a different path because its technical team had already realized by then that driving differently from humans doesn’t mean driving incorrectly, and driving 'very human-like' with minor differences could still be catastrophically wrong. The goal of L4 autonomous driving shouldn’t be to compare decision-making and behavior with humans; the goal should simply be to 'drive well'—specifically, achieving statistically high enough safety, comfort, and traffic efficiency.

Moreover, since L4 autonomous driving cannot rely on human intervention, it is fundamentally different from L2/L2++ assisted driving. Even if 99.99% of scenarios are driven better than humans, the remaining 0.01% being dangerous is still unacceptable. For L4 autonomous driving, capping the lower limit of the model’s performance is as important as pushing its upper limits. This is completely different from occasional 'hallucinations' in large language models, and also distinct from L2 assisted driving, where responsibility always lies with the driver.

Once the learning objective shifts from 'driving like humans' to 'driving well,' this signifies a paradigm shift—from imitation learning to reinforcement learning. AlphaGo achieved reinforcement learning by self-playing on the board, focusing on 'winning' as the learning goal rather than 'playing like humans.'

Starting in 2020, Pony AI spent several years gradually refining a system that allows AI to enhance driving capabilities through reinforcement learning, enabling AI to repeatedly drive and train vehicle-side models in a 'virtual driving school.' This is what we now refer to as the 'PonyWorld world model.'

What is the World Model 02? How to improve its accuracy?

It is not a realistic game engine, but an entire system

Two technological approaches have been developing in parallel over the years. However, by 2024-2025, leading companies such as Waymo and Pony AI successively launched large-scale commercial operations of unmanned Robotaxi fleets in multiple cities. The industry gradually realized that simply increasing human driving data cannot indefinitely enhance the capabilities of autonomous driving models. L2-level assisted driving cannot continuously improve safety to become L4-level autonomous driving merely by collecting human driving data. An increasing number of companies (including assisted driving algorithm firms and automakers) began shifting their technical routes to reinforcement learning and the World Model approach. By 2026, achieving autonomous driving that meets L4 standards through reinforcement learning and the World Model (simulation training environments) has become an industry consensus between China and the US, with Pony AI undoubtedly taking the lead.

However, many companies and the public within the industry simplistically view the World Model as a simulation environment capable of generating virtual data, as if a sufficiently realistic game engine could teach AI how to drive. In contrast, Pony AI’s World Model has never been a single module but rather a comprehensive system spanning cloud and vehicle ends, constructed since 2020 and progressively implemented. Each layer is already operational in real mass-production systems:

It must define what “driving well” means, which is the reward function for reinforcement learning — this cannot be defined by simple rules; it also needs to be trained via neural networks.

The modeling of the physical world must be sufficiently accurate, including precisely reflecting the kinematic model of the self-driving vehicle and the kinematic models of surrounding traffic participants.

Most importantly, autonomous driving involves strong interaction. The World Model must not only generate data for corner cases but also enable traffic participants in long-tail scenarios and all virtual scenarios to interact with the AI-driven vehicle in ways consistent with human behavioral distributions. For example, when the AI-driven vehicle suddenly changes lanes while there is a car in the adjacent lane, the behavior of the car in the adjacent lane will be influenced by the AI's actions, with a certain probability of slowing down to yield and another probability of accelerating to compete, leaving no room for the AI to change lanes. These varying probabilities of behavior should be reflected in the scenarios generated by the World Model.

Accuracy is everything: Whether the World Model is good or not depends on whether the AI becomes increasingly incorrect as it learns.

Only when the World Model achieves these three points (of course, each is challenging) can it adequately allow the AI driver to obtain positive training results in this environment. Otherwise, the AI model's driving ability may be 'self-indulging' in unrealistic scenarios, becoming increasingly incorrect, and even inferior to imitation learning that introduces vast amounts of human driving data. This capacity of the World Model to 'simulate the world' is what we call 'accuracy.' After the initial version of the World Model went live and its trained in-vehicle models were deployed, as the accuracy of the World Model improved, the performance of the continuously trained reinforcement learning in-vehicle models would also increase. The process of enhancing Pony AI’s autonomous driving capabilities essentially became one of improving the accuracy of the World Model. Over the past few years, we have made efforts in several aspects to improve accuracy.

Collecting 'high-end chess matches' to enhance the accuracy of the World Model

As previously mentioned, world models need to simulate the responses of other road users to AI - an interesting challenge. When AI improves its driving ability not through imitation learning but through reinforcement learning, its simulation environment still needs to mimic human (or AI) interactions and decision-making with AI drivers. Therefore, world models must not only simulate interactions between humans but also between humans and AI, especially when the AI driver's behavior isn't entirely "human-like," making this capability even more critical.

How will humans react to AI drivers with specific capabilities? This behavior cannot be imagined out of thin air; only by putting AI drivers on the road can we know. Thus, the most crucial aspect of improving and aligning the accuracy of world models is real-world road testing for AI drivers - collecting not ordinary human driving data but AI driver data. Once AI capabilities, particularly in safety, surpass those of humans, only AI driver data can be used to enhance world models because other road users' reactions to AI drivers differ from their reactions to human drivers. A world model trained solely on human driving data will always lack this crucial precision data.

Looking at historical data from Pony AI, the fastest improvement in safety didn't occur before driverless road testing began but after a certain scale of fully autonomous test vehicles hit the roads. By that point, the AI drivers had already surpassed human-level performance, and the collected data could better improve the accuracy of the world model, further enhancing the onboard model's capabilities.

The data flywheel of the world model: High-precision models and high-precision data mutually reinforce each other.

At this stage, a deeper structural barrier emerges. Once AI driving capabilities surpass those of ordinary human drivers, human driving data can no longer effectively improve the precision of world models. It’s like having a Go grandmaster repeatedly study amateur players' game records - it won’t make them stronger. AI has now reached a level of ten dan or higher, and for it to continue improving, it needs to face entirely new scenarios beyond its existing experience.

For autonomous driving world models, the only source of these 'ten-dan-level new games' is the data generated by L4 fully autonomous fleets during real-world commercial operations. The unique value of this data lies in the fact that it comes from AI independently driving in real traffic environments. AI encounters situations that human drivers would never face because human drivers have different reaction patterns, and surrounding traffic participants interact with them differently. The traffic interaction patterns triggered by autonomous vehicles are themselves one-of-a-kind. Only companies operating large-scale L4 autonomous fleets in the real world can continuously produce this high-value data.

This creates a self-reinforcing flywheel:

Large-scale L4 autonomous fleet operations → Generate high-value real-world data → Improve the accuracy of the world model → Continuously enhance the onboard model → Support larger-scale L4 deployments → Generate more high-precision data → ……

Once this flywheel gets going, the data it produces is exclusive, its evolutionary path is self-directed, and its efficiency increases with scale.

Companies without the ability to operate large-scale L4 fully autonomous fleets cannot start this flywheel. It can't be caught up with by spending more money on GPUs, hiring more annotators, or training on L2 data for more rounds.

This is a structural moat.

Intention: Adding an 'intention layer' to the in-car model

There was once a relatively popular technical approach in the industry that attempted to insert a language model between perception and action — allowing AI to first describe in words the scene it sees, such as 'a tricycle is crossing at the intersection ahead, I need to slow down,' and then generate driving actions based on this textual description — known as VLA.

But this goes against the first principles of driving. A true seasoned driver does not silently recite lines during emergency maneuvers. The core of human driving lies in immediate spatial awareness and subconscious muscle memory. Language, on the other hand, is a low-dimensional product that severely 'lossy compresses' complex four-dimensional physical spacetime — using sentences with subject-predicate-object structures to describe dynamic interactions among vehicles, pedestrians, and lane markings at the millisecond level is not only sluggish but also results in significant information loss.

Pony AI has chosen a more direct path: sensor data directly maps to driving actions without going through a language layer. By skipping this unnecessary intermediary, it not only significantly reduces computational power consumption but also allows the system to allocate every saved computing resource to what truly matters — understanding the physical world, simulating future scenarios, and making decisions. In Pony AI's seventh-generation Robotaxi, the entire onboard computing platform operates at 1016 TOPS, with the main system powered by three NVIDIA DRIVE Orin-X chips and the redundant system by one DRIVE Orin-X chip. The redundant system can independently complete driving tasks and continue normal operation even if the main system fails, eventually pulling over safely at an appropriate location.

Without this 'middleman,' collecting physical data and improving the physical accuracy of world models becomes more direct and efficient — many believe that whether the in-car model uses VLA or another architecture, it doesn't contradict the architecture of training models as world models, but that’s only partially correct — when the efficiency of the in-car model is high enough, the training and iteration efficiency will also improve significantly.

To achieve better iterations, Pony AI introduced the Intention (intent) semantic layer during the training process of its in-car model.

Initially, the inputs to the in-car model were sensor data, and the outputs were driving actions (steering wheel angle, throttle, brake). It could drive well, but its decision-making process was unreadable by humans.

In later versions, while making each driving action, the model internally generates structured intent expressions. Translated into human-understandable language, it would be 'I choose to slow down and wait before the intersection because there is a pedestrian approaching the crosswalk from the right front, and I predict that he is likely to cross.' These intent messages are not 'explained' post hoc by another model, nor are they inserted as a separate language model during inference — which would turn into a 'language middleman.' They are jointly learned alongside driving actions during the training phase. Intention, as a structured representation within the model, ensures that what the model 'thinks' and 'does' are aligned from the start of training.

The triple value brought by explainability:

First, it is auditable. When a driving behavior needs to be retrospectively analyzed - whether for regulatory review, accident investigation, or internal quality review - engineers no longer need to face an astronomical dimensional neural network to guess 'what it was thinking.' The Intention layer provides a human-readable decision summary.

Second, it is debuggable. When the model makes a mistake in a certain scenario, the engineering team can directly examine its intention representation: Was the obstacle not identified at the perception level? Or was it identified, but there was a deviation in risk assessment during the intention generation phase? Or was the intention correct, but the final action execution had an issue? The precision of fault localization improves from 'something went wrong somewhere' to 'exactly which layer and why something went wrong.'

Third, it is iterative. This point is crucial as it directly relates to the evolutionary flywheel that will unfold later - when the system can clearly express its intentions, it also has the foundational ability for self-diagnosis. 'My intention generation in these scenarios is always inaccurate' - this self-awareness is precisely the starting point for the world model's self-evolution.

03 World Model 2.0: A self-iterative, scenario-unlimited physical AI engine

The previous discussion was about why Pony AI's world model is necessary and how it works. Now comes the more fundamental question:Why does it keep getting stronger? What is its ceiling?

When "The process of enhancing Pony AI's autonomous driving capability essentially becomes the process of improving the accuracy of the world model.We continuously collect L4 autonomous driving data to improve the accuracy of the world model. However, when the Robotaxi fleet reaches a sufficient scale and the accuracy of the world model is good enough, most of the Robotaxi data contributes very little to improving the accuracy of the world model, only adding unnecessary storage costs, and increasing the burden of data filtering for training the world model.More importantly, when AI driving capability far exceeds that of humans, guidance provided by humans could be wrong.

Self-diagnosis: AI knows where it falls short.

What World Model 2.0 changes is precisely this logic.

Combined with the previously mentioned intention layer, when the in-vehicle model can clearly articulate 'why I made this decision,' an extremely important capability is unlocked:Self-diagnosis.

The system can automatically and on a large scale retrospect every decision made by the in-vehicle model, or even every step of the process used to train it, comparing deviations between its intended expression and actual outcomes:

In which scenarios was the model’s intention correct, but execution deviated – requiring further training in the world model.

In which scenarios was the model’s intention itself wrong – requiring further training in the world model.

In which scenarios was the model’s intention incorrect due to inconsistencies between real interactions and simulated scenarios in reinforcement learning – indicating an issue with the precision of the world model.

These diagnostic results are fed directly back into the world model. The first two types can improve the iterative efficiency of the world model’s training of the in-vehicle model – focusing on practicing difficult problems while skipping the 'easy ones.' Extracting the third type of diagnostic result represents the most significant leap in capability for version 2.0:Improving the precision of the world model's scenarios is no longer a broad-based effort, but rather targeted.

Targeted data collection: Engineers become AI data collectors.

World Model 2.0 not only enhances the performance of in-vehicle models more efficiently, but also automatically improves the accuracy of the world model itself: allowing AI to tell humans what data should be collected. When the system detects through self-diagnosis that the world model is not stable enough in certain real-world scenarios — for example, at certain intersections in a city during the evening backlight hours, the confidence level of the model in generating simulated data for specific types of obstacles drops — it will automatically generate a targeted data collection task and send it to the testing operations team:

"In the next week, between 4:30 PM and 5:30 PM, please focus on collecting driving data under backlight conditions at the following three intersections. Prioritize mixed traffic scenarios involving non-motorized vehicles and pedestrians."

After receiving this instruction, the test engineers assign the task to the test vehicles for data collection. The real-world data collected is then transmitted back to the cloud, where the world model calibrates its scenario generation model accordingly and generates a set of more realistic data to fine-tune the in-vehicle model specifically.Humans are no longer the teachers of AI but rather its data collectors.R&D personnel, test engineers, and the operations team — the entire organization starts operating around the "accuracy requirements" of World Model 2.0. Wherever it identifies weakness, humans step in to fill the data gaps. If it determines that more real-world samples are needed for a particular scenario, humans drive out to collect data from those scenarios.

"The R&D staff are working for World Model 2.0." — This is not a joke but represents an entirely new paradigm of research and development.

When you ask the world model what simulation capabilities are still missing,

As Pony AI's tens of millions of kilometers of autonomous driving data, especially millions of kilometers of fully driverless data, continue to refine the world model, this includes data from Robotaxis operating in urban areas, highways, enclosed parks, and parking lots, as well as data from Robotrucks in different scenarios like main roads and ports. The AI clearly recognizes that its dataset is limited to the 'structured road driving' scenario.

If you ask it where further improvements can be made and what data is needed to enhance the accuracy of physical simulations, besides identifying the need to collect driving data for a specific new scenario in a newly deployed country or city, it might also respond that there is a lack of data regarding sidewalks, non-motorized vehicle lanes, overpasses, and even indoor scenes. As an autonomous driving world model, it indeed lacks indoor data — but who says the PonyWorld world model can only be used for autonomous driving?

A self-evolving world model with highly efficient precision enhancement capabilities can meet the requirements of physical AI beyond autonomous driving — covering scenarios with complexity levels many orders of magnitude higher than structured road driving.

No matter how much data or computing power is available, it will never be enough. Efficiency will remain a critical factor for the continued iteration of AI in the future. Whether it's improving autonomous driving capabilities that already surpass human safety levels or addressing scenarios far more complex than driving, such as general-purpose physical AI and embodied intelligence, directed evolution of world models will be an essential capability. Only world models capable of directed and autonomous evolution can support training environments for higher-dimensional and more complex physical AI, enabling AI to achieve capabilities far beyond human performance across tasks beyond just driving.

As world models enter the 2.0 era, PonyWorld will not only focus on optimizing autonomous driving scenarios but also explore possibilities for other physical AI applications and use cases.

38K Views