
On the day before the May Day holiday, DeepSeek suddenly released a technical report on multimodal vision.
Before I opened it, I had some expectations, mainly about how far and how clearly the model could see.
After all, over the past year, most multimodal models have been heading in this direction. OpenAI talked about 'thinking with images,' allowing models to crop, zoom, and rotate images during the reasoning process; Gemini and Claude have also been working on enabling models to handle higher-resolution and more complex visual inputs.
The common assumption has been that as long as the model sees more details, its visual reasoning will naturally improve.
However, after reading through DeepSeek's report, you'll realize they've gone down a completely different path.
Instead of focusing on 'letting the model see more pixels,' they concentrated on a more fundamental issue.
Even if the model can see clearly, how can you ensure that the model is referring to the same thing as you during the reasoning process?
In fact, this is the most easily overlooked fatal flaw in multimodal reasoning.
When humans look at images, they can use their fingers to point at objects. For instance, 'This person is so-and-so,' or 'That person is so-and-so.' But how does the model know which one you're referring to?
The model can only use language to say things like 'the one on the left,' 'the one above,' or 'this line.' Once the image gets complicated, linguistic references start to drift, and reasoning falls apart.
So DeepSeek said, why not just give the model a 'finger'?
It turns points and bounding boxes into the basic units of the model's thinking, allowing the model to point at objects with this cybernetic finger while performing reasoning.
01
From continuous vision to discrete symbols
In this technical report, DeepSeek raises an interesting question. They argue that the real challenge for multimodal models isn't seeing the image but maintaining stable reference to the same visual object during continuous reasoning.
For example, if you tell your friend, 'At the market, the vegetables sold at Grandma Zhang’s stall are the freshest.' But there are so many elderly people at the market, which one is Grandma Zhang?
But if you directly point and say, 'It’s that one,' your friend will immediately understand.
DeepSeek refers to this issue as the 'Reference Gap.'
Over the past year, almost all cutting-edge multimodal models have been working to address the 'Perception Gap' issue.
Imagine a photo placed in front of you. If the photo is too blurry or has very low resolution, you may not be able to see small text or distant details clearly. The same applies to AI. If the input image quality is insufficient or the processing method is incorrect, it will 'fail to see clearly,' which is what we call the perception gap.
Models like GPT, Claude, and Gemini have been improving resolution by introducing high-resolution cropping, dynamic tiling, and multi-scale processing, all aimed at enabling the model to perceive more details.
While this direction is certainly valuable, DeepSeek's report points out that even if the model sees more clearly, it can still experience logical breakdowns when handling complex spatial reasoning tasks.
The problem lies in natural language itself.
If there are a dozen dogs in a photo and you say 'the dog on the left,' the model won't be able to understand exactly which one you mean.
Even worse, if you ask the model to count the number of dogs in a photo, it can easily lose track of which ones it has already counted and which ones remain uncounted during the reasoning process.
The report also mentions extreme cases like maze navigation, where pure language cannot accurately describe irregular paths or complex topological relationships.
As a referential tool, language is inherently ambiguous in continuous visual space. It excels at abstract concepts and causal relationships but faces fundamental limitations in expressing spatial positioning and topological relations.
However, since DeepSeek itself is a general-purpose language model, how should it address this issue?
This brings us to the 'finger' mentioned at the beginning of the article.
The core concept they proposed is 'Visual Primitives.' Specifically, they elevate bounding boxes and points—the two most fundamental spatial markers in computer vision—to become 'the smallest units of thought.'
Previous multimodal models could also draw bounding boxes to annotate objects, but only showed you the final result as proof that 'I found it.' It's like taking an exam where you only submit the answer without showing your work.
Some research allowed AI to draw boxes during the reasoning process, but only for the purpose of 'seeing more accurately.' The boxes were merely an auxiliary tool. It's akin to using scratch paper when solving math problems; the scratch paper helps you calculate more clearly, but it’s not part of the problem-solving thought process.
DeepSeek takes a completely different approach.
They embed these spatial markers directly into the model's reasoning process, making them an integral part of its reasoning. When thinking, the model doesn’t just describe in words, 'I see a dog,' but simultaneously outputs, 'I see a dog, and it is here: [[x1,y1,x2,y2]].'
This mechanism is referred to by DeepSeek as 'point while it reasons.'

Every step of the model's reasoning is anchored to specific coordinates on the image.
The technical report provides an example: starting from the initial point, the model explores, backtracks, tries again, and finally outputs a complete path of coordinates, each corresponding to a point traveled in the maze.
In this way, the model won't get 'lost' during the reasoning process. It won’t lose track of what it is referring to or describing. Each visual object has a clear spatial anchor, making the reasoning process traceable and verifiable.
This technical approach forms an interesting contrast with the direction taken by OpenAI.
In the official introduction of o3 and o4-mini, OpenAI explicitly mentioned the concept of 'thinking with images,' meaning the model can incorporate images into the reasoning chain and process them by cropping, zooming, rotating, and other methods. The focus of this direction is to make images themselves part of the reasoning chain, allowing the model to generate new images, modify images, and manipulate them during the reasoning process.
OpenAI's approach emphasizes general capabilities, with vision, code, search, documents, and tool invocation all working together. The model has a powerful 'visual workbench' that can flexibly handle various visual tasks.
DeepSeek’s approach, on the other hand, is more 'symbolic.' It brings coordinates into the reasoning chain. The model explicitly writes out bounding boxes and point coordinates in the reasoning text, turning visual objects into reusable anchors during reasoning.
This results in OpenAI’s visual reasoning taking place internally, where users can only see the final answer and necessary explanations, while the intermediate visual processing remains a black box. DeepSeek, however, deliberately makes intermediate visual anchors explicit, making the entire reasoning process transparent.
The advantage of DeepSeek’s approach is that the reasoning process becomes easier to train, inspect, and score. This also makes it easier to design format, quality, and task-level rewards. Especially in tasks like mazes and path tracing, it allows for finer feedback on path legality, trajectory coverage, and so on.
The model not only learns to output the correct answer but also masters the method of reasoning using visual primitives.
02
Efficiency is at the core.
One easily overlooked but extremely important detail in DeepSeek’s report is that their model uses far fewer tokens than other cutting-edge models when processing images.
The report contains a comparison chart showing the number of tokens consumed by different models when processing an 800×800 resolution image.
Gemini-3-Flash has approximately 1,100 tokens, Claude-Sonnet-4.6 around 870 tokens, GPT-5.4 about 740 tokens, Qwen3-VL roughly 660 tokens, and DeepSeek approximately 361 tokens, while retaining only about 90 entries in the KV cache.
This gap is not small. The number of tokens used by DeepSeek is only one-third of Gemini's, and the number of KV cache entries is only about one-tenth.
How is such extreme efficiency achieved?
DeepSeek uses a mechanism called 'Compressed Sparse Attention (CSA)'.
You can think of it this way: if you show your friend a family portrait, you wouldn’t say, 'Starting from the 237th pixel on the left, there’s a red area…'; instead, you’d directly say, 'My mom is on the left, and my dad is on the right.'
DeepSeek-ViT first compresses the image into fewer visual tokens, and CSA further compresses the representation of these visual tokens in the KV cache.
This mechanism was previously used in the DeepSeek-V4-Flash model and is now applied to multimodal vision tasks.
The specific compression process works as follows: A 756x756 image contains 571,536 pixels. These pixels are first processed by ViT, divided into patches with a size of 14x14, generating 2,916 patch tokens. Then, a 3x3 spatial compression is performed, compressing every 9 adjacent tokens along the channel dimension into 1, resulting in 324 visual tokens.
These 324 tokens enter the large language model for pre-filling. Finally, the CSA mechanism compresses these visual tokens in the KV cache by another 4x, leaving only 81 entries.
From 571,536 pixels to 81 KV cache entries, the overall compression ratio reaches 7,056 times.
Typically, large AI companies rely on brute-force methods to amass computing resources, whereas DeepSeek makes trade-offs at the information theory level, retaining only the most straightforward and comprehensible information.
The most immediate result is a significant increase in inference speed.

The number of image tokens directly affects the model's inference latency. During the autoregressive generation process, every time a new token is generated, the model needs to perform attention calculations on the KV cache of all previous tokens. If an image occupies 1,000 tokens, then each generation step requires performing attention on those 1,000 tokens. If it only occupies 90 tokens, the computational load decreases significantly.
For application scenarios requiring real-time response, such as robotic vision, autonomous driving, and real-time video analysis, the improvement in inference speed plays a decisive role.
Additionally, it consumes less memory.
KV cache is the memory bottleneck for large model inference. Especially when handling long contexts or batch inference, the KV cache can consume a large amount of GPU memory. By compressing the visual token KV cache to 90 entries, DeepSeek allows processing more images on the same hardware or managing longer multi-turn conversations.
This is crucial for actual deployment. Many companies' multimodal models perform well in labs but encounter cost issues during practical deployment. The more tokens each image consumes, the higher the inference costs, and the fewer concurrent users the system can support. DeepSeek's efficiency advantages are magnified during large-scale deployments.
It also indirectly increases the model's context capacity.
If an image takes up 1,000 tokens, then within a 128k context window, only about 100 images can fit. If it uses only 300 tokens, over 400 images can be accommodated. This is essential for scenarios that require processing multi-image conversations, long video analysis, and extensive document comprehension.
DeepSeek's model can handle more images within a single conversation, compare and analyze dozens or even hundreds of images, and track long-term changes in videos.
The most critical aspect is the training cost.
Although the report mainly discusses inference efficiency, this compression mechanism is equally effective during the training phase. Fewer visual tokens mean a smaller computational graph, faster training speed, and lower hardware requirements.
DeepSeek has always been known for achieving better results with fewer resources. From the reinforcement learning training of R1, to the MoE architecture of V4, to the current visual multimodal model, this efficiency-first philosophy has been consistently applied.
However, there is a key question: Will compression lead to information loss?
DeepSeek does not deny that compression may result in some information loss. Its argument is that, for these spatial reasoning and counting tasks, the compressed representations remain sufficiently effective.
Each step of compression retains the most crucial information for reasoning while discarding redundancy and noise.
In fact, the visual primitive mechanism mentioned earlier by DeepSeek is itself a form of information compression. A bounding box can precisely locate an object using four numbers, and a point can mark a position using two numbers. These discrete symbols carry much higher information density compared to raw pixels.
From the experimental results, this compression has not harmed performance; instead, it has brought improvements in certain tasks.
This suggests that for many visual reasoning tasks, the bottleneck is not about seeing things less clearly but rather about failing to find the right representation method.
This efficiency advantage also demonstrates that multimodal intelligence does not necessarily require larger models, more computing power, or higher costs.
Since the inception of DeepSeek, there has been a hidden theme within this company: 'True intelligence does not lie in computational power but in understanding the essence of problems.'
When you truly understand what visual reasoning requires, you don’t need as many tokens. When you find the right representation, you don’t need such a large model.
From this perspective, DeepSeek’s extreme efficiency is not the goal but a byproduct. The real objective is to find the correct paradigm for visual reasoning. Efficiency simply proves that this paradigm is correct.
03
Unfinished Business
In the limitations section of its report, DeepSeek candidly listed several issues with the current approach. These are not minor technical flaws but point to the next phase of visual reasoning.
The first issue is trigger-word dependency.
The report explicitly states that the current ability to 'think using visual primitives' requires explicit trigger words to activate. In other words, the model cannot naturally or autonomously decide 'when to draw boxes or place points.'
This means the model has not yet truly learned when to use visual primitives and when language alone suffices.
Ideally, the model should be able to make autonomous decisions based on the nature of the task. But when a user asks, 'Count how many dogs are in the picture,' the model should automatically switch to visual primitive mode and use bounding boxes to assist with counting.
Technically speaking, this requires building a metacognitive layer into the model. This metacognitive layer can assess the complexity of the current task, determine whether pure language reasoning is sufficient, and decide whether to invoke visual primitives.
DeepSeek has not yet implemented this metacognitive layer, but they have already clarified the direction. Future versions may allow the model to learn how to independently decide reasoning strategies rather than relying on external triggers.
The second issue is resolution limitation.
The report mentions that due to input resolution constraints, the model's performance in fine-grained scenarios is still insufficient, with output visual primitives sometimes lacking precision.
This issue is related to DeepSeek’s efficiency-first strategy. To control the number of tokens, they have limited the range of visual tokens between 81 and 384. Images exceeding this range will undergo scaling.
This design is reasonable in most scenarios, but it encounters bottlenecks in tasks requiring extremely high precision. For example, medical image analysis needs to identify tiny lesions, and industrial quality inspection requires detecting subtle defects, both of which demand high resolution.
DeepSeek mentioned in the report that this issue can be resolved by integrating existing high-resolution methods. In other words, their visual primitive framework and traditional high-resolution cropping methods are not mutually exclusive but complementary.
I believe DeepSeek could introduce a hybrid solution.
Specifically, for most routine tasks, compressed visual representations and visual primitive reasoning would be used to maintain high efficiency. For local areas requiring fine-grained analysis, high-resolution cropping would be dynamically invoked to extract more detailed visual information. This approach maintains overall efficiency while meeting the need for local precision.
The key to this hybrid solution is enabling the model to determine which areas require high-resolution processing. This brings us back to the earlier issue of metacognition.
The third issue is cross-scenario generalization.

The report mentions that using points as visual primitives to solve complex topological reasoning problems remains difficult, and the model's cross-scene generalization ability is limited.
This issue is particularly evident in maze navigation and path tracing tasks. Although DeepSeek achieved 66.9% and 56.7% accuracy on its self-constructed test set, surpassing other models, these figures are still insufficient.
More importantly, these tasks are trained and tested on synthetic data. Mazes are algorithmically generated, and path-tracing curves are procedurally drawn. When the model encounters real-world topological reasoning problems, such as planning paths on actual maps or tracking connections in complex pipeline diagrams, performance may decline.
DeepSeek's approach is to enhance generalization capabilities through large-scale, high-diversity data. They crawled 97,984 data sources, retaining 31,701 after strict filtering, ultimately yielding over 40 million samples. For maze and path tracing tasks, they also designed various topologies, visual styles, and difficulty levels to cover as many variations as possible.
However, data diversity is only part of generalization ability. Does the model truly understand the essence of topological reasoning, or has it merely memorized patterns within the training data?
Additionally, DeepSeek’s visual primitives represent a new representation system requiring specialized data formats, training processes, and evaluation methods. This is not entirely compatible with the existing multimodal ecosystem.
Most multimodal datasets and evaluation benchmarks are designed based on the traditional 'image + text' paradigm and do not account for visual primitives. To evaluate DeepSeek’s model on these benchmarks, either the visual primitive functionality must be disabled, or the evaluation method needs to be redesigned.
Other researchers who wish to reproduce or improve upon this work need to reconstruct the entire data and training process, presenting a relatively high barrier to entry.
The fact that DeepSeek addresses these issues in their report indicates they have a clear understanding of their own work.
This may be more valuable than providing perfect answers. Because what truly drives societal progress is often not the answers, but the questions. $DeepSeek Beneficiaries (LIST23585.US)$$DeepSeek Beneficiaries$$DeepSeek Beneficiaries$$AI (LIST0535.SH)$$Artificial Intelligence (LIST2136.US)$$Artificial Intelligence (LIST23586.HK)$$Alphabet-C (GOOG.US)$$Alphabet-A (GOOGL.US)$
Risk Disclaimer: The above content only represents the author's view. It does not represent any position or investment advice of Futu. Futu makes no representation or warranty.Read more
Comments
to post a comment
2
