Gemini 3 Flash’s new ‘Agentic Vision’ improves image responses

Abner Li | Jan 27 2026 - 11:40 am PT

Agentic Vision is a new capability for the Gemini 3 Flash model to make image-related tasks more accurate by “grounding answers in visual evidence.”

Frontier AI models like Gemini typically process the world in a single, static glance. If they miss a fine-grained detail — like a serial number on a microchip or a distant street sign — they are forced to guess.

This new approach “treats vision as an active investigation” by combining visual reasoning with code execution and other tools in the future.

To answer prompts with images, Gemini 3 Flash will formulate “plans to zoom in, inspect and manipulate images step-by-step.” Specifically, Agentic Vision leverages a “Think, Act, Observe loop.”

Think: the model analyzes the user query and the initial image, formulating a multi-step plan.
Act: The model generates and executes Python code to actively manipulate images (e.g. cropping, rotating, annotating) or analyze them (e.g. running calculations, counting bounding boxes, etc).
Observe: The transformed image is appended to the model’s context window. This allows the model to inspect the new data with better context before generating a final response.

Instead of just describing an image it’s given, Gemini 3 Flash “can execute code to draw directly on the canvas to ground its reasoning.” One example of this image annotation in the Gemini app is asking “to count the digits on a hand.”

Advertisement - scroll for more content

To avoid counting errors, it uses Python to draw bounding boxes and numeric labels over each finger it identifies. This “visual scratchpad” ensures that its final answer is based on pixel-perfect understanding.

Meanwhile, Gemini 3 Flash will zoom in when it detects fine-grained details in the image. Agentic Vision can also “parse high-density tables and execute Python code to visualize the findings.”

Standard LLMs often hallucinate during multi-step visual arithmetic. Gemini 3 Flash bypasses this by offloading computation to a deterministic Python environment… This replaces probabilistic guessing with verifiable execution.

Agentic Vision results in a “consistent 5-10% quality boost across most vision benchmarks” for Gemini 3 Flash.

This is starting to roll out to the Gemini app with the Thinking model. For developers, it’s available today with the Gemini API in Google AI Studio and Vertex AI.

In the future, Gemini 3 Flash will get better at rotating images or performing visual math without an “explicit prompt nudge to trigger.” Today, Agentic Vision will implicitly decide when to zoom.

In addition to code execution, future tools will allow Gemini to use web and reverse image search to “ground its understanding of the world even further.” Agentic Vision will also be available with other Gemini models.