Skip to main content

Gemini 3 Flash’s new ‘Agentic Vision’ improves image responses

Agentic Vision is a new capability for the Gemini 3 Flash model to make image-related tasks more accurate by “grounding answers in visual evidence.”

Frontier AI models like Gemini typically process the world in a single, static glance. If they miss a fine-grained detail — like a serial number on a microchip or a distant street sign — they are forced to guess.

This new approach “treats vision as an active investigation” by combining visual reasoning with code execution and other tools in the future.

To answer prompts with images, Gemini 3 Flash will formulate “plans to zoom in, inspect and manipulate images step-by-step.” Specifically, Agentic Vision leverages a “Think, Act, Observe loop.”


  1. Think: the model analyzes the user query and the initial image, formulating a multi-step plan.
  2. Act: The model generates and executes Python code to actively manipulate images (e.g. cropping, rotating, annotating) or analyze them (e.g. running calculations, counting bounding boxes, etc).
  3. Observe: The transformed image is appended to the model’s context window. This allows the model to inspect the new data with better context before generating a final response.

Instead of just describing an image it’s given, Gemini 3 Flash “can execute code to draw directly on the canvas to ground its reasoning.” One example of this image annotation in the Gemini app is asking “to count the digits on a hand.”

Advertisement - scroll for more content

To avoid counting errors, it uses Python to draw bounding boxes and numeric labels over each finger it identifies. This “visual scratchpad” ensures that its final answer is based on pixel-perfect understanding.

Meanwhile, Gemini 3 Flash will zoom in when it detects fine-grained details in the image. Agentic Vision can also “parse high-density tables and execute Python code to visualize the findings.”

Standard LLMs often hallucinate during multi-step visual arithmetic. Gemini 3 Flash bypasses this by offloading computation to a deterministic Python environment… This replaces probabilistic guessing with verifiable execution.

Agentic Vision results in a “consistent 5-10% quality boost across most vision benchmarks” for Gemini 3 Flash.

This is starting to roll out to the Gemini app with the Thinking model. For developers, it’s available today with the Gemini API in Google AI Studio and Vertex AI.

In the future, Gemini 3 Flash will get better at rotating images or performing visual math without an “explicit prompt nudge to trigger.” Today, Agentic Vision will implicitly decide when to zoom.

In addition to code execution, future tools will allow Gemini to use web and reverse image search to “ground its understanding of the world even further.” Agentic Vision will also be available with other Gemini models.

FTC: We use income earning auto affiliate links. More.

You’re reading 9to5Google — experts who break news about Google and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Google on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Comments

Author

Avatar for Abner Li Abner Li

Editor-in-chief. Interested in the minutiae of Google and Alphabet. Tips/talk: abner@9to5g.com