Gemini 3.5 Flash lands on Google’s Android coding rankings, but it’s 3x the cost for slower performance

Andrew Romero | Jun 12 2026 - 8:10 am PT

Google has released another set of benchmark results to determine the best AI models for Android coding, along with how much each model costs per token. Google’s Gemini 3.5 Flash is easily the most resource-intensive in Android development, and it doesn’t even make the top five.

As the hype for general chatbots is dying down, companies like Google, OpenAI, and Anthropic are shifting towards agentic models with a strength in coding. Users have begun relying on these models for “vibe coding,” which essentially offloads the bulk of software development to LLMs.

Recent models have dramatically improved their Android coding, and Google has kept tabs on which models perform best over the past few months. The “Android Bench” goes through updates as Google releases its own models, like the recent Gemini 3.5 Flash, and compares them to the competition.

The main takeaway is how Google breaks these models down. Each model gets a score out of 100, indicative of the percentage of Android coding cases it can successfully solve across 10 runs. Google lists expected performance and the date the last test was run, with some high performers sticking around since February.

Advertisement - scroll for more content

In the latest edition of Android Bench, the results paint a more expensive picture. Gemini 3.5 Flash ranks 6th in the Android Bench list under models like GPT 5.5 and Gemini 3.1 Pro Preview, which was tested in February.

Gemini 3.5 Flash was touted as a cheaper and faster alternative to Gemini 3.1 Pro, with an expected performance gap of 6.1%. The new benchmark results say otherwise in regards to Android development, as Gemini 3.5 Flash has a higher latency and 9% gap in performance success.

The kicker – Google’s latest model costs an average of 355.9 tokens at $147.1 for one benchmark run, compared to Gemini 3.1 Pro Preview’s 73.3 tokens used at around a third of that cost.

Of course, it’s worth noting that Google lists the preview version of Gemini 3.1 Pro. That being said, the preview model scores higher than a model meant to be faster and more efficient.

GPT 5.5 ranks similarly in cost per run, but Gemini 3.5 Flash used up 5.5x more tokens in Android Bench tests. Claude’s previous model, Opus 4.7, ranked 4th at a slightly lower run cost and token usage, sitting right in the middle of the pack. Google has not released benchmark scores for Opus 4.8 or Fable 5, for that matter.

Here are the top ten models ranked by Google in the latest Android Bench release:

Model	Score	Avg Latency	Avg Total Tokens	Avg Cost
GPT 5.5	74	15.7	64.7	$134.2
GPT 5.4	72.4	21.2	64.2	$91.7
Gemini 3.1 Pro Preview	72.4	11.1	73.3	$47.9
Claude Opus 4.7	68.7	11.6	90.0	$124.3
Claude Opus 4.6	66.6	9.9	69.5	$84.4
Gemini 3.5 Flash	63.7	14.2	355.9	$147.1
GLM 5.1	59.7	33.4	80.2	$46.7
Kimi K2.6	58.6	29.9	94.3	$42.5
Claude Sonnet 4.6	58.4	8.2	47.9	$40.4
DeepSeek V4 Pro	55.4	35.8	132.7	$13.7
Claude Sonnet 4.5	53.7	13.1	94.2	$61.0

The list includes several open-weight models listed among the well-know closed-weight models like Claude and GPT. The high end of the list has effectively remained unchanged since the last Android Bench, with the exception for GPT 5.3 Codex which has been removed from the list.

You can see the full rankings on Google’s website.

Google has regularly updated this list as more models are tested. At its core, it seems like a solid indicator of model performance in Android development. Gemini 3.5 Flash has been a solid improvement for other LLM and agentic tasks, even as Google has shifted cost and usage limits around. Google’s release numbers can’t be disregarded completely, though Android coding is apparently not Gemini 3.5 Flash’s strong suit.