Google Cloud

All Google chip design now takes place in the cloud

Abner Li | Mar 28 2023 - 10:30 am PT

Besides the Tensor SoC on Pixel phones, Google has developed other custom chips, with all silicon design now occurring in the cloud.

Previously, Google’s chip development infrastructure team used “dozens of racks and hundreds of servers” in a data center.

As projects began to mount, so did the implementation challenges, with hardware costs doubling annually, and each new initiative requiring new engineers and infrastructure. When the team was prioritizing hiring engineers simply to manage and optimize legacy machines, they knew they were losing sight of their core focus: growth and innovation.

Later on, this team “explored a hybrid solution using Google’s internal software design environment, and some Electronic Design Automation (EDA) workloads sent into Google Cloud.”

While the approach was reliable in the short-term, delays in transferring workloads for analysis would leave engineers waiting around for results. The added burden of having two desktops running concurrently, one for their design environment and one for their results in Google Cloud, led to a rethink.

The chip division eventually decided to migrate fully to the cloud with the help of an internal “Alphabet Cloud” team that’s responsible for “helping teams across Alphabet accelerate their adoption of Google Cloud’s unique offerings to drive faster development and scale, just like a customer’s platform team would.” The team is using Google Kubernetes Engine (GKE) for containers, as well as Cloud Storage, Filestore, Cloud Spanner, Big Query, and Pub/Sub for data.

This transition allowed the chip group to use Google Cloud’s existing ML algorithms to “efficiently navigate large search spaces and apply unique optimizations at various stages of chip design.”

This resulted in a shortened chip design process, reduced time-to-market, expanded product areas for ML accelerators, and improved efficiency.

Since it’s easier to add more compute resources, “chip designers were able to run more jobs to weed out bugs.”

Since moving to Google Cloud, the team increased daily job submissions by 170% over the past year while maintaining a flat scheduling latency. The workload is supported across 250+ GKE clusters spanning multiple Google Cloud regions.

On the business side, there was a reduction in operational costs, finding infrastructure bugs faster, and spending “less time on data center maintenance.”

The team said “that all Google chip design projects are using Google Cloud” now.

The chip design team has launched full designs built using Google Cloud, including the last two generations of TPUs and YouTube’s video accelerator program, Argos VCU.