Starting as early as 2006, we discussed deploying GPUs, FPGAs, or custom ASICs in our datacenters.

Google started the accelerator discussion so early!

The philosophy of the TPU microarchitecture is to keep the matrix unit busy.

They chose CISC over RISC. Any reasons?

Drivers:

Like GPUs, the TPU stack is split into a User Space Driver and a Kernel Driver. The Kernel Driver is lightweight and handles only memory management and interrupts. It is designed for long-term stability. The User Space driver changes frequently.

While most architects have been accelerating CNNs, they represent just 5% of our datacenter workload.

This is quite interesting. I thought in 2014 CNNs should be the dominant model, but apparently what’s popular in academia isn’t necessarily to be popular in companies. Facebook’s measurement paper somewhat coincide with this. The Facebook paper says the percentages of time spent in different types of operator are: fully connected 42%, embedding lookup 17%, tensor manipulation 17%, convolution 4%, and some others. From these data points, this TPU paper’s appeal on putting more work on MLPs and LSTMs seem reasonable.