Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

TLDR: See Table 1 for conclusions.

Scope:

Models and hyperparameters in ParaDnn:

Fully Connected: ``Input → [Layer[Node]] → Output`

Number of layers, the number of nodes per layer, and the numbers of input and output units of the datasets.
Convolutional Neural Networks: Input → [Residual/Bottleneck Block]×4 → FC → Output

A residual network contains four groups of blocks [21]. Each can be a residual block or a bottleneck block, followed by a fully-connected layer. Residual blocks have two convolutional layers and two batch normalization layers, while bottleneck blocks have three of each. Usually the minimum number of filters of a residual network is 64 and it doubles in every group, so the maximum is 512 filters. We sweep the number of blocks per group, the minimum filters, and the datasets, including input images and number of categories as outputs. An input image is square with three channels, represented by its length. To keep the study tractable, we constrain each group to have the same number of blocks.
Recurrent Neural Networks: Input → [RNN/LSTM/GRU Cell] → Output

Each token of the input sequence is embedded within a fixed length vector, and the length of the vector is the embedding size. In ParaDnn, the number of layers and the embedding size are variable. The variables in the dataset include the maximum length per input sequence and the vocabulary size
ParaDnn tunable models’ parameters range from 10k to 1b.
Real-world models: Transformer, RetinaNet, ResNet-50, DenseNet, MobileNet, SqueezeNet

[Related works…] Their limitation is that they only contain today’s deep learning models, which may become obsolete as DL models evolve rapidly.

But ParaDnn also only contains today’s model? And MLPerf said you can add more models into it later.

Hardware: GPU and TPU all use HBM instead of DDR and provides way more memory bandwidth.

Roofline Model:

FC: large batch sizes make FCs more compute-bound, and more nodes make FCs more memorybound.
CNN: models close to ResNet-50 are compute-bound, while a majority of the CNNs are bottlenecked by memory bandwidth. The CNNs’ higher FLOPS comes from higher arithmetic intensity caused by more filters.
The only compute-bound operation is large fused MatMul.
Even compute-bound FC/CNN models contain a noticeable fraction of memory-bound operations.

Multi TPU cores:

Software stack advances: Upgrading TensorFlow and CUDA both helps performance.