• I think the idea of this paper is easy to accept.
  • The paper first points out the problem of existing data parallelism and model parallelism approaches, which they call intra-batch parallelism.
  • Although data parallelism is easy to use, the communication overhead is high, especially for models that have large parameters, because workers need to synchronize parameters.
    • Bottlenecked by inter-server links
    • Communication overhead increases as the number of workers increases
    • Why?
    • Communication overhead increases as computation speed increases
  • Model parallelism only sends the intermediate results
    • Resource under-utilization (Figure 2), more applicable when the model is too large for a single GPU to fit in.
    • Need to manually partition models.
  • They proposed pipeline parallelism.
    • Partition model layers into consecutive stages.
    • Each stage maps to a GPU (model parallelism)
    • To keep pipeline efficient, each stage should cost approximately the same amount of time. When such schedule is not possible, they also try to run different training inputs for the same stage on multiple GPUs (data parallelism)
    • In steady state, each worker alternates between forward pass and backward pass.
  • How do they partition layers, by hand or automatically? How do they partition weirder models, like GoogLeNet?