SlowMo Distributed Data Parallel¶
Training neural networks in a distributed data-parallel manner results in non-linear scaling (slowdown) due to the time spent on communication between the different nodes (as well as, to a lesser extent though, synchronization between the different nodes). So, a distributed training run with 8 nodes is not 8x faster than a run with 1 node as we would expect it to be.
SlowMo Distributed Data Parallel aims to solve this by replacing the typical exact allreduce between gradients with an approximate averaging of parameters. This approximate averaging reduces both the time spent on communication as well as the synchronization between different nodes. It uses one of the following two algorithms (configurable) as a base algorithm for this purpose:
Local SGD (papers #1 and #2). This algorithm does an allreduce of the parameters every few iterations.
Stochastic Gradient Push (SGP). This algorithm involves one-to-one communications between nodes.
These base algorithms (LocalSGD and SGP), when used only by themselves, result in reduced model quality (measured as accuracy in a classification setting). The SlowMo algorithm alleviates this issue by doing a slow momentum step, typically, every 48 iterations.
The training process with SlowMo looks as follows:
Compute the forward pass.
Compute the backward pass.
During the backward pass, using a backward hook, on each node, the gradients are synchronized using allreduce across the different GPUs on that node.
Perform the
optimizer.step()
to update parameters on each node with the gradients of that node.Approximately average the parameters using a base algorithm - one of LocalSGD or SGP (both are described above).
Perform the slow momentum update step once every
slowmo_frequency
(typically 48) iterations. In this step, the parameters on different nodes are (exactly) averaged, followed by aslowmo_optimizer.step()
. Note that thisslowmo_optimizer
is different from the original optimizer, and it is done in a Zero-1 like manner to save memory.
Best practices for using SlowMoDistributedDataParallel
¶
SlowMo will be useful in deep learning workloads which run on more than 2 nodes in clusters with a slow interconnect, eg Ethernet.
SlowMo should be useful in your workload if the following condition holds:
\(\textrm{time_taken_for_all_reduce_of_gradients} \times (1 - \frac{1}{\textrm{localsgd_frequency}} ) > \textrm{time_taken_for_backward_pass}\)
Notes:
In case you are using SGP as the base algorithm, the value of
localsgd_frequency
can be plugged in as 2.The formula above is a simplified version of: \(\textrm{time_taken_for_all_reduce_of_gradients} > \textrm{time_taken_for_backward_pass} + \frac{\textrm{time_taken_for_all_reduce_of_gradients}}{\textrm{localsgd_frequency}}\) The left and right hand sides denote the total backward duration (combining the computation of gradients in the backward pass and the communication cost) for DDP and SlowMo DDP, respectively. Since DDP overlaps the computation of gradients with their communication, it is bottlenecked by the latter. In contrast, there is an extra
time_taken_for_backward_pass
on the right hand side because we do not overlap the backward pass with communication in the current implementation of SlowMo.In clusters with slower interconnect,
time_taken_for_all_reduce_of_gradients
will go up, leading to SlowMo being more useful.localsgd_frequency
is also an important factor here. More details on varying that to affect performance are in tip 2 of Performance tips for SlowMoDistributedDataParallel.
slowmo_momentum
will need to be tuned for obtaining good model quality. A grid search across {0.0, 0.1, 0.2, 0.4, 0.6} should be good enough for tuning. Thisslowmo_momentum
value holds consistent across multiple runs with similar settings. When the number of nodes used is increased, however, a higher value ofslow_momentum
should be needed. More details about this can be found in the documentation.Adding SlowMo to existing Distributed Data Parallel code involves two steps, which can be found in the tutorial.
Performance tips for SlowMoDistributedDataParallel
¶
nprocs_per_node
should be set to the number of GPUs on a node (this number should be the same on each node). This allows the API to exploit the fast interconnect between different GPUs on a node.Increasing the
localsgd_frequency
results in an increase in speed. However, it comes with a tradeoff of reducing the model quality. We recommend keeping thelocalsgd_frequency
at 3.slowmo_memory_efficient
should typically be used (this is the default behavior). It reduces memory usage by sharding the additional slow momentum optimizer’s parameters in a Zero-1 like manner.A call to
model.zero_grad(set_to_none=True)
should be made afteroptimizer.step()
in order to save memory for themodel.perform_slowmo()
step. More details about this can be found in the documentation for perform_slowmo().