Efficient Data Parallel Training with SlowMo Distributed Data Parallel¶
SlowMo Distributed Data Parallel reduces the communication between different nodes while performing data parallel training. It is mainly useful for use on clusters with low interconnect speeds between different nodes. When using SlowMo, the models on the different nodes are no longer kept in sync after each iteration, which leads to the optimization dynamics being affected. The end result is close to the results of Distributed Data Parallel, but is not exactly the same.
If you have code that is setup to use Distributed Data Parallel, using SlowMo Distributed Data Parallel
is simply replacing the DDP call with a call to
fairscale.experimental.nn.data_parallel.SlowMoDistributedDataParallel
, and adding a
model.perform_slowmo(optimizer)
call after optimizer.step()
– preceded by
model.zero_grad(set_to_none=True)
in order to reduce peak memory usage.
The different points at which use_slowmo
is used below help demonstrate these changes:
import torch
from fairscale.experimental.nn.data_parallel import SlowMoDistributedDataParallel as SlowMoDDP
def train(
rank: int,
world_size: int,
epochs: int,
use_slowmo: bool):
# process group init
dist_init(rank, world_size)
# Problem statement
model = MyAwesomeModel().to(rank)
if use_slowmo:
# Wrap the model into SlowMoDDP
model = SlowMoDDP(model, slowmo_momentum=0.5, nprocs_per_node=8)
else:
model = DDP(model, device_ids=[rank])
dataloader = MySuperFastDataloader()
loss_ln = MyVeryRelevantLoss()
optimizer = MyAmazingOptimizer()
# Any relevant training loop, with a line at the very end specific to SlowMoDDP, e.g.:
model.train()
for e in range(epochs):
for (data, target) in dataloader:
data, target = data.to(rank), target.to(rank)
# Train
outputs = model(data)
loss = loss_fn(outputs, target)
loss.backward()
optimizer.step()
model.zero_grad(set_to_none=use_slowmo) # free memory for the perform_slowmo() call below
if use_slowmo:
model.perform_slowmo(optimizer)
In the example above, when using SlowMoDDP, we are reducing the total communication between
nodes by 3 times as the default localsgd_frequency
is set to 3.
SlowMoDDP takes in slowmo_momentum
as a parameter. This parameter may need to be tuned
depending on your use case. It also takes in nproces_per_node
which should be typically set
to the number of GPUs on a node. Please look at the
documentation
for more details on these parameters as well as other advanced settings of the SlowMo algorithm.