Model sharding using Pipeline Parallel¶

Let us start with a toy model that contains two linear layers.

import torch
import torch.nn as nn

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5)

    def forward(self, x):
        x = self.relu(self.net1(x))
        return self.net2(x)

model = ToyModel()

To run this model on 2 GPUs we need to convert the model to torch.nn.Sequential and then wrap it with fairscale.nn.Pipe.

import fairscale
import torch
import torch.nn as nn

model = nn.Sequential(
            torch.nn.Linear(10, 10),
            torch.nn.ReLU(),
            torch.nn.Linear(10, 5)
        )

model = fairscale.nn.Pipe(model, balance=[2, 1])

This will run the first two layers on cuda:0 and the last layer on cuda:1. To learn more, visit the Pipe documentation.

You can then define any optimizer and loss function

import torch.optim as optim
import torch.nn.functional as F

optimizer = optim.SGD(model.parameters(), lr=0.001)
loss_fn = F.nll_loss

optimizer.zero_grad()
target = torch.randint(0,2,size=(20,1)).squeeze()
data = torch.randn(20, 10)

Finally, to run the model and compute the loss function, make sure that outputs and target are on the same device.

device = model.devices[0]
## outputs and target need to be on the same device
# forward step
outputs = model(data.to(device))
# compute loss
loss = loss_fn(outputs.to(device), target.to(device))

# backward + optimize
loss.backward()
optimizer.step()