Shortcuts

Pipe

class fairscale.nn.Pipe(module: torch.nn.modules.container.Sequential, balance: Optional[Iterable[int]] = None, *, devices: Optional[Union[Iterable[Union[torch.device, int, str]], List[Union[torch.device, int, str]]]] = None, chunks: int = 1, checkpoint: str = 'except_last', deferred_batch_norm: bool = False)[source]

Wraps an arbitrary nn.Sequential module to train on Pipe. If the module requires lots of memory, Pipe will be very efficient.

model = nn.Sequential(a, b, c, d)
model = Pipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

Pipe combines pipeline parallelism with checkpointing to reduce peak memory required to train while minimizing device under-utilization.

You should determine the balance when defining a Pipe module, as balancing will not be done automatically. The module will be partitioned into multiple devices according to the given balance. You may rely on heuristics to find your own optimal configuration.

Parameters
  • module (torch.nn.Sequential) – sequential module to be parallelized

  • balance (ints) – list of number of layers in each partition

Keyword Arguments
  • devices (iterable of devices) – devices to use (default: all CUDA devices)

  • chunks (int) – number of micro-batches (default: 1)

  • checkpoint (str) – when to enable checkpointing, one of 'always', 'except_last', or 'never' (default: 'except_last')

  • deferred_batch_norm (bool) – whether to use deferred BatchNorm moving statistics (default: False, see Deferred Batch Normalization for more details)

Raises
chunks: int = 1

The number of micro-batches.

checkpoint: str = 'except_last'

The checkpoint mode to determine when to enable checkpointing. It is one of 'always', 'except_last', or 'never'.

balance: List[int] = []

The number of layers in each partition.

devices: List[torch.device] = []

The devices mapped to each partition.

devices[-1] refers to the device of the last partition, which means it is the output device. Probably, you need to use it to transfer the target to calculate the loss without a device mismatch RuntimeError. For example:

out_device = pipe.devices[-1]

for input, target in loader:
    target = target.to(out_device, non_blocking=True)
    output = pipe(input)
    loss = F.cross_entropy(output, target)
__len__() int[source]

Counts the length of the underlying sequential module.

__getitem__(index: int) torch.nn.modules.module.Module[source]

Gets a layer in the underlying sequential module.

__iter__() Iterable[torch.nn.modules.module.Module][source]

Iterates over children of the underlying sequential module.

cuda(device: Optional[Union[torch.device, int, str]] = None) fairscale.nn.pipe.pipe.Pipe[source]
cpu() fairscale.nn.pipe.pipe.Pipe[source]
to(*args: Any, **kwargs: Any) fairscale.nn.pipe.pipe.Pipe[source]

Deny these usages: - to(device[, dtype, non_blocking]) - to(tensor[, non_blocking])

But allow this: - to(dtype[, non_blocking])

forward(input: Union[torch.Tensor, Tuple[torch.Tensor, ...]]) Union[torch.Tensor, Tuple[torch.Tensor, ...]][source]

Pipe is a fairly transparent module wrapper. It doesn’t modify the input and output signature of the underlying module. But there’s type restriction. Input and output have to be a Tensor or a tuple of tensors. This restriction is applied at partition boundaries too.

Parameters

input (torch.Tensor or tensors) – input mini-batch

Returns

tensor or tensors – output mini-batch

Raises

TypeError – input is not a tensor or tensors.