Activation Checkpoint¶

class fairscale.nn.checkpoint.checkpoint_wrapper(module: torch.nn.modules.module.Module, offload_to_cpu: bool = False)[source]¶

A friendlier wrapper for performing activation checkpointing.

Compared to the PyTorch version, this version:

wraps an nn.Module, so that all subsequent calls will use checkpointing

handles keyword arguments in the forward

handles non-Tensor outputs from the forward

supports offloading activations to CPU

Usage:

checkpointed_module = checkpoint_wrapper(my_module, offload_to_cpu=True)
a, b = checkpointed_module(x, y=3, z=torch.Tensor([1]))

To understand the benefits of checkpointing and the offload_to_cpu flag, let’s divide activations into 2 types: inner activations and outer activations w.r.t. the checkpointed modules. The inner ones are saved by activation checkpointing, the outer ones are saved by offload_to_cpu.

In terms of GPU memory savings:

When inner ones are large in size and outer ones are small, checkpointing helps a lot, offload_to_cpu may help a little.

When inner ones are small and outer ones are large, checkpointing helps little, offload_to_cpu helps a lot.

When both inner and outer are large, both help and the benefit is additive.

..Note:

The first and last layers are not likely to benefit from the `offload_to_cpu` flag
because (1) there are typically other references to the first layer's input, so
the GPU memory won't be freed; (2) the input to the last layer is immediately
used by the backward pass and won't result in memory savings.

Parameters

module (nn.Module) – The module to be wrapped
offload_to_cpu (bool) – Whether to offload activations to CPU.

Returns

(nn.Module) – Wrapped module