# Implementing Synchronized Multi-GPU Batch Normalization, Do It Exactly Right

### What is Batch Normalization (BN) and how it works?

Batch Normalization was introduced in the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , which dramatically speed up the training process of the network (enables larger learning rate) and makes the network less sensitive to the weight initialization. The idea is performing the normalization within the mini-batch. The training mode:

Forward Pass: For the input data $X={x_1, ...x_N}$, the data are normalized to be zero-mean and unit-variance, then scale and shit:

where $\mu=\frac{\sum_i^N x_i}{N} , \sigma = \sqrt{\frac{\sum_i^N (x_i-\mu)^2}{N}+\epsilon}$ and $\gamma, \beta$ are the learnable scale and shift parameters.

Backward Pass:

We need to consider the partial gradients from output $\frac{d_\ell}{d_{y_i}}$, and the gradients from $\frac{d_\ell}{d_\mu}$ and $\frac{d_\ell}{d_\sigma}$, because the mean and variance are the function of the input: (We use the notations of partial gradients here.)

where $\frac{d_{y_i}}{d_{x_i}}=\frac{\gamma}{\sigma}, \frac{d_\ell}{d_\mu}=-\frac{\gamma}{\sigma}\sum_i^N\frac{d_\ell}{d_{y_i}}, \frac{d_\mu}{d_{x_i}}=\frac{1}{N} \text{ and } \frac{d_\sigma}{d_{x_i}}=-\frac{1}{\sigma}(\frac{x_i-\mu}{N})$.

### Why synchronize the BN layer?

For popular deep learning frameworks (Caffe, Torch. Tensorflow, PyTorch and etc.) , the implementation of Batch Normalization is calculating the mean and variance within every single GPU. That is because the synchronize version is slower due to cross GPU communication and there is almost no benefit of synchronizing BN for most of the vision tasks. However, we have to synchronize BN for some very particular tasks such as Semantic Segmentation, because the GPU usage is very expensive for per-pixel prediction and the mini-batch size within a single GPU is too small for BN. Therefore, we discuss the synchronized implementation here.

### Synchronized Batch Normalization implementation

• For forward pass:

The mean and variance need to be calculated across all the GPUs. We cannot simply calulate the variance individually and then averge it, which will produce the wrong value. The easiest way is suming up the values and the squares across GPUs (using reduce sum). Then calculate the global mean and variance.

• Considering the backward function:

The first term can be calculated locally within GPU device. For the second and third term of backward function, we need to calculate the gradient of mean and variance across all the GPU. Then continue the backward pass.

• Synchronized DataParallel:

Standard DataParallel pipeline of public frameworks (MXNet, PyTorch…) in each training iters:

• duplicate the network (weights) to all the GPUs,
• split the training batch to each GPU,
• forward and backward to calculate gradient,
• update network parameters (weights) then go to next iter.

Therefore, communicattion accross different GPUs are not supported. To address this problem, we introduce a SelfDataParallel mode, which enables each layer to accept mutli-GPU inputs directly. Those self-parallel layers are provide in encoding.nn.

Due to the BN layers are frequently used in the networks, the PyTorch autograd engine will be messed up by such a complicated backward graph. To address this problem, we provide an aotograd function :class:encoding.parallel.AllReduce to handle the cross GPU gradient calculation.