Weight Initialization In Neural Networks

Notes from the book by Ian Goodfellow and Bengio

May 24, 2023

What not to do

Zero Initialization

All weights are zero, creating what is known as a dead neuron as Input-information to each neuron will be zero, no matter the input x_i.

Input fed to neuron = x_i * w_i
As w_i = 0,
x_i*w_i=0

Due to this, during backpropagation, even the gradient ∇ will be zero (see the backpropagation weight updation formula)
The network fails to learn the intricacies of input data, or in other words, fails to map the relationship between the input and output.

Symmetrical/Constant Initialization

All weights are assigned the same value, bad idea as all inputs are essentially treated as the same. Although the outputs are not zero for neurons, they don’t learn anything.
Read more here

Random Initialization

Randomly select values from a gaussian curve having mean µ and standard deviation σ

N ( µ ,σ )

Decide what should be the deviation σ in these random weights. The issue here is if the deviation σ is very low, this becomes close to symmetric initialization on the other hand if we choose a high σ, we tend to move towards the exploding and vanishing gradient problems.

Solutions

Maintain some variance in weight allocation

LeCun Initialization

Weights are distributed in such a way that the variance of weights closely follows the variance in output.

The output of a neuron with linear activation function is given by
y=w₁x₁+w₂x₂+…w_nx_n +b

var(y)=var(w₁x₁+w₂x₂+…w_nx_n +b)

As the bias parameter is constant, it has no bias, we neglect the bias term.

var(y)=var(w₁)var(x₁) + var(w₂)var(x₂) + …var(w_n)var(x_n)

As weights are i.i.d (Independent, Identically Distributed),
var(y)=N∗var(w)∗var(x)

Where N is the dimension of the input vector. As our goal was to match the variance in input and output,

\(N*var(w) =1\)

\( var(w) = {1 \over N }\)

LeCun’s initialization suggests randomly allocating weights from a Gaussian curve with mean 0 and standard deviation equal to 1/√N.

Xavier Glorot Initialization

For efficient performance on backpropagation, the variance should also account for the backward pass through the network.
Weights distribution should be a Gaussian with zero mean and variance given by the following formula :

\(var(w) = {2 \over fanᵢₙ+fanₒᵤₜ}\)

where fan_in represents the number of inputs coming to the specific layer and fan_out represents the no. of outputs moving to the next layer.

\(W \sim N(0,{2\over {nᵢₙ +nₒᵤₜ}})\)

He Initialization

As ReLU function does is defined as f(x)=max(0,x).
It is not a zero mean function, this makes our initial assumption that the weights distribution have zero mean.
To account for this, we slightly modify the xavier/glorot method.

\(W \sim N(0,{2\over {nᵢₙ}})\)

Summary

Zero Initialization doesn’t work and neither does Initializing to some constant. This leads to what’s called the symmetry problem.
Random Initialization can be used to break the symmetry. But, if weights are too small we don’t get significant variance in the activations as we go deeper into the network on the other side, If the weights are too large then it leads to saturation.
LeCun Initialization can be used to make sure that the activations have significant variance, but the gradients still suffer
Xavier Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation.
But Xavier Initialization fails for ReLU, so we use He Initialization for ReLU activation function.

Aryan’s Substack

Discussion about this post