Weight Initialization In Neural Networks
Notes from the book by Ian Goodfellow and Bengio
What not to do
Zero Initialization
All weights are zero, creating what is known as a dead neuron as Input-information to each neuron will be zero, no matter the input xi.
Input fed to neuron = xi * wi
As wi = 0,
xi*wi=0
Due to this, during backpropagation, even the gradient ∇ will be zero (see the backpropagation weight updation formula)
The network fails to learn the intricacies of input data, or in other words, fails to map the relationship between the input and output.
Symmetrical/Constant Initialization
All weights are assigned the same value, bad idea as all inputs are essentially treated as the same. Although the outputs are not zero for neurons, they don’t learn anything.
Read more here
Random Initialization
Randomly select values from a gaussian curve having mean µ and standard deviation σ
N ( µ ,σ )
Decide what should be the deviation σ in these random weights. The issue here is if the deviation σ is very low, this becomes close to symmetric initialization on the other hand if we choose a high σ, we tend to move towards the exploding and vanishing gradient problems.
Solutions
Maintain some variance in weight allocation
LeCun Initialization
Weights are distributed in such a way that the variance of weights closely follows the variance in output.
The output of a neuron with linear activation function is given by
y=w1x1+w2x2+…wnxn +b
var(y)=var(w1x1+w2x2+…wnxn +b)
As the bias parameter is constant, it has no bias, we neglect the bias term.
var(y)=var(w1)var(x1) + var(w2)var(x2) + …var(wn)var(xn)
As weights are i.i.d (Independent, Identically Distributed),
var(y)=N∗var(w)∗var(x)
Where N is the dimension of the input vector. As our goal was to match the variance in input and output,
LeCun’s initialization suggests randomly allocating weights from a Gaussian curve with mean 0 and standard deviation equal to 1/√N.
Xavier Glorot Initialization
For efficient performance on backpropagation, the variance should also account for the backward pass through the network.
Weights distribution should be a Gaussian with zero mean and variance given by the following formula :
where fanin represents the number of inputs coming to the specific layer and fanout represents the no. of outputs moving to the next layer.
He Initialization
As ReLU function does is defined as f(x)=max(0,x).
It is not a zero mean function, this makes our initial assumption that the weights distribution have zero mean.
To account for this, we slightly modify the xavier/glorot method.
Summary
Zero Initialization doesn’t work and neither does Initializing to some constant. This leads to what’s called the symmetry problem.
Random Initialization can be used to break the symmetry. But, if weights are too small we don’t get significant variance in the activations as we go deeper into the network on the other side, If the weights are too large then it leads to saturation.
LeCun Initialization can be used to make sure that the activations have significant variance, but the gradients still suffer
Xavier Initialization is used to maintain the same smooth distribution for both the forward pass as well the backpropagation.
But Xavier Initialization fails for ReLU, so we use He Initialization for ReLU activation function.

