[Google_Bootcamp_Day4]

Updated: October 22, 2020

Sigmoid
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
- Only exception to use sigmoid function is for the output layer in binary classification problem
Tanh
- Almost always better than sigmoid function
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
ReLU
- Usually default choice of the activation function
- If z < 0, the gradient goes zero, but enough hiddent units have z >= 0, so no worry for learning goes slow down
Leaky ReLU

activation

grad

formula

For logistic regression, it was okay to initialize the weights to zero
But for a neural network, if we initialize the weight parameters to all zero and then applied gradient descent, it won’t work
Initializing bias term to zero does not matter
Problem of initialization to zeros
Random Initialization

[Source] https://www.coursera.org/learn/neural-networks-deep-learning

[Network] Transport Layer 2