[Google_Bootcamp_Day4]
Updated:
Activation Functions
- Sigmoid
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
- Only exception to use sigmoid function is for the output layer in binary classification problem
- Tanh
- Almost always better than sigmoid function
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
- ReLU
- Usually default choice of the activation function
- If z < 0, the gradient goes zero, but enough hiddent units have z >= 0, so no worry for learning goes slow down
- Leaky ReLU
Why need non-linear activation functions
- Composition of linear activation functions is just a linear function
- It means that always same result no matter how many hidden units are
Derivatives of activation functions
-
Sigmoid
-
Tanh
- ReLU
- Leaky ReLU
Gradient descent for neural networks
Formulas for computing derivatives
Backpropagation review
-
Logistic regression
-
Neural network
Summary of gradient descent
- Vectorize Implementation
Random Initialization
- For logistic regression, it was okay to initialize the weights to zero
- But for a neural network, if we initialize the weight parameters to all zero and then applied gradient descent, it won’t work
- Initializing bias term to zero does not matter
- Problem of initialization to zeros
- Random Initialization
[Source] https://www.coursera.org/learn/neural-networks-deep-learning
Leave a comment