[Google_Bootcamp_Day4]
Updated:
Activation Functions
- Sigmoid
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
- Only exception to use sigmoid function is for the output layer in binary classification problem
- Tanh
- Almost always better than sigmoid function
- If z is too large/small, the gradient goes to zero which makes gradient descent slow down
- ReLU
- Usually default choice of the activation function
- If z < 0, the gradient goes zero, but enough hiddent units have z >= 0, so no worry for learning goes slow down
- Leaky ReLU

Why need non-linear activation functions
- Composition of linear activation functions is just a linear function
- It means that always same result no matter how many hidden units are
Derivatives of activation functions
-
Sigmoid

-
Tanh

- ReLU
- Leaky ReLU

Gradient descent for neural networks

Formulas for computing derivatives

Backpropagation review
-
Logistic regression

-
Neural network

Summary of gradient descent
- Vectorize Implementation

Random Initialization
- For logistic regression, it was okay to initialize the weights to zero
- But for a neural network, if we initialize the weight parameters to all zero and then applied gradient descent, it won’t work
- Initializing bias term to zero does not matter
- Problem of initialization to zeros

- Random Initialization

[Source] https://www.coursera.org/learn/neural-networks-deep-learning
Leave a comment