[Google_Bootcamp_Day4]

Updated:

Activation Functions

  • Sigmoid
    • If z is too large/small, the gradient goes to zero which makes gradient descent slow down
    • Only exception to use sigmoid function is for the output layer in binary classification problem
  • Tanh
    • Almost always better than sigmoid function
    • If z is too large/small, the gradient goes to zero which makes gradient descent slow down
  • ReLU
    • Usually default choice of the activation function
    • If z < 0, the gradient goes zero, but enough hiddent units have z >= 0, so no worry for learning goes slow down
  • Leaky ReLU

activation

Why need non-linear activation functions

  • Composition of linear activation functions is just a linear function
  • It means that always same result no matter how many hidden units are

Derivatives of activation functions

  • Sigmoid sig

  • Tanh tanh

  • ReLU
  • Leaky ReLU relu

Gradient descent for neural networks

grad

Formulas for computing derivatives

formula

Backpropagation review

  • Logistic regression regression

  • Neural network nn

Summary of gradient descent

  • Vectorize Implementation vectorized summary

Random Initialization

  • For logistic regression, it was okay to initialize the weights to zero
  • But for a neural network, if we initialize the weight parameters to all zero and then applied gradient descent, it won’t work
  • Initializing bias term to zero does not matter
  • Problem of initialization to zeros init_zero
  • Random Initialization random_init

[Source] https://www.coursera.org/learn/neural-networks-deep-learning

Categories:

Updated:

Leave a comment