

Batch Normalization

  • makes hyperparameter search much easier


Normalizing inputs to speed up learning

  • Normalizing input features help train ‘w,b’ more efficiently
  • Same process to hidden layer 2

Implementing Batch Norm


Adding Batch Norm to a network


With mini-batch


Implementing gradient descent


Why Batch norm work

  • Problem 7
    • Even the exact value of Z1,Z2,…,Z4 (all in second layer) change, at least the mean and the variance remain same
  • Solution
    • Batch Norm reduces the amount that the distribution of hidden unit values shift around
    • Limits the amount that updating parameters in the earlier layers can affect the distribution of values
    • Can use batch norm as regularization
      • Each mini-batch is scaled by the mean/variance computed on just that mini-batch
      • This adds some noise to the values z[l] within that minibatch.
      • So, similar to dropout, it adds some noise to each hidden layer’s activations
      • large mini-batch size -> reduce noise -> reduce regularization effect

Batch Norm at test time


Softmax regression

  Ex. Recognizing cats(1), dogs(2), and baby chicks(3)
  • Softmax regression generalizes logistic regression to C classes

Softmax layer

3 4 5

