Optimizers
https://youtu.be/mdKjMPmcWjY
Gradient Descent
- Update once every epoch
- .
Stochastic Gradient Descent
- update every sample
- Takes long
- sensitive to each sample
- .
Mini-batch Gradient Descent
- Decrease noise of SGD
- update every N-sample
- .
SGD + Momentum
- Decrease noise of SGD
- .
- .
- parameters of a model tend to change in single direction as examples are similar.
- pros
- With this property, using momentum, a model learns faster by paying little attention to some samples.
- Cons
- Choosing samples blindly not guarantee the right direction.
SGD + Momentum + Acceleration
Adaptive Learning Rate Optimizers
- learn more along one direction than another
Adagrad
- for every epoch
- .
- .
- or
- : Sum of squares of the gradients w.r.t until that point
- Problem:
- G is monotonically increasing over iterations
- LR will decay to a point where the parameter will no longer update (no learning)
- tends to 0
- learns slower
Adadelta
- for every epoch
- reduce the influence of the past squared gradients by introducing gamma weight to all of those gradients
- reduces the effect exponentially
- .
- .
- not tend to 0
Adam (Adadelta + momentum)
- for every epoch
- Adadelta + expected value of past gradients
- .
- .
- .
- slow initially, pick up speed over time.
- similar to momentum
- different step size for different parameters
- With momentum for every parameter, Adam leads to faster convergence
- This is why Adam is de facto of optimizer of many projects.

ref: https://youtu.be/mdKjMPmcWjY
Leave a comment