Optimizers

1 minute read

OptimizersPermalink

https://youtu.be/mdKjMPmcWjY

Gradient DescentPermalink

  • Update once every epoch
  • ฮธ=ฮธโˆ’ฮฑโˆ‡ฮธJ(ฮธ).

Stochastic Gradient DescentPermalink

  • update every sample
    • Takes long
    • sensitive to each sample
  • ฮธ=ฮธโˆ’ฮฑโˆ‡ฮธJ(ฮธ).

Mini-batch Gradient DescentPermalink

  • Decrease noise of SGD
  • update every N-sample
  • ฮธ=ฮธโˆ’ฮฑโˆ‡ฮธJ(ฮธ).

SGD + MomentumPermalink

  • Decrease noise of SGD
  • v=ฮณv+ฮทโˆ‡ฮธJ(ฮธ).
  • ฮธ=ฮธโˆ’ฮฑv.
  • parameters of a model tend to change in single direction as examples are similar.
    • pros
      • With this property, using momentum, a model learns faster by paying little attention to some samples.
    • Cons
      • Choosing samples blindly not guarantee the right direction.

SGD + Momentum + AccelerationPermalink

  • v=ฮณv+ฮทโˆ‡ฮธJ(ฮธโˆ’ฮณv).
  • ฮธ=ฮธโˆ’ฮฑv.

Adaptive Learning Rate OptimizersPermalink

  • learn more along one direction than another

AdagradPermalink

  • for every epoch t
    • for every parameter ฮธi
  • ฮธt+1,i=ฮธt,iโˆ’ฮทGt,ii+ฯตโˆ‡ฮธt,iJ(ฮธt,i).
  • Gt,ii=Gtโˆ’1,ii+โˆ‡ฮธt,i2J(ฮธt,i).
  • or Gt,ii=โˆ‡ฮธ1,i2J(ฮธ1,i)+โˆ‡ฮธ2,i2J(ฮธ2,i)+โˆ‡ฮธ3,i2J(ฮธ3,i)+โ‹ฏ+โˆ‡ฮธt,i2J(ฮธt,i)
    • Gt,ii : Sum of squares of the gradients w.r.t ฮธi until that point
  • Problem:
    • G is monotonically increasing over iterations
      • LR will decay to a point where the parameter will no longer update (no learning)
      • ฮทGt,ii+ฯต tends to 0
      • learns slower

AdadeltaPermalink

  • for every epoch t
    • for every parameter ฮธi
  • reduce the influence of the past squared gradients by introducing gamma weight to all of those gradients
    • reduces the effect exponentially
    • ฮธt+1,i=ฮธt,iโˆ’ฮทE[Gt,ii]+ฯตโˆ‡ฮธt,iJ(ฮธt,i).
    • Gt,ii=ฮณGtโˆ’1,ii+(1โˆ’ฮณ)โˆ‡ฮธt,i2J(ฮธt,i).
    • ฮทE[Gt,ii]+ฯต not tend to 0

Adam (Adadelta + momentum)Permalink

  • for every epoch t
    • for every parameter ฮธi
  • Adadelta + expected value of past gradients
    • ฮธt+1,i=ฮธt,iโˆ’ฮทE[Gt,ii]+ฯตE[gt,i].
    • Gt,ii=ฮณGtโˆ’1,ii+(1โˆ’ฮณ)โˆ‡ฮธt,i2J(ฮธt,i).
    • E[gt,i]=ฮฒE[gtโˆ’1,i]+(1โˆ’ฮฒ)โˆ‡ฮธt,i2J(ฮธt,i).
      • slow initially, pick up speed over time.
      • similar to momentum
    • different step size for different parameters
    • With momentum for every parameter, Adam leads to faster convergence
    • This is why Adam is de facto of optimizer of many projects.

optimizer-equations.png

ref: https://youtu.be/mdKjMPmcWjY

Leave a comment