Contents

RMS Prop¶

Extends Adagrad to avoid the effect of a monotonically decreasing learning rate. It maintains an decaying average of squared gradients. Works good in nonconvex settings. This average is updated according:

\[ \hat{s}^{(k+1)} = \gamma \hat{s}^{(k)} + (1-\gamma)(g^{(k)} \odot g^{(k)}) \]
  • \(\odot\) is element wise product

  • \(\gamma \in [0,1]\) is the decay rate often set to \(0.9\)

The update rule is the same as for Adagrad but with the decaying average:

\[\begin{split} x_i^{(k+1)} = x_i^{(k)} - \frac{\alpha}{\epsilon + \sqrt{\hat{s_i}^{(k)}}} g_i^{(k)} \\ x_i^{(k+1)} = x_i^{(k)} - \frac{\alpha}{\epsilon + \text{RMS}(g_i)}g_i^{(k)} \end{split}\]
  • \(\text{RMS}(g_i)\) is a shorthand for the decaying root mean squared

Algorithm¶

Initialize:

\(\alpha\) learning rate, \(\gamma\) decay rate, \(\epsilon\) is a small number, \(s=0\) sum of squared gradient

Iterate: $\( g =\nabla f(x) \\ s = \gamma s + (1 - \gamma)*(g \cdot g) \\ \text{return } x - \alpha * g / (\sqrt{s} + \epsilon) \)$