RMS Prop¶

Extends Adagrad to avoid the effect of a monotonically decreasing learning rate. It maintains an decaying average of squared gradients. Works good in nonconvex settings. This average is updated according:

\[ \hat{s}^{(k+1)} = \gamma \hat{s}^{(k)} + (1-\gamma)(g^{(k)} \odot g^{(k)}) \]

$\odot$ is element wise product
$\gamma \in [0,1]$ is the decay rate often set to $0.9$

The update rule is the same as for Adagrad but with the decaying average:

\[\begin{split} x_i^{(k+1)} = x_i^{(k)} - \frac{\alpha}{\epsilon + \sqrt{\hat{s_i}^{(k)}}} g_i^{(k)} \\ x_i^{(k+1)} = x_i^{(k)} - \frac{\alpha}{\epsilon + \text{RMS}(g_i)}g_i^{(k)} \end{split}\]

$\text{RMS}(g_i)$ is a shorthand for the decaying root mean squared

Algorithm¶

Initialize:

$\alpha$ learning rate, $\gamma$ decay rate, $\epsilon$ is a small number, $s=0$ sum of squared gradient

Iterate: $$ g =\nabla f(x) \\ s = \gamma s + (1 - \gamma)*(g \cdot g) \\ \text{return } x - \alpha * g / (\sqrt{s} + \epsilon) $$

study-notes

RMS Prop¶

Algorithm¶