Proximal algorithms¶
We consider a convex objective function:
\(L(\theta)\) is convex and differentiable
\(R(\theta)\) is convex but not necessarily differentiable
Then the proximal operator has the form:
\(prox_R(y)\) is the proximal operator
The goal of proximal operator is to minimize R but stay also close to y. The key is to efficiently compute the proximal operator for different R.
Proximal operators¶
L1 norm¶
L2 norm¶
Indicator function¶
This is the projection operator. For convex sets this is easy to compute.
For rectangular box \(C = \{ \theta : l_j \le \theta_j \le \mu_j \}\) $\( proj_C(\theta) = \begin{cases} l_j & \theta_j \le l_j \\ \theta_j & l_j \le \theta_j \le \mu_j \\ \mu_j & \theta_j \ge \mu_j \end{cases} \)$
For euclidean ball \(C = \{ \theta: ||\theta||_2 \le 1 \}\) $\( proj_c(\theta) = \begin{cases} \frac{\theta}{|| \theta||_2} & ||\theta||_2 > 1 \\ \theta & ||\theta||_2 \le 1 \end{cases} \)$
The projection on the 1-norm ball \(C = \{ \theta: ||\theta||_1 \le 1 \}\)
\(\lambda = 0\) if \(||\theta||_1 \le 1\) otherwise \(\lambda\) is the solution to \(\sum_{j=1}^D \max (|\theta_j| - \lambda, 0) = 1\)
Proximal gradient method¶
We can extend gradient descent to also perform proximal operator. The basic idea is to perform an quadratic approximation of the loss function arround \(\theta_k\)
\(g_k^T = \nabla L(\theta_k)\)
\(\frac{1}{t_k}I \approx \nabla^2L(\theta_k)\) approximate the Hessian so we have first order method
We can simplify by dropping terms that do not depend on z and multiply by \(t_k\). The gradient is zero.
Depending on \(R(\theta)\) we get:
\(R(\theta) = 0\) we get gradient descent
\(R(\theta) = I_C(\theta)\) we get projected gradient
\(R(\theta) = \lambda ||\theta||_1\) we get iterative soft threshing