\(l_1\) regularization¶

Here we encourage \(w_j=0\) by a prior on \(w\) that assigns a lot of probability density near the origin such as a zero-mean Laplace distribution. This should yield an sparse solution.

\[p(w|\lambda) = \prod_{j=1}^D Lap(w_j|0,1/\lambda) \propto \prod_{j=1}^D e^{-\lambda |w_j|}\]

MAP estimation¶

We can view \(l_1\) regularization as using Laplace prior and performing MAP estimation. This is also known as the Lasso algorithm.

If we assume an uniform prior on the bias term \(p(w_0)) \propto 0\) and perform MAP estimation with this prior. The penalized negative log likelihood has the form:

\[ f(w) = - \log p(D|w) - log p(w|\lambda) = NLL(w) + \lambda ||w||_1 \]

\(||w||_1 = \sum_{j=1}^D|w_j|\) is the \(l_1\) norm of w.

For a suitably large \(\lambda\) the estimate \(\hat{w}\) will be sparse. We can view this as an convex approximation to

\[ \argmin w NLL(w) + \lambda ||w||_0 \]

the \(l_0\) regularization (Bernoully Gaussian Model)

For linear regression this becomes:

\[ f(w) = RSS(w) + \lambda'||w||_1 \]

\(\lambda' = 2\lambda \sigma^2\)

Regularizaton path¶

As we increase \(\lambda\) the solution to \(\hat{w}(\lambda)\) will tend to get sparser (not necessarily monotonically). If we plot \(\hat{w}(\lambda)\) vs \(\lambda\) for each j, we get the regularization path.

Here we plot the regularzation path for \(t = \frac{1}{\alpha}\) this a smaller value of \(t\) yield larger \(\alpha\)

High dimension¶

In case \(N> D\) then lasso can select at most N nonzero parameters.

Model selection¶

It is tempting to use \(l_1\) regularization to estimate a set of relevant variables. In some cases we can recover the true parameters \(w^*\). If a model can recover the parameters as \(N \rightarrow \infty\) than we can say that the model selection is consistent.

However \(l_1\) regularziation cannot guarantee this.

Since the parameters a shunken, to recover the true data generating parameters we may first find the non-zero parameters, and on those parameters we fit regular OLS. This process is called debiasing
If the data is perturbed slightly, then \(l_1\) regularization can give quite different results. We can however use boostrap to find the most stable ones.

study-notes

\(l_1\) regularization¶

MAP estimation¶

Regularizaton path¶

High dimension¶

Model selection¶

Fitting¶

Variants¶

Fused lasso¶

Group lasso¶

Nonconvex regularizisers¶