Forward vs reverse KL divergence¶
KL divergence is not symmetric in its arguments, minimizing \(K(q||p)\) wtr \(q\) will give different behavior than minimizing \(KL(p||q)\).
a) KL(p||q) b,cc) KL(q||p)
We can see that KL(p||q) overestimates but KL(q||p) chooses only one mode
Reverse \(KL(q||p)\) (I-projection, information projection)¶
By definition we have:
This is infinite if \(p(x) = 0\) and \(q(x) \ge 0\). Thus if \(p(x) = 0\) we must ensure \(q(x) = 0\). We say that the reverse KL is zero forcing for q. Hence q will typically under-estimate’s support of p.
Forward \(KL(p||q)\) (M-projection, moment projection)¶
This is inifinite if \(q(x) = 0\) and \(p(x) > 0\). So if \(p(x) > 0\) we must ensure that \(q(x) > 0\). We say that the forward KL is zero avoiding for q. Hence q will typically over-estimate the support of p. The reason why it is called moment-projection is that it forces \(q\) to match the empirical moments of p.
Alpha divergence¶
Ofthen if we minimize \(K(q||p)\) where \(q\) is factorized, the result in an approximation that is overconfident.
We can create a family of divergence measures indexed by a parameter \(\alpha \in R\). By defining the alpha divergence as follows:
This measure satisfies \(D_{\alpha}(p||q) = 0 \iff p = q\), but it is not symmetric, hence it is not a metric. \(KL(p||q)\) correspondes to the limit \(\alpha \rightarrow 1\), whereas \(KL(q||p)\) corresponds to \(\alpha \rightarrow -1\). When \(\alpha = 0\), we get a symmetric divergence measure that linearly related to the Hellinger distance defined by:
\(\sqrt{D_H (p||q)}\) is a valid distance metric.