Forward vs reverse KL divergence

KL divergence is not symmetric in its arguments, minimizing K(q||p) wtr q will give different behavior than minimizing KL(p||q).

a) KL(p||q) b,cc) KL(q||p)

We can see that KL(p||q) overestimates but KL(q||p) chooses only one mode

Reverse KL(q||p) (I-projection, information projection)

By definition we have:

KL(q||p)=xq(x)lnq(x)p(x)

This is infinite if p(x)=0 and q(x)0. Thus if p(x)=0 we must ensure q(x)=0. We say that the reverse KL is zero forcing for q. Hence q will typically under-estimate’s support of p.

Forward KL(p||q) (M-projection, moment projection)

K(p||q)=xp(x)lnp(x)q(x)

This is inifinite if q(x)=0 and p(x)>0. So if p(x)>0 we must ensure that q(x)>0. We say that the forward KL is zero avoiding for q. Hence q will typically over-estimate the support of p. The reason why it is called moment-projection is that it forces q to match the empirical moments of p.

Alpha divergence

Ofthen if we minimize K(q||p) where q is factorized, the result in an approximation that is overconfident.

We can create a family of divergence measures indexed by a parameter αR. By defining the alpha divergence as follows:

Dα(p||q)=41α2(1p(x)(1+α)/2q(x)(1α)/2)dx

This measure satisfies Dα(p||q)=0p=q, but it is not symmetric, hence it is not a metric. KL(p||q) correspondes to the limit α1, whereas KL(q||p) corresponds to α1. When α=0, we get a symmetric divergence measure that linearly related to the Hellinger distance defined by:

DH(p||q)=(p(x)1/2q(x)1/2)2dx
  • DH(p||q) is a valid distance metric.