Forward vs reverse KL divergence¶
KL divergence is not symmetric in its arguments, minimizing K(q||p) wtr q will give different behavior than minimizing KL(p||q).
a) KL(p||q) b,cc) KL(q||p)
We can see that KL(p||q) overestimates but KL(q||p) chooses only one mode
Reverse KL(q||p) (I-projection, information projection)¶
By definition we have:
This is infinite if p(x)=0 and q(x)≥0. Thus if p(x)=0 we must ensure q(x)=0. We say that the reverse KL is zero forcing for q. Hence q will typically under-estimate’s support of p.
Forward KL(p||q) (M-projection, moment projection)¶
This is inifinite if q(x)=0 and p(x)>0. So if p(x)>0 we must ensure that q(x)>0. We say that the forward KL is zero avoiding for q. Hence q will typically over-estimate the support of p. The reason why it is called moment-projection is that it forces q to match the empirical moments of p.
Alpha divergence¶
Ofthen if we minimize K(q||p) where q is factorized, the result in an approximation that is overconfident.
We can create a family of divergence measures indexed by a parameter α∈R. By defining the alpha divergence as follows:
This measure satisfies Dα(p||q)=0⟺p=q, but it is not symmetric, hence it is not a metric. KL(p||q) correspondes to the limit α→1, whereas KL(q||p) corresponds to α→−1. When α=0, we get a symmetric divergence measure that linearly related to the Hellinger distance defined by:
√DH(p||q) is a valid distance metric.