Mixture of multinoullis

A mixture model where the data consists of D-dimensional bit vectors. The class conditional density is a product of Bernoullis

p(xi|zi=k,θ)=Dj=1Ber(xij|μij)
  • μij is the probability that bit j turns on in cluster k.

The mean and covariance of this mixture model is given by:

E[x]=xπkμk
cov[x]=kπk[Σi+μkμTk]E[x]E[x]T
  • Σk=diag(μjk(1μjk))

Here the mixture distribution can capture correlations between variables.

Clustering

We can use a mixture model to create a generative classifier, where we model each class-conditional density p(x|y=c) by a mixture distribution.

We fit a mixture model, and compute p(zi=k|xi,θ) which represetns the posterior robability that the point i belongs to cluster k. This is known as the responsibility of cluster k for point i and can be computed as:

rikp(zi=k|xi,θ)=p(zi=k|θ)p(xi|zi=k,θ)Kk=1p(zi=k|θ)p(xi|zi=k,θ)

This is also known as soft clutering. We can transform it into hard clustering by using the MAP estimate by approximating

^zi=argmaxkrik=argmaxklogp(xi|zi=k,θ)+logp(zi=k|θ)

Mixtures of experts

We can use mixtures models for discriminative classification or regression. In this example we fit 3 different linear regression functions, each applying to different part of the input space.

p(yi|xi,zi=k,θ)=N(yi|wTkxi,σ2k)p(zi|xi,θ)=Cat(zi|S(VTxi))

This is called a mixture of experts.