Maximum entropy derivation¶
The exponential family, is the distribution that makes the least number of assumptions about the data, subject to a specific set of user-specified constraints. In particular suppose all we know is the expected values of certain features or functions:
\(F_x\) is known constant
\(f_k(x)\) is an arbitrary function
The principle of maximum entropy or maxent says we should pick the distribution with maximum entropy (closest to uniform), subject to the constraints that the moments of the distribution match the empirical moments of the specified functions
To maximize the entropy subject to the constrain, and the constraints that \(p(x) \ge 0\) and \(\sum_x p(x) = 1\) we need to use Lagrange multipliers. The Lagrangian is given by:
We treat p as fixed length vector (since we assume that x is discrete) then we have:
If we wet this to 0 we get:
\(Z - e^{1 + \lambda_0}\)
Since \(p(x)\) is a probability distribution it has to sum up to 1.
Hence the normalization constant becomes:
\(Z = \sum_x \exp(- \sum_k \lambda_k f_k(x))\)
Thus the maxnet distribution \(p(x)\) is of the form of the exponential familuy, also known as Gibbs distribution.