Exponential family¶

The exponential family is important because:

  • It can be shown that, under certain regularity conditions, the exponential family is the only family of distributions with finite-sized sufficient statistics, meaning that we can compress the data into a fixed-sized summary without loss of information.

  • The exponential family is the only family of distributions for which conjugate priors exist, which simplifies the computation of the posterior.

  • The exponential family can be shown to be the family of distributions that makes the least set of assumptions subject to some user-choosen constraints

  • The exponential family is at the core of generalized linear models.

  • The exponential family is at the core of variational inference

Definition¶

A distribution of the form:

\[p(x|\theta) = \frac{1}{Z(\theta)} h(x) \exp{[\theta^T \phi(x)]} = h(x)\exp[\theta^T \phi(x) - A(\theta)]\]
  • \(Z(\theta) = \int_{X^m} h(x) \exp[\theta^T \phi(x)]dx\) is the partition function, and it is log-convex in \(\theta\)

  • \(h(x)\) is the scaling constnat ofthen set to 1.

  • \(A(\theta) = \log Z(\theta)\) (log partition function or cumulant function)

  • \(\theta\) is called the natural parameters or canonical parameters. Exponential families are log-concave in their natural parameters.

  • \(\phi(x) \in R^d\) is called a vector of sufficient statistics , if \(\phi(x) =x​\) we say it is a natural exponential family

We can futher generalize the exponential family by writing:

\[p(x|\theta) = h(x)\exp[\eta(\theta)^T \phi(x) - A(\eta(\theta))]\]
  • \(\eta\) is a function that maps the parameter \(\theta\) to the canonical parameter \(\eta = \eta(\theta )\).

    • If \(dim(\theta) < dim(\eta(\theta))\), it is called a curved exponential family, which means we have more sufficient statistics than parameters.

    • If \(\eta(\theta) = \theta\) , the model is said to be in canonical form

Minimal exponential distribution¶

If there is a unique parameter \(\theta\) associated with the distribution, we can say it is unique. (The components of sufficient statistics are linearly independent)

Examples¶

What does not belong¶

Log partition function¶

An important property of the exponential family is that the derivatives of the log partition function can be used to generate cumulants (moments) of the sufficient statistics. For this reason \(A(\theta)\) is sometimes called a cumlant function.

Pitman-Koopman-Darmois theorem¶

Under certain regularity conditions, the exponential family is the only family of distributions with finite sufficient statistics. (independent of the size of the dataset)

Bayesian view¶

The only family for which conjugate priors exists is the exponential family. Conjugate priors can simplify exact Baysian analysis.

Maximum entropy derivation¶

The exponential family, is the distribution that makes the least number of assumptions about the data, subject to a specific set of user-specified constraints. In particular suppose all we know is the expected values of certain features or functions.

Or we can say, that the distribution that maximizes the entropy \(H(p)\) under constraint \(E_p[\phi(x)] = \alpha\) (sufficient stats equal some parameter alpha) is the exponential family.

Markov random fields¶

If we learn markov random fields we extensively use exponential families.