Learning Markov Random Fields¶
Markov Random fields are more expressive than directed graphical models, unfortunately this makes them harder to deal with.
Lets assume an MRF of the form:
We can parametrize p as follows:
\(f(x)\) is a vector of indicator functions
\(\theta\) is the set of all the model parameters defined by \(\log \phi_c(x'_c,\varphi)\)
\(Z(\theta)\) is the partition function (differs for each set of parameters \(\theta\)). In Bayesian networks we have \(Z(\theta) = 1\) for all \(\theta\), in MRF we do not do this modelling assumptions, which makes it more flexible but harder to learn.
Exponential families¶
The distribution:
Is an example of exponential families, which have the following properties:
log-concave in their natural parameter \(\theta\)
\(f(x)\) is the vector of sufficient statistics, that fully describe the distribution p
they make the fewest unnecessary assumptions about the data distribution.
Maximum likelihood learning¶
Here we assume the following log likelihood: $\( \frac{1}{|D|} \log p(D;\theta) = \frac{1}{|D|}\sum_{x \in D}\theta^T f(x) - \log Z(\theta) \)$
Unfortunately optimizing this objective is intractable, because the gradient is intractable.
Approximate learning¶
We can reduce maximum-likelihood learning to inference in a sense that we may repeatedly use inference to compute the gradient and determine the model weights using gradient descent. We can leverage this to perform approximate inference as:
Persistent contrastive divergence. Here after a step of gradient descent, the model will change only a little, thus we keep sampling from the same Gibbs sampler instead of a new one.
Pseudo likelihood¶
We replace the likelihood
with the following approximation:
\(x_i\) is the ith variable in x
\(N(i)\) are the neighbors of i in the graph g (markov blanket of i)
Here \(\log p(x_i|x_{N(i)};\theta)\) involves only one variable \(x_i\) and the partition function is just a sum over all the values of one variable.
The pseudo-likelihood assumes that \(x_i\) depends mainly on its neighbors in the graph, and ignores the dependencies on other, more distant, variables.
Pseudo-likelihood is concave and works well in practice (there are some exceptions)
Moment matching¶
If we take the log-likelihood:
If we differentiate it we get:
This is the difference between the expectation of natural parameters under the empirical distribution and the model distribution. Since \(f\) in MRF is a vector of indicator functions for each variable of a clique one entry of \(f = I[x_c - \bar{x}_c]\), the gradient is equal:
Thus by training MRF we force marginals to match the empirical marginals. This property is knows as moment matching. Since in maximum-likelihood we choose a distribution q to minimize the inclusive KL divergence \(KL(p||q)\) across q in an exponential family, the minimizer \(q^*\) will match the moments of the sufficient statistics to the corresponding moments of p. This is the reason why minimizing \(KL(p||q)\) is called moment matching.
Learning in conditional random fields¶
Here we extend learning to conditional random fields