Ben's Blog
Published on
Reading time
10 min read

HMMs and CRFs

A comparison of Hidden Markov Models and Conditional Random Fields, two kinds of probabilistic graphical models.

I found that there was a surprising lack of comparisons available online between linear chain conditional random fields and hidden Markov models, despite the many similarities between the two. So, in this post, I'll cover some of the differences between two types of probabilistic graphical models: Hidden Markov Models and Conditional Random Fields. Specifically, I'll describe the different interfaces that two algorithms, the Viterbi algorithm and the forward-backward algorithm, take when used for each kind of model.

Hidden Markov Models

The diagram below illustrates the architecture of a hidden Markov model, with the hidden states on the left in blue and the observed states on the right in green. The parameters of the model are the arrows in the diagram, where the transition probabilities represent the probability of going from one hidden state to another, and the emission probabilities represent the probability of going from one hidden state to one observed state. This diagram illustrates a single time step. The observed states being modeled can be any kind of discrete data; for example, each observed state could represent a word (such as with the Subreddit Simulator).

HMM over a single time step

To expand this model over multiple time steps, we can draw consecutive samples from the model. The probability of each state depends only on the probability distribution over the states in the previous timestep and the probability of transitioning between states. This leads to a forward-generative model, as in the image below.

HMM over multiple time steps

Mathematically, the probability of an observed sequence X=x1,x2,...,xNX = x_1, x_2, ..., x_N is here defined as a function of a sample sequence of hidden states H=h1,h2,...,hNH = h_1, h_2, ..., h_N drawn from the space of all possible sequences of hidden states H\mathscr{H}.

p(X)=HHi=1Np(xihi)p(hihi1)p(X) = \sum_{H \in \mathscr{H}} \prod_{i = 1}^{N} p(x_i | h_i) p(h_i | h_{i - 1})

The forward-backward algorithm, as described in this tutorial by Michael Collins, is used to find the marginal probability p(hiX)p(h_i \| X). Assuming we know the emission probabilities p(xihi)p(x_i \| h_i) and the transition probabilities p(hihi1)p(h_i \| h_{i - 1}), the forward backward algorithm is given by the equations below.

p(hiX)αiβip(h_i | X) \propto \alpha_i \beta_i

where

αi=p(hi,x1:i)βi=p(xi+1:N)\alpha_i = p(h_i, x_{1 : i}) \qquad \beta_i = p(x_{i + 1 : N})

These values are computed using the recurrence relations below.

αi=hi1p(xihi)p(hihi1)αi1\alpha_i = \sum_{h_{i - 1}} p(x_i | h_i) p(h_i | h_{i - 1}) \alpha_{i - 1}

βi=hi+1p(xi+1zi+1)p(zi+1zi)βi+1\beta_i = \sum_{h_{i + 1}} p(x_{i + 1} | z_{i + 1}) p(z_{i + 1} | z_i) \beta_{i + 1}

For a more in-depth math explanation, I highly recommend the Mathematical Monk series.

Interface

The code below gives the interface for interacting with this model. Although the algorithms aren't actually filled in, here is a brief overview of what they do:

  • We can sample the most likely sequence of observed states using the Viterbi algorithm. I've covered this in a previous post. Effectively, this finds the most probable sequence of hidden states, given the observed states.
  • Training this model involves using the forward-backward algorithm to find the marginal probability of each hidden state at each timestep (in other words, the probability independent of what any other hidden state is). We can then perform expectation maximization to push these probabilities in the direction that maximizes the likelihood of generating some training sample XX. This procedure is known as the Baum-Welch algorithm.
"""Defines the Hidden Markov Model interface.

Compare this interface with the CRF interface below. Helper types are added to
make the code more readable.

We assume that the hidden and output states are numbered [0, 1, ... , n - 1].
Additionally, unlike in typical neural network notation, we assume we're dealing
with one batch at a time, so the first dimension will not be a `bsz` dimension.
"""

import abc
import numpy as np
from typing import Tuple

FloatVector = np.array          # 1-D with float values.
FloatMatrix = np.array          # 2-D with float values.
IntVector = np.array            # 1-D with int values.
IntMatrix = np.array            # 2-D with int values.


class HiddenMarkovModel(metaclass=abc.ABCMeta):
    """Defines the HMM interface."""

    def __init__(self, num_observed: int, num_hidden: int) -> None:
        self.num_observed = num_observed
        self.num_hidden = num_hidden

        self.hidden_transitions = np.random.randn(num_output, num_output)

    @abc.abstractmethod
    def viterbi(self, observed_states: IntVector) -> Tuple[IntVector, float]:
        """Returns the most likely sequence of hidden states.

        Args:
            observed_states: the sequence of `num_time` observed states.

        Returns:
            hidden_states: the most probable sequence of `num_time` hidden states
                associated with in the sequence of observed states.
            sequence_prob: the probability of the maximum-likelihood sequence.
        """

        raise NotImplementedError

    @abc.abstractmethod
    def forward_backward(self, observed_states: IntVector) -> FloatMatrix:
        """Returns the marginal probability of each hidden state.

        Args:
            observed_states: the sequence of `num_time` observed states.

        Returns:
            hidden_state_probs: matrix with shape `(num_time, num_hidden)`, where
                index `(i, j)` represents the marginal probability of the `j`th
                hidden state at time `i`.
        """

        raise NotImplementedError

Can we reframe this as a neural network layer?

An interesting thought experiment to try is to consider how this model could be reframed in the way that we usually think about neural networks. HMMs are generative models that are trained to maximize the probability of the training data given some observed states.

If we constrain the number of hidden states to equal the number of observed states, we can treat the HMM as a refinement layer, which takes a sequence of predictions from the neural networks and outputs the maximum likelihood sequence from the Viterbi algorithm as a conditional prediction. This is illustrated in the figure below, where the red text indicates the training procedure.

HMM as a refinement layer

Connectionist Temporal Classification

Another common application of HMMs in deep learning is the Connectionist Temporal Classification loss function, which is covered in a great article here. In this formulation, the HMM takes on a specific form, where the transition probabilities are treated as uniform (where there is an equal likelihood of going from any state to any other state), meaning we can discard the p(hihi1)p(h_i \| h_{i - 1}) term. If we flip the generative probability around using Bayes' rule, we get the CTC learning criteria:

p(X)HHi=1Np(hixi)p(xi)p(hi)HHi=1Np(hiX)\begin{aligned} p(X) & \propto \sum_{H \in \mathscr{H}} \prod_{i = 1}^{N} \frac{p(h_i | x_i) p(x_i)}{p(h_i)} \\ & \propto \sum_{H \in \mathscr{H}} \prod_{i = 1}^{N} p(h_i | X) \end{aligned}

The training and inference algorithms associated with CTC learning can therefore be thought of in the same terms as HMM algorithms.

Conditional Random Fields

The diagram below illustrates the architecture of a linear chain conditional random field, with the input states on the right in green and the output states on the left in blue. The parameters of the model are the arrows in the diagram, where the unary potentials represent the probability of going from one input state to one output state, and the binary potentials represent the probability of going from one output state to another output state. The diagram illustrates a single time step.

Note that the only difference between this diagram and the HMM diagram above is the direction of the arrows! While HMMs are generative, CRFs are conditional.

CRF over a single time step

When we expand this model over multiple time steps, we get something like the image below. Note that the arrows of the binary potentials go in both directions. Rather than being forward generative, this model

CRF over multiple time steps

One reason for the popularity of CRFs over HMMs in recent literature is that the math is relatively more forgiving. Mathematically, the probability of an output sequence y=y1,y2,...,yNy = y_1, y_2, ..., y_N given an input sequence X=x1,x2,...,xNX = \vec{x}_1, \vec{x}_2, ..., \vec{x}_N is

p(yX)=exp(i=1NU(xi,yi)+i=1N1T(yi,yi+1))Z(X)p(y | X) = \frac{\exp(\sum_{i=1}^{N} U(\vec{x}_i, y_i) + \sum_{i=1}^{N-1} T(y_i, y_{i+1}))}{Z(X)}

U(xi,yi)U(\vec{x}_i, y_i) is the learned unary score, proportional the probability of yiy_i being generated by xi\vec{x}_i. T(yi,yi+1)T(y_i, y_{i + 1}) is the learned binary score, representing the likelihood of yiy_i and yi+1y_{i+1} appearing next to each other. These scores aren't probabilities, so they can be represented by neural networks.

The partition function Z(X)Z(X) is the normalizing factor which gets the final probability. This represents the total score of all output possible sequences.

Z(X)=y1yy2y...yiy...yNyexp(i=1NU(xi,yi))+i=1N1T(yi,yi+1))=yNyexp(U(xN,yN))yN1yexp(U(xN1,yN1)+T(yN1,yN))...y1yexp(U(x1,y1)+T(y1,y2))\begin{aligned} Z(X) = & \sum_{y'_1 \in y} \sum_{y'_2 \in y} ... \sum_{y'_i \in y} ... \sum_{y'_N \in y} \exp(\sum_{i=1}^{N} U(\vec{x}_i, y'_i)) + \sum_{i=1}^{N-1} T(y'_i, y'_{i+1})) \\ = & \sum_{y'_N \in y} \exp(U(\vec{x}_N, y'_N)) \\ & \quad \sum_{y'_{N - 1} \in y} \exp(U(\vec{x}_{N - 1}, y'_{N - 1}) + T(y'_{N - 1}, y'_N)) \\ & \quad \quad ... \quad \sum_{y'_1 \in y} \exp(U(\vec{x}_1, y'_1) + T(y'_1, y'_2)) \end{aligned}

This partition function is computed using the forward backward algorithm, an example of a dynamic programming algorithm. This algorithm is usually represented using α\alpha and β\beta to represent the forward and backward dynamic programming tables, where αi\alpha_i and βi\beta_i are the iith vectors in these tables.

Mi=yjyexp(U(xi,yj)+T(yj,yj+1))M_i = \sum_{y'_j \in y} \exp(U(\vec{x}_i, y'_j) + T(y'_j, y'_{j+1}))

αi+1=αiMα1={1if y=start0otherwise\alpha_{i + 1} = \alpha_i M \qquad \alpha_1 = \begin{cases} 1 & \text{if } y = \text{start} \\ 0 & \text{otherwise} \end{cases} βi1=MβiTβN+1={1if y=stop0otherwise\beta_{i - 1} = M \beta_i^T \qquad \beta_{N + 1} = \begin{cases} 1 & \text{if } y = \text{stop} \\ 0 & \text{otherwise} \end{cases}

Z(X)=αNZ(X) = \alpha_{N}

Once this table has been computed, the marginal probabilities are then expressable as

p(Yi=yx)=αiβiZ(x)p(Y_i = y | \vec{x}) = \frac{\alpha_i \beta_i}{Z(\vec{x})}

On GPUs, it is more efficient to use the tree reduction approach described, for example, here.

For CRFs and HMMs, since continually multiplying probabilities can often result in very small numbers, it is usually better to perform log-space operations, for numerical stability. Most implementations are done this way.

Interface

The code below gives the interface for interacting with this model. The marginals returned by the forward-backward algorithm of this model are effectively

"""Defines the Conditional Random Field interface.

Compare this interface with the HMM interface above. This implementation uses the
same helper types as above.
"""

import abc
import numpy as np
from typing import Tuple

FloatVector = np.array          # 1-D with float values.
FloatMatrix = np.array          # 2-D with float values.
IntVector = np.array            # 1-D with int values.
IntMatrix = np.array            # 2-D with int values.


class ConditionalRandomField(metaclass=abc.ABCMeta):
    """Defines the CRF interface."""

    def __init__(self, num_input: int, num_output: int) -> None:
        self.num_input = num_input
        self.num_output = num_output

        # Defines the transition matrices inside the model.
        self.unary_probs = np.random.randn(num_input, num_output)
        self.binary_probs = np.random.randn(num_output, num_output)

    @abc.abstractmethod
    def viterbi(self, unary_potentials: FloatMatrix) -> Tuple[IntVector, float]:
        """Returns the most likely sequence of hidden states.

        Args:
            unary_potentials: matrix with shape `(num_time, num_input)`, where index
                `(i, j)` represents the unary potential for the `j`th input state
                at time `i`.

        Returns:
            output_states: the most probable sequence of `num_time` output states
                associated with in the sequence of input states.
            sequence_prob: the probability of the maximum-likelihood sequence.
        """

        raise NotImplementedError

    @abc.abstractmethod
    def forward_backward(self, unary_potentials: FloatMatrix) -> FloatMatrix:
        """Returns the marginal probability of each output state.

        Args:
            unary_potentials: matrix with shape `(num_time, num_input)`, where index
                `(i, j)` represents the unary potential for the `j`th input state
                at time `i`.

        Returns:
            output_marginals: matrix with shape `(num_time, num_output)`, where
                index `(i, j)` represents the marginal probability of the `j`th
                hidden state at time `i`.
        """

        raise NotImplementedError

Can we reframe this as a neural network layer?

Hugo Larochelle has a great series about doing just this, and there is a PyTorch-based package from Harvard NLP which includes implementations of CRFs. CRFs can be used as a drop-in layer in a neural network architecture, as in the diagram below.

CRF as a neural network layer

Intuitively, CRFs are a way of doing "path-max" for a restricted type of model. Rather than choosing uncorrelated maximum values at each timestep, the CRF model can be used to sample a path. The Viterbi algorithm can be used to visualize the best path through the model.