This tutorial explains how to code the Viterbi algorithm in Numpy, and gives a minor explanation. I’m using Numpy version `1.18.1`

and Python `3.7`

, although this should work for any future Python or Numpy versions.

The Viterbi algorithm has been widely covered in many areas. One of my favorite explanations is from the Youtuber Mathematical Monk. This tutorial expects some background familiarity with hidden Markov models; if you’ve never heard of them, check out those videos! The Wikipedia page is also pretty good. Here’s a diagram of an unfolded hidden Markov model:

The goal of the Viterbi algorithm is to compute the most probable sequence of hidden states $z^*$ for a Hidden Markov Model defined by an observed sequence $x$ and a set of possible sequences of hidden states $z$:

`z^* = \underset{z}{\mathrm{argmin}}[p(z,x)]`

We assume that we know the observed sequence $x = (x_1, …, x_n)$, the set of initial hidden probabilities $z_1 = (z_1^1, …, z_1^L)$, the emission probabilities $p(x_i | z_j)$ and the transition probabilities $p(z_i | z_{i-1})$. In Numpy, these can be represented as arrays:

Note that these matrices represent the probability distributions in *column-major* format, meaning that the column dimension sums to 1. Note also that if we represent the hidden state probabilities as a vector, then doing vector-matrix multiplication into the `transition_probs`

matrix will give us the new hidden states, and doing vector-matrix multiplication into the `emission_probs`

matrix will give us the observed states.

The maximum value of $p(z, x)$ for all the preceding timesteps up to the current timestep $1, …, k$ is

```
\begin{aligned}
\mu_k(z_k) & = \max_{z_{1:k-1}} p(z_{1:k}, x_{1:k}) \\
& = \max_{z_{1:k-1}} p(x_k|z_k) p(z_k|z_{k-1}) p(z_{1:k-1}, x_{1:k-1}) \\
& = \max_{z_{k-1}} p(x_k|z_k) p(z_k|z_{k-1}) \max_{z_{1:k-2}} p(z_{1:k-1},x_{1:k-1})
\end{aligned}
```

Since the first two terms aren’t dependent on anything besides $z_{k-1}$, you can take them out of the whole max operation.

`\mu_k(z_k) = p(x_k|z_k) \max_{z_{k-1}} p(z_k|z_{k-1}) \mu_{k-1}(z_{k-1})`

For the recursion base case, we get

```
\begin{aligned}
\mu_1(z_1) & = p(z_1, x_1) \\
& = p(z_1) p(x_1 | z_1)
\end{aligned}
```

Knowing the maximizing value, we can find the *argmax* by using *backpointers* - for some max value $\mu_k(z_k)$, we get the incoming index as the $arg \max_{z_{k-1}} p(z_k | z_{k-1}) \mu_{k-1}(z_{k-1})$. By doing this recursively, we can get the (reversed) sequence of indices that led to the final maximum value.

The **time complexity** of the whole inference is $O(N M^2)$ for $N$ time steps and $M$ observed states (although the $M^2$ term represents a matrix-matrix multiplication, which Numpy helps accelerate using BLAS). The **space complexity** is $O(N M)$, since we have to store the most likely previous state for each time step.