Hao Tang 唐顥

Hidden Markov Models (Part 1)

We have seen how we can model speech frames with Gaussian mixture models, but it assumes that the frames are independent from each other, which is likely false for speech frames. In this post, we will discuss hidden Markov models (HMMs) and how dependencies between frames are modeled. We will mainly focus on the definition, properties, and how the HMMs are used, and will leave the training of HMMs to the next post.

Definition

Given a sequence $x_1, \dots, x_T$, an HMM assumes that the sequence is generated iteratively with \begin{align} z_1 & \sim \text{Categorical}(\pi) \\ z_t & \sim \text{Categorical}(\mathbf{a}_{z_{t-1}}) \\ x_t & \sim f_\theta(x_t | z_t) \end{align} for $t=2, \dots, T$.

The variable $\pi$ is a $K$-dimensional stochastic vector, where a stochastic vector means that the entries are probabilities. That means $0 \leq \pi_k \leq 1$ and $\sum_{k=1}^K \pi_k = 1$.

The variable $\mathbf{a}_k$ is also a $K$-dimensional stochastic vector for $k=1, \dots, K$, and because we have $K$ of them, it is more convenient to pack the vectors into a matrix $A = \begin{bmatrix}\mathbf{a}_1 & \cdots & \mathbf{a}_K\end{bmatrix}$.

Each $z_t \in \{1, \dots, K\}$, and the $z_t$'s are called states. Since $z_t \sim \text{Categorical}(\mathbf{a}_{z_{t-1}})$, the stochastic vector $\mathbf{a}_{z_{t-1}}$ defines the probability of transitioning from a state (which $z_{t-1}$ is at) to other states (that gets assigned to $z_t$). The probabilities in $A$ are called transition probabilities, and $A$ is called the transition matrix.

The function $f_\theta(x | z)$ is a conditional probability distribution parameterized by $\theta$. Note that $f$ does not depend on $t$. In other words, the same function is used at every time step. This function $f$ is often called the emission probability, where $x_t$ is considered to be the output at time $t$.

The parameters of an HMM are $\pi$, $A$, and $\theta$.

(In)dependencies

In HMMs, the distribution of $z_t$ is completely determined once $z_{t-1}$ is known. In probability, instead of describing the dependencies, we only have the notion of statistical independence. To put it in the language of probability, the variable $z_t$ is independent of any other variable $v$ given $z_{t-1}$. Formally, \begin{align} p(z_t | v, z_{t-1}) = \frac{p(z_t, v | z_{t-1})}{p(v | z_{t-1})} = \frac{p(z_t | z_{t-1}) p(v | z_{t-1})}{p(v | z_{t-1})} = p(z_t | z_{t-1}). \end{align} If we compare the first term and the last term, we can drop the other variables that is being conditioned, in this case, the variable $v$, if $z_t$ is independent of any other variable, in this case $v$, given $z_{t-1}$.

Similarly, when we write $p(x_t | z_t)$, it means that $x_t$ is completely determined once $z_t$ is known, and $x_t$ is independent of everything else given $z_t$.

The property that the current state only depends on a finite number of past states is called the Markov property; hence the word Markov in HMM.

The Joint Probability

Given the conditional independence, we have \begin{align} p(x_{1:T}, z_{1:T}) & = p(z_1) p(x_1 | z_1) p(z_2 | z_1, x_1) p(x_2 | z_2, z_1, x_1) p(z_3 | z_2, z_1, x_2, x_1) \cdots \\ & = p(z_1) p(x_1 | z_1 )\prod_{t=2}^T p(z_t | z_{1:t-1}, x_{1:t-1}) p(x_t | z_{1:t}, x_{1:t-1}) \\ & = p(z_1) p(x_1 | z_1) \prod_{t=2}^T p(z_t | z_{t-1}) p(x_t | z_t) \end{align} where the first line is based on the definition of conditional probability, the second line is a cleaner version of the first line, and the third line is where we use the conditional independence to drop the variables.

Using the parameters, we can write the probabilities as \begin{align} p(z_1) &= \pi_{z_1} \\ p(z_t | z_{t-1}) &= a_{z_t z_{t-1}} \\ p(x_t | z_t) &= f_\theta(x_t | z_t) \end{align}

Inferring the Hidden States: the Viterbi Algorithm

An HMM is most useful when the states are not observed. If the states $z_1, \dots, z_T$ are observed, the outputs $x_1, \dots, x_T$ are independent of each other, and the transition probabilities are not even involved.

If the states are not observed, we can ask what could have been the most likely state sequence that produces the outputs. This can be written formally as \begin{align} \operatorname*{\arg\!\!\max}_{z_{1:T}} \log p(z_{1:T} | x_{1:T}) & = \operatorname*{\arg\!\!\max}_{z_{1:T}} \log \frac{p(x_{1:T}, z_{1:T})}{p(x_{1:T})} \\ & = \operatorname*{\arg\!\!\max}_{z_{1:T}} \log p(x_{1:T}, z_{1:T}) \end{align} where the definition of conditional probability is used in the first line, and we drop $p(x_{1:T})$ in the second line because it does not involve $z_{1:T}$.

Based on the joint probability, we can derive the recurrsion \begin{align} \log p(x_{1:t}, z_{1:t}) & = \log p(z_1) + \log p(x_1 | z_1) + \sum_{i=2}^t \left[ \log p(z_i | z_{i-1}) + \log p(x_i | z_i) \right] \\ & = \log p(z_1) + \log p(x_1 | z_1) + \sum_{i=2}^{t-1} \left[ \log p(z_i | z_{i-1}) + \log p(x_i | z_i) \right] \\ & \qquad + \log p(z_t | z_{t-1}) + \log p(x_t | z_t) \\ & = \log p(x_{1:t-1}, z_{1:t-1}) + \log p(z_t | z_{t-1}) + \log p(x_t | z_t). \end{align} This is nothing but separating out the last two terms, but now we are in a good position to build efficient dynamic programming algorithms.

Since the last two terms depdends on both $z_{t-1}$ and $z_t$, we delay finding the maximum of $z_t$ and instead focus on finding the maximum up until $z_{t-1}$. Using the same approach by separating out the last term, we have the recurrsion \begin{align} \max_{z_{1:t-1}} \log p(x_{1:t}, z_{1:t}) & = \max_{z_{t-1}} \max_{z_{1:t-2}} \log p(x_{1:t}, z_{1:t}) \\ & = \max_{z_{t-1}} \max_{z_{1:t-2}} \Bigg[ \log p(x_{1:t-1}, z_{1:t-1}) + \log p(z_t | z_{t-1}) + \log p(x_t | z_t) \Bigg] \\ & = \max_{z_{t-1}} \left[ \log p(z_t | z_{t-1}) + \log p(x_t | z_t) + \max_{z_{1:t-2}} \log p(x_{1:t-1}, z_{1:t-1}) \right]. \end{align} Note that $\max_{z_{1:t-1}} \log p(x_{1:t}, z_{1:t})$ is a $K$-dimensional vector, because $z_t$ is free and can take values from $1$ to $K$. We give the term a new name and rewrite the recurrsion as \begin{align} d_1(z_1) &= \log \pi_{z_1} \\ d_t(z_t) &= \max_{z_{1:t-1}} \log p(x_{1:t}, z_{1:t}) \\ & = \max_{z_{t-1}} \left[\log p(z_t | z_{t-1}) + \log p(x_t | z_t) + d_{t-1}(z_{t-1}) \right]. \end{align} To compute the actual value of the max, we have \begin{align} \max_{z_{1:T}} \log p(x_{1:T}, z_{1:T}) = \max_{z_T} d_T(z_T). \end{align}

To find the hidden states that attain the maximum, we follow the recurrsion \begin{align} \delta_t(z_t) = \operatorname*{\arg\!\!\max}_{z_{t-1}} \left[\log p(z_t | z_{t-1}) + \log p(x_t | z_t) + d_{t-1}(z_{t-1}) \right] \end{align} and save the hidden states as we compute $d$. Once the dynamic programming is done, we go backwards, following the recurrsion \begin{align} \hat{z}_T &= \operatorname*{\arg\!\!\max}_{z_t} d_t(z_t) \\ \hat{z}_{t-1} &= \delta_t(\hat{z}_t), \end{align} for $t = T, \dots, 2$, and the sequence of hidden states that attain the maximum is $\hat{z}_1, \dots, \hat{z}_T$. Due to the way states are saved and the backward computation, this step is commonly known as backtracking.

The entire algorithm is known as the Viterbi algorithm. Note that the algorithm only relies on recurrsion of the joint probability. It does not even need to know the implementation of $p(z_t | z_{t-1})$ and $p(x_t | z_t)$ as long as they can be evaluated.

The Marginal Probability

Assuming the data is generated by an HMM, we can also the probability of observing the data itself. This requires us to marginalize over all possible sequences of hidden states, so the resulting probability is also called the marginal probability. Formally, we want to compute $p(x_{1:T})$, and by the property of joint probability, we have \begin{align} p(x_{1:T}) = \sum_{z_{1:T}} p(x_{1:T}, z_{1:T}). \end{align}

We follow the same approach and compute $\sum_{z_{1:t-1}} p(x_{1:t}, z_{1:t})$ instead, delaying the marginalization of the last time step $z_t$. It is now a reflex to separate out the last time step, and we have \begin{align} \sum_{z_{1:t-1}} p(x_{1:t}, z_{1:t}) & = \sum_{z_{t-1}} \sum_{z_{1:t-2}} p(x_{1:t}, z_{1:t}) \\ & = \sum_{z_{t-1}} \sum_{z_{1:t-2}} \Bigg[ p(x_{1:t-1}, z_{1:t-1}) p(z_t | z_{t-1}) p(x_t | z_t) \Bigg] \\ & = \sum_{z_{t-1}} \left[ p(z_t | z_{t-1}) p(x_t | z_t) \sum_{z_{1:t-2}} p(x_{1:t-1}, z_{1:t-1}) \right]. \end{align} We give the term a new name and rewrite the recurrsion as \begin{align} \alpha_t(z_t) & = \sum_{z_{1:t-1}} p(x_{1:t}, z_{1:t}) \\ & = \sum_{z_{t-1}} p(z_t | z_{t-1}) p(x_t | z_t) \alpha_{t-1}(z_{t-1}). \end{align} Note how everything is in parallel with the Viterbi algorithm. The only change here is replacing the max with the sum. Similar to the Viterbi algorithm, computing the marginal probability only relies on the recurrsion of the joint probability and does not need to know the implementation of $p(z_t | z_{t-1})$ and $p(x_t | z_t)$.

The marginalization can also be done in the reversed order. Instead of separating out the last time step, we separate the first and have the recurrsion \begin{align} \sum_{z_{t:T}} p(x_{t:T}, z_{t:T} | z_{t-1}) & = \sum_{z_t} \sum_{z_{t+1:T}} p(z_t | z_{t-1}) p(x_t | z_t) p(x_{t+1:T}, z_{t+1:T} | z_t) \\ & = \sum_{z_t} \left[ p(z_t | z_{t-1}) p(x_t | z_t) \sum_{z_{t+1:T}} p(x_{t+1:T}, z_{t+1:T} | z_t) \right]. \end{align} We give the term a name, and write the recurssion more compactly as \begin{align} \beta_{t-1}(z_{t-1}) & = \sum_{z_{t:T}} p(x_{t:T}, z_{t:T} | z_{t-1}) \\ & = \sum_{z_t} p(z_t | z_{t-1}) p(x_t | z_t) \beta_{t}(z_t). \end{align}

Even though $\alpha$ and $\beta$ are constructed for dynamic programming, it is easy to forget that they themselves are probabilities. \begin{align} \alpha_t(z_t) &= \sum_{z_{1:t-1}} p(x_{1:t}, z_{1:t}) = p(x_{1:t}, z_t) \\ \beta_t(z_t) &= \sum_{z_{t+1:T}} p(x_{t+1:T}, z_{t+1:T} | z_t) = p(x_{t+1:T} | z_t) \end{align}

Implementation

In practice, probabilities are less than 1, and when multiplying many small real numbers, we would quickly run out of precision to represent the results. A numerically stable approach is to store the results in log domain, and exponentiating the numbers when needed. The recurrsion becomes \begin{align} \alpha'_t(z_t) & = \log \alpha_t(z_t) \\ & = \log \sum_{z_{t-1}} \left[ p(z_t | z_{t-1}) p(x_t | z_t) \alpha_{t-1}(z_{t-1}) \right] \\ & = \log \sum_{z_{t-1}} \exp \left[ \log p(z_t | z_{t-1}) + \log p(x_t | z_t) + \log \alpha_{t-1}(z_{t-1}) \right] \\ & = \log \sum_{z_{t-1}} \exp \left[ \log p(z_t | z_{t-1}) + \log p(x_t | z_t) + \alpha'_{t-1}(z_{t-1}) \right]. \end{align}

The Posterior Probability

Since we know how to compute the joint probability and the marginal probability, computing the posterior is simply \begin{align} p(z_{1:T} | x_{1:T}) = \frac{p(x_{1:T}, z_{1:T})}{p(x_{1:T})}. \end{align}