Hao Tang 唐顥

Hidden Markov Models (Part 2)

We have discussed how to use an HMM when the parameters are given. In this part, we will look at how the parameters are learned given the data.

There are many approaches to train HMMs, but we will only focus on the most commonly used approach, expectation maximization.

Recap of Expectation Maximization

We first start with the variational lower bound of the log likelihood. \begin{align} \sum_{n=1}^N \log p(x^n) &= \sum_{n=1}^N \log \sum_{z^n} p(x^n, z^n) \\ &= \sum_{n=1}^N \log \sum_{z^n} q^n(z^n) \frac{p(x^n, z^n)}{q(z^n)} \\ &\geq \sum_{n=1}^N \sum_{z^n} q^n(z^n) \log \frac{p(x^n, z^n)}{q(z^n)} \\ &= \sum_{n=1}^N \left[ \sum_{z^n} q^n(z^n) \log p(x^n, z^n) - \sum_{z^n} q^n(z^n) \log q^n(z^n) \right] = \text{ELBO} \end{align} The lower bound is derived based on the concavity of the log function. Note that we use $x^n$ to denote the $n$-th sequence in the data set, and reserve the subscript for indexing the time steps.

The lower bound is a function of the HMM parameters and the auxiliary probability distributions $q^1, \dots, q^N$. When $q^n(z^n) = p(z^n|x^n)$, the inequality becomes equality. The EM algorithm iteratively does the following two steps.

We know from part 1 how to compute the conditional probability $p(z^n | x^n)$ in E-step, and in M-step, we need to maximize \begin{align} Q(\Lambda) &= \sum_{n=1}^N \sum_{z^n} p(z^n|x^n) \log p_{\Lambda}(x^n, z^n), \end{align} where $\Lambda$ is used to denote the HMM parameters. Recall that \begin{align} p(z_1) &= \pi_{z_1} \\ p(z_{t+1} | z_t) &= a_{z_t z_{t+1}} \\ p(x_t | z_t) &= f_\theta(x_t|z_t) \end{align} so $\Lambda = \{\pi, A, \theta\}$.

Computing the Lower Bound

To compuate the lower bound, we can focus on one particular sequence of outputs $x_{1:T}$ and one particular sequence of states $z_{1:T}$. \begin{align} \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \log p_{\Lambda}(x_{1:T}, z_{1:T}) &= \frac{1}{p(x_{1:T})} \sum_{z_{1:T}} p(x_{1:T}, z_{1:T}) \log p_{\Lambda}(x_{1:T}, z_{1:T}) \end{align} From part 1, we know how to compute the marginal probability $p(x_{1:T})$. For the rest, we plug in the joint distribution of HMM. We separate the last time step as we did in part 1 to derive the dynamic programming. \begin{align} \alpha'_{t}(z_t) &= \sum_{z_{1:t-1}} p(x_{1:t-1}, z_{1:t-1}) p(z_t | z_{t-1}) p(x_t | z_t) \log \Bigg[ p(x_{1:t-1}, z_{1:t-1}) p(z_t | z_{t-1}) p(x_t | z_t) \Bigg] \\ &= \sum_{z_{1:t-1}} p(x_{1:t-1}, z_{1:t-1}) p(z_t | z_{t-1}) p(x_t | z_t) \Bigg[ \log p(x_{1:t-1}, z_{1:t-1}) + \log p(z_t | z_{t-1}) p(x_t | z_t) \Bigg] \\ &= \sum_{z_{t-1}} \sum_{z_{1:t-2}} p(x_{1:t-1}, z_{1:t-1}) p(z_t | z_{t-1}) p(x_t | z_t) \Bigg[ \log p(x_{1:t-1}, z_{1:t-1}) + \log p(z_t | z_{t-1}) p(x_t | z_t) \Bigg] \\ &= \sum_{z_{t-1}} p(z_t | z_{t-1}) p(x_t | z_t) \Bigg[ \sum_{z_{1:t-2}} p(x_{1:t-1}, z_{1:t-1}) \log p(x_{1:t-1}, z_{1:t-1}) \\ & \qquad + \Big( \log p(z_t | z_{t-1}) p(x_t | z_t) \Big) \left( \sum_{z_{1:t-2}} p(x_{1:t-1}, z_{1:t-1}) \right) \Bigg] \\ &= \sum_{z_{t-1}} p(z_t | z_{t-1}) p(x_t | z_t) \Big[ \alpha'_{t-1}(z_{t-1}) + \alpha_{t-1}(z_{t-1}) \log p(z_t | z_{t-1}) p(x_t | z_t) \Big] \end{align}

Maximizing the Lower Bound

The general recipe for deriving the update rule is to take the derivative of $Q$ with respect to the parameter, set the term to zero, and derive the closed-form solutions. In general, it is not always possible to derive the closed-form solutions, but luckly for HMMs, we do have them.

The Update Rule for the Prior $\pi$

The derivative of $Q$ with respect to $\pi_k$ is as follows. \begin{align} & \frac{\partial}{\partial \pi_k} \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \log p(x_{1:T}, z_{1:T}) \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \frac{\partial}{\partial \pi_k} \log p(x_{1:T}, z_{1:T}) \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \frac{\partial}{\partial \pi_k} \log \pi_{z_1} \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \frac{\mathbb{1}_{k = z_1}}{\pi_k} \\ &\quad = \sum_{z_1} p(z_1|x_{1:T}) \frac{\mathbb{1}_{k = z_1}}{\pi_k} \\ &\quad = \frac{p(z_1 = k|x_{1:T})}{\pi_k} \end{align} The final result involves marginalizing all states except for $z_1$, and this is exactly the probability computed by the backward algorithm in part 1. \begin{align} p(z_1 = k | x_{1:T}) = \frac{p(x_{1:T}| z_1 = k) p(z_1 = k)}{p(x_{1:T})} = \frac{\beta_1(k) \pi_k}{p(x_{1:T})} \end{align}

The derivative of $Q$ alone is not sufficient, because $\pi$ needs to be constrained to be a stochastic vector. The actual objective with the constraint becomes \begin{align} L = Q + \eta \left(1 - \sum_{k=1}^K \pi_k \right), \end{align} where $\eta$ is the Lagrange multiplier. Now we can set the derivative of $L$ to zero. \begin{align} \frac{\partial}{\partial \pi_k} L = \sum_{n=1}^N \frac{p(z^n_1 = k | x^n)}{\pi_k} - \eta = 0 \end{align} We get the closed-form solution \begin{align} \pi_k = \frac{1}{\eta} \sum_{n=1}^N p(z^n_1 = k | x^n). \end{align} To get $\eta$, we can plug in the $\pi$ we have into the constraint. \begin{align} \sum_{k=1}^K \pi_k = \frac{1}{\eta} \sum_{k=1}^K \sum_{n=1}^N p(z^n_1 = k | x^n) = 1. \end{align} This gives \begin{align} \eta = \sum_{k=1}^K \sum_{n=1}^N p(z^n_1 = k | x^n), \end{align} and if we plug $\eta$ back to $\pi$, we arrive at \begin{align} \pi_k = \frac{\sum_{n=1}^N p(z^n_1 = k | x^n)}{\sum_{k=1}^K \sum_{n=1}^N p(z^n_1 = k | x^n)}. \end{align} The resulting equation has an intuitive interpretation. The first state being $k$ is the sum of posterior probabilities of the first state being $k$ after seeing the data, normalized.

The Update Rule for Transition Matrix $A$

The update rule for the transition matrix is derived in a similar way. \begin{align} & \frac{\partial}{\partial a_{ij}} \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \log p(x_{1:T}, z_{1:T}) \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \frac{\partial}{\partial a_{ij}} \log p(x_{1:T}, z_{1:T}) \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \sum_{t=2}^T \frac{\partial}{\partial a_{ij}} \log p(z_t|z_{t-1}) \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \sum_{t=2}^T \frac{\partial}{\partial a_{ij}} \log a_{z_{t-1} z_t} \\ &\quad = \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \sum_{t=2}^T \frac{\mathbb{1}_{i = z_{t-1}, j = z_t}}{a_{ij}} \\ &\quad = \frac{1}{p(x_{1:T})} \sum_{z_{1:T}} p(x_{1:T}, z_{1:T}) \sum_{t=2}^T \frac{\mathbb{1}_{i = z_{t-1}, j = z_t}}{a_{ij}} \\ &\quad = \frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} \sum_{z_{1:t-2}} \sum_{z_{t+1:T}} p(x_{t+1:T}, z_{t+1:T}) p(x_t | z_t) p(z_t|z_{t-1}) p(x_{1:t-1}, z_{1:t-1}) \frac{\mathbb{1}_{i = z_{t-1}, j = z_t}}{a_{ij}} \\ &\quad = \frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \frac{\mathbb{1}_{i = z_{t-1}, j = z_t}}{a_{ij}} \sum_{z_{1:t-2}} p(x_{1:t-1}, z_{1:t-1}) \sum_{z_{t+1:T}} p(x_{t+1:T}, z_{t+1:T} | z_t) \\ &\quad = \frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \frac{\mathbb{1}_{i = z_{t-1}, j = z_t}}{a_{ij}} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t) \end{align} The dynamic programming is constructed by choosing a time step $t$ and split the marginalization from the front and from the back, one becoming the forward variable $\alpha$ and the other becoming the backward variable $\beta$.

We add the constraints and have the Lagrangian \begin{align} L = Q + \sum_{i=1}^K \lambda_i \left(1 - \sum_{j=1}^K a_{ij} \right), \end{align} where $\lambda_1, \dots, \lambda_K$ are the Lagrange multipliers. We set the derivative of the Lagrangian to zero. \begin{align} \frac{\partial}{\partial a_{ij}} L = - \lambda_i + \frac{\partial}{\partial a_{ij}} \sum_{z_{1:T}} p(z_{1:T}|x_{1:T}) \log p(x_{1:T}, z_{1:T}) = 0 \end{align} The resulting closed-form solution is that \begin{align} a_{ij} = \frac{1}{\lambda_i}\frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \mathbb{1}_{i = z_{t-1}, j = z_t} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t). \end{align} Plugging the solution back to the constraint, \begin{align} \sum_{j=1}^K a_{ij} = \sum_{j=1}^K \frac{1}{\lambda_i}\frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \mathbb{1}_{i = z_{t-1}, j = z_t} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t) = 1. \end{align} We get \begin{align} \lambda_i = \sum_{j=1}^K \frac{1}{p(x_{1:T})} \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \mathbb{1}_{i = z_{t-1}, j = z_t} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t). \end{align} Putting this back to the closed-form solution, we arrive at \begin{align} a_{ij} = \frac{\sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \mathbb{1}_{i = z_{t-1}, j = z_t} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t)}{\sum_{j=1}^K \sum_{t=2}^T \sum_{z_{t-1} z_t} p(x_t | z_t) p(z_t|z_{t-1}) \mathbb{1}_{i = z_{t-1}, j = z_t} \alpha_{t-1}(z_{t-1}) \beta_{t}(z_t)}. \end{align} The solution has an intuitive interpretation. The probability of trantioning from state $j$ to $i$ is how often this transition appears anywhere in the sequence after seeing the data, normalized.