Hao Tang 唐顥

Template Matching

One great advantage of representing speech as spectrograms is that it enables us to visually analyze and classify speech segments. Below are two speech segments for the word two and one speech segment for the word one, spoken by the same speaker. We can tell that the segments in Figure 1 and 2 are visually similar and are quite different from the one in Figure 3. In this post, we will formalize this intuition.

Figure 1. An example speech segment of the word two.
Figure 2. Another example speech segment of the word two spoken by the same speaker in the speech in Figure 1.
Figure 3. An example speech segment of the word one spoken by the same speaker in the speech in Figure 1.

Spectral Distortion

Recall that speech is nonstationary, meaning that its content changes over time. It we take a closer look at the speech segment in Figure 1, we can see several distinct regions, for example, the silence at the beginning and the end, the first phone [t], and the second phone [u]. To see the difference of these distinct regions, the same speech segment is shown in Figure 4 with more details. In particular, if we take slices of the spectrogram, often called spectra or frames, we can see that at time 0.1s, 0.25s, and 0.35s (shown in Figure 4) are significantly different from each other.

Figure 4. The spectrogram in Figure 1 annotated. The phones are annotated at the top of the spectrogram. The vertical lines show where we take spectral slices, and the spectra are shown in Figure 4.
Figure 5. The spectra of the corresponding time points in Figure 4.

The problem of measuring the similarity of speech segments inevitably involves measuring the similarity of frames. In this post, we simply use the $\ell_2$ norm as the distance of two spectra. More formally, we represent a frame as a vector, in our case, a 40-dimensional vector. The distance between two frames $x$ and $y$ is measured by \begin{align} \|x - y\|_2 = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}. \end{align} There are many other alternatives, and the family of distances is generally called spectral-distortion measures. We will stick to $\ell_2$ for the rest of the post, but it is easy to plug and play different distance functions into the formulation we are going to discuss.

Given two speech segments, one of $T_1$ frames and the other of $T_2$ frames, there are only $T_1 T_2$ distances to consider. The distances can be represented as a matrix, and is shown in Figure 6 along with the speech segments. We can see a dark diagonal line around the center of the distance matrix. The dark line shows us a path to go from the bottom left to the top right while incurring the lowest distortion. In other words, the dark line tells us how the frames in the two speech segments should be matched or aligned.

Figure 6. The distances for all pairs of frames in the speech segments of Figure 1 and Figure 2.

Since no two words, even for the same speaker, can be pronounced exactly the same way, we have to account for certain variance when comparing two speech segments. One hypothesis is that the words being uttered remain the same to a listener, as long as certain targets are met as the speaker produces the speech. As a consequence, we should be able to align two speech segments if they contain the same set of targets.

The approach we are going to discuss lets the spectral-distortion measures to handle spectral variance, and lets the alignment handle the variance of speaking rate. This approach assumes that spectral changes are completely synchronized in time. This is of course idealized, but we can go really far with this simple idea.

Dynamic Time Warping (DTW)

There are two constraints to consider when aligning two speech segments.

  1. Every frame needs to be aligned.
  2. When a frame is aligned to a frame at a particular time, subsequent frames can only be aligned to frames no earlier than that time.
The goal is to find an alignment that achieves the lowest distortion (in our case the $\ell_2$ norm) satisfying the constraints. Suppose we are given two spectrograms, represented as matrices, $X$ and $Y$. We use $X_s$ to denote the frame at time $s$ in $X$, and $X_{s:t}$ to denote the frames from time $s$ to time $t$ in $X$. We use $d[s, t]$ to denote the minimum distortion when aligning $X_{1:s}$ and $Y_{1:t}$.

The problem of finding the alignment of minimum distortion can be decomposed into sub-problems. In particular, we can ask the question: Does knowing the result of aligning $X_{1:s-1}$ to $Y_{1:t}$ and similar cases makes it simpler to align $X_{1:s}$ and $Y_{1:t}$? Indeed, we only need to consider how to align $X_s$ if we know how to align $X_{1:s-1}$ and $Y_{1:t}$. In general, we have the recursion \begin{align} d[s, t] = \min \Bigg\{ d[s, t-1], d[s-1, t], d[s-1, t-1] \Bigg\} + \|X_{s} - Y_{t}\|. \end{align} In words, to align $X_{1:s}$ and $Y_{1:t}$, we can first align $X_{1:s-1}$ with $Y_{1:t}$, align $X_{1:s}$ with $Y_{1:t-1}$, align $X_{1:s-1}$ with $Y_{1:t-1}$, and align the final frames to find the minimum distortion.

The algorithm above is called dynamic time warping, and is based on classical dynamic programming. Other variants exist, for example, allowing frames to be discarded instead of aligned. The alignment of minimum distortion is shown in Figure 7 as a red line.

Figure 7. The minimum distortion path (in red) overlap on top of Figure 5.

Note that the alignment is off diagonal at the beginning. It's because the silence frames look more or less the same, and it does not matter much how we align them. To avoid this unnecessary problem, we can excise the silence. The alignment is shown in Figure 8.

Figure 8. The alignment of minimum distortion with silence removed.

Template Matching

In addition to obtaining an alignment between speech segments, the minimum distortion can serve as a distance measure between speech segments. Table 1 shows the minimum distortions of the excised two against the other speech segments. It is clear that the distortions are lower when the word are the same than when the words are different. In other words, we can use distortions to identify the existence of a word. We refer to the word spectrograms without silence as templates. To recognize speech, a naive approach is to run an utterance through DTW against a library of templates. The sequence of word templates that has the lowest distortion would be the potential spoken words in the utterance.

Table 1. The minimum distortion of different pairs of segments.
  Two (Figure 1) Two (Figure 2) One (Figure 3)
Two (without silence) 379.10 585.71 917.95

To illustrate this idea, we can consider the speech in Figure 9. It consists of three words two eight two. To recognize the words, we can run DTW against various combination of templates. The alignment of Figure 9 with the concatenation of the templates two eight two is shown in Figure 10.

Figure 9. An example speech segment of the words two eight two.
Figure 10. An example speech segment of the words two eight two.

We obviously do not want to run DTW on combinatorially many word sequences. We again leverage dynamic programming to turn this into a tractable problem. The key observation is that it is relatively cheap to compute DTW against a sequence of word templates as long as the DTW of the prefix of the templates has been computed. In other words, we can keep the DTW traces, and keep concatenating word templates to match longer word sequences. when we decide to concatenate a word template onto a sequence of templates that is already been matched, we can simply continue the dynamic programming by extending the table with more entries. The decision left is what word template to choose for concatenation.

We can organize the decisions of what word template to use as a graph (shown in Figure 11). There are two benefits of representing the decisions of word templates as a graph. First, it is now clear what computation of dynamic programming can be reused. More importantly, we can now see template matching as a graph search problem, where the weights on the graph are the minimum distortion computed by DTW.

Figure 11. Template matching as a search problem.