Hao Tang 唐顥

CTC does not always induce a proper probability distribution

This note discusses why connectionist temporal classification (CTC) does not always induce a proper probability distribution.1 The result is perhaps not surprising to many, but it is never formally written down. At the high level, the probability distribution is not proper when repeating labels appear in the label sequence. I will briefly review the definition of CTC, formally define the problem, and discuss the sufficient conditions for the induced distribution to be proper.

Suppose we have a sequence of $T$ frames $x = (x_1, \dots, x_T)$ and a sequence of $K$ labels $y = (y_1, \dots, y_K)$ for $y_k \in L$ for $k = 1, \dots, K$ and for some label set $L$. Let $B : (L \cup \{\varnothing\})^* \to L^*$ be a function that collapses repeating labels and removes blanks ($\varnothing$). CTC defines the probability distribution \begin{align} p(y | x) = \sum_{z \in B^{-1}(y)} p(z | x) \end{align} where $B^{-1}$ is the pre-image of $B$. In words, if $z \in B^{-1}(y)$, then $y = B(z)$. Each $z_t$ is assumed to be independent of $z_{t'}$ for $t' \neq t$, so we have \begin{align} p(z | x) = \prod_{t=1}^T p(z_t | x). \end{align} The sequence $z$ is often called an alignment or a path.

Every time we refer to a function as a probability distribution, we should always ask ourselves whether the distribution is well defined. For a function $p$ to be a proper probability distribution, it should satisfy $p(\omega) \geq 0$ for all $\omega$ is the domain and $\sum_{\omega} p(\omega) = 1$. Here is the central question of this note: Given the definition of CTC above, is it true that $\sum_{y} p(y|x) = 1$?

Sufficient conditions for properness

Before looking at $p(y|x)$, we first show that $p(z|x)$ is proper. Since $p(z_t|x)$'s are almost always the result of a softmax, we can assume $p(z_t|x)$ is proper. It is also not hard to see that \begin{align} \sum_{z} p(z|x) & = \sum_{z_1} \dots \sum_{z_T} p(z|x) = \sum_{z_1} \dots \sum_{z_T} \prod_{t=1}^T p(z_t|x) \\ & = \prod_{t=1}^T \sum_{z_t} p(z_t|x) = \prod_{t=1}^T 1 = 1, \end{align} so $p(z|x)$ is proper as long as $p(z_t|x)$'s are.

Given that $p(z|x)$ is proper and that $p(y|x)$ is a sum based on $p(z|x)$, $p(y|x)$ should be proper if the sum is carefully handled. We will use the following fact about probability distributions. Let $\Omega$ be the event space of distribution $p$. If $A_1, \dots, A_n$ are disjoint subsets of $\Omega$ and $\bigcup_{i=1}^n A_i = \Omega$, then \begin{align} \sum_{i=1}^n p(A_i) = p(\bigcup_{i=1}^n A_i) = p(\Omega) = 1. \end{align} The key condition here is disjointness ($A_i \cap A_j = \emptyset$ for any $i$ and $j$) and coverage ($\bigcup_{i=1}^n A_i = \Omega$, or for any $\omega \in \Omega$, there exists $i \in \{1, \dots, n\}$ such that $\omega \in A_i$).

To apply the above fact, let $\Omega$ be the domain of $p(z|x)$, i.e., $(L \cup \{\varnothing\})^T$. The function $p(y|x)$ sums over $B^{-1}(y)$, a subset of $\Omega$. If the subsets are disjoint and they cover everything in $\Omega$, then $\sum_{y} p(y|x) = 1$ and $p(y|x)$ is proper. More formally, disjointness requires $B^{-1}(y) \cap B^{-1}(y') = \emptyset$ for any $y$ and $y'$, and coverage requires that there is a $y$ for every $z$ such that $z \in B^{-1}(y)$. These form the sufficient condition for $p(y|x)$ to be proper.

In the case of CTC, it is trivially true that any alignment $z$ there is a label sequence $y$ such that $y = B(z)$—we simply apply $B$ on $z$ to get a $y$. The deciding condition is whether $B^{-1}(y)$ and $B^{-1}(y')$ are disjoint for any $y$ and $y'$. What happens when there are $y$ and $y'$ such that $B^{-1}(y) \cap B^{-1}(y') \neq \emptyset$? If we train to maximize $p(y|x)$, the probability mass of $p(y'|x)$ would also increase.

Implementation of $B^{-1}$

To compute the marginalization efficiently, care needs to be taken when implementing $B^{-1}$. A common approach is to implement $B^{-1}$ as a graph (or more precisely a finite-state acceptor), and run dynamic programming on it.

Since $B$ removes duplicate symbols and blanks, a common approach to implement $B^{-1}$ is to insert zero or more blanks between each symbol and to repeat the symbols one or more times. For example, $B^{-1}(\text{c a t}) = \varnothing^* \text{c}^+ \varnothing^* \text{a}^+ \varnothing^* \text{t}^+ \varnothing^*$, where I have used regular expression to indicate how many times a symbol can be repeated. However, this implementation causes a problem when there are repeating symbols in the label sequence. For example, if $B^{-1}(\text{m e e t}) = \varnothing^* \text{m}^+ \varnothing^* \text{e}^+ \varnothing^* \text{e}^+ \varnothing^* \text{t}^+ \varnothing^*$, the sequence $\text{m m e e e t t}$ belongs to $B^{-1}(\text{m e e t})$, but the sequence also belongs to $B^{-1}(\text{m e t})$. In other words, $B^{-1}(\text{m e t}) \cap B^{-1}(\text{m e e t}) \neq \emptyset$. This particular implementation not only fails to be faithful to the defition of $B$ (because $B(\text{m m e e e t t}) = \text{m e t}$), but also violates the sufficient conditions of properness.2

A more careful implementation of $B^{-1}$ would be to insert one or more blanks between repeating symbols, to insert zero or more blanks otherwise, and to repeat the symbols one or more times. With this implementation, $B^{-1}(\text{m e e t}) = \varnothing^* \text{m}^+ \varnothing^* \text{e}^+ \varnothing^{\textcolor{red}{\boldsymbol{+}}} \text{e}^+ \varnothing^* \text{t}^+ \varnothing^*$. Note that there has to be at least one blank between the two e's.3

Other variants

Since properness depends on the choice of $B$, there are a few other variants that guarantee properness. For example, avoid repeating symbols while allowing blanks (Graves, 2012); only allow repeating symbols and avoid using blanks altogether (Collobert et al., 2016); create a separate symbol for each pair of symbols (Zweig et al., 2017). These variants are considered generalizations of CTC and some are not called CTC anymore.

Acknowledgement

I would like to thank Shubham Toshniwal for the wonderful discussion on the different CTC implementations and their consequences.
1 This note is the result of preparing a guest lecture on end-to-end speech recognition for MIT 6.345. The properness was supposed to be a sanity check but turned out to be false.
2 Note that blanks need to be removed after removing the duplicate symbols. Otherwise, $B(\text{m m e } \varnothing \text{ e t t}) = \text{m e t}$, and all the sequences that we think should go into $B^{-1}(\text{m e e t})$ belong to $B^{-1}(\text{m e t})$.
3 In fact, the subtlety was taken care of in equation (6) of (Graves et al., 2006).