Acoustic Modelling is embedded in a context that may look back over as many as forty years of cunning experiment and elaborate theory. Yet, to date, the state-of-the-art ASR system was still built on Hidden Markov Models, a relatively simple model that has been used for nearly three decades.
A major drawback of HMM is the so called "conditional independence assumption", i.e. the acoustic observation of each frame is modelled independently given the discrete state of that frame. While this assumption yields efficient algorithms for HMM, it is too restrictive to model the dynamic aspect of human speech. For instance, it is known that in human speech the same phoneme can be pronounced differently according to its surrounding phoneme context to ensure a smooth transition between syllables. This phenomenon, also called coarticulation in phonology, leads to strong correlations between adjacent speech segments and is difficult to model with an HMM.
Partly motivated by these insights, this research will investigate alternative acoustic models that can better model the temporal correlations in speech. More specifically, the following research issues will be addressed in this work:
The highly dynamic nature of speech is due to the joint effect of the vocal tract movement, the motion of articulators, and the underlying neurological faculties which are responsible for the production of speech. A better understanding of the sources of acoustic dynamics will help derive proper acoustic models for ASR.
Current HMM-based ASR systems try to capture short-term speech dynamics by appending feature derivatives to the acoustic vectors. This method, though works well in practice, contradicts the independence assumption made by HMMs. Alternative methods that properly handle the temporal constraints should warrant better performance.
Many promising models such as Segment models and the Trajectory model are known to work well on small dataset, but a clear superiority in performance with respect to ordinary HMMs on large speech recognition task still remains to be shown.
In this research we propose to model the dynamic patterns of speech using a trajectory model (Keiichi Tokuda, 2004), which is a properly normalised version of HMM that models feature derivatives explicitly without imposing any conditional independence assumption. The use of trajectory models in speech recognition is still in its early stage although this model has been successfully applied to speech synthesis (Keiichi Tokuda et al., 2000).
The conditional independence assumption imposed by the Hidden Markov Models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients (dynamic features) to the acoustic feature vectors. Although this workaround leads to improved performance in speech recognition, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, which can be fixed by using a trajectory model that correctly handles the dynamic constraints. It can be shown that an HMM can be transformed into a trajectory model, by performing a per-utterance normalisation. In contrast to the band-diagonal temporal covariance matrix of an HMM, the new model has a full covariance matrix capable of modelling short range temporal dynamics of speech.
We hope this research will deepen our understanding of the statistical speech processing enterprise, which inevitably brings with it some insight into the nature of human speech production process.
You are invited to have a look at my past research done at the Natural Language Processing Lab of Northeastern University. Here is the (slightly outdated) project description entry on CSTR's web page.