Research Interest

Currently I'm looking at the use of Trajectory Model in speech recognition. The title of my research proposal is Modelling Speech Dynamics with a Trajectory Model.

Motivation of Research

Acoustic Modelling is embedded in a context that may look back over as many as forty years of cunning experiment and elaborate theory. Yet, to date, the state-of-the-art ASR system was still built on Hidden Markov Models, a relatively simple model that has been used for nearly three decades.

A major drawback of HMM is the so called "conditional independence assumption", i.e. the acoustic observation of each frame is modelled independently given the discrete state of that frame. While this assumption yields efficient algorithms for HMM, it is too restrictive to model the dynamic aspect of human speech. For instance, it is known that in human speech the same phoneme can be pronounced differently according to its surrounding phoneme context to ensure a smooth transition between syllables. This phenomenon, also called coarticulation in phonology, leads to strong correlations between adjacent speech segments and is difficult to model with an HMM.

Partly motivated by these insights, this research will investigate alternative acoustic models that can better model the temporal correlations in speech. More specifically, the following research issues will be addressed in this work:

  • Developing a better understanding of acoustic variability
    The highly dynamic nature of speech is due to the joint effect of the
    vocal tract movement, the motion of articulators, and the underlying
    neurological faculties which are responsible for the production of speech.
    A better understanding of the sources of acoustic dynamics will help
    derive proper acoustic models for ASR.

  • Exploring alternative acoustic models that can handle temporal constraints in a principled way

    Current HMM-based ASR systems try to capture short-term speech
    dynamics by appending feature derivatives to the acoustic
    vectors. This method, though works well in practice, contradicts the
    independence assumption made by HMMs. Alternative methods that properly
    handle the temporal constraints should warrant better performance.

  • Evaluating the use of new acoustic models on large speech recognition task
    Many promising models such as Segment models and the Trajectory model
    are known to work well on small dataset, but a clear superiority in 
    performance with respect to ordinary HMMs on large speech recognition 
    task still remains to be shown.

The Proposal

In this research we propose to model the dynamic patterns of speech using a trajectory model (Keiichi Tokuda, 2004), which is a properly normalised version of HMM that models feature derivatives explicitly without imposing any conditional independence assumption. The use of trajectory models in speech recognition is still in its early stage although this model has been successfully applied to speech synthesis (Keiichi Tokuda et al., 2000).

The conditional independence assumption imposed by the Hidden Markov Models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients (dynamic features) to the acoustic feature vectors. Although this workaround leads to improved performance in speech recognition, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, which can be fixed by using a trajectory model that correctly handles the dynamic constraints. It can be shown that an HMM can be transformed into a trajectory model, by performing a per-utterance normalisation. In contrast to the band-diagonal temporal covariance matrix of an HMM, the new model has a full covariance matrix capable of modelling short range temporal dynamics of speech.

We hope this research will deepen our understanding of the statistical speech processing enterprise, which inevitably brings with it some insight into the nature of human speech production process.

Main Reference

  • Reformulating the HMM as a trajectory model,, K. Tokuda, H. Zen, T. Kitamura, Proc. of Beyond HMM -- Workshop on statistical modeling approach for speech recognition, Kyoto, Dec. 2004.
  • A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features, H. Zen, K. Tokuda, T. Kitamura, Proc. of ICASSP 2004, pp.837-840, Montreal, May 2004.
  • Towards Better Understanding of the Model Implied by the use of Dynamic Features in HMMs, John Bridle, ICSLP04.
  • How to pretend that correlated variables are independent by using difference observations, Christopher K. I. Williams, Neural Computation 17(1) 1-6 (2005)

Other Stuff

You are invited to have a look at my past research done at the Natural Language Processing Lab of Northeastern University. Here is the (slightly outdated) project description entry on CSTR's web page.

Edit - Print