Symposium on Machine Learning in Speech and Language Processing (MLSLP)

September 14, 2012
Portland, Oregon, USA
Speaker: Dirk van Compernolle (KULeuven, Belgium)

Title: Large Vocabulary Template Based Speech Recognition

Abstract:
Acoustic Modeling, the core component of any speech recognition system, has been dominated by the HMM/GMM modeling paradigm for the past decades. This approach is very powerful thanks to the very powerful machine learning algorithms that allow for learning and optimizing the parameters in the system based on thousands of hours of speech data. The weakness is that it requires a crude approximation of reality, in which short term speech frames are modeled as being independent observations generated by a first order Markov process. In recent years a variety of examplar based approaches have been developed as an alternative to the HMM paradigm. The main promise of any examplar based approach is that it can form a new local reference model on demand that is more relevant for the test sample than a global HMM ever can be and can tell us more about the test sample and hence should result in better recognition.

The system we have developed at KULeuven-ESAT is at its basis very similar to the small vocabulary Dynamic Time Warping systems used in the early days of isolated word recognition. We will focus on those novel components that were essential to make DTW work for a large vocabulary continuous speech recognition task. Unit (class) definition and local distance metric (especially within class, thus very local) needed a critical review. Data sharpening - a technique commonly used in non-parametric classifiers to deal with outliers and mislabeled data - proved to be an essential complement for any local distance metric. We will review both single best and k-NN decoding strategies that each hold different options to deal with long span continuity constraints in the decoded match. We will then dig deeper into how we can query the most similar examplars for properties beyond classical spectral envelope features. These meta-features may have been derived from the signal observed over a much wider time window, any available annotation or both: e.g. gender, speaking style, phonetic context, speaking rate, background ... Given the variable nature of such meta-features merging them into a global score is accomplished SCRFs (Segmental Conditional Random Fields).

Doing time warping at the template level makes our system computationally intensive. We will highlight some issues and adopted solutions. Finally we will make a comparison of our approach with some of the other examplar based systems.