Symposium on Machine Learning in Speech and Language Processing (MLSLP)
September 14, 2012
Portland, Oregon, USA
Speaker: Dirk van Compernolle (KULeuven, Belgium)
Title: Large Vocabulary Template Based Speech Recognition
Abstract:
Acoustic Modeling, the core component of any
speech recognition system, has been dominated by the HMM/GMM modeling
paradigm for the past decades. This approach is very powerful thanks
to the very powerful machine learning algorithms that allow for
learning and optimizing the parameters in the system based on
thousands of hours of speech data. The weakness is that it requires a
crude approximation of reality, in which short term speech frames are
modeled as being independent observations generated by a first order
Markov process. In recent years a variety of examplar based
approaches have been developed as an alternative to the HMM paradigm.
The main promise of any examplar based approach is that it can form a
new local reference model on demand that is more relevant for the test
sample than a global HMM ever can be and can tell us more about the
test sample and hence should result in better recognition.
The system we have developed at KULeuven-ESAT is at its basis very
similar to the small vocabulary Dynamic Time Warping systems used in
the early days of isolated word recognition. We will focus on those
novel components that were essential to make DTW work for a large
vocabulary continuous speech recognition task. Unit (class)
definition and local distance metric (especially within class, thus
very local) needed a critical review. Data sharpening - a technique
commonly used in non-parametric classifiers to deal with outliers and
mislabeled data - proved to be an essential complement for any local
distance metric. We will review both single best and k-NN decoding
strategies that each hold different options to deal with long span
continuity constraints in the decoded match. We will then dig deeper
into how we can query the most similar examplars for properties beyond
classical spectral envelope features. These meta-features may have
been derived from the signal observed over a much wider time window,
any available annotation or both: e.g. gender, speaking style,
phonetic context, speaking rate, background ... Given the variable
nature of such meta-features merging them into a global score is
accomplished SCRFs (Segmental Conditional Random Fields).
Doing time warping at the template level makes our system
computationally intensive. We will highlight some issues and adopted
solutions. Finally we will make a comparison of our approach with
some of the other examplar based systems.