Gary Cook
,
James Christie
,
Dan Ellis
,
Eric Fosler-Lussier
,
Yoshi Gotoh
,
Brian Kingsbury
,
Nelson Morgan
,
Steve Renals
,
Tony Robinson
and Gethin Williams
![]()
This paper describes a broadcast news transcription system developed in collaboration by several research groups as part of a European Union sponsored project known as SPRACH (for SPeech Recognition Algorithms for Connectionist Hybrids). This paper presents an overview of the complete system, further details of MLP acoustic modelling and automatic pronunciation modelling are described in the companion papers [1,2].
The layout of the rest of this paper is as follows: first we describe the Hub-4E evaluation system. This includes a description of the acoustic features, acoustic and language models, and the recognition procedure. Next we describe the modifications to this system necessary to ensure that it runs in less-than 10 times real-time. Section 3 introduces the CHRONOS decoder, outlining the search strategy employed. We then describe some recent features added to CHRONOS to allow the SPRACH system to use a single pass recognition procedure. Finally we describe a system designed to operate in real-time.
This section describes the SPRACH broadcast news evaluation system. This uses multiple acoustic models to produce three acoustic probability streams. Search is performed on each of these streams to provide three hypotheses, and these are merged to form the final system output. The following sections describe the system in detail.
Acoustic segmentation for the hub system is performed using the method
developed by Cambridge University HTK group [3]. The
less-than 10 times real-time spoke system used the CMU segmentation
tools [4] to perform acoustic segmentation. Two sets
of acoustic features are used: PLP, 12
order
cepstral coefficients derived using perceptual linear prediction and
log energy, and MSG, modulation-filtered spectrogram
features [5] derived from data that is first
down-sampled at 8kHz. The modulation-filtered spectrogram is a robust
speech representation for automatic speech recognition. The robustness
of the representation is based on two signal-processing strategies
modelled after human speech perception. The first strategy is the
emphasis of changes in the spectral structure of the speech signal
(measured with critical-band-like resolution) occurring at rates of
16Hz or less. The second is adaptation to slowly-varying components
of the speech signal that functions as a form of automatic gain
control (AGC). To increase the robustness of the system to
environmental conditions, the statistics of each feature channel were
normalised to zero mean with unit variance over each segment.
The SPRACH system uses both recurrent neural network (RNN) and multi-layer perceptron (MLP) models to estimate a posteriori context-independent (CI) phone class probabilities. Forward-in-time and backward-in-time RNN models were trained using the 104 hours of broadcast news training data released in 1997. These models use PLP acoustic features. The outputs of the forward and backward models are merged in the log domain to form the final CI RNN probability estimates. The MLP has 8000 hidden units and was trained on all 200 hours of broadcast news training data and uses MSG features [1].
Context-dependent (CD) RNN acoustic models were trained by factorisation of conditional context-class probabilities [6]. The joint a posteriori probability of context class j and phone class i is given by yij(t) = yi(t) yj|i(t). The CI RNN estimates yi(t), and single-layer perceptrons are used to estimate the conditional context-class posterior, yj|i(t). The input to each module is the internal state of the CI RNN, since it is assumed that the state vector contains all the relevant contextual information necessary to discriminate between different context classes of the same monophone. Phonetic decision trees were used to chose the CD phone classes, and the SPRACH system uses 676 word-internal CD phones.
Around 450 million words of text data was used to generate back-off
n-gram language models. Specifically these models were estimated from:
Table 1 shows the language models sizes (in terms of number of n-grams) and perplexity on the 1997 evaluation data. For use in the 10x real-time spoke we pruned the trigram model using the Seymore-Rosenfeld method [7]. Despite the large increase in perplexity of the pruned model only a very small word error rate increase (0.1%) was observed.
The recognition lexicon contains 65,432 words, including every word that appears in the broadcast news training data. The dictionary was constructed using phone decision tree smoothed acoustic alignments. Full details of the automatic pronunciation modelling used are give in [2].
As described in Section 2.2 the SPRACH system uses a set of three different acoustic models. In order to use each of these models a method for combining their estimates is necessary. Frame level acoustic combination is effective for a set of estimators with the same output classes. However, it is more problematic to combine hypotheses with different output classes, such as context-independent and context-dependent acoustic models.
One method for combining models with different output classes is to
combine at the hypothesis level as opposed to the acoustic probability
level. We have employed the NIST recogniser output voting error
reduction (ROVER) system for hypothesis combination [8].
ROVER may be operated either as a purely voting system, or in a mode
in which confidence scores are taken into account. We have used the
local phone posterior probability-based confidence measure as the
confidence score for ROVER. The confidence measure is based
purely on the connectionist acoustic model, the duration normalised
log phone posterior probability [9]:
![]() |
(1) |
![]() |
(2) |
A schematic of the system can be seen in Figure 1. Since the SPRACH system employs multiple acoustic models, a number of recognition passes are required. The recognition process can be summarised as follows:
The SPRACH 10x real-time system is very similar to the hub system described above. The differences are:
As can be seen from Table 3 the system runs in
considerably less-than 10 times real-time. The word error rates are
shown in Table 4.
for j = 1 to N
for t = 1 to T
![]()
with the constraint that the HMMs are left-to-right. The search traverses the tree-structured pronunciation lexicon in this manner, sharing computations between words with common pronunciation prefixes. This approach also allows large branches of the tree to be pruned out of the search in one decision, and is very memory efficient.
Continuous recognition is achieved by growing a tree of word hypotheses, where each node corresponds to an element on the stack (of which there is only one), which is ordered on time. An adaptive beam-width is used to limit the stack size. Processing involves popping the hypothesis at the top of the stack, extending it by each word in the lexicon, and pushing all resulting hypotheses back onto the stack. The finite-state property of N-gram language models can be exploited by only keeping the most probable hypothesis for each unique language model history. Within the time-first framework it is beneficial to group hypotheses with common histories into a single stack item, thus grouping future extensions into a single computation.
A record of the best path probability to every frame is maintained and the search is pruned if the current hypothesis is less likely than a fixed fraction of the highest path probability. An online garbage model [11] is used to control the beam and so to limit the growth of novel path extensions [12]. Processing is complete when there are no items remaining on the stack.
We have recently added functionality to CHRONOS to allow its use for broadcast news evaluations. To this end we have implemented support for arbitrary n-gram language models, state-based decoding, and word level confidence score output. With these facilities it is possible to replace the NOWAY and lattice decoder stages of the hub system with a single pass using CHRONOS. In addition, the time-first search strategy results in significant reduction in search times, and so we have been able to produce a system for the less-than 10 times real-time spoke which uses a 4-gram language model and has only a very small search error.
Comparison between the CHRONOS hub and less-than 10 times real-time systems is shown in Table 5 (results with NOWAY are shown in brackets.1). From these results it can be seen that almost all the increase in error rate seen in the less-than 10 times real-time system is due to the use of the CMU segmentation tools as opposed to HTK segmentation. Run times for the less-than 10 times real-time system are 8.2 times real-time for h4e_98_1, and 7.8 times real-time for h4e_98_2.
We have developed a real-time system based on the acoustic and language models used for the Hub-4E evaluation. The real-time system has the following features:
This paper has described the SPRACH broadcast news transcription system, and presented results from the 1998 DARPA Hub-4E evaluation. The system makes use of multiple acoustic probability streams, and we chose these to be as diverse as possible by using both PLP and MSG features. The use of diverse feature representations was found to reduce error rates for non-studio speech. We have used computationly efficient confidence scores based on the a posteriori phone class probabilities produced by connectionist acoustic models. These confidence scores were used in conjunction with ROVER for hypothesis combination.
In addition to describing the evaluation system we have outlined the time-first search procedures employed by the CHRONOS decoder. We have shown that using this search method it is possible to run a system in less-than 10 times real-time with negligible increase in error rate. By reducing the number of acoustic models we have also shown that it is possible to run a system in real-time.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 -address Tony Robinson Cook+99.tex.
The translation was initiated by Tony Robinson on 1999-04-06