Steve Renals, Yoshihiko Gotoh, Robert Gaizauskas, Mark Stevenson
University of Sheffield, Department of Computer Science
Regent Court, 211 Portobello St., Sheffield S1 4DP, UK
This paper describes our participation in the Hub-4E IE-NE spoke. The SPRACH/LaSIE system for named entity (NE) identification in Broadcast News consists of two baseline systems:
SPRACH-S: a statistical system based on the NE tagged language modelling approach , which was originally introduced to enable name category information to be used in the construction of language models for very large vocabulary speech recognisers;
SPRACH-R: a rule-based approach [2, 3], ported from the LaSIE system used for text-based NE identification.
The stochastic finite state model approach is based on explicit word-level n-gram relations. We present an overview of the statistical system and a procedure for NE annotation. Then we describe key features for the rule-based approach. Comparison is made between the original text- and speech-based systems. The SPRACH-S and R systems were employed in the 1998 Hub-4E IE-NE evaluation. We report our results using the five sets of transcripts: a reference transcription, a transcription produced by the 1998 SPRACH recogniser used in the transcription evaluation (21% WER) and the three baseline transcriptions provided by NIST.
The SPRACH-S system consists of an NE tagged language model and a recently developed statistical NE tagger. A formal description of the NE tagged LM is provided in . Technical details for the development and for the annotation procedure are presented in .
The basic idea of the NE tagged language model (LM) is to use NE tags as categories in a class-based n-gram language model. This enables the construction of extensible vocabulary speech recognition systems, along with the identification of named entities in spoken language. An NE tagged LM is derived from a corpus marked with named entities. It is a backed off n-gram model with the vocabulary entries being the most frequent words attributed with their name category information. Unigram extensions for less frequent names are attached in order to increase the overall vocabulary size.
For the evaluation, three NE tagged trigram LMs were estimated, each with an independent vocabulary set plus unigram extensions:
H4-train LM: derived from transcripts of the Hub-4E acoustic training data (approximately one million words with manual NE annotations) consisting of an 18k trigram vocabulary (i.e., tag-word tokens), with a further 4k vocabulary in unigram extensions;
BN96 LM: estimated from 1996 BN text corpus for training/test data (150 million words with automatic NE annotations), consisting of a 65k trigram vocabulary, with a further 85k vocabulary in unigram extensions;
NA98 LM: estimated from a part of the 1998 North American News (NA News) corpus (1997-98 LA Times/Washington Post, 1996-98 Associated Press; 133 million words with automatic NE annotations), consisting of a 65k trigram vocabulary, with a further 145k vocabulary in unigram extensions.
Manual NE annotations were provided by MITRE and BBN (through NIST) and they conformed with the Hub-4E NE task specification. Automatic annotations were achieved using the LaSIE-II system . Because the LaSIE-II was developed according to the MUC-7 NE task specification, relative time expressions were also tagged for the BN and NA News corpora, conflicting with the Hub-4E specification.
After several trial runs using the development set (described later), an NE annotation procedure was settled as follows:
Because this n-gram based NE tagger did not explicitly handle multiple
word named entities, we made post-corrections according to a simple rule:
suppose multiple and consecutive words were all marked with the same name tag,
then we assumed they belonged to one named entity.
For example, suppose ``BILL'' and
``CLINTON'' were both marked as <person>,
The SPRACH-R system described in this section was specifically developed for the 1998 Hub-4E IE-NE spoke. It uses a restricted and slightly modified version of the NE annotation component of the Sheffield LaSIE-II information extraction system, as entered in MUC-7  and described in detail in .
The rule-based approach relies on: finite state matching against lists of single or multi-word names and NE cue words, part-of-speech tagging, and specialised NE parsing based on phrasal grammars for the NE classes. The key stages of processing are as follows:
Pseudo sentence segmenter. Since part-of-speech and parsing components of the system require text units of reasonable length (ideally less than 40 words; anything over 100 words becomes excessively slow), a trivial text segmenter breaks the text into ``pseudo sentences'' by breaking before certain closed class words (determiners, nominal pronouns, certain prepositions). The aim is not to find true sentence boundaries but to produce sensible length text chunks which are not broken in the middle of named entities.
Gazetteer lookup. Lists of single and multi-word names and name cues are used to tag the input. These lists include male and female first names, person titles (e.g., ``Mr.'', ``Mayor''), well-known locations and organisations, location cue words (e.g., ``Bay'', ``Harbour'') and company designators (e.g., ``Corporation''). Case-insensitive finite state matching is carried out. Multiple tags may be assigned per word or multi-word.
Part-of-speech tagging. A version of the Brill transformation based part-of-speech tagger retrained for all upper case text is used to assign one of the Penn Treebank word classes to each word in the input.
NE parsing. A bottom-up partial chart parser applies a set of regular NE grammars (one for each NE class, plus a general NE grammar and a default NE grammar). A typical rule has a form such as:
The SPRACH-R system was derived from one designed to do full information extraction on well punctuated, mixed-case newswire text. In addition to pseudo sentence segmenting and retraining the part-of-speech tagger on upper-case only text, there are a number of other differences between this system and the original LaSIE-II system:
We participated in the evaluation using the statistical and the rule-based NE annotation systems. For a development data set (1997 Hub-4E evaluation data) we report results of experiments using the manually verified reference transcriptions and the transcriptions from the 1997 CU-CON system (27% WER). A test data set (1998 evaluation data) consisted of reference transcriptions, the 1998 SPRACH recogniser output with 21% WER, and three baseline recogniser outputs.
System Development. For each of three individual LM sets, Table 1 shows NE identification results on the development set. This table indicates that the H4-train LM, obtained from the limited amount of manually annotated training data, resulted in a much higher precision than the other two, but had a poor recall owing to its limited vocabulary. The LMs trained on the automatically annotated data resulted in lower precision NE tagging but a higher recall score. Table 1 also shows the results for the merged system. On hand transcriptions, merged results did not reduce the precision but improved the recall; on the 27% WER transcriptions, merging did result in a slightly reduced precision with respect to the H4-train model, but again gave an improvement in recall and F-measure.
In the following, we analyse NE annotation errors by closer inspection to the mark-ups on the development data set.  provides further description of errors using graphs and examples found in the annotated transcriptions.
Recall scores. Name categories, <location> (38.6% of total NE occurrences in the annotated reference), <person> (28.3%), and <organisation> (22.3%) dominated the temporal and number expressions. Recall scores for <location> and <person> were substantially higher by the BN96-tagged and the NA98-tagged transcripts than by the H4-tagged transcripts.
The initial marking on speech transcripts was done solely using the backed off trigram relation. By inspection of annotated transcripts, it was found that most correctly marked NEs were identified through bigram or trigram constraints around each NE (i.e., an NE itself and words before/after that NE). When the LM was forced to back-off to unigram statistics, the LM often estimated a bigram of an unknown word (with no tag) followed by some other word, rather than the unigram of a tagged word. Larger LMs were more likely to include the required bigrams and trigrams: thus it is not very surprising that the recall score using the H4-train LM (uni/bi/trigram: 19k, 96k, 86k entries) was less than the BN96 LM (65k, 4.3M, 12.9M entries) or the NA98 LM (65k, 4.9M, 14.5M entries).
When using the H4-train LM, the recall score for subclass <organisation> (.63) was relatively higher than <person> (.37) and <location> (.42), since there were more cues around <organisation> names than the other two (although this statement is by observation without any statistical backing); as a consequence, bigrams and trigrams were more likely to be present in the LM. Furthermore, even without any cues, many <organisation> names contained multiple words, resulting in sufficiently high probability scores.
A secondary cause of inaccurate NE identification were errors in the BN and NA News training data produced by the automatic tagger. Occasionally it also marked corpora with <name> tags when unresolvable type ambiguity occurred between <organisation>, <person>, and <location>. This inaccuracy seemed to contribute some of failures, for <organisation> in particular, when using the BN96 and the NA98 LMs.
Precision scores. Except for temporal expressions, NE annotation using the BN96 and the NA98 LMs achieved about the same level of precision as one using the H4-train model. Especially, precision scores for <person>, <location>, <money>, and <percentage> were easily over 90% for the former. Although the automatic marking contained some errors, it was compensated by a more reliable estimate of model parameters due to an increase in corpus size. Because of a specification conflict, the BN96-tagged and the NA98-tagged transcripts were poorly matched to temporal expressions (a precision of just over .3 for <date> and well below .2 for <time>).
Hub-4E Evaluation Results. Table 2 shows the NE identification results for the SPRACH-S system on the 1998 evaluation data. The n-gram approach presented in this paper resulted in precision and recall scores that were 5-10% worse than those reported by BBN and MITRE, even though those systems were trained only on the one million word H4-train annotated data. Ignoring technicalities, their methods both modelled transitions to the current word and class, conditioned on the previous word and class: i.e., transitions between classes were explicit. In contrast, we have constructed an n-gram model directly on word to word transitions, with class information treated as a word attribute. This is a serious drawback of the direct n-gram approach. As described above, the successful recovery of name expressions are heavily dependent on existence of higher order n-grams in the model. The most straightforward way to improve the direct n-gram approach seems to be via the incorporation of constraints on a class level.
The results of the rule-based approach are shown in Table 3, for both the reference transcriptions and the SPRACH and baseline recogniser transcriptions. Breakdown of the results by NE category for both 1998 test sets is shown in Table 4. Note that due to a porting bug no time, money or percentage NEs were identified by the SPRACH-R system.
On the reference transcriptions, the rule-based system returns an overall F-measure that 20-25% lower than those returned in MUC-6 and MUC-7. The errors committed by the system may be divided into three classes:
First, newswire stories provide clearly delineated discourses in which a limited set of entities is introduced and then referred to in various ways (e.g., ``Winston Scott'', later just ``Scott''; ``Bloomberg News Service'', later just ``Bloomberg''). The broadcast news transcriptions are not segmented into discourse units in any clearly identifiable way. Without a notion of ``story boundary'', techniques developed for matching variable forms of names across one story in newswire texts could not be used. Since the initial form of reference is usually fuller and hence more easily classified, inability to resolve subsequent references with earlier ones lead to a significant drop in recall (but attempts to do name matching across arbitrarily segmented portions of the transcriptions lead to even more exaggerated drops in precision).
Second, the use of company designators (e.g., ``Inc.'', ``Ltd.'') and personal titles (e.g., ``Mr.'', ``Dr.'') appears to be much more limited in spoken news. These terms provide significant clues in text-based news stories, of which the rule-based system takes considerable advantage.
These experiments were preliminary experiments using baseline statistical and rule-based systems. The statistical system does not specifically model extent and has a context limited by the history of a trigram language model. It is also dependent on annotated training data, which we expanded using the existing LaSIE-II system. However, the statistical system is a very close match to the language model currently used in LVCSR systems, and it is straightforward to see how the NE tagged LM could be integrated into an LVCSR system. The rule-based system -- which has produced good performance in previous MUC evaluations -- was minimally modified to spoken rather than textual data and was not modified for the broadcast news domain. Although both systems are still under development, we are in a good position to investigate differences between statistical and rule-based approaches for information extraction. We also hope to investigate the possibility of constructing a hybrid system.