An obvious danger in using relevant documents retrieved from a database of automatically transcribed spoken documents is that the query expansion may include recognition errors (eg [2]). One way of avoiding this problem is by using a secondary corpus of documents from a similar domain that do not contain recognition errors. For our application an obvious choice for such a corpus is contemporaneous newswire or newspaper text. This secondary collection is ranked with respect to the query. A query expansion algorithm may then be applied using this ranking to find those terms in the secondary collection that have the largest mutual information (or related statistic) with the query terms.
In interactive systems, where there is a human in the loop, it is
possible to definitely mark documents as relevant or non-relevant, and
such documents can be used as training data for a relevance
feedback query expansion approach. In the purely automatic case,
in which no relevance judgements are available, it is assumed that the
top nr documents are relevant. This process is sometimes termed
pseudo-relevance feedback. We have used such an algorithm,
based on the local context analysis algorithm of Xu and
Croft [10]. In this algorithm, the query expansion
weight for a term given a query and the secondary collection is based
on the nr top ranked documents in the secondary collection:
The potential query expansion terms e are simply those terms in the
relevant documents. The term AF(e,t) measures the term frequency
correlation of two terms e and t across collection of documents di:
![]() |
(4) |
|
|
In practice the values of nr and nt are maximum limits, since we threshold so that only those documents with a score greater than 0.8 times the score of the top-ranked document are considered, and only those terms with QEW(Q,e) greater than an empirically-determined threshold are added.
We experimented with this query expansion algorithm on the TREC-7 SDR corpus. Figure 1 shows the effect on the interpolated recall-precision curve for the reference and speech recognition conditions; the query-by-query effect over the 23 queries in the TREC-7 SDR evaluation is shown in figure 2.