next up previous
Next: Segmentation Up: Text Retrieval Previous: Text Retrieval

   
Query Expansion

Under the bag of words model, if a relevant document does not contain the terms that are in the query, then that document will not be retrieved. The aim of query expansion is to reduce this query/document mismatch by expanding the query using words or phrases with a similar meaning or some other statistical relation to the set of relevant documents. This procedure may have even greater importance in spoken document retrieval, since the word mismatch problem is heightened by the presence of errors in the automatic transcription of spoken documents.

An obvious danger in using relevant documents retrieved from a database of automatically transcribed spoken documents is that the query expansion may include recognition errors (eg [2]). One way of avoiding this problem is by using a secondary corpus of documents from a similar domain that do not contain recognition errors. For our application an obvious choice for such a corpus is contemporaneous newswire or newspaper text. This secondary collection is ranked with respect to the query. A query expansion algorithm may then be applied using this ranking to find those terms in the secondary collection that have the largest mutual information (or related statistic) with the query terms.

In interactive systems, where there is a human in the loop, it is possible to definitely mark documents as relevant or non-relevant, and such documents can be used as training data for a relevance feedback query expansion approach. In the purely automatic case, in which no relevance judgements are available, it is assumed that the top nr documents are relevant. This process is sometimes termed pseudo-relevance feedback. We have used such an algorithm, based on the local context analysis algorithm of Xu and Croft [10]. In this algorithm, the query expansion weight for a term given a query and the secondary collection is based on the nr top ranked documents in the secondary collection:
\begin{multline}QEW(Q, e) = \sum_{t \in Q}
\log \left ( \frac {\log(AF(e,t)) * CFW(e)} {\log(nr)}
+ \delta \right ) \\
* CFW(t) .
\end{multline}

The potential query expansion terms e are simply those terms in the relevant documents. The term AF(e,t) measures the term frequency correlation of two terms e and t across collection of documents di:

\begin{displaymath}AF(e,t) = \sum_{i=1}^{nr} TF(e,d_i)*TF(t,d_i) .
\end{displaymath} (4)

The nt possible expansion terms with the largest weights are then added to the original query, weighted as 1/rank. Note that this algorithm is not discriminative, since it does consider (pseudo-)non-relevant documents.


  
Figure 1: Effect of query expansion on recall-precision for TREC-7 SDR using reference transcriptions (R1) and the output of the ABBOT recognizer (S1).
\includegraphics[width=\columnwidth]{shef-qe-bw.eps}


  
Figure 2: Query-by-query effect of query expansion in terms of change in average precision compared with no query expansion.
\includegraphics[width=0.6\textwidth,height=70mm]{QEDelta.eps}

In practice the values of nr and nt are maximum limits, since we threshold so that only those documents with a score greater than 0.8 times the score of the top-ranked document are considered, and only those terms with QEW(Q,e) greater than an empirically-determined threshold are added.

We experimented with this query expansion algorithm on the TREC-7 SDR corpus. Figure 1 shows the effect on the interpolated recall-precision curve for the reference and speech recognition conditions; the query-by-query effect over the 23 queries in the TREC-7 SDR evaluation is shown in figure 2.


next up previous
Next: Segmentation Up: Text Retrieval Previous: Text Retrieval
Steve Renals
1999-04-30