Annie Louis: Software and Resources
Topic Words Tool
The idea of topic words for summarization was introduced by
C. Lin and E. Hovy. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th Conference on Computational Linguistics.
A topic word is a word with significantly greater probability in a given text compared to that in a large background corpus. For each word in the text, the ratio between two hypotheses (a) and (b) is computed.
a) The word is not a topic word and so its probability would be the same in both the given text and the background collection
b) The word is a topic word and hence its probability in the given text is greater
This ratio lambda is estimated and -2 * log(lamda) has a chisq distribution. So a cutoff can be chosen based on some significance level and words obtained in this manner are called topic signatures.
Topic signatures are a very useful feature for content selection from source documents in summarization.
This code computes topic words for a given document or a set of documents considered as a whole. The background corpus used contains 5000 documents from the English GigaWord Corpus.
Please send any questions or suggestions to:Annie Louis
alouis at inf dot ed dot ac dot uk