Global Inference for Summarization Using Integer Linear Programming, Research Project Grant, funded by the EPSRC, 01/2009-01/2012.

Principal Investigators: Mirella Lapata and Andreas Grothey

Research Fellow: Kristian Woodsend

Summarization is the process of condensing a source text into a shorter version while preserving its information content. The applications of summarization are many and varied. From quick access to news and scientific articles to systems that aid physicians in gathering patient information and meeting browsers. Humans summarize on a daily basis and effortlessly (e.g., by describing the contents of a lecture, a meeting or a movie), but producing high quality summaries automatically remains a challenge. The difficulty lies primarily in the nature of the task which is complex, must satisfy many constraints (e.g., summary length, informativeness, coherence, grammaticality) and ultimately requires large-scale text understanding.

Since robust text understanding is beyond the capabilities of current NLP technology, most work today focuses on extractive summarization. The idea here is to create a summary simply by identifying and subsequently concatenating the most important sentences in a document. Without a great deal of linguistic analysis, it is possible to create summaries for a wide range of documents, independently of style, text type, and subject matter. Unfortunately, extracts are often documents of low readability and text quality.

In this project we will develop novel models for single-document summarization that break away from the sentence extraction paradigm. We will model summarization as an optimisation problem and use integer linear programming (ILP) for finding a summary that is best for the application, task, or user at hand. The ILP formulation is advantageous for two reasons. First, it allows us to explicitly encode the constraints our output summaries must meet. Secondly, ILP is a well studied optimization problem with efficient algorithms for finding a globally optimal solution in the presence of many conflicting constraints. This proposal aims to shift the summarization paradigm by developing novel and unified models based on the ILP framework that are able to identify what is important in a document and express it appropriately. The success of this research will make significant and far-reaching impact on summarization and related areas (e.g., information retrieval) that could not be brought about by incrementally extending conventional models.