Sporleder, Caroline and Mirella Lapata. 2006. Broad Coverage Paragraph Segmentation across Languages and Domains. ACM Transactions on Speech and Language Processing, 3:2, 1-35.

This paper considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g.,~summarisation) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summariser and show that it is useful for structuring the output of automatically generated text.


@Article{Sporleder:Lapata:06,
  author = 	 {Caroline Sporleder and Mirella Lapata},
  title = 	 {Broad Coverage Paragraph Segmentation across Languages and Domains},
  journal =      {ACM Transactions on Speech and Language Processing},
  volume =       3,
  number =       2,
  pages  =       {1--35},
  year = 	 2006
}