Web Data for Language Processing


Lapata, Mirella and Frank Keller. 2004. The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 121-128. Boston.

Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-the-art models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.


@InProceedings{Lapata:Keller:04,
  author = 	 {Mirella Lapata and Frank Keller},
  title = 	 {The Web as a Baseline: Evaluating the Performance of
                  Unsupervised Web-based Models for a Range of {NLP} Tasks},
  crossref =	 {NAACL:04},
  pages =        {121--128}
}

@Proceedings{NAACL:04,
  title = 	 {Proceedings of the Human Language Technology Conference of the North  
                  American Chapter of the Association for Computational Linguistics},
  booktitle = 	 {Proceedings of the Human Language Technology Conference of the North  
                  American Chapter of the Association for Computational Linguistics},
  year = 	 2004,
  address =	 {Boston}
}