Lapata, Mirella and Frank Keller. 2005. Web-based Models for Natural Language Processing. ACM Transactions on Speech and Language Processing 2:1, 1-31.

Previous work demonstrated that web counts can be used to approximate bigram counts, thus suggesting that web-based frequencies should be useful for a wide variety of NLP tasks. However, only a limited number of tasks have so far been tested using web-scale data sets. The present paper overcomes this limitation by systematically investigating the performance of web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine web counts and corpus counts. However, unsupervised web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.

  author = 	 {Mirella Lapata and Frank Keller},
  title = 	 {Web-based Models for Natural Language Processing},
  journal =      {ACM Transactions on Speech and Language Processing},
  volume =       2,
  issue =        1,
  pages =        {1--31},
  year = 	 2005