C++ bindings for Storm.

FSD Twitter Corpus

Indic multi-parallel corpora

Twitter have asked us to stop distributing our tweets.

We have released our Bloom Filter-based Language Model code: RandLM. See our ACL07 paper for details.

The maxent package we use is now available

Meet our no-cost(*) Hadoop cluster! This was cobbled together from machines we had that were sitting in a basement. This setup crawls 1/3M blog posts and one million Tweets per day. (*) Almost: I bought some ethernet cables and the hub, Philipp put in new hard disks. This was our first Hadoop cluster and it has been working continuously since about 2008. Another Hadoop cluster we have uses much fancier hardware.

One billion web pages