Twitter have asked us to stop distributing our tweets.
We have released our Bloom Filter-based Language Model code: RandLM. See our ACL07 paper for details.
The maxent package we use is now available
Meet our no-cost(*) Hadoop cluster! This was cobbled together from machines we had that were sitting in a basement. This setup crawls 1/3M blog posts and one million Tweets per day. (*) Almost: I bought some ethernet cables and the hub, Philipp put in new hard disks. This was our first Hadoop cluster and it has been working continuously since about 2008. Another Hadoop cluster we have uses much fancier hardware.
One billion web pages