Currently, I'm most interested in Social Media, working with massive streams of language and (statistical) machine translation.
Cross-cutting these broad areas is the question of how to train and apply large models. For example, our machine translation systems need to run on a cluster of machines. Throwing more machines at such models is great, but it is clear that the most interesting models and datasets will make computational demands which far outstrip whatever resources we have available. Imagine training with all data that appears on the Web, each and every day. Scaling our machine learning methods will become crucial. Randomised and streaming algorithms will prove essential here and our work using Bloom Filters for Language Models is a start in this direction, as it our work on locality sensitive hashing for real-time event detection in Twitter. Infrastructure to support large-scale experiments is vital. I have been installing and playing around with Hadoop for quite a while now as a fun way to do this. More recently we have started looking at Storm for real-time event processing.