My general interests lie in the development of machine learning techniques to address challenges in scientific modelling. By definition, all models are incomplete abstraction of any real system, and therefore wrong as George Box famously pointed out. However, some of them are extremely useful as they allow us to explore the conceptual implications of these abstractions in silico. To assess the usefulness of a model, scientists would normally perform further experiments, to validate the model's predictions.
But what does it mean for a model's predictions to agree with data? Traditionally, particularly in physics, one could create very well controlled conditions, such that the agreement between models and data was essentially complete. Greater challenges appear when one tries to apply this paradigm to biology, where noise and incomplete observations are the norm. To assess the agreement of models' predictions with data it is therefore imperative to first quantify the inherent uncertainty in model predictions. This uncertainty is the direct consequence of the model being an incomplete representation of the system, and the fact that incomplete and noisy observations have been used to calibrate the model.
In recent years, I have developed Bayesian Machine Learning tools to quantify model prediction uncertainty for specific classes of models in biology. Here is a list of some of my active areas of research, and some areas where I would like to become more active:
Protein levels in cells change dynamically in response to external stimuli or internal programmes (e.g. development). This is a tightly controlled process, where auxilliary proteins (e.g. transcription factors) play a prominent but often subtle role. There exist established models for individual subsystems, and conceptual models of the main processes. However, the specific workings of any single subsystem are mostly unknown; reconstructing them from data poses plenty of interesting inferential questions. Most of my group works on problems related to this: in particular, two BBSRC funded post-docs are investigating gene regulation in E. coli following exposure to stress (B. Cseke and R. Begg).
Good background reading in this area could be: Genes and Signals, the seminal book by Ptashne and Gann elucidating the biological principles. Learning and Inference in Computational Systems Biology, a book I co-edited, presents a good and still up to date overview of machine learning in systems biology. For an idea of my own work in the area, this paper is a good starting point.
The overwhelming majority of quantitative biology has focused on studying molecules like mRNA, which decay within hours at most. How can this help us explain phenomena that take years to establish, e.g. ageing, cancer, neurodegenerative diseases? People increasingly think that a determining factor is so called "epigenetics", i.e. changes in the spatial organisation/ chemical state of DNA (e.g. how it is wrapped around histones, its methylation state; for a very accessible review see here ). Data about these epigenetics factors is becoming increasingly available thanks to next generation sequencing. Can we use computational methods to discover whether there are networks connecting these various epigenetic factors, and connecting epigenetics with genetics? Gabriele Schweikert is working jointly in my group and in Adrian Bird's group on some of these questions.
Most people would agree that we exist in both space and time. Yet an overwhelming majority of scientific work assumes spatial homogeneity; while this is a convenient simplification, it is probably not an appropriate one in most cases. I have recently become increasingly interested in methodologies for spatio-temporal modelling, partly due to collaboration with Visakan Kadirkamanathan and our joint student Andrew Zammit-Mangion.
Some useful references for this area of work could be the following two papers, exploring stochasticity indrosophila development and presenting a general estimation tool for a class of spatio-temporal models.