Out of date, will be updated soon (2019.10.31)


My research is about the development of interactive systems that can understand human communication. A lot of this work is grounded in speech recognition, and is based on building and applying statistical models to interpret communication signals. The Natural Speech Technology programme grant is concerned with core work in speech recognition and speech synthesis.

Speech to text transcription is a highly challenging task in itself, but ultimately we want to understand human communication, rather than only transcribing the words. Along these lines, we have done work concerned with interpreting and accessing information from speech, and multimodal interaction. For more than a decade now, a lot of our work has focussed on the recognition and interpretation of multiparty meetings, as part of the M4, AMI, AMIDA, and InEvent projects.

Speech Recognition and Synthesis

I am interested in developing better models for speech recognition, trainable from large amounts of data. I have worked in most aspects of speech recognition including (deep) neural networks, other discriminative acoustic modelling approaches, ongoing attempts to develop language models that work significantly better than smoothed n-grams, and efficient search. I am also interested in statistical modelling for speech synthesis. This work is supported by the Natural Speech Technology programme grant, and by the InEvent, EU-Bridge, and uDialogue projects.

Acoustic modelling

In the late 1980s and early 1990s, I worked on hybrid neural network / HMM systems (with Tony Robinson, Mike Hochberg, Nelson Morgan, Herve Bourlard, and Richard Rohwer). This resulted in state-of-the-art systems, at the time, but during the 1990s GMM/HMM systems proved to be more accurate, mainly due to better context-dependent acoustic modelling, speaker adaptation, and the ability to train larger models on more data due to easy paralellism across compute clusters. Since about 2010, neural network based systems have redefined the state of the art in speech recognition owing to the availability of low cost GPGPU-based computing, the use of large-scale context dependence, and the development of methods to train deep neural networks.

One of the most exciting aspects of neural network acousitic models is their ability to automatically learn representations of speech. Recent and current work has looked at this in the context of cross-lingual speech recognition, domain adaptation, and distant speech recognition. I'm working on these topics with Pawel Swietojanski, Peter Bell, and (until recently) Arnab Ghoshal.

I've also investigated a variety of acoustic modelling techniques which attempt to address the drawbacks of conventional GMM/HMM systems including factorised and subspace models, discriminative training, richer spectral representations, trajectory models, and better dynamic models of speech. I'm currently working on some of these ideas with Liang Lu.

In the context of distant speech recognition, in particular, issues such as acoustic segmentation and speaker diarization become very important. I recently worked on this with Erich Zwyssig who also worked on the design, implmenmtation, and deployment of digital MEMS microphone arrays.

Selected Publications - Neural Network Acoustic Models

Selected Publications - Miscellaneous Acoustic Models

Language modelling

Fred Jelinek's keynote at Eurospeech '91 was entitled Up from trigrams! The struggle for improved language models. It has taken nearly two decades of work for the state-of-the-art in language modelling to move on from smoothed trigram or 4-gram language models.... Neural network language models now define the state of the art in language modelling (although nearly always in combination with a large-scale n-gram model).

Jelinek's struggle continues, and we are interested in neural network language models and hierarchical Bayesian approaches - both of which can provide a framework for the inclusion of additional variables for language modelling. Neural network models are especially interesting both because of their flexibility, their performance, and also the possibility to learn representations of language. Rapid adaptation of language models across domains is also an area of interests. Currently, I'm working on language modelling with Siva Reddy Gangireddy and Fergus McInnes.

Selected Publications - Language Models

Speech Recognition Systems

We are interested in building speech recognition systems that work in natural spoken speech in realistic acoustic environments. Currently we are working on systems for meeting recognition (often recorded using microphone arrays), recognition of lectures and talks (e.g. TED talks), and recognition of BBC broadcasts, across a wide range of genres. We all contribute to the systems, but in particular I'm working with Joris Driesen, Peter Bell,and Fergus McInnes. Going back in time.... during the 1990s we worked very hard on a neural network/HMM hybrid system that we called ABBOT. I enjoyed writing decoders then (and still would, if I had the time...) - the NOWAY decoder was designed to decode WSJ sentences (20,000 word vocabulary) in realtime (on a 120MHz pentium and 64-96Mb RAM!).

We're also interested in developing systems that do more than simply transcribe speech. With Joris Driesen, Alexandra Birch, and Philipp Koehn we are working on speech translation systems.

Selected Publications - Systems

Speech synthesis

The biggest innovation in speech technology over the past decade has been the development of the trajectory HMM, and HTS the HMM-based speech synthesis system, by Tokuda and co-workers at NITech. I previously did some work with Joao Cabral on glottal source modelling for HMM-based sysnthesis, and am currently working with Benigno Uriaon deep neural network acoustic models and density estimators.

Selected Publications - Speech Synthesis

Acoustic-articulatory models

What can we infer about the state of the articulatory system from the acoustic signal? This is an intriguing machine learning problem - and solutions are likely to benefit recognition and synthesis. Recently I've been interested in acoustic-articulatory modelling with trajectory HMMs and with neural networks. With Korin Richnmond, in the Ultrax project, we are interested in developing a simpolified real-time visulaisation of ultrasound images of the tongue.

Selected Publications - Acoustic/Articulatory

Multimodal Interaction

Interaction and communication is multimodal. We have developed an instrumented meeting room to capture human communication in meetings across multiple modalities, and are working on automatic approaches to recognize, interpret and structure meetings. This work is currently supported by the InEvent and uDialogue projects.


In the AMI and AMIDA projects we were interested in recognizing, interpreting, summarizing and structuring multiparty meetings. Summarization, dialog act recognition, meeting phase segmentation are examples of things that we are pursuing, along with meeting speech recognition. This work is continued in the inEvent project. With Catherine Lai and Jean Carletta I am working on things like summarization and detection of involvement of meetings and lectures.

Selected Publications - Meetings

Multimodal dialogue

As well as human-human communication, we are looking at multimodal human-computer dialogues. I previously did some work on using reinforcement learning for optimising spoken dialogues, and am currently working with Qiang Huang on multimodal dialogue in the uDialogue project.

Selected Publications

Information Access from Speech

In addition to work on meetings and multimodal interaction, when at Sheffield we constructed systems for spoken document retrieval, named entity identification, summarization and automatic segmentation of speech such as broadcast news and voicemail. In the late 1990s we put in a good deal of effort to develop systems for NIST evaluations in these areas.

Selected Publications