Out of date, will be updated soon (2019.10.31)

Research

My research is about the development of interactive systems that can understand human communication. A lot of this work is grounded in speech recognition, and is based on building and applying statistical models to interpret communication signals. The Natural Speech Technology programme grant is concerned with core work in speech recognition and speech synthesis.

Speech to text transcription is a highly challenging task in itself, but ultimately we want to understand human communication, rather than only transcribing the words. Along these lines, we have done work concerned with interpreting and accessing information from speech, and multimodal interaction. For more than a decade now, a lot of our work has focussed on the recognition and interpretation of multiparty meetings, as part of the M4, AMI, AMIDA, and InEvent projects.

Speech Recognition and Synthesis

I am interested in developing better models for speech recognition, trainable from large amounts of data. I have worked in most aspects of speech recognition including (deep) neural networks, other discriminative acoustic modelling approaches, ongoing attempts to develop language models that work significantly better than smoothed n-grams, and efficient search. I am also interested in statistical modelling for speech synthesis. This work is supported by the Natural Speech Technology programme grant, and by the InEvent, EU-Bridge, and uDialogue projects.

Acoustic modelling

In the late 1980s and early 1990s, I worked on hybrid neural network / HMM systems (with Tony Robinson, Mike Hochberg, Nelson Morgan, Herve Bourlard, and Richard Rohwer). This resulted in state-of-the-art systems, at the time, but during the 1990s GMM/HMM systems proved to be more accurate, mainly due to better context-dependent acoustic modelling, speaker adaptation, and the ability to train larger models on more data due to easy paralellism across compute clusters. Since about 2010, neural network based systems have redefined the state of the art in speech recognition owing to the availability of low cost GPGPU-based computing, the use of large-scale context dependence, and the development of methods to train deep neural networks.

One of the most exciting aspects of neural network acousitic models is their ability to automatically learn representations of speech. Recent and current work has looked at this in the context of cross-lingual speech recognition, domain adaptation, and distant speech recognition. I'm working on these topics with Pawel Swietojanski, Peter Bell, and (until recently) Arnab Ghoshal.

I've also investigated a variety of acoustic modelling techniques which attempt to address the drawbacks of conventional GMM/HMM systems including factorised and subspace models, discriminative training, richer spectral representations, trajectory models, and better dynamic models of speech. I'm currently working on some of these ideas with Liang Lu.

In the context of distant speech recognition, in particular, issues such as acoustic segmentation and speaker diarization become very important. I recently worked on this with Erich Zwyssig who also worked on the design, implmenmtation, and deployment of digital MEMS microphone arrays.

Selected Publications - Neural Network Acoustic Models

Pawel Swietojanski, Arnab Ghoshal, and Steve Renals (2013). "Hybrid acoustic models for distant and multichannel large vocabulary speech recognition", Proc. IEEE ASRU. [pdf]
Peter Bell, Pawel Swietojanski, and Steve Renals (2013). "Multi-level adaptive networks in tandem and hybrid ASR systems", Proc. IEEE ICASSP. [pdf]
Arnab Ghoshal, Pawel Swietojanski, and Steve Renals (2013). "Multilingual training of deep neural networks", Proc. IEEE ICASSP. [pdf]
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals (2012). "Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR", Proc. IEEE SLT. [pdf]
Gethin Williams and Steve Renals (1999). "Confidence measures from local posterior probability estimates", Computer Speech and Language, 13:395-411. [pdf]
Jean Hennebert, Christophe Ris, Hervé Bourlard, Steve Renals, and Nelson Morgan (1997). "Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems", Proc. Eurospeech, pages 1951-1954. [pdf]
Tony Robinson, Mike Hochberg, and Steve Renals (1996). "The use of recurrent networks in continuous speech recognition", In C.-H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic Speech and Speaker Recognition - Advanced Topics, pages 233-258, Kluwer Academic Publishers. [pdf]
Steve Renals, Nelson Morgan, Hervé Bourlard, Mike Cohen, and Horacio Franco (1994). "Connectionist probability estimators in HMM speech recognition", IEEE Trans. on Speech and Audio Processing, 2:161-175. [pdf]

Selected Publications - Miscellaneous Acoustic Models

Liang Lu, Arnab Ghoshal, and Steve Renals (2014). "Cross-lingual Subspace Gaussian Mixture Models for Low-resource Speech Recognition", IEEE Trans Audio, Speech and Language Processing, 22(1):17-27. [pdf | doi]
Liang Lu, KK Chin, Arnab Ghoshal, and Steve Renals (2013). "Joint uncertainty decoding for noise robust subspace Gaussian mixture models", IEEE Trans Audio, Speech and Language Processing, 21(9):1791-1804. [pdf | doi]
Erich Zwyssig, Steve Renals, and Mike Lincoln (2012). "On the effect of SNR and superdirective beamforming in speaker diarisation in meetings", Proc IEEE ICASSP. [pdf]
Liang Lu, Arnab Ghoshal, and Steve Renals (2011). "Regularized subspace Gaussian mixture models for speech recognition", IEEE Signal Processing Letters, 18(7):419-422. [pdf | doi]
Ravi Chander Vipperla, Steve Renals, and Joe Frankel (2010). "Ageing voices: The effect of changes in voice parameters on ASR performance", EURASIP Journal on Audio, Speech, and Music Processing. [pdf | doi]
Yasser Hifny and Steve Renals (2009). "Speech Recognition Using Augmented Conditional Random Fields", IEEE Trans. Audio, Speech and Language Processing, 17:354-365. [pdf | doi]
Giulia Garau and Steve Renals (2008). "Combining Spectral Representations for Large Vocabulary Continuous Speech Recognition", IEEE Trans. Audio, Speech and Language Processing, 16:508-518. [pdf | doi]
Le Zhang and Steve Renals (2006). "Phone recognition analysis for trajectory HMM", Proc. Interspeech 2006 [pdf]
Vincent Wan and Steve Renals (2005), "Speaker verification using sequence discriminant support vector machines", IEEE Trans. on Speech and Audio Processing, 13:203-210. [pdf | doi]

Language modelling

Fred Jelinek's keynote at Eurospeech '91 was entitled Up from trigrams! The struggle for improved language models. It has taken nearly two decades of work for the state-of-the-art in language modelling to move on from smoothed trigram or 4-gram language models.... Neural network language models now define the state of the art in language modelling (although nearly always in combination with a large-scale n-gram model).

Jelinek's struggle continues, and we are interested in neural network language models and hierarchical Bayesian approaches - both of which can provide a framework for the inclusion of additional variables for language modelling. Neural network models are especially interesting both because of their flexibility, their performance, and also the possibility to learn representations of language. Rapid adaptation of language models across domains is also an area of interests. Currently, I'm working on language modelling with Siva Reddy Gangireddy and Fergus McInnes.

Selected Publications - Language Models

Songfang Huang and Steve Renals (2010). "Hierarchical Bayesian language models for conversational speech recognition", IEEE Transactions on Audio, Speech and Language Processing, 18(8):1941-1954. [pdf]
Songfang Huang and Steve Renals (2010). "Power law discounting for n-gram language models", In Proc. IEEE ICASSP-10, 5178-5181. [ pdf ]
Songfang Huang and Steve Renals (2008). "Using Prosodic Features in Language Models for Meetings", in Proc. MLMI-07, Springer LNCS 4892, 192-203. [ pdf ]
Yoshi Gotoh and Steve Renals (2000). "Variable word rate n-grams", Proc IEEE ICASSP-2000 [pdf]
Yoshi Gotoh and Steve Renals (1999). "Topic-based mixture language modelling", Journal of Natural Language Engineering, 5:355-375. [pdf]

Speech Recognition Systems

We are interested in building speech recognition systems that work in natural spoken speech in realistic acoustic environments. Currently we are working on systems for meeting recognition (often recorded using microphone arrays), recognition of lectures and talks (e.g. TED talks), and recognition of BBC broadcasts, across a wide range of genres. We all contribute to the systems, but in particular I'm working with Joris Driesen, Peter Bell,and Fergus McInnes. Going back in time.... during the 1990s we worked very hard on a neural network/HMM hybrid system that we called ABBOT. I enjoyed writing decoders then (and still would, if I had the time...) - the NOWAY decoder was designed to decode WSJ sentences (20,000 word vocabulary) in realtime (on a 120MHz pentium and 64-96Mb RAM!).

We're also interested in developing systems that do more than simply transcribe speech. With Joris Driesen, Alexandra Birch, and Philipp Koehn we are working on speech translation systems.

Selected Publications - Systems

Joris Driesen and Steve Renals (2013). "Lightly supervised automatic subtitling of weather forecasts", Proc IEEE ASRU-2013. [pdf]
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals (2013). "Hybrid acoustic models for distant and multichannel large vocabulary speech recognition", Proc IEEE ASRU-2013. [pdf]
Peter Bell, Hitoshi Yamamoto, Pawel Swietojanski, Youzheng Wu, Fergus McInnes, Chiori Hori, and Steve Renals (2013). "A lecture transcription system combining neural network acoustic and language models", Proc Interspeech. [pdf]
Steve Renals, Thomas Hain and Hervé Bourlard (2007). "Recognition and understanding of meetings: The AMI and AMIDA projects", in Proc. IEEE ASRU'07. [ pdf]
AJ Robinson, GD Cook, DPW Ellis, E Fosler-Lussier, SJ Renals, and DAG Williams (2002). "Connectionist speech recognition of broadcast news", Speech Communication, 37:27-45. [pdf]
Steve Renals and Michael Hochberg (1999). "Start-synchronous search for large vocabulary continuous speech recognition", IEEE Trans. on Speech and Audio Processing, 7:542-553. [pdf]
The noway decoder was packaged up as part of the SPRACHcore distribution. The code was written in the mid-90s - pre STL! - but thanks to people at ICSI, it compiles under modern gcc (noway.tar.gz) - and Paul Dixon made some updates so it compiles under OS X (https://github.com/edobashira/noway)

Speech synthesis

The biggest innovation in speech technology over the past decade has been the development of the trajectory HMM, and HTS the HMM-based speech synthesis system, by Tokuda and co-workers at NITech. I previously did some work with Joao Cabral on glottal source modelling for HMM-based sysnthesis, and am currently working with Benigno Uriaon deep neural network acoustic models and density estimators.

Selected Publications - Speech Synthesis

J. Yamagishi, T. Nose, H. Zen, Z. Ling, T. Toda, K. Tokuda, S.King and S. Renals (2009). "Robust Speaker-Adaptive HMM-based Text-to-Speech Synthesis", IEEE Trans. Audio, Speech and Language Processing, 17:1208-1230. [pdf | doi]
Joao P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi, "Towards an improved modeling of the glottal source in statistical parametric speech synthesis", Proc. Sixth ISCA Speech Synthesis Workshop (SSW6) [pdf]

Acoustic-articulatory models

What can we infer about the state of the articulatory system from the acoustic signal? This is an intriguing machine learning problem - and solutions are likely to benefit recognition and synthesis. Recently I've been interested in acoustic-articulatory modelling with trajectory HMMs and with neural networks. With Korin Richnmond, in the Ultrax project, we are interested in developing a simpolified real-time visulaisation of ultrasound images of the tongue.

Selected Publications - Acoustic/Articulatory

Korin Richmond and Steve Renals (2012). "Ultrax: An animated midsagittal vocal tract display for speech therapy", Proc. Interspeech. [ pdf]
Benigno Uria, Iain Murray, Steve Renals, and Korin Richmond (2012). "Deep architectures for articulatory inversion" Proc. Interspeech. [pdf]
L. Zhang and S. Renals (2008). "Acoustic-Articulatory Modelling with the Trajectory HMM", IEEE Signal Processing Letters, 15:245-248. [pdf | doi]
Miguel Carreira-Perpiñán and Steve Renals (1999). "A latent-variable modelling approach to the acoustic-to-articulatory mapping problem", Proc. 14th Int. Congress of Phonetic Sciences, pages 2013-2016. [pdf]
Miguel Carreira-Perpiñán and Steve Renals (1998). "Dimensionality reduction of electropalatographic data using latent variable models", Speech Communication, 26:259-282. [pdf]

Multimodal Interaction

Interaction and communication is multimodal. We have developed an instrumented meeting room to capture human communication in meetings across multiple modalities, and are working on automatic approaches to recognize, interpret and structure meetings. This work is currently supported by the InEvent and uDialogue projects.

Meetings

In the AMI and AMIDA projects we were interested in recognizing, interpreting, summarizing and structuring multiparty meetings. Summarization, dialog act recognition, meeting phase segmentation are examples of things that we are pursuing, along with meeting speech recognition. This work is continued in the inEvent project. With Catherine Lai and Jean Carletta I am working on things like summarization and detection of involvement of meetings and lectures.

Selected Publications - Meetings

Catherine Lai, Jean Carletta, and Steve Renals (2013). "Detecting summarization hot spots in meetings using group level involvement and turn-taking features", Proc Interspeech. [pdf]
The AMI Corpus
Steve Renals (2010). Recognition and understanding of meetings. Proc. NAACL/HLT. [ pdf ]
A. Dielmann and S. Renals (2008). "Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN", IEEE Trans. Audio, Speech and Language Processing, in press. [ pdf ]
Gabriel Murray and Steve Renals (2008). "Term-Weighting for Summarization of Multi-party Spoken Dialogues", in Proc. MLMI-07, Springer LNCS 4892, 156-167. [ pdf ]
Alfred Dielmann and Steve Renals (2007). "DBN based joint dialogue act recognition of multiparty meetings", Proc. IEEE ICASSP'07'. [pdf]
Alfred Dielmann and Steve Renals (2007). "Automatic meeting segmentation using dynamic Bayesian networks", IEEE Trans. on Multimedia, 9(1):25-36. [pdf]
G. Murray, S. Renals, J. Moore, and J. Carletta (2006). "Incorporating speaker and discourse features into speech summarization", Proc. HLT-NAACL 2006. [pdf]
S. J. Wrigley, G. J. Brown, V. Wan, and S. Renals (2005). "Speech and crosstalk detection in multi-channel audio", IEEE Trans. on Speech and Audio Processing, 13:84-91. [pdf]
Steve Renals and Dan Ellis (2003). "Audio information access from meeting rooms", Proc. IEEE ICASSP'03' [pdf]

Multimodal dialogue

As well as human-human communication, we are looking at multimodal human-computer dialogues. I previously did some work on using reinforcement learning for optimising spoken dialogues, and am currently working with Qiang Huang on multimodal dialogue in the uDialogue project.

Selected Publications

Heriberto Cuayáhuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira (2007). "Hierarchical dialogue optimization using semi-Markov decision processes", in Proc. Interspeech '07. [.pdf]
Olivier Pietquin and Steve Renals (2002). "ASR system modeling for automatic evaluation and optimization of dialogue systems", Proc IEEE ICASSP. [pdf]

Information Access from Speech

In addition to work on meetings and multimodal interaction, when at Sheffield we constructed systems for spoken document retrieval, named entity identification, summarization and automatic segmentation of speech such as broadcast news and voicemail. In the late 1990s we put in a good deal of effort to develop systems for NIST evaluations in these areas.

Selected Publications

Heidi Christensen, Yoshi Gotoh and Steve Renals (2008). "A cascaded broadcast news highlighter", IEEE Trans. Audio, Speech and Language Processing, 16:151-161. [pdf]
Konstantinos Koumpis and Steve Renals (2005). "Content-based access to spoken audio", IEEE Signal Processing Magazine, 22(5):61-69. [pdf]
Konstantinos Koumpis and Steve Renals (2005). "Automatic summarization of voicemail messages using lexical and prosodic features", ACM Transactions on Speech and Language Processing, 2(1):1-24. [pdf]
Jerry Goldman, Steve Renals, Steven Bird, Franciska de Jong, Marcello Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Doug Oard, Clare Stewart, and Richard Wright (2005). "Accessing the spoken word", International Journal of Digital Libraries, 5(4):287-298. [pdf]
Heidi Christensen, Yoshi Gotoh, and Steve Renals (2001). "Punctuation annotation using statistical prosody models", Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding [pdf]
Steve Renals, Dave Abberley, David Kirby, and Tony Robinson (2000). "Indexing and retrieval of broadcast news", Speech Communication, 32:5-20 [pdf]
Yoshi Gotoh and Steve Renals (2000). "Information extraction from broadcast news", Philosophical Transactions of the Royal Society of London, Series A, 358:1295-1310 [pdf]