Research
My research is about the development of interactive systems that can understand human communication. A lot of this work is grounded in speech recognition, and is based on building and applying statistical models to interpret communication signals.
Over the past decade this work has grown beyond speech transcription, to include approaches to interpreting and accessing information from speech, and multimodal interaction. Since 2002, a lot of our work has focussed on the recognition and interpretation of multiparty meetings, as part of the M4, AMI and AMIDA projects.
Speech Recognition and Synthesis
I am interested in developing better models for speech recognition, trainable from large amounts of data. I have worked in most aspects of speech recognition including discriminative acoustic modelling, ongoing attempts to develop language models that work significantly better than smoothed n-grams, efficient search, and HMM-based synthesis and other models that are useful for both recognition and synthesis.
Acoustic modelling
State-of-the-art generative models of speech acoustics, based on HMMs, are scalable to huge amounts of training data, and are surprisingly accurate in practice. However, there are many drawbacks to standard HMM approaches which may be addressed by discriminative training, richer spectral representations, and better dynamic models of speech. In the past I did a lot of work on connectionist/HMM hybrids, and there is still plenty of interest about such models, especially given recent interest in models such as conditional random fields. Recent work has included alternative discriminative approaches, the use of pitch-synchronous acoustic representations and trajectory HMMs.
Selected Publications
- Yasser Hifny and Steve Renals (2009). "Speech Recognition Using Augmented Conditional Random Fields", IEEE Trans. Audio, Speech and Language Processing, 17:354-365. [web | bib | pdf | abstract]
- Giulia Garau and Steve Renals (2008). "Combining Spectral Representations for Large Vocabulary Continuous Speech Recognition", IEEE Trans. Audio, Speech and Language Processing, 16:508-518. [web | bib | pdf | abstract]
- Le Zhang and Steve Renals (2006). "Phone recognition analysis for trajectory HMM", Proc. Interspeech 2006 [ bib | pdf | Abstract ]
- Vincent Wan and Steve Renals (2005), "Speaker verification using sequence discriminant support vector machines", IEEE Trans. on Speech and Audio Processing, 13:203-210. [web | bib | pdf | Abstract ]
- Gethin Williams and Steve Renals (1999). "Confidence measures from local posterior probability estimates", Computer Speech and Language, 13:395-411. [web | bib | pdf | Abstract ]
- Jean Hennebert, Christophe Ris, Hervé Bourlard, Steve Renals, and Nelson Morgan (1997). "Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems", Proc. Eurospeech, pages 1951-1954. [ bib | pdf | Abstract ]
- Tony Robinson, Mike Hochberg, and Steve Renals (1996). "The use of recurrent networks in continuous speech recognition", In C.-H. Lee, K. K. Paliwal, and F. K. Soong, editors, Automatic Speech and Speaker Recognition - Advanced Topics, pages 233-258, Kluwer Academic Publishers. [ bib | pdf | Abstract ]
- Steve Renals, Nelson Morgan, Hervé Bourlard, Mike Cohen, and Horacio Franco (1994). "Connectionist probability estimators in HMM speech recognition", IEEE Trans. on Speech and Audio Processing, 2:161-175. [web | bib | pdf | Abstract ]
Language modelling
Fred Jelinek's keynote at Eurospeech '91 was entitled Up from trigrams! The struggle for improved language models. Over 15 years later most state-of-the-art large vocabulary speech recognition systems still use smoothed trigram or 4-gram language models.... The struggle continues, however, and we are interested in hierarchical Bayesian approaches which can provide a framework for the inclusion of additional variables for language modelling.
Selected Publications
- Songfang Huang and Steve Renals (2008). "Using Prosodic Features in Language Models for Meetings", in Proc. MLMI-07, Springer LNCS 4892, 192-203. [ pdf ]
- Songfang Huang and Steve Renals (2007). "Hierarchical Pitman-Yor Language Models for ASR in Meetings", Proc. IEEE ASRU'07 [ pdf ]
- Yoshi Gotoh and Steve Renals (2003). "Language modelling", in S. Renals and G. Grefenstette, editors, Text and Speech Triggered Information Access, number 2705 in Lecture Notes in Computer Science, pages 78-105. Springer-Verlag. [ bib | Abstract ]
- Yoshi Gotoh and Steve Renals (2000). "Variable word rate n-grams", Proc IEEE ICASSP-2000 [ bib | pdf | Abstract ]
- Yoshi Gotoh and Steve Renals (1999). "Topic-based mixture language modelling", Journal of Natural Language Engineering, 5:355-375. [ bib | pdf | Abstract ]
LVCSR Search and Systems
Building speech recognition systems is fun and during the 1990s we worked very hard on a connectionist/HMM hybrid system, ABBOT. I enjoyed writing decoders then (and still would, if I had the time...) - the NOWAY decoder was designed to decode 20,000 word WSJ sentences in realtime (on a 120MHz pentium and 64-96Mb RAM!)
Selected Publications
- Steve Renals, Thomas Hain and Hervé Bourlard (2007). "Recognition and understanding of meetings: The AMI and AMIDA projects", in Proc. IEEE ASRU'07. [bib | pdf | abstract]
- AJ Robinson, GD Cook, DPW Ellis, E Fosler-Lussier, SJ Renals, and DAG Williams (2002). "Connectionist speech recognition of broadcast news", Speech Communication, 37:27-45. [ bib | pdf | Abstract ]
- Steve Renals and Michael Hochberg (1999). "Start-synchronous search for large vocabulary continuous speech recognition", IEEE Trans. on Speech and Audio Processing, 7:542-553.[ bib | pdf | Abstract ]
- The noway decoder was packaged up as part of the SPRACHcore distribution. The code was written in the mid-90s - pre STL! - but thanks to people at ICSI, it compiles under modern gcc (noway.tar.gz).
Speech synthesis
The biggest innovation in speech technology over the past decade has been the development of the trajectory HMM, and HTS the HMM-based speech synthesis system, by Tokuda and co-workers at NITech.
Selected Publications
- J. Yamagishi, T. Nose, H. Zen, Z. Ling, T. Toda, K. Tokuda, S.King and S. Renals (2009). "Robust Speaker-Adaptive HMM-based Text-to-Speech Synthesis", IEEE Trans. Audio, Speech and Language Processing, 17:1208-1230. [web | bib | pdf | abstract]
- Joao P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi, "Towards an improved modeling of the glottal source in statistical parametric speech synthesis", Proc. Sixth ISCA Speech Synthesis Workshop (SSW6) [ bib | pdf | Abstract ]
Acoustic-articulatory models
What can we infer about the state of the articulatory system from the acoustic signal? This is an intriguing machine learning problem - and solutions are likely to benefit recognition and synthesis.
Selected Publications
- L. Zhang and S. Renals (2008). "Acoustic-Articulatory Modelling with the Trajectory HMM", IEEE Signal Processing Letters, 15:245-248. [web | bib | pdf | abstract]
- Miguel Carreira-Perpiñán and Steve Renals (1999). "A latent-variable modelling approach to the acoustic-to-articulatory mapping problem", Proc. 14th Int. Congress of Phonetic Sciences, pages 2013-2016. [ bib | pdf | Abstract ]
- Miguel Carreira-Perpiñán and Steve Renals (1998). "Dimensionality reduction of electropalatographic data using latent variable models", Speech Communication, 26:259-282. [ bib | pdf | Abstract ]
Multimodal Interaction
Interaction and communication is multimodal. We have developed an instrumented meeting room to capture human communication in meetings across multiple modalities, and are working on automatic approaches to recognize, interpret and structure meetings. The annual MLMI workshops (Machine Learning for Multimodal Interfaces) are an effort to advance progress in this area.
Meetings
In the AMI and AMIDA projects we are interested in recognizing, interpreting, summarizing and structuring multiparty meetings. Summarization, dialog act recognition, meeting phase segmentation are examples of things that we are pursuing, along with meeting speech recognition.
Selected Publications
- The AMI Corpus
- Steve Renals, Thomas Hain and Hervé Bourlard (2007). "Recognition and understanding of meetings: The AMI and AMIDA projects", in Proc. IEEE ASRU'07. [ pdf ]
- A. Dielmann and S. Renals (2008). "Recognition of Dialogue Acts in Multiparty Meetings using a Switching DBN", IEEE Trans. Audio, Speech and Language Processing, in press. [ pdf ]
- Gabriel Murray and Steve Renals (2008). "Term-Weighting for Summarization of Multi-party Spoken Dialogues", in Proc. MLMI-07, Springer LNCS 4892, 156-167. [ pdf ]
- Alfred Dielmann and Steve Renals (2007). "DBN based joint dialogue act recognition of multiparty meetings", Proc. IEEE ICASSP'07'. [ bib | pdf | Abstract ]
- Alfred Dielmann and Steve Renals (2007). "Automatic meeting segmentation using dynamic Bayesian networks", IEEE Trans. on Multimedia, 9(1):25-36. [ bib | Abstract ]
- M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang (2006). "Multimodal integration for meeting group action segmentation and recognition", In S. Renals and S. Bengio, editors, Proc. Multimodal Interaction and Related Machine Learning Algorithms Workshop (MLMI-05) pages 52-63. Springer. [ bib | Abstract ]
- G. Murray, S. Renals, J. Moore, and J. Carletta (2006). "Incorporating speaker and discourse features into speech summarization", Proc. HLT-NAACL 2006. [ bib | pdf | Abstract ]
- S. J. Wrigley, G. J. Brown, V. Wan, and S. Renals (2005). "Speech and crosstalk detection in multi-channel audio", IEEE Trans. on Speech and Audio Processing, 13:84-91. [ bib | pdf | Abstract ]
- Steve Renals and Dan Ellis (2003). "Audio information access from meeting rooms", Proc. IEEE ICASSP'03' [ bib | pdf | Abstract ]
Information Access from Speech
In addition to work on meetings and multimodal interaction, when at Sheffield we constructed systems for spoken document retrieval, named entity identification, summarization and automatic segmentation of speech such as broadcast news and voicemail. In the late 1990s we put in a good deal of effort to develop systems for NIST evaluations in these areas.
Selected Publications
- Heidi Christensen, Yoshi Gotoh and Steve Renals (2008). "A cascaded broadcast news highlighter", IEEE Trans. Audio, Speech and Language Processing, 16:151-161. [ pdf ]
- Konstantinos Koumpis and Steve Renals (2005). "Content-based access to spoken audio", IEEE Signal Processing Magazine, 22(5):61-69. [ bib | pdf | Abstract ]
- Konstantinos Koumpis and Steve Renals (2005). "Automatic summarization of voicemail messages using lexical and prosodic features", ACM Transactions on Speech and Language Processing, 2(1):1-24. [ bib | pdf | Abstract ]
- Jerry Goldman, Steve Renals, Steven Bird, Franciska de Jong, Marcello Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Doug Oard, Clare Stewart, and Richard Wright (2005). "Accessing the spoken word", International Journal of Digital Libraries, 5(4):287-298. [ bib | pdf | Abstract ]
- Heidi Christensen, Yoshi Gotoh, and Steve Renals (2001). "Punctuation annotation using statistical prosody models", Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding [ bib | pdf | Abstract ]
- Steve Renals, Dave Abberley, David Kirby, and Tony Robinson (2000). "Indexing and retrieval of broadcast news", Speech Communication, 32:5-20 [ bib | pdf | Abstract ]
- Yoshi Gotoh and Steve Renals (2000). "Information extraction from broadcast news", Philosophical Transactions of the Royal Society of London, Series A, 358:1295-1310 [ bib | pdf | Abstract ]