Simon King

Professor of Speech Processing
Linguistics and English Language
School of Philosophy, Psychology and Language Sciences
University of Edinburgh
and director of the
Centre for Speech Technology Research
Room 3.11, Informatics Forum
10 Crichton Street
Edinburgh EH8 9AB
United Kingdom
Tel: +44 131 651 1725
Fax: +44 131 650 4587
blog: speech.zone


A fundamental question is: What are the basic building blocks of speech? To answer this question, I am working in a number of areas.

In speech recognition, I am looking at new acoustic models, such as Linear Dynamical Models, factorial-HMMs and other graphical models that can represent speech not as 'beads on a string' but as streams of interacting factors. I've investigated ways to automatically find an inventory of suitable units to model, as well as working on other alterntives to phonetic units, such as graphemes. One long-standing interest is the use of phonological/acoustic/articulatory features and articulatory measurement data as a tool to develop models of speech.

In speech synthesis, I work on both unit selection methods and HMM-based speech synthesis. In both of these areas, the definition of the unit of speech is crucial. Both typically use context-dependent phonemes or diphones so, in this context, we can gain some insight into the basic building blocks of speech by asking "What contextual features must we model?" In unit selection, this means learning the target cost and in HMM-based speech synthesis, it relates to the clustering of acoustically similar units. Neither of these processes is entirely satisfactory, but to improve them requires a better understanding of how we can construct speech from basic units.

I am increasingly interested in perceptual measures in speech synthesis, not just for evaluation of the final output, but within the synthesis process itself. In unit selection, perceptual measures should be used to determine equivalent units or contexts, because acoustic similarity and perceptual interchangeability are not the same thing. In HMM-based speech synthesis, the training criterion should be perceptual: perhaps minimum generation error gives us a way to use such a criterion? How can the requirements of acoustic modelling fit with this idea of perceptual equivalence?

In both recognition and synthesis, I have recently started work on multilingual systems as an additional way to look at the basic units of speech. Is there a univeral set of building blocks for speech, and can we build systems that use common models or unit inventories for multiple languages?


See my publications page

If you are interested in studying for a PhD at CSTR, you can find more information here or here

Travel plans / busy periods

This calendar displays all dates for the next few months when I will be away from the office. Click an event to get details and location.


Office hours

By appointment - you can arrange a time via Doodle.



Why not look at some nice bike rides and walks or you could learn Spanish.