Simon King: CSTR Personal Homepage

Simon King
Professor of Speech Processing
Linguistics and English Language
School of Philosophy, Psychology and Language Sciences
University of Edinburgh

and director of the

Centre for Speech Technology Research

Room 3.11, Informatics Forum
10 Crichton Street
Edinburgh EH8 9AB
United Kingdom

email:
teaching website: speech.zone

Research

A fundamental question is: What are the basic building blocks of speech? To answer this question, I am working in a number of areas.

In speech recognition, I am looking at new acoustic models, such as Linear Dynamical Models, factorial-HMMs and other graphical models that can represent speech not as 'beads on a string' but as streams of interacting factors. I've investigated ways to automatically find an inventory of suitable units to model, as well as working on other alterntives to phonetic units, such as graphemes. One long-standing interest is the use of phonological/acoustic/articulatory features and articulatory measurement data as a tool to develop models of speech.

In speech synthesis, I work on both unit selection methods and HMM-based speech synthesis. In both of these areas, the definition of the unit of speech is crucial. Both typically use context-dependent phonemes or diphones so, in this context, we can gain some insight into the basic building blocks of speech by asking "What contextual features must we model?" In unit selection, this means learning the target cost and in HMM-based speech synthesis, it relates to the clustering of acoustically similar units. Neither of these processes is entirely satisfactory, but to improve them requires a better understanding of how we can construct speech from basic units.

I am increasingly interested in perceptual measures in speech synthesis, not just for evaluation of the final output, but within the synthesis process itself. In unit selection, perceptual measures should be used to determine equivalent units or contexts, because acoustic similarity and perceptual interchangeability are not the same thing. In HMM-based speech synthesis, the training criterion should be perceptual: perhaps minimum generation error gives us a way to use such a criterion? How can the requirements of acoustic modelling fit with this idea of perceptual equivalence?

In both recognition and synthesis, I have recently started work on multilingual systems as an additional way to look at the basic units of speech. Is there a univeral set of building blocks for speech, and can we build systems that use common models or unit inventories for multiple languages?

Current research funding
- NST - Natural Speech Technology (EPSRC 2011 - 2016)
- Simple4All - Speech synthesis that improves through adaptive learning (EC FP7 2011 - 2014)
- INSPIRE - Marie Curie Initial Training Network (EC FP7 2012 - )
Recently completed research grants
- Effective Multilingual Interaction in Mobile Environments - EMIME (EC FP7 March 2008 - Feb 2011)
- Study of Source Features for Speech Synthesis and Speaker Recognition (UKIERI April 2007 - March 2011)
- LISTA - The Listening Talker (EC FP7 2010 - 2013)

Publications

See my publications page

Positions I hold
- Member of Editorial Board of Computer Speech and Language
- Co-organiser of the Blizzard series of speech synthesis evaluations.
- Member of the IEEE Speech and Language Processing Technical Committee
and positions I have recently held
- Associate editor from 2006 to 2009 of IEEE Transactions on Audio, Speech and Language Processing
- Secretary and Treasurer of ISCA Speech Synthesis Special Interest Group (SynSIG)
- Board member of the European Masters in Language and Speech

Research fellows I currently work with
- Mirjam Wester - LISTA and NST projects
- Christophe Veaux - voice reconstruction
- Oliver Watts - NST project
- Cássia Valentini Botinhão - speech synthesis
- Gustav Henter - NST project
- Zhizheng Wu - NST project
PhD students (in chronological order)
- As principal or co-supervisor
  - Rasmus Dall - speech synthesis
  - Tom Merritt - speech synthesis
  - Srikanth Ronanki - prosody for speech synthesis
  - Felipe Espic - waveform generation for speech synthesis
- As second supervisor or advisor
Former students

If you are interested in studying for a PhD at CSTR, you can find more information here or here

Travel plans / busy periods

This calendar displays all dates for the next few months when I will be away from the office. Click an event to get details and location.

Teaching

Office hours

By appointment - you can arrange a time via Doodle.

Courses

Speech Processing
Speech Synthesis
Programme Director for the M.Sc. in Speech and Language Processing

Personal

Why not look at some nice bike rides and walks or you could learn Spanish.