My research interests are in speech information processing and I have mainly concentrated on statistical speech synthesis. However, my research ideas bridge the gaps between the field of speech synthesis and other research fields, including machine  learning, speech production, and linguistics. My goals are to place speech synthesis on a more scientific basis, and to create commercially useful technologies.

Educational History

2002 BEng, computer science, Tokyo Institute of Technology
2003 MEng, information processing, Tokyo Institute of Technology
2006 Ph.D., information processing, Tokyo Institute of Technology
I pioneered the use of speaker adaptation techniques in HMM-based speech synthesis in my doctoral dissertation ‘Average-voice-based speech synthesis’, which won the Tejima Doctoral Dissertation Award 2007.

titech logo

Intern Researcher,
Advanced Telecommunications Research (ATR), Japan
October 2003 – March 2006

In parallel with my Masters and Ph.D. studies, I was a member of research staff at ATR, Japan for 2 and a half years. ATR is widely recognised as one of the leading research labs in this field. Under the supervision of Dr. Kawai, I co-developed a novel hybrid text-to-speech synthesis system using HMMs and unit selection, called XIMERA (press release). In XIMERA, acoustic and prosodic parameters generated from HMMs are used as target parameters in the objective functions for unit selection. I was in charge of the HMM component. This method for speech synthesis, pioneered in XIMERA, has recently become widely popular.

Research Fellow,
Japan Society for the Promotion of Science (JSPS), Japan
April 2004 – March 2007

From 2004 to 2007 I had held a research fellowship from JSPS. Under Dr. Kobayashi's supervision, I mainly worked on HMM-based speech synthesis using average voice models. Specifically, I proposed several basic core techniques for the speaker-adaptive approach to HMM-based speech synthesis, such as a shared decision tree clustering technique for the average voice models, integration of speaker-adaptive training into HMM-based speech synthesis, simultaneous speaker adaptation algorithms for the spectrum, fundamental frequency and duration using hidden semi-Markov models, and an advanced speaker adaptation algorithm called CSMAPLR. Almost all of these techniques are implemented and publicly available in a free research software toolkit for statistical speech synthesis called ‘HTS’ (version 2.1 or later). The HTS toolkit is now used all over the world by both academic and commercial organizations. It has been downloaded over 6000 times. From 2006 to 2007, I was at CSTR, University of Edinburgh as a visiting researcher where I integrated speaker-adaptive HMM-based speech synthesis into CSTR's open-source Festival system (which is a very widely-used text-to-speech synthesis system, included in almost all major Linux distributions). I also built UK English HTS systems. My work in speaker-adaptive synthesis led directly to the EC FP7 project EMIME.

HTS logo

Senior Research Fellow

The Centre for Speech Technology Research (CSTR),
University of Edinburgh, U.K.

April 2007 – November 2011

Since 2007 I have been a senior research fellow of CSTR, University of Edinburgh. I mainly work on HMM-based speech synthesis, especially model adaptation, in the EMIME project, plus applications such as visual speech synthesis and articulatory-controllable HMM-based speech synthesis. I co-supervise several PhD students. Model adaptation has become one of the most compelling reasons for choosing the HMM-based speech synthesis framework over alternatives, such as unit selection. Using model adaptation, we have created thousands of different voices, plus various emotional voices, different dialects and accents (e.g. Scottish accents), several languages, child voices, and voices created from recordings of well-known people. I am currently extending this technique to operate across languages - this is the basis of the EMIME project.
Each year, I collaborate with colleagues to enter a system into the common evaluation called the
Blizzard Challenge. In 2008, our system was found to be as intelligible as human speech. This is a landmark achievement for speech synthesis.

CSTR logoEMIME logo

EPSRC Career Acceleration Fellow,

The Centre for Speech Technology Research (CSTR),
University of Edinburgh, U.K.

November 2011 – Present


Associate, CereProc, U. K.
April 2008 – Dec 2010

Cereproc logo

Visiting Associate Professor,
Speech Processing Laboratory,
Nagoya Institute of Technology, Japan
April 2010 – Present

The Euan McDonald Centre for Motor Nuerone Disease Research
Oct 2010 – Present

Other Professional Activities

As well as publishing my work in conference proceedings and journal articles, I disseminate my ideas, in the form of software and high-quality HTS voices. I am a key member of the teams developing two major free open-source research software packages for speech synthesis: HTS and SPTK. I have also actively got involved in the development of Festival.
I am on the review committee for several major international conferences for speech information processing (e.g., ICASSP, Interspeech, and SSW) and for several leading journals (e.g. Speech Communication, IEEE Trans. Audio, Speech and Language Proc)