Simple4All kick off

EC FP7 Simple4All project has started.
This is a 3 year EC collaboration project.
The partners are Aalto University, University of Helsinki, University Politecnia de Madrid and Technical University of Cluj-Napoca.



uDialogue kick off meeting

JST CREST "uDialogue" kick off meeting was held in Nagoya, Japan.
This is a 5 year project supported by JST CREST.


A new journal paper

Adriana (Technical University of Cluj-Napoca)’s new journal paper was published in Speech Communication!

This paper introduces a new speech corpus named "RSS" and HMM-based speech synthesis systems using higher sampling rates such as 48kHz. The following is abstract.

This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.
Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.

CSTR HTS voice library (updated to ver 0.96)

CSTR HTS Voice Library (version 0.96)

This is a library of advanced HTS voices for CSTR's Festival text-to-speech synthesis system. It includes more than 40 high-quality English voices. Some of them are available from Currently General American, British Received Pronunciation, Scottish English voices and several Spanish voices are included and they may be only licensed for research use at this moment. For commercial use, please contact us.

Download URL

Open Source Initiatives for Speech Synthesis

Prof Tokuda and I organised a meeting named "Open Source Initiatives for Speech Synthesis" as a special session of SW7 in Kyoto. Thank you for attending the meeting!


Cereproc's HTS demo

Please try Cereproc's new American HTS voice from their live demo:
From the option list, please choose 'HTS (American male)' and click a ply button.
You can hear very good quality and smooth synthetic speech!

Odyssey 2010: The Speaker and Language Recognition Workshop

I had a presentation at the speaker and language recognition workshop 'Odyssey 2010' held in Brno, Czech. This is a collaboration with Phillip from NMSU and Michael from ftw on speaker verification systems. We reported that synthesised speech generated from speaker-adaptive HMM-based speech synthesis systems is enough high to allow these synthesised voices to pass for true human claimants despite the good performance of the speaker verification systems and emphasised that we need to develop new features or strategies to discriminate synthetic speech from real speech because the conventional methods to detect synthetic speech is no longer robust enough.

Slides (pdf format)

2010 OCG-Förderpreis

The 2010 OCG -Förderpreis by Austrian Computer Society has been awarded to Dietmar Schabus (from ftw) for his thesis titled ''Interpolation of Austrian German and Viennese Dialect/Sociolect in HMM-based Speech Synthesis". Congratulations!!

The abstract of this nice thesis has also been published in Speech Communication as a journal paper below:
M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth
Modeling and Interpolation of Austrian German and Viennese Dialect in HMM-based Speech Synthesis,”
Speech Communication,
Volume 52, Issue 2, Pages 164-179, February 2010
(See paper on Science Direct)

Two new journal papers!

Two new journal papers were published in IEEE transactions on Audio, Speech, and Language Processing!


The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.

Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

A new FP7 EC project 'LISTA' kicked off!

We have had a kick-off meting of a new EC FP7 project 'LISTA' (2010--2013) and have introduced the recent work on HMM-based speech synthesis such as articulatory controllable HMM-based speech synthesis and the use of hyper-articulation under noisy conditions. This is a collaborative project with University of the Basque Country (Spain), KTH (Sweden) and ICS-FORTH (Greece).

The following is cut & paste from

Speech is efficient and robust, and remains the method of choice for human communication. Consequently, speech output is used increasingly to deliver information in automated systems such as talking GPS and live-but-remote forms such as public address systems. However, these systems are essentially one-way, output-oriented technologies that lack an essential ingredient of human interaction: communication. When people speak, they also listen. When machines speak, they do not listen. As a result, there is no guarantee that the intended message is intelligible, appropriate or well-timed. The current generation of speech output technology is deaf, incapable of adapting to the listener's context, inefficient in use and lacking the naturalness that comes from rapid appreciation of the speaker-listener environment. Crucially, when speech output is employed in safety-critical environments such as vehicles and factories, inappropriate interventions can increase the chance of accidents through divided attention, while similar problems can result from the fatiguing effect of unnatural speech. In less critical environments, crude solutions involve setting the gain level of the output signal to a level that is unpleasant, repetitive and at times distorted. All of these applications of speech output will, in the future, be subject to more sophisticated treatments based at least in part on understanding how humans communicate.


The purpose of the EU-funded LISTA project (the Listening Talker) is to develop the scientific foundations needed to enable the next generation of spoken output technologies. LISTA will target all forms of generated speech -- synthetic, recorded and live -- by observing how listeners modify their production patterns in realistic environments that are characterised by noise and natural, rapid interactions. Parties to a communication are both listeners and talkers. By listening while talking, speakers can reduce the impact of noise and reverberation at the ears of their interlocutor. And by talking while listening, speakers can indicate understanding, agreement and a range of other signals that make natural dialogs fluid and not the sequence of monologues that characterise current human-computer interaction. Both noise and natural interactions demand rapid adjustments, including shifts in spectral balance, pauses, expansion of the vowel space, and changes in speech rate and hence should be considered as part of the wider LISTA vision. LISTA will build a unified framework for treating all forms of generated speech output to take advantage of commonalities in the levels at which interventions can be made (e.g., signal, vocoder parameters, statistical model, prosodic hierarchy).

CSTR takes a lead for WP3 on synthetic speech modifications.

Joint workshop with Toshiba and Phonetic Arts

EMIME has organised a mini joint workshop on HMM-based speech synthesis in Cambridge and we have had nice and deep discussions with people from Toshiba and Phonetic Arts. This meeting was held in conjunction with EMIME board- and annual review-meetings.


HTS version 2.1.1

New version of HTS (version 2.1.1) was released from HTS working group. For details of new features, please see Since this is based on HTK-3.4.1, speaker adaptation using linear transforms and adaptive training (SAT) become faster and demands less memory even for large regression-class trees.

Festival ver 2.0.95

The Festival Speech Synthesis System version 2.0.95 and
Edinburgh Speech Tools Library version 2.0.95

April 2010

Surprisingly we have a new release.  Please give feedback for installation issues so they can be fixed in a 2.1 release.

Festival offers a general framework for building speech synthesis systems as well as including examples of various modules.  As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface.  Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced.  Other groups release new languages for the system.  And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (  This version also supports voices built with the latest version of Nagoya Institute of Technologies' HTS system (

The system is written in C++ and uses the Edinburgh Speech Tools Library for low level architecture and has a Scheme (SIOD) based command interpreter for control.  Documentation is given in the FSF texinfo format which can generate, a printed manual, info files and HTML.

Festival is free software.  Festival and the speech tools are distributed under an X11-type licence allowing unrestricted commercial and non-commercial use alike.

This distribution includes:
 * Full English (British and American English) text to speech
 * Full C++ source for modules, SIOD interpreter, and Scheme library
 * Lexicon based on CMULEX and OALD (OALD is restricted to non-commercial use only)
 * Edinburgh Speech Tools, low level C++ library
 * rab_diphone: British English Male residual LPC, diphone
 * kal_diphone: American English Male residual LPC diphone
 * cmu_us_slt_arctic_hts: American Female, HTS
 * cmu_us_rms_cg: American Male using  clustergen
 * cmu_us_awb_cg: Scottish English Male (with US frontend) clustergen
 * Full documentation (html, postscript and GNU info format)

Note there are some licence restrictions on the voices themselves. The US English voices have the same restrictions as Festival.  The UK lexicon (OALD) is restrictied to non-commercial use.

Addition voices are also available.

Festival version 2.0.95 sources, voices

In Europe:
In North America:


To run Festival you need:
 * A Unix-like environment, e.g Linux, FreeBSD, OSX, cygwin under Windows.
 * A C++ compiler: we have used GCC  versions. 2.x updato 4.1
 * GNU Make any recent version

New in 2.0.95
 * Support for the new versions of C++ that have been released
 * Integrated and updated support for HTS, Clustergen, Multisyn and Clunits voices
 * "Building Voices in Festival" document describing process of building new voices in the system

Alan W Black (CMU)
Rob Clark (Edinburgh)
Junichi Yamagishi (Edinburgh)
Keiichiro Oura (Nagoya)


HTS voices which I made are used by various organisations including CERN (The European Organisation for Nuclear Research). The following is a CERN’s movie for their recent memorable experiment of the large hadron collider, the largest particle accelerator in the world. You can hear a British English male HTS voice says e.g. “Wire scanners, fly the wire” in the beginning. In addition you can hear that the HTS voice announces situations several times (05:26, 05:49, 13:30, 13:52, 14:45, 26:05, 32:04 etc) in this movie.

Rob, Korin and I are very glad we was of help to CERN. According to CERN staff, festival and HTS voices are constantly used there.

Blizzard Challenge

Oliver and I participated in the 2010 Blizzard Challenge (Speech synthesis open evaluation using the common database). The followings are audio samples of synthetic speech generated from a new HMM-based speech synthesis system which we built for the 2010 challenge:


The evaluation results with over 20 TTS systems are shown in this summer.

Prof. Tokuda's interview

Prof. Tokuda has been interviewed on HMM-based speech synthesis and kindly introduced some celebrity voices which I made using speaker adaptation. You can see the news from Youtube. The interview on HTS starts from around 3:00

Similar samples were introduced in the following NHK radio program

University Sheffiled's voice reconstruction project

A book on computer synthesised speech technologies for aiding impairment was published from IGI Global. This book includes the following chapter which mentions outcomes and future direction of the voice reconstruction project done by University of Sheffield. HTS and adaptation frameworks are used for this clinical application.

S. Creer, P. Green, S. Cunningham, and J. Yamagishi
Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.
ISBN: 978-1-61520-725-1

This voice reconstruction project of the University Sheffield have been introduced in several news articles last year.
Press release 1 Press release 2
Telegraph 1 Telegraph 2
Yorkshire Post
Times India