This is a library of advanced HTS voices for CSTR's Festival text-to-speech synthesis system. It includes more than 40 high-quality English voices. Some of them are available from http://www.cstr.ed.ac.uk/projects/festival/morevoices.html Currently General American, British Received Pronunciation, Scottish English voices and several Spanish voices are included and they may be only licensed for research use at this moment. For commercial use, please contact us.
The following is the title and abstract
Measuring the Gap Between HMM-Based ASR and TTS
The EMIME European project is conducting research in the development of technologies for mobile, personalized speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components; thus, the investigation of unified statistical modeling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper, we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality, HMM topology, and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modeling approaches.
The following is the title and abstract
HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering
This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.
The Romanian speech synthesis (RSS) corpus is a free large-scale Romanian speech corpus that includes about 3000 sentences uttered by a native female speaker. The RSS corpus was designed mainly for text-to-speech synthesis and was recorded in a hemianechoic chamber (anechoic walls and ceiling; floor partially anechoic) at the University of Edinburgh. We used three high quality studio microphones: a Neumann u89i (large diaphragm condenser), a Sennheiser MKH 800 (small diaphragm condenser with very wide bandwidth) and a DPA 4035 (headset-mounted condenser). Although the current release includes only speech data recorded via Sennheiser MKH 800, we may release speech data recorded via other microphones in the future. All recordings were made at 96 kHz sampling frequency and 24 bits per sample, then downsampled to 48 kHz sampling frequency. For recording, downsampling and bit rate conversion, we used ProTools HD hardware and software. We conducted 8 sessions over the course of a month, recording about 500 sentences in each session. At the start of each session, the speaker listened to a previously recorded sample, in order to attain a similar voice quality and intonation.
- 1st Sept, Talk at Aholab, University of Basque Country, Bilbao, Spain
- 2nd and 3rd Sept, LISTA project meeting, Vitoria, Spain
- 17th, 19th (National working day!), and 20th Sept, Talk at Nokia research center, Beijing, China
- 22nd, 23rd, and 24th Sept, Speech synthesis workshop 7, Presentation for EMIME work
- 24th Sept, Open Source Initiatives for Speech Synthesis, Presentation for 'open-source/creative common' speech database
- 25th Sept, The 2010 Blizzard Challenge, Presentation for 'CSTR/EMIME entry for the 2010 Blizzard Challenge'
- 27th to 30th Sept, Interspeech 2010, Presentation for 'Roles of the average voice in speaker-adaptive HMM-based speech synthesis'
An Open Position for Postdoctoral Research Associate
The Centre for Speech Technology Research (CSTR)
University of Edinburgh
The School of Informatics at the University of Edinburgh invites applications for the post of Postdoctoral Research Associate on a project concerning voice reconstruction and personalised voice communication aids. The project will develop clinical applications of speaker-adaptive statistical text-to-speech synthesis in collaboration with the Euan MacDonald Centre, who are funding this project. Applications include the reconstruction of voices of patients who have disordered speech as a consequence of Motor Neurone Disease, by using statistical parametric model adaptation. The project will also investigate better voice reconstruction methods.
You will be part of a dynamic and creative research team within the Centre for Speech Technology Research, at the forefront of developments in statistical speech synthesis. The application of statistical parametric speech synthesis to clinical applications such as voice banking, voice reconstruction and assistive devices, is an exciting new development and an area in which we expect to have increased research activity in the coming years. We are seeking additional long-term funding for this work and there may be the possibility of extending this Research Associate position.
You have (or will be near completion of) a PhD in speech processing, computer science, cognitive science, linguistics, engineering, mathematics, or a related discipline.
You will have the necessary programming ability to conduct research in this area, a background in statistical modelling using Hidden Markov Models and strong experimental planning and execution skills.
A background in one or more of the following areas is also desirable: statistical parametric text-to-speech synthesis using HMMs and HSMMs; speaker adaptation using the MLLR or MAP family of techniques; familiarity with software tools including HTK, HTS, Festival; ability to implement web applications.; Familiarity with the issues surrounding degenerative diseases which affect speech, Motor Neurone Disease, Parkinson's disease, Cerebral Palsy or Multiple Sclerosis are is desirable.
For further information, see http://www.jobs.ed.ac.uk/vacancies/index.cfm?fuseaction=vacancies.detail&vacancy_ref=3013390
New and emerging applications of speech synthesis
Until recently, text-to-speech was often just an 'optional extra' which allowed text to be read out loud. But now, thanks to statistical and machine learning approaches, speech synthesis can mean more than just the reading out of text in a predefined voice. New research areas and more interesting applications are emerging.
In this tutorial, after a quick overview of the basic approaches to statistical speech synthesis including speaker adaptation, we consider some of these new applications of speech synthesis. We look behind each application at the underlying techniques used and describe the scientific advances that have made them possible. The applications we will examine include personalised speech-to-speech translation, 'robust speech synthesis' (the making thousands of different voices automatically from imperfect data), clinical applications such as voice reconstruction of patients who have disordered speech, and articulatory-controllable statistical speech synthesis.
The really interesting problems still to be solved in speech synthesis go beyond simply improving 'quality' or 'naturalness' (typically measured using Mean Opinion Scores). The key problem of personalised speech-to-speech translation is to reproduce or transfer speaker characteristics across languages. The aim of robust speech synthesis is to create good quality synthetic speech from noisy and imperfect data. The core problems in voice reconstruction centre around retaining or reconstructing the original characteristics of patients, given only samples of their disordered speech.
We illustrate our multidisciplinary approach to speech synthesis, bringing in techniques and knowledge from ASR, speech enhancement and speech production in order to develop the techniques required for these new applications. We will conclude by attempting to predict some future directions of speech synthesis.
See you in Taiwan!
From the option list, please choose 'HTS (American male)' and click a ply button.
You can hear very good quality and smooth synthetic speech!
This paper carefully analyses how we can use HMMs for prediction of articulatory movements from given speech and/or texts. The following is abstract.
This paper presents an investigation into predicting the movement of a speaker’s mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchro- nous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.
According to Google translation,,,,,,
Have you wondered what it would sound like if you could speak Japanese or Finnish as effortlessly as your mother tongue? Within a few years, a translation function with voice mimicry be available in the mobile phone.It is the EU-funded research project that developed EMIME translation function with voice imitation. Speech technology expert Mikko Kurimo at the Helsinki University of Technology, explains that one can conduct an entire conversation using his mobile was where it sounds like you speak and understand one another's language without having had to devote years to the flexing and rattling study your words.There is no easy task researchers have assumed when they tried to create a voice imitating translators. First, the understanding of what you say, then make an accurate translation, and so create a sound file where it sounds like you're saying the same thing, yet at the Japanese example.It will require further around five years of development before we can have the function of our cell phones, think Mikko Kurimo, but considering that there is a high point in getting away from the erased and emotionally liberated computer generated voices that exist in today's translation software.- The default is cast as the translation software today is very boring and always the same. If you want to express something, it's much better to have your own voice there.
Task EH1 (4 hours of speech data,
According to Wilcoxon signed rank tests
with alpha Bonferoni correction (1% level), the system V is not as
good as systems M, J, T. There is no significant differences
between system V and B.
Task EH2 (1 hour of speech data,
Arctic sentences, Speaker Roger)
Likewise, according to Wilcoxon signed
rank tests with alpha Bonferoni correction (1% level), the system V
is the second best and is significantly better than B.
Task ES1 (100 sentences of arctic
sentences, Speaker Roger)
Likewise, according to Wilcoxon signed
rank tests with alpha Bonferoni correction (1% level), the systems
M and V are the equal best.
In summary, the new HTS system performs very good on small dataset and 1 hour of speech data set. Even on 4 hours of speech data set, it is as good as the Festival unit-selection system.
Slides (pdf format)
The abstract of this nice thesis has also been published in Speech Communication as a journal paper below:
M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth
“Modeling and Interpolation of Austrian German and Viennese Dialect in HMM-based Speech Synthesis,”
Volume 52, Issue 2, Pages 164-179, February 2010
(See paper on Science Direct)
The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.
Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.
The following is cut & paste from http://listening-talker.org/
Speech is efficient and robust, and remains the method of choice for human communication. Consequently, speech output is used increasingly to deliver information in automated systems such as talking GPS and live-but-remote forms such as public address systems. However, these systems are essentially one-way, output-oriented technologies that lack an essential ingredient of human interaction: communication. When people speak, they also listen. When machines speak, they do not listen. As a result, there is no guarantee that the intended message is intelligible, appropriate or well-timed. The current generation of speech output technology is deaf, incapable of adapting to the listener's context, inefficient in use and lacking the naturalness that comes from rapid appreciation of the speaker-listener environment. Crucially, when speech output is employed in safety-critical environments such as vehicles and factories, inappropriate interventions can increase the chance of accidents through divided attention, while similar problems can result from the fatiguing effect of unnatural speech. In less critical environments, crude solutions involve setting the gain level of the output signal to a level that is unpleasant, repetitive and at times distorted. All of these applications of speech output will, in the future, be subject to more sophisticated treatments based at least in part on understanding how humans communicate.
The purpose of the EU-funded LISTA project (the Listening Talker) is to develop the scientific foundations needed to enable the next generation of spoken output technologies. LISTA will target all forms of generated speech -- synthetic, recorded and live -- by observing how listeners modify their production patterns in realistic environments that are characterised by noise and natural, rapid interactions. Parties to a communication are both listeners and talkers. By listening while talking, speakers can reduce the impact of noise and reverberation at the ears of their interlocutor. And by talking while listening, speakers can indicate understanding, agreement and a range of other signals that make natural dialogs fluid and not the sequence of monologues that characterise current human-computer interaction. Both noise and natural interactions demand rapid adjustments, including shifts in spectral balance, pauses, expansion of the vowel space, and changes in speech rate and hence should be considered as part of the wider LISTA vision. LISTA will build a unified framework for treating all forms of generated speech output to take advantage of commonalities in the levels at which interventions can be made (e.g., signal, vocoder parameters, statistical model, prosodic hierarchy).
CSTR takes a lead for WP3 on synthetic speech modifications.
The Festival Speech Synthesis
System version 2.0.95 and
Edinburgh Speech Tools Library version 2.0.95
Surprisingly we have a new release. Please give feedback for installation issues so they can be fixed in a 2.1 release.
Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). This version also supports voices built with the latest version of Nagoya Institute of Technologies' HTS system (http://hts.sp.nitech.ac.jp)
The system is written in C++ and uses the Edinburgh Speech Tools Library for low level architecture and has a Scheme (SIOD) based command interpreter for control. Documentation is given in the FSF texinfo format which can generate, a printed manual, info files and HTML.
Festival is free software. Festival and the speech tools are distributed under an X11-type licence allowing unrestricted commercial and non-commercial use alike.
This distribution includes:
* Full English (British and American English) text to speech
* Full C++ source for modules, SIOD interpreter, and Scheme library
* Lexicon based on CMULEX and OALD (OALD is restricted to non-commercial use only)
* Edinburgh Speech Tools, low level C++ library
* rab_diphone: British English Male residual LPC, diphone
* kal_diphone: American English Male residual LPC diphone
* cmu_us_slt_arctic_hts: American Female, HTS
* cmu_us_rms_cg: American Male using clustergen
* cmu_us_awb_cg: Scottish English Male (with US frontend) clustergen
* Full documentation (html, postscript and GNU info format)
Note there are some licence restrictions on the voices themselves. The US English voices have the same restrictions as Festival. The UK lexicon (OALD) is restrictied to non-commercial use.
Addition voices are also available.
Festival version 2.0.95 sources, voices
In North America:
To run Festival you need:
* A Unix-like environment, e.g Linux, FreeBSD, OSX, cygwin under Windows.
* A C++ compiler: we have used GCC versions. 2.x updato 4.1
* GNU Make any recent version
New in 2.0.95
* Support for the new versions of C++ that have been released
* Integrated and updated support for HTS, Clustergen, Multisyn and Clunits voices
* "Building Voices in Festival" document describing process of building new voices in the system
Alan W Black (CMU)
Rob Clark (Edinburgh)
Junichi Yamagishi (Edinburgh)
Keiichiro Oura (Nagoya)
Rob, Korin and I are very glad we was of help to CERN. According to CERN staff, festival and HTS voices are constantly used there.
The evaluation results with over 20 TTS systems are shown in this summer.
Similar samples were introduced in the following NHK radio program
This paper carefully analyses how the unit selection and HTS behave for emotional speech. The following is abstract.
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded – happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.
His online demonstration is available from here or here
Full list of the news article on
Roger’s voice can be seen from here.
S. Creer, P. Green, S. Cunningham, and J. Yamagishi
“Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.
This voice reconstruction project of the University Sheffield have been introduced in several news articles last year.
Press release 1 Press release 2
Telegraph 1 Telegraph 2