This is a 3 year EC collaboration project.
The partners are Aalto University, University of Helsinki, University Politecnia de Madrid and Technical University of Cluj-Napoca.
This is a 5 year project supported by JST CREST.
This paper introduces a new speech corpus named "RSS" and HMM-based speech synthesis systems using higher sampling rates such as 48kHz. The following is abstract.
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.
Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.
This is a library of advanced HTS voices for CSTR's Festival text-to-speech synthesis system. It includes more than 40 high-quality English voices. Some of them are available from http://www.cstr.ed.ac.uk/projects/festival/morevoices.html Currently General American, British Received Pronunciation, Scottish English voices and several Spanish voices are included and they may be only licensed for research use at this moment. For commercial use, please contact us.
From the option list, please choose 'HTS (American male)' and click a ply button.
You can hear very good quality and smooth synthetic speech!
Slides (pdf format)
The abstract of this nice thesis has also been published in Speech Communication as a journal paper below:
M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth
“Modeling and Interpolation of Austrian German and Viennese Dialect in HMM-based Speech Synthesis,”
Volume 52, Issue 2, Pages 164-179, February 2010
(See paper on Science Direct)
The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.
Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.
The following is cut & paste from http://listening-talker.org/
Speech is efficient and robust, and remains the method of choice for human communication. Consequently, speech output is used increasingly to deliver information in automated systems such as talking GPS and live-but-remote forms such as public address systems. However, these systems are essentially one-way, output-oriented technologies that lack an essential ingredient of human interaction: communication. When people speak, they also listen. When machines speak, they do not listen. As a result, there is no guarantee that the intended message is intelligible, appropriate or well-timed. The current generation of speech output technology is deaf, incapable of adapting to the listener's context, inefficient in use and lacking the naturalness that comes from rapid appreciation of the speaker-listener environment. Crucially, when speech output is employed in safety-critical environments such as vehicles and factories, inappropriate interventions can increase the chance of accidents through divided attention, while similar problems can result from the fatiguing effect of unnatural speech. In less critical environments, crude solutions involve setting the gain level of the output signal to a level that is unpleasant, repetitive and at times distorted. All of these applications of speech output will, in the future, be subject to more sophisticated treatments based at least in part on understanding how humans communicate.
The purpose of the EU-funded LISTA project (the Listening Talker) is to develop the scientific foundations needed to enable the next generation of spoken output technologies. LISTA will target all forms of generated speech -- synthetic, recorded and live -- by observing how listeners modify their production patterns in realistic environments that are characterised by noise and natural, rapid interactions. Parties to a communication are both listeners and talkers. By listening while talking, speakers can reduce the impact of noise and reverberation at the ears of their interlocutor. And by talking while listening, speakers can indicate understanding, agreement and a range of other signals that make natural dialogs fluid and not the sequence of monologues that characterise current human-computer interaction. Both noise and natural interactions demand rapid adjustments, including shifts in spectral balance, pauses, expansion of the vowel space, and changes in speech rate and hence should be considered as part of the wider LISTA vision. LISTA will build a unified framework for treating all forms of generated speech output to take advantage of commonalities in the levels at which interventions can be made (e.g., signal, vocoder parameters, statistical model, prosodic hierarchy).
CSTR takes a lead for WP3 on synthetic speech modifications.
The Festival Speech Synthesis
System version 2.0.95 and
Edinburgh Speech Tools Library version 2.0.95
Surprisingly we have a new release. Please give feedback for installation issues so they can be fixed in a 2.1 release.
Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system. And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org). This version also supports voices built with the latest version of Nagoya Institute of Technologies' HTS system (http://hts.sp.nitech.ac.jp)
The system is written in C++ and uses the Edinburgh Speech Tools Library for low level architecture and has a Scheme (SIOD) based command interpreter for control. Documentation is given in the FSF texinfo format which can generate, a printed manual, info files and HTML.
Festival is free software. Festival and the speech tools are distributed under an X11-type licence allowing unrestricted commercial and non-commercial use alike.
This distribution includes:
* Full English (British and American English) text to speech
* Full C++ source for modules, SIOD interpreter, and Scheme library
* Lexicon based on CMULEX and OALD (OALD is restricted to non-commercial use only)
* Edinburgh Speech Tools, low level C++ library
* rab_diphone: British English Male residual LPC, diphone
* kal_diphone: American English Male residual LPC diphone
* cmu_us_slt_arctic_hts: American Female, HTS
* cmu_us_rms_cg: American Male using clustergen
* cmu_us_awb_cg: Scottish English Male (with US frontend) clustergen
* Full documentation (html, postscript and GNU info format)
Note there are some licence restrictions on the voices themselves. The US English voices have the same restrictions as Festival. The UK lexicon (OALD) is restrictied to non-commercial use.
Addition voices are also available.
Festival version 2.0.95 sources, voices
In North America:
To run Festival you need:
* A Unix-like environment, e.g Linux, FreeBSD, OSX, cygwin under Windows.
* A C++ compiler: we have used GCC versions. 2.x updato 4.1
* GNU Make any recent version
New in 2.0.95
* Support for the new versions of C++ that have been released
* Integrated and updated support for HTS, Clustergen, Multisyn and Clunits voices
* "Building Voices in Festival" document describing process of building new voices in the system
Alan W Black (CMU)
Rob Clark (Edinburgh)
Junichi Yamagishi (Edinburgh)
Keiichiro Oura (Nagoya)
Rob, Korin and I are very glad we was of help to CERN. According to CERN staff, festival and HTS voices are constantly used there.
The evaluation results with over 20 TTS systems are shown in this summer.
Similar samples were introduced in the following NHK radio program
S. Creer, P. Green, S. Cunningham, and J. Yamagishi
“Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.
This voice reconstruction project of the University Sheffield have been introduced in several news articles last year.
Press release 1 Press release 2
Telegraph 1 Telegraph 2