2010

CSTR HTS voice library (updated to ver 0.96)

CSTR HTS Voice Library (version 0.96)

This is a library of advanced HTS voices for CSTR's Festival text-to-speech synthesis system. It includes more than 40 high-quality English voices. Some of them are available from http://www.cstr.ed.ac.uk/projects/festival/morevoices.html Currently General American, British Received Pronunciation, Scottish English voices and several Spanish voices are included and they may be only licensed for research use at this moment. For commercial use, please contact us.

Download URL
http://homepages.inf.ed.ac.uk/jyamagis/library/

Tutorial slides

I had a tutorial at ISCSLP 2010 in Taiwan last week. Our slides can be seen from
http://homepages.inf.ed.ac.uk/jyamagis/ISCSLP2010/ISCLSP-Tutorial.pdf

New journal paper on the gap between HMM-based ASR and TTS

A new journal papers was published in IEEE Journal of Selected Topics in Signal Processing

10.1109/JSTSP.2010.2079315

The following is the title and abstract
Measuring the Gap Between HMM-Based ASR and TTS
The EMIME European project is conducting research in the development of technologies for mobile, personalized speech-to-speech translation systems. The hidden Markov model (HMM) is being used as the underlying technology in both automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components; thus, the investigation of unified statistical modeling approaches has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper, we present results and analysis of a series of experiments that have been conducted on English ASR and TTS systems measuring their performance with respect to phone set and lexicon, acoustic feature type and dimensionality, HMM topology, and speaker adaptation. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance often demands diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of such unified modeling approaches.

Congratulations to Dr. Joao

Joao Cabral passed his PhD viva with minor corrections.
Congratulations!

New journal paper on glottal source modeling

A new journal papers was published in IEEE transactions on Audio, Speech, and Language Processing!

10.1109/TASL.2010.2045239

The following is the title and abstract
HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering
This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.

New databases RSS database

We released a new free Romanian speech database for speech synthesis named "RSS" .

The Romanian speech synthesis (RSS) corpus is a free large-scale Romanian speech corpus that includes about 3000 sentences uttered by a native female speaker. The RSS corpus was designed mainly for text-to-speech synthesis and was recorded in a hemianechoic chamber (anechoic walls and ceiling; floor partially anechoic) at the University of Edinburgh. We used three high quality studio microphones: a Neumann u89i (large diaphragm condenser), a Sennheiser MKH 800 (small diaphragm condenser with very wide bandwidth) and a DPA 4035 (headset-mounted condenser). Although the current release includes only speech data recorded via Sennheiser MKH 800, we may release speech data recorded via other microphones in the future. All recordings were made at 96 kHz sampling frequency and 24 bits per sample, then downsampled to 48 kHz sampling frequency. For recording, downsampling and bit rate conversion, we used ProTools HD hardware and software. We conducted 8 sessions over the course of a month, recording about 500 sentences in each session. At the start of each session, the speaker listened to a previously recorded sample, in order to attain a similar voice quality and intonation.


http://octopus.utcluj.ro:56337/RORelease/

Presentatinos and Talks

I have a series of talks, presentations, and meetings in this summer:

- 1st Sept, Talk at Aholab, University of Basque Country, Bilbao, Spain
- 2nd and 3rd Sept, LISTA project meeting, Vitoria, Spain
- 17th, 19th (National working day!), and 20th Sept, Talk at Nokia research center, Beijing, China
- 22nd, 23rd, and 24th Sept, Speech synthesis workshop 7, Presentation for EMIME work
- 24th Sept, Open Source Initiatives for Speech Synthesis, Presentation for 'open-source/creative common' speech database
- 25th Sept, The 2010 Blizzard Challenge, Presentation for 'CSTR/EMIME entry for the 2010 Blizzard Challenge'
- 27th to 30th Sept, Interspeech 2010, Presentation for 'Roles of the average voice in speaker-adaptive HMM-based speech synthesis'

Open Source Initiatives for Speech Synthesis

Prof Tokuda and I organised a meeting named "Open Source Initiatives for Speech Synthesis" as a special session of SW7 in Kyoto. Thank you for attending the meeting!

IMG_0216IMG_0217

Vacancy

We would like to announce an open postdoc position in speech synthesis, voice reconstruction and personalised voice communication aids at University of Edinburgh. (This position is now closed!)

=============================================================

An Open Position for Postdoctoral Research Associate

The Centre for Speech Technology Research (CSTR)
University of Edinburgh

Job Description
The School of Informatics at the University of Edinburgh invites applications for the post of Postdoctoral Research Associate on a project concerning voice reconstruction and personalised voice communication aids. The project will develop clinical applications of speaker-adaptive statistical text-to-speech synthesis in collaboration with the Euan MacDonald Centre, who are funding this project. Applications include the reconstruction of voices of patients who have disordered speech as a consequence of Motor Neurone Disease, by using statistical parametric model adaptation. The project will also investigate better voice reconstruction methods.

You will be part of a dynamic and creative research team within the Centre for Speech Technology Research, at the forefront of developments in statistical speech synthesis. The application of statistical parametric speech synthesis to clinical applications such as voice banking, voice reconstruction and assistive devices, is an exciting new development and an area in which we expect to have increased research activity in the coming years. We are seeking additional long-term funding for this work and there may be the possibility of extending this Research Associate position.


Person Specification
You have (or will be near completion of) a PhD in speech processing, computer science, cognitive science, linguistics, engineering, mathematics, or a related discipline.

You will have the necessary programming ability to conduct research in this area, a background in statistical modelling using Hidden Markov Models and strong experimental planning and execution skills.

A background in one or more of the following areas is also desirable: statistical parametric text-to-speech synthesis using HMMs and HSMMs; speaker adaptation using the MLLR or MAP family of techniques; familiarity with software tools including HTK, HTS, Festival; ability to implement web applications.; Familiarity with the issues surrounding degenerative diseases which affect speech, Motor Neurone Disease, Parkinson's disease, Cerebral Palsy or Multiple Sclerosis are is desirable.

For further information, see http://www.jobs.ed.ac.uk/vacancies/index.cfm?fuseaction=vacancies.detail&vacancy_ref=3013390

==============================================================

Tutorial at ISCSLP 2010

Simon and I will give a tutorial at ISCSLP 2010 held in Taiwan on 29th November.

http://conf.ncku.edu.tw/iscslp2010/Tutorial.htm

----------------------------------------------------------------------------------------------------
New and emerging applications of speech synthesis

Until recently, text-to-speech was often just an 'optional extra' which allowed text to be read out loud. But now, thanks to statistical and machine learning approaches, speech synthesis can mean more than just the reading out of text in a predefined voice. New research areas and more interesting applications are emerging.

In this tutorial, after a quick overview of the basic approaches to statistical speech synthesis including speaker adaptation, we consider some of these new applications of speech synthesis. We look behind each application at the underlying techniques used and describe the scientific advances that have made them possible. The applications we will examine include personalised speech-to-speech translation, 'robust speech synthesis' (the making thousands of different voices automatically from imperfect data), clinical applications such as voice reconstruction of patients who have disordered speech, and articulatory-controllable statistical speech synthesis.

The really interesting problems still to be solved in speech synthesis go beyond simply improving 'quality' or 'naturalness' (typically measured using Mean Opinion Scores). The key problem of personalised speech-to-speech translation is to reproduce or transfer speaker characteristics across languages. The aim of robust speech synthesis is to create good quality synthetic speech from noisy and imperfect data. The core problems in voice reconstruction centre around retaining or reconstructing the original characteristics of patients, given only samples of their disordered speech.

We illustrate our multidisciplinary approach to speech synthesis, bringing in techniques and knowledge from ASR, speech enhancement and speech production in order to develop the techniques required for these new applications. We will conclude by attempting to predict some future directions of speech synthesis.

----------------------------------------------------------------------------------------------------
See you in Taiwan!

Cereproc's HTS demo

Please try Cereproc's new American HTS voice from their live demo:
From the option list, please choose 'HTS (American male)' and click a ply button.
You can hear very good quality and smooth synthetic speech!
http://www.cereproc.com/

Zhenhua's new journal paper

Zhenhua (USTC, iFlytek)’s new journal paper was published in Speech Communication
http://dx.doi.org/10.1016/j.specom.2010.06.006

This paper carefully analyses how we can use HMMs for prediction of articulatory movements from given speech and/or texts. The following is abstract.

This paper presents an investigation into predicting the movement of a speaker’s mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchro- nous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.

EMIME on Swedish radio

The EMIME project and cross-lingual synthetic speech samples created by Oura-kun has been introduced in the Swedish national radio, Sveriges Radio SR (P1)
http://sverigesradio.se/sida/artikel.aspx?programid=406&artikel=3859170

According to Google translation,,,,,,

Have you wondered what it would sound like if you could speak Japanese or Finnish as effortlessly as your mother tongue? Within a  few years, a translation function with voice mimicry be available in  the mobile phone.It is the EU-funded research project that developed EMIME translation function with voice imitation. Speech technology expert  Mikko Kurimo at the Helsinki University of Technology,  explains that one can conduct an entire conversation using his  mobile was where it sounds like you speak and understand one  another's language without having had to devote years to the flexing  and rattling study your words.There is no easy task researchers have assumed when they tried to  create a voice imitating translators. First, the understanding of  what you say, then make an accurate translation, and so create a sound file where it sounds like you're saying the same thing, yet at  the Japanese example.It will require further around five years of development before we can have the function of our cell phones, think Mikko Kurimo,  but considering that there is a high point in getting away from the  erased and emotionally liberated computer generated voices that  exist in today's translation software.- The default is cast as the translation software today is very  boring and always the same. If you want to express something, it's  much better to have your own voice there.

Results of the 2010 Blizzard Challenge

The evaluation results of the 2010 Blizzard Challenge come out! The followings are MOS scores on naturalness for task EH1, EH2 and ES1. My new HTS systems, which I introduce in the past news, are system V. System A is natural speech for reference and System B is a standard Festival unit-selection system.

voiceEH1_all_mos

Task EH1 (4 hours of speech data, Speaker RJS)

According to Wilcoxon signed rank tests with alpha Bonferoni correction (1% level), the system V is not as good as systems M, J, T. There is no significant differences between system V and B.

voiceEH2_all_mos

Task EH2 (1 hour of speech data, Arctic sentences, Speaker Roger)

Likewise, according to Wilcoxon signed rank tests with alpha Bonferoni correction (1% level), the system V is the second best and is significantly better than B.

voiceES1_all_mos

Task ES1 (100 sentences of arctic sentences, Speaker Roger)

Likewise, according to Wilcoxon signed rank tests with alpha Bonferoni correction (1% level), the systems M and V are the equal best.

In summary, the new HTS system performs very good on small dataset and 1 hour of speech data set. Even on 4 hours of speech data set, it is as good as the Festival unit-selection system.

Odyssey 2010: The Speaker and Language Recognition Workshop

I had a presentation at the speaker and language recognition workshop 'Odyssey 2010' held in Brno, Czech. This is a collaboration with Phillip from NMSU and Michael from ftw on speaker verification systems. We reported that synthesised speech generated from speaker-adaptive HMM-based speech synthesis systems is enough high to allow these synthesised voices to pass for true human claimants despite the good performance of the speaker verification systems and emphasised that we need to develop new features or strategies to discriminate synthetic speech from real speech because the conventional methods to detect synthetic speech is no longer robust enough.

Slides (pdf format)
odyssey10-ver3

2010 OCG-Förderpreis

The 2010 OCG -Förderpreis by Austrian Computer Society has been awarded to Dietmar Schabus (from ftw) for his thesis titled ''Interpolation of Austrian German and Viennese Dialect/Sociolect in HMM-based Speech Synthesis". Congratulations!!

http://www.ocg.at/presse/2010/100707-fp.html
http://www.ftw.at/news/ocg-foerderpreis-2010-geht-an-di-dietmar-schabus?set_language=en

The abstract of this nice thesis has also been published in Speech Communication as a journal paper below:
M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth
Modeling and Interpolation of Austrian German and Viennese Dialect in HMM-based Speech Synthesis,”
Speech Communication,
Volume 52, Issue 2, Pages 164-179, February 2010
(See paper on Science Direct)

Two new journal papers!

Two new journal papers were published in IEEE transactions on Audio, Speech, and Language Processing!

10.1109/TASL.2010.2045237
10.1109/TASL.2009.2035029

The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.

Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

A new FP7 EC project 'LISTA' kicked off!

We have had a kick-off meting of a new EC FP7 project 'LISTA' (2010--2013) and have introduced the recent work on HMM-based speech synthesis such as articulatory controllable HMM-based speech synthesis and the use of hyper-articulation under noisy conditions. This is a collaborative project with University of the Basque Country (Spain), KTH (Sweden) and ICS-FORTH (Greece).

The following is cut & paste from http://listening-talker.org/

------------------------------------------------------------------
Speech is efficient and robust, and remains the method of choice for human communication. Consequently, speech output is used increasingly to deliver information in automated systems such as talking GPS and live-but-remote forms such as public address systems. However, these systems are essentially one-way, output-oriented technologies that lack an essential ingredient of human interaction: communication. When people speak, they also listen. When machines speak, they do not listen. As a result, there is no guarantee that the intended message is intelligible, appropriate or well-timed. The current generation of speech output technology is deaf, incapable of adapting to the listener's context, inefficient in use and lacking the naturalness that comes from rapid appreciation of the speaker-listener environment. Crucially, when speech output is employed in safety-critical environments such as vehicles and factories, inappropriate interventions can increase the chance of accidents through divided attention, while similar problems can result from the fatiguing effect of unnatural speech. In less critical environments, crude solutions involve setting the gain level of the output signal to a level that is unpleasant, repetitive and at times distorted. All of these applications of speech output will, in the future, be subject to more sophisticated treatments based at least in part on understanding how humans communicate.

overall


The purpose of the EU-funded LISTA project (the Listening Talker) is to develop the scientific foundations needed to enable the next generation of spoken output technologies. LISTA will target all forms of generated speech -- synthetic, recorded and live -- by observing how listeners modify their production patterns in realistic environments that are characterised by noise and natural, rapid interactions. Parties to a communication are both listeners and talkers. By listening while talking, speakers can reduce the impact of noise and reverberation at the ears of their interlocutor. And by talking while listening, speakers can indicate understanding, agreement and a range of other signals that make natural dialogs fluid and not the sequence of monologues that characterise current human-computer interaction. Both noise and natural interactions demand rapid adjustments, including shifts in spectral balance, pauses, expansion of the vowel space, and changes in speech rate and hence should be considered as part of the wider LISTA vision. LISTA will build a unified framework for treating all forms of generated speech output to take advantage of commonalities in the levels at which interventions can be made (e.g., signal, vocoder parameters, statistical model, prosodic hierarchy).
------------------------------------------------------------------

CSTR takes a lead for WP3 on synthetic speech modifications.


Joint workshop with Toshiba and Phonetic Arts

EMIME has organised a mini joint workshop on HMM-based speech synthesis in Cambridge and we have had nice and deep discussions with people from Toshiba and Phonetic Arts. This meeting was held in conjunction with EMIME board- and annual review-meetings.

IMG_0290

HTS version 2.1.1

New version of HTS (version 2.1.1) was released from HTS working group. For details of new features, please see http://hts.sp.nitech.ac.jp/ Since this is based on HTK-3.4.1, speaker adaptation using linear transforms and adaptive training (SAT) become faster and demands less memory even for large regression-class trees.

Festival ver 2.0.95

The Festival Speech Synthesis System version 2.0.95 and
Edinburgh Speech Tools Library version 2.0.95

April 2010


Surprisingly we have a new release.  Please give feedback for installation issues so they can be fixed in a 2.1 release.

Festival offers a general framework for building speech synthesis systems as well as including examples of various modules.  As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface.  Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced.  Other groups release new languages for the system.  And full tools and documentation for build new voices are available through Carnegie Mellon's FestVox project (http://festvox.org).  This version also supports voices built with the latest version of Nagoya Institute of Technologies' HTS system (http://hts.sp.nitech.ac.jp)

The system is written in C++ and uses the Edinburgh Speech Tools Library for low level architecture and has a Scheme (SIOD) based command interpreter for control.  Documentation is given in the FSF texinfo format which can generate, a printed manual, info files and HTML.

Festival is free software.  Festival and the speech tools are distributed under an X11-type licence allowing unrestricted commercial and non-commercial use alike.

This distribution includes:
 * Full English (British and American English) text to speech
 * Full C++ source for modules, SIOD interpreter, and Scheme library
 * Lexicon based on CMULEX and OALD (OALD is restricted to non-commercial use only)
 * Edinburgh Speech Tools, low level C++ library
 * rab_diphone: British English Male residual LPC, diphone
 * kal_diphone: American English Male residual LPC diphone
 * cmu_us_slt_arctic_hts: American Female, HTS
 * cmu_us_rms_cg: American Male using  clustergen
 * cmu_us_awb_cg: Scottish English Male (with US frontend) clustergen
 * Full documentation (html, postscript and GNU info format)

Note there are some licence restrictions on the voices themselves. The US English voices have the same restrictions as Festival.  The UK lexicon (OALD) is restrictied to non-commercial use.

Addition voices are also available.

Festival version 2.0.95 sources, voices

In Europe:
  http://www.cstr.ed.ac.uk/downloads/festival/2.0.95/
In North America:
  http://festvox.org/festival

Requirements

To run Festival you need:
 * A Unix-like environment, e.g Linux, FreeBSD, OSX, cygwin under Windows.
 * A C++ compiler: we have used GCC  versions. 2.x updato 4.1
 * GNU Make any recent version

New in 2.0.95
 * Support for the new versions of C++ that have been released
 * Integrated and updated support for HTS, Clustergen, Multisyn and Clunits voices
 * "Building Voices in Festival" document describing process of building new voices in the system
     http://festvox.org/

Alan W Black (CMU)
Rob Clark (Edinburgh)
Junichi Yamagishi (Edinburgh)
Keiichiro Oura (Nagoya)

CERN

HTS voices which I made are used by various organisations including CERN (The European Organisation for Nuclear Research). The following is a CERN’s movie for their recent memorable experiment of the large hadron collider, the largest particle accelerator in the world. You can hear a British English male HTS voice says e.g. “Wire scanners, fly the wire” in the beginning. In addition you can hear that the HTS voice announces situations several times (05:26, 05:49, 13:30, 13:52, 14:45, 26:05, 32:04 etc) in this movie.
http://cdsweb.cern.ch/record/1256279

Rob, Korin and I are very glad we was of help to CERN. According to CERN staff, festival and HTS voices are constantly used there.

Blizzard Challenge

Oliver and I participated in the 2010 Blizzard Challenge (Speech synthesis open evaluation using the common database). The followings are audio samples of synthetic speech generated from a new HMM-based speech synthesis system which we built for the 2010 challenge:

broadcast_2010_0001
news_2010_0001
novel_2010_0001

The evaluation results with over 20 TTS systems are shown in this summer.

Homepage updated

I renewed my web sites. Demonstration pages are still under constructions.

Prof. Tokuda's interview

Prof. Tokuda has been interviewed on HMM-based speech synthesis and kindly introduced some celebrity voices which I made using speaker adaptation. You can see the news from Youtube. The interview on HTS starts from around 3:00



Similar samples were introduced in the following NHK radio program
http://www.nhk.or.jp/shibuz-blog/070/36206.html

Roberto's journal paper

Roberto’s paper was published in Speech Communication
doi:10.1016/j.specom.2009.12.007

This paper carefully analyses how the unit selection and HTS behave for emotional speech. The following is abstract.

We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded – happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.


His online demonstration is available from here or here

Cereproc's Roger Ebert voice

News articles on Cereproc. Although I haven’t got involved with this project, this is a great achievement.


Full list of the news article on Roger’s voice can be seen from here.
http://www.cereproc.com/en/node/314

Itakura Prize!

The 2010 Itakura Prize for Innovative Young Researchers by the Acoustical Society of Japan, has been awarded to me for "Speaker adaptation techniques for speech synthesis”. This is a very great news. Thank you very much!

IMG_24

University Sheffiled's voice reconstruction project

A book on computer synthesised speech technologies for aiding impairment was published from IGI Global. This book includes the following chapter which mentions outcomes and future direction of the voice reconstruction project done by University of Sheffield. HTS and adaptation frameworks are used for this clinical application.

S. Creer, P. Green, S. Cunningham, and J. Yamagishi
Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.
ISBN: 978-1-61520-725-1
51POSNW9fhL._SS500_

This voice reconstruction project of the University Sheffield have been introduced in several news articles last year.
Press release 1 Press release 2
Times
Telegraph 1 Telegraph 2
Yorkshire Post
Times India
Engineer
MedGaget