EMIME on Swedish radio

The EMIME project and cross-lingual synthetic speech samples created by Oura-kun has been introduced in the Swedish national radio, Sveriges Radio SR (P1)

According to Google translation,,,,,,

Have you wondered what it would sound like if you could speak Japanese or Finnish as effortlessly as your mother tongue? Within a  few years, a translation function with voice mimicry be available in  the mobile phone.It is the EU-funded research project that developed EMIME translation function with voice imitation. Speech technology expert  Mikko Kurimo at the Helsinki University of Technology,  explains that one can conduct an entire conversation using his  mobile was where it sounds like you speak and understand one  another's language without having had to devote years to the flexing  and rattling study your words.There is no easy task researchers have assumed when they tried to  create a voice imitating translators. First, the understanding of  what you say, then make an accurate translation, and so create a sound file where it sounds like you're saying the same thing, yet at  the Japanese example.It will require further around five years of development before we can have the function of our cell phones, think Mikko Kurimo,  but considering that there is a high point in getting away from the  erased and emotionally liberated computer generated voices that  exist in today's translation software.- The default is cast as the translation software today is very  boring and always the same. If you want to express something, it's  much better to have your own voice there.

Two new journal papers!

Two new journal papers were published in IEEE transactions on Audio, Speech, and Language Processing!


The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.

Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.

Joint workshop with Toshiba and Phonetic Arts

EMIME has organised a mini joint workshop on HMM-based speech synthesis in Cambridge and we have had nice and deep discussions with people from Toshiba and Phonetic Arts. This meeting was held in conjunction with EMIME board- and annual review-meetings.