1000s of voices for HMM-based speech synthesis

In conventional speech synthesis, large amounts of phonetically-balanced speech data recorded in highly-controlled
recording studio environments are typically required to build a voice. Although using such data is a straightforward solution
for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other
hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based
speech synthesis (which uses an average voice model plus model adaptation) is robust to non-ideal speech data that are recorded
under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This
enables us to consider building high-quality voices on non-TTS corpora such as ASR corpora. Since ASR corpora generally
include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically.
Here we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from pre-defined
training sets of several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0),
Resource Management, and SPEECON databases.

Audio Examples

WSJ0 database

Cambridge version of WSJ0 database

WSJ1 database

Resource Management

Finnish Speecon

Mandarin Speecon: clean data

Mandarin Speecon:clean and noisy data


These voices have been built through collaboration and cooperation with a lot of EMIME members and other external researchers.
Reference:
J. Yamagishi, et. al
``1000s of voices for HMM-based speech synthesis,
Interspeech 2009

J. Yamagishi, et al.
``1000s of voices for HMM-based speech synthesis
-- Analysis and applications of TTS systems built on various ASR corpora''
IEEE Audio, Speech, Language Proc. (under reciew)