1000s of voices for HMM-based speech synthesis
In conventional speech synthesis, large amounts of
phonetically-balanced speech data recorded in highly-controlled
recording studio environments are typically required to build a
voice. Although using such data is a straightforward solution
for high quality synthesis, the number of voices available will
always be limited, because recording costs are high. On the other
hand, our recent experiments with HMM-based speech synthesis
systems have demonstrated that speaker-adaptive HMM-based
speech synthesis (which uses an average voice model plus model
adaptation) is robust to non-ideal speech data that are recorded
under various conditions and with varying microphones, that
are not perfectly clean, and/or that lack phonetic balance. This
enables us to consider building high-quality voices on non-TTS
corpora such as ASR corpora. Since ASR corpora generally
include a large number of speakers, this leads to the possibility
of producing an enormous number of voices automatically.
Here we demonstrate the thousands of voices for
HMM-based speech synthesis that we have made from pre-defined
training sets of several popular ASR corpora such as
the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0),
Resource Management, and SPEECON databases.
Cambridge version of WSJ0 database
Mandarin Speecon: clean data
Mandarin Speecon:clean and noisy data
These voices have been built through collaboration and cooperation with a lot of EMIME members and other external researchers.
- J. Yamagishi, et. al
``1000s of voices for HMM-based speech synthesis,
- J. Yamagishi, et al.
``1000s of voices for HMM-based speech synthesis
-- Analysis and applications of TTS systems built on various ASR corpora''
IEEE Audio, Speech, Language Proc. (under reciew)