Analysis of speaker similarity of HMM-based speech synthesis

We have revisited some basic configuration choices made in HMM-based speech synthesis such as the sampling rate, auditory
scale and logarithmic scale of F0, which are typically based on experience from other fields. Contrary to what is generally accepted
in ASR, higher sampling rates (above 16 kHz) lead to enhanced feature extraction and improved speaker similarity for speech synthesis.
A generalized logarithmic transform of F0 results in a wider intrautterance variance of F0 trajectories and more dynamic prosody.

Audio Examples









These voices have been built through collaboration and cooperation with artist James Coupe
Reference:
J. Yamagishi, S. King
``Simple methods for improving speaker-similarity of HMM-based speech synthesis,
ICASSP 2010 (under review)