Analysis of speaker similarity of HMM-based speech synthesis
We have revisited some basic configuration choices made in
HMM-based speech synthesis such as the sampling rate, auditory
scale and logarithmic scale of F0, which are typically based on experience
from other fields. Contrary to what is generally accepted
in ASR, higher sampling rates (above 16 kHz) lead to enhanced feature
extraction and improved speaker similarity for speech synthesis.
A generalized logarithmic transform of F0 results in a wider intrautterance
variance of F0 trajectories and more dynamic prosody.
These voices have been built through collaboration and cooperation with artist James Coupe
J. Yamagishi, S. King
``Simple methods for improving speaker-similarity of HMM-based speech synthesis,
ICASSP 2010 (under review)