Romanian speech synthesis (RSS) corpus and
HMM/unit-selection voices built using a high sampling frequency
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called 'RSS', along with Unit-selection voices and HMM-based synthetic voices built from the corpus. All of these will be made freely available for academic use for promoting Romanian speech research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using three microphones at 96kHz sampling frequency in a hemianechoic chamber.
Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity which is an acknowledged weakness of current HMM-based speech synthesisers. As we can hear audio samples below, we can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.