P.L. De Leon, M. Pucher, J. Yamagishi, I. Hernaez, I. Saratxaga
"Evaluation of Speaker Verification Security and Detection of HMM-based Synthetic Speech"
IEEE Audio, Speech, & Language Processing, vol. 20, no. 8 Oct 2012
In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.
K. Oura, J. Yamagishi, M. Wester, S. King, K. Tokuda
"Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping"
Speech Communication, Vol 54, Issue 6, pp.704-714, July 2012
In the EMIME project, we
developed a mobile device that performs personalized
speech-to-speech translation such that a user’s spoken input in one
language is used to produce spoken output in another language,
while continuing to sound like the user’s voice. We integrated two
techniques into a single architecture: unsupervised adaptation for
HMM-based TTS using word-based large-vocabulary continuous speech
recognition, and cross-lingual speaker adaptation (CLSA) for
HMM-based TTS. The CLSA is based on a state-level transform mapping
learned using minimum Kullback–Leibler divergence between pairs of
HMM states in the input and output languages. Thus, an unsupervised
cross-lingual speaker adaptation system was developed. End-to-end
speech-to-speech translation systems for four languages (English,
Finnish, Mandarin, and Japanese) were constructed within this
framework. In this paper, the English-to-Japanese adaptation is
evaluated. Listening tests demonstrate that adapted voices sound
more similar to a target speaker than average voices and that
differences between supervised and unsupervised cross-lingual
speaker adaptation are small. Calculating the KLD state-mapping on
only the first 10 mel-cepstral coefficients leads to huge savings
in computational costs, without any detrimental effect on the
quality of the synthetic speech.
Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping
Keiichiro Oura, Junichi Yamagishi, Mirjam Wester, Simon King, Keiichi Tokuda,
Speech Communication, Available online 5 January 2012
Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction
Junichi Yamagishi, Christophe Veaux, Simon King and Steve Renals
Acoustical Science and Technology
Vol. 33 (2012) , No. 1 pp.1-5
山岸順一, C. Veaux, S. King, S. Renals,
（解説）音声の障害患者のための音声合成技術 – Voice banking and reconstruction
日本音響学会誌67巻12号, pp587-592, 2011
Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis.
This paper introduces a new speech corpus named "RSS" and HMM-based speech synthesis systems using higher sampling rates such as 48kHz. The following is abstract.
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.
Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.
The following is the title and abstract
HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering
This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.
This paper carefully analyses how we can use HMMs for prediction of articulatory movements from given speech and/or texts. The following is abstract.
This paper presents an investigation into predicting the movement of a speaker’s mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchro- nous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used.
The first paper describes on 1000s voices which you can see 'voices of the world' demos. The second paper mentions on child speech created using HMM adaptation and voice conversion techniques.
Thousands of Voices for HMM-Based Speech Synthesis – Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesizer from that data. We chose to build a statistical parametric synthesizer using the hidden Markov model (HMM)-based system HTS, as this technique has previously been shown to perform well for limited amounts of data, and for data collected under imperfect conditions. Six different configurations of the synthesizer were compared, using both speaker-dependent and speaker-adaptive modeling techniques, and using varying amounts of data. For comparison with HMM adaptation, techniques from voice conversion were used to transform existing synthesizers to the characteristics of the target speaker. Speaker-adaptive voices generally outperformed child speaker-dependent voices in the evaluation. HMM adaptation outperformed voice conversion style techniques when using the full target speaker corpus; with fewer adaptation data, however, no significant listener preference for either HMM adaptation or voice conversion methods was found.
This paper carefully analyses how the unit selection and HTS behave for emotional speech. The following is abstract.
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded – happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion. Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method.
His online demonstration is available from here or here
S. Creer, P. Green, S. Cunningham, and J. Yamagishi
“Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.
This voice reconstruction project of the University Sheffield have been introduced in several news articles last year.
Press release 1 Press release 2
Telegraph 1 Telegraph 2