Three selected recent journal publications

Z-H. Ling, K. Richmond, J. Yamagishi, R.-H. Wang
Integrating Articulatory Features into HMM-based Parametric Speech Synthesis,”
IEEE Audio, Speech, & Language Processing.
vol.17 No.6 pp.1171-1185 August 2009

This paper presents a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into a Hidden Markov Model (HMM)-based parametric speech synthesis system. In contrast to model adaptation and interpolation approaches for speaking style control, this method is driven by phonetic knowledge, and target speech samples are not required. The joint distribution of parallel acoustic and articulatory features considering cross-stream feature dependency is estimated. At synthesis time, acoustic and articulatory features are generated simultaneously based on the maximum-likelihood criterion. The synthetic speech can be controlled flexibly by modifying the generated articulatory features according to arbitrary phonetic rules in the parameter generation process. Our experiments show that the proposed method is effective in both changing the overall character of synthesised speech and in controlling the quality of a specific vowel.

J. Yamagishi, T. Nose, H. Zen, Z. Ling, T. Toda, K. Tokuda, S. King, S. Renals
A Robust Speaker-Adaptive HMM-based Text-to-Speech Synthesis,”
IEEE Audio, Speech, & Language Processing,
vol.17, no.6, pp.1208-1230, August 2009

As speech synthesis techniques become more advanced, we are able to consider building high-quality voices from data collected outside the usual highly-controlled recording studio en-vironment. This presents new challenges that are not present in conventional text-to-speech synthesis: the available speech data are not perfectly clean, the recording conditions are not consistent, and/or the phonetic balance of the material is not ideal. Although a clear picture of the performance of various speech synthesis techniques (e.g., concatenative, HMM-based or hybrid) under good conditions is provided by the Blizzard Challenge, it is not well understood how robust these algorithms are to less favourable conditions. In this paper, we analyse the performance of several speech synthesis methods under such conditions. This is, as far as we know, a new research topic: “Robust speech synthesis.” As a consequence of our investigations, we propose a new robust training method for the speaker-adaptive HMM-based speech synthesis in for use with speech data collected in unfavourable conditions. In addition to the robust speech synthesis, this paper reports various improvements and their results done for the 2007 and 2008 Blizzard Challenges. In the 2008 Challenge, the proposed systems achieved the best naturalness on the smaller data set and the best intelligibility on both data sets. In addition, the two English systems were found to be as intelligible as human speech. We believe that this is a landmark achievement for speech synthesis research.

S. Creer, P. Green, S. Cunningham, and J. Yamagishi
Building personalised synthesised voices for individuals with dysarthria using the HTS toolkit,”
Computer Synthesized Speech Technologies: Tools for Aiding Impairment
John W. Mullennix and Steven E. Stern (Eds), IGI Global press, Jan. 2010.

When the speech of an individual becomes unintelligible due to a neurological disorder, a synthesised voice can replace that of the individual. To fully replace all functions of human speech communication: communication of information, maintenance of social relationships and displaying identity, the voice must be intelligible, natural-sounding and retain the vocal identity of the speaker. For speakers with dysarthria, achieving this output with minimal data recordings and deteriorating speech is difficult. An alternative to this is using Hidden Markov models (HMMs) which require much less speech data than needed for concatenative methods, to adapt a robust statistical model of speech towards the speaker characteristics captured in the data recorded by the individual. This paper implements this technique using the HTS toolkit to build personalised synthetic voices for two individuals with dysarthria. An evaluation of the voices by the participants themselves suggests that this technique shows promise for building and reconstructing personalised voices for individuals with dysarthria once deterioration has begun.

There are over 100 publications in total. Please see full lists from left menus