A bit more about voice cloning

Creating an individual's synthetic voice would normally involve recording thousands of sentences based on phonetic coverage so that all the speech sounds can be captured in a way that will make the synthesised speech sound as natural as possible. The main idea behind voice cloning is that much part of the recorded speech can come from others speakers (voice donors) rather than collecting all the speech from one speaker (the patient).

In the first instance, a large amount of speech data is collected from several voice donors and analysed in order to create an "average voice" model. This average voice model is then used as a template to extrapolate the missing speech sounds from the short recording of the target speaker (patient). This extrapolation process, known as speaker adaptation, allows the creation of a synthetic version of the patient's voice with a very limited amount of patient's speech. For example, it has been shown that recording as few as 20 minutes of the patient's speech is sufficient to generate a synthetic voice that sounds similar to his voice.

The voice cloning process is most successful when the average voice model is already close to the voice characteristics of the target speaker (patient). Therefore, we aim to record a large catalogue of healthy voices from which we can derive a set of average voice models corresponding to different age, gender and regional accents combinations.

Speaker adaptation
Voice cloning process

play

play

play