Analysis of Individual Domain Factors in Self-Supervised Pre-Training


Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli

In Preprint [Paper]


Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment. Previous work explores the effect of domain mismatch in automatic speech recognition between pre-training and fine-tuning as a whole but does not dissect the contribution of individual factors. In this paper, we present a controlled study to better understand the effect of such factors on the performance of pre-trained representations. To do so, we pre-train models either on modified natural speech or synthesized audio, with a single domain factor modified, and then measure performance on automatic speech recognition after fine tuning.Results show that phonetic domain factors play an important role during pre-training while grammatical and syntactic factors are far less important. To our knowledge, this is the first study to better understand the domain characteristics in self-supervised pre-training for speech.



LJSpeech:

Original (synthesized):

Synthesizing 2 words at a time (removing prosody):

Synthesizing 3 words at a time (removing prosody):

Synthesizing 5 words at a time (removing prosody):

Shuffling word order:

Shuffling phone order:

Decipherment:

VCTK:

VCTK (synthesizing 6 words at a time)

Audio modification:

Shuffling word order:

Word random segmentation:

Shuffling phone order:

Phone random segmentation:

Synthetic Language:

Synthetic language composed for 44 (phone-like) tones and noises
Synthetic language composed for (word-like) noise sequence