Comparison of unsupervised adaptation for speaker-adaptive HMM-based speech synthesis


Strategy
ASR systems
ASR WER (%)
Sample 1
Smaple 2
Supervised -- 0.0
Unsupervised SI84 (1st pass, 5k lexicon/bigram) 41.0
Unsupervised SI84 (2nd pass, 5k lexicon/bigram) 37.9
Unsupervised SI84 (1st pass, 20k lexicon/bigram) 16.7
Unsupervised SI84 (2nd pass, 20k lexicon/bigram) 11.7
Unsupervised AMI (1st pass, 50k lexicon/bigram) 25.1
Unsupervised AMI (1st pass, 50k lexicon, 4gram rescoring) 17.0


System configuration
Training data for average voice model:SI-84 set of WSJ corpora
Adaptation data: 40 'block adaptation' sentences included in November 1993 CSR H2 task
Model: state-tied context-dependent MSD-HSMMs
Adaptation: CSMAPLR+MAP
Acoustic features: STRAIGHT mel-cepstrum (40-dim), logF0 and aperiodicity + their delta and delta-delta