Comparison of the number of states and frame-shifts for speaker-adaptive HMM-based speech synthesis


3 states with 10ms frame-shift
5 states with 5ms frame-shift
12 states with 2ms frame-shift
Sample 1
Sample2


System configuration
Training data for average voice model:SI-84 set of WSJ corpora
Adaptation data: 40 'block adaptation' sentences included in November 1993 CSR H2 task
Model: state-tied context-dependent MSD-HSMMs
Adaptation: CSMAPLR+MAP
Acoustic features: STRAIGHT mel-cepstrum (40-dim), logF0 and aperiodicity + their delta and delta-delta