Bridging the gap between HMM-based ASR and TTS systems

Both TTS and ASR uses HMMs for representating subword units of speech.
However ASR and HTS have a lot of differences in practice. In order to unify
them, we need to bridge the gap while keeping the performance of both systems
as good as possible.

Gap between ASR and HTS
Spectral representation and its analysis order
ASR typically uses 12+C0 spectral features
whereas HTS uses higer order spectral features

Frame-shift and number of states
ASR typically uses 10ms frame-shift with 3 states HMM
whereas HTS uess smaller frame-shift with HMMs having more states

Lexicon and phonesets
CSTR's precise TTS lexicon called 'Unilex'
vs popular CMU lexicon

Decision tree-based clustering
ASR typically uses phonetic decision trees for parameter tying of HMMs
whereas HTS uess shared single decision trees for that.

Speaker adaptation
Any adaptation algorithms can be used for both ASR and HTS.
See how effective these adaptation algorithms are in the sense of
ASR's WER and TTS's MOS

Unsupervised adaptation
(Finnish samples)
Unsupervised adaptation techniques for ASR are available for HTS likewise.
However, ASR typically transcribes words informatino whereas HTS requires
prosodic contexts such as stress and accentual information.


Reference:
John Dines, Junichi Yamagishi and EMIME members
coming soon!