Year one PhD review



Sameer Bansal

Supervisors:
Sharon Goldwater, Adam Lopez,
(unofficial: Herman Kamper)

Haiti earthquake, 2010

Ti ekipman Lopital General genyen yo paka minm fè 24 è
General Hospital has less than 24 hrs. supplies


Moun kwense nan Sakre Kè nan Pòtoprens
People trapped in Sacred Heart Church, PauP

  • >200,000 death toll
  • International rescue teams face language barrier
  • Crowdsourced translations through network of volunteers
  • Haitian Creole not supported by Google Translate at the time


text messages source: Munro 2010

image + voice messages

mobile-driven economy



What happens to people when apps which link them to customers do not speak their language?

Endangered languages

473 of the estimated 7000 spoken languages categorized as endangered

few remaining experts or native speakers

how can we collect data efficiently while there is still time?



Language resource classification

  • zero - only audio
  • high - audio + transcription
  • low - audio data + something
    • aligned + unaligned text
    • images, videos, gestures
    • translations in a high resource language

Lack of training data

Send people to collect data?

view expressed by leading research scientist

Explore models to learn from weakly labeled data?

Key idea


Foreign speech

+

Cross-lingual side information (English translation)

=

Useful?

Learning from weakly labeled speech data

Why English translations?

Unsupervised Spoken Term Discovery



Aim: discover patterns in speech data

Hypothesis: repeated segments in speech are word-like units

Applications



  • Query-by-example (audio search)
  • Topic modeling


  • Speech translation without ASR (coming soon ...)

Audio data comparison

  • hard problem
  • high dimensional data
  • variations in acoustic signal, speakers, male and female
  • background noise, filler words


CALLHOME Spanish dataset

  • ~18 hours of Spanish audio
  • 104 telephone conversations between native speakers
  • access to transcriptions (for evaluation)
  • English translations

  • FISHER Spanish dataset gives us an extra ~100 hours of speech, with transcripts + translations


Telephone call excerpt

spk 1: me mande mi acta de bautismo
spk 2: una nueva
spk 1: sí oye mamá
spk 3: sí mi hijo
spk 1: mi acta de bautismo
spk 3:


Raw audio data

Raw audio data

Spoken Term Discovery - methods

  • Segmental DTW (SDTW) - Park and Glass 2008

  • Zero Resource Toolkit (ZRTools) - Jansen et. al. 2011
    code released Aug 2015



In our work we use ZRTools

    Why?
  • Open source and available
  • 1 hr of speech, 6 mins single CPU processing time
  • SDTW takes 33 hours over a cluster

Dynamic Time Warping (DTW)

  • ASR roots - template based spoken word recognition
  • Comparing two time series of varying lengths
  • Computationally expensive

Dynamic Time Warping (DTW)

DTW distance: 0.0215

Dynamic Time Warping (DTW)

DTW distance: 0.611

Dynamic Time Warping (DTW)

DTW distance: 0.012

DTW on telephone audio

Looks good! (☞゚∀゚)☞

DTW on telephone audio

Incorrect match (ಥ﹏ಥ)

DTW on telephone audio

Correct match scored poorly (╯°□°)╯︵ ┻┻

Can we improve spoken term discovery?

  • Acoustic information is not sufficient to discriminate between segments
  • We need extra information

English translations

translation similarity is a (noisy) signal of the semantic similarity between the discovered acoustic units



Jaccard similarity (J)

$$J ~= \frac{| E_1~\cap~E_2 |}{| E_1~\cup~E_2 |}$$



where $E_1$ and $E_2$ are the content English words in the translations

ZRTools output example



translation 1 translation 2 Acous. sim. Trans. sim.
(from) twelfth (to) eighteenth (of) september hmm (from) (the) twelfth (to) (the) eighteenth 0.90 0.5

ZRTools output + translations example



translation 1 translation 2 Acous. sim. Trans. sim.
(to) tell (them) (to) send (me) (my) baptism act (my) sacrament (of) baptism 0.93 0.125

English translations



translation 1 translation 2 Acous. sim. Trans. sim.
(not) (now) (now) (then) (he) cant anymore yes well (its) good well yeah 0.88 0

English translations



translation 1 translation 2 Acous. sim. Trans. sim.
okay (this) (the) address two thousand two hundred two thousand two hundred 0.86 0.6

Using translations

$score_i~=~(1-\alpha)~\times~dtw_i~+~\alpha~\times~J_i$



$dtw_i$: acoustic similarity score computed by ZRTools
$J_i$: translation similarity score

Experiments

Input configurations

# calls speech (hrs)
104 13.55


ZRTools recommends using output pairs with:

  • DTW similarity >= 0.88
  • duration >= 500ms

Precision / Recall curve

ROC curve

Precision vs # predictions

Input configurations

# calls speech (hrs)
20 2.80
50 6.92
104 13.55

Rescoring using translation

# calls avg. prec. $\alpha=0$ avg. prec. $\alpha=0.4$ % improv.
20 0.341 0.583 71%
50 0.185 0.454 145%
104 0.121 0.354 190%


Results show significant improvement in average precision across all input configurations

What next?

speech to text translation

  • cluster UTD output
  • use cluster ids to symbolically represent audio utterances
  • train translation model pairing with English translations
  • predict bag of English words for new utterances

speech example

our pred. gold pred. transcript pseudotext
WELL COMPUTERS EVERYTHING AHA EVERYTHING COMPUTERS CO AJá COMPUTADORAS PUES 1576 1737 11816

(☞゚∀゚)☞

speech example

our pred. gold pred. transcript pseudotext
TELL MAMY SAD FUN SAD THINKING DAY BIRTHDAY TRISTE PENSANDO DE QUE EL DíA DE TU CUMPLEAñOS 293 348 379 2514

(ಥ﹏ಥ)

speech example

our pred. gold pred. transcript pseudotext
OH WELL 'S PROBLEM HEY WELL AH BUENO ESA ES LA CUESTIóN OYE BUENO Y 1354

(╯°□°)╯︵ ┻┻

clustering patterns

audio words purity
doscientos, doscientos, ... 1
claro, claro, claro, ... 1
okey, okey, pueda, ... 0.82
😆 0.95
🤔 1
mande, tantas, maneras, bien 0.125

Research plan

  • improve pattern discovery by using more data
    • use more speech data - Fisher corpora, 100 hours
  • speech translation
    • explore better alignment models
  • semi-supervised learning
  • probabilistic model for jointly discovering terms and alignment with translation



Finish PhD by Feb 2019
(▀̿Ĺ̯▀̿ ̿)

Thanks

  • Sharon, Adam and Herman for their continuous encouragement, guidance, patience and critical feedback
  • David Chiang, Antonios, Aren Jansen, Marco, Federico, Sorcha, Ida, Clara and Nik
  • Sponsors - James S McDonnell Foundation Scholar Award and a Google faculty research award

Submissions

  • S. Bansal, H. Kamper, S. Goldwater, and A. Lopez, “Weakly supervised spoken term discovery using cross-lingual side information,” arXiv:1609.06530, cs.CL, 2016
  • Speech translation paper submission for EACL

“... Of the 200 billion stars in the Milky Way galaxy, some – perhaps many – may have inhabited planets ... This is a present from a small distant world, a token of our sounds, our science, our images, our music, our thoughts, and our feelings ... This record represents our hope and our determination and our goodwill in a vast and awesome universe.”

source: Wikipedia, SoundCloud

Thank you!

Questions?

(°o°)☝