ISCA Special Interest Group on Machine Learning for Speech and Language Technology (SIGML)

ISCA SIGML Seminar Series

This seminar series focuses on speech processing, providing a place for speech researchers to present, discuss, learn, and exchange ideas. The presentation can be about (but not limited to) published work, ongoing work, a review, or even an interesting interpretation of a known approach.

The date and time depend on the speakers' time zone and preference, but we will mostly meet biweekly. The meeting will start ten minutes early for people to hang out and chat, and the presentation is usually about an hour long.

To facilitate discussion, we will follow the Chatham House Rule. The talks will be live-streamed and recorded, but please let us know if you don't feel comfortable being recorded.

Please subscribe to our mailing list Links to the talks will be distributed a day before each talk.

UTC+0 Time Local Time


13 Dec, 2021 5pm (UTC+0) Saurabhchand Bhati Johns Hopkins University
Segmental Contrastive Predictive Coding for Unsupervised Acoustic Segmentation


Automatic discovery of phone or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ contrastive predictive coding (CPC), where the model learns representations by predicting the next frame given past context. However, CPC only looks at the audio signal's structure at the frame level. The speech structure exists beyond frame-level, i.e., at phone level or even higher. We propose a segmental contrastive predictive coding (SCPC) framework to learn from the signal structure at both the frame and phone levels.

SCPC is a hierarchical model with three stages trained in an end-to-end manner. In the first stage, the model predicts future feature frames and extracts frame-level representation from the raw waveform. In the second stage, a differentiable boundary detector finds variable-length segments. In the last stage, the model predicts future segments to learn segment representations. Experiments show that our model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets.

In the end, we use SCPC to extract speech features at the segment level rather than at uniformly spaced frame level (e.g., 10 ms) and produce variable rate representations that change according to the contents of the utterance. We can lower the feature extraction rate from the typical 100 Hz to as low as 14.5 Hz on average while still outperforming the hand-crafted features such as MFCC on the linear phone classification task.

Bio: Saurabhchand Bhati is a fourth-year Ph.D. student working with the Center for Language and Speech Processing (CLSP) at Johns Hopkins University. His research interests are unsupervised spoken term discovery and unsupervised representation learning. Recently he has been working on jointly learning the segmentation and feature representations. He received and in Electrical Engineering from IIT Hyderabad, India, in 2017. When the models are training in the background, he likes to watch random clips on YouTube.

Past Meetings

29 Nov, 2021 4pm (UTC+0) Marcely Zanon Boito Avignon Université
Attention-based Unsupervised Word Segmentation

video (Due to technical difficulty, the audio starts at 37:10. Sincere apologies to the speaker and the people who missed the talk.)

Documenting languages helps to prevent the extinction of endangered dialects - many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, for which no written form is available, Unsupervised Word Segmentation from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words.

In this seminar, I will present our speech processing pipeline, which produces word segmentation in a documentation setting. This setting corresponds to leveraging minimal amounts of data: the unsupervised word segmentation task is tackled using only 4 hours of speech data. To cope with the lack of data, we use an attention-based approach that takes advantage of aligned translations in order to ground the discovered word segments.

Bio: Marcely Zanon Boito is a computer scientist and a postdoctoral researcher at Avignon University (France), where she is currently working on speech translation approaches that focus on low-resource languages. She got her PhD from the University Grenoble Alpes in 2021. Her research interests include low-resource approaches for speech processing, computational linguistics and neural machine translation.

16 Nov, 2021 5pm (UTC+0) Kushal Lakhotia Outreach
Generative Spoken Language Modeling from Raw Audio


We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Bio: Kushal is a Senior Staff Applied Scientist at Outreach working on Conversation AI. Previously he was at Facebook AI Research working on speech and natural language processing problems with an interest in self-supervised representation learning and evaluation for different domains. He holds an M.S. from Purdue University and B.Tech from NIT Durgapur, and has previously worked on Facebook Reality Labs and Bing Search.

2 Nov, 2021 12pm (UTC+0) Shammur Absar Chowdhury Qatar Computing Research Institute
Fine-Grained Interpretation of End-to-End Speech Models


The end-to-end deep neural network provides a simplified, elegant, and flexible training mechanism with improved performance. These improvements came at the cost of transparency, hindering the practitioners from understanding what features are learned within these deep models, how they are encoded, and how they inter-operate. Such information is important for a better understanding of the models, for debugging, and to ensure fairness in ethical decision-making.

In this talk, we will focus on post-hoc functional interpretability of pretrained speech models using layer- and neuron-level analysis. First, we will introduce a method to analyze an individual/group of neurons with respect to a pre-defined property (e.g., voice identity) using a simple diagnostic classifier on the captured utterance-level representations. Further, we will study the following questions: (i) is the said property captured in the learned representations? (ii) how are they encoded, and distributed? and (iii) can we identify a minimal subset of the network that possesses this information. Finally, we will show how such an analysis can reflect on the task complexity, robustness of the models, presence of task-specific redundancy, or possible bias towards a certain property.

Bio: Shammur Absar Chowdhury is currently a research scientist at Qatar Computing Research Institute (QCRI). Before joining QCRI, Shammur was a postdoc in the Center for Mind/Brain Sciences at the University of Trento. She received her PhD (2017) from the same university, working under the supervision of Prof. Giuseppe Riccardi, in modeling Turn-Taking dynamics in the context of spontaneous and in-vivo human conversations. Her research interests include: code-switching and dialectal speech recognition systems; explainable speech modules; speech overlaps and turn-takings models. Shammur regularly serves as a reviewer for top conferences and journals dealing with Speech (Interspeech, ICASSP among others) and Natural language Processing (ACL, EMNLP, Coling).

19 Oct, 2021 4pm (UTC+0) Kaizhi Qian MIT-IBM Watson AI Lab
Unsupervised Speech Disentanglement for Speech Style Transfer


Speech information can be roughly decomposed into four components: linguistic content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in speech analysis and generation applications. Among them, non-parallel many-to-many voice conversion can convert between many speakers without training on parallel data, which is the most challenging speech style transfer paradigm. We did a series of three works to solve the challenges progressively.

First, we proposed AutoVC, the first zero-shot non-parallel timbre conversion framework that solves the over-smoothness problem of the VAE-based methods and the unstable training problem of the GAN-based methods using a simple autoencoder with a carefully designed bottleneck. However, AutoVC and the most existing voice conversion methods can only disentangle timbre, while information about pitch, rhythm, and content is still mixed. Further disentangling the remaining speech components is vital for prosody style transfer and an under-determined problem in the absence of explicit annotations for each component, which are difficult and expensive to obtain.

The second work proposed SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed bottlenecks. SpeechSplit is among the first algorithms to separately perform style transfer on timbre, pitch, and rhythm without text transcriptions. Although SpeechSplit has made sizeable progress towards unsupervised prosody style transfer, it cannot extract high-level global prosody style in an unsupervised manner.

The third work proposed AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by self-expressive representation learning. AutoPST is among the first algorithms to effectively convert prosody style in an unsupervised manner.

Bio: Kaizhi Qian is currently doing research in MIT-IBM AI Waston Lab. He received his Ph.D. in Electrical and Computer Engineering from UIUC under the supervision of Prof. Mark Hasegawa-Johnson. His work focuses specifically on applications of deep generative models for speech and time-series processing. He has recently been working on unsupervised speech disentanglement for low-resource language processing.

11 Aug, 2021 4pm (UTC+0) Lea Schönherr Ruhr-Universität Bochum
Adversarially Robust Speech Recognition


Speech and speaker recognition systems are integrated into our everyday life; voice assistants answer questions, set timers, or play music, but also send personal messages, control smart homes, and place orders online. They constantly capture and analyze their surrounding environment, making them a gateway for potential attacks. Many security and privacy concerns arise from their built-in automatic speech recognition (ASR) system, including adversarial examples, which sound like benign audio for human listeners but are interpreted by the ASR system as an attacker-chosen, malicious speech command.

In the first part, we analyze adversarial examples for hybrid speech recognition systems that exploit psychoacoustic hearing thresholds. We show that it is possible to calculate inconspicuous adversarial examples by leveraging psychoacoustic principles to limit the audibility of adversarial perturbations. The attack is further extended to be viable when the adversarial example is played via a loudspeaker. For this purpose, we also consider varying room characteristics for the optimization of adversarial examples. The resulting adversarial examples remain viable across different rooms and recording setups.

In addition, we developed a detection mechanism for adversarial examples. We use a one-class classifier trained on uncertainty measures to detect potential adversarial examples as outliers. Furthermore, we propose a different perspective: We accept the presence of adversarial examples against ASR systems, but we require them to be perceivable by human listeners. By applying the principles of psychoacoustics, we can remove semantically irrelevant information from the ASR input and train a model that resembles human perception more closely.

Bio: Lea Schönherr is a postdoctoral researcher at the Cognitive Signal Processing group at Ruhr University Bochum, Germany, in the Cluster of Excellence Cyber Security in the Age of Large-Scale Adversaries (CASA). Her research focuses on adversarial machine learning, the robustness of neural networks, and the security of speech and speaker recognition systems. She finished her M.Sc. degree in Electrical Engineering and Information Technology at Ruhr University Bochum in 2015 and obtained her PhD from Ruhr University Bochum (2021).

28 Jul, 2021 4pm (UTC+0) Felix Kreuk Bar-Ilan University
From Supervised to Self-Supervised: Phoneme Boundary Detection using Novel Neural Architectures


Phoneme Boundary Detection plays an essential first step for a variety of speech processing applications such as Speaker Diarization, Speech Science, Keyword Spotting, etc. In the first half of my talk I will present the problem of supervised Phoneme Boundary Detection. Most previous work considered the problem as a binary classification task per time-step. This approach ignores the internal structure of the output by considering boundaries to be conditionally independent. I will present our model called SegFeat that learns segmental representations using a neural network coupled with a structured loss function. We demonstrate the effectiveness of our model on the TIMIT and Buckeye corpora, achieving SOTA performance on both benchmarks. In the second half I will talk about the problem of Self-Supervised Phoneme segmentation. Unsupervised Phoneme Segmentation was traditionally done by marking areas of spectral change as possible boundaries. By coupling a CNN with a contrastive loss function we learn a speech representation from the raw waveform for the purpose of phoneme segmentation. During inference, we use a peak detection algorithm over the network outputs to produce the phoneme boundaries. We evaluate our method on the TIMIT and Buckeye datasets, reaching SOTA performance on both corpora. We also show that leveraging additional unlabeled data (from LibriSpeech) leads to better overall performance for unseen data.

Bio: Felix Kreuk is currently a PhD candidate supervised by Professor Joseph Keshet at Bar-Ilan university. His research is focused on speech synthesis and analysis. He was previously a research intern at NVIDIA research (2019-2021). His MSc work was focused on vulnerabilities of deep-learning models in the speech domain.

14 Jul, 2021 6pm (UTC+0) Weiran Wang Google
Improving ASR for Small Data with Self-Training and Pre-Training


I will talk about two of my (not too) recent projects on end-to-end speech recognition with small data, and provide discussions on recent trends.

In the first part, we propose a self-training method with an end-to-end system for semi-supervised ASR. Starting from a Connectionist Temporal Classification (CTC) system trained on the supervised data, we iteratively generate pseudo-labels on a mini-batch of unsupervised utterances with the current model, and use the pseudo-labels to augment the supervised data for immediate model update. Our method retains the simplicity of end-to-end ASR systems, and can be seen as performing alternating optimization over a well-defined learning objective. We also perform empirical investigations of our method, regarding the effect of data augmentation, decoding beamsize for pseudo-label generation, and freshness of pseudo-labels. On a commonly used semi-supervised ASR setting with the Wall Street Journal (WSJ) corpus, our method gives significant relative WER improvement over a carefully-trained base system with data augmentation.

In the second part, we propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains.

Bio: Weiran Wang is currently a research scientist at Google. He was previously a research scientist at Salesforce Research (2019 to 2020), and an applied scientist at Amazon Alexa (2017 to 2019). From 2014 to 2017, he was a postdoc at Toyota Technological Institute at Chicago working with Professor Karen Livescu and Professor Nathan Srebro. He obtained Ph.D. from the EECS Department at UC Merced in 2013. Weiran has a wide interest in machine learning, and worked in areas including multi-view learning, representation learning, optimization for machine learning, and speech processing.

30 Jun, 2021 4pm (UTC+0) Zach Hodari The University of Edinburgh
Controlling Prosody with Insufficient Context


Prosody is a key component in human communication: it conveys humour, emotion, personality, connotation, as well as more local effects like emphasis, contrast, and focus. However, state-of-the-art text to speech (TTS) methods are lacking when it comes to long-form speech. While individual sentences can be very natural, the prosody generated does not convey additional information. This is due to a lack of prosodic context: the information normally used by humans to decide what to convey with their prosody. Additionally, there is no orthography for prosody meaning that, without contextual information, prosody appears to be unwanted variation. Together this means we cannot synthesise prosody based on a transcription, and we cannot predict prosody based on the available context. Instead, I focus on learning to control prosody, allowing multiple prosodic renditions to be synthesised for a single sentence. This is achieved by explicitly modelling prosodic variation that was previously unaccounted-for. This solution does not rely on prosody labels or prosodic context features, however, it does rely on outside input to specify which rendition to use. I found that the renditions are perceptually distinct from one another, and represent different prosodic functions. This system also resolves the issue of average prosody, present in typical TTS voices. Finally, I propose a framework that can incorporate additional contextual information to directly predict appropriate prosodic choices, removing the need for outside input. Additional prosodic context can include new context (i.e. new features) and/or wider context (i.e. neighbouring sentences/speech). This final system was able to reduce the remaining gap between state-of-the-art TTS and natural speech by 26%.

Bio: Zack Hodari is a PhD candidate supervised by Professor Simon King at the Centre for Speech Technology Research (CSTR) in the University of Edinburgh. He obtained an MSc by research from the University of Edinburgh, working on emotion recognition and emotive speech synthesis. His PhD research focuses on speech synthesis and prosody, specifically on producing multiple prosodic renditions of individual sentences.

16 Jun, 2021 5pm (UTC+0) Yuan Gong Massachusetts Institute of Technology
Audio Spectrogram Transformer for Audio Scene Analysis


Audio scene analysis is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models, but have not received the attention they deserve. In the first part of the talk, I will present PSLA, a collection of training techniques that can noticeably boost the model accuracy.

On the other hand, in the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In the second part of the talk, I will answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification.

Bio: Yuan Gong is a postdoctoral associate at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his Ph.D. degree in Computer Science from the University of Notre Dame, and his B.S. degree in Biomedical Engineering from Fudan University. He won the 2017 AVEC depression detection challenge and one of his papers was nominated for the best student paper award in Interspeech 2019. Currently, his research interests include: audio scene analysis, speech-based health systems, voice anti-spoofing.

2 Jun, 2021 4pm (UTC+0) Benjamin Milde Universität Hamburg
Self-Supervised Representation Learning in Speech Processing


In self-supervised machine learning, we create an artificial supervised task to learn from data without (human) annotations, for example by predicting future data points from current ones. It is considered to be a subset of unsupervised learning, where we are able to learn from unlabeled data, but where we are also able to apply powerful supervised machine learning "machinery". Typically, we are not interested in how well a deep neural network is able to solve an artificial task, but hope to learn useful representations on raw data that can then be used in other downstream applications. Applied to speech processing, self-supervised speech models can serve as a building block in other downstream tasks such as speech classification or augment automatic speech recognition (ASR) models. To train self-supervised speech models, we can make use of large amounts of cheaply acquired untranscribed speech data. In my talk, I will present two speech models that make use of the self-supervised training paradigm. The Unspeech model [1] learns speech context embeddings on speech data without transcriptions or labels. The model and learned embedding representations can be applied to adaptation in ASR models, speech clustering, downstream tasks such as emotion recognition or simple isolated command recognition. In the second model, dubbed Sparsespeech [2,3], we construct a task to learn meaningful discrete units and unsupervised posteriorgrams from speech data. This task is closely related to acoustic unit discovery.

[1] Unspeech: Unsupervised Speech Context Embeddings. Benjamin Milde and Chris Biemann. In: Proceedings of Interspeech 2018, Hyderabad, India.
[2] Sparsespeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders. Benjamin Milde and Chris Biemann. In: Proceedings of Interspeech 2019, Graz, Austria.
[3] Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization. Benjamin Milde and Chris Biemann. In: Proceedings of Interspeech 2020, Virtual Shanghai, China.

Bio: Benjamin Milde is a PhD student at Universität Hamburg, Germany. He received his and from Technische Universität Darmstadt, with a master thesis on unsupervised acoustic unit discovery. His research interests are representation learning in speech signals, unsupervised learning, neural networks and automatic speech recognition.

19 May, 2021 6pm (UTC+0) Trang Tran University of Southern California
Integrating Prosody in Spoken Language Understanding


Prosody comprises aspects of speech that communicate information beyond written words related to syntax, sentiment, intent, discourse, and comprehension. Decades of linguistics research have confirmed the importance of prosody in human speech perception and production, but spoken language understanding systems have struggled to use prosody effectively. In this talk, I will present my thesis work that aimed to integrate prosody in language systems slightly more effectively. My approach was tested on two language understanding tasks: (a) constituency parsing (identifying the syntactic structure of a sentence), and (b) dialog act recognition (identifying the segmentation and intents of utterances in discourse). I will also discuss how these approaches and results can be applied to my current work in clinical psychology—analysis of language patterns in motivational interviewing.

Bio: Trang (pronounced as "Chong") Tran (pronounced as "Chun") is a postdoctoral researcher at the Institute for Creative Technologies at the University of Southern California. She obtained her PhD from the University of Washington, where she was advised by Mari Ostendorf. Her research interests lie in the intersection of speech and natural language processing, particularly applications in education and health. Trang still can't think of noteworthy achievements or cool facts to use as the last sentence of this bio.

5 May, 2021 5pm (UTC+0) Desh Raj Johns Hopkins University
Diarization of Overlapping Speech: Methods and Ensembles


Speaker diarization, or "who spoke when," is an important component of many applications such as meeting transcription. Conventional diarization systems formulated using a clustering approach ignored overlapping speech almost completely, resulting in high error rates. In this talk, I will first describe a reformulation of multi-class spectral clustering which allows overlapping speaker assignment. I will also mention some recent work on another paradigm for handling the problem, which can be summarized as: "separate, then diarize". Finally, since machine learning tasks benefit from ensembles, I will introduce DOVER-Lap, a simple and robust method that combines the outputs of diarization systems. In the freshly-concluded DIHARD-3 challenge, four teams, including the top 2, used DOVER-Lap for system fusion.

Bio: Desh Raj is a third-year PhD student at the Johns Hopkins University, working in the Center for Language and Speech Processing (CLSP). His research interests are in machine learning applied to speech and language, and he currently works on multi-speaker ASR and diarization, with a special focus on handling overlapping speech. He was a part of the JHU team which placed 2nd in the CHiME-6 and DIHARD-3 challenges. Before starting his PhD, he graduated with a B.Tech. in Computer Science from IIT Guwahati (India) in 2017, where he worked on NLP. When he is not doing ML, he likes to climb boulders and read fiction.

21 Apr, 2021 5pm (UTC+0) Leda Sarı University of Illinois Urbana-Champaign
Speaker Adaptation and Fairness


Speech signals are known to be highly variable due to various reasons such as physical differences and idiolectal differences between individuals. Therefore, it is usually hard to handle all types of speech signals using a single machine learning model. To solve this problem, several speaker adaptation techniques have been proposed in the automatic speech recognition (ASR) literature. In this talk, I will first represent an auxiliary network based speaker adaptation model that can jointly detect speaker changes and adapt accordingly. I will then provide a fairness perspective to the speaker adaptation problem and elaborate on our ASR experiments that aim at reducing the performance gap between speakers from different genders, age groups, and also speakers using different dialects.

Bio: Leda Sari is currently a PhD candidate in the department of Electrical and Computer Engineering at the University of Illinois Urbana-Champaign. She received her B.Sc. and M.Sc. degrees in Electrical and Electronics Engineering at Bogazici University, Istanbul, Turkey. Her research interests include speaker adaptation for automatic speech recognition and spoken language understanding, and fairness in speech recognition. She is a student member of ISCA and IEEE. She was selected as one of the Rising Stars in EECS in 2019.

7 Apr, 2021 4pm (UTC+0) Lucas Ondel Laboratoire Interdisciplinaire des Sciences du Numérique
Subspace Models for Acoustic Unit Discovery


From an early age, infants show an innate ability to infer linguistic structures from the speech signal long before they learn to read and write. In contrast, modern speech recognition systems require large collections of transcribed data to achieve a low error rate. The relatively recent field of Unsupervised Speech Learning (USL) has been dedicated to endow machines with a similar ability. In this talk, I will focus on one of the key problems of USL: Acoustic Unit Discovery (AUD). I will present my latest work: the Subspace Model and Hierarchical Subspace Model for AUD, inspired by how infants learn speech. I will show how these models provide a theoretical framework for knowledge transfer and adaptation, as well as promising results on real low-resource corpora.

Bio: Lucas Ondel is currently doing a post-doc at LISN (formerly known at LIMSI) in Orsay, France. He received a bachelor's degree in embedded systems from Claude Bernard University in Lyon, a master's degree in software engineering from University of Avignon and Ph.D. in speech recognition/machine learning from Brno University of technology. His research interests include: Bayesian models for unsupervised learning of speech, novel acoustic modeling techniques to better capture uncertainty and general machine learning methods for low-resource languages. He served as a member of the scientific committee for the Zero Resource Challenge 2019 and 2020.