MLSLP is a recurring workshop, often joint with machine learning or speech/NLP conferences. Prior workshops were held in 2011, 2012, 2016, 2017, and 2018. While research in speech and language processing has always involved machine learning (ML), current research is benefiting from even closer interaction between these fields. Speech and language processing is continually mining new ideas from ML and ML, in turn, is devoting more interest to speech and language applications.
This workshop aims to be a venue for identifying and incubating the next waves of research directions for interaction and collaboration. The workshop will discuss the emerging research ideas with potential for impact in speech/language and bring together relevant researchers from ML and speech/language who may not regularly interact at conferences.
MLSLP is a workshop of SIGML, the SIG on machine learning in speech and language processing of ISCA (the International Speech Communication Association).
|UTC+0 Time||Local Time|
National Taiwan University
Is self-supervised learning universal in speech processing tasks?
Self-supervised learning (SSL) has shown to be vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art for various tasks with minimal adaptation. However, existing works of SSL explored a limited number of speech processing tasks, blurring the generalizability and re-usability of SSL models across speech processing tasks. This talk will first introduce Speech processing Universal PERformance Benchmark (SUPERB), which is a leaderboard to benchmark the performance of SSL model across a wide range of speech processing tasks. The results on SUPERB demonstrates that SSL representations show competitive generalizability across speech processing tasks. I will also share some ongoing research based on SUPERB.
Building streamable transformer model with supervised and semi-supervised learning
Self-supervised speech representation learning has attracted a lot of interest in both industry and academia. It is a challenge to leverage self-supervision on unlabeled data to improve the performance of streaming ASR models optimized with large-scale labeled data. In this talk, we will introduce our recent work on building streamable Transformer models with supervised and semi-supervised learning. In the first part, we will discuss how to balance accuracy, latency, and runtime cost for streamable Transformer model in real scenarios, and elaborate our streamable method which outperforms traditional RNN-T significantly on the accuracy-wise evaluation with large-scale training data. In the second part, we introduce UniSpeech which combines supervised loss and contrastive loss in a manner of multitask learning. UniSpeech outperforms supervised transfer-learning and unsupervised pre-training on both low-resource and high-resource scenarios. We further extend the idea on a reconstruction-loss-based model, showing better results on a public evaluation set and significantly decrease the GPU cost in training.
|1:30pm (UTC+0)||15-minute Break|
University of Wroclaw
Unsupervised learning for unit discovery in sequential data
Learning representations of data in an unsupervised way is an open but actively researched problem of machine learning. We consider representations of speech and handwriting learned using autoencoders and contrastive self-supervised methods. These representations capture high level semantic content from the signal, e.g. phoneme or character identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. We investigate how to enforce desired properties of the latent representation, concentrating on enforcing temporal segmentation of the latent representation. Finally, we demonstrate the usefulness of such representations in unsupervised acoustic unit discovery on the ZeroSpeech task.
Unsupervised subword modeling using cross-lingual phone-aware modeling
There are around 7000 spoken languages in the world. For most of these languages, not enough transcribed speech data is available to train automatic speech recognition (ASR) systems. This leads to the current and unwanted situation that high-performance ASR systems are only available for a small number of major languages,. To facilitate ASR technology for low-resource languages, investigation of unsupervised acoustic modeling (UAM) methods is necessary, which aims to find and model a set of basic speech units that represents all the sounds in the language of interest, i.e., the low-resource, target language.
This talk addresses the topic of unsupervised subword modelling, i.e., learning acoustic feature representations that can distinguish subword (phoneme) units of a target language, and is robust to linguistically-irrelevant factors, such as speaker information. Unsupervised subword modeling thus focuses on learning an intermediate representation towards the ultimate goal of UAM. I will present a recently proposed, state-of-the-art method for unsupervised subword modelling which consists of a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer, and several recent experiments that investigate the effectiveness and robustness of this method. To this end, I will present experiments investigating the robustness of the approach’s effectiveness to different amounts of training data, and several comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level to investigate the method’s ability to capture specific linguistic information.
|3:15pm (UTC+0)||15-minute Break|
|3:30pm (UTC+0)||Poster Session in Gather Town|
|4:45pm (UTC+0)||15-minute Break|
Facebook AI Research
Speech representation learning with wav2vec
Despite rapid progress in the recent past, current speech recognition systems rely heavily on labeled training data. This limits the technology to a small fraction of languages and accents spoken around the globe. In this talk, I will give an overview of the wav2vec project which recently enabled speech recognition models with no labeled data. The key ingredient is self-supervised pre-training to learn powerful representations of speech audio solely from unlabeled data. Unsupervised speech recognition rivals some of the best published systems trained on 960 hours of labeled data from only two years ago while using no labeled data. This is an important step towards systems which can learn to solve tasks without explicit supervision.
DeCoAR: Deep contextualized acoustic representation and its applications to speech recognition and speech processing
Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition models. The essence of speech representation is to exploit acoustically meaningful representation on large quantities of unlabeled data via self-supervised learning. There are many successful examples on speech representation learning, including but not limited to wav2vec/wav2vec 2.0, APC/vq-APC, Mockingjay, and more recently, HuBERT. In this talk, I am going to describe a collection of our work and their applications to speech recognition and other speech processing tasks. First, I will describe the original DeCoAR (Deep Contextualized Acoustic Representation) work, which encodes input audio into a contextualized representation from bidirectional sequence models; followed by that, I will discuss a few modifications to improve DeCoAR, based on inspirations from other speech representation learning work. I will demonstrate that DeCoAR and its improved version can attain competitive ASR performance using limited amount of labeled data. Finally, I will present BERTPhone, a variation of DeCoAR that combines self-supervision objective with weak supervision, and its application to speaker and language identification tasks.
BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition
Semi-supervised learning, which uses unlabeled data to enhance the performance of labeled tasks, has recently played a crucial part in improving pubic automatic speech recognition (ASR) benchmarks. In this talk, we summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining state-of-the-art (SoTA) performance on many public benchmarks. We also utilize the learned representation of pre-trained networks for achieving SoTA results on non-ASR tasks. In addition to scaling up, we proposed w2v-BERT, a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens.
|Michael Auli Facebook||Michael Auli is a scientist at Facebook AI Research in Menlo Park, California. During his PhD he worked on natural language processing and parsing at the University of Edinburgh where he was advised by Adam Lopez and Philipp Koehn. Michael led the team which developed convolutional sequence to sequence models that were the first non-recurrent models to outperform RNNs for neural machine translation. He also led and co-led the teams which ranked first in several WMT news translation tasks in 2018 and 2019. Currently, Michael works on semi-supervised and self-supervised learning applied to natural language processing and speech recognition.|
|Jan Chorowski University of Wroclaw||Jan Chorowski is an Associate Professor at Faculty of Mathematics and Computer Science at the University of Wrocław and Head of AI at NavAlgo. He received his M.Sc. degree in electrical engineering from the Wrocław University of Technology, Poland and EE PhD from the University of Louisville, Kentucky in 2012. He has worked with several research teams, including Google Brain, Microsoft Research and Yoshua Bengio’s Lab at the University of Montreal. He has led a research topic during the JSALT 2019 workshop. His research interests are applications of neural networks to problems which are intuitive and easy for humans and difficult for machines, such as speech and natural language processing.|
|Hung-yi Lee National Taiwan University||Hung-yi Lee is currently an associate professor of the Department of Electrical Engineering of National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering. He received Ph.D. degree from NTU in 2012. From 2012 to 2013, he was a postdoctoral fellow in Research Center for Information Technology Innovation, Academia Sinica. From 2013 to 2014, he was a visiting scientist at the Spoken Language Systems Group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He gave tutorials at ICASSP 2018, APSIPA 2018, ISCSLP 2018, Interspeech 2019, SIPS 2019, and Interspeech 2020. He is the co-organizer of the special session, New Trends in self-supervised speech processing, at Interspeech 2020, and co-organizer of the workshop, Self-Supervised Learning for Speech and Audio Processing, at NeurIPS 2020. He is a member of the IEEE Speech and Language Processing Technical Committee (SLTC).|
|Yuzong Liu Amazon||Yuzong Liu is an Applied Science Manager at AWS AI. His team develops speech recognition technology that supports a variety of cloud-based speech-to-text applications including Amazon Transcribe. Currently, Yuzong and his team work on acoustic modeling research & development, self-trained and label-free approaches to speech recognition. Before joining AWS, Yuzong spent three years in Alexa ASR team, working on acoustic modeling, on-device speech recognition, and had been the leading scientist in designing language ID system and multilingual Alexa mode. Yuzong obtained his PhD from University of Washington in 2016, and his dissertation focused on semi-supervised learning for acoustic modeling.|
|Odette Scharenborg TU Delft||Odette Scharenborg is an Associate Professor and Delft Technology Fellow at the Multimedia Computing Group at Delft University of Technology, the Netherlands, working on human speech-processing inspired automatic speech processing. She has an interdisciplinary background in automatic speech recognition and psycholinguistics, and uses knowledge from how humans process speech to improve automatic speech recognition systems, with a focus on low-resource languages (including unwritten languages) and low-resource types of speech (oral cancer speech and dysarthric speech), and uses visual information for multi-modal speech learning and processing. In 2017, she co-organised a 6-weeks Frederick Jelinek Memorial Summer Workshop. Since 2017, she is on the Board of the International Speech Communication Association (ISCA). Since 2018, she is a member of the IEEE Speech and Language Processing Technical Committee, and since 2019 an Associate Editor of IEEE Signal Processing Letters.|
|Chengyi Wang Microsoft||Chengyi Wang is a 3rd year joint-PhD student of Microsoft Research and Nankai University. Her research interests lie in Speech Recognition and Speech Translation. She has been interning at Microsoft since 2017. During the internship, she has published several papers at AAAI, ACL, ICML, ICASSP and Interspeech. One paper was nominated for Best Student Paper Award in Interspeech2020. In year 2020, she won the National Scholarship as an outstanding PhD student.|
|Yu Wu Microsoft||Yu Wu is a researcher at Natural Language Computing Group, Microsoft Research Asia. He obtained his B.S. degree and Ph.D. at Beihang University in 2014 and 2019 respectively, supervised by Prof. Ming Zhou and Prof Zhoujun Li. After that, he worked at MSRA and study the core problems of speech recognition, chit-chat, and machine translation. He has publications at top conferences and journals, including AAAI, ACL, COLING, CL, EMNLP, ICASSP, ICML, and INTERSPEECH.|
|Yu Zhang Google||Yu Zhang is currently a research scientist at Google Brain. He received his Ph.D degree in computer science from Massachusetts Institute of Technology in 2017. During his Ph.D, he worked on improving speech recognition performance. He is a fan of open source projects and contributed or involved to develop CNTK, MXNet and ESPNet to facilitate ASR research. Currently, his research interests are improving ML model performance for various speech processing applications, with a focus on sequence to sequence modeling. Yu is a main contributor to Google's next generation RNNT ASR model and Tacotron based text-to-speech system.|