Gene-Ping Yang

Research Scientist at Meta.

I am currently a Research Scientist at Meta, working on speech post-training for controllable and expressive TTS.

I completed my Ph.D. in Informatics at the University of Edinburgh, where I spent a wonderful time at the Centre for Speech Technology Research (CSTR) with Prof. Hao Tang and Prof. Peter Bell. My research focuses on self-supervised pre-training, speech tokenization, and speech-text alignment to uncover the underlying patterns and geometry of speech representations.

I received M.S. in Computer Science and B.S. in Electrical Engineering from National Taiwan University, where I built my speech foundation and have the pleasure of working with Prof. Lin-shan Lee and Prof. Hung-yi Lee on speech separation and enhancement.

View CV Browse publications

Representation Learning Adaptive pre-training methods for learning segment-based speech units beyond fixed-frame representations.

Neural Speech Tokenization Joint segmentation and discretization methods that turn continuous audio into high-fidelity tokens for LLM integration.

LLM Post-Training SFT and GRPO methods for controllable TTS, natural prosody, conversational flow, and non-verbal speech.

Automatic Speech Recognition Robust ASR systems that optimize implicit speech-text alignment.

Selected Publications

Speech Representation Learning & Tokenization

A Simple HMM with Self-Supervised Representations for Phone Segmentation
Gene-Ping Yang, Hao Tang. SLT 2024.
[bib] [abstract]

@INPROCEEDINGS{10832139, author={Yang, Gene-Ping and Tang, Hao}, booktitle={2024 IEEE Spoken Language Technology Workshop (SLT)}, title={A Simple HMM with Self-Supervised Representations for Phone Segmentation}, year={2024}, pages={1007-1014}, doi={10.1109/SLT61566.2024.10832139} }

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.
Towards Matching Phones and Speech Representations
Gene-Ping Yang, and Hao Tang. ASRU 2023.
[bib] [abstract]

@INPROCEEDINGS{10389757, author={Yang, Gene-Ping and Tang, Hao}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, title={Towards Matching Phones and Speech Representations}, year={2023}, pages={1-8}, doi={10.1109/ASRU57964.2023.10389757} }

Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.
Autoregressive Predictive Coding: A Comprehensive Study
Gene-Ping Yang, Sung-Lin Yeh, Yu-An Chung, James Glass and Hao Tang. JSTSP 2022.
[bib] [abstract]

@ARTICLE{9874771, author={Yang, Gene-Ping and Yeh, Sung-Lin and Chung, Yu-An and Glass, James and Tang, Hao}, journal={IEEE Journal of Selected Topics in Signal Processing}, title={Autoregressive Predictive Coding: A Comprehensive Study}, year={2022}, volume={16}, number={6}, pages={1380-1390}, doi={10.1109/JSTSP.2022.3203608} }

We review autoregressive predictive coding (APC), an approach to learn speech representation by predicting a future frame given the past frames. We present three different views of interpreting APC, and provide a historical account to the approach. To study the speech representation learned by APC, we use common speech tasks, such as automatic speech recognition and speaker verification, to demonstrate the utility of the learned representation. In addition, we design a suite of fine-grained tasks, including frame classification, segment classification, fundamental frequency tracking, and duration prediction, to probe the phonetic and prosodic content of the representation. The three views of the APC objective welcome various generalizations and algorithms to learn speech representations. Probing on the suite of fine-grained tasks suggests that APC makes a wide range of high-level speech information accessible in its learned representation.
On-Device Constrained Self-Supervised Learning for Keyword Spotting via Quantization Aware Pre-Training and Fine-tuning
Gene-Ping Yang, Yue Gu, Sashank Macha, Qingming Tang, Yuzong Liu. ICASSP 2024.
[bib] [abstract]

@INPROCEEDINGS{10447258, author={Yang, Gene-Ping and Gu, Yue and Macha, Sashank and Tang, Qingming and Liu, Yuzong}, booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={On-Device Constrained Self-Supervised Learning for Keyword Spotting via Quantization Aware Pre-Training and Fine-Tuning}, year={2024}, pages={10951-10955}, doi={10.1109/ICASSP48485.2024.10447258} }

Large self-supervised models have excelled in various speech processing tasks, but their deployment on resource-limited devices is often impractical due to their substantial memory footprint. Previous studies have demonstrated the effectiveness of self-supervised pre-training for keyword spotting, even with constrained model capacity. In our pursuit of maintaining high performance while minimizing the model's resource demands, we investigate the implementation of Quantization Aware Training for both self-supervised pre-training and fine-tuning, specifically tailored to fit within the constraints of on-device model budget. Our experiments emphasize the critical role of selecting and synchronizing QAT methods throughout both stages of model training and tuning. We evaluate our methodology on a 16.6k-hour in-house keyword spotting dataset, and show that there is no decline in performance, even when the bit size of model weights and activations is cut by a factor of four.
On-device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation
Gene-Ping Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu. Interspeech 2023.
[bib] [abstract]

@inproceedings{yang23y_interspeech, author={Gene-Ping Yang and Yue Gu and Qingming Tang and Dongsu Du and Yuzong Liu}, title={{On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={1623--1627}, doi={10.21437/Interspeech.2023-2362} }

Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framework to transfer knowledge from a larger, more complex model to a smaller, light-weight model using dual-view cross-correlation distillation and the teacher's codebook as learning objectives. We evaluated our model's performance on an Alexa keyword spotting detection task using a 16.6k-hour in-house dataset. Our technique showed exceptional performance in normal and noisy conditions, demonstrating the efficacy of knowledge distillation methods in constructing self-supervised models for keyword spotting tasks while working within on-device resource constraints.

Automatic Speech Recognition

Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in Automatic Speech Recognition
Gene-Ping Yang, Haibin Wu, Peng Su, Ruizhe Huang, Suwon Shon, ..., Yuzong Liu. Under Review.
Supervised Attention In Sequence-to-Sequence Models for Speech Recognition
Gene-Ping Yang and Hao Tang. ICASSP 2022.
[bib] [abstract]

@INPROCEEDINGS{9746310, author={Yang, Gene-Ping and Tang, Hao}, booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Supervised Attention in Sequence-to-Sequence Models for Speech Recognition}, year={2022}, volume={}, number={}, pages={7222-7226}, doi={10.1109/ICASSP43922.2022.9746310} }

Attention mechanism in sequence-to-sequence models is designed to model the alignments between acoustic featuresand output tokens in speech recognition. However, attention weights produced by models trained end to end do notalways correspond well with actual alignments, and severalstudies have further argued that attention weights might noteven correspond well with the relevance attribution of frames.Regardless, visual similarity between attention weights andalignments is widely used during training as an indicator ofthe models quality. In this paper, we treat the correspondencebetween attention weights and alignments as a learning problem by imposing a supervised attention loss. Experimentshave shown significant improved performance, suggestingthat learning the alignments well during training critically determines the performance of sequence-to-sequence model

Speech Separation & Enhancement

Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention
Gene-Ping Yang, Sebastian Braun. WASPAA 2025.
[bib] [abstract]

@INPROCEEDINGS{11230972, author={Yang, Gene-Ping and Braun, Sebastian}, booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, title={Distributed Asynchronous Device Speech Enhancement via Windowed Cross-Attention}, year={2025}, pages={1-5}, doi={10.1109/WASPAA66052.2025.11230972} }

The increasing number of microphone-equipped personal devices offers great flexibility and potential using them as ad-hoc microphone arrays in dynamic meeting environments. However, most existing approaches are designed for time-synchronized microphone setups, a condition that may not hold in real-world meeting scenarios, where time latency and clock drift vary across devices. Under such conditions, we found transform-average-concatenate (TAC), a popular module for neural multi-microphone processing, insufficient in handling time-asynchronous microphones. In response, we propose a windowed cross-attention module capable of dynamically aligning features between all microphones. This module is invariant to both the permutation and the number of microphones and can be easily integrated into existing models. Furthermore, we propose an optimal training target for multi-talker environments. We evaluated our approach in a multi-microphone noisy reverberant setup with unknown time latency and clock drift of each microphone. Experimental results show that our method outperforms TAC on both iFaSNet and CRUSE models, offering faster convergence and improved learning, demonstrating the efficacy of the windowed cross-attention module for asynchronous microphone setups.
Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training
Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee. Interspeech 2021.
[bib] [abstract]

@inproceedings{huang2020stabilizing,
title={Stabilizing Label Assignment for Speech Separation by Self-supervised Pre-training},
author={Huang, Sung-Feng and Chuang, Shun-Po and Liu, Da-Rong and Chen, Yi-Chen and Yang, Gene-Ping and Lee, Hung-yi},
journal={Interspeech},
year={2021}
}

Speech separation has been well developed, with the very successful permutation invariant training (PIT) approach, although the frequent label assignment switching happening during PIT training remains to be a problem when better convergence speed and achievable performance are desired. In this paper, we propose to perform self-supervised pre-training to stabilize the label assignment in training the speech separation model. Experiments over several types of self-supervised approaches, several typical speech separation models and two different datasets showed that very good improvements are achievable if a proper self-supervised approach is chosen.
Interrupted and Cascaded Permutation Invariant Training for Speech Separation
Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-yi Lee, Lin-shan Lee. ICASSP 2020.
[bib] [abstract]

@inproceedings{9053697, author={Yang, Gene-Ping and Wu, Szu-Lin and Mao, Yao-Wen and Lee, Hung-yi and Lee, Lin-shan},
booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Interrupted and Cascaded Permutation Invariant Training for Speech Separation},
year={2020},
pages={6369-6373},
doi={10.1109/ICASSP40776.2020.9053697} }

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and parameters. In this paper, we investigate instead for a given model architecture the various flexible label assignment strategies for training the model, rather than directly using PIT. Surprisingly, we discover a significant performance boost compared to PIT is possible if the model is trained with fixed label assignments and a good set of labels is chosen. With fixed label training cascaded between two sections of PIT, we achieved the state-of-the-art performance on WSJ0-2mix without changing the model architecture at all.
Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering
Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee. In Interspeech, 2019 Oral.
[bib] [abstract]

@inproceedings{Yang2019, author={Gene-Ping Yang and Chao-I Tuan and Hung-Yi Lee and Lin-shan Lee},
title={Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering},
year={2019},
booktitle={Proc. Interspeech 2019},
pages={1363--1367},
doi={10.21437/Interspeech.2019-2181},
url={http://dx.doi.org/10.21437/Interspeech.2019-2181} }

Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over magnitude spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. It is highly correlated to the phonetic structure of speech, or “how the speech sounds” when perceived by human, but primarily frequency domain features carrying temporal behaviour. Very impressive work achieving speech separation over time domain was reported recently, probably because waveforms in time domain may describe the different realizations of speech in a more precise way than magnitude spectrogram lacking phase information. In this paper, we propose a framework properly integrating the above two directions, hoping to achieve both purposes. We construct a time-and-frequency feature map by concatenating 1-dim convolution encoded feature map (for time domain) and magnitude spectrogram (for frequency domain), which was then processed by an embedding network and clustering approaches very similar to those used in time and frequency domain prior works. In this way, the information in time and frequency domains, as well as the interactions between them, can be jointly considered during embedding and clustering. Very encouraging results (state-of-the-art to our knowledge) were obtained with WSJ0-2mix dataset in preliminary experiments.

Text-to-Speech

T-Mimi: A Transformer-Based Mimi Decoder for Real-Time On-Phone TTS
Haibin Wu, Bach Viet Do, Naveen Suda, Julian Chan, Madhavan C R, Gene-Ping Yang, Yi-Chiao Wu, Naoyuki Kanda, Yossef Adi, Xin Lei, Yue Liu, Florian Metze, Yuzong Liu. ICASSP 2026.