The Automatic Sound Engineer

The Automatic Sound Engineer (ASE) is a Near End Listening Enhancement (NELE) algorithm based on sound engineering knowledge. For a brief video introduction, you can watch our ICASSP 2020 Show and Tell; the first 3 minutes are about ASE.
Click here for ASE beta and here for ASE applied to synthetic speech over the telephone.

Background

When you listen to a modern music production, you can usually hear what the singer is saying quite clearly, even if the vocals are competing with a wall of sound - guitars, bass, drums, keyboards and possibly backing vocals. Moreover, reverberation is typically added to the voice for aesthetic purposes: yet the lyrics are highly intelligible. This also happens when you go to a concert. How is this possible?
There is a person behind the scenes who makes sure the sound is delivered to the listeners in the best possible way: the sound engineer. Wouldn't it be nice if there was such an engineer in your kitchen when you listen to the radio, to adjust the sound so that you can understand the speaker even when the fan is on? Or at the train station, when you are far from the loudspeaker and there is a lot of reverberation? Unfortunately there is no sound engineer in those situations, but there may be some technology to help.
The Automatic Sound Engineer (ASE) is a NELE algorithm inspired by common knowledge in sound engineering. We can use it to process speech signals before we play them on a device, in order to make them more intelligible.
The main objective of ASE is to find the sweet spot between intelligibility and quality: speech needs to be clear, but also pleasurable and easy to listen to. Defining what "quality" actually means is much harder than one would expect. For this reason, we are running tests to understand what listeners prefer, while assessing the strategies that improve intelligibility the most.

Intelligibility

ASE beta was entered into the Hurricane Challenge 2.0, an international competition for NELE algorithms. The goal of the Challenge - which was run in three languages - was to improve speech intelligibility in noise and reverberation, without raising the volume. ASE beta won the challenge by a significant margin, achieving intelligibility gains of over +7 dB (corresponding to +58% words understood on average by normal hearing listeners). Results from the Challenge were presented at Interspeech 2020 in the special session "Intelligibility-Enhancing Speech Modifications", during which ASE was chosen as the baseline for the next Challenge.
We tested the latest version of ASE in adverse listening conditions: over a low-quality telephone line in background noise. We found a surprising improvement of 8.3 dB for natural speech and 7.5 dB for synthetic speech. Click here to learn more about this experiment and listen to the samples.
More intelligibility tests are on the way.

Sound Quality

ASE has been vastly improved after the Hurricane Challenge 2.0; it now presents a much better audio quality. In a recent study, users liked ASE better than unmodified speech across listening conditions and speakers. In perfect silence and audibility, the median quality rating for ASE was 80% (on a scale from 0%=bad to 100%=excellent), while the original, unmodified speech was under 70% (N=40 normal hearing listeners). This is because of ASE's novel approach: we do not take for granted that the unmodified recording sounds the most natural: maybe we can actually improve it (and results suggest we do).
However, the most relevant novelty in ASE is its approach to DRC (Dynamic Range Compression): parameters are chosen automatically based on the input signal, and no compression is performed if the signal looks like it's been compressed already.
ASE is noise-unaware, which means that the algorithm doesn't need any information on noise conditions or SNR; the processing only depends on the input signal. This approach was already shown to be the most efficient across noise conditions by SSDRC in previous studies.

Key features

  • Voice activity detection
  • Denoising
  • Equalization
  • Multiband and broadband DRC

You don't have to set any parameter in ASE: the algorithm will decide what's best. Our goal is to have a one-click solution for the non-expert user.
Unfortunately we cannot disclose too many details at the moment, as the algorithm is under review for patenting.

Listen to some samples

Play the following samples over headphones and/or over loudspeakers:

SAMPLES male voice female voice
unmodified speech recording
compression only
ASE (optimised for headphones)
ASE (optimised for loudspeakers)
compare with SSDRC

If you listen over headphones, you may hear the difference between "compression only" and ASE. This is for you to see how much denoising is important in order to have a high-quality sound!
We compare ASE to SSDRC, a benchmark NELE algorithm which offers similar intelligibility gains.

Listen to samples in noise

Please listen to the following samples over headphones (for a correct representation of spatial cues). Unmodified speech and ASE are played in the two realistic environments of NELE RE, at three different SNRs.
SNR(Signal to Noise Ratio) is the power of speech over the power of noise. The values we used to create these audio files are the same as in NELE RE. When unmodified speech is played at the high SNR, normal hearing listeners understand on average 75% of words, 50% at the medium SNR and 25% at the low one. We expect higher percentages with ASE.

CAFETERIA unmodified ASE
high SNR
mid SNR
low SNR
LIVING ROOM unmodified ASE
high SNR
mid SNR
low SNR

ASE in audio/video production

ASE was used to produce the voiceover of the ICASSP 2020 Show and Tell "Enriched Speech for Effortless Listening". In the table below, you can listen to a snippet of the raw speech recording and to the same snippet after ASE processing optimised for headphones (which is a mild version). Please listen over headphones to better tell the difference. In the picture below, you can see how the recording looks before and after; they have the same RMS.

raw recording after ASE

As the recording processed with ASE is compressed in time, it allows for a more efficient volume output when we normalise the file (maximise the volume). It allows also a higher music background, as the softer parts of speech are enhanced.

raw + music (normalised) ASE + music (normalised)

You can see how ASE transformed the amateur recording (done in a bedroom with a makeshift setup) into something that sounds much more like a professional entertainment product.



Get in touch: c.chermaz@ed.ac.uk


This project has received funding from the EU's H2020 research and innovation programme under the MSCA GA 675324