Near End Listening Enhancement

NELE (Near End Listening Enhancement) refers to a class of algorithms that modify speech signals before they are played back on a device, in order to make them more intelligible for the listener. For example, when you listen to the news on the radio in the kitchen, the loudspeaker is the far end and you (the listener) are the near end. It might be difficult to hear what the speaker says because of the noise from the extractor fan, the dish washer or other appliances. A NELE algorithm can be used to modify the voice of the speaker before it is played back on your loudspeaker, so that you can understand better what's being said - notwithstanding the noise. There are two problems we have to tackle in the real world: additive noise and reverberation. Additive noise refers to any source of sound that is not the speech you want to listen to: a ventilation fan, another person speaking, a car passing by. Reverberation instead is created by the sound bouncing onto surfaces: even if you listen to speech in a silent place, reverberation can cause problems (think of someone speaking in a church for example) as delayed copies of the sound will add on to each other and you will hear a "smeared" version of the original signal.
Over the phone there may be additional problems, like electric/magnetic disturbances (over analog lines) and narrow acoustic bandwith, plus many other factors that lead to a poor audio quality in general.

Over the last two decades a lot of work has been done in the field of NELE, following different approaches. We can point out some common modifications that can be performed on speech signals: reducing the spectral tilt, enhancing the formants, slowing down the speed of utterances, changing the pitch. These modifications are inspired by the natural ones humans make when they speak in noise: this phenomenon is known as Lombard speech, which is more than just "shouting". When speech comes out of a device though, we may induce changes that are even more effective than those a human could do, and - in line of principle - we want to do this without raising the volume. In fact, the typical goal of NELE algorithms is to improve intelligibility (in comparison to unmodified speech) without changing the RMS of the original signal.
In the image below, two common NELE processes are illustrated: dynamic range compression and reduction of spectral tilt. You can see the original speech signal (black) on the left, and the processed one (blue) on the right. Dynamic range compression means that soft sounds are amplified and loud sounds are reduced, making the volume more homogeneous over time. This will allow you to hear better the consonants, which are extremely important for understanding speech - but usually have little power in comparison to vowels.
Reduction of spectral tilt means that power is taken away from low frequencies (80-1000 Hz, where most of the power in speech is found) and reallocated to higher frequencies that are more useful for humans to understand speech (1000-4000 Hz). Overall, this is an optimization process: resources (power, in this case) are reallocated where needed the most.
These are only two examples of the various modifications that can be done to improve the intelligibility of speech signals.

Suggested reads:

This project has received funding from the EU's H2020 research and innovation programme under the MSCA GA 675324