Informed source separation

If you had hearing devices, wouldn't it be nice if you could switch the world off and listen only to the signal you're interested in? As explained in the page on HA (Hearing Aids), the biggest challenge faced by hearing prosthetics is to separate voice signals from noise, in order to make speech more intelligible for the listener. This process is known as speech enhancement. HA run several algorithms to do this, but they all need to operate in a "blind" way, as they have no prior information on the signal they have to enhance: they have to make an educated guess, so to speak. In the picture below, you can see a speech signal and a noise signal, separately. They add on to each other in the real world, and the purple signal, the mixture, is what a hearing aid is presented with. It has to try and guess what the clean speech signal looks like and suppress the rest. The "guessing" process is based on expert knowledge and statistics, but it doesn't always work perfectly - especially when the "noise" is made of other speakers, as the signals will all look the same to the device. It can't possibly know which speaker the users wants to listen to. Although there currently is a lot of research on detecting which speaker is the focus of attention (by analysing brain waves), this feature is not yet available in commercial devices, as this field is still in its infancy.
Speech enhancement can also be defined as source separation, where speech and noise represent different sources of sound. As we mentioned earlier, the process is blind. Our research question is: can we modify the speech signal in a way that will make it clearly recognisable by the hearing device? Of course this can be done when speech comes out of a device (like a radio or TV), as we cannot change the way humans speak.
We can modify the signal in two different ways: explicit or implicit. When we make an explicit modification, changes in the signal are audible, therefore listeners can hear them with their ears. This type of modification include dynamic range compression and equalisation; the algorithms tested in this study and ASE perform this type of processing.

Implicit modifications instead, are quite an uncharted territory as of yet. Implicit means that we modify the signal by adding data that is inaudible to human ears, but that can be decoded by devices (such as HA). This process is known as acoustic watermarking, and it is typically implemented by means of audiotory masking. There is a long tradition of acoustic watermarking in the field of copyright protection (i.e. audio tracks on CDs), but this kind of strategy has never been used for the purpose of speech enhancement, as it turns out to be extremely challenging to implement it for this purpose. The core of the problem is that we want to deliver the data acoustically, along with the signal itself, and we want it to be compatible with current devices (no ultrasound). Here are some of the challenges:

  • we need to define which information on the speech signal would be useful for the hearing device to have (the data set has to be as informative and small as possible)
  • if we want to embed a lot of data into the speech signal, the modification becomes audible
  • when we propagate the signal acoustically noise and reverberation can destroy the watermark; the more we make it robust, the more it becomes audible.

The ability of the device to tell signals apart, having some prior information on the signal of interest, would qualify as informed source separation.

This project has received funding from the EU's H2020 research and innovation programme under the MSCA GA 675324