Generative adversarial Networks for Audio-Visual Speech Separation
Abstract
The objective of this work is to isolate a single speech signal of the target speaker from a video in which multiple people are speaking simultaneously. This work introduces generative adversarial networks (GANs) for speech separation using audio and visual cues. GANs have been efficient in enhancing the quality of noisy speech because of their ability to learn realistic distributions. We demonstrate the efficacy of the adversarial training approach for audio-visual speech separation (AVSS-GAN) over the existing audio-visual baseline (AVSS), which operates in the time domain and utilizes visual features for target speech extraction. We evaluated the performance of our approach on two-speaker and three-speaker mixtures synthesized from the LRS2 dataset in a speaker-independent scenario and show the effectiveness of AVSS-GAN over the baseline. We train AVSS-GAN with L1 loss and Si-SNR loss as an additional component to the adversarial loss of the generator and show that Si-SNR loss performs better than the widely used L1 loss. Evaluation on the generated mixtures using the proposed approach showed an improvement of 0.46 dB and 0.24 dB in Si-SNR over the baseline in the case of two-speaker and three-speaker mixtures, respectively.
Demo 1

Both the speakers talking simultaneously

Separated audio of the first speaker

Separated audio of the second speaker
Demo 2

Both the speakers talking simultaneously

Separated audio of the first speaker
