Search for a command to run...
Single-channel speech separation remains one of the most challenging tasks in the field of speech signal processing. In many situations, such as during epidemics that involve respiratory diseases (e.g., COVID-19 or influenza A), individuals are required to wear masks while communicating. Is it possible to address the challenge of speech separation when the target speaker is wearing a mask? Can audio‒visual approaches achieve better speech separation performance than that of audio-only approaches in scenarios where speakers are wearing masks? To address the aforementioned questions, we first construct a large-scale multimodal dataset, termed Speech Separation while Wearing a Mask (SSWM), which includes both the audio modality and the visual modality with masked faces. We explore two strategies for addressing the problem of facial occlusion. One strategy involves utilizing occluded faces—which lack critical visual cues such as mouth movements—directly as supervisory information for self-supervised speech separation; the other strategy involves the use of Wav2Lip to first generate visual information, which is then used as supervisory guidance for self-supervised speech separation. Building upon these two strategies, we propose the SSWM network (SSWMNet), which can flexibly choose to either utilize occluded facial images directly or employ Wav2Lip to generate visual information. The experimental results demonstrate that the proposed speech separation method in which Wav2Lip is used for visual information generation outperforms the approach of utilizing occluded faces directly for self-supervised speech separation. Both proposed audio‒visual methods outperform the audio-only speech separation approach, which operates without the aid of visual information. Availability—SSWMNet is available at https://github.com/fanmanqian/SSWMNetwork .