Reverberant speech separation based on audio-visual dictionary learning and binaural Cues

Qingju Liu, Wenwu Wang, Philip Jackson, Mark Barnard

    Research output: Contribution to conferencePaperpeer-review

    Abstract

    Probabilistic models of binaural cues, such as the interaural phase difference (IPD) and the interaural level difference (ILD), can be used to obtain the audio mask in the time-frequency (TF) domain, for source separation of binaural mixtures. Those models are, however, often degraded by acoustic noise. In contrast, the video stream contains relevant information about the synchronous audio stream that is not affected by acoustic noise. In this paper, we present a novel method for modeling the audio-visual (AV) coherence based on dictionary learning. A visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio-visual mask, which is then applied to the binaural signal for source separation. We tested our algorithm on the XM2VTS database, and observed considerable performance improvement for noise corrupted signals.
    Original languageEnglish
    DOIs
    Publication statusPublished - Aug 2012
    EventIEEE Statistical Signal Processing Workshop (SSP) - Michigan, U.S.
    Duration: 5 Aug 20128 Aug 2012

    Conference

    ConferenceIEEE Statistical Signal Processing Workshop (SSP)
    Period5/08/128/08/12

    Keywords

    • Computer science and informatics

    Fingerprint

    Dive into the research topics of 'Reverberant speech separation based on audio-visual dictionary learning and binaural Cues'. Together they form a unique fingerprint.

    Cite this