Self-Supervised MultiModal Versatile Networks


Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a {\em multimodal versatile network} -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of \emph{deflation}, so that the networks can be effortlessly applied to the visual data in the form of \emph{video} or a \emph{static image}. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on \emph{video}, \emph{video-text}, \emph{image} and \emph{audio} tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.