Generating Images with Sparse Representations


One of the key challenges in generative modelling of images is the high dimensionality of the data. Previous likelihood-based approaches use neural lossy compression to obtain compact latent codes that are more practical for generative modelling. In this paper we instead take inspiration from ubiquitous image compression methods like JPEG, and represent images using quantized discrete cosine transform (DCT) blocks, that are converted into sparse lists of non-zero components. We propose a Transformer-based autoregressive generative model of these representations, that sequentially predicts DCT channels, spatial positions, and quantized DCT values. On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state of the art GANs. We additionally show that two natural choices of ordering for the sparse image representation yield effective image colorization or super-resolution models.