PARTS: Unsupervised segmentation with slots, attention and independence maximization


There has been a growing interest in object-centric generative models in recent years. Models of this kind attempt to extract meaningful objects, segments and representations from images or image sequences, usually in an unsupervised fashion. While significant progress has been demonstrated, few models, if any, are able to work properly on inputs originating from complex 3D scenes, with clutter, occlusions, interactions and camera motion. We present a model which is able to extract meaningful segments, learn disentangled representations of individual objects and form consistent and coherent predictions of future frames in complex 3D environments in an unsupervised manner. Our model builds on recently proposed models employing iterative amortized inference and transition models. We introduce several significant contributions that enable the model to improve performance dramatically. We introduce a recurrent slot-attention based encoder which allows for top-down influence on the inference procedure. We show that using an auto-regressive prior when modeling image sequences is sub-optimal if one is interested in learning useful representations. Finally we introduce several architectural changes that push performance even further. We demonstrate the success of the model on several datasets and provide an analysis of factors contributing to its success as well as the learned representations.