SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition


Inferring the structure of an environment from observations alone, while keeping track of an agent's location/viewpoint in it, is fundamentally difficult. The two variables are entangled and can generally only be estimated approximately or to a local optimum. We consider a similar problem in 3D multi-object settings. The challenge is to extract the (static) compositional structure of a scene from changing views of it. Furthermore, we want to extract shared structure across different scene instances. We present a variational approach which does both, learning from RGB videos without supervision. Relying on a transformer-based inference network to integrate information jointly across space and time, we infer two sets of latent representations from each video: a set of "object" latents, corresponding to the time-invariant, object-level contents of the scene, as well as a set of "frame" latents, corresponding to global time-varying elements such as viewpoint. This factorization of latents allows our model SIMONe to represent object attributes in an allocentric manner which does not depend on viewpoint. Moreover, it allows us to disentangle object dynamics and summarize their trajectories as time-abstracted, view-invariant, per-object properties. We demonstrate these capabilities, as well as the model's performance in terms of view interpolation and instance segmentation, across a range of video datasets.