Jump to Content

Research

Building architectures that can handle the world’s data

Published
Authors

Drew Jaegle, Joao Carreira, Carl Doersch, David Ding, Catalin Ionescu

Perceiver and Perceiver IO work as multi-purpose tools for AI

Most architectures used by AI systems today are specialists. A 2D residual network may be a good choice for processing images, but at best it’s a loose fit for other kinds of data — such as the Lidar signals used in self-driving cars or the torques used in robotics. What’s more, standard architectures are often designed with only one task in mind, often leading engineers to bend over backwards to reshape, distort, or otherwise modify their inputs and outputs in hopes that a standard architecture can learn to handle their problem correctly. Dealing with more than one kind of data, like the sounds and images that make up videos, is even more complicated and usually involves complex, hand-tuned systems built from many different parts, even for simple tasks. As part of DeepMind's mission of solving intelligence to advance science and humanity, we want to build systems that can solve problems that use many types of inputs and outputs, so we began to explore a more general and versatile architecture that can handle all types of data.

Figure 1. The Perceiver IO architecture maps input arrays to output arrays by means of a small latent array, which lets it scale gracefully even for very large inputs and outputs. Perceiver IO uses a global attention mechanism that generalizes across many different kinds of data.

In a paper presented at ICML 2021 (the International Conference on Machine Learning) and published as a preprint on arXiv, we introduced the Perceiver, a general-purpose architecture that can process data including images, point clouds, audio, video, and their combinations. While the Perceiver could handle many varieties of input data, it was limited to tasks with simple outputs, like classification. A new preprint on arXiv describes Perceiver IO, a more general version of the Perceiver architecture. Perceiver IO can produce a wide variety of outputs from many different inputs, making it applicable to real-world domains like language, vision, and multimodal understanding as well as challenging games like StarCraft II. To help researchers and the machine learning community at large, we’ve now open sourced the code.

The joke "a bear walks into a restaurant" is repeated seven times on the page, each with different parts of the joke highlighted. The first two versions are under the heading "local" each with one word highlighted, followed by two more under the heading "periodic" with highlights at regular intervals. The last three are under the heading "syntactic elements" and show highlighted punctuation.

Figure 2. Perceiver IO processes language by first choosing which characters to attend to. The model learns to use several different strategies: some parts of the network attend to specific places in the input, while others attend to specific characters like punctuation marks.

Perceivers build on the Transformer, an architecture that uses an operation called “attention” to map inputs into outputs. By comparing all elements of the input, Transformers process inputs based on their relationships with each other and the task. Attention is simple and widely applicable, but Transformers use attention in a way that can quickly become expensive as the number of inputs grows. This means Transformers work well for inputs with at most a few thousand elements, but common forms of data like images, videos, and books can easily contain millions of elements. With the original Perceiver, we solved a major problem for a generalist architecture: scaling the Transformer’s attention operation to very large inputs without introducing domain-specific assumptions. The Perceiver does this by using attention to first encode the inputs into a small latent array. This latent array can then be processed further at a cost independent of the input’s size, enabling the Perceiver’s memory and computational needs to grow gracefully as the input grows larger, even for especially deep models.

Figure 3. Perceiver IO produces state-of-the-art results on the challenging task of optical flow estimation, or tracking the motion of all pixels in an image. The colour of each pixel shows the direction and speed of motion estimated by Perceiver IO, as indicated in the legend above.

This “graceful growth” allows the Perceiver to achieve an unprecedented level of generality — it’s competitive with domain-specific models on benchmarks based on images, 3D point clouds, and audio and images together. But because the original Perceiver produced only one output per input, it wasn’t as versatile as researchers needed. Perceiver IO fixes this problem by using attention not only to encode to a latent array but also to decode from it, which gives the network great flexibility. Perceiver IO now scales to large and diverse inputs and outputs, and can even deal with many tasks or types of data at once. This opens the door for all sorts of applications, like understanding the meaning of a text from each of its characters, tracking the movement of all points in an image, processing the sound, images, and labels that make up a video, and even playing games, all while using a single architecture that’s simpler than the alternatives.

Watch

Original

Watch

Perceiver IO

Watch

Original

Watch

Perceiver IO

Watch

Original

Watch

Perceiver IO

In our experiments, we’ve seen Perceiver IO work across a wide range of benchmark domains — such as language, vision, multimodal data, and games — to provide an off-the-shelf way to handle many kinds of data. We hope our latest preprint and the code available on Github help researchers and practitioners tackle problems without needing to invest the time and effort to build custom solutions using specialised systems. As we continue to learn from exploring new kinds of data, we look forward to further improving upon this general-purpose architecture and making it faster and easier to solve problems throughout science and machine learning.