One in a series of posts explaining the theories underpinning our research.
Over the last decade, machine learning has made unprecedented progress in areas as diverse as image recognition, self-driving cars and playing complex games like Go. These successes have been largely realised by training deep neural networks with one of two learning paradigms—supervised learning and reinforcement learning. Both paradigms require training signals to be designed by a human and passed to the computer. In the case of supervised learning, these are the “targets” (such as the correct label for an image); in the case of reinforcement learning, they are the “rewards” for successful behaviour (such as getting a high score in an Atari game). The limits of learning are therefore defined by the human trainers.
While some scientists contend that a sufficiently inclusive training regime—for example, the ability to complete a very wide variety of tasks—should be enough to give rise to general intelligence, others believe that true intelligence will require more independent learning strategies. Consider how a toddler learns, for instance. Her grandmother might sit with her and patiently point out examples of ducks (acting as the instructive signal in supervised learning), or reward her with applause for solving a woodblock puzzle (as in reinforcement learning). But the vast majority of a toddler’s time is spent naively exploring the world, making sense of her surroundings through curiosity, play, and observation. Unsupervised learning is a paradigm designed to create autonomous intelligence by rewarding agents (that is, computer programs) for learning about the data they observe without a particular task in mind. In other words, the agent learns for the sake of learning.
A key motivation for unsupervised learning is that, while the data passed to learning algorithms is extremely rich in internal structure (e.g., images, videos and text), the targets and rewards used for training are typically very sparse (e.g., the label ‘dog’ referring to that particularly protean species, or a single one or zero to denote success or failure in a game). This suggests that the bulk of what is learned by an algorithm must consist of understanding the data itself, rather than applying that understanding to particular tasks.
Decoding the elements of vision
2012 was a landmark year for deep learning, when AlexNet (named after its lead architect Alex Krizhnevsky) swept the ImageNet classification competition. AlexNet’s abilities to recognize images were unprecedented, but even more striking is what was happening under the hood. When researchers analysed what AlexNet was doing, they discovered that it interprets images by building increasingly complex internal representations of its inputs. Low-level features, such as textures and edges, are represented in the bottom layers, and these are then combined to form high-level concepts such as wheels and dogs in higher layers.
This is remarkably similar to how information is processed in our brains, where simple edges and textures in primary sensory processing areas are assembled into complex objects like faces in higher areas. The representation of a complex scene can therefore be built out of visual primitives, in much the same way that meaning emerges from the individual words comprising a sentence. Without explicit guidance to do so, the layers of AlexNet had discovered a fundamental ‘vocabulary’ of vision in order to solve its task. In a sense, it had learned to play what Wittgenstein called a ‘language game’ that iteratively translates from pixels to labels.
From the perspective of general intelligence, the most interesting thing about AlexNet’s vocabulary is that it can be reused, or transferred, to visual tasks other than the one it was trained on, such as recognising whole scenes rather than individual objects. Transfer is essential in an ever-changing world, and humans excel at it: we are able to rapidly adapt the skills and understanding we’ve gleaned from our experiences (our ‘world model’) to whatever situation is at hand. For example, a classically-trained pianist can pick up jazz piano with relative ease. Artificial agents that form the right internal representations of the world, the reasoning goes, should be able to do similarly.
Nonetheless, the representations learned by classifiers such as AlexNet have limitations. In particular, as the network was only trained to label images with a single class (cat, dog, car, volcano), any information not required to infer the label—no matter how useful it might be for other tasks—is liable to be ignored. For example, the representations may fail to capture the background of the image if the label always refers to the foreground. A possible solution is to provide more comprehensive training signals, like detailed captions describing the images: not just “dog,” but “A Corgi catching a frisbee in a sunny park.” However, such targets are laborious to provide, especially at scale, and still may be insufficient to capture all the information needed to complete a task. The basic premise of unsupervised learning is that the best way to learn rich, broadly transferable representations is to attempt to learn everything that can be learned about the data.
If the notion of transfer through representation learning seems too abstract, consider a child who has learned to draw people as stick figures. She has discovered a representation of the human form that is both highly compact and rapidly adaptable. By augmenting each stick figure with specifics, she can create portraits of all her classmates: glasses for her best friend, her deskmate in his favorite red tee-shirt. And she has developed this skill not in order to complete a specific task or receive a reward, but rather in response to her basic urge to reflect the world around her.
Learning by creating: generative models
Perhaps the simplest objective for unsupervised learning is to train an algorithm to generate its own instances of data. So-called generative models should not simply reproduce the data they are trained on (an uninteresting act of memorisation), but rather build a model of the underlying class from which that data was drawn: not a particular photograph of a horse or a rainbow, but the set of all photographs of horses and rainbows; not a specific utterance from a specific speaker, but the general distribution of spoken utterances. The guiding principle of generative models is that being able to construct a convincing example of the data is the strongest evidence of having understood it: as Richard Feynman put it, "what I cannot create, I do not understand."
For images, the most successful generative model so far has been the Generative Adversarial Network (GAN for short), in which two networks—a generator and a discriminator—engage in a contest of discernment akin to that of an artistic forger and a detective. The generator produces images with the goal of tricking the discriminator into believing they are real; the discriminator, meanwhile, is rewarded for spotting the fakes. The generated images, first messy and random, are refined over many iterations, and the ongoing dynamic between the networks leads to ever-more realistic images that are in many cases indistinguishable from real photographs. Generative adversarial networks can also dream details of landscapes defined by the rough sketches of users.
A glance at the images below is enough to convince us that the network has learned to represent many of the key features of the photographs they were trained on, such as the structure of animal’s bodies, the texture of grass, and detailed effects of light and shade (even when refracted through a soap bubble). Close inspection reveals slight anomalies, such as the white dog’s apparent extra leg and the oddly right-angled flow of one of the jets in the fountain. While the creators of generative models strive to avoid such imperfections, their visibility highlights one of the benefits of recreating familiar data such as images: by inspecting the samples, researchers can infer what the model has and hasn’t learned.
Creating by predicting
Another notable family within unsupervised learning are autoregressive models, in which the data is split into a sequence of small pieces, each of which is predicted in turn. Such models can be used to generate data by successively guessing what will come next, feeding in a guess as input and guessing again. Language models, where each word is predicted from the words before it, are perhaps the best known example: these models power the text predictions that pop up on some email and messaging apps. Recent advances in language modelling have enabled the generation of strikingly plausible passages, such as the one shown below from OpenAI’s GPT-2.
One interesting inconsistency in the text is that the unicorns are described as “four-horned”: again, it is fascinating to probe the limitations of the network’s understanding.
By controlling the input sequence used to condition the out predictions, autoregressive models can also be used to transform one sequence into another. This demo uses a conditional autoregressive model to transform text into realistic handwriting. WaveNet transforms text into natural sounding speech, and is now used to generate voices for Google Assistant. A similar process of conditioning and autoregressive generation can be used to translate from one language to another.
Autoregressive models learn about data by attempting to predict each piece of it in a particular order. A more general class of unsupervised learning algorithms can be built by predicting any part of the data from any other. For example, this could mean removing a word from a sentence, and attempting to predict it from whatever remains. By learning to make lots of localised predictions, the system is forced to learn about the data as a whole.
One concern around generative models is their potential for misuse. While manipulating evidence with photo, video, and audio editing has been possible for a long time, generative models could make it even easier to edit media with malicious intent. We have already seen demonstrations of so-called ‘deepfakes’—for instance, this fabricated video footage of President Obama. It’s encouraging to see that several major efforts to address these challenges are already underway, including using statistical techniques to help detect synthetic media and verify authentic media, raising public awareness, and discussions around limiting the availability of trained generative models. Furthermore, generative models can themselves be used to detect synthetic media and anomalous data—for example when detecting fake speech or identifying payment abnormalities to protect customers against fraud. Researchers need to work on generative models in order to better understand them and mitigate downstream risks.
Generative models are fascinating in their own right, but our principal interest in them at DeepMind is as a stepping stone towards general intelligence. Endowing an agent with the ability to generate data is a way of giving it an imagination, and hence the ability to plan and reason about the future. Even without explicit generation, our studies show that learning to predict different aspects of the environment enriches the agent’s world model, and thereby improves its ability to solve problems.
These results resonate with our intuitions about the human mind. Our ability to learn about the world without explicit supervision is fundamental to what we regard as intelligence. On a train ride we might listlessly gaze through the window, drag our fingers over the velvet of the seat, regard the passengers sitting across from us. We have no agenda in these studies: we almost can’t help but gather information, our brains ceaselessly working to understand the world around us, and our place within it.