Imitating Interactive Intelligence


A common vision from science fiction is that AI will one day sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour.

Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalise beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount. See videos for an overview of the manuscript, training time-lapse, and human-agent interactions.

Authors' Notes

Two questions must be answered at the outset of any artificial intelligence research. What do we want AI systems to do? And how will we evaluate when we are making progress toward this goal? Alan Turing, in his seminal paper describing the Turing Test, which he more modestly named the imitation game, argued that for a certain kind of AI, these questions may be one and the same. Roughly, if an AI’s behaviour resembles human-like intelligence when a person interacts with it, then the AI has passed the test and can be called intelligent. An AI that is designed to interact with humans should be tested via interaction with humans.

At the same time, interaction is not just a test of intelligence but also the point. For AI agents to be generally helpful, they should assist us in diverse activities and communicate with us naturally. In science fiction, the vision of robots that we can speak to is commonplace. And intelligent digital agents that can help accomplish large numbers of tasks would be eminently useful. To bring these devices into reality, we therefore must study the problem of how to create agents that can capably interact with humans and produce actions in a rich world.

Building agents that can interact with humans and the world poses a number of important challenges. How can we provide appropriate learning signals to teach artificial agents such abilities? How can we evaluate the performance of the agents we develop, when language itself is ambiguous and abstract? As the wind tunnel is to the design of the airplane, we have created a virtual environment for researching how to make interacting agents.

We first create a simulated environment, the Playroom, in which virtual robots can engage in a variety of interesting interactions by moving around, manipulating objects, and speaking to each other. The Playroom’s dimensions can be randomised as can its allocation of shelves, furniture, landmarks like windows and doors, and an assortment of children's toys and domestic objects. The diversity of the environment enables interactions involving reasoning about space and object relations, ambiguity of references, containment, construction, support, occlusion, partial observability. We embedded two agents in the Playroom to provide a social dimension for studying joint intentionality, cooperation, communication of private knowledge, and so on.

Agents interacting in the Playroom. The blue agent instructs the yellow agent to “Put the helicopter into the box.”
The configuration of the Playroom is randomised to create diversity in data collection.

We harness a range of learning paradigms to build agents that can interact with humans, including imitation learning, reinforcement learning, supervised, and unsupervised learning. As Turing may have anticipated in naming “the imitation game,” perhaps the most direct route to create agents that can interact with humans is through imitation of human behaviour. Large datasets of human behaviour along with algorithms for imitation learning from those data have been instrumental for making agents that can interact with textual language or play games. For grounded language interactions, we have no readily available, pre-existing data source of behaviour, so we created a system for eliciting interactions from human participants interacting with each other. These interactions were elicited primarily by prompting one of the players with a cue to improvise an instruction about, e.g., “Ask the other player to position something relative to something else.” Some of the interaction prompts involve questions as well as instructions, like “Ask the other player to describe where something is.” In total, we collected more than a year of real-time human interactions in this setting.

Our agents each consume images and language as inputs and produce physical actions and language actions as outputs. We built reward models with the same input specifications.
Left: Over the course of a 2 minute interaction, the two players (setter & solver) move around, look around, grab and drop objects, and speak. Right: The setter is prompted to “Ask the other player to lift something.” The setter instructs the solver agent to “Lift the plane which is in front of the dining table”. The solver agent finds the correct object and completes the task.

Imitation learning, reinforcement learning, and auxiliary learning (consisting of supervised and unsupervised representation learning) are integrated into a form of interactive self-play that is crucial to create our best agents. Such agents can follow commands and answer questions. We call these agents “solvers.” But our agents can also provide commands and ask questions. We call these agents “setters.” Setters interactively pose problems to solvers to produce better solvers. However, once the agents are trained, humans can play as setters and interact with solver agents.

From human demonstrations we train policies using a combination of supervised learning (behavioural cloning), inverse RL to infer reward models, and forward RL to optimise policies using the inferred reward model. We use semi-supervised auxiliary tasks to help shape the representations of both the policy and reward models.
The setter agent asks the solver agent to “Take the white robot and place it on the bed.” The solver agent finds the robot and accomplishes the task. The reward function learned from demonstrations captures key aspects of the task (blue), and gives less reward (grey) when the same observations are coupled with the counterfactual instruction, “Take the red robot and place it on the bed.”

Our interactions cannot be evaluated in the same way that most simple reinforcement learning problems can. There is no notion of winning or losing, for example. Indeed, communicating with language while sharing a physical environment introduces a surprising number of abstract and ambiguous notions. For example, if a setter asks a solver to put something near something else, what exactly is “near”? But accurate evaluation of trained models in standardised settings is a linchpin of modern machine learning and artificial intelligence. To cope with this setting, we have developed a variety of evaluation methods to help diagnose problems in and score agents, including simply having humans interact with agents in large trials.

Humans evaluated the performance of agents and other humans in completing instructions in the Playroom on both instruction-following and question-answering tasks. Randomly initialised agents were successful ~0% of the time. An agent trained with supervised behavioural cloning alone (B) performed somewhat better, at ~10-20% of the time. Agents trained with semi-supervised auxiliary tasks as well (B·A) performed better. Those trained with supervised, semi-supervised, and reinforcement learning using interactive self-play were judged to perform best (BG·A & BGR·A).

A distinct advantage of our setting is that human operators can set a virtually infinite set of new tasks via language, and quickly understand the competencies of our agents. There are many tasks that they cannot cope with, but our approach to building AIs offers a clear path for improvement across a growing set of competencies. Our methods are general and can be applied wherever we need agents that interact with complex environments and people.