Updated 30/5/19. Read about our new work below, in “Human Comparable Agents” and “Going Further”
Mastering the strategy, tactical understanding, and team play involved in multiplayer video games represents a critical challenge for AI research. In our latest paper now published in the journal Science, we present new developments in reinforcement learning, resulting in human-level performance in Quake III Arena Capture the Flag. This is a complex, multi-agent environment and one of the canonical 3D first-person multiplayer games. The agents successfully cooperate with both artificial and human teammates, and demonstrate high performance even when trained with reaction times comparable to human players. Furthermore, we show how these methods have managed to scale beyond research Capture the Flag environments to the full game of Quake III Arena.
Billions of people inhabit the planet, each with their own individual goals and actions, but still capable of coming together through teams, organisations and societies in impressive displays of collective intelligence. This is a setting we call multi-agent learning: many individual agents must act independently, yet learn to interact and cooperate with other agents. This is an immensely difficult problem - because with co-adapting agents the world is constantly changing.
To investigate this problem, we look at 3D first-person multiplayer video games. These games represent the most popular genre of video game, and have captured the imagination of millions of gamers because of their immersive game play, as well as the challenges they pose in terms of strategy, tactics, hand-eye coordination, and team play. The challenge for our agents is to learn directly from raw pixels to produce actions. This complexity makes first-person multiplayer games a fruitful and active area of research within the AI community.
The game we focused on in this work is Quake III Arena (which we aesthetically modified, though all game mechanics remain the same). Quake III Arena has laid the foundations for many modern first-person video games, and has attracted a long-standing competitive e-sports scene. We train agents that learn and act as individuals, but which must be able to play on teams with and against any other agents, artificial or human.
The rules of CTF are simple, but the dynamics are complex. Two teams of individual players compete on a given map with the goal of capturing the opponent team’s flag while protecting their own. To gain tactical advantage they can tag the opponent team members to send them back to their spawn points. The team with the most flag captures after five minutes wins.
From a multi-agent perspective, CTF requires players to both successfully cooperate with their teammates as well as compete with the opposing team, while remaining robust to any playing style they might encounter.
To make things even more interesting, we consider a variant of CTF in which the map layout changes from match to match. As a consequence, our agents are forced to acquire general strategies rather than memorising the map layout. Additionally, to level the playing field, our learning agents experience the world of CTF in a similar way to humans: they observe a stream of pixel images and issue actions through an emulated game controller.
Our agents must learn from scratch how to see, act, cooperate, and compete in unseen environments, all from a single reinforcement signal per match: whether their team won or not. This is a challenging learning problem, and its solution is based on three general ideas for reinforcement learning:
- Rather than training a single agent, we train a population of agents, which learn by playing with each other, providing a diversity of teammates and opponents.
- Each agent in the population learns its own internal reward signal, which allows agents to generate their own internal goals, such as capturing a flag. A two-tier optimisation process optimises agents’ internal rewards directly for winning, and uses reinforcement learning on the internal rewards to learn the agents’ policies.
- Agents operate at two timescales, fast and slow, which improves their ability to use memory and generate consistent action sequences.
The resulting agent, dubbed the For The Win (FTW) agent, learns to play CTF to a very high standard. Crucially, the learned agent policies are robust to the size of the maps, the number of teammates, and the other players on their team. Below, you can explore some games on both the outdoor procedural environments, where FTW agents play against each other, as well as games in which humans and agents play together on indoor procedural environments.
We ran a tournament including 40 human players, in which humans and agents are randomly matched up in games - both as opponents and as teammates.
The FTW agents learn to become much stronger than the strong baseline methods, and exceed the win-rate of the human players. In fact, in a survey among participants they were rated more collaborative than human participants.
Going beyond mere performance evaluation, it is important to understand the emergent complexity in the behaviours and internal representations of these agents.
To understand how agents represent game state, we look at activation patterns of the agents’ neural networks plotted on a plane. Clusters of dots in the figure below represent situations during play with nearby dots representing similar activation patterns. These dots are coloured according to the high-level CTF game state in which the agent finds itself: In which room is the agent? What is the status of the flags? What teammates and opponents can be seen? We observe clusters of the same colour, indicating that the agent represents similar high-level game states in a similar manner.
The agents are never told anything about the rules of the game, yet learn about fundamental game concepts and effectively develop an intuition for CTF. In fact, we can find particular neurons that code directly for some of the most important game states, such as a neuron that activates when the agent’s flag is taken, or a neuron that activates when an agent’s teammate is holding a flag. The paper provides further analysis covering the agents’ use of memory and visual attention.
Human Comparable Agents
How did our agents perform as well as they did? First, we noticed that the agents had very fast reaction times and were very accurate taggers, which might explain their performance (tagging is a tactical action that sends opponents back to their starting point). Humans are comparatively slow to process and act on sensory input, due to our slower biological signalling. Here’s an example of a reaction time test you can try yourself. Thus, our agents’ superior performance might be a result of their faster visual processing and motor control. However, by artificially reducing this accuracy and reaction time, we saw that this was only one factor in their success. In a further study, we trained agents which have an inbuilt delay of a quarter of a second (267 ms) – that is, agents have a 267ms lag before observing the world – comparable with reported reaction times of human video game players. These response-delayed agents still outperformed human participants, with strong humans only winning 21% of the time.
Through unsupervised learning we established the prototypical behaviours of agents and humans to discover that agents in fact learn human-like behaviours, such as following teammates and camping in the opponent’s base.
These behaviours emerge in the course of training, through reinforcement learning and population-level evolution, with behaviours–such as teammate following–falling out of favour as agents learn to cooperate in a more complementary manner.
The training progression of a population of FTW agents. Top left: the 30 agents’ Elo ratings as they train and evolve from each other. Top right: the genetic tree of these evolution events. The lower graph shows the progression of knowledge, some of the internal rewards, and behaviour probability throughout the training of the agents.
While this paper focuses on Capture the Flag, the research contributions are general and we are excited to see how others build upon our techniques in different complex environments. Since initially publishing these results, we have found success in extending these methods to the full game of Quake III Arena, which includes professionally played maps, more multiplayer game modes in addition to Capture the Flag, and more gadgets and pickups. Initial results indicate that agents can play multiple game modes and multiple maps competitively, and are starting to challenge the skills of our human researchers in test matches. Indeed, ideas introduced in this work, such as population based multi-agent RL, form a foundation of the AlphaStar agent in our work on StarCraft II.
Agents playing two other Quake III Arena multiplayer game modes on full-scale tournament maps: Harvester on the Future Crossings map and One Flag Capture the Flag on the Ironwood map.
In general, this work highlights the potential of multi-agent training to advance the development of artificial intelligence: exploiting the natural curriculum provided by multi-agent training, and forcing the development of robust agents that can even team up with humans.
This work was done by Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Brendan Tracey, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil Rabinowitz, Ari Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel.
Visualisations were created by Adam Cain, Damien Boudot, Doug Fritz, Jaume Sanchez Elias, Paul Lewis, Max Jaderberg, Wojciech M. Czarnecki, and Luke Marris.
We would like to thank Patrick Howard and Dan “Scancode” Gold for allowing us to use the Quake III Arena maps they designed.