Capture the Flag: the emergence of complex cooperative agents

Mastering the strategy, tactical understanding, and team play involved in multiplayer video games represents a critical challenge for AI research. Now, through new developments in reinforcement learning, our agents have achieved human-level performance in Quake III Arena Capture the Flag, a complex multi-agent environment and one of the canonical 3D first-person multiplayer games. These agents demonstrate the ability to team up with both artificial agents and human players.

play Capture the Flag: from pixels to actions

Above: four of our trained agents play together on an indoor and outdoor procedurally generated Capture the Flag level.

Billions of people inhabit the planet, each with their own individual goals and actions, but still capable of coming together through teams, organisations and societies in impressive displays of collective intelligence. This is a setting we call multi-agent learning: many individual agents must act independently, yet learn to interact and cooperate with other agents. This is an immensely difficult problem - because with co-adapting agents the world is constantly changing.

To investigate this problem we look at 3D first-person multiplayer video games. These games represent the most popular genre of video game, and have captured the imagination of millions of gamers because of their immersive game play, as well as the challenges they pose in terms of strategy, tactics, hand-eye coordination, and team play. The challenge for our agents is to learn directly from raw pixels to produce actions. This complexity makes first-person multiplayer games a fruitful and active area of research within the AI community.

The game we focus on in this work is Quake III Arena (which we aesthetically modified, though all game mechanics remain the same). Quake III Arena has laid the foundations for many modern first-person video games, and has attracted a long-standing competitive esports scene. We train agents that learn and act as individuals, but which must be able to play on teams with and against any other agents, artificial or human.

The rules of CTF are simple, but the dynamics are complex. Two teams of individual players compete on a given map with the goal of capturing the opponent team’s flag while protecting their own. To gain tactical advantage they can tag the opponent team members to send them back to their spawn points. The team with the most flag captures after five minutes wins.

play Capture the Flag: tutorial

From a multi-agent perspective, CTF requires players to both successfully cooperate with their teammates as well as compete with the opposing team, while remaining robust to any playing style they might encounter.

To make things even more interesting, we consider a variant of CTF in which the map layout changes from match to match. As a consequence, our agents are forced to acquire general strategies rather than memorising the map layout. Additionally, to level the playing field, our learning agents experience the world of CTF in a similar way to humans: they observe a stream of pixel images and issue actions through an emulated game controller.

CTF-BlogAsset-ThumbLoop fullscreen fullscreen_mobile
CTF is played on procedurally generated environments, such that agents must generalise to unseen maps.

Our agents must learn from scratch how to see, act, cooperate, and compete in unseen environments, all from a single reinforcement signal per match: whether their team won or not. This is a challenging learning problem, and its solution is based on three general ideas for reinforcement learning:

  • Rather than training a single agent, we train a population of agents, which learn by playing with each other, providing a diversity of teammates and opponents.
  • Each agent in the population learns its own internal reward signal, which allows agents to generate their own internal goals, such as capturing a flag. A two-tier optimisation process optimises agents’ internal rewards directly for winning, and uses reinforcement learning on the internal rewards to learn the agents’ policies. 
  • Agents operate at two timescales, fast and slow, which improves their ability to use memory and generate consistent action sequences.

fullscreen fullscreen_mobile
A schematic of the For The Win (FTW) agent architecture. The agent combines recurrent neural networks (RNNs) on fast and slow timescales, includes a shared memory module, and learns a conversion from game points to internal reward.

The resulting agent, dubbed the For The Win (FTW) agent, learns to play CTF to a very high standard. Crucially, the learned agent policies are robust to the size of the maps, the number of teammates, and the other players on their team. Below, you can explore some games on both the outdoor procedural environments, where FTW agents play against each other, as well as games in which humans and agents play together on indoor procedural environments.

Interactive CTF game explorer, with games on indoor and outdoor procedurally generated environments. Games on outdoor maps are between FTW agents, while those on indoor maps are mixed human and FTW agent games (see icons).

We ran a tournament including 40 human players, in which humans and agents are randomly matched up in games - both as opponents and as teammates.

CTF players fullscreen fullscreen_mobile
An early test tournament with humans playing CTF with and against trained agents and other humans.

The FTW agents learn to become much stronger than the strong baseline methods, and exceed the win-rate of the human players. In fact, in a survey among participants they were rated more collaborative than human participants.

fullscreen fullscreen_mobile
The performance of our agents during training. Our new agent, the FTW agent, obtains a much higher Elo rating - which corresponds to the probability of winning - than the human players and baseline methods of Self-play + RS and Self-play.

Going beyond mere performance evaluation, it is important to understand the emergent complexity in the behaviours and internal representations of these agents.

To understand how agents represent game state, we look at activation patterns of the agents’ neural networks plotted on a plane. Dots in the figure below represent situations during play with close by dots representing similar activation patterns. These dots are coloured according to the high-level CTF game state in which the agent finds itself: In which room is the agent? What is the status of the flags? What teammates and opponents can be seen? We observe clusters of the same colour, indicating that the agent represents similar high-level game states in a similar manner.

fullscreen fullscreen_mobile
A look into how our agents represent the game world. Different situations corresponding conceptually to the same game situation are represented similarly by the agent. The trained agents even exhibit some artificial neurons which code directly for particular situations.

The agents are never told anything about the rules of the game, yet learn about fundamental game concepts and effectively develop an intuition for CTF. In fact, we can find particular neurons that code directly for some of the most important game states, such as a neuron that activates when the agent’s flag is taken, or a neuron that activates when an agent’s teammate is holding a flag. The paper provides further analysis covering the agents’ use of memory and visual attention.

Aside from this rich representation, how do the agents actually behave? First, we noticed that the agents had very fast reaction times and were very accurate taggers, which could explain their performance. However, by artificially reducing this accuracy and reaction time we saw that this was only one factor in their success.

fullscreen fullscreen_mobile
The effect of artificially reducing the agent’s tagging accuracy and tagging reaction time after training. Even with human-comparable accuracy and reaction time the performance of our agents is higher than that of humans.

Through unsupervised learning we established the prototypical behaviours of agents and humans to discover that agents in fact learn human-like behaviours, such as following teammates and camping in the opponent’s base.

ctf behaviours gif fullscreen fullscreen_mobile
Three examples of the automatically discovered behaviours that the trained agents exhibit.

These behaviours emerge in the course of training, through reinforcement learning and population-level evolution, with behaviours - such as teammate following - falling out of favour as agents learn to cooperate in a more complementary manner.

play FTW agents: training progression

Above: the training progression of a population of FTW agents. Top right shows the 30 agents’ Elo ratings as they train and evolve from each other. Top left shows the genetic tree of these evolution events. The lower graph shows the progression of knowledge, some of the internal rewards, and behaviour probability throughout the training of the agents.

The research community has recently done very impressive work in complex games like StarCraft II and Dota 2, and while this paper focuses on Capture the Flag, the research contributions are general and we are excited to see how others build upon our techniques in different complex environments. In the future, we also want to further improve on our current reinforcement learning and population based training methods. In general, we think this work highlights the potential of multi-agent training to advance the development of artificial intelligence: exploiting the natural curriculum provided by multi-agent training, and forcing the development of robust agents that can even team up with humans.

For more details, please see the paper and the full supplementary video.

This work was done by Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil Rabinowitz, Ari Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel.

Visualisations by Adam Cain, Damien Boudot, Doug Fritz, Jaume Sanchez Elias, Paul Lewis, and Rich Green.