AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

TL;DR: AlphaStar is the first AI to reach the top league of a widely popular esport without any game restrictions. This January, a preliminary version of AlphaStar challenged two of the world's top players in StarCraft II, one of the most enduring and popular real-time strategy video games of all time. Since then, we have taken on a much greater challenge: playing the full game at a Grandmaster level under professionally approved conditions
Our new research differs from prior work in several key regards:
  1. AlphaStar now has the same kind of constraints that humans play under – including viewing the world through a camera, and stronger limits on the frequency of its actions* (in collaboration with StarCraft professional Dario TLO” Wünsch). 
  2. AlphaStar can now play in one-on-one matches as and against Protoss, Terran, and Zerg – the three races present in StarCraft II. Each of the Protoss, Terran, and Zerg agents is a single neural network.
  3. The League training is fully automated, and starts only with agents trained by supervised learning, rather than from previously trained agents from past experiments.
  4. AlphaStar played on the official game server, Battle.net, using the same maps and conditions as human players. All game replays are available here.

We chose to use general-purpose machine learning techniques – including neural networks, self-play via reinforcement learning, multi-agent learning, and imitation learning – to learn directly from game data with general purpose techniques. Using the advances described in our Nature paper, AlphaStar was ranked above 99.8% of active players on Battle.net, and achieved a Grandmaster level for all three StarCraft II races: Protoss, Terran, and Zerg. We expect these methods could be applied to many other domains.

Learning-based systems and self-play are elegant research concepts which have facilitated remarkable advances in artificial intelligence. In 1992, researchers at IBM developed TD-Gammon, combining a learning-based system with a neural network to play the game of backgammon. Instead of playing according to hard-coded rules or heuristics, TD-Gammon was designed to use reinforcement learning to figure out, through trial-and-error, how to play the game in a way that maximises its probability of winning. Its developers used the notion of self-play to make the system more robust: by playing against versions of itself, the system grew increasingly proficient at the game. When combined, the notions of learning-based systems and self-play provide a powerful paradigm of open-ended learning.

Many advances since then have demonstrated that these approaches can be scaled to progressively challenging domains. For example, AlphaGo and AlphaZero established that it was possible for a system to learn to achieve superhuman performance at Go, chess, and shogi, and OpenAI Five and DeepMind’s FTW demonstrated the power of self-play in the modern games of Dota 2 and Quake III. 

At DeepMind, we’re interested in understanding the potential – and limitations – of open-ended learning, which enables us to develop robust and flexible agents that can cope with complex, real-world domains. Games like StarCraft are an excellent training ground to advance these approaches, as players must use limited information to make dynamic and difficult decisions that have ramifications on multiple levels and timescales. 

I’ve found AlphaStar’s gameplay incredibly impressive – the system is very skilled at assessing its strategic position, and knows exactly when to engage or disengage with its opponent. And while AlphaStar has excellent and precise control, it doesn’t feel superhuman – certainly not on a level that a human couldn’t theoretically achieve. Overall, it feels very fair – like it is playing a ‘real’ game of StarCraft.

Dario “TLO” Wünsch, professional StarCraft II player

Despite its successes, self-play suffers from well known drawbacks. The most salient one is forgetting: an agent playing against itself may keep improving, but it also may forget how to win against a previous version of itself. Forgetting can create a cycle of an agent “chasing its tail”, and never converging or making real progress. For example, in the game rock-paper-scissors, an agent may currently prefer to play rock over other options. As self-play progresses, a new agent will then choose to switch to paper, as it wins against rock. Later, the agent will switch to scissors, and eventually back to rock, creating a cycle. Fictitious self-play - playing against a mixture of all previous strategies - is one solution to cope with this challenge.

After first open-sourcing StarCraft II as a research environment, we found that even fictitious self-play techniques were insufficient to produce strong agents, so we set out to develop a better, general-purpose solution. A central idea of our recently published Nature paper extends the notion of fictitious self-play to a group of agents – the League. Normally in self-play, every agent maximises its probability of winning against its opponents; however, this was only part of the solution. In the real world, a player trying to improve at StarCraft may choose to do so by partnering with friends so that they can train particular strategies. As such, their training partners are not playing to win against every possible opponent, but are instead exposing the flaws of their friend, to help them become a better and more robust player. The key insight of the League is that playing to win is insufficient: instead, we need both main agents whose goal is to win versus everyone, and also exploiter agents that focus on helping the main agent grow stronger by exposing its flaws, rather than maximising their own win rate against all players. Using this training method, the League learns all its complex StarCraft II strategy in an end-to-end, fully automated fashion.

Figure 1 depicts some of the challenges in complex domains such as StarCraft. (Top row) Players can create a variety of ‘units’ (e.g. workers, fighters, or transporters) to deploy in different strategic moves. Units have balanced strengths and weaknesses in a manner analogous to the game rock-paper-scissors. Thanks to imitation learning, our initial agent can already execute a diverse set of strategies, depicted here as a composition of units created in the game (in this example: Void rays, Stalkers and Immortals). However, because some strategies are easier to improve on, naive reinforcement learning would narrowly focus on these. Other strategies may require more learning, or have particular nuances that make them harder for the agent to perfect. This creates a vicious cycle in which some valid strategies appear less and less effective because the agent abandons them in favour of a dominant strategy. (Bottom row) We added agents to the League whose sole purpose is to expose weaknesses of the main agent. This means that more valid strategies will be discovered and developed, making the main agent far more robust against their opponents. At the same time, we employed imitation learning techniques (including distillation) to prevent AlphaStar from forgetting throughout training, and by using latent variables to represent a diverse set of opening moves.
Figure 1 depicts some of the challenges in complex domains such as StarCraft. (Top row) Players can create a variety of ‘units’ (e.g. workers, fighters, or transporters) to deploy in different strategic moves. Units have balanced strengths and weaknesses in a manner analogous to the game rock-paper-scissors. Thanks to imitation learning, our initial agent can already execute a diverse set of strategies, depicted here as a composition of units created in the game (in this example: Void rays, Stalkers and Immortals). However, because some strategies are easier to improve on, naive reinforcement learning would narrowly focus on these. Other strategies may require more learning, or have particular nuances that make them harder for the agent to perfect. This creates a vicious cycle in which some valid strategies appear less and less effective because the agent abandons them in favour of a dominant strategy. (Bottom row) We added agents to the League whose sole purpose is to expose weaknesses of the main agent. This means that more valid strategies will be discovered and developed, making the main agent far more robust against their opponents. At the same time, we employed imitation learning techniques (including distillation) to prevent AlphaStar from forgetting throughout training, and by using latent variables to represent a diverse set of opening moves.

Exploration is another key challenge in complex environments such as StarCraft. There are up to 1026 possible actions available to one of our agents at each time step, and the agent must make thousands of actions before learning if it has won or lost the game. Finding winning strategies is challenging in such a massive solution space. Even with a strong self-play system and a diverse league of main and exploiter agents, there would be almost no chance of a system developing successful strategies in such a complex environment without some prior knowledge. Learning human strategies, and ensuring that the agents keep exploring those strategies throughout self-play, was key to unlocking AlphaStar’s performance. To do this, we used imitation learning – combined with advanced neural network architectures and techniques used for language modelling – to create an initial policy which played the game better than 84% of active players. We also used a latent variable which conditions the policy and encodes the distribution of opening moves from human games, which helped to preserve high-level strategies. AlphaStar then used a form of distillation throughout self-play to bias exploration towards human strategies. This approach enabled AlphaStar to represent many strategies within a single neural network (one for each race). During evaluation, the neural network was not conditioned on any specific opening moves.

League exploiter discovery

AlphaStar is an intriguing and unorthodox player – one with the reflexes and speed of the best pros but strategies and a style that are entirely its own. The way AlphaStar was trained, with agents competing against each other in a league, has resulted in gameplay that’s unimaginably unusual; it really makes you question how much of StarCraft’s diverse possibilities pro players have really explored.

Diego "Kelazhur" Schwimer, professional StarCraft II player

In addition, we found that many prior approaches to reinforcement learning are ineffective in StarCraft, due to its enormous action space. In particular, AlphaStar uses a new algorithm for off-policy reinforcement learning, which allows it to efficiently update its policy from games played by an older policy.


AlphaStar results

Open-ended learning systems that utilise learning-based agents and self-play have achieved impressive results in increasingly challenging domains. Thanks to advances in imitation learning, reinforcement learning, and the League, we were able to train AlphaStar Final, an agent that reached Grandmaster level at the full game of StarCraft II without any modifications, as shown in the above video. This agent played online anonymously, using the gaming platform Battle.net, and achieved a Grandmaster level using all three StarCraft II races. AlphaStar played using a camera interface, with similar information to what human players would have, and with restrictions on its action rate to make it comparable with human players. The interface and restrictions were approved by a professional player. Ultimately, these results provide strong evidence that general-purpose learning techniques can scale AI systems to work in complex, dynamic environments involving multiple actors. The techniques we used to develop AlphaStar will help further the safety and robustness of AI systems in general, and, we hope, may serve to advance our research in real-world domains.

It was exciting to see the agent develop its own strategies differently from the human players [...]. The caps on the actions it can take and the camera view restrictions now make for compelling games – even though, as a pro, I can still spot some of the system’s weaknesses.

Grzegorz "MaNa" Komincz

  • Publicly available paper here

AlphaStar team:

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcerhe, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, David Silver

Acknowledgements:

We’re grateful to Dario Wünsch (TLO) , Grzegorz Komincz (MaNa), and Diego Schwimer (Kelazhur) for their advice, guidance, and immense skill. We are also grateful for the continued support of Blizzard and the StarCraft gaming and AI community for making this work possible–especially those who played against AlphaStar on Battle.net. Thanks to Ali Razavi, Daniel Toyama, David Balduzzi, Doug Fritz, Eser Aygün, Florian Strub, Guillaume Alain, Haoran Tang, Jaume Sanchez, Jonathan Fildes, Julian Schrittwieser, Justin Novosad, Karen Simonyan, Karol Kurach, Philippe Hamel,  Ricardo Barreira, Scott Reed, Sergey Bartunov, Shibl Mourad, Steve Gaffney, Thomas Hubert, the team that created PySC2, and the whole DeepMind team, with special thanks to the research platform team, comms and events teams.


*Agents were capped at a max of 22 agent actions per 5 seconds, where one agent action corresponds to a selection, an ability and a target unit or point, which counts as up to 3 actions towards the in-game APM counter. Moving the camera also counts as an agent action, despite not being counted towards APM.