Alchemy: A structured task distribution for meta-reinforcement learning


There has been rapidly growing interest in meta-learning as a method for increasing the flexibility and sample efficiency of reinforcement learning. One problem in this area of research, however, has been a scarcity of adequate benchmark tasks. In general, the structure underlying past benchmarks has either been too simple to be inherently interesting, or too ill-defined to support principled analysis. In the present work, we introduce a new benchmark for meta-RL research, which combines structural richness with structural transparency. Alchemy is a 3D video game, implemented in Unity, which involves a latent causal structure that is resampled procedurally from episode to episode, affording structure learning, online inference, hypothesis testing and action sequencing based on abstract domain knowledge. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents. Results clearly indicate a frank and specific failure of meta-learning, providing validation for Alchemy as a challenging benchmark for meta-RL. Concurrent with this report, we are releasing Alchemy as public resource, together with a suite of analysis tools and sample agent trajectories.

Authors' Notes

When humans are faced with a new task, we are typically able to tackle it with admirable speed, requiring very little experience to get going. This kind of efficiency and flexibility is something we would also like to see in artificial agents. However, although there has recently been dramatic progress in building deep reinforcement learning (RL) agents that can perform complex tasks after extensive training, getting deep RL agents to rapidly master new tasks remains an open problem.

One promising approach is meta-learning or learning to learn. The idea here is that the learner gains repurposable knowledge across a large set of experiences, and as this knowledge accumulates, it allows the learner to adapt more and more quickly to each new task it encounters. There has been rapidly growing interest in developing methods for meta-learning within deep RL. Although there has been substantive progress toward such ‘meta-reinforcement learning,’ research in this area has been held back by a shortage of benchmark tasks. In the present work, we aim to ease this problem by introducing (and open-sourcing) Alchemy, a useful new benchmark environment for meta-RL, along with a suite of analysis tools.

In order for meta-learning to occur, it is necessary that the environment present the learner not with a single task, but instead with a series or distribution of tasks, all of which have some high-level features in common. Although such interrelated task settings are common in the real world (think of board games, or kitchen tasks, or subway systems), they are notoriously difficult to design for artificial agents operating in simulated environments. Ideally, we would like task distributions that are both interesting and accessible: Interesting in the sense that they involve the rich kinds of shared structure that one sees in real-world tasks; and accessible in the sense that we have complete knowledge of the full task distribution, allowing us to say precisely what the shared structure is that a good meta-learner would pick up on. Previous work on meta-RL has generally relied on tasks distributions that are either accessible without being interesting (such as bandit tasks), or else interesting without being accessible (such as Atari games). Alchemy is designed to offer the best of both worlds.

Alchemy is a single-player video game, implemented in Unity. The player sees a first-person view of a table with a number of objects on it, including a set of colored stones, a set of dishes containing colored potions, and a central cauldron. Stones have different point values, and points are collected when stones are added to the cauldron. By dipping stones into the potions, the player can transform the stones’ appearance, and thus their value, increasing the number of points that can be won.

Alchemy video demonstration

However, Alchemy also involves a crucially important catch: The ‘chemistry’ that governs how potions affect stones changes every time the game is played. A skillful player must perform a set of targeted experiments to discover how the current chemistry works, and use the results of those experiments to guide strategic action sequences. Learning to do that, over the course of many rounds of Alchemy, is precisely the meta-RL challenge.

Alchemy has an ‘interesting’ structure, in the sense that it involves a compositional set of latent causal relationships, and requires strategic experimentation and action sequencing. But Alchemy’s structure is also ‘accessible,’ since game levels are created based on an explicit generative process.

This accessibility allows us to identify optimal meta-learning performance in Alchemy, by building a Bayes-optimal solver with access to the generative process. This optimal agent offers a valuable gold-standard against which to compare any deep RL agent.

As a first application of Alchemy, we presented it to two powerful deep RL agents (IMPALA and V-MPO). As detailed in our paper, although these agents have been shown to do well in many single-task RL environments, in Alchemy both of them displayed very poor meta-learning performance. Even after extensive training, both agents showed behavior reflecting only a superficial ‘understanding’ of the task -- essentially dipping stones into potions randomly, until a high stone value happened to result. Through a series of detailed analyses, we were able to establish that this failure of meta-learning was due not simply to the visuo-motor challenges of the 3D environment, nor to the difficulty of sequencing actions to achieve goals. Instead, the agents’ poor performance specifically reflected a failure of structure learning and latent-state inference, the core functions involved in meta-learning. Overall, the initial experiments presented in our report suggest that Alchemy may be a useful benchmark task for meta-RL research.

In tandem with our paper, we are releasing Alchemy as a public resource. The release includes multiple versions of the game (including a simplified, symbolic version, and a human-playable version), along with the Bayes-optimal benchmark agent described above and numerous other resources and analysis tools.

By Jane X. Wang, Michael King, Nicolas Porcel, Zeb Kurth-Nelson, Tina Zhu, Charlie Deck, Peter Choy, Mary Cassin, Malcolm Reynolds, Francis Song, Gavin Buttimore, David P. Reichert, Neil Rabinowitz, Loic Matthey, Demis Hassabis, Alex Lerchner, and Matthew Botvinick.