# Going beyond average for reinforcement learning

Consider the commuter who toils backwards and forwards each day on a train. Most mornings, her train runs on time and she reaches her first meeting relaxed and ready. But she knows that once in awhile the unexpected happens: a mechanical problem, a signal failure, or even just a particularly rainy day. Invariably these hiccups disrupt her pattern, leaving her late and flustered.

Randomness is something we encounter everyday and has a profound effect on how we experience the world. The same is true in reinforcement learning (RL) applications, systems that learn by trial and error and are motivated by rewards. Typically, an RL algorithm predicts the average reward it receives from multiple attempts at a task, and uses this prediction to decide how to act. But random perturbations in the environment can alter its behaviour by changing the exact amount of reward the system receives.

In a new paper, we show it is possible to model not only the average but also the full variation of this reward, what we call the value distribution. This results in RL systems that are more accurate and faster to train than previous models, and more importantly opens up the possibility of rethinking the whole of reinforcement learning.

Returning to the example of our commuter, let’s consider a journey composed of three segments of 5 minutes each, except that once a week the train breaks down, adding another 15 minutes to the trip. A simple calculation shows that the average commute time is (3 x 5) + 15 / 5 = 18 minutes.

In reinforcement learning, we use Bellman's equation to predict this average commute time. Specifically, Bellman’s equation relates our current average prediction to the average prediction we make in the immediate future. From the first station, we predict an 18 minutes journey (the average total duration); from the second, we predict a 13 minutes journey (average duration minus the first segment’s length). Finally, assuming the train hasn’t yet broken down, from the third station we predict there are 8 minutes (13 - 5) left to our commute, until finally we arrive at our destination. Bellman’s equation makes each prediction sequentially, and updates these predictions on the basis of new information.

What's a little counterintuitive about Bellman’s equation is that we never actually observe these predicted averages: either the train takes 15 minutes (4 days out of 5), or it takes 30 minutes – never 18! From a purely mathematical standpoint, this isn’t a problem, because decision theory tells us we only need averages to make the best choice. As a result, this issue has been mostly ignored in practice. Yet, there is now plenty of empirical evidence that predicting averages is a complicated business.

It’s already evident from our empirical results that the distributional perspective leads to better, more stable reinforcement learning

In our new paper, we  show that there is in fact a variant of Bellman's equation which predicts all possible outcomes, without averaging them. In our example, we maintain two predictions – a distribution – at each station: If the journey goes well, then the times are 15, 10, then 5 minutes, respectively; but if the train breaks down, then the times are 30, 25, and finally 20 minutes.

All of reinforcement learning can be reinterpreted under this new perspective, and its application is already leading to surprising new theoretical results. Predicting the distribution over outcomes also opens up all kinds of algorithmic possibilities, such as:

• Disentangling the causes of randomness: once we observe that commute times are bimodal, i.e. take on two possible values, we can act on this information, for example checking for train updates before leaving home;
• Telling safe and risky choices apart: when two choices have the same average outcome (e.g., walking or taking the train), we may favour the one which varies the least (walking)..
• Natural auxiliary predictions: predicting a multitude of outcomes, such as the distribution of commute times, has been shown to be beneficial for training deep networks faster.

We took our new ideas and implemented them within the Deep Q-Network agent, replacing its single average reward output with a distribution with 51 possible values. The only other change was a new learning rule, reflecting the transition from Bellman’s (average) equation to its distributional counterpart. Incredibly, it turns out going from averages to distributions was all we needed to surpass the performance of all other comparable approaches, and by a wide margin. The graph below shows how we get 75% of a trained Deep Q-Network’s performance in 25% of the time, and achieve significantly better human performance:

One surprising result is that we observe some randomness in Atari 2600 games, even though Stella, the underlying game emulator, is itself fully predictable. This randomness arises in part because of what’s called partial observability: due to the internal programming of the emulator, our agents playing the game of Pong cannot predict the exact time at which their score increases. Visualising the agent’s prediction over successive frames (graphs below) we see two separate outcomes (low and high), reflecting the possible timings. Although this intrinsic randomness doesn’t directly impact performance, our results highlight the limits of our agents’ understanding.

Randomness also occurs because the agent’s own behaviour is uncertain. In Space Invaders, our agent learns to predict the future probability that it might make a mistake and lose the game (zero reward).

Just like in our train journey example, it makes sense to keep separate predictions for these vastly different outcomes, rather than aggregate them into an unrealisable average. In fact, we think that our improved results are in great part due to the agent’s ability to model its own randomness.

It’s already evident from our empirical results that the distributional perspective leads to better, more stable reinforcement learning. With the possibility that every reinforcement learning concept could now want a distributional counterpart, it might just be the beginning for this approach.

This work was done by Marc G. Bellemare*, Will Dabney*, and Rémi Munos.