Emphatic Algorithms for Deep Reinforcement Learning
Off-policy learning allows us to learn target policies from experience generated by a different policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling—this is known as the “deadly triad”. Emphatic temporal difference (ETD) algorithms ensure convergence in the linear case by reweighting the updates on each time step. In this paper, we extend the use of emphatic method to deep reinforcement learning (RL) agents. We show that naively adapting ETD(λ) to popular deep RL algorithms results in poor performance. We then derive new emphatic algorithms for use in the context of deep RL, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we show that these algorithms can work at scale on classic Atari games from the Arcade Learning Environment, and improved the median human normalized score of a strong baseline from 403% to 497% in the 200 million frames regime.