Temporal Difference and Return Optimism in Cooperative Multi-Agent Reinforcement Learning
We describe a framework for decentralized multi-agent reinforcement learning that encompasses existing approaches based on optimistic updates, as well as ideas from distributional reinforcement learning. Both methods can be interpreted as augmenting single-agent learning algorithms with the addition of optimism; the former at the level of temporal difference (TD) errors, and the latter at the level of returns. This perspective allows for the detailed examination of the fundamental differences between these two families of methods across a range of environments, and we identify several environment properties that exemplify the differences in performance that may arise in practice. Further, the unifying framework used to describe these algorithms highlights many possible variants of these basic approaches, and we introduce several new families of algorithms that can be seen as interpolating between TD-optimism and return-optimism.