Quirks of Offline Reinforcement Learning: Addressing Estimation and Extrapolation


Offline reinforcement algorithms have the promise to learn efficiently, without interacting with an environment, by using large amounts of logged data which tends to be readily available. This could enable several real-world applications from robotics to self-driving cars and health-care to recommendation systems, where direct interaction with the environment and in particular unstructured exploration can be prohibitively expensive. However, in practice such algorithms tend to perform poorly, not being able to obtain similar behaviour as online algorithms. A crucial difference between the two modalities, offline and online, can be described through the \emph{staleness} of the data. Simply stated, compared to the online regime, in the offline regime the agent can not explore actions or states that have not been collected during the constructions of the dataset, even if those actions seem to be along the optimal path according to the policy being learned. This tends to manifest as an \emph{overestimation} of the value of actions not present in the dataset, which is reinforced due to the bootstrapping mechanism of learning, leading to policies that tend to move away from the region of the state space covered by the training dataset, where the learned policy is unpredictable. We propose a few simple modifications to mitigate some of this behaviour including (a) relying on SARSA updates which tend to reduce overestimation, (b) regularize the Q-network to rank the actions that are not in the dataset lower than those that are and (c) providing a different parametrization of the output layer of the Q-network. We provide experimental results on standard RL benchmarks such as Atari, and partially observable domains from DMLab, highlighting the efficiency of our approach in the offline regime.