NeurIPS workshop - Semi-supervised reward learning for offline reinforcement learning


Offline reinforcement learning (RL) brings a promise of learning control policies in real-life applications by eliminating the need for online data collection and exploration. However, offline RL still needs a reward signal, which is often not readily available in practice. One solution to this problem is to learn reward functions. In this paper we would like to understand a) what type and amount of supervision is needed for efficient learning of reward functions, and b) how to understand if a reward function is accurate enough to enable fast learning of RL policies. To study these questions, we discuss different types of supervision for reward learning (i.e., timestep annotations or demonstrations) and the ways to utilize them efficiently. In particular, we propose a semi-supervised learning algorithm for inferring the reward functions. It implements ideas from self-training, co-training and multiple-instance learning to improve the agent’s performance and stability. We further investigate how the reward model quality translates into the quality of the final policies. Our experiments with a simulated robotic arm demonstrate that good control policies can be learnt even with a limited amount of noisy supervision when semi-supervised learning techniques are used.