Active offline policy selection


When agents are trained with offline reinforcement learning (ORL), off-policy policy evaluation (OPE) can be used to select the best agent. However, OPE is challenging and its estimates are not always precise. In many applications it is realistic to assume that interactions with the real environment are too expensive to train a policy, but it is still feasible to evaluate a few selected policies. If we are given an opportunity to interact with an environment, we can hope to obtain a better estimate while maintaining a small budget of interactions with the environment. This problem setting is very relevant, for example, in robotics and language. We refer to this problem as active offline policy selection (active-ops). To use limited interactions wisely, we employ a Bayesian optimisation approach where we start with OPE values and model the dependency between different policies through the actions that they take. We test this approach on several environments and diverse ORL policies.