Bootstrap Statistical Inference for Off-Policy Evaluation


Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted-Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE estimator for infering the distribution of the OPE error. We show that this method is asymptotically efficient and consistent for OPE statistical inference. To overcome the computation bottleneck faced by most bootstrap methods, our method adopts a subsampling procedure that improves the runtime by an order of magnitude. We evaluate this method in some classical RL environments for confidence interval estimation, variance and correlation estimations.