Unifying gradient estimations for meta RL via off-policy evaluation


Model-agnostic meta reinforcement learning requires estimating the Hessian matrix of value functions. This is challenging from an implementation perspective, as repeated differentiations of policy gradient estimates generally lead to biased Hessian estimates. In this work, we provide a unifying framework for estimating high-order derivatives of value function, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates. This framework also opens the door to a new family of estimates, which can be easily implemented with auto-differentiation libraries, and can lead to performance gains in practice.