Reward is enough for convex MDPs
We show that the maximization of a non-stationary reward by a Reinforcement Learning agent is enough to solve any convex MDP -- a sequential decision making problem that is convex in the state occupancy of the policy. Convex MDPs include the standard RL problem, where the objective function is the linear product between the extrinsic reward and the state occupancy, but also cover many other situations including apprenticeship learning, variational intrinsic control, constrained mdps, and pure exploration. We reformulate the convex MDP as a min-max game between the policy and cost (negative reward) players using Fenchel duality. The average of the policies produced by an RL agent that maximizes the non-stationary reward produced by the cost player converges to an optimal solution to the convex MDP.