Learning in two-player zero-sum partially observable Markov games with perfect recall


We consider the problem of exploration in two-player zero-sum games. More precisely, we aim at finding the Nash equilibrium of a tabular, episodic, partially observable Markov game (POMG) under the perfect-recall assumption through self-play where the only feedback of the agent is realizations of the game where in particular, the dynamics of the Markov game is not known-we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide a optimistic model-based algorithm with a regret bound of order $O(\sqrt{T})$ where $T$ is the number of played games.