Policy-Guided Heuristic Search with Guarantees


The employment of a policy and a value function for guiding search can be quite effective in adversarial problems, as demonstrated by AlphaGo and its successors, which used MCTS with PUCT as search algorithm.

While MCTS with PUCT can also be used to solve single-agent deterministic problems while guided by both a policy and a value function, it lacks guarantees on its search effort and it can be computationally inefficient in practice.
Approaches that combine the A* algorithm with a learned value function (known as a heuristic function) tend to work better in single-agent deterministic problems, but A* and its variants do not make use of a policy. Levin Tree Search (LTS) is guided by a policy and provides guarantees on the number of search steps that relate to the quality of the policy, but it does not make use of a heuristic function. In this work we introduce Policy-Guided Heuristic Search (PHS), a novel search algorithm that uses both a heuristic function and a policy and has theoretical guarantees on the number of search steps before reaching a solution that relate to both the quality of the heuristic and of the policy.

We show empirically on the Sliding-Tile puzzle and Sokoban domains that PHS is able to learn both a policy and a heuristic function and outperform A*, LTS, and other baselines in terms of number of problems solved and solution quality.