Recently, Approximate Policy Iteration (API) algorithms have achieved superhuman proficiency in two-player zero-sum games such as Go, Chess, and Shogi without human data. These API algorithms iterate between two policies: a slow policy (tree search), and a fast policy (a neural network). In these two-player games, a reward is always received at the end of the game. However, the Rubik’s Cube has only a single solved state, and episodes are not guaranteed to terminate. This poses a major problem for these API algorithms since they rely on the reward received at the end of the game. We introduce Autodidactic Iteration: an API algorithm that overcomes the problem of sparse rewards by training on a distribution of states that allows the reward to propagate from the goal state to states farther away. Autodidactic Iteration is able to learn how to solve the Rubik’s Cube without relying on human data. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge.Forest Agostinelli, Stephen McAleer, Alexander Shmakov, Pierre Baldi. Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 2019; DOI: 10.1038/s42256-019-0070-z (Data availability)
7/24/2019
Solving the Rubik's Cube with Approximate Policy Iteration
Stephen McAleer, Forest Agostinelli, Alexander Shmakov, Pierre Baldi, Solving the Rubik's Cube with Approximate Policy Iteration, ICLR, 2019.
沒有留言:
張貼留言