Report

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

See allHide authors and affiliations

Science  07 Dec 2018:
Vol. 362, Issue 6419, pp. 1140-1144
DOI: 10.1126/science.aar6404
  • Fig. 1 Training AlphaZero for 700,000 steps.

    Elo ratings were computed from games between different players where each player was given 1 s per move. (A) Performance of AlphaZero in chess compared with the 2016 TCEC world champion program Stockfish. (B) Performance of AlphaZero in shogi compared with the 2017 CSA world champion program Elmo. (C) Performance of AlphaZero in Go compared with AlphaGo Lee and AlphaGo Zero (20 blocks over 3 days).

  • Fig. 2 Comparison with specialized programs.

    (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against, respectively, Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar, AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (W; green), draw (D; gray), or loss (L; red). (B) Scalability of AlphaZero with thinking time compared with Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 s per move); time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing (27) and against Stockfish with a strong opening book (28). Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program, Aperyqhapaq (29), at full time controls and against Elmo under 2017 CSA world championship time controls (10 min per game and 10 s per move). (D) Average result of chess matches starting from different opening positions, either common human positions (see also Fig. 3) or the 2016 TCEC world championship opening positions (see also fig. S4), and average result of shogi matches starting from common human positions (see also Fig. 3). CSA world championship games start from the initial board position. Match conditions are summarized in tables S8 and S9.

  • Fig. 3 Matches starting from the most popular human openings.

    AlphaZero plays against (A) Stockfish in chess and (B) Elmo in shogi. In the left bar, AlphaZero plays white, starting from the given position; in the right bar, AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (green), draw (gray), or loss (red). The percentage frequency of self-play training games in which this opening was selected by AlphaZero is plotted against the duration of training, in hours.

  • Fig. 4 AlphaZero’s search procedure.

    The search is illustrated for a position (inset) from game 1 (table S6) between AlphaZero (white) and Stockfish (black) after 29. ... Qf8. The internal state of AlphaZero’s MCTS is summarized after 102, ..., 106 simulations. Each summary shows the 10 most visited states. The estimated value is shown in each state, from white’s perspective, scaled to the range [0, 100]. The visit count of each state, relative to the root state of that tree, is proportional to the thickness of the border circle. AlphaZero considers 30. c6 but eventually plays 30. d5.

Supplementary Materials

  • A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Materials and Methods
    • Figs S1 to S4
    • Tables S1 to S9
    • References
    Data S1

Stay Connected to Science

Navigate This Article