Research Article

Human-level performance in 3D multiplayer games with population-based reinforcement learning

See allHide authors and affiliations

Science  31 May 2019:
Vol. 364, Issue 6443, pp. 859-865
DOI: 10.1126/science.aau6249
  • Fig. 1 CTF task and computational training framework.

    (A and B) Two example maps that have been sampled from the distribution of (A) outdoor maps and (B) indoor maps. Each agent in the game sees only its own first-person pixel view of the environment. (C) Training data are generated by playing thousands of CTF games in parallel on a diverse distribution of procedurally generated maps and (D) used to train the agents that played in each game with RL. (E) We trained a population of 30 different agents together, which provided a diverse set of teammates and opponents to play with and was also used to evolve the internal rewards and hyperparameters of agents and learning process. Each circle represents an agent in the population, with the size of the inner circle representing strength. Agents undergo computational evolution (represented as splitting) with descendents inheriting and mutating hyperparameters (represented as color). Gameplay footage and further exposition of the environment variability can be found in movie S1.

  • Fig. 2 Agent architecture and benchmarking.

    (A) How the agent processes a temporal sequence of observations xt from the environment. The model operates at two different time scales, faster at the bottom and slower by a factor of τ at the top. A stochastic vector-valued latent variable is sampled at the fast time scale from distribution t on the basis of observations xt. The action distribution πt is sampled conditional on the latent variable at each time step t. The latent variable is regularized by the slow moving prior t, which helps capture long-range temporal correlations and promotes memory. The network parameters are updated by using RL according to the agent’s own internal reward signal rt, which is obtained from a learned transformation w of game points ρt. w is optimized for winning probability through PBT, another level of training performed at yet a slower time scale than that of RL. Detailed network architectures are described in fig. S11. (B) (Top) The Elo skill ratings of the FTW agent population throughout training (blue) together with those of the best baseline agents by using hand-tuned reward shaping (RS) (red) and game-winning reward signal only (black), compared with human and random agent reference points (violet, shaded region shows strength between 10th and 90th percentile). The FTW agent achieves a skill level considerably beyond strong human subjects, whereas the baseline agent’s skill plateaus below and does not learn anything without reward shaping [evaluation procedure is provided in (28)]. (Bottom) The evolution of three hyperparameters of the FTW agent population: learning rate, Kullback-Leibler divergence (KL) weighting, and internal time scale τ, plotted as mean and standard deviation across the population.

  • Fig. 3 Knowledge representation and behavioral analysis.

    (A) The 2D t-SNE embedding of an FTW agent’s internal states during gameplay. Each point represents the internal state (hp, hq) at a particular point in the game and is colored according to the high-level game state at this time—the conjunction of (B) four basic CTF situations, each state of which is colored distinctly. Color clusters form, showing that nearby regions in the internal representation of the agent correspond to the same high-level game state. (C) A visualization of the expected internal state arranged in a similarity-preserving topological embedding and colored according to activation (fig. S5). (D) Distributions of situation conditional activations (each conditional distribution is colored gray and green) for particular single neurons that are distinctly selective for these CTF situations and show the predictive accuracy of this neuron. (E) The true return of the agent’s internal reward signal and (F) the agent’s prediction, its value function (orange denotes high value, and purple denotes low value). (G) Regions where the agent’s internal two–time scale representation diverges (red), the agent’s surprise, measured as the KL between the agent’s slow– and fast–time scale representations (28). (H) The four-step temporal sequence of the high-level strategy “opponent base camping.” (I) Three automatically discovered high-level behaviors of agents and corresponding regions in the t-SNE embedding. (Right) Average occurrence per game of each behavior for the FTW agent, the FTW agent without temporal hierarchy (TH), self-play with reward shaping agent, and human subjects (fig. S9).

  • Fig. 4 Progression of agent during training.

    Shown is the development of knowledge representation and behaviors of the FTW agent over the training period of 450,000 games, segmented into three phases (movie S2). “Knowledge” indicates the percentage of game knowledge that is linearly decodable from the agent’s representation, measured by average scaled AUCROC across 200 features of game state. Some knowledge is compressed to single-neuron responses (Fig. 3A), whose emergence in training is shown at the top. “Relative internal reward magnitude” indicates the relative magnitude of the agent’s internal reward weights of 3 of the 13 events corresponding to game points ρ. Early in training, the agent puts large reward weight on picking up the opponent’s flag, whereas later, this weight is reduced, and reward for tagging an opponent and penalty when opponents capture a flag are increased by a factor of two. “Behavior probability” indicates the frequencies of occurrence for 3 of the 32 automatically discovered behavior clusters through training. Opponent base camping (red) is discovered early on, whereas teammate following (blue) becomes very prominent midway through training before mostly disappearing. The “home base defense” behavior (green) resurges in occurrence toward the end of training, which is in line with the agent’s increased internal penalty for more opponent flag captures. “Memory usage” comprises heat maps of visitation frequencies for (left) locations in a particular map and (right) locations of the agent at which the top-10 most frequently read memories were written to memory, normalized by random reads from memory, indicating which locations the agent learned to recall. Recalled locations change considerably throughout training, eventually showing the agent recalling the entrances to both bases, presumably in order to perform more efficient navigation in unseen maps (fig. S7).

Supplementary Materials

  • Human-level performance in 3D multiplayer games with populationbased reinforcement learning

    Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, Thore Graepel

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Supplementary Text
    • Figs. S1 to S12
    • References
    Pseudocode and Supplementary Data

    Images, Video, and Other Media

    Movie S1
    Movie S2
    Movie S3
    Movie S4

Stay Connected to Science

Navigate This Article