Research Article

# Superhuman AI for multiplayer poker

See allHide authors and affiliations

Science  30 Aug 2019:
Vol. 365, Issue 6456, pp. 885-890
DOI: 10.1126/science.aay2400

## AI now masters six-player poker

Computer programs have shown superiority over humans in two-player games such as chess, Go, and heads-up, no-limit Texas hold'em poker. However, poker games usually include six players—a much trickier challenge for artificial intelligence than the two-player variant. Brown and Sandholm developed a program, dubbed Pluribus, that learned how to play six-player no-limit Texas hold'em by playing against five copies of itself (see the Perspective by Blair and Saffidine). When pitted against five elite professional poker players, or with five copies of Pluribus playing against one professional, the computer performed significantly better than humans over the course of 10,000 hands of poker.

Science, this issue p. 885; see also p. 864

## Abstract

In recent years there have been great strides in artificial intelligence (AI), with games often serving as challenge problems, benchmarks, and milestones for progress. Poker has served for decades as such a challenge problem. Past successes in such benchmarks, including poker, have been limited to two-player games. However, poker in particular is traditionally played with more than two players. Multiplayer games present fundamental additional issues beyond those in two-player games, and multiplayer poker is a recognized AI milestone. In this paper we present Pluribus, an AI that we show is stronger than top human professionals in six-player no-limit Texas hold’em poker, the most popular form of poker played by humans.

Poker has served as a challenge problem for the fields of artificial intelligence (AI) and game theory for decades (1). In fact, the foundational papers on game theory used poker to illustrate their concepts (2, 3). The reason for this choice is simple: No other popular recreational game captures the challenges of hidden information as effectively and as elegantly as poker. Although poker has been useful as a benchmark for new AI and game-theoretic techniques, the challenge of hidden information in strategic settings is not limited to recreational games. The equilibrium concepts of von Neumann and Nash have been applied to many real-world challenges such as auctions, cybersecurity, and pricing.

The past two decades have witnessed rapid progress in the ability of AI systems to play increasingly complex forms of poker (46). However, all prior breakthroughs have been limited to settings involving only two players. Developing a superhuman AI for multiplayer poker was the widely recognized main remaining milestone. In this paper we describe Pluribus, an AI capable of defeating elite human professionals in six-player no-limit Texas hold’em poker, the most commonly played poker format in the world.

## Theoretical and practical challenges of multiplayer games

AI systems have reached superhuman performance in games such as checkers (7), chess (8), two-player limit poker (4), Go (9), and two-player no-limit poker (6). All of these involve only two players and are zero-sum games (meaning that whatever one player wins, the other player loses). Every one of those superhuman AI systems was generated by attempting to approximate a Nash equilibrium strategy rather than by, for example, trying to detect and exploit weaknesses in the opponent. A Nash equilibrium is a list of strategies, one for each player, in which no player can improve by deviating to a different strategy. Nash equilibria have been proven to exist in all finite games—and many infinite games—though finding an equilibrium may be difficult.

Two-player zero-sum games are a special class of games in which Nash equilibria also have an extremely useful additional property: Any player who chooses to use a Nash equilibrium is guaranteed to not lose in expectation no matter what the opponent does (as long as one side does not have an intrinsic advantage under the game rules, or the players alternate sides). In other words, a Nash equilibrium strategy is unbeatable in two-player zero-sum games that satisfy the above criteria. For this reason, to “solve” a two-player zero-sum game means to find an exact Nash equilibrium. For example, the Nash equilibrium strategy for Rock-Paper-Scissors is to randomly pick Rock, Paper, or Scissors with equal probability. Against such a strategy, the best that an opponent can do in expectation is tie (10). In this simple case, playing the Nash equilibrium also guarantees that the player will not win in expectation. However, in more complex games, even determining how to tie against a Nash equilibrium may be difficult; if the opponent ever chooses suboptimal actions, then playing the Nash equilibrium will indeed result in victory in expectation.

In principle, playing the Nash equilibrium can be combined with opponent exploitation by initially playing the equilibrium strategy and then over time shifting to a strategy that exploits the opponent’s observed weaknesses (for example, by switching to always playing Paper against an opponent that always plays Rock) (11). However, except in certain restricted ways (12), shifting to an exploitative nonequilibrium strategy opens oneself up to exploitation because the opponent could also change strategies at any moment. Additionally, existing techniques for opponent exploitation require too many samples to be competitive with human ability outside of small games. Pluribus plays a fixed strategy that does not adapt to the observed tendencies of the opponents.

Although a Nash equilibrium strategy is guaranteed to exist in any finite game, efficient algorithms for finding one are only proven to exist for special classes of games, among which two-player zero-sum games are the most prominent. No polynomial-time algorithm is known for finding a Nash equilibrium in two-player non-zero-sum games, and the existence of one would have sweeping surprising implications in computational complexity theory (13, 14). Finding a Nash equilibrium in zero-sum games with three or more players is at least as hard (because a dummy player can be added to the two-player game to make it a three-player zero-sum game). Even approximating a Nash equilibrium is hard (except in special cases) in theory (15), and in games with more than two players, even the best complete algorithm can only address games with a handful of possible strategies per player (16). Moreover, even if a Nash equilibrium could be computed efficiently in a game with more than two players, it is not clear that playing such an equilibrium strategy would be wise. If each player in such a game independently computes and plays a Nash equilibrium, the list of strategies that they play (one strategy per player) may not be a Nash equilibrium and players might have an incentive to deviate to a different strategy. One example of this is the Lemonade Stand Game (17), illustrated in Fig. 1, in which each player simultaneously picks a point on a ring and wants to be as far away as possible from any other player. The Nash equilibrium is for all players to be spaced uniformly along the ring, but there are infinitely many ways this can be accomplished and therefore infinitely many Nash equilibria. If each player independently computes one of those equilibria, the joint strategy is unlikely to result in all players being spaced uniformly along the ring. Two-player zero-sum games are a special case where even if the players independently compute and select Nash equilibria, the list of strategies is still a Nash equilibrium.

The shortcomings of Nash equilibria outside of two-player zero-sum games, and the failure of any other game-theoretic solution concept to convincingly overcome them, have raised the question of what the right goal should even be in such games. In the case of six-player poker, we take the viewpoint that our goal should not be a specific game-theoretic solution concept but rather to create an AI that empirically consistently defeats human opponents, including elite human professionals.

The algorithms that we used to construct Pluribus, discussed in the next two sections, are not guaranteed to converge to a Nash equilibrium outside of two-player zero-sum games. Nevertheless, we observe that Pluribus plays a strong strategy in multiplayer poker that is capable of consistently defeating elite human professionals. This shows that even though the techniques do not have known strong theoretical guarantees on performance outside of the two-player zero-sum setting, they are nevertheless capable of producing superhuman strategies in a wider class of strategic settings.

## Description of Pluribus

The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI starts from scratch by playing randomly and gradually improves as it determines which actions, and which probability distribution over those actions, lead to better outcomes against earlier versions of its strategy. Forms of self-play have previously been used to generate powerful AIs in two-player zero-sum games such as backgammon (18), Go (9, 19), Dota 2 (20), StarCraft 2 (21), and two-player poker (46), though the precise algorithms that were used have varied widely. Although it is easy to construct toy games with more than two players in which commonly used self-play algorithms fail to converge to a meaningful solution (22), in practice self-play has nevertheless been shown to do reasonably well in some games with more than two players (23).

Pluribus’s self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. Then during actual play against opponents, Pluribus improves upon the blueprint strategy by searching for a better strategy in real time for the situations in which it finds itself during the game. In subsections below, we discuss both of those phases in detail, but first we discuss abstraction, forms of which are used in both phases to make them scalable.

### Abstraction for large imperfect-information games

There are far too many decision points in no-limit Texas hold’em to reason about individually. To reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction (24, 25). After abstraction, the bucketed decision points are treated as identical. We use two kinds of abstraction in Pluribus: action abstraction and information abstraction.

### Depth-limited search in imperfect-information games

The blueprint strategy for the entire game is necessarily coarse-grained owing to the size and complexity of no-limit Texas hold’em. Pluribus only plays according to this blueprint strategy in the first betting round (of four), where the number of decision points is small enough that the blueprint strategy can afford to not use information abstraction and have a lot of actions in the action abstraction. After the first round (and even in the first round if an opponent chooses a bet size that is sufficiently different from the sizes in the blueprint action abstraction), Pluribus instead conducts real-time search to determine a better, finer-grained strategy for the current situation it is in. For opponent bets on the first round that are slightly off the tree, Pluribus rounds the bet to a nearby on-tree size [using the pseudoharmonic mapping (39)] and proceeds to play according to the blueprint as if the opponent had used the latter bet size.

Real-time search has been necessary for achieving superhuman performance in many perfect-information games, including backgammon (18), chess (8), and Go (9, 19). For example, when determining their next move, chess AIs commonly look some number of moves ahead until a leaf node is reached at the depth limit of the algorithm’s lookahead. An evaluation function then estimates the value of the board configuration at the leaf node if both players were to play a Nash equilibrium from that point forward. In principle, if an AI could accurately calculate the value of every leaf node (e.g., win, draw, or loss), this algorithm would choose the optimal next move.

However, search as has been done in perfect-information games is fundamentally broken when applied to imperfect-information games. For example, consider a sequential form of Rock-Paper-Scissors, illustrated in Fig. 3, in which player 1 acts first but does not reveal her action to player 2, followed by player 2 acting. If player 1 were to conduct search that looks just one move ahead, every one of her actions would appear to lead to a leaf node with zero value. After all, if player 2 plays the Nash equilibrium strategy of choosing each action with 13 probability, the value to player 1 of choosing Rock is zero, as is the value of choosing Scissors. So player 1’s search algorithm could choose to always play Rock because, given the values of the leaf nodes, this appears to be equally good as any other strategy.

Indeed, if player 2’s strategy were fixed to always playing the Nash equilibrium, always playing Rock would be an optimal player 1 strategy. However, in reality, player 2 could adjust to a strategy of always playing Paper. In that case, the value of always playing Rock would actually be 1.

This example illustrates that in imperfect-information subgames (the part of the game in which search is being conducted) (40), leaf nodes do not have fixed values. Instead, their values depend on the strategy that the searcher chooses in the subgame (that is, the probabilities that the searcher assigns to his actions in the subgame). In principle, this could be addressed by having the value of a subgame leaf node be a function of the searcher’s strategy in the subgame, but this is impractical in large games. One alternative is to make the value of a leaf node conditional only on the belief distribution of both players at that point in the game. This was used to generate the two-player poker AI DeepStack (5). However, this option is extremely expensive because it requires one to solve huge numbers of subgames that are conditional on beliefs. It becomes even more expensive as the amount of hidden information or the number of players grows. The two-player poker AI Libratus sidestepped this issue by only doing real-time search when the remaining game was short enough that the depth limit would extend to the end of the game (6). However, as the number of players grows, always solving to the end of the game also becomes computationally prohibitive.

Pluribus instead uses a modified form of an approach that we recently designed—previously only for two-player zero-sum games (41)—in which the searcher explicitly considers that any or all players may shift to different strategies beyond the leaf nodes of a subgame. Specifically, rather than assuming that all players play according to a single fixed strategy beyond the leaf nodes (which results in the leaf nodes having a single fixed value), we instead assume that each player may choose between k different strategies, specialized to each player, to play for the remainder of the game when a leaf node is reached. In the experiments in this paper, k=4. One of the four continuation strategies that we use in the experiments is the precomputed blueprint strategy; another is a modified form of the blueprint strategy in which the strategy is biased toward folding; another is the blueprint strategy biased toward calling; and the final option is the blueprint strategy biased toward raising. This technique results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy (e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g., always playing Paper).

Another major challenge of search in imperfect-information games is that a player’s optimal strategy for a particular situation depends on what the player’s strategy is for every situation the player could be in from the perspective of her opponents. For example, suppose the player is holding the best possible hand. Betting in this situation could be a good action. But if the player bets in this situation only when holding the best possible hand, then the opponents would know that they should always fold in response.

To cope with this challenge, Pluribus keeps track of the probability it would have reached the current situation with each possible hand according to its strategy. Regardless of which hand Pluribus is actually holding, it will first calculate how it would act with every possible hand, being careful to balance its strategy across all the hands so as to remain unpredictable to the opponent. Once this balanced strategy across all hands is computed, Pluribus then executes an action for the hand it is actually holding. The structure of a depth-limited imperfect-information subgame as used in Pluribus is shown in Fig. 4.

Pluribus used one of two different forms of CFR to compute a strategy in the subgame, depending on the size of the subgame and the part of the game. If the subgame is relatively large or it is early in the game, then Monte Carlo Linear CFR is used just as it was for the blueprint strategy computation. Otherwise, Pluribus uses an optimized vector-based form of Linear CFR (38) that samples only chance events (such as board cards) (42).

When playing, Pluribus runs on two Intel Haswell E5-2695 v3 CPUs and uses less than 128 GB of memory. For comparison, AlphaGo used 1920 CPUs and 280 GPUs for real-time search in its 2016 matches against top Go professional Lee Sedol (43), Deep Blue used 480 custom-designed chips in its 1997 matches against top chess professional Garry Kasparov (8), and Libratus used 100 CPUs in its 2017 matches against top professionals in two-player poker (6). The amount of time that Pluribus takes to conduct search on a single subgame varies between 1 and 33 s, depending on the particular situation. On average, Pluribus plays at a rate of 20 s per hand when playing against copies of itself in six-player poker. This is roughly twice as fast as professional humans tend to play.

## Experimental evaluation

We evaluated Pluribus against elite human professionals in two formats: five human professionals playing with one copy of Pluribus (5H+1AI), and one human professional playing with five copies of Pluribus (1H+5AI). Each human participant has won more than $1 million playing poker professionally. Performance was measured by using the standard metric in this field of AI, milli big blinds per game (mbb/game). This measures how many big blinds (the initial money the second player must put into the pot) were won on average per thousand hands of poker. In all experiments, we used the variance-reduction technique AIVAT (44) to reduce the luck factor in the game (45) and measured statistical significance at the 95% confidence level using a one-tailed t test to determine whether Pluribus is profitable. The human participants in the 5H+1AI experiment were Jimmy Chou, Seth Davies, Michael Gagliano, Anthony Gregg, Dong Kim, Jason Les, Linus Loeliger, Daniel McAulay, Greg Merson, Nicholas Petrangelo, Sean Ruane, Trevor Savage, and Jacob Toole. In this experiment, 10,000 hands of poker were played over 12 days. Each day, five volunteers from the pool of professionals were selected to participate on the basis of availability. The participants were not told who else was participating in the experiment. Instead, each participant was assigned an alias that remained constant throughout the experiment. The alias of each player in each game was known, so that players could track the tendencies of each player throughout the experiment.$50,000 was divided among the human participants on the basis of their performance to incentivize them to play their best. Each player was guaranteed a minimum of $0.40 per hand for participating, but this could increase to as much as$1.60 per hand on the basis of performance. After applying AIVAT, Pluribus won an average of 48 mbb/game (with a standard error of 25 mbb/game). This is considered a very high win rate in six-player no-limit Texas hold’em poker, especially against a collection of elite professionals, and implies that Pluribus is stronger than the human opponents. Pluribus was determined to be profitable with a p value of 0.028. The performance of Pluribus over the course of the experiment is shown in Fig. 5. (Owing to the extremely high variance in no-limit poker and the impossibility of applying AIVAT to human players, the win rate of individual human participants could not be determined with statistical significance).

The human participants in the 1H+5AI experiment were Chris “Jesus” Ferguson and Darren Elias. Each of the two humans separately played 5000 hands of poker against five copies of Pluribus. Pluribus does not adapt its strategy to its opponents and does not know the identity of its opponents, so the copies of Pluribus could not intentionally collude against the human player. To incentivize strong play, we offered each human $2000 for participation and an additional$2000 if he performed better against the AI than the other human player did. The players did not know who the other participant was and were not told how the other human was performing during the experiment. For the 10,000 hands played, Pluribus beat the humans by an average of 32 mbb/game (with a standard error of 15 mbb/game). Pluribus was determined to be profitable with a p value of 0.014. (Darren Elias was behind Pluribus by 40 mbb/game with a standard error of 22 mbb/game and a p value of 0.033, and Chris Ferguson was behind Pluribus by 25 mbb/game with a standard error of 20 mbb/game and a p value of 0.107. Ferguson’s lower loss rate may be a consequence of variance, skill, and/or his use of a more conservative strategy that was biased toward folding in unfamiliar difficult situations.)

Because Pluribus’s strategy was determined entirely from self-play without any human data, it also provides an outside perspective on what optimal play should look like in multiplayer no-limit Texas hold’em. Pluribus confirms the conventional human wisdom that limping (calling the “big blind” rather than folding or raising) is suboptimal for any player except the “small blind” player who already has half the big blind in the pot by the rules, and thus has to invest only half as much as the other players to call. Although Pluribus initially experimented with limping when computing its blueprint strategy offline through self-play, it gradually discarded this action from its strategy as self-play continued. However, Pluribus disagrees with the folk wisdom that “donk betting” (starting a round by betting when one ended the previous betting round with a call) is a mistake; Pluribus does this far more often than professional humans do.

## Conclusions

Forms of self-play combined with forms of search have led to a number of high-profile successes in perfect-information two-player zero-sum games. However, most real-world strategic interactions involve hidden information and more than two players. This makes the problem very different and considerably more difficult both theoretically and practically. Developing a superhuman AI for multiplayer poker was a widely recognized milestone in this area and the major remaining milestone in computer poker. In this paper we described Pluribus, an AI capable of defeating elite human professionals in six-player no-limit Texas hold’em poker, the most commonly played poker format in the world. Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect-information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.

## Supplementary Materials

science.sciencemag.org/content/365/6456/885/suppl/DC1

Supplementary Text

Table S1

References (4652)

Data File S1

## References and Notes

1. Recently, in the real-time strategy games Dota 2 (20) and StarCraft 2 (21), AIs have beaten top humans, but as humans have gained more experience against the AIs, humans have learned to beat them. This may be because for those two-player zero-sum games, the AIs were generated by techniques not guaranteed to converge to a Nash equilibrium, so they do not have the unbeatability property that Nash equilibrium strategies have in two-player zero-sum games. (Dota 2 involves two teams of five players each. However, because the players on the same team have the same objective and are not limited in their communication, the game is two-player zero-sum from an AI and game-theoretic perspective.)
2. Here we use the term “subgame” the way it is usually used in AI. In game theory, that word is used differently by requiring a subgame to start with a node where the player whose turn it is to move has no uncertainty about state—in particular, no uncertainty about the opponents’ private information.
3. Owing to the presence of AIVAT and because the players did not know each others’ scores during the experiment, there was no incentive for the players to play a risk-averse or risk-seeking strategy to outperform the other human.
Acknowledgments: We thank P. Ringshia for building a graphical user interface and thank J. Chintagunta, B. Clayman, A. Du, C. Gao, S. Gross, T. Liao, C. Kroer, J. Langas, A. Lerer, V. Raj, and S. Wu for playing against Pluribus as early testing. Funding: This material is based on Carnegie Mellon University research supported by the National Science Foundation under grants IIS-1718457, IIS-1617590, IIS-1901403, and CCF-1733556 and by the ARO under award W911NF-17-1-0082, as well as XSEDE computing resources provided by the Pittsburgh Supercomputing Center. Facebook funded the player payments. Author contributions: N.B. and T.S. designed the algorithms. N.B. wrote the code. N.B. and T.S. designed the experiments and wrote the paper. Competing interests: The authors have ownership interest in Strategic Machine, Inc., and Strategy Robot, Inc., which have exclusively licensed prior game-solving code from the Carnegie Mellon University laboratory of T.S., which constitutes the bulk of the code in Pluribus. Data and materials availability: The data presented in this paper are shown in the main text and supplementary materials. Because poker is played commercially, the risk associated with releasing the code outweighs the benefits. To aid reproducibility, we have included the pseudocode for the major components of our program in the supplementary materials.