Research Article

# Superhuman AI for heads-up no-limit poker: Libratus beats top professionals

See allHide authors and affiliations

Science  26 Jan 2018:
Vol. 359, Issue 6374, pp. 418-424
DOI: 10.1126/science.aao1733

## Libratus versus humans

Pitting artificial intelligence (AI) against top human players demonstrates just how far AI has come. Brown and Sandholm built a poker-playing AI called Libratus that decisively beat four leading human professionals in the two-player variant of poker called heads-up no-limit Texas hold'em (HUNL). Over nearly 3 weeks, Libratus played 120,000 hands of HUNL against the human professionals, using a three-pronged approach that included precomputing an overall strategy, adapting the strategy to actual gameplay, and learning from its opponent.

Science, this issue p. 418

## Abstract

No-limit Texas hold’em is the most popular form of poker. Despite artificial intelligence (AI) successes in perfect-information games, the private information and massive game tree have made no-limit poker difficult to tackle. We present Libratus, an AI that, in a 120,000-hand competition, defeated four top human specialist professionals in heads-up no-limit Texas hold’em, the leading benchmark and long-standing challenge problem in imperfect-information game solving. Our game-theoretic approach features application-independent techniques: an algorithm for computing a blueprint for the overall strategy, an algorithm that fleshes out the details of the strategy for subgames that are reached during play, and a self-improver algorithm that fixes potential weaknesses that opponents have identified in the blueprint strategy.

In recent years, the field of artificial intelligence (AI) has advanced considerably. The measure of this progress has, in many cases, been marked by performance against humans in benchmark games. AI programs have defeated top humans in checkers (1), chess (2), and Go (3). In these perfect-information games, both players know the exact state of the game at every point. By contrast, in imperfect-information games, some information about the state of the game is hidden from a player—for example, the opponent may hold hidden cards. Hidden information is ubiquitous in real-world strategic interactions—such as business strategy, negotiation, strategic pricing, finance, cybersecurity, and military applications—which makes research on general-purpose techniques for imperfect-information games particularly important.

Hidden information makes a game far more complex for a number of reasons. Rather than simply search for an optimal sequence of actions, an AI for imperfect-information games must determine how to balance actions appropriately, so that the opponent never finds out too much about the private information the AI has. For example, bluffing is a necessary feature in any competitive-poker strategy, but bluffing all the time would be a bad strategy. That is, the value of an action depends on the probability it is played.

Another key challenge is that different parts of the game cannot be considered in isolation; the optimal strategy for a given situation may depend on the strategy that would be played in situations that have not occurred (4). As a consequence, a competitive AI must always consider the strategy for the game as a whole.

Poker has a long history as a challenge problem for developing AIs that can address hidden information (511). No-limit Texas hold’em is the most popular form of poker in the world. The heads-up (that is, two-player) variant prevents opponent collusion and kingmaker scenarios where a bad player causes a mediocre player to shine and therefore allows a clear winner to be determined. Because of its large size and strategic complexity, heads-up no-limit Texas hold’em (HUNL) has been the primary benchmark and challenge problem for imperfect-information game solving for several years. No prior AI has defeated top human players in this game.

In this paper we introduce Libratus (12), an AI that takes a distinct approach to addressing imperfect-information games. In a 20-day, 120,000-hand competition featuring a $200,000 prize pool, it defeated top human professionals in HUNL. The techniques in Libratus do not use expert domain knowledge or human data and are not specific to poker; thus, they apply to a host of imperfect-information games. ## Game-solving approach in Libratus Libratus features three main modules: 1) The first module computes an abstraction of the game, which is smaller and easier to solve, and then computes game-theoretic strategies for the abstraction. The solution to this abstraction provides a detailed strategy for the early rounds of the game, but only an approximation for how to play in the more numerous later parts of the game. We refer to the solution of the abstraction as the blueprint strategy. 2) When a later part of the game is reached during play, the second module of Libratus constructs a finer-grained abstraction for that subgame and solves it in real time (13). Unlike subgame-solving techniques in perfect-information games, Libratus does not solve the subgame abstraction in isolation; instead, it ensures that the fine-grained solution to the subgame fits within the larger blueprint strategy of the whole game. The subgame solver has several key advantages over prior subgame-solving techniques (1416). Whenever the opponent makes a move that is not in the abstraction, a subgame is solved with that action included. We call this nested subgame solving. This technique comes with a provable safety guarantee. 3) The third module of Libratus—the self-improver—enhances the blueprint strategy. It fills in missing branches in the blueprint abstraction and computes a game-theoretic strategy for those branches. In principle, one could conduct all such computations in advance, but the game tree is way too large for that to be feasible. To tame this complexity, Libratus uses the opponents’ actual moves to suggest where in the game tree such filling is worthwhile. In the following three subsections, we present these three modules in more detail. ## Abstraction and equilibrium finding: Building a blueprint strategy One solution to the problem of imperfect information is to simply reason about the entire game as a whole, rather than just pieces of it. In this approach, a solution is precomputed for the entire game, possibly using a linear program (10) or an iterative algorithm (1721). For example, an iterative algorithm called counterfactual regret minimization plus (CFR+) was used to near-optimally solve heads-up limit Texas hold’em, a relatively simple version of poker, which has about 1013 unique decision points (11, 22). By contrast, HUNL (23) has 10161 decision points (24), so traversing the entire game tree even once is impossible. Precomputing a strategy for every decision point is infeasible for such a large game. Fortunately, many of those decision points are very similar. For example, there is little difference between a bet of$100 and a bet of $101. Rather than consider every possible bet between$100 and $20,000, we could instead just consider increments of$100. This is referred to as action abstraction. An abstraction is a smaller, simplified game that retains as much as possible the strategic aspects of the original game. This drastically reduces the complexity of solving the game. If an opponent bets $101 during an actual match, then the AI may simply round this to a bet of$100 and respond accordingly (2527). Most of the bet sizes included in Libratus’s action abstraction were nice fractions or multiples of the pot [roughly determined by analyzing the most common bet sizes at various points in the game taken by prior top AIs in the Annual Computer Poker Competition (ACPC) (28)]. However, certain bet sizes early in the game tree were determined by an application-independent parameter-optimization algorithm that converged to a locally optimal set of bet sizes (29).

An additional form of abstraction is abstraction of actions taken by chance, that is, card abstraction, in the case of poker. Similar hands are grouped together and treated identically. Intuitively, there is little difference between a king-high flush and a queen-high flush. Treating those hands as identical reduces the complexity of the game and thus makes it computationally easier. Nevertheless, there are still differences even between a king-high flush and a queen-high flush. At the highest levels of play, those distinctions may be the difference between winning and losing. Libratus does not use any card abstraction on the first and second betting rounds. The last two betting rounds, which have a considerably larger number of states, are abstracted only in the blueprint strategy. The 55 million different hand possibilities on the third round were algorithmically grouped into 2.5 million abstract buckets, and the 2.4 billion different possibilities on the fourth round were algorithmically grouped into 1.25 million abstract buckets. However, the AI does not follow the blueprint strategy in these rounds and instead applies nested subgame solving, described in the next section, which does not use any card abstraction. Thus, each poker hand is considered individually during actual play. The card abstraction algorithm that we used was similar to that used in our prior AIs Baby Tartanian8 (30), which won the 2016 ACPC, and Tartanian7 (3133), which won the 2014 ACPC. (There was no ACPC in 2015.)

Once the abstraction was constructed, we computed the blueprint strategy for Libratus by having the AI play simulated games of poker against itself (while still exploring the hypothetical outcomes of actions not chosen) using an improved version of an algorithm called Monte Carlo counterfactual regret minimization (MCCFR) (17, 34, 35), which has a long history of use in successful poker AIs (30, 31, 36, 37). MCCFR maintains a regret value for each action. Intuitively, regret represents how much the AI regrets having not chosen that action in the past. When a decision point is encountered during self-play, the AI chooses actions with higher regret with higher probability (38). As more and more games are simulated, MCCFR guarantees that, with high probability, a player’s average regret for any action (total regret divided by the number of iterations played) approaches zero. Thus, the AI’s average strategy over all simulated games gradually improves. We will now describe the equilibrium-finding algorithm (4).

For each simulated game, MCCFR chooses one player (whom we refer to as the traverser) that will explore every possible action and update his regrets, while the opponent simply plays according to the strategy determined by the current regrets. The algorithm switches the roles of the two players after each game, that is, a single hand of poker. Every time either player is faced with a decision point in a simulated game, the player will choose a probability distribution over actions on the basis of regrets for those actions (which are determined by what he had learned in earlier games when he had been in that situation). For the first game, the AI has not learned anything yet and therefore uses a uniform random distribution over actions. At traverser decision points, MCCFR explores every action in a depth-first manner. At opponent decision points, MCCFR samples an action on the basis of the probability distribution. This process repeats at every decision point until the game is over and a reward is received, which is passed up. When a reward is returned by every action at a traverser decision point, MCCFR calculates the weighted average reward for that decision point on the basis of the probability distribution over actions. The regret for each action is then updated by adding the value returned by that action and subtracting the weighted average reward for the decision point. The weighted average reward is then passed up to the preceding decision point, and so on.

Our improved version of MCCFR traverses a smaller portion of the game tree on each iteration. Intuitively, there are many clearly suboptimal actions in the game and repeatedly exploring them wastes computational resources that could be better used to improve the strategy elsewhere. Rather than explore every hypothetical alternative action to see what its reward would have been, our algorithm probabilistically skips over unpromising actions that have very negative regret as it proceeds deeper into the tree during a game (30, 39). In practice, this led to a speedup of MCCFR by a factor of three and allowed us to solve larger abstractions than were otherwise possible.

This skipping also mitigates the problems that stem from imperfect recall. The state-of-the-art practical abstractions in the field, including ours, are imperfect-recall abstractions where some aspects of the cards that are on the path of play so far are intentionally forgotten in order to be able to computationally afford to have a more detailed abstraction of the present state of cards (3032, 40). Because all decision points in a single abstract card bucket share the same strategy, updating the strategy for one of them leads to updating the strategy for all of them. This is not an issue if all of the decision points share the same optimal strategy at the solution reached, but, in practice, there are differences between their optimal strategies, and they effectively “fight” to push the bucket’s strategy toward their own optimal strategy. Skipping negative-regret actions means that decision points that will never be reached in actual play will no longer have their strategies updated, thereby allowing the decision points that will actually occur during play to move the bucket’s strategy closer to their optimal strategies.

We ran our algorithm on an abstraction that is very detailed in the first two rounds of HUNL, but relatively coarse in the final two rounds. However, Libratus never plays according to the abstraction solution in the final two rounds. Rather, it uses the abstract blueprint strategy in those rounds only to estimate what reward a player should expect to receive with a particular hand in a subgame. This estimate is used to determine a more precise strategy during actual play, as described in the next section.

## Nested safe subgame solving

Although purely abstraction-based approaches have produced strong AIs for poker (25, 30, 32, 41), abstraction alone has not been enough to reach superhuman performance in HUNL. In addition to abstraction, Libratus builds upon previous research into subgame solving (1416, 42), in which a more detailed strategy is calculated for a particular part of the game that is reached during play. Libratus features many advances in subgame solving that proved critical to achieving superhuman performance (43).

Libratus plays according to the abstract blueprint strategy only in the early parts of HUNL, where the number of possible states is relatively small and we can afford the abstraction to be extremely detailed. Upon reaching the third betting round, or any earlier point in the game where the remaining game tree is sufficiently small (44), Libratus constructs a new, more detailed abstraction for the remaining subgame and solves it in real time.

However, there is a major challenge with subgame solving in imperfect-information games: A subgame cannot be solved in isolation because its optimal strategy may depend on other, unreached subgames (4). Prior AIs that used real-time subgame solving addressed this problem by assuming that the opponent plays according to the blueprint strategy. However, the opponent can exploit this assumption by simply switching to a different strategy. For this reason, the technique may produce strategies that are far worse than the blueprint strategy and is referred to as unsafe subgame solving (42, 45). Safe subgame-solving techniques, on the other hand, guarantee that the subgame’s new strategy makes the opponent no better off no matter what strategy the opponent might use (14). They accomplish this by ensuring that the new strategy for the subgame fits within the overarching blueprint strategy of the original abstraction. Ensuring the opponent is no better off relative to the blueprint strategy is trivially possible because we could just reuse the blueprint strategy. However, now that the abstraction is more detailed in the subgame and we can better distinguish the strategic nuances of the subgame, it may be possible to find an improvement over the previous strategy that makes the opponent worse off no matter what cards she is holding.

We now describe Libratus’s core technique for determining an improved strategy in a subgame. For exposition, we assume player 2 (P2) is determining an improved strategy against player 1 (P1). Given that P2’s strategy outside the subgame is σ2, there exists some optimal strategy σ2* that P2 could play in the subgame. We would like to find or approximate σ2* in real time. We assume that, for each poker hand P1 might have, we have a good estimate of the value P1 receives in the subgame with that hand by playing optimally against σ2*, even though we do not know σ2* itself. Although we do not know these values exactly, we can approximate them with the values P1 receives in the subgame in the blueprint strategy. We later prove that if these estimates are approximately accurate, we can closely approximate σ2*.

To find a strategy close to σ2* in the subgame using only the values from the blueprint, we create an augmented subgame (Fig. 1) that contains the subgame and additional structures. At the start of the augmented subgame, P1 is privately dealt a random poker hand. Given that P2 plays according to σ2 before the subgame, and given P1’s dealt hand, there is a particular probability distribution over what hands P2 might have in this situation. P2 is dealt a poker hand according to this probability distribution. P1 then has the choice of either entering the subgame (which is now far more detailed than in the blueprint strategy) or taking an alternative payoff that ends the augmented subgame immediately. The value of the alternative payoff is our estimate, according to the blueprint strategy, of P1’s value for that poker hand in that subgame. If P1 chooses to enter the subgame, then play proceeds normally until the end of the game is reached. We can solve this augmented subgame just as we did for the blueprint strategy (46).

For any hand P1 might have, P1 can do no worse in the augmented subgame than just choosing the alternative payoff (which awards our estimate of the expected value P1 could receive against σ2*). At the same time, P2 can ensure that for every poker hand P1 might have, he does no better than what he could receive against σ2*, because P2 can simply play σ2* itself. Thus, any solution to the augmented subgame must do approximately as well as σ2*—where the approximation error depends on how far off our estimates of P1’s values are. P2 then uses the solution to the augmented subgame as P2’s strategy going forward.

All of this relied on the assumption that we have accurate estimates of P1’s values against σ2*. Although we do not know these values exactly, we can approximate them with values from the blueprint strategy. We now prove that if these estimates are approximately accurate, subgame solving will produce a strategy that is close to the quality of σ2*. Specifically, we define the exploitability of a strategy σ2 as how much more σ2 would lose, in expectation, against a worst-case opponent than what P2 would lose, in expectation, in an exact solution of the full game.

Theorem 1 uses a form of safe subgame solving that we coin “Estimated-Maxmargin.” We define a margin for every P1 hand in a subgame as the expected value of that hand according to the blueprint minus what P1 could earn with that hand, in expectation, by entering the more detailed subgame. Estimated-Maxmargin finds a strategy that maximizes the minimum margin among all P1 hands. It is similar to a previous technique called Maxmargin (15), except that technique conservatively used as the margin what P1 could earn in the subgame, in expectation, by playing a best response to P2’s blueprint strategy minus what P1 could earn, in expectation, by entering the more detailed subgame.

Theorem 1. Let σi be a strategy for a two-player zero-sum perfect-recall game, let S be a set of nonoverlapping subgames in the game, and let σi* be the least-exploitable strategy that differs from σi only in S. Assume that for any opponent decision point (a hand in the case of poker) and any subgame in S, our estimate of the opponent’s value in a best response to σi* for that decision point in that subgame is off by at most Δ. Applying estimated-Maxmargin subgame solving to any subgame in S reached during play results in overall exploitability at most 2Δ higher than that of σi* (47).

Although safe subgame-solving techniques have been known for 3 years (14, 15), they were not used in practice because empirically they performed substantially worse than unsafe subgame solving (42) head to head (48). Libratus features a number of advances to subgame solving that greatly improve effectiveness:

1) Although we describe safe subgame solving as using estimates of P1 values, past techniques used upper bounds on those values (14, 15). Using upper bounds guarantees that the subgame solution has exploitability no higher than the blueprint strategy. However, it tends to lead to overly conservative strategies in practice. Using estimates can, in theory, result in strategies with higher exploitability than the blueprint strategy, but Theorem 1 bounds how much higher this exploitability can be.

2) Libratus arrives at better strategies in subgames than was previously thought possible. Past techniques ensured that the new strategy for the subgame made P1 no better off in that subgame for every situation. It turns out that this is an unnecessarily strong constraint. For example, 2♠7♥ is considered the worst hand in HUNL and should be folded immediately, which ends the game. Choosing any other action would result in an even bigger loss in expectation. Nevertheless, past subgame-solving techniques would be concerned about P1 having 2♠7♥ in a subgame, which is unrealistic. Even if subgame solving resulted in a strategy that increased the value of 2♠7♥ a small amount in one subgame, that increase would not outweigh the cost of reaching the subgame (that is, the cost of not folding with 2♠7♥). Thus, P2 can allow the value of some “unimportant” P1 hands to increase in subgames, so long as the increase is small enough that it is still a mistake for P1 to reach the subgame with that hand. We accomplish this by increasing the alternative reward of P1 hands in the augmented subgame by the extra cost to P1 of reaching the subgame, that is, the size of the mistake P1 would have to make to reach that subgame with that hand. By increasing the alternative reward in the augmented subgame of these unimportant hands, P2 develops a strategy in the subgame that better defends against hands P1 might actually have (4).

## Conclusions

Libratus presents an approach that effectively addresses the challenge of game-theoretic reasoning under hidden information in a large state space. The techniques that we developed are largely domain independent and can thus be applied to other strategic imperfect-information interactions, including nonrecreational applications. Owing to the ubiquity of hidden information in real-world strategic interactions, we believe the paradigm introduced in Libratus will be important for the future growth and widespread application of AI.

## Supplementary Materials

www.sciencemag.org/content/359/6374/418/suppl/DC1

Supplementary Text

Figs. S1 and S2

Table S1

References (6268)

## References and Notes

1. See supplementary materials for more details.
2. Libratus is Latin and means balanced (for approximating Nash equilibrium) and forceful (for its powerful play style and strength).
3. An imperfect-information subgame (which we refer to simply as a subgame) is defined differently than how a subgame is usually defined in game theory. The usual definition requires that a subgame starts with the players knowing the exact state of the game, that is, no information is hidden from any player. Here, an imperfect-information subgame is determined by information that is common knowledge to the players. For example, in poker, a subgame is defined by the sequence of visible board cards and actions the players have taken so far. Every possible combination of private cards—that is, every node in the game tree which is consistent with the common knowledge—is a root of this subgame. Any node that descends from a root node is also included in the subgame. A formal definition is provided in the supplementary materials.
4. The version of HUNL that we refer to, which is used in the Annual Computer Poker Competition, allows bets in increments of $1, with each player having$20,000 at the beginning of a hand.
5. There are a number of theoretically correct ways to choose actions on the basis of their regrets. The most common is regret matching, in which an action is chosen in proportion to its positive regret (58). Another common choice is hedge (59, 60).
6. An action a with regret R(a) that is below a threshold C (where C is negative) is sampled with probability K/[K + C R(a)], where K is a positive constant. There is additionally a floor on the sample probability. This sampling is only done for about the last half of iterations to be run; the first half is conducted using traditional external-sampling MCCFR. Other formulas can also be used.
7. In Libratus, we considered “sufficiently small” to be situations where no additional bets or raises could be made.
8. Despite lacking theoretical guarantees, unsafe subgame solving empirically performs well in certain situations and requires less information to be precomputed. For this reason, Libratus uses it once upon first reaching the third betting round, but uses safe subgame solving in all subsequent situations.
9. We solved augmented subgames using a heavily optimized form of the CFR+ algorithm (22, 61) because of the better performance of CFR+ in small games where a precise solution is desired. The optimizations we use keep track of all possible P1 hands, rather than dealing out a single one at random.
10. Note that the theorem only assumes perfect recall in the actual game, not in the abstraction that is used for computing a blueprint strategy. Furthermore, applying Estimated-Maxmargin assumes that that subroutine maximizes the minimum margin; a sufficient condition for doing so is that there is no abstraction in the subgame.
11. Indeed, the original purpose of safe subgame solving was merely to reduce space usage by reconstructing subgame strategies rather than storing them.
12. Specifically, Libratus increased or decreased all its bet sizes by a percentage chosen uniformly at random between 0 and 8%.
13. Based on the available computing resources, we chose k = 3 so that the algorithm could typically fix three holes to reasonable accuracy in 24 hours.
14. Baby Tartanian8 and all other AIs in the ACPC are available to ACPC participants for benchmarking.
15. Baby Tartanian8 uses action translation in response to bet sizes that are not in its action abstraction. Our experiments above demonstrated that action translation performs poorly compared to subgame solving. Using only bet sizes in Baby Tartanian8’s abstraction disentangles the effects of action translation from the improvement of nested subgame solving. Baby Tartanian8 still used actions that were not in Libratus’s abstraction, and therefore, the experiments can be considered conservative.
16. Because both the humans and the AI adapted over the course of the competition, treating the hands as independent is not entirely inappropriate. We include confidence figures to provide some intuition for the variance in HUNL. In any case, 147 mbb/game over 120,000 hands is considered a massive and unambiguous victory in HUNL.
Acknowledgments: This material is based on research supported by the NSF under grants IIS-1718457, IIS-1617590, and CCF-1733556, and by the U.S. Army Research Office under award W911NF-17-1-0082, as well as the Extreme Science and Engineering Discovery Environment (XSEDE) computing resources and advanced user support services provided by the Pittsburgh Supercomputing Center. The Brains vs. Artificial Intelligence: Upping the Ante competition was sponsored by Carnegie Mellon University, Rivers Casino, GreatPoint Ventures, Avenue4Analytics, TNG Technology Consulting, Artificial Intelligence, Intel, and Optimized Markets, Inc. We thank B. Clayman for computing statistics of the play of our AIs against humans. The data presented in this paper are shown in the main text and supplementary materials. Additional data can be obtained from the corresponding author upon request. Because HUNL poker is played commercially, the risk associated with releasing the code outweighs the benefits. To aid reproducibility, we have included the pseudocode for the major components of our program in the supplementary materials. The technology has been exclusively licensed to Strategic Machine, Inc., and the authors have ownership interest in the company.
View Abstract