{"title": "Depth-Limited Solving for Imperfect-Information Games", "book": "Advances in Neural Information Processing Systems", "page_first": 7663, "page_last": 7674, "abstract": "A fundamental challenge in imperfect-information games is that states do not have well-defined values. As a result, depth-limited search algorithms used in single-agent settings and perfect-information games do not apply. This paper introduces a principled way to conduct depth-limited solving in imperfect-information games by allowing the opponent to choose among a number of strategies for the remainder of the game at the depth limit. Each one of these strategies results in a different set of values for leaf nodes. This forces an agent to be robust to the different strategies an opponent may employ. We demonstrate the effectiveness of this approach by building a master-level heads-up no-limit Texas hold'em poker AI that defeats two prior top agents using only a 4-core CPU and 16 GB of memory. Developing such a powerful agent would have previously required a supercomputer.", "full_text": "Depth-Limited Solving for\n\nImperfect-Information Games\n\nNoam Brown, Tuomas Sandholm, Brandon Amos\n\nComputer Science Department\nCarnegie Mellon University\n\nnoamb@cs.cmu.edu, sandholm@cs.cmu.edu, bamos@cs.cmu.edu\n\nAbstract\n\nA fundamental challenge in imperfect-information games is that states do not have\nwell-de\ufb01ned values. As a result, depth-limited search algorithms used in single-\nagent settings and perfect-information games do not apply. This paper introduces a\nprincipled way to conduct depth-limited solving in imperfect-information games\nby allowing the opponent to choose among a number of strategies for the remainder\nof the game at the depth limit. Each one of these strategies results in a different set\nof values for leaf nodes. This forces an agent to be robust to the different strategies\nan opponent may employ. We demonstrate the effectiveness of this approach by\nbuilding a master-level heads-up no-limit Texas hold\u2019em poker AI that defeats two\nprior top agents using only a 4-core CPU and 16 GB of memory. Developing such\na powerful agent would have previously required a supercomputer.\n\n1\n\nIntroduction\n\nImperfect-information games model strategic interactions between agents with hidden information.\nThe primary benchmark for this class of games is poker, speci\ufb01cally heads-up no-limit Texas hold\u2019em\n(HUNL), in which Libratus defeated top humans in 2017 [6]. The key breakthrough that led to\nsuperhuman performance was nested solving, in which the agent repeatedly calculates a \ufb01ner-grained\nstrategy in real time (for just a portion of the full game) as play proceeds down the game tree [5, 27, 6].\nHowever, real-time subgame solving was too expensive for Libratus in the \ufb01rst half of the game\nbecause the portion of the game tree Libratus solved in real time, known as the subgame, always\nextended to the end of the game. Instead, for the \ufb01rst half of the game Libratus pre-computed a \ufb01ne-\ngrained strategy that was used as a lookup table. While this pre-computed strategy was successful, it\nrequired millions of core hours and terabytes of memory to calculate. Moreover, in deeper sequential\ngames the computational cost of this approach would be even more expensive because either longer\nsubgames or a larger pre-computed strategy would need to be solved. A more general approach\nwould be to solve depth-limited subgames, which may not extend to the end of the game. These could\nbe solved even in the early portions of a game.\nThe poker AI DeepStack does this using a technique similar to nested solving that was developed\nindependently [27]. However, while DeepStack defeated a set of non-elite human professionals in\nHUNL, it never defeated prior top AIs despite using over one million core hours to train the agent,\nsuggesting its approach may not be suf\ufb01ciently ef\ufb01cient in domains like poker. We discuss this in\nmore detail in Section 7. This paper introduces a different approach to depth-limited solving that\ndefeats prior top AIs and is computationally orders of magnitude less expensive.\nWhen conducting depth-limited solving, a primary challenge is determining what values to substitute\nat the leaf nodes of the depth-limited subgame. In perfect-information depth-limited subgames, the\nvalue substituted at leaf nodes is simply an estimate of the state\u2019s value when all players play an\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fequilibrium [35, 33]. For example, this approach was used to achieve superhuman performance in\nbackgammon [39], chess [9], and Go [36, 37]. The same approach is also widely used in single-agent\nsettings such as heuristic search [30, 24, 31, 15]. Indeed, in single-agent and perfect-information\nmulti-agent settings, knowing the values of states when all agents play an equilibrium is suf\ufb01cient\nto reconstruct an equilibrium. However, this does not work in imperfect-information games, as we\ndemonstrate in the next section.\n\n2 The Challenge of Depth-Limited Solving in Imperfect-Information Games\n\nIn imperfect-information games (also referred to as partially-observable games), an optimal strategy\ncannot be determined in a subgame simply by knowing the values of states (i.e., game-tree nodes)\nwhen all players play an equilibrium strategy. A simple demonstration is in Figure 1a, which shows a\nsequential game we call Rock-Paper-Scissors+ (RPS+). RPS+ is identical to traditional Rock-Paper-\nScissors, except if either player plays Scissors, the winner receives 2 points instead of 1 (and the loser\nloses 2 points). Figure 1a shows RPS+ as a sequential game in which P1 acts \ufb01rst but does not reveal\nthe action to P2 [7, 13]. The optimal strategy (Minmax strategy, which is also a Nash equilibrium in\ntwo-player zero-sum games) for both players in this game is to choose Rock and Paper each with\n40% probability, and Scissors with 20% probability. In this equilibrium, the expected value to P1 of\nchoosing Rock is 0, as is the value of choosing Scissors or Paper. In other words, all the red states in\nFigure 1a have value 0 in the equilibrium. Now suppose P1 conducts a depth-limited search with a\ndepth of one in which the equilibrium values are substituted at that depth limit. This depth-limited\nsubgame is shown in Figure 1b. Clearly, there is not enough information in this subgame to arrive at\nthe optimal strategy of 40%, 40%, and 20% for Rock, Paper, and Scissors, respectively.\n\n(a) Rock-Paper-Scissors+ shown with the optimal\nP1 strategy. The terminal values are shown \ufb01rst for P1,\nthen P2. The red lines between the P2 nodes means\nthey are indistinguishable to P2.\n\n(b) A depth-limited subgame of Rock-Paper-Scissors+\nwith state values determined from the equilibrium.\n\nIn the RPS+ example, the core problem is that we incorrectly assumed P2 would always play a \ufb01xed\nstrategy. If indeed P2 were to always play Rock, Paper, and Scissors with probability (cid:104)0.4, 0.4, 0.2(cid:105),\nthen P1 could choose any arbitrary strategy and receive an expected value of 0. However, by assuming\nP2 is playing a \ufb01xed strategy, P1 may not \ufb01nd a strategy that is robust to P2 adapting. In reality, P2\u2019s\noptimal strategy depends on the probability that P1 chooses Rock, Paper, and Scissors. In general, in\nimperfect-information games a player\u2019s optimal strategy at a decision point depends on the player\u2019s\nbelief distribution over states as well as the strategy of all other agents beyond that decision point.\nIn this paper we introduce a method for depth-limited solving that ensures a player is robust to such\nopponent adaptations. Rather than simply substitute a single state value at a depth limit, we instead\nallow the opponent one \ufb01nal choice of action at the depth limit, where each action corresponds to\na strategy the opponent will play in the remainder of the game. The choice of strategy determines\nthe value of the state. The opponent does not make this choice in a way that is speci\ufb01c to the state\n(in which case he would trivially choose the maximum value for himself). Instead, naturally, the\nopponent must make the same choice at all states that are indistinguishable to him. We prove that\nif the opponent is given a choice between a suf\ufb01cient number of strategies at the depth limit, then\nany solution to the depth-limited subgame is part of a Nash equilibrium strategy in the full game.\nWe also show experimentally that when only a few choices are offered (for computational speed),\nperformance of the method is extremely strong.\n\n2\n\n\ud835\udc77\ud835\udfcf\ud835\udc77\ud835\udfd0Paper0,0-1,12,-2\ud835\udc77\ud835\udfd0Paper1,-10,0-2,2\ud835\udc77\ud835\udfd0Paper-2,22,-20,0Paper\ud835\udc43=0.4\ud835\udc43=0.2\ud835\udc43=0.4Paper\ud835\udc77\ud835\udfcf0,00,00,0\f3 Notation and Background\nIn an imperfect-information extensive-form game there is a \ufb01nite set of players, P. A state (also\ncalled a node) is de\ufb01ned by all information of the current situation, including private knowledge\nknown to only one player. A unique player P (h) acts at state h. H is the set of all states in the game\ntree. The state h(cid:48) reached after an action is taken in h is a child of h, represented by h \u00b7 a = h(cid:48),\nwhile h is the parent of h(cid:48). If there exists a sequence of actions from h to h(cid:48), then h is an ancestor\nof h(cid:48) (and h(cid:48) is a descendant of h), represented as h (cid:64) h(cid:48). Z \u2286 H are terminal states for which no\nactions are available. For each player i \u2208 P, there is a payoff function ui : Z \u2192 R. If P = {1, 2}\nand u1 = \u2212u2, the game is two-player zero-sum. In this paper we assume the game is two-player\nzero-sum, though many of the ideas extend to general sum and more than two players.\nImperfect information is represented by information sets (infosets) for each player i \u2208 P. For any\ninfoset I belonging to player i, all states h, h(cid:48) \u2208 I are indistinguishable to player i. Moreover, every\nnon-terminal state h \u2208 H belongs to exactly one infoset for each player i.\nA strategy \u03c3i(I) (also known as a policy) is a probability vector over actions for player i in infoset\nI. The probability of a particular action a is denoted by \u03c3i(I, a). Since all states in an infoset\nbelonging to player i are indistinguishable, the strategies in each of them must be identical. We\nde\ufb01ne \u03c3i to be a strategy for player i in every infoset in the game where player i acts. A strategy\nis pure if all probabilities in it are 0 or 1. All strategies are a linear combination of pure strategies.\nA strategy pro\ufb01le \u03c3 is a tuple of strategies, one for each player. The strategy of every player other\nthan i is represented as \u03c3\u2212i. ui(\u03c3i, \u03c3\u2212i) is the expected payoff for player i if all players play\naccording to the strategy pro\ufb01le (cid:104)\u03c3i, \u03c3\u2212i(cid:105). The value to player i at state h given that all players play\naccording to strategy pro\ufb01le \u03c3 is de\ufb01ned as v\u03c3\ni (h), and the value to player i at infoset I is de\ufb01ned as\n\ni (h)(cid:1), where p(h) is player i\u2019s believed probability that they are in state h,\n\nv\u03c3(I) =(cid:80)\n\n(cid:0)p(h)v\u03c3\n\nh\u2208I\n\ni\n\ni, \u03c3\u2217\n\nui(\u03c3(cid:48)\n\n\u2212i) [29]. A Nash equilibrium strategy for player i is a strategy \u03c3\u2217\n\nconditional on being in infoset I, based on the other players\u2019 strategies and chance\u2019s probabilities.\nui(\u03c3(cid:48)\nA best response to \u03c3\u2212i is a strategy BR(\u03c3\u2212i) such that ui(BR(\u03c3\u2212i), \u03c3\u2212i) = max\u03c3(cid:48)\ni, \u03c3\u2212i). A\nNash equilibrium \u03c3\u2217 is a strategy pro\ufb01le where every player plays a best response: \u2200i, ui(\u03c3\u2217\ni , \u03c3\u2217\n\u2212i) =\ni that is part of any\nmax\u03c3(cid:48)\nNash equilibrium. In two-player zero-sum games, if \u03c3i and \u03c3\u2212i are both Nash equilibrium strategies,\nthen (cid:104)\u03c3i, \u03c3\u2212i(cid:105) is a Nash equilibrium.\nA depth-limited imperfect-information subgame, which we refer to simply as a subgame, is a\ncontiguous portion of the game tree that does not divide infosets. Formally, a subgame S is a set of\nstates such that for all h \u2208 S, if h \u2208 Ii and h(cid:48) \u2208 Ii for some player i, then h(cid:48) \u2208 S. Moreover, if\nx \u2208 S and z \u2208 S and x (cid:64) y (cid:64) z, then y \u2208 S. If h \u2208 S but no descendant of h is in S, then h is a\nleaf node. Additionally, the infosets containing h are leaf infosets. Finally, if h \u2208 S but no ancestor\nof h is in S, then h is a root node and the infosets containing h are root infosets.\n\ni\n\n4 Multi-Valued States in Imperfect-Information Games\n\nIn this section we describe our new method for depth-limited solving in imperfect-information\ngames, which we refer to as multi-valued states. Our general approach is to \ufb01rst precompute an\napproximate Nash equilibrium for the entire game. We refer to this precomputed strategy pro\ufb01le as a\nblueprint strategy. Since the blueprint is precomputed for the entire game, it is likely just a coarse\napproximation of a true Nash equilibrium. Our goal is to compute a better approximation in real time\nfor just a depth-limited subgame S that we \ufb01nd ourselves in during play. For the remainder of this\npaper, we assume that player P1 is attempting to approximate a Nash equilibrium strategy in S.\nLet \u03c3\u2217 be an exact Nash equilibrium. To present the intuition for our approach, we begin by\nconsidering what information about \u03c3\u2217 would, in theory, be suf\ufb01cient in order to compute a P1 Nash\nequilibrium strategy in S. For ease of understanding, when considering the intuition for multi-valued\nstates we suggest the reader \ufb01rst focus on the case where S is rooted at the start of the game (that is,\nno prior actions have occurred).\nAs explained in Section 2, knowing the values of leaf nodes in S when both players play according to\n\u03c3\u2217 (that is, v\u03c3\u2217\ni (h) for leaf node h and player Pi) is insuf\ufb01cient to compute a Nash equilibrium in S\n(even though this is suf\ufb01cient in perfect-information games), because it assumes P2 would not adapt\ntheir strategy outside S. But what if P2 could adapt? Speci\ufb01cally, suppose hypothetically that P2\n\n3\n\n\f2 (cid:105)\n1 ,\u03c3n\n\n(cid:104)\u03c3\u2217\ni\n\n1 ,\u03c32(cid:105)\n\n(cid:104)\u03c3\u2217\n1\n\n1 )(cid:105)\n1 ,BR(\u03c3\u2217\n\n(cid:104)\u03c3\u2217\n2\n\n1 )(cid:105)\n1 ,BR(\u03c3\u2217\n\n(cid:104)\u03c3\u2217\n2\n\n(cid:104)\u03c31,BR(\u03c3\u2217\n1 )(cid:105)\n2\n\n(I) \u2264 v\n\n(cid:104)\u03c3\u2217\n2\n\n2 against \u03c3\u2217\n\n1, just the values of \u03c3\u2217\n\n1 in S. Thus, P1 should play \u03c3\u2217\n\n2 (h) (which assumes P2 plays according to \u03c3\u2217\n\n1 played against every pure opponent strategy in each leaf node.\n\ncould choose any strategy in the entire game, while P1 could only play according to \u03c3\u2217\n1 outside of\nS. In this case, what strategy should P1 choose in S? Since \u03c3\u2217\n1 is a Nash equilibrium strategy and\nP2 can choose any strategy in the game (including a best response to P1\u2019s strategy), so by de\ufb01nition\nP1 cannot do better than playing \u03c3\u2217\n1 (or some equally good Nash\nequilibrium) in S.\nAnother way to describe this setup is that upon reaching a leaf node h in infoset I in subgame S,\nrather than simply substituting v\u03c3\u2217\n2 for the remainder of\nthe game), P2 could instead choose any mixture of pure strategies for the remainder of the game. So\nif there are N possible pure strategies following I, P2 would choose among N actions upon reaching\nI, where action n would correspond to playing pure strategy \u03c3n\n2 for the remainder of the game. Since\nthis choice is made separately at each infoset I and since P2 may mix between pure strategies, so this\nallows P2 to choose any strategy below S.\nSince the choice of action would de\ufb01ne a P2 strategy for the remainder of the game and since P1 is\nknown to play according to \u03c3\u2217\n1 outside S, so the chosen action could immediately reward the expected\n(h) to Pi. Therefore, in order to reconstruct a P1 Nash equilibrium in S, it is suf\ufb01cient\nvalue v\nto know for every leaf node the expected value of every pure P2 strategy against \u03c3\u2217\n1 (stated formally\nin Proposition 1). This is in contrast to perfect-information games, in which it is suf\ufb01cient to know\nfor every leaf node just the expected value of \u03c3\u2217\n1. Critically, it is not necessary to know the\nstrategy \u03c3\u2217\n(I) for every root infoset I \u2208 S. This\nProposition 1 adds the condition that we know v\n(I) is\ncondition is used if S does not begin at the start of the game. Knowledge of v\nneeded to ensure that any strategy \u03c31 that P1 computes in S cannot be exploited by P2 changing their\n1 )(cid:105)\n1 ,BR(\u03c3\u2217\nstrategy earlier in the game. Speci\ufb01cally, we add a constraint that v\n(I)\nfor all P2 root infosets I. This makes our technique safe:\nProposition 1. Assume P1 has played according to Nash equilibrium strategy \u03c3\u2217\n1 prior to reaching\na depth-limited subgame S of a two-player zero-sum game. In order to calculate the portion of a\nP1 Nash equilibrium strategy that is in S, it is suf\ufb01cient to know v\n(I) for every root P2\n(h) for every pure undominated P2 strategy \u03c32 and every leaf node h \u2208 S.\ninfoset I \u2208 S and v\nOther safe subgame solving techniques have been developed in recent papers, but those techniques\nrequire solving to the end of the full game [7, 17, 28, 5, 6] (except one [27], which we will compare\nto in Section 7).\nOf course, it is impractical to know the expected value in every state of every pure P2 strategy against\n1, especially since we do not know \u03c3\u2217\n\u03c3\u2217\n1 itself. To deal with this, we \ufb01rst compute a blueprint strategy\n\u02c6\u03c3\u2217 (that is, a precomputed approximate Nash equilibrium for the full game). Next, rather than\nconsider every pure P2 strategy, we instead consider just a small number of different P2 strategies\n(that may or may not be pure). Indeed, in many complex games, the possible opponent strategies at a\ndecision point can be approximately grouped into just a few \u201cmeta-strategies\u201d, such as which highway\nlane a car will choose in a driving simulation. In our experiments, we \ufb01nd that excellent performance\nis obtained in poker with fewer than ten opponent strategies. In part, excellent performance is possible\nwith a small number of strategies because the choice of strategy beyond the depth limit is made\nseparately at each leaf infoset. Thus, if the opponent chooses between ten strategies at the depth\nlimit, but makes this choice independently in each of 100 leaf infosets, then the opponent is actually\nchoosing between 10100 different strategies. We now consider two questions. First, how do we\ncompute the blueprint strategy \u02c6\u03c3\u2217\n1? Second, how do we determine the set of P2 strategies? We answer\neach of these in turn.\nThere exist several methods for constructing a blueprint. One option, which achieves the best\nempirical results and is what we use, involves \ufb01rst abstracting the game by bucketing together similar\nsituations [19, 12] and then applying the iterative algorithm Monte Carlo Counterfactual Regret\nMinimization [22]. Several alternatives exist that do not use a distinct abstraction step [3, 16, 10]. The\n1 ,\u03c32(cid:105)(h).\nagent will never actually play according to the blueprint \u02c6\u03c3\u2217. It is only used to estimate v(cid:104)\u03c3\u2217\nWe now discuss two different ways to select a set of P2 strategies. Ultimately we would like the\nset of P2 strategies to contain a diverse set of intelligent strategies the opponent might play, so that\nP1\u2019s solution in a subgame is robust to possible P2 adaptation. One option is to bias the P2 blueprint\n\n1 )(cid:105)\n1 ,BR(\u03c3\u2217\n\n(cid:104)\u03c3\u2217\n2\n\n4\n\n\fstrategy \u02c6\u03c3\u2217\n2 in a few different ways. For example, in poker the blueprint strategy should be a mixed\nstrategy involving some probability of folding, calling, or raising. We could de\ufb01ne a new strategy \u03c3(cid:48)\n2\nin which the probability of folding is multiplied by 10 (and then all the probabilities renormalized). If\nthe blueprint strategy \u02c6\u03c3\u2217 were an exact Nash equilibrium, then any such \u201cbiased\u201d strategy \u03c3(cid:48)\n2 in which\nthe probabilities are arbitrarily multiplied would still be a best response to \u02c6\u03c3\u2217\n1. In our experiments, we\nuse this biasing of the blueprint strategy to construct a set of four opponent strategies on the second\nbetting round. We refer to this as the bias approach.\nAnother option is to construct the set of P2 strategies via self-play. The set begins with just one P2\nstrategy: the blueprint strategy \u02c6\u03c3\u2217\n2. We then solve a depth-limited subgame rooted at the start of the\ngame and going to whatever depth is feasible to solve, giving P2 only the choice of this P2 strategy\nat leaf infosets. That is, at leaf node h we simply substitute v \u02c6\u03c3\u2217\ni (h) for Pi. Let the P1 solution to this\ndepth-limited subgame be \u03c31. We then approximate a P2 best response assuming P1 plays according\nto \u03c31 in the depth-limited subgame and according to \u02c6\u03c3\u2217\n1 in the remainder of the game. Since P1\nplays according to this \ufb01xed strategy, approximating a P2 best response is equivalent to solving a\nMarkov Decision Process, which is far easier to solve than an imperfect-information game. This P2\napproximate best response is added to the set of strategies that P2 may choose at the depth limit,\nand the depth-limited subgame is solved again. This process repeats until the set of P2 strategies\ngrows to the desired size. This self-generative approach bears some resemblance to the double oracle\nalgorithm [26] and recent work on generation of opponent strategies in multi-agent RL [23]. In our\nexperiments, we use this self-generative method to construct a set of ten opponent strategies on the\n\ufb01rst betting round. We refer to this as the self-generative approach.\nOne practical consideration is that since \u02c6\u03c3\u2217\n1 is not an exact Nash equilibrium, a generated P2 strategy\n\u03c32 may do better than \u02c6\u03c3\u2217\n1 in a\ndepth-limited subgame. To correct for this, one can balance the players by also giving P1 a choice\nbetween multiple strategies for the remainder of the game at the depth limit. Alternatively, one\ncan \u201cweaken\u201d the generated P2 strategies so that they do no better than \u02c6\u03c3\u2217\n2 against \u02c6\u03c3\u2217\n1. Formally, if\n(cid:104)\u02c6\u03c3\u2217\n1 ,\u03c32(cid:105)\n(cid:104)\u02c6\u03c3\u2217\n(h) for h \u2208 I by v\n(I).\nv\n2\n2\n1 ,\u03c32(cid:105)\n(cid:104)\u02c6\u03c3\u2217\nAnother alternative (or additional) solution would be to simply reduce v\n2 by\n2\nsome heuristic amount, such as a small percentage of the pot in poker.\nOnce a P1 strategy \u02c6\u03c3\u2217\n1 and a set of P2 strategies have been generated, we need some way to calculate\nand store v\n(h). Calculating the state values can be done by traversing the entire game tree once.\nHowever, that may not be feasible in large games. Instead, one can use Monte Carlo simulations to\napproximate the values. For storage, if the number of states is small (such as in the early part of the\ngame tree), one could simply store the values in a table. More generally, one could train a function to\npredict the values corresponding to a state, taking as input a description of the state and outputting a\nvalue for each P2 strategy. Alternatively, one could simply store \u02c6\u03c3\u2217\n1 and the set of P2 strategies. Then,\nin real time, the value of a state could be estimated via Monte Carlo rollouts. We present results for\nboth of these approaches in Section 6.\n\n1. In that case, P1 may play more conservatively than \u03c3\u2217\n\n(I), we uniformly lower v\n\n(cid:104)\u02c6\u03c3\u2217\n2\n\n2(cid:105)\n1 ,\u02c6\u03c3\u2217\n(I) \u2212 v\n(h) for \u03c32 (cid:54)= \u02c6\u03c3\u2217\n\n2 against \u02c6\u03c3\u2217\n\n1 ,\u03c32(cid:105)\n\n(cid:104)\u02c6\u03c3\u2217\n2\n\n1 ,\u03c32(cid:105)\n\n(cid:104)\u02c6\u03c3\u2217\n2\n\n1 ,\u03c32(cid:105)\n\n(cid:104)\u02c6\u03c3\u2217\n(I) > v\n2\n\n2(cid:105)\n1 ,\u02c6\u03c3\u2217\n\n5 Nested Solving of Imperfect-Information Games\n\nWe use the new idea discussed in the previous section in the context of nested solving, which is a\nway to repeatedly solve subgames as play descends down the game tree [5]. Whenever an opponent\nchooses an action, a subgame is generated following that action. This subgame is solved, and its\nsolution determines the strategy to play until the next opponent action is taken.\nNested solving is particularly useful in dealing with large or continuous action spaces, such as an\nauction that allows any bid in dollar increments up to $10,000. To make these games feasible to solve,\nit is common to apply action abstraction, in which the game is simpli\ufb01ed by considering only a few\nactions (both for ourselves and for the opponent) in the full action space. For example, an action\nabstraction might only consider bid increments of $100. However, if the opponent chooses an action\nthat is not in the action abstraction (called an off-tree action), the optimal response to that opponent\naction is unde\ufb01ned.\nPrior to the introduction of nested solving, it was standard to simply round off-tree actions to a nearby\nin-abstraction action (such as treating an opponent bid of $150 as a bid of $200) [14, 34, 11]. Nested\nsolving allows a response to be calculated for off-tree actions by constructing and solving a subgame\n\n5\n\n\fthat immediately follows that action. The goal is to \ufb01nd a strategy in the subgame that makes the\nopponent no better off for having chosen the off-tree action than an action already in the abstraction.\nDepth-limited solving makes nested solving feasible even in the early game, so it is possible to play\nwithout acting according to a precomputed strategy or using action translation. At the start of the\ngame, we solve a depth-limited subgame (using action abstraction) to whatever depth is feasible. This\ndetermines our \ufb01rst action. After every opponent action, we solve a new depth-limited subgame that\nattempts to make the opponent no better off for having chosen that action than an action that was in\nour previous subgame\u2019s action abstraction. This new subgame determines our next action, and so on.\n\n6 Experiments\n\nWe conducted experiments on the games of heads-up no-limit Texas hold\u2019em poker (HUNL) and\nheads-up no-limit \ufb02op hold\u2019em poker (NLFH). Appendix B reminds the reader of the rules of these\ngames. HUNL is the main large-scale benchmark for imperfect-information game AIs. NLFH is\nsimilar to HUNL, except the game ends immediately after the second betting round, which makes it\nsmall enough to precisely calculate best responses and Nash equilibria. Performance is measured\nin terms of mbb/g, which is a standard win rate measure in the literature. It stands for milli-big\nblinds per game and represents how many thousandths of a big blind (the initial money a player must\ncommit to the pot) a player wins on average per hand of poker played.\n\n6.1 Exploitability Experiments in No-Limit Flop Hold\u2019em (NLFH)\n\n1 is a Nash equilibrium strategy.\n\n1, \u03c32) \u2212 min\u03c32 u1(\u03c31, \u03c32), where \u03c3\u2217\n\nOur \ufb01rst experiment measured the exploitability of our technique in NLFH. Exploitability of a strategy\nin a two-player zero-sum game is how much worse the strategy would do against a best response than\na Nash equilibrium strategy would do against a best response. Formally, the exploitability of \u03c31 is\nmin\u03c32 u1(\u03c3\u2217\nWe considered the case of P1 betting 0.75\u00d7 the pot at the start of the game, when the action\nabstraction only contains bets of 0.5\u00d7 and 1\u00d7 the pot. We compared our depth-limited solving\ntechnique to the randomized pseudoharmonic action translation (RPAT) [11], in which the bet of\n0.75\u00d7 is simply treated as either a bet of 0.5\u00d7 or 1\u00d7. RPAT is the lowest-exploitability known\ntechnique for responding to off-tree actions that does not involve real-time computation.\nWe began by calculating an approximate Nash equilibrium in an action abstraction that does not\ninclude the 0.75\u00d7 bet. This was done by running the CFR+ equilibrium-approximation algorithm [38]\nfor 1,000 iterations, which resulted in less than 1 mbb/g of exploitability within the action abstraction.\nNext, values for the states at the end of the \ufb01rst betting round within the action abstraction were\ndetermined using the self-generative method discussed in Section 4. Since the \ufb01rst betting round is a\nsmall portion of the entire game, storing a value for each state in a table required just 42 MB.\nTo determine a P2 strategy in response to the 0.75\u00d7 bet, we constructed a depth-limited subgame\nrooted after the 0.75\u00d7 bet with leaf nodes at the end of the \ufb01rst betting round. The values of a leaf\nnode in this subgame were set by \ufb01rst determining the in-abstraction leaf nodes corresponding to the\nexact same sequence of actions, except P1 initially bets 0.5\u00d7 or 1\u00d7 the pot. The leaf node values in\nthe 0.75\u00d7 subgame were set to the average of those two corresponding value vectors. When the end\nof the \ufb01rst betting round was reached and the board cards were dealt, the remaining game was solved\nusing safe subgame solving.\nFigure 2 shows how exploitability decreases as we add state values (that is, as we give P1 more best\nresponses to choose from at the depth limit). When using only one state value at the depth limit (that\nis, assuming P1 would always play according to the blueprint strategy for the remainder of the game),\nit is actually better to use RPAT. However, after that our technique becomes signi\ufb01cantly better and at\n16 values its performance is close to having had the 0.75\u00d7 action in the abstraction in the \ufb01rst place.\nWhile one could have calculated a (slightly better) P2 strategy in response to the 0.75\u00d7 bet by solving\nto the end of the game, that subgame would have been about 10,000\u00d7 larger than the subgames\nsolved in this experiment. Thus, depth-limited solving dramatically reduces the computational cost\nof nested subgame solving while giving up very little solution quality.\n\n6\n\n\fFigure 2: Exploitability of depth-limited solving in response to an opponent off-tree action as a function of\nnumber of state values. We compare to action translation and to having had the off-tree action included in the\naction abstraction (which is a lower bound on the exploitability achievable with 1,000 iterations of CFR+).\n\n6.2 Experiments Against Top AIs in Heads-Up No-Limit Texas Hold\u2019em (HUNL)\n\nOur main experiment uses depth-limited solving to produce a master-level HUNL poker AI called\nModicum using computing resources found in a typical laptop. We test Modicum against Baby\nTartanian8 [4], the winner of the 2016 Annual Computer Poker Competition, and against Slumbot [18],\nthe winner of the 2018 Annual Computer Poker Competition. Neither Baby Tartanian8 nor Slumbot\nuses real time computation; their strategies are a precomputed lookup table. Baby Tartanian8 used\nabout 2 million core hours and 18 TB of RAM to compute its strategy. Slumbot used about 250,000\ncore hours and 2 TB of RAM to compute its strategy. In contrast, Modicum used just 700 core\nhours and 16GB of RAM to compute its strategy and can play in real time at the speed of human\nprofessionals (an average of 20 seconds for an entire hand of poker) using just a 4-core CPU. We\nnow describe Modicum and provide details of its construction in Appendix A.\nThe blueprint strategy for Modicum was constructed by \ufb01rst generating an abstraction of HUNL using\nstate-of-the-art abstraction techniques [12, 20]. Storing a strategy for this abstraction as 4-byte \ufb02oats\nrequires just 5 GB. This abstraction was approximately solved by running Monte Carlo Counterfactual\nRegret Minimization for 700 core hours [22].\nHUNL consists of four betting rounds. We conduct depth-limited solving on the \ufb01rst two rounds\nby solving to the end of that round using MCCFR. Once the third betting round is reached, the\nremaining game is small enough that we solve to the end of the game using an enhanced form of\nCFR+ described in the appendix.\nWe generated 10 values for each state at the end of the \ufb01rst betting round using the self-generative\napproach. The \ufb01rst betting round was small enough to store all of these state values in a table using\n240 MB. For the second betting round, we used the bias approach to generate four opponent best\nresponses. The \ufb01rst best response is simply the opponent\u2019s blueprint strategy. For the second, we\nbiased the opponent\u2019s blueprint strategy toward folding by multiplying the probability of fold actions\nby 10 and then renormalizing. For the third, we biased the opponent\u2019s blueprint strategy toward\nchecking and calling. Finally for the fourth, we biased the opponent\u2019s blueprint strategy toward\nbetting and raising. To estimate the values of a state when the depth limit is reached on the second\nround, we sample rollouts of each of the stored best-response strategies.\nThe performance of Modicum is shown in Table 1. For the evaluation, we used AIVAT to reduce\nvariance [8]. Our new agent defeats both Baby Tartanian8 and Slumbot with statistical signi\ufb01cance.\nFor comparison, Baby Tartanian8 defeated Slumbot by 36 \u00b1 12 mbb/g, Libratus defeated Baby\nTartanian8 by 63 \u00b1 28 mbb/g, and Libratus defeated top human professionals by 147 \u00b1 77 mbb/g.\nIn addition to head-to-head performance against prior top AIs, we also tested Modicum against two\nversions of Local Best Response (LBR) [25]. An LBR agent is given full access to its opponent\u2019s\nfull-game strategy and uses that knowledge to exactly calculate the probability the LBR agent is in\neach possible state. Given that probability distribution and a heuristic for how the opposing agent\nwill play thereafter, the LBR agent chooses a best response action. LBR is a way to calculate a\nlower bound on exploitability and has been shown to be effective in exploiting agents that do not use\nreal-time solving.\n\n7\n\n0246810121412481632Exploitability (mb/g)Number of Values Per StateExploitability of depth-limited solving in NLFHAction TranslationMulti-State ValuesIn-Abstraction\fBaby Tartanian8\nBlueprint (No real-time solving) \u221257 \u00b1 13\n\u221210 \u00b1 8\nNa\u00efve depth-limited solving\n6 \u00b1 5\nDepth-limited solving\n\nSlumbot\n\u221211 \u00b1 8\n\u22121 \u00b1 15\n11 \u00b1 9\n\nTable 1: Head to head performance of our new agent against Baby Tartanian8 and Slumbot with 95% con\ufb01dence\nintervals shown. Our new agent defeats both opponents with statistical signi\ufb01cance. Na\u00efve depth-limited solving\nmeans states are assumed to have just a single value, which is determined by the blueprint strategy.\n\nIn the \ufb01rst version of LBR we tested against, the LBR agent was limited to either folding or betting\n0.75\u00d7 the pot on the \ufb01rst action, and thereafter was limited to either folding or calling. Modicum\nbeat this version of LBR by 570 \u00b1 42 mbb/g. The second version of LBR we tested against could bet\n10 different amounts on the \ufb02op that Modicum did not include in its blueprint strategy. Much like the\nexperiment in Section 6.1, this was intended to measure how vulnerable Modicum is to unanticipated\nbet sizes. The LBR agent was limited to betting 0.75\u00d7 the pot for the \ufb01rst action of the game and\ncalling for the remaining actions on the pre\ufb02op. On the \ufb02op, the LBR agent could either fold, call,\nor bet 0.33 \u00d7 2x times the pot for x \u2208 {0, 1, ..., 10}. On the remaining rounds the LBR agent could\neither fold or call. Modicum beat this version of LBR by 1377 \u00b1 115 mbb/g. In contrast, similar\nforms of LBR have been shown to defeat prior top poker AIs that do not use real-time solving by\nhundreds or thousands of mbb/g [25].\nWhile our new agent is probably not as strong as Libratus, it was produced with less than 0.1% of the\ncomputing resources and memory, and is never vulnerable to off-tree opponent actions.\nWhile the rollout method used on the second betting round worked well, rollouts may be signi\ufb01cantly\nmore expensive in deeper games. To demonstrate the generality of our approach, we also trained a\ndeep neural network (DNN) to predict the values of states at the end of the second betting round as an\nalternative to using rollouts. The DNN takes as input a 34-\ufb02oat vector of features describing the state,\nand outputs four \ufb02oats representing the values of the state for the four possible opponent strategies\n(represented as a fraction of the size of the pot). The DNN was trained using 180 million examples\nper player by optimizing the Huber loss with Adam [21], which we implemented using PyTorch [32].\nIn order for the network to run suf\ufb01ciently fast on just a 4-core CPU, the DNN has just 4 hidden\nlayers with 256 nodes in the \ufb01rst hidden layer and 128 nodes in the remaining hidden layers. This\nachieved a Huber loss of 0.02. Using a DNN rather than rollouts resulted in the agent beating Baby\nTartanian8 by 2 \u00b1 9 mbb/g. However, the average time taken using a 4-core CPU increased from 20\nseconds to 31 seconds per hand. Still, these results demonstrate the generality of our approach.\n\n7 Comparison to Prior Work\n\nSection 2 demonstrated that in imperfect-information games, states do not have unique values and\ntherefore the techniques common in perfect-information games and single-agent settings do not\napply. This paper introduced a way to overcome this challenge by assigning multiple values to\nstates. A different approach is to modify the de\ufb01nition of a \u201cstate\u201d to instead be all players\u2019 belief\nprobability distributions over states, which we refer to as a joint belief state. This technique was\npreviously used to develop the poker AI DeepStack [27]. While DeepStack defeated non-elite human\nprofessionals in HUNL, it was never shown to defeat prior top AIs even though it used over 1,000,000\ncore hours of computation. In contrast, Modicum defeated two prior top AIs with less than 1,000\ncore hours of computation. Still, there are bene\ufb01ts and drawbacks to both approaches, which we now\ndescribe in detail. The right choice may depend on the domain and future research may change the\ncompetitiveness of either approach.\nA joint belief state is de\ufb01ned by a probability (belief) distribution for each player over states that\nare indistinguishable to the player. In poker, for example, a joint belief state is de\ufb01ned by each\nplayers\u2019 belief about what cards the other players are holding. Joint belief states maintain some of\nthe properties that regular states have in perfect-information games. In particular, it is possible to\ndetermine an optimal strategy in a subgame rooted at a joint belief state independently from the rest\nof the game. Therefore, joint belief states have unique, well-de\ufb01ned values that are not in\ufb02uenced by\nthe strategies played in disjoint portions of the game tree. Given a joint belief state, it is also possible\n\n8\n\n\fto de\ufb01ne the value of each root infoset for each player. In the example of poker, this would be the\nvalue of a player holding a particular poker hand given the joint belief state.\nOne way to do depth-limited subgame solving, other than the method we describe in this paper, is\nto learn a function that maps joint belief states to infoset values. When conducting depth-limited\nsolving, one could then set the value of a leaf infoset based on the joint belief state at that leaf infoset.\nOne drawback is that because a player\u2019s belief distribution partly de\ufb01nes a joint belief state, the\nvalues of the leaf infosets must be recalculated each time the strategy in the subgame changes. With\nthe best domain-speci\ufb01c iterative algorithms, this would require recalculating the leaf infosets about\n500 times. Monte Carlo algorithms, which are the preferred domain-independent method of solving\nimperfect-information games, may change the strategy millions of times in a subgame, which poses a\nproblem for the joint belief state approach. In contrast, our multi-valued state approach requires only\na single function call for each leaf node regardless of the number of iterations conducted.\nMoreover, evaluating multi-valued states with a function approximator is cheaper and more scalable\nto large games than joint belief states. The input to a function that predicts the value of a multi-valued\nstate is simply the state description (for example, the sequence of actions), and the output is several\nvalues. In our experiments, the input was 34 \ufb02oats and the output was 4 \ufb02oats. In contrast, the input\nto a function that predicts the values of a joint belief state is a probability vector for each player over\nthe possible states they may be in. For example, in HUNL, the input is more than 2,000 \ufb02oats and\nthe output is more than 1,000 \ufb02oats. The input would be even larger in games with more states per\ninfoset.\nAnother drawback is that learning a mapping from joint belief states to infoset values is computation-\nally more expensive than learning a mapping from states to a set of values. For example, Modicum\nrequired less than 1,000 core hours to create this mapping. In contrast, DeepStack required over\n1,000,000 core hours to create its mapping. The increased cost is partly because computing training\ndata for a joint belief state value mapping is inherently more expensive. The multi-valued states\napproach is learning the values of best responses to a particular strategy (namely, the approximate\nNash equilibrium strategy \u02c6\u03c3\u2217\n1). In contrast, a joint belief state value mapping is learning the value\nof all players playing an equilibrium strategy given that joint belief state. As a rough guideline,\ncomputing an equilibrium is about 1,000\u00d7 more expensive than computing a best response in large\ngames [1].\nOn the other hand, the multi-valued state approach requires knowledge of a blueprint strategy that is\nalready an approximate Nash equilibrium. A bene\ufb01t of the joint belief state approach is that rather\nthan simply learning best responses to a particular strategy, it is learning best responses against every\npossible strategy. This may be particularly useful in self-play settings where the blueprint strategy is\nunknown, because it may lead to increasingly more sophisticated strategies.\nAnother bene\ufb01t of the joint belief state approach is that in many games (but not all) it obviates the\nneed to keep track of the sequence of actions played. For example, in poker if there are two different\nsequences of actions that result in the same amount of money in the pot and all players having the\nsame belief distribution over what their opponents\u2019 cards are, then the optimal strategy in both of\nthose situations is the same. This is similar to how in Go it is not necessary to know the exact\nsequence of actions that were played. Rather, it is only necessary to know the current con\ufb01guration\nof the board (and, in certain situations, also the last few actions played).\nA further bene\ufb01t of the joint belief state approach is that its run-time complexity does not increase\nwith the degree of precision other than needing a better (possibly more computationally expensive)\nfunction approximator. In contrast, for our algorithm the computational complexity of \ufb01nding a\nsolution to a depth-limited subgame grows linearly with the number of values per state.\n\n8 Conclusions\n\nWe introduced a principled method for conducting depth-limited solving in imperfect-information\ngames. Experimental results show that this leads to stronger performance than the best precomputed-\nstrategy AIs in HUNL while using orders of magnitude less computational resources, and is also\norders of magnitude more ef\ufb01cient than past approaches that use real-time solving. Additionally, the\nmethod exhibits low exploitability. In addition to using less resources, this approach broadens the\napplicability of nested real-time solving to longer games.\n\n9\n\n\f9 Acknowledgments\n\nThis material is based on work supported by the National Science Foundation under grants IIS-\n1718457, IIS-1617590, and CCF-1733556, and the ARO under award W911NF-17-1-0082, as well\nas XSEDE computing resources provided by the Pittsburgh Supercomputing Center. We thank Thore\nGraepel, Marc Lanctot, David Silver, Ariel Procaccia, Fei Fang, and our anonymous reviewers for\nhelpful inspiration, feedback, suggestions, and support.\n\nReferences\n[1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit\n\nhold\u2019em poker is solved. Science, 347(6218):145\u2013149, January 2015.\n\n[2] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed\nequilibrium computation, and post-processing, with application to a champion no-limit texas\nhold\u2019em agent. In Proceedings of the 2015 International Conference on Autonomous Agents\nand Multiagent Systems, pages 7\u201315. International Foundation for Autonomous Agents and\nMultiagent Systems, 2015.\n\n[3] Noam Brown and Tuomas Sandholm. Simultaneous abstraction and equilibrium \ufb01nding in\ngames. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI),\n2015.\n\n[4] Noam Brown and Tuomas Sandholm. Baby Tartanian8: Winning agent from the 2016 annual\ncomputer poker competition. In Proceedings of the Twenty-Fifth International Joint Conference\non Arti\ufb01cial Intelligence (IJCAI-16), pages 4238\u20134239, 2016.\n\n[5] Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-\ninformation games. In Advances in Neural Information Processing Systems, pages 689\u2013699,\n2017.\n\n[6] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus\n\nbeats top professionals. Science, page eaao1733, 2017.\n\n[7] Neil Burch, Michael Johanson, and Michael Bowling. Solving imperfect information games\nusing decomposition. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 602\u2013608,\n2014.\n\n[8] Neil Burch, Martin Schmid, Matej Morav\u02c7c\u00edk, and Michael Bowling. AIVAT: A new variance\n\nreduction technique for agent evaluation in imperfect information games. 2016.\n\n[9] Murray Campbell, A Joseph Hoane, and Feng-Hsiung Hsu. Deep Blue. Arti\ufb01cial intelligence,\n\n134(1-2):57\u201383, 2002.\n\n[10] Jiri Cermak, Viliam Lisy, and Branislav Bosansky. Constructing imperfect recall abstractions to\n\nsolve large extensive-form games. arXiv preprint arXiv:1803.05392, 2018.\n\n[11] Sam Ganzfried and Tuomas Sandholm. Action translation in extensive-form games with large\naction spaces: axioms, paradoxes, and the pseudo-harmonic mapping. In Proceedings of the\nTwenty-Third international joint conference on Arti\ufb01cial Intelligence, pages 120\u2013128. AAAI\nPress, 2013.\n\n[12] Sam Ganzfried and Tuomas Sandholm. Potential-aware imperfect-recall abstraction with earth\nmover\u2019s distance in imperfect-information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2014.\n\n[13] Sam Ganzfried and Tuomas Sandholm. Endgame solving in large imperfect-information games.\nIn International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages\n37\u201345, 2015.\n\n10\n\n\f[14] Andrew Gilpin, Tuomas Sandholm, and Troels Bjerre S\u00f8rensen. A heads-up no-limit Texas\nhold\u2019em poker player: discretized betting models and automatically generated equilibrium-\n\ufb01nding programs. In Proceedings of the Seventh International Joint Conference on Autonomous\nAgents and Multiagent Systems-Volume 2, pages 911\u2013918. International Foundation for Au-\ntonomous Agents and Multiagent Systems, 2008.\n\n[15] Peter E Hart, Nils J Nilsson, and Bertram Raphael. Correction to \"a formal basis for the heuristic\n\ndetermination of minimum cost paths\". ACM SIGART Bulletin, (37):28\u201329, 1972.\n\n[16] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-\n\ninformation games. arXiv preprint arXiv:1603.01121, 2016.\n\n[17] Eric Jackson. A time and space ef\ufb01cient algorithm for approximately solving large imperfect\ninformation games. In AAAI Workshop on Computer Poker and Imperfect Information, 2014.\n\n[18] Eric Jackson. Targeted CFR. In AAAI Workshop on Computer Poker and Imperfect Information,\n\n2017.\n\n[19] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract\nstrategies in extensive-form games. In Proceedings of the Twenty-Sixth AAAI Conference on\nArti\ufb01cial Intelligence, pages 1371\u20131379. AAAI Press, 2012.\n\n[20] Michael Johanson, Neil Burch, Richard Valenzano, and Michael Bowling. Evaluating state-space\nabstractions in extensive-form games. In Proceedings of the 2013 International Conference\non Autonomous Agents and Multiagent Systems, pages 271\u2013278. International Foundation for\nAutonomous Agents and Multiagent Systems, 2013.\n\n[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[22] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling\nfor regret minimization in extensive games. In Proceedings of the Annual Conference on Neural\nInformation Processing Systems (NIPS), pages 1078\u20131086, 2009.\n\n[23] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Julien Perolat, David\nSilver, Thore Graepel, et al. A uni\ufb01ed game-theoretic approach to multiagent reinforcement\nlearning. In Advances in Neural Information Processing Systems, pages 4193\u20134206, 2017.\n\n[24] Shen Lin. Computer solutions of the traveling salesman problem. The Bell system technical\n\njournal, 44(10):2245\u20132269, 1965.\n\n[25] Viliam Lisy and Michael Bowling. Equilibrium approximation quality of current no-limit poker\n\nbots. arXiv preprint arXiv:1612.07547, 2016.\n\n[26] H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost\nfunctions controlled by an adversary. In Proceedings of the 20th International Conference on\nMachine Learning (ICML-03), pages 536\u2013543, 2003.\n\n[27] Matej Morav\u02c7c\u00edk, Martin Schmid, Neil Burch, Viliam Lis\u00fd, Dustin Morrill, Nolan Bard, Trevor\nDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level\narti\ufb01cial intelligence in heads-up no-limit poker. Science, 2017.\n\n[28] Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen Gaukrodger. Re\ufb01ning\nsubgames in large imperfect information games. In AAAI Conference on Arti\ufb01cial Intelligence\n(AAAI), 2016.\n\n[29] John Nash. Equilibrium points in n-person games. Proceedings of the National Academy of\n\nSciences, 36:48\u201349, 1950.\n\n[30] Allen Newell and George Ernst. The search for generality. In Proc. IFIP Congress, volume 65,\n\npages 17\u201324, 1965.\n\n[31] Nils Nilsson. Problem-Solving Methods in Arti\ufb01cial Intelligence. McGraw-Hill, 1971.\n\n11\n\n\f[32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[33] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal\n\nof research and development, 3(3):210\u2013229, 1959.\n\n[34] David Schnizlein, Michael Bowling, and Duane Szafron. Probabilistic state translation in\nextensive games with large action sets. In Proceedings of the Twenty-First International Joint\nConference on Arti\ufb01cial Intelligence, pages 278\u2013284, 2009.\n\n[35] Claude E Shannon. Programming a computer for playing chess. The London, Edinburgh, and\n\nDublin Philosophical Magazine and Journal of Science, 41(314):256\u2013275, 1950.\n\n[36] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[37] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\nGo without human knowledge. Nature, 550(7676):354, 2017.\n\n[38] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit\ntexas hold\u2019em. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), pages 645\u2013652, 2015.\n\n[39] Gerald Tesauro. Programming backgammon using self-teaching neural nets. Arti\ufb01cial Intelli-\n\ngence, 134(1-2):181\u2013199, 2002.\n\n12\n\n\f", "award": [], "sourceid": 3786, "authors": [{"given_name": "Noam", "family_name": "Brown", "institution": "Carnegie Mellon University"}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": "Carnegie Mellon University"}, {"given_name": "Brandon", "family_name": "Amos", "institution": "Carnegie Mellon University"}]}