{"title": "Regret-Based Pruning in Extensive-Form Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1972, "page_last": 1980, "abstract": "Counterfactual Regret Minimization (CFR) is a leading algorithm for finding a Nash equilibrium in large zero-sum imperfect-information games. CFR is an iterative algorithm that repeatedly traverses the game tree, updating regrets at each information set.We introduce an improvement to CFR that prunes any path of play in the tree, and its descendants, that has negative regret. It revisits that sequence at the earliest subsequent CFR iteration where the regret could have become positive, had that path been explored on every iteration. The new algorithm maintains CFR's convergence guarantees while making iterations significantly faster---even if previously known pruning techniques are used in the comparison. This improvement carries over to CFR+, a recent variant of CFR. Experiments show an order of magnitude speed improvement, and the relative speed improvement increases with the size of the game.", "full_text": "Regret-Based Pruning in Extensive-Form Games\n\nNoam Brown\n\nComputer Science Department\nCarnegie Mellon University\n\nPittsburgh, PA 15217\nnoamb@cmu.edu\n\nTuomas Sandholm\n\nComputer Science Department\nCarnegie Mellon University\n\nPittsburgh, PA 15217\n\nsandholm@cs.cmu.edu\n\nAbstract\n\nCounterfactual Regret Minimization (CFR) is a leading algorithm for \ufb01nding a\nNash equilibrium in large zero-sum imperfect-information games. CFR is an it-\nerative algorithm that repeatedly traverses the game tree, updating regrets at each\ninformation set. We introduce an improvement to CFR that prunes any path of play\nin the tree, and its descendants, that has negative regret. It revisits that sequence\nat the earliest subsequent CFR iteration where the regret could have become posi-\ntive, had that path been explored on every iteration. The new algorithm maintains\nCFR\u2019s convergence guarantees while making iterations signi\ufb01cantly faster\u2014even\nif previously known pruning techniques are used in the comparison. This improve-\nment carries over to CFR+, a recent variant of CFR. Experiments show an order\nof magnitude speed improvement, and the relative speed improvement increases\nwith the size of the game.\n\n1\n\nIntroduction\n\nExtensive-form imperfect-information games are a general model for strategic interaction. The last\nten years have witnessed a leap of several orders of magnitude in the size of two-player zero-sum\nextensive-form imperfect-information games that can be solved to (near-)equilibrium [11][2][6].\nThis is the game class that this paper focuses on. For small games, a linear program (LP) can\n\ufb01nd a solution (that is, a Nash equilibrium) to the game in polynomial time, even in the presence\nof imperfect information. However, today\u2019s leading LP solvers only scale to games with around\n108 nodes in the game tree [4]. Instead, iterative algorithms are used to approximate solutions for\nlarger games. There are a variety of such iterative algorithms that are guaranteed to converge to a\nsolution [5, 3, 10]. Among these, Counterfactual Regret Minimization (CFR) [16] has emerged as\nthe most popular, and CFR+ as the state-of-the-art variant thereof [13, 14].\nCFR begins by exploring the entire game tree (though sampling variants exist as well [9]) and\ncalculating regret for every hypothetical situation in which the player could be. A key improvement\nthat makes CFR practical in large games is pruning. At a high level, pruning allows the algorithm to\navoid traversing the entire game tree while still maintaining the same convergence guarantees. The\nclassic version of pruning, which we will refer to as partial pruning, allows the algorithm to skip\nupdates for a player in a sequence if the other player\u2019s current strategy does not reach the sequence\nwith positive probability. This dramatically reduces the cost of each iteration. The magnitude of this\nreduction varies considerably depending on the game, but can easily be higher than 90% [9], which\nimproves the convergence speed of the algorithm by a factor of 10. Moreover, the bene\ufb01t of partial\npruning empirically seems to be more signi\ufb01cant as the size of the game increases.\nWhile partial pruning leads to a large gain in speed, we observe that there is still room for much\nlarger speed improvement. Partial pruning only skips updates for a player if an opponent\u2019s action\nin the path leading to that point has zero probability. This can fail to prune paths that are actually\nprunable. Consider a game where the \ufb01rst player to act (Player 1) has hundreds of actions to choose\n\n1\n\n\ffrom, and where, over several iterations, the reward received from many of them is extremely poor.\nIntuitively, we should be able to spend less time updating the strategy for Player 1 following these\npoor actions, and more time on the actions that proved worthwhile so far. However, here, partial\npruning will continue to update Player 1\u2019s strategy following each action in every iteration.\nIn this paper we introduce a better version of pruning, regret-based pruning (RBP), in which CFR\ncan avoid traversing a path in the game tree if either player takes actions leading to that path with\nzero probability. This pruning needs to be temporary, because the probabilities may change later in\nthe CFR iterations, so the reach probability may turn positive later on. The number of CFR iterations\nduring which a sequence can be skipped depends on how poorly the sequence has performed in\nprevious CFR iterations. More speci\ufb01cally, the number of iterations that an action can be pruned is\nproportional to how negative the regret is for that action. We will detail these topics in this paper.\nRBP can lead to a dramatic improvement depending on the game. As a rough example, consider\na game in which each player has very negative regret for actions leading to 90% of nodes. Partial\npruning, which skips updates for a player when the opponent does not reach the node, would traverse\n10% of the game tree per iteration. In contrast, regret-based pruning, which skips updates when\neither player does not reach the node, would traverse only 0.1 \u00b7 0.1 = 1% of the game tree. In\ngeneral, RBP roughly squares the performance gain of partial pruning.\nWe test RBP with CFR and CFR+. Experiments show that it leads to more than an order of magni-\ntude speed improvement over partial pruning. The bene\ufb01t increases with the size of the game.\n\n2 Background\n\nIn this section we present the notation used in the rest of the paper. In an imperfect-information\nextensive-form game there is a \ufb01nite set of players, P. H is the set of all possible histories (nodes)\nin the game tree, represented as a sequence of actions, and includes the empty history. A(h) is\nthe actions available in a history and P (h) \u2208 P \u222a c is the player who acts at that history, where c\ndenotes chance. Chance plays an action a \u2208 A(h) with a \ufb01xed probability \u03c3c(h, a) that is known\nto all players. The history h(cid:48) reached after an action is taken in h is a child of h, represented by\nh \u00b7 a = h(cid:48), while h is the parent of h(cid:48). More generally, h(cid:48) is an ancestor of h (and h is a descendant\nof h(cid:48)), represented by h(cid:48) (cid:64) h, if there exists a sequence of actions from h(cid:48) to h. Z \u2286 H are\nterminal histories for which no actions are available. For each player i \u2208 P, there is a payoff\nfunction ui : Z \u2192 (cid:60). If P = {1, 2} and u1 = \u2212u2, the game is two-player zero-sum. We de\ufb01ne\n\u2206i = maxz\u2208Z ui(z) \u2212 minz\u2208Z ui(z) and \u2206 = maxi \u2206i.\nImperfect information is represented by information sets for each player i \u2208 P by a partition Ii of\nh \u2208 H : P (h) = i. For any information set I \u2208 Ii, all histories h, h(cid:48) \u2208 I are indistinguishable\nto player i, so A(h) = A(h(cid:48)). I(h) is the information set I where h \u2208 I. P (I) is the player\ni such that I \u2208 Ii. A(I) is the set of actions such that for all h \u2208 I, A(I) = A(h).\n|Ai| =\nmaxI\u2208Ii |A(I)| and |A| = maxi |Ai|. We de\ufb01ne U (I) to be the maximum payoff reachable from a\nhistory in I, and L(I) to be the minimum. That is, U (I) = maxz\u2208Z,h\u2208I:h(cid:118)z uP (I)(z) and L(I) =\nminz\u2208Z,h\u2208I:h(cid:118)z uP (I)(z). We de\ufb01ne \u2206(I) = U (I) \u2212 L(I) to be the range of payoffs reachable\nfrom a history in I. We similarly de\ufb01ne U (I, a), L(I, a), and \u2206(I, a) as the maximum, minimum,\nand range of payoffs (respectively) reachable from a history in I after taking action a. We de\ufb01ne\nD(I, a) to be the set of information sets reachable by player P (I) after taking action a. Formally,\nI(cid:48) \u2208 D(I, a) if for some history h \u2208 I and h(cid:48) \u2208 I(cid:48), h \u00b7 a (cid:118) h(cid:48) and P (I) = P (I(cid:48)).\nA strategy \u03c3i(I) is a probability vector over A(I) for player i in information set I. The probability\nof a particular action a is denoted by \u03c3i(I, a). Since all histories in an information set belonging to\nplayer i are indistinguishable, the strategies in each of them must be identical. That is, for all h \u2208 I,\n\u03c3i(h) = \u03c3i(I) and \u03c3i(h, a) = \u03c3i(I, a). We de\ufb01ne \u03c3i to be a probability vector for player i over all\navailable strategies \u03a3i in the game. A strategy pro\ufb01le \u03c3 is a tuple of strategies, one for each player.\nui(\u03c3i, \u03c3\u2212i) is the expected payoff for player i if all players play according to the strategy pro\ufb01le\n(cid:104)\u03c3i, \u03c3\u2212i(cid:105). If a series of strategies are played over T iterations, then \u00af\u03c3T\n\u03c0\u03c3(h) = \u03a0h(cid:48)\u2192a(cid:118)h\u03c3P (h)(h, a) is the joint probability of reaching h if all players play according to\n\u03c3. \u03c0\u03c3\ni (h) is the contribution of player i to this probability (that is, the probability of reaching h if all\nplayers other than i, and chance, always chose actions leading to h). \u03c0\u03c3\u2212i(h) is the contribution of\n\nt\u2208T \u03c3t\nT\n\n(cid:80)\n\ni =\n\ni\n\n.\n\n2\n\n\fall players other than i, and chance. \u03c0\u03c3(h, h(cid:48)) is the probability of reaching h(cid:48) given that h has been\nreached, and 0 if h (cid:54)(cid:64) h(cid:48). In a perfect-recall game, \u2200h, h(cid:48) \u2208 I \u2208 Ii, \u03c0i(h) = \u03c0i(h(cid:48)). In this paper\nwe focus on perfect-recall games. Therefore, for i = P (I) we de\ufb01ne \u03c0i(I) = \u03c0i(h) for h \u2208 I. We\nde\ufb01ne the average strategy \u00af\u03c3T\n\ni (I) for an information set I to be\n\n(cid:80)\n(cid:80)\n\n\u00af\u03c3T\ni (I) =\n\ni\n\nt\u2208T \u03c0\u03c3t\ni \u03c3t\ni (I)\nt\u2208T \u03c0\u03c3t\ni (I)\n\n2.1 Nash Equilibrium\nA best response to \u03c3\u2212i is a strategy \u03c3\u2217\ni, \u03c3\u2212i). A Nash\nequilibrium, is a strategy pro\ufb01le where every player plays a best response. Formally, it is a strategy\npro\ufb01le \u03c3\u2217 such that \u2200i, ui(\u03c3\u2217\n\u2212i). We de\ufb01ne a Nash equilibrium strategy\nfor player i as a strategy \u03c3i that is part of any Nash equilibrium. In two-player zero-sum games, if \u03c3i\nand \u03c3\u2212i are both Nash equilibrium strategies, then (cid:104)\u03c3i, \u03c3\u2212i(cid:105) is a Nash equilibrium. An \u0001-equilibrium\nis a strategy pro\ufb01le \u03c3\u2217 such that \u2200i, ui(\u03c3\u2217\n\ni such that ui(\u03c3\u2217\ni, \u03c3\u2217\n\ni , \u03c3\u2212i) = max\u03c3(cid:48)\n\n\u2212i) + \u0001 \u2265 max\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\n\u2212i) = max\u03c3(cid:48)\n\ni\u2208\u03a3i ui(\u03c3(cid:48)\n\ni , \u03c3\u2217\n\ni , \u03c3\u2217\n\ni, \u03c3\u2217\n\n\u2212i).\n\n2.2 Counterfactual Regret Minimization\n\nCounterfactual Regret Minimization (CFR) is a popular regret-minimization algorithm for extensive-\nform games [16]. Our analysis of CFR makes frequent use of counterfactual value. Informally, this\nis the expected utility of an information set given that player i tries to reach it. For player i at\ninformation set I given a strategy pro\ufb01le \u03c3, this is de\ufb01ned as\n\nh\u2208I\nThe counterfactual value of an action a is\n\nv\u03c3\ni (I) =\n\n\u03c0\u03c3\u2212i(h)\n\nv\u03c3\ni (I, a) =\n\n\u03c0\u03c3\u2212i(h)\n\n(cid:88)\n(cid:88)\n\n(cid:16)\n(cid:16)\n\nh\u2208I\n\n(cid:0)\u03c0\u03c3(h, z)ui(z)(cid:1)(cid:17)\n(cid:0)\u03c0\u03c3(h \u00b7 a, z)ui(z)(cid:1)(cid:17)\n\n(cid:88)\n(cid:88)\n\nz\u2208Z\n\nz\u2208Z\n\nLet \u03c3t be the strategy pro\ufb01le used on iteration t. The instantaneous regret on iteration t for action a\nin information set I is\n\nrt(I, a) = v\u03c3t\nand the regret for action a in I on iteration T is\n\nP (I)(I, a) \u2212 v\u03c3t\n\nP (I)(I)\n\nRT (I, a) =\n\nrt(I, a)\n\n(5)\n\nAdditionally, RT\nin the entire game is\n\n+(I, a) = max{RT (I, a), 0} and RT (I) = maxa{RT\n\n+(I, a)}. Regret for player i\n\ni , \u03c3t\u2212i)(cid:1)\n\ni, \u03c3t\u2212i) \u2212 ui(\u03c3t\n\nRT\n\ni = max\ni\u2208\u03a3i\n\u03c3(cid:48)\n\nIn CFR, a player in an information set picks an action among the actions with positive regret in\nproportion to his positive regret on that action. Formally, on each iteration T + 1, player i selects\nactions a \u2208 A(I) according to probabilities\n\n\u03c3T +1\ni\n\n(I, a) =\n\nRT\n\n+(I,a)\na(cid:48)\u2208A(I) RT\n\n+(I,a(cid:48)) ,\n\n+(I, a(cid:48)) > 0\nRT\n\n(cid:88)\n(cid:0)ui(\u03c3(cid:48)\n\nt\u2208T\n\n(cid:88)\n\nt\u2208T\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(6)\n\n\uf8f1\uf8f2\uf8f3\n(cid:80)\n1|A(I)| ,\ni \u2264 (cid:88)\n\nI\u2208Ii\n\na(cid:48)\u2208Ai\notherwise\n\nif (cid:80)\n(cid:112)|Ai|\n\n\u221a\n\nIf a player plays according to CFR in every iteration, then on iteration T , RT (I) \u2264 \u2206i\nMoreover,\n\nRT\n\nRT (I) \u2264 |Ii|\u2206i\n\nT\n\n(8)\n\nSo, as T \u2192 \u221e, RT\ntheir average strategies (cid:104)\u00af\u03c3T\nalgorithm for \ufb01nding an \u0001-Nash equilibrium in zero-sum games.\n\nT \u2192 0. In two-player zero-sum games, if both players\u2019 average regret RT\n\nT \u2264 \u0001,\n2 (cid:105) form a 2\u0001-equilibrium [15]. Thus, CFR constitutes an anytime\n\n1 , \u00af\u03c3T\n\ni\n\ni\n\n(cid:112)|A(I)|\u221a\n\n(7)\n\nT .\n\n3\n\n\f3 Applying Best Response to Zero-Reach Sequences\n\nIn Section 2 it was explained that if both players\u2019 average regret approaches zero, then their average\nstrategies approach a Nash equilibrium. CFR provides one way to compute strategies that have\nbounded regret, but it is not the only way. CFR-BR [7] is a variant of CFR in which one player\nplays CFR and the other player plays a best response to the opponent\u2019s strategy in every iteration.\nCalculating a best response to a \ufb01xed strategy is computationally cheap (in games of perfect recall),\ncosting only a single traversal of the game tree. By playing a best response in every iteration, the\nbest-responder is guaranteed to have at most zero regret. Moreover, the CFR player\u2019s regret is still\nbounded according to (8). However, in practice the CFR player\u2019s regret in CFR-BR tends to be\nhigher than when both players play vanilla CFR (since the opponent is clairvoyantly maximizing the\nCFR player\u2019s regret). For this reason, empirical results show that CFR-BR converges slower than\nCFR, even though the best-responder\u2019s regret is always at most zero.\nWe now discuss a modi\ufb01cation of CFR that will motivate the main contribution of this paper, which,\nin turn, is described in Section 4. The idea is that by applying a best response only in certain\nsituations (and CFR in others), we can lower regret for one player without increasing it for the\nopponent. Without loss of generality, we discuss how to reduce regret for Player 1. Speci\ufb01cally,\nconsider an information set I \u2208 I1 and action a where \u03c3t\n1(I, a) = 0 and any history h \u2208 I. Then\nfor any ancestor history h(cid:48) such that h(cid:48) (cid:64) h \u00b7 a, we know \u03c0\u03c3t\n1 (h(cid:48), h \u00b7 a) = 0. Likewise, for any\ndescendant history h(cid:48) such that h \u00b7 a (cid:118) h(cid:48), we know \u03c0\u03c3t\n1 (h(cid:48)) = 0. Thus, from (4) we see that Player\n1\u2019s strategy on iteration t in any information set following action a has no effect on Player 2\u2019s regret\nfor that iteration. Moreover, it also has no effect on Player 1\u2019s regret for any information set except\nR(I, a) and information sets that follow action a. Therefore, by playing a best response only in\ninformation sets following action a (and playing vanilla CFR elsewhere), Player 1 guarantees zero\nregret for himself in all information sets following action a, without the practical cost of increasing\nhis regret in information sets before I or of increasing Player 2\u2019s regret. This may increase regret\nfor action a itself, but if we only do this when R(I, a) \u2264 \u2212\u2206(I), we can guarantee R(I, a) \u2264 0\neven after the iteration. Similarly, Player 2 can simultaneously play a best response in information\n2(I(cid:48), a(cid:48)) = 0 for I(cid:48) \u2208 I2. This approach leads to lower regret for\nsets following an action a(cid:48) where \u03c3t\nboth players.\n(In situations where both players\u2019 sequences of reaching an information set have zero probability\n(\u03c01(h) = \u03c02(h) = 0) the strategies chosen have no impact on the regret or average strategy for\neither player, so there is no need to compute what strategies should be played from then on.)\nOur experiments showed that this technique leads to a dramatic improvement over CFR in terms\nof the number of iterations needed\u2014though the theoretical convergence bound remains the same.\nHowever, each iteration touches more nodes\u2014because negative-regret actions more quickly become\npositive and are not skipped with partial pruning\u2014and thus takes longer. It depends on the game\nwhether CFR or this technique is faster overall; see experiments in Appendix A. Regret-based prun-\ning, introduced in the next section, outperforms both of these approaches signi\ufb01cantly.\n\n4 Regret-Based Pruning (RBP)\n\nIn this section we present the main contribution of this paper, a technique for soundly pruning\u2014on a\ntemporary basis\u2014negative-regret actions from the tree traversal in order to speed it up signi\ufb01cantly.\nIn Section 3 we proposed a variant of CFR where a player plays a best response in information sets\nthat the player reaches with zero probability. In this section, we show that these information sets and\ntheir descendants need not be traversed in every iteration. Rather, the frequency that they must be\ntraversed is proportional to how negative regret is for the action leading to them. This less-frequent\ntraversal does not hurt the regret bound (8). Consider an information set I \u2208 I1 and action a where\nRt(I, a) = \u22121000 and regret for at least one other action in I is positive, and assume \u2206(I) = 1.\nFrom (7), we see that \u03c3t+1\n(I, a) = 0. As described in Section 3, the strategy played by Player 1\non iteration t + 1 in any information set following action a has no effect on Player 2. Moreover, it\nhas no immediate effect on what Player 1 will do in the next iteration (other than in information sets\nfollowing action a), because we know regret for action a will still be at most -999 on iteration t + 2\n(since \u2206(I) = 1) and will continue to not be played. So rather than traverse the game tree following\naction a, we could \u201cprocrastinate\u201d in deciding what Player 1 did on iteration t + 1, t + 2, ..., t + 1000\n\n1\n\n4\n\n\fin that branch until after iteration t + 1000 (at which point regret for that action may be positive).\nThat is, we could (in principle) store Player 2\u2019s strategy for each iteration between t+1 and t+1000,\nand on iteration t+1000 calculate a best response to each of them and announce that Player 1 played\nthose best responses following action a on iterations t + 1 to t + 1000 (and update the regrets to\nmatch this). Obviously this itself would not be an improvement, but performance would be identical\nto the algorithm described in Section 3.\nHowever, rather than have Player 1 calculate and play a best response for each iteration between\nt + 1 and t + 1000 separately, we could simply calculate a best response against the average strategy\nthat Player 2 played in those iterations. This can be accomplished in a single traversal of the game\ntree. We can then announce that Player 1 played this best response on each iteration between t + 1\nand t + 1000. This provides bene\ufb01ts similar to the algorithm described in Section 3, but allows us\nto do the work of 1000 iterations in a single traversal! We coin this regret-based pruning (RBP).\nWe now present a theorem that guarantees that when R(I, a) \u2264 0, we can prune D(I, a) through\nregret-based pruning for (cid:98)\nTheorem 1. Consider a two-player zero-sum game. Let a \u2208 A(I) be an action such that on\niteration T0, RT0(I, a) \u2264 0. Let I(cid:48) be an information set for any player such that I(cid:48) (cid:54)\u2208 D(I, a) and\n|R(I,a)|\nlet a(cid:48) \u2208 A(I(cid:48)). Let m = (cid:98)\nU (I,a)\u2212L(I)(cid:99). If \u03c3(I, a) = 0 when R(I, a) \u2264 0, then regardless of what\nis played in D(I, a) during {T0, ..., T0 + m}, RT\n\n+(I(cid:48), a(cid:48)) is identical for T \u2264 T0 + m.\n\n|R(I,a)|\nU (I,a)\u2212L(I)(cid:99) iterations.\n\ni (I) \u2265 L(I) and v\u03c3\n\ni (I, a) \u2264 U (I, a), so from (4) we get rt(I, a) \u2264 U (I, a) \u2212 L(I).\nProof. Since v\u03c3\nThus, for iteration T0 \u2264 T \u2264 T0 + m, RT (I, a) \u2264 0. Clearly the theorem is true for T < T0.\nWe prove the theorem continues to hold inductively for T \u2264 T0 + m. Assume the theorem holds\n(cid:54)= a.\n(cid:54)= I or a(cid:48)\nfor iteration T and consider iteration T + 1. Suppose I(cid:48) \u2208 IP (I) and either I(cid:48)\nThen for any h(cid:48) \u2208 I(cid:48), there is no ancestor of h(cid:48) in an information set in D(I, a). Thus, \u03c0\u03c3T +1\n(h(cid:48))\n\u2212i\ndoes not depend on the strategy in D(I, a). Moreover, for any z \u2208 Z, if h(cid:48) (cid:64) h (cid:64) z for some\n(h(cid:48), z) = 0 because \u03c3T +1(I, a) = 0. Since I(cid:48) (cid:54)= I or a(cid:48) (cid:54)= a, it\nh \u2208 I\u2217 \u2208 D(I, a), then \u03c0\u03c3T +1\n(h(cid:48)\u00b7a(cid:48), z) = 0. Then from (4), rT +1(I, a) does not depend on the strategy\nsimilarly holds that \u03c0\u03c3T +1\nin D(I, a).\nNow suppose I(cid:48) \u2208 Ii for i (cid:54)= P (I). Consider some h(cid:48) \u2208 I(cid:48) and some h \u2208 I. First suppose that\nh \u00b7 a (cid:118) h(cid:48). Since \u03c0\u03c3T +1\n(h(cid:48)) = 0 and h(cid:48) contributes nothing to the regret of\nI(cid:48). Now suppose h(cid:48) (cid:64) h. Then for any z \u2208 Z, if h(cid:48) (cid:64) h (cid:64) z then \u03c0\u03c3T +1\n(h(cid:48), z) = 0 and does not\ndepend on the strategy in D(I, a). Finally, suppose h(cid:48) (cid:54)(cid:64) h and h \u00b7 a (cid:54)(cid:118) h(cid:48). Then for any z \u2208 Z\nsuch that h(cid:48) (cid:64) z, we know h (cid:54)(cid:64) z and therefore \u03c0\u03c3T +1\n(h(cid:48), z) = 0 does not depend on the strategy\nin D(I, a).\nNow suppose I(cid:48) = I and a(cid:48) = a. We proved RT (I, a) \u2264 0 for T0 \u2264 T \u2264 T0 + m, so RT\nThus, for all T \u2264 T0 + m, RT (I(cid:48), a(cid:48)) is identical regardless of what is played in D(I, a).\n\n(h \u00b7 a) = 0, so \u03c0\u03c3T +1\n\n+(I, a) = 0.\n\ni\n\ni\n\nWe can improve this approach signi\ufb01cantly by not requiring knowledge beforehand of exactly how\nmany iterations can be skipped. Rather, we will decide in light of what happens during the interven-\ning CFR iterations when an action needs to be revisited. From (4) we know that rT (I, a) \u221d \u03c0\u03c3T\n\u2212i (I).\nMoreover, v\u03c3T\nP (I)(I) does not depend on D(I, a). Thus, we can prune D(I, a) from iteration T0 until\niteration T1 so long as\n\nT0(cid:88)\n\nt=1\n\nT1(cid:88)\n\n\u2212i(I)U (I, a) \u2264 T1(cid:88)\n\n\u03c0\u03c3t\n\nt=T0+1\n\nt=1\n\nv\u03c3t\nP (I)(I, a) +\n\nv\u03c3t\nP (I)(I)\n\n(9)\n\nU (I,a)\u2212L(I)(cid:99) iterations. However, in practice it\nIn the worst case, this allows us to skip only (cid:98)\nperforms signi\ufb01cantly better, though we cannot know on iteration T0 how many iterations it will\nskip because it depends on what is played in T0 \u2264 t \u2264 T1. Our exploratory experiments showed\nthat in practice performance also improves by replacing U (I, a) with a more accurate upper bound\non reward in (9). CFR will still converge if D(I, a) is pruned for too many iterations; however, that\nhurts convergence speed. In the experiments included in this paper, we conservatively use U (I, a)\nas the upper bound.\n\nR(I,a)\n\n5\n\n\f4.1 Best Response Calculation for Regret-Based Pruning\n\nt=T0\n\ni\n\n(I) =\n\nIn this section we discuss how one can ef\ufb01ciently compute the best responses as called for in regret-\nbased pruning. The advantage of Theorem 1 is that we can wait until after pruning has \ufb01nished\u2014that\nis, until we revisit an action\u2014to decide what strategies were played in D(I, a) during the intervening\niterations. We can then calculate a single best response to the average strategy that the opponent\nplayed, and say that that best response was played in D(I, a) in each of the intervening iterations.\nThis results in zero regret over those iterations for information sets in D(I, a). We now describe\nhow this best response can be calculated ef\ufb01ciently.\n\nTypically, when playing CFR one stores(cid:80)T\n\ni (I)\n\n\u03c0t\ni (I)\u03c3t\n\u03c0t\ni (I)\n\nt=1 \u03c0t\n\ni (I)\u03c3t\n\n(cid:80)T1\n(cid:80)T1\n\nt=T0\n\nwhere \u00af\u03c3T1\u2212T0\n\ni (I) for each information set I. This allows\none to immediately calculate the average strategy de\ufb01ned in (1) in any particular iteration. If we\nstart pruning on iteration T0 and revisit on iteration T1, we wish to calculate a best response to\n\u00af\u03c3T1\u2212T0\n. An easy approach would be to store the opponent\u2019s\ni\ncumulative strategy before pruning begins and subtract it from the current cumulative strategy when\npruning ends. In fact, we only need to store the opponent\u2019s strategy in information sets that follow\naction a. However, this could potentially use O(H) memory because the same information set I\nbelonging to Player 2 may be reached from multiple information sets belonging to Player 1. In\ncontrast, CFR only requires O(|I||A|) memory, and we want to maintain this desirable property.\nWe accomplish that as follows.\nTo calculate a best response against \u00af\u03c3T\n2 , we traverse the game tree and calculate the counterfactual\nvalue, de\ufb01ned in (3), for every action for every information set belonging to Player 1 that does\nnot lead to any further Player 1 information sets. Speci\ufb01cally, we calculate v \u00af\u03c3T0\u22121\n(I, a) for every\naction a in I such that D(I, a) = \u2205. Since we calculate this only for actions where D(I, a) = \u2205,\nso v \u00af\u03c3T0\u22121\n(I, a) does not depend on \u00af\u03c31. Then, starting from the bottom information sets, we set the\nbest-response strategy \u03c3BR\n(I) to always play the action with the highest counterfactual value (ties\ncan be broken arbitrarily), and pass this value up as the payoff for reaching I, repeating the process\nup the tree. In order to calculate a best response to \u00af\u03c3T1\u2212T0\n, we \ufb01rst store, before pruning begins,\nthe counterfactual values for Player 1 against Player 2\u2019s average strategy for every action a in each\ninformation set I where D(I, a) = \u2205. When we revisit the action on iteration T1, we calculate a best\nresponse to \u00af\u03c3T1\n2 except that we set the counterfactual value for every action a in information set I\nwhere D(I, a) = \u2205 to be T1v \u00af\u03c3T1\n(I, a). The latter term was stored, and the\nformer term can be calculated from the current average strategy pro\ufb01le. As before, we set \u03c3BR\n(I)\nto always play whichever action has the highest counterfactual value, and pass this term up.\nA slight complication arises when we are pruning an action a in information set I and wish to start\npruning an earlier action a(cid:48) from information set I(cid:48) such that I \u2208 D(I(cid:48), a(cid:48)).\nIn this case, it is\nnecessary to explore action a in order to calculate the best response in D(I(cid:48), a(cid:48)). However, if such\ntraversals happen frequently, then this would defeat the purpose of pruning action a. One way to\naddress this is to only prune an action a(cid:48) when the number of iterations guaranteed (or estimated)\nto be skipped exceeds some threshold. This ensures that the overhead is worthwhile, and that we\nare not frequently traversing an action a farther down the tree that is already being pruned. Another\noption is to add some upper bound to how long we will prune an action. If the lower bound for\nhow long we will prune a exceeds the upper bound for how long we will prune a(cid:48), then we need not\ntraverse a in the best response calculation for a(cid:48) because a will still be pruned when we are \ufb01nished\nwith pruning a(cid:48). In our experiments, we use the former approach. Experiments to determine a good\nparameter for this are presented in Appendix B.\n\n(I, a) \u2212 (T0 \u2212 1)v \u00af\u03c3T0\u22121\n\n1\n\n1\n\n1\n\n1\n\n1\n\n2\n\n1\n\n4.2 Regret-Based Pruning with CFR+\n\nCFR+ [13] is a variant of CFR where the regret is never allowed to go below 0. Formally, RT (I, a) =\nmax{RT\u22121(I, a) + rT (I, a), 0} for T \u2265 1 and RT (I, a) = 0 for T = 0. Although this change\nappears small, and does not improve the bound on regret, it leads to faster empirical convergence.\nCFR+ was a key advancement that allowed Limit Texas Hold\u2019em poker to be essentially solved [1].\nAt \ufb01rst glance, it would seem that CFR+ and RBP are incompatible. RBP allows actions to be\ntraversed with decreasing frequency as regret decreases below zero. However, CFR+ sets a \ufb02oor\n\n6\n\n\ffor regret at zero. Nevertheless, it is possible to combine the two, as we now show. We modify\nthe de\ufb01nition of regret in CFR+ so that it can drop below zero, but immediately returns to being\npositive as soon as regret begins increasing. Formally, we modify the de\ufb01nition of regret in CFR+\nfor T > 0 to be as follows: RT (I, a) = rT (I, a) if rT (I, a) > 0 and RT\u22121(I, a) \u2264 0, and\nRT (I, a) = RT\u22121(I, a) + rT (I, a) otherwise. This leads to identical behavior in CFR+, and also\nallows regret to drop below zero so actions can be pruned.\nWhen using RBP with CFR+, regret does not strictly follow the rules for CFR+. CFR+ calls for an\naction to be played with positive probability whenever instantaneous regret for it is positive in the\nprevious iteration. Since RBP only checks the regret for an action after potentially several iterations\nhave been skipped, there may be a delay between the iteration when an action would return to play\nin CFR+ and the iteration when it returns to play in RBP. This does not pose a theoretical problem:\nCFR\u2019s convergence rate still applies.\nHowever, this difference is noticeable when combined with linear averaging. Linear averaging\nweighs each iteration \u03c3t in the average strategy by t. It does not affect regret or in\ufb02uence the selec-\ntion of strategies on an iteration. That is, with linear averaging the new de\ufb01nition for average strat-\n\n(cid:80)\n(cid:80)\n\nt\u2208T (t\u03c0\n\nt\u2208T (t\u03c0\n\n\u03c3t\ni \u03c3t\ni\ni )\n\u03c3t\ni\ni )\n\negy becomes \u00af\u03c3T\n\ni (I) =\n\n. Linear averaging still maintains the asymptotic convergence\n\nrate of constant averaging (where each iteration is weighed equally) in CFR+ [14]. Empirically it\ncauses CFR+ to converge to a Nash equilibrium much faster. However, in vanilla CFR it results in\nworse performance and there is no proof guaranteeing convergence. Since RBP with CFR+ results\nin behavior that does not strictly conform to CFR+, linear averaging results in somewhat noisier\nconvergence. This can be mitigated by reporting the strategy pro\ufb01le found so far that is closest to a\nNash equilibrium rather than the current average strategy pro\ufb01le, and we do this in the experiments.\n\n5 Experiments\n\nWe tested regret-based pruning in both CFR and CFR+ against partial pruning, as well as against\nCFR with no pruning. Our implementation traverses the game tree once each iteration.1 We tested\nour algorithm on standard Leduc Hold\u2019em [12] and a scaled-up variant of it featuring more actions.\nLeduc Hold\u2019em is a popular benchmark problem for imperfect-information game solving due to its\nsize (large enough to be highly nontrivial but small enough to be solvable) and strategic complexity.\nIn Leduc Hold\u2019em, there is a deck consisting of six cards: two each of Jack, Queen, and King. There\nare two rounds. In the \ufb01rst round, each player places an ante of 1 chip in the pot and receives a single\nprivate card. A round of betting then takes place with a two-bet maximum, with Player 1 going \ufb01rst.\nA public shared card is then dealt face up and another round of betting takes place. Again, Player 1\ngoes \ufb01rst, and there is a two-bet maximum. If one of the players has a pair with the public card, that\nplayers wins. Otherwise, the player with the higher card wins. In standard Leduc Hold\u2019em, the bet\nsize in the \ufb01rst round is 2 chips, and 4 chips in the second round. In our scaled-up variant, which we\ncall Leduc-5, there are 5 bet sizes to choose from: in the \ufb01rst round a player may bet 0.5, 1, 2, 4, or\n8 chips, while in the second round a player may bet 1, 2, 4, 8, or 16 chips.\nWe measure the quality of a strategy pro\ufb01le by its exploitability, which is the summed \u0001 distance\nof both players from a Nash equilibrium strategy. Formally, exploitability of a strategy pro\ufb01le \u03c3\n2). We measure exploitability against the number of\nis max\u03c3\u2217\nnodes touched over all CFR traversals. As shown in Figure 1, RBP leads to a substantial improve-\nment over vanilla CFR with partial pruning in Leduc Hold\u2019em, increasing the speed of convergence\nby more than a factor of 8. This is partially due to the game tree being traversed twice as fast, and\npartially due to the use of a best response in sequences that are pruned (the bene\ufb01t of which was\ndescribed in Section 3). The improvement when added on top of CFR+ is smaller, increasing the\nspeed of convergence by about a factor of 2. This matches the reduction in game tree traversal size.\nThe bene\ufb01t from RBP is more substantial in the larger benchmark game, Leduc-5. RBP increases\nconvergence speed of CFR by a factor of 12, and reduces the per-iteration game tree traversal cost by\nabout a factor of 7. In CFR+, RBP improves the rate of convergence by about an order of magnitude.\nRBP also decreases the number of nodes touched per iteration in CFR+ by about a factor of 40.\n\n2\u2208\u03a32 u2(\u03c31, \u03c3\u2217\n\n1\u2208\u03a31 u1(\u03c3\u2217\n\n1, \u03c32) + max\u03c3\u2217\n\n1Canonical CFR+ traverses the game tree twice each iteration, updating the regrets for each player in sepa-\n\nrate traversals [13]. This difference does not, however, affect the error measure (y-axis) in the experiments.\n\n7\n\n\f(a) Leduc Hold\u2019em\n\n(b) Leduc-5 Hold\u2019em\n\nFigure 1: Top: Exploitability. Bottom: Nodes touched per iteration.\n\nThe results imply that larger games bene\ufb01t more from RBP than smaller games. This is not univer-\nsally true, since it is possible to have a large game where every action is part of the Nash equilibrium.\nNevertheless, there are many games with very large action spaces where the vast majority of those\nactions are suboptimal, but players do not know beforehand which are suboptimal. In such games,\nRBP would improve convergence tremendously.\n\n6 Conclusions and Future Research\n\nIn this paper we introduced a new method of pruning that allows CFR to avoid traversing high-\nregret actions in every iteration. Our regret-based pruning (RBP) temporarily ceases their traversal\nin a sound way without compromising the overall convergence rate. Experiments show an order of\nmagnitude speed improvement over partial pruning, and suggest that the bene\ufb01t of RBP increases\nwith game size. Thus RBP is particularly useful in large games where many actions are suboptimal,\nbut where it is not known beforehand which actions those are.\nIn future research, it would be worth examining whether similar forms of pruning can be applied\nto other equilibrium-\ufb01nding algorithms as well. RBP, as presented in this paper, is for CFR using\nregret matching to determine what strategies to use on each iteration based on the regrets. RBP\ndoes not directly apply to other strategy selection techniques that could be used within CFR such as\nexponential weights, because the latter always puts positive probability on actions. Also, it would be\ninteresting to see whether RBP-like pruning could be applied to \ufb01rst-order methods for equilibrium-\n\ufb01nding [5, 3, 10, 8]. The results in this paper suggest that for any equilibrium-\ufb01nding algorithm to\nbe ef\ufb01cient in large games, effective pruning is essential.\n\n6.1 Acknowledgement\n\nThis material is based on work supported by the National Science Foundation under grants IIS-\n1320620 and IIS-1546752, as well as XSEDE computing resources provided by the Pittsburgh Su-\npercomputing Center.\n\n8\n\n\fReferences\n[1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hol-\n\ndem poker is solved. Science, 347(6218):145\u2013149, 2015.\n\n[2] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. Hierarchical abstraction, distributed\nequilibrium computation, and post-processing, with application to a champion no-limit texas\nhold\u2019em agent. In Proceedings of the 2015 international conference on Autonomous agents\nand multi-agent systems. International Foundation for Autonomous Agents and Multiagent\nSystems, 2015.\n[3] Andrew Gilpin, Javier Pe\u02dcna, and Tuomas Sandholm. First-order algorithm with O(ln(1/\u0001))\nconvergence for \u0001-equilibrium in two-person zero-sum games. Mathematical Programming,\n133(1\u20132):279\u2013298, 2012. Conference version appeared in AAAI-08.\n\n[4] Andrew Gilpin and Tuomas Sandholm. Lossless abstraction of imperfect information games.\nJournal of the ACM, 54(5), 2007. Early version \u2018Finding equilibria in large sequential games\nof imperfect information\u2019 appeared in the Proceedings of the ACM Conference on Electronic\nCommerce (EC), pages 160\u2013169, 2006.\n\n[5] Samid Hoda, Andrew Gilpin, Javier Pe\u02dcna, and Tuomas Sandholm. Smoothing techniques\nfor computing Nash equilibria of sequential games. Mathematics of Operations Research,\n35(2):494\u2013512, 2010. Conference version appeared in WINE-07.\n\n[6] Eric Grif\ufb01n Jackson. A time and space ef\ufb01cient algorithm for approximately solving large im-\nperfect information games. In AAAI Workshop on Computer Poker and Imperfect Information,\n2014.\n\n[7] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract\nIn AAAI Conference on Arti\ufb01cial Intelligence (AAAI),\n\nstrategies in extensive-form games.\n2012.\n\n[8] Christian Kroer, Kevin Waugh, Fatma K\u0131l\u0131nc\u00b8-Karzan, and Tuomas Sandholm. Faster \ufb01rst-\norder methods for extensive-form game solving. In Proceedings of the ACM Conference on\nEconomics and Computation (EC), 2015.\n\n[9] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling\nIn Proceedings of the Annual Conference on\n\nfor regret minimization in extensive games.\nNeural Information Processing Systems (NIPS), pages 1078\u20131086, 2009.\n\n[10] Franc\u00b8ois Pays. An interior point approach to large games of incomplete information. In AAAI\n\nComputer Poker Workshop, 2014.\n\n[11] Tuomas Sandholm. The state of solving large incomplete-information games, and application\nto poker. AI Magazine, pages 13\u201332, Winter 2010. Special issue on Algorithmic Game Theory.\n[12] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse\nBillings, and Chris Rayner. Bayes\u2019 bluff: Opponent modelling in poker. In Proceedings of the\n21st Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 550\u2013558, July\n2005.\n\n[13] Oskari Tammelin. Solving large imperfect information games using CFR+. arXiv preprint\n\narXiv:1407.5042, 2014.\n\n[14] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up\n\nlimit texas holdem. In IJCAI, volume 2015, 2015.\n\n[15] Kevin Waugh, David Schnizlein, Michael Bowling, and Duane Szafron. Abstraction patholo-\ngies in extensive games. In International Conference on Autonomous Agents and Multi-Agent\nSystems (AAMAS), 2009.\n\n[16] Martin Zinkevich, Michael Bowling, Michael Johanson, and Carmelo Piccione. Regret mini-\nmization in games with incomplete information. In Proceedings of the Annual Conference on\nNeural Information Processing Systems (NIPS), 2007.\n\n9\n\n\f", "award": [], "sourceid": 1198, "authors": [{"given_name": "Noam", "family_name": "Brown", "institution": "Carnegie Mellon University"}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": "Carnegie Mellon University"}]}