{"title": "Long-term Causal Effects via Behavioral Game Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 2604, "page_last": 2612, "abstract": "Planned experiments are the gold standard in reliably comparing the causal effect of switching from a baseline policy to a new policy. % One critical shortcoming of classical experimental methods, however, is that they typically do not take into account the dynamic nature of response to policy changes. For instance, in an experiment where we seek to understand the effects of a new ad pricing policy on auction revenue, agents may adapt their bidding in response to the experimental pricing changes. Thus, causal effects of the new pricing policy after such adaptation period, the {\\em long-term causal effects}, are not captured by the classical methodology even though they clearly are more indicative of the value of the new policy. % Here, we formalize a framework to define and estimate long-term causal effects of policy changes in multiagent economies. Central to our approach is behavioral game theory, which we leverage to formulate the ignorability assumptions that are necessary for causal inference. Under such assumptions we estimate long-term causal effects through a latent space approach, where a behavioral model of how agents act conditional on their latent behaviors is combined with a temporal model of how behaviors evolve over time.", "full_text": "Long-term causal effects via behavioral game theory\n\nPanagiotis (Panos) Toulis\n\nDavid C. Parkes\n\nEconometrics & Statistics, Booth School\n\nDepartment of Computer Science\n\nUniversity of Chicago\nChicago, IL, 60637\n\nHarvard University\n\nCambridge, MA, 02138\n\npanos.toulis@chicagobooth.edu\n\nparkes@eecs.harvard.edu\n\nAbstract\n\nPlanned experiments are the gold standard in reliably comparing the causal effect\nof switching from a baseline policy to a new policy. One critical shortcoming of\nclassical experimental methods, however, is that they typically do not take into\naccount the dynamic nature of response to policy changes. For instance, in an\nexperiment where we seek to understand the effects of a new ad pricing policy on\nauction revenue, agents may adapt their bidding in response to the experimental\npricing changes. Thus, causal effects of the new pricing policy after such adapta-\ntion period, the long-term causal effects, are not captured by the classical method-\nology even though they clearly are more indicative of the value of the new policy.\nHere, we formalize a framework to de\ufb01ne and estimate long-term causal effects\nof policy changes in multiagent economies. Central to our approach is behavioral\ngame theory, which we leverage to formulate the ignorability assumptions that are\nnecessary for causal inference. Under such assumptions we estimate long-term\ncausal effects through a latent space approach, where a behavioral model of how\nagents act conditional on their latent behaviors is combined with a temporal model\nof how behaviors evolve over time.\n\n1\n\nIntroduction\n\nA multiagent economy is comprised of agents interacting under speci\ufb01c economic rules. A common\nproblem of interest is to experimentally evaluate changes to such rules, also known as treatments, on\nan objective of interest. For example, an online ad auction platform is a multiagent economy, where\none problem is to estimate the effect of raising the reserve price on the platform\u2019s revenue. Assessing\ncausality of such effects is a challenging problem because there is a conceptual discrepancy between\nwhat needs to be estimated and what is available in the data, as illustrated in Figure 1.\nWhat needs to be estimated is the causal effect of a policy change, which is de\ufb01ned as the difference\nbetween the objective value when the economy is treated, i.e., when all agents interact under the\nnew rules, relative to when the same economy is in control, i.e., when all agents interact under the\nbaseline rules. Such de\ufb01nition of causal effects is logically necessitated from the designer\u2019s task,\nwhich is to select either the treatment or the control policy based on their estimated revenues, and\nthen apply such policy to all agents in the economy. The long-term causal effect is the causal effect\nde\ufb01ned after the system has stabilized, and is more representative of the value of policy changes\nin dynamical systems. Thus, in Figure 1 the long-term causal effect is the difference between the\nobjective values at the top and bottom endpoints, marked as the \u201ctargets of inference\u201d.\nWhat is available in the experimental data, however, typically comes from designs such as the so-\ncalled A/B test, where we randomly assign some agents to the treated economy (new rules B) and\nthe others to the control economy (baseline rules A), and then compare the outcomes. In Figure 1\nthe experimental data are depicted as the solid time-series in the middle of the plot, marked as the\n\u201cobserved data\u201d.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: The two inferential tasks for causal inference in multiagent economies. First, infer agent actions\nacross treatment assignments (y-axis), particularly, the assignment where all agents are in the treated economy\n(top assignment, Z = 1), and the assignment where all agents are in the control economy (bottom assignment,\nZ = 0). Second, infer across time, from t0 (last observation time) to long-term T . What we seek in order to\nevaluate the causal effect of the new treatment is the difference between the objectives (e.g., revenue) at the two\ninferential target endpoints.\n\nTherefore the challenge in estimating long-term causal effects is that we generally need to perform\ntwo inferential tasks simultaneously, namely,\n\n(i) infer outcomes across possible experimental policy assignments (y-axis in Figure 1), and\n(ii) infer long-term outcomes from short-term experimental data (x-axis in Figure 1).\n\nThe \ufb01rst task is commonly known as the \u201cfundamental problem of causal inference\u201d [14, 19] be-\ncause it underscores the impossibility of observing in the same experiment the outcomes for both\npolicy assignments that de\ufb01ne the causal effect; i.e., that we cannot observe in the same experiment\nboth the outcomes when all agents are treated and the outcomes when all agents are in control, the\nassignments of which are denoted by Z = 1 and Z = 0, respectively, in Figure 1. In fact the\nrole of experimental design, as conceived by R.A. Fisher [8], is exactly to quantify the uncertainty\nabout such causal effects that cannot be observed due to the aforementioned fundamental problem,\nby using standard errors that can be observed in a carefully designed experiment.\nThe second task, however, is unique to causal inference in dynamical systems, such as the multiagent\neconomies that we study in this paper, and has received limited attention so far. Here, we argue that\nit is crucial to study long-term causal effects, i.e., effects measured after the system has stabilized,\nbecause such effects are more representative of the value of policy changes. If our analysis focused\nonly on the observed data part depicted in Figure 1, then policy evaluation would re\ufb02ect transient\neffects that might differ substantially from the long-term effects. For instance, raising the reserve\nprice in an auction might increase revenue in the short-term but as agents adapt their bids, or switch\nto another platform altogether, the long-term effect could be a net decrease in revenue [13].\n\n1.1 Related work and our contributions\n\nThere have been several important projects related to causal inference in multiagent economies. For\ninstance, Ostrovsky and Schwartz [16] evaluated the effects of an increase in the reserve price of\nYahoo! ad auctions on revenue. Auctions were randomly assigned to an increased reserve price\ntreatment, and the effect was estimated using difference-in-differences (DID), which is a popular\neconometric method [6, 7, 16]. In relation to Figure 1, DID extrapolates across assignments (y-axis)\nand across time (x-axis) by making a strong additivity assumption [1, 3, Section 5.2], speci\ufb01cally,\nby assuming that the dependence of revenue on reserve price and time is additive.\nIn a structural approach, Athey et.al. [4] studied the effects of auction format (ascending versus\nsealed bid) on competition for timber tracts. In relation to Figure 1, their approach extrapolates\n\n2\n\n\facross assignments by assuming that agent individual valuations for tracts are independent of the\ntreatment assignment, and extrapolates across time by assuming that the observed agent bids are\nalready in equilibrium. Similar approaches are followed in econometrics for estimation of general\nequilibrium effects [11, 12].\nIn a causal graph approach [17] Bottou et.al. [5] studied effects of changes in the algorithm that\nscores Bing ads on the ad platform\u2019s revenue.\nIn relation to Figure 1, their approach is non-\nexperimental and extrapolates across assignments and across time by assuming a directed acyclic\ngraph (DAG) as the correct data model, which is also assumed to be stable with respect to treatment\nassignment, and by estimating counterfactuals through the \ufb01tted model.\nOur work is different from prior work because it takes into account the short-term aspect of experi-\nmental data to evaluate long-term causal effects, which is the key conceptual and practical challenge\nthat arises in empirical applications. In contrast, classical econometric methods, such as DID, as-\nsume strong linear trends from short-term to long-term, whereas structural approaches typically\nassume that the experimental data are already long-term as they are observed in equilibrium. We\nrefer the reader to Sections 2 and 3 of the supplement for more detailed comparisons.\nIn summary, our key contribution is that we develop a formal framework that (i) articulates the\ndistinction between short-term and long-term causal effects, (ii) leverages behavioral game-theoretic\nmodels for causal analysis of multiagent economies, and (iiii) explicates theory that enables valid\ninference of long-term causal effects.\n\n2 De\ufb01nitions\nConsider a set of agents I and a set of actions A, indexed by i and a, respectively. The experiment\ndesigner wants to run an experiment to evaluate a new policy against the baseline policy relative to\nan objective. In the experiment each agent is assigned to one policy, and the experimenter observes\nhow agents act over time. Formally, let Z = (Zi) be the |I| \u00d7 1 assignment vector where Zi = 1\ndenotes that agent i is assigned to the new policy, and Zi = 0 denotes that i is assigned to the\nbaseline policy; as a shorthand, Z = 1 denotes that all agents are assigned to the new policy, and\nZ = 0 denotes that all agents are assigned to the baseline policy, where 1, 0 generally denote an\nappropriately-sized vector of ones and zeroes, respectively. In the simplest case, the experiment is\n\nan A/B test, where Z is uniformly random on {0, 1}|I| subject to(cid:80)\n\ni Zi = |I|/2.\n\nAfter the initial assignment Z agents play actions at discrete time points from t = 0 to t = t0. Let\nAi(t; Z) \u2208 A be the random variable that denotes the action of agent i at time t under assignment\nZ. The population action \u03b1j(t; Z) \u2208 \u2206|A|, where \u2206p denotes the p-dimensional simplex, is the fre-\nquency of actions at time t under assignment Z of agents that were assigned to game j; for example,\nassuming two actions A = {a1, a2}, then \u03b11(0; Z) = [0.2, 0.8] denotes that, under assignment Z,\n20% of agents assigned to the new policy play action a1 at t = 0, while the rest play a2. We assume\nthat the objective value for the experimenter depends on the population action, in a similar way that,\nsay, auction revenue depends on agents\u2019 aggregate bidding. The objective value in policy j at time\nt under assignment Z is denoted by R(\u03b1j(t; Z)), where R : \u2206|A| \u2192 R. For instance, suppose in\nthe previous example that a1 and a2 produce revenue $10 and \u2212$2, respectively, each time they are\nplayed, then R is linear and R([.2, .8]) = 0.2 \u00b7 $10 \u2212 0.8 \u00b7 $2 = $0.4.\nDe\ufb01nition 1 The average causal effect on objective R at time t of the new policy relative to the\nbaseline is denoted by CE(t) and is de\ufb01ned as\n\nCE(t) = E (R(\u03b11(t; 1)) \u2212 R(\u03b10(t; 0))) .\n\n(1)\n\nSuppose that (t0, T ] is the time interval required for the economy to adapt to the experimental con-\nditions. The exact de\ufb01nition of T is important but we defer this discussion for Section 3.1. The\ndesigner concludes that the new policy is better than the baseline if CE(T ) > 0. Thus, CE(T )\nis the long-term average causal effect and is a function of two objective values, R(\u03b11(T ; 1)) and\nR(\u03b10(T ; 0)), which correspond to the two inferential target endpoints in Figure 1. Neither value is\nobserved in the experiment because agents are randomly split between policies, and their actions are\nobserved only for the short-term period [0, t0]. Thus we need to (i) extrapolate across assignments\nby pivoting from the observed assignment to the counterfactuals Z = 1 and Z = 0; (ii) extrap-\nolate across time from the short-term data [0, t0] to the long-term t = T . We perform these two\nextrapolations based on a latent space approach, which is described next.\n\n3\n\n\f2.1 Behavioral and temporal models\n\nWe assume a latent behavioral model of how agents select actions, inspired by models from be-\nhavioral game theory. The behavioral model is used to predict agent actions conditional on agent\nbehaviors, and is combined with a temporal model to predict behaviors in the long-term. The two\nmodels are ultimately used to estimate agent actions in the long-term, and thus estimate long-term\ncausal effects. As the choice of the latent space is not unique, in Section 3.1 we discuss why we\nchose to use behavioral models from game theory.\nLet Bi(t; Z) denote the behavior that agent i adopts at time t under experimental assignment Z. The\nfollowing assumption puts a constraints on the space of possible behaviors that agents can adopt,\nwhich will simplify the subsequent analysis.\nAssumption 1 (Finite set of possible behaviors) There is a \ufb01xed and \ufb01nite set of behaviors B such\nthat for every time t, assignment Z and agent i, it holds that Bi(t; Z) \u2208 B; i.e., every agent can only\nadopt a behavior from B.\nDe\ufb01nition 2 (Behavioral model) The behavioral model for policy j de\ufb01ned by set B of behaviors\nis the collection of probabilities\n\nP (Ai(t; Z) = a|Bi(t; Z) = b, Gj),\n\n(2)\nfor every action a \u2208 A and every behavior b \u2208 B, where Gj denotes the characteristics of policy j.\nAs an example, a non-sophisticated behavior b0 could imply that P (Ai(t; Z) = a|b0, Gj) = 1/|A|,\ni.e., that the agent adopting b0 simply plays actions at random. Conditioning on policy j in Def-\ninition 2 allows an agent to choose its actions based on expected payoffs, which depend on the\npolicy characteristics. For instance, in the application of Section 4 we consider a behavioral model\nwhere an agent picks actions in a two-person game according to expected payoffs calculated from\nthe game-speci\ufb01c payoff matrix\u2014in that case Gj is simply the payoff matrix of game j.\nThe population behavior \u03b2j(t; Z) \u2208 \u2206|B| denotes the frequency at time t under assignment Z of\nthe adopted behaviors of agents assigned to policy j. Let Ft denote the entire history of population\nbehaviors in the experiment up to time t. A temporal model of behaviors is de\ufb01ned as follows.\n\nDe\ufb01nition 3 (Temporal model) For an experimental assignment Z a temporal model for policy j\nis a collection of parameters \u03c6j(Z), \u03c8j(Z), and densities (\u03c0, f ), such that for all t,\n\n\u03b2j(0; Z) \u223c \u03c0(\u00b7; \u03c6j(Z)),\n\n\u03b2j(t; Z)| Ft\u22121, Gj \u223c f (\u00b7|\u03c8j(Z),Ft\u22121).\n\n(3)\n\nA temporal model de\ufb01nes the distribution of population behavior as a time-series with a Markovian\nstructure. As de\ufb01ned, the temporal model imposes the restriction that the prior \u03c0 of population\nbehavior at t = 0 and the density f of behavioral evolution are both independent of treatment\nassignment Z.\nIn other words, regardless of how agents are assigned to games, the population\nbehavior in the game will evolve according to a \ufb01xed model described by f and \u03c0. The model\nparameters \u03c6, \u03c8 may still depend on the treatment assignment Z.\n\n3 Estimation of long-term causal effects\n\nHere we develop the assumptions that are necessary for inference of long-term causal effects.\n\nAssumption 2 (Stability of initial behaviors) Let \u03c1Z =(cid:80)\n\ni\u2208I Zi/|I| be the proportion of agents\n\nassigned to the new policy under assignment Z. Then, for every possible assignment Z,\n\n\u03c1Z\u03b21(0; Z) + (1 \u2212 \u03c1Z)\u03b20(0; Z) = \u03b2(0),\n\n(4)\n\nwhere \u03b2(0) is a \ufb01xed population behavior invariant to Z.\n\nAssumption 3 (Behavioral ignorability) The assignment is independent of population behavior at\ntime t, conditional on policy and behavioral history up to t; i.e., for every t > 0 and policy j,\n\nZ|= \u03b2j(t; Z) | Ft\u22121, Gj.\n\n4\n\n\fRemarks. Assumption 2 implies that the agents do not anticipate the assignment Z as they \u201chave\nmade up their minds\u201d to adopt a population behavior \u03b2(0) before the experiment. Quantities such as\nthat in Eq. (4) are crucial in causal inference because they can be used as a pivot for extrapolation\nacross assignments. Assumption 3 states that the treatment assignment does not add information\nabout the population behavior at time t, if we already know the full behavioral history of up to t,\nand the policy which agents are assigned to; hence, the treatment assignment is conditionally ignor-\nable. This ignorability assumption precludes, for instance, an agent adopting a different behavior\ndepending on whether it was assigned with friends or foes in the experiment.\nAlgorithm 1 is the main methodological contribution of this paper. It is a Bayesian procedure as it\nputs priors on parameters \u03c6, \u03c8 of the temporal model, and then marginalizes these parameters out.\n\nAlgorithm 1 Estimation of long-term causal effects\nInput: Z, T,A,B, G1, G0,D1 = {a1(t; Z) : t = 0, . . . , t0},D0 = {a0(t; Z) : t = 0, . . . , t0}.\nOutput: Estimate of long-term causal effect CE(T ) in Eq. (1).\n1: By Assumption 3, de\ufb01ne \u03c6j \u2261 \u03c6j(Z), \u03c8j \u2261 \u03c8j(Z).\n2: Set \u00b51 \u2190 0 and \u00b50 \u2190 0, both of size |A|; set \u03bd0 = \u03bd1 = 0.\n3: for iter = 1, 2, . . . do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nSet \u03b2j(0; j1) = \u03b2(0).\nSample Bj = {\u03b2j(t; j1) : t = 0, . . . , T} given \u03c8j and \u03b2j(0, j1).\nSample \u03b1j(T ; j1) conditional on \u03b2j(T ; j1).\nSet \u00b5j \u2190 \u00b5j + P (Dj|Bj, Gj) \u00b7 R(\u03b1j(T ; j1)).\nSet \u03bdj \u2190 \u03bdj + P (Dj|Bj, Gj).\n\nFor j = 0, 1, sample \u03c6j, \u03c8j from prior, and sample \u03b2j(0; Z) conditional on \u03c6j.\nCalculate \u03b2(0) = \u03c1Z\u03b21(0; Z) + (1 \u2212 \u03c1Z)\u03b20(0; Z).\nfor j = 0, 1 do\n\n14: Return estimate (cid:99)CE(T ) = \u00b51/\u03bd1 \u2212 \u00b50/\u03bd0.\n\n# behavioral model\n\nend for\n\n# temporal model\n\nTheorem 1 (Estimation of long-term causal effects) Suppose that behaviors evolve according to\na known temporal model, and actions are distributed conditionally on behaviors according to a\nknown behavioral model. Suppose that Assumptions 1, 2 and 3 hold for such models. Then, for\nevery policy j \u2208 {0, 1} as the iterations of Algorithm 1 increase, \u00b5j/\u03bdj \u2192 E (R(\u03b1j(T ; j1))|Dj) .\n\nThe output (cid:99)CE(T ) of Algorithm 1 asymptotically estimates the long-term causal effect, i.e.,\nRemarks. Theorem 1 shows that (cid:99)CE(T ) consistently estimates the long-term causal effect in Eq. (1).\n\nE((cid:99)CE(T )) = E (R(\u03b11(T ; 1)) \u2212 R(\u03b10(T ; 0))) \u2261 CE(T ).\n\nWe note that it is also possible to derive the variance of this estimator with respect to the random-\nization distribution of assignment Z. To do so we \ufb01rst create a set of assignments Z by repeatedly\nsampling Z according to the experimental design. Then we adapt Algorithm 1 so that (i) Step 4 is\nremoved; (ii) in Step 5, \u03b2(0) is sampled from its posterior distribution conditional on observed data,\nwhich can be obtained from the original Algorithm 1. The empirical variance of the outputs over\n\nZ from the adapted algorithm estimates the variance of the output (cid:99)CE(T ) of the original algorithm.\n\nWe leave the full characterization of this variance estimation procedure for future work.\n\n3.1 Discussion\n\nMethodologically, our approach is aligned with the idea that for long-term causal effects we need a\nmodel for outcomes that leverages structural information pertaining to how outcomes are generated\nand how they evolve. In our application such structural information is the microeconomic infor-\nmation that dictates what agent behaviors are successful in a given policy and how these behaviors\nevolve over time.\nIn particular, Step 1 in the algorithm relies on Assumptions 2 and 3 to infer that model parameters,\n\u03c6j, \u03c8j are stable with respect to treatment assignment. Step 5 of the algorithm is the key estimation\npivot, which uses Assumption 2 to extrapolate from the experimental assignment Z to the coun-\nterfactual assignments Z = 1 and Z = 0, as required in our problem. Having pivoted to such\n\n5\n\n\fcounterfactual assignment, it is then possible to use the temporal model parameters \u03c8j, which are\nunaffected by the pivot under Assumption 3, to sample population behaviors up to long-term T , and\nsubsequently sample agent actions at T (Steps 8 and 9).\nThus, a lot of burden is placed on the behavioral game-theoretic model to predict agent actions,\nand the accuracy of such models is still not settled [10]. However, it does not seem necessary\nthat such prediction is completely accurate, but rather that the behavioral models can pull relevant\ninformation from data that would otherwise be inaccessible without game theory, thereby improving\nover classical methods. A formal assessment of such improvement, e.g., using information theory,\nis open for future work. An empirical assessment can be supported by the extensive literature in\nbehavioral game theory [20, 15], which has been successful in predicting human actions in real-\nworld experiments [22].\nAnother limitation of our approach is Assumption 1, which posits that there is a \ufb01nite set of pre-\nde\ufb01ned behaviors. A nonparametric approach where behaviors are estimated on-the-\ufb02y might do\nbetter. In addition, the long-term horizon, T , also needs to be de\ufb01ned a priori. We should be careful\nhow T interferes with the temporal model since such a model implies a time T (cid:48) at which population\nbehavior reaches stationarity. Thus if T (cid:48) \u2264 T we implicitly assume that the long-term causal effect\nof interest pertains to a stationary regime (e.g., Nash equilibrium), but if T (cid:48) > T we assume that the\neffect pertains to a transient regime, and therefore the policy evaluation might be misguided.\n\n4 Application: Long-term causal effects from a behavioral experiment\n\nIn this section, we apply our methodology to experimental data from Rapoport and Boebel [18],\nas reported by McKelvey and Palfrey [15]. The experiment consisted of a series of zero-sum two-\nagent games, and aimed at examining the hypothesis that human players play according to minimax\nsolutions of the game, the so-called minimax hypothesis initially suggested by von Neumann and\nMorgenstern [21]. Here we repurpose the data in a slightly arti\ufb01cial way, including how we construct\nthe designer\u2019s objective. This enables a suitable demonstration of our approach.\nEach game in the experiment was a simultaneous-move game with \ufb01ve discrete actions for the row\nplayer and \ufb01ve actions for the column player. The structure of the payoff matrix, given in the\nsupplement in Table 1, is parametrized by two values, namely W and L; the experiment used two\ndifferent versions of payoff matrices, corresponding to payments by the row agent to the column\nagent when the row agent won (W ), or lost (L): modulo a scaling factor, Rapoport and Boebel [18]\nused (W, L) = ($10,\u2212$6) for game 0 and (W, L) = ($15,\u2212$1) for game 1.\nForty agents, I = {1, 2, . . . , 40}, were randomized to one game design (20 agents per game), and\neach agent played once as row and once as column, matched against two different agents. Every\nmatch-up between a pair of agents lasted for two periods of 60 rounds, with each round consisting\nof a selection of an action from each agent and a payment. Thus, each agent played for four periods\nand 240 rounds in total. If Z is the entire assignment vector of length 40, Zi = 1 means that agent\ni was assigned to game 1 with payoff matrix (W, L) = ($15,\u2212$1) and Zi = 0 means that i was\nassigned to game 0 with payoff matrix (W, L) = ($10,\u2212$6).\nIn adapting the data, we take advantage of the randomization in the experiment, and ask a question\nin regard to long-term causal effects. In particular, assuming that agents pay a fee for each action\ntaken, which accounts for the revenue of the game, we ask the following question:\n\u201dWhat is the long-term causal effect on revenue if we switch from payoffs (W, L) = ($10,\u2212$6) of\ngame 0 to payoffs (W, L) = ($15,\u2212$1) of game 1?\u201d.\nThe games induced by the two aforementioned payoff matrices represent the two different policies\nwe wish to compare. To evaluate our method, we consider the last period as long-term, and hold out\ndata from this period. We de\ufb01ne the causal estimand in Eq. (1) as\n(\u03b11(T ; 1) \u2212 \u03b10(T ; 0)),\n\n(cid:124)\nCE = c\n\n(5)\n\nwhere T = 3 and c is a vector of coef\ufb01cients. The interpretation is that, given an element ca of c, the\nagent playing action a is assumed to pay a constant fee ca. To check the robustness of our method\nwe test Algorithm 1 over multiple values of c.\n\n6\n\n\f4.1\n\nImplementation of Algorithm 1 and results\n\nHere we demonstrate how Algorithm 1 can be applied to estimate the long-term causal effect in\nEq. (5) on the Rapoport & Boebel dataset. To this end we clarify Algorithm 1 step by step, and give\nmore details in the supplement.\nStep 1: Model parameters. For simplicity we assume that the models in the two games share\ncommon parameters, and thus (\u03c61, \u03c81, \u03bb1) = (\u03c60, \u03c80, \u03bb0) \u2261 (\u03c6, \u03c8, \u03bb), where \u03bb are the parame-\nters of the behavioral model to be described in Step 8. Having common parameters also acts as\nregularization and thus helps estimation.\nStep 4: Sampling parameters and initial behaviors As explained later we assume that there are\n3 different behaviors and thus \u03c6, \u03c8, \u03bb are vectors with 3 components. Let x \u223c U (m, M ) denote\nthat every component of x is uniform on (m, M ), independently. We choose diffuse priors for our\nparameters, speci\ufb01cally, \u03c6 \u223c U(0, 10), \u03c8 \u223c U(\u22125, 5), and \u03bb \u223c U(\u221210, 10). Given \u03c6 we sample\nthe initial behaviors as Dirichlet, i.e., \u03b21(0; Z) \u223c Dir(\u03c6) and \u03b20(0; Z) \u223c Dir(\u03c6), independently.\nSteps 5 & 7: Pivot to counterfactuals. Since we have a completely randomized experiment (A/B\ntest) it holds that \u03c1Z = 0.5 and therefore \u03b2(0) = 0.5(\u03b21(0; Z) + \u03b20(0; Z)). Now we can pivot to the\ncounterfactual population behaviors under Z = 1 and Z = 0 by setting \u03b21(0; 1) = \u03b20(0; 0) = \u03b2(0).\nStep 8: Sample counterfactual behavioral history. As the temporal model, we adopt the lag-one\nvector autoregressive model, also known as VAR(1). We transform1 the population behavior into\na new variable wt = logit(\u03b21(t; 1)) \u2208 R2 (also do so for \u03b20(t; 0)). Such transformation with a\nunique inverse is necessary because population behaviors are constrained on the simplex, and thus\nform so-called compositional data [2, 9]. The VAR(1) model implies that\n\nwt = \u03c8[1]1 + \u03c8[2]wt\u22121 + \u03c8[3]\u0001t,\n\n(6)\nwhere \u03c8[k] is the kth component of \u03c8 and \u0001t \u223c N (0, I) is i.i.d. standard bivariate normal. Eq. (6)\nis used to sample the behavioral history, Bj, in Step 8 of Algorithm 1.\nStep 9: Behavioral model. For the behavioral model, we adopt the quantal p-response (QLp)\nmodel [20], which has been successful in predicting human actions in real-world experiments [22].\nWe choose p = 3 behaviors, namely B = {b0, b1, b2} of increased sophistication parametrized by\n\u03bb = (\u03bb[1], \u03bb[2], \u03bb[3]) \u2208 R3. Let Gj denote the 5 \u00d7 5 payoff matrix of game j and let the term\nstrategy denote a distribution over all actions. An agent with behavior b0 plays the uniform strategy,\n\nP (Ai(t; Z) = a|Bi(t; Z) = b0, Gj) = 1/5.\n\nAn agent of level-1 (row player) assumes to be playing only against level-0 agents and thus expects\nper-action pro\ufb01t u1 = (1/5)Gj1 (for column player we use the transpose of Gj). The level-1 agent\nwill then play a strategy proportional to e\u03bb[1]u1, where ex for vector x denotes the element-wise\nexponentiation, ex = (ex[k]). The precision parameter \u03bb[1] determines how much an agent insists\non maximizing expected utility; for example, if \u03bb[1] = \u221e, the agent plays the action with maximum\nexpected payoff (best response); if \u03bb[1] = 0, the agent acts as a level-0 agent. An agent of level-\n2 (row player) assumes to be playing only against level-1 agents with precision \u03bb[2] and therefore\nexpects to face strategy proportional to e\u03bb[2]u1. Thus its expected per-action pro\ufb01t is u2 \u221d Gje\u03bb[2]u1,\nand plays strategy \u221d e\u03bb[3]u2.\nGiven Gj and \u03bb we calculate a 5 \u00d7 3 matrix Qj where the kth column is the strategy played by an\nagent with behavior bk\u22121. The expected population action is therefore \u00af\u03b1j(t; Z) = Qj\u03b2j(t; Z). The\npopulation action \u03b1j(t; Z) is distributed as a normalized multinomial random variable with expecta-\ntion \u00af\u03b1j(t; Z), and so P (\u03b1j(t; 1)|\u03b2j(t; 1), Gj) = Multi(|I| \u00b7 \u03b1j(t; 1); \u00af\u03b1j(t; 1)), where Multi(n; p)\nis the multinomial density of observations n = (n1, . . . , nK) with probabilities p = (p1, . . . , pK).\nHence, the full likelihood for observed actions in game j in Steps 10 and 11 of Algorithm 1 is given\nby the product\n\nT\u22121(cid:89)\n\nP (Dj|Bj, Gj) =\n\nMulti(|I| \u00b7 \u03b1j(t; j1); \u00af\u03b1j(t; j1)).\n\nt=0\n\nRunning Algorithm 1 on the Rapoport and Boebel dataset yields the estimates shown in Figure 2,\nfor 25 different fee vectors c, where each component ca is sampled uniformly at random from (0, 1).\n1y = logit(x) is de\ufb01ned as the function \u2206m \u2192 Rm\u22121, y[i] = log(x[i + 1]/x[1]), where x[1] (cid:54)= 0 wlog.\n\n7\n\n\fFigure 2: Estimates of long-term effects of different methods corresponding to 25 random objective\ncoef\ufb01cients c in Eq. (5). For estimates of our method we ran Algorithm 1 for 100 iterations.\n\nWe also test difference-in-differences (DID), which estimates the causal effect through\n\u02c6\u03c4 did = [R(\u03b11(2; Z)) \u2212 R(\u03b11(0; Z))] \u2212 [R(\u03b10(2; Z)) \u2212 R(\u03b10(0; Z))],\n\nand a naive method (\u201cnaive\u201d in the plot), which ignores the dynamical aspect and estimates the long-\nterm causal effect as \u02c6\u03c4 nai = [R(\u03b11(2; Z)) \u2212 R(\u03b10(2; Z))]. Our estimates (\u201cLACE\u201d in the plot) are\ncloser to the truth (mse = 0.045) than the estimates from the naive method (mse = 0.185) and from\nDID (mse = 0.361). This illustrates that our method can pull game-theoretic information from the\ndata for long-term causal inference, whereas the other methods cannot.\n\n5 Conclusion\n\nOne critical shortcoming of statistical methods of causal inference is that they typically do not assess\nthe long-term effect of policy changes. Here we combined causal inference and game theory to\nbuild a framework for estimation of such long-term effects in multiagent economies. Central to\nour approach is behavioral game theory, which provides a natural latent space model of how agents\nact and how their actions evolve over time. Such models enable to predict how agents would act\nunder various policy assignments and at various time points, which is key for valid causal inference.\nWorking on a real-world dataset [18] we showed how our framework can be applied to estimate the\nlong-term effect of changing the payoff structure of a normal-form game.\nOur framework could be extended in future work by incorporating learning (e.g., \ufb01ctitious play,\nbandits, no-regret learning) to better model the dynamic response of multiagent systems to policy\nchanges. Another interesting extension would be to use our framework for optimal design of exper-\niments in such systems, which needs to account for heterogeneity in agent learning capabilities and\nfor intrinsic dynamical properties of the systems\u2019 responses to experimental treatments.\n\nAcknowledgements\n\nThe authors wish to thank Leon Bottou, the organizers and participants of CODE@MIT\u201915,\nGAMES\u201916, the Workshop on Algorithmic Game Theory and Data Science (EC\u201915), and the anony-\nmous NIPS reviewers for their valuable feedback. Panos Toulis has been supported in part by the\n2012 Google US/Canada Fellowship in Statistics. David C. Parkes was supported in part by NSF\ngrant CCF-1301976 and the SEAS TomKat fund.\n\n8\n\n\fReferences\n[1] Alberto Abadie. Semiparametric difference-in-differences estimators. The Review of Economic\n\nStudies, 72(1):1\u201319, 2005.\n\n[2] John Aitchison. The statistical analysis of compositional data. Springer, 1986.\n[3] Joshua D Angrist and J\u00a8orn-Steffen Pischke. Mostly harmless econometrics: An empiricist\u2019s\n\ncompanion. Princeton university press, 2008.\n\n[4] Susan Athey, Jonathan Levin, and Enrique Seira. Comparing open and sealed bid auctions:\nEvidence from timber auctions. The Quarterly Journal of Economics, 126(1):207\u2013257, 2011.\n[5] L\u00b4eon Bottou, Jonas Peters, Joaquin Qui\u02dcnonero-Candela, Denis X Charles, D Max Chickering,\nElon Portugualy, Dipankar Ray, Patrice Simard, and Ed Snelson. Couterfactual reasoning and\nlearning systems. J. Machine Learning Research, 14:3207\u20133260, 2013.\n\n[6] David Card and Alan B Krueger. Minimum wages and employment: A case study of the fast\nfood industry in New Jersey and Pennsylvania. American Economic Review, 84(4):772\u2013793,\n1994.\n\n[7] Stephen G Donald and Kevin Lang. Inference with difference-in-differences and other panel\n\ndata. The review of Economics and Statistics, 89(2):221\u2013233, 2007.\n\n[8] Ronald Aylmer Fisher. The design of experiments. Oliver & Boyd, 1935.\n[9] Gary K Grunwald, Adrian E Raftery, and Peter Guttorp. Time series of continuous proportions.\n\nJournal of the Royal Statistical Society. Series B (Methodological), pages 103\u2013116, 1993.\n\n[10] P Richard Hahn, Indranil Goswami, and Carl F Mela. A bayesian hierarchical model for\ninferring player strategy types in a number guessing game. The Annals of Applied Statistics,\n9(3):1459\u20131483, 2015.\n\n[11] James J Heckman, Lance Lochner, and Christopher Taber. General equilibrium treatment\n\neffects: A study of tuition policy. American Economic Review, 88(2):3810386, 1998.\n\n[12] James J Heckman and Edward Vytlacil. Structural equations, treatment effects, and economet-\n\nric policy evaluation1. Econometrica, 73(3):669\u2013738, 2005.\n\n[13] John H Holland and John H Miller. Arti\ufb01cial adaptive agents in economic theory. The Ameri-\n\ncan Economic Review, pages 365\u2013370, 1991.\n\n[14] Paul W Holland. Statistics and causal inference. Journal of the American statistical Associa-\n\ntion, 81(396):945\u2013960, 1986.\n\n[15] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form\n\ngames. Games and economic behavior, 10(1):6\u201338, 1995.\n\n[16] Michael Ostrovsky and Michael Schwarz. Reserve prices in internet advertising auctions: A\n\ufb01eld experiment. In Proceedings of the 12th ACM conference on Electronic commerce, pages\n59\u201360. ACM, 2011.\n\n[17] Judea Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2000.\n[18] Amnon Rapoport and Richard B Boebel. Mixed strategies in strictly competitive games: A\nfurther test of the minimax hypothesis. Games and Economic Behavior, 4(2):261\u2013283, 1992.\n[19] Donald B Rubin. Causal inference using potential outcomes. Journal of the American Statis-\n\ntical Association, 2011.\n\n[20] Dale O Stahl and Paul W Wilson. Experimental evidence on players\u2019 models of other players.\n\nJournal of Economic Behavior & Organization, 25(3):309\u2013327, 1994.\n\n[21] J Von Neumann and O Morgenstern. Theory of games and economic behavior. Princeton\n\nUniversity Press, 1944.\n\n[22] James R Wright and Kevin Leyton-Brown. Beyond equilibrium: Predicting human behavior\n\nin normal-form games. In Proc. 24th AAAI Conf. on Arti\ufb01cial Intelligence, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Panagiotis", "family_name": "Toulis", "institution": "University of Chicago"}, {"given_name": "David", "family_name": "Parkes", "institution": "Harvard University"}]}