{"title": "Deep Learning for Predicting Human Strategic Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 2424, "page_last": 2432, "abstract": "Predicting the behavior of human participants in strategic settings is an important problem in many domains. Most existing work either assumes that participants are perfectly rational, or attempts to directly model each participant's cognitive processes based on insights from cognitive psychology and experimental economics. In this work, we present an alternative, a deep learning approach that automatically performs cognitive modeling without relying on such expert knowledge.  We introduce a novel architecture that allows a single network to generalize across different input and output dimensions by using matrix units rather than scalar units, and show that its performance significantly outperforms that of the previous state of the art, which relies on expert-constructed features.", "full_text": "Deep Learning for Predicting\nHuman Strategic Behavior\n\nJason Hartford, James R. Wright, Kevin Leyton-Brown\n\nDepartment of Computer Science\nUniversity of British Columbia\n\n{jasonhar, jrwright, kevinlb}@cs.ubc.ca\n\nAbstract\n\nPredicting the behavior of human participants in strategic settings is an important\nproblem in many domains. Most existing work either assumes that participants\nare perfectly rational, or attempts to directly model each participant\u2019s cognitive\nprocesses based on insights from cognitive psychology and experimental economics.\nIn this work, we present an alternative, a deep learning approach that automatically\nperforms cognitive modeling without relying on such expert knowledge. We\nintroduce a novel architecture that allows a single network to generalize across\ndifferent input and output dimensions by using matrix units rather than scalar units,\nand show that its performance signi\ufb01cantly outperforms that of the previous state\nof the art, which relies on expert-constructed features.\n\n1\n\nIntroduction\n\nGame theory provides a powerful framework for the design and analysis of multiagent systems\nthat involve strategic interactions [see, e.g., 16]. Prominent examples of such systems include\nsearch engines, which use advertising auctions to generate a signi\ufb01cant portion of their revenues\nand rely on game theoretic reasoning to analyze and optimize these mechanisms [6, 20]; spectrum\nauctions, which rely on game theoretic analysis to carefully design the \u201crules of the game\u201d in order to\ncoordinate the reallocation of valuable radio spectrum [13]; and security systems, which analyze the\nallocation of security personnel as a game between rational adversaries in order to optimize their use\nof scarce resources [19]. In such applications, system designers optimize their choices with respect\nto assumptions about the preferences, beliefs and capabilities of human players [14]. A standard\ngame theoretic approach is to assume that players are perfectly rational expected utility maximizers\nand indeed, that they have common knowledge of this. In some applications, such as the high-stakes\nspectrum auctions just mentioned, this assumption is probably reasonable, as participants are typically\nlarge companies that hire consultants to optimize their decision making. In other scenarios that\nallow less time for planning or involve less sophisticated participants, however, the perfect rationality\nassumption may lead to suboptimal system designs. For example, Yang et al. [24] were able to\nimprove the performance of systems that defend against adversaries in security games by relaxing the\nperfect rationality assumption. Of course, relaxing this assumption means \ufb01nding something else to\nreplace it with: an accurate model of boundedly rational human behavior.\nThe behavioral game theory literature has developed a wide range of models for predicting hu-\nman behavior in strategic settings by incorporating cognitive biases and limitations derived from\nobservations of play and insights from cognitive psychology [2]. Like much previous work, we\nstudy the unrepeated, simultaneous-move setting, for two reasons. First, the setting is conceptually\nstraightforward: games can be represented in a so-called \u201cnormal form\u201d, simply by listing the utilities\nto each player in for each combination of their actions (e.g., see Figure 1). Second, the setting is\nsurprisingly general: auctions, security systems, and many other interactions can be modeled naturally\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: An example 3 \u00d7 3 normal form\ngame. The row player chooses from actions\n{T, M, B} and the column player chooses\nIf the row player\nfrom actions {R, C, L}.\nplayed action T and column player played ac-\ntion C, their resulting payoffs would be 3 and\n5 respectively. Given such a matrix as input\nwe aim to predict a distribution over the row\nplayer\u2019s choice of actions de\ufb01ned by the ob-\nserved frequency of actions shown on the right.\n\nas normal form games. The most successful predictive models for this setting combine notions of\niterative reasoning and noisy best response [21] and use hand-crafted features to model the behavior\nof non-strategic players [23].\nThe recent success of deep learning has demonstrated that predictive accuracy can often be enhanced,\nand expert feature engineering dispensed with, by \ufb01tting highly \ufb02exible models that are capable of\nlearning novel representations. A key feature in successful deep models is the use of careful design\nchoices to encode \u201cbasic domain knowledge of the input, in particular its topological structure. . . to\nlearn better features\" [1, emphasis original]. For example, feed-forward neural nets can, in principle,\nrepresent the same functions as convolution networks, but the latter tend to be more effective in\nvision applications because they encode the prior that low-level features should be derived from the\npixels within a small neighborhood and that predictions should be invariant to small input translations.\nAnalogously, Clark and Storkey [4] encoded the fact that a Go board is invariant to rotations. These\nmodeling choices constrain more general architectures to a subset of the solution space that is likely\nto contain good solutions. Our work seeks to do the same for the behavioral game theory setting,\nidentifying novel prior assumptions that extend deep learning to predicting behavior in strategic\nscenarios encoded as two player, normal-form games.\nA key property required of such a model is invariance to game size: a model must be able to take\nas input an m \u00d7 n bimatrix game (i.e., two m \u00d7 n matrices encoding the payoffs of players 1 and\n2 respectively) and output an m-dimensional probability distribution over player 1\u2019s actions, for\narbitrary values of n and m, including values that did not appear in training data. In contrast, existing\ndeep models typically assume either a \ufb01xed-dimensional input or an arbitrary-length sequence of\n\ufb01xed-dimensional inputs, in both cases with a \ufb01xed-dimensional output. We also have the prior belief\nthat permuting rows and columns in the input (i.e., changing the order in which actions are presented\nto the players) does not change the output beyond a corresponding permutation. In Section 3, we\npresent an architecture that operates on matrices using scalar weights to capture invariance to changes\nin the size of the input matrices and to permutations of its rows and columns. In Section 4 we evaluate\nour model\u2019s ability to predict distributions of play given normal form descriptions of games on a\ndataset of experimental data from a variety of experiments, and \ufb01nd that our feature-free deep learning\nmodel signi\ufb01cantly exceeds the performance of the current state-of-the-art model, which has access\nto hand-tuned features based on expert knowledge [23].\n\n2 Related Work\n\nPrediction in normal form games. The task of predicting actions in normal form games has been\nstudied mostly in the behavioral game theory literature. Such models tend to have few parameters and\nto aim to describe previously identi\ufb01ed cognitive processes. Two key ideas are the relaxation of best\nresponse to \u201cquantal response\u201d and the notion of \u201climited iterative strategic reasoning\u201d. Models that\nassume quantal response assume that players select actions with probability increasing in expected\nutility instead of always selecting the action with the largest expected utility [12]. This is expressed\nformally by assuming that players select actions, ai, with probability, si, given by the logistic quantal\nresponse function si(ai) = exp(\u03bbui(ai,s\u2212i))\ni,s\u2212i)). This function is equivalent to the familiar softmax\nfunction with an additional scalar sharpness parameter \u03bb that allows the function to output the best\nresponse as \u03bb \u2192 \u221e and the uniform distribution as \u03bb \u2192 0. This relaxation is motivated by the\nbehavioral notion that if two actions have similar expected utility then they will also have similar\nprobability of being chosen. Iterative strategic reasoning means that players perform a bounded\n\nexp(\u03bbui(a(cid:48)\n\n(cid:80)\n\na(cid:48)\n\ni\n\n2\n\n05101520253010,105,38,183,520,2025,018,80,2515,15TMBTMBLCRCountsofActions\fnumber of steps of reasoning in deciding on their actions, rather than always converging to \ufb01xed\npoints as in classical game theory. Models incorporating this idea typically assume that every agent\nhas an integer level. Non-strategic, \u201clevel-0\u201d players choose actions uniformly at random; level-k\nplayers best respond to the level-(k \u2212 1) players [5] or to a mixture of levels between level-0 and\nlevel-(k \u2212 1) [3]. The two ideas can be combined, allowing players to quantally respond to lower\nlevel players [18, 22]. Because iterative reasoning models are de\ufb01ned recursively starting from a\nbase-case of level-0 behavior, their performance can be improved by better modeling the non-strategic\nlevel-0 players. Wright and Leyton-Brown [23] combine quantal response and bounded steps of\nreasoning with a model of non-strategic behavior based on hand-crafted game theoretic features. To\nthe best of our knowledge, this is the current state-of-the-art model.\n\nDeep learning. Deep learning has demonstrated much recent success in solving supervised learning\nproblems in vision, speech and natural language processing [see, e.g., 9, 15]. By contrast, there have\nbeen relatively few applications of deep learning to multiagent settings. Notable exceptions are Clark\nand Storkey [4] and the policy network used in Silver et al. [17]\u2019s work in predicting the actions\nof human players in Go. Their approach is similar in spirit to ours: they map from a description\nof the Go board at every move to the choices made by human players, while we perform the same\nmapping from a normal form game. The setting differs in that Go is a single, sequential, zero-sum\ngame with a far larger, but \ufb01xed, action space, which requires an architecture tailored for pattern\nrecognition on the Go board. In contrast, we focus on constructing an architecture that generalizes\nacross general-sum, normal form games.\nWe enforce invariance to the size of the network\u2019s input. Fully convolutional networks [11] achieve\ninvariance to the image size in a similar by manner replacing all fully connected layers with convolu-\ntions. In its architectural design, our model is mathematically similar to Lin et al. [10]\u2019s Network in\nNetwork model, though we derived our architecture independently using game theoretic invariances.\nWe discuss the relationships between the two models at the end of Section 3.\n\n3 Modeling Human Strategic Behavior with Deep Networks\n\nA natural starting point in applying deep networks to a new domain is testing the performance of a\nregular feed-forward neural network. To apply such a model to a normal form game, we need to \ufb02atten\nthe utility values into a single vector of length mn + nm and learn a function that maps to the m-\nsimplex output via multiple hidden layers. Feed-forward networks can\u2019t handle size-invariant inputs,\nbut we can temporarily set that problem aside by restricting ourselves to games with a \ufb01xed input\nsize. We experimented with that approach and found that feed-forward networks often generalized\npoorly as the network over\ufb01tted the training data (see Section 2 of the supplementary material for\nexperimental evidence). One way of combating over\ufb01tting is to encourage invariance through data\naugmentation: for example, one may augment a dataset of images by rotating, shifting and scaling\nthe images slightly. In games, a natural simplifying assumption is that players are indifferent to the\norder in which actions are presented, implying invariance to permutations of the payoff matrix.1\nIncorporating this assumption by randomly permuting rows or columns of the payoff matrix at every\nepoch of training dramatically improved the generalization performance of a feed-forward network in\nour experiments, but the network is still limited to games of the size that it was trained on.\nOur approach is to enforce this invariance in the model architecture rather than through data aug-\nmentation. We then add further \ufb02exibility using novel \u201cpooling units\u201d and by incorporating iterative\nresponse ideas inspired by behavioral game theory models. The result is a model that is \ufb02exible\nenough to represent the all the models surveyed in Wright and Leyton-Brown [22, 23]\u2014and a huge\nspace of novel models as well\u2014and which can be identi\ufb01ed automatically. The model is also in-\nvariant to the size of the input payoff matrix, differentiable end to end and trainable using standard\ngradient-based optimization.\nThe model has two parts: feature layers and action response layers; see Figure 2 for a graphical\noverview. The feature layers take the row and column player\u2019s normalized utility matrices U(r) and\nU(c) \u2208 Rm\u00d7n as input, where the row player has m actions and the column player has n actions.\nThe feature layers consist of multiple levels of hidden matrix units, H(r)\ni,j \u2208 Rm\u00d7n, each of which\ncalculates a weighted sum of the units below and applies a non-linear activation function. Each\n\n1We thus ignore salience effects that could arise from action ordering; we plan to explore this in future work.\n\n3\n\n\fFigure 2: A schematic representation of our architecture. The feature layers consist of hidden matrix\nunits (orange), each of which use pooling units to output row- and column-preserving aggregates\n(blue and purple) before being reduced to distributions over actions in the softmax units (red). Iterative\nresponse is modeled using the action response layers (green) and the \ufb01nal output, y, is a weighted\nsum of the row player\u2019s action response layers.\n\nlayer of hidden units is followed by pooling units, which output aggregated versions of the hidden\nmatrices to be used by the following layer. After multiple layers, the matrices are aggregated to\nvectors and normalized to a distribution over actions, f (r)\ni \u2208 \u2206m in softmax units. We refer to these\ndistributions as features because they encode higher-level representations of the input matrices that\nmay be combined to construct the output distribution.\nAs discussed earlier, iterative strategic reasoning is an important phenomenon in human decision\nmaking; we thus want to allow our models the option of incorporating such reasoning. To do so, we\ncompute features for the column player in the same manner by applying the feature layers to the\ntranspose of the input matrices, which outputs f (c)\ni \u2208 \u2206n. Each action response layer for a given\nplayer then takes the opposite player\u2019s preceding action response layers as input and uses them to\nconstruct distributions over the respective players\u2019 outputs. The \ufb01nal output y \u2208 \u2206m is a weighted\nsum of all action response layers\u2019 outputs.\n\nInvariance-Preserving Hidden Units We build a model that ties parameters in our network by\nencoding the assumption that players reason about each action identically. This assumption implies\nthat the row player applies the same function to each row of a given game\u2019s utility matrices. Thus, in\na normal form game represented by the utility matrices U(r) and U(c), the weights associated with\neach row of U(r) and U(c) must be the same. Similarly, the corresponding assumption about the\ncolumn player implies that the weights associated with each column of U(r) and U(c) must also be\nthe same. We can satisfy both assumptions by applying a single scalar weight to each of the utility\nmatrices, computing wrU(r) + wcU(c). This idea can be generalized as in a standard feed-forward\nnetwork to allow us to \ufb01t more complex functions. A hidden matrix unit taking all the preceding\nhidden matrix units as input can be calculated as\n\nHl,i = \u03c6\uf8eb\uf8ed(cid:88)j\n\nwl,i,j Hl\u22121,j + bl,i\uf8f6\uf8f8 Hl,i \u2208 Rm\u00d7n,\n\nwhere Hl,i is the ith hidden unit matrix for layer l, wl,i,j is the jth scalar weight, bl,i is a scalar bias\nvariable, and \u03c6 is a non-linear activation function applied element-wise. Notice that, as in a traditional\nfeed-forward neural network, the output of each hidden unit is simply a nonlinear transformation of\nthe weighted sum of the preceding layer\u2019s hidden units. Our architecture differs by maintaining a\n\n4\n\nH(r)1,1H1,1\u2193H1,1\u2193...H(r)1,jH1,j\u2193H1,j\u2193H(r)2,1H2,1\u2193H2,1\u2193...H(r)2,jH2,j\u2193H2,j\u2193......f1...fjFeatureLayersOutputInputUnitsSoftmaxUnitsH(c)1,1H1,1\u2193H1,1\u2193...H(c)1,jH1,j\u2193H1,j\u2193H(c)2,1H2,1\u2193H2,1\u2193...H(c)2,jH2,j\u2193H2,j\u2193......f1...fjU(r)U(r)\u2193U(r)\u2193U(c)U(c)\u2193U(c)\u2193ar(r)0ar(r)1...ar(r)k\u22121ar(r)kar(c)0ar(c)1...ar(c)k\u22121yActionResponseLayers\fFigure 3: Left: Without pooling units, each element of every hidden matrix unit depends only on the\ncorresponding elements in the units from the layer below; e.g., the middle element highlighted in\nred depends only on the value of the elements of the matrices highlighted in orange. Right: With\npooling units at each layer in the network, each element of every hidden matrix unit depends both on\nthe corresponding elements in the units below and the pooled quantity from each row and column.\nE.g., the light blue and purple blocks represent the row and column-wise aggregates corresponding to\ntheir adjacent matrices. The dark blue and purple blocks show which of these values the red element\ndepends on. Thus, the red element depends on both the dark- and light-shaded orange cells.\n\nmatrix at each hidden unit instead of a scalar. So while in a traditional feed-forward network each\nhidden unit maps the previous layer\u2019s vector of outputs into a scalar output, in our architecture each\nhidden unit maps a tensor of outputs from the previous layer into a matrix output.\nTying weights in this way reduces the number of parameters in our network by a factor of nm,\noffering two bene\ufb01ts. First, it reduces the degree to which the network is able to over\ufb01t; second and\nmore importantly, it makes the model invariant to the size of the input matrices. To see this, notice\nthat each hidden unit maps from a tensor containing the k output matrices of the preceding layer\nin Rk\u00d7m\u00d7n to a matrix in Rm\u00d7n using k weights. Thus our number of parameters in each layer\ndepends on the number of hidden units in the preceding layer, but not on the sizes of the input and\noutput matrices. This allows the model to generalize to input sizes that do not appear in training data.\n\nPooling units A limitation of the weight tying used in our hidden matrix units is that it forces\nindependence between the elements of their matrices, preventing the network from learning functions\nthat compare the values of related elements (see Figure 3 (left)). Recall that each element of the\nmatrices in our model corresponds to an outcome in a normal form game. A natural game theoretic\nnotion of the \u201crelated elements\u201d which we\u2019d like our model to be able to compare is the set of payoffs\nassociated with each of the players\u2019 actions that led to that outcome. This corresponds to the row and\ncolumn of each matrix associated with the particular element.\nThis observation motivates our pooling units, which allow information sharing by outputting ag-\ngregated versions of their input matrix that may be used by later layers in the network to learn to\ncompare the values of a particular cell in a matrix and its row- or column-wise aggregates.\n\n...\n\n...\n\n...\n\n...\n\n\uf8f6\uf8f7\uf8f7\uf8f8 ,\uf8eb\uf8ec\uf8ec\uf8ed\n\n(1)\n\n. . .\n. . .\n\n. . .\n\n\uf8fc\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8fe\n\uf8f6\uf8f7\uf8f7\uf8f8\n\nH \u2192 {Hc, Hr} =\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\uf8eb\uf8ec\uf8ec\uf8ed\n\nmaxi hi,1 maxi hi,2\nmaxi hi,1 maxi hi,2\n\n. . .\n. . .\n\nmaxj h1,j maxj h1,j\nmaxj h2,j maxj h2,j\n\nmaxi hi,1 maxi hi,2\n\nmaxj hm,j maxj hm,j\n\nA pooling unit takes a matrix as input and outputs two matrices constructed from row- and column-\npreserving pooling operations respectively. A pooling operation could be any continuous function that\nmaps from Rn \u2192 R. We use the max function because it is a necessary to represent known behavioral\nfunctions (see Section 4 of the supplementary material for details) and offered the best empirical\nperformance of the functions we tested. Equation (1) shows an example of a pooling layer with max\nfunctions for some arbitrary matrix H. The \ufb01rst of the two outputs, Hc, is column-preserving in that\nit selects the maximum value in each column of H and then stacks the resulting vector n-dimensional\nvector m times such that the dimensionality of H and Hc are the same. Similarly, the row-preserving\noutput constructs a vector of the max elements in each column and stacks the resulting m-dimensional\nvector n times such that Hr and H have the same dimensionality. We stack the vectors that result\nfrom the pooling operation in this fashion so that the hidden units from the next layer in the network\nmay take H, Hc and Hr as input. This allows these later hidden units to learn functions where each\nelement of their output is a function both of the corresponding element from the matrices below as\nwell as their row and column-preserving maximums (see Figure 3 (right)).\n\n5\n\nFigure3:Left:Withoutpoolingunits,eachelementofeveryhiddenmatrixunitdependsonlyonthecorrespondingelementsintheunitsfromthelayerbelow;e.g.,themiddleelementhighlightedinreddependsonlyonthevalueoftheelementsofthematriceshighlightedinorange.Right:Withpoolingunitsateachlayerinthenetwork,eachelementofeveryhiddenmatrixunitdependsbothonthecorrespondingelementsintheunitsbelowandthepooledquantityfromeachrowandcolumn.E.g.,thelightblueandpurpleblocksrepresenttherowandcolumn-wiseaggregatescorrespondingtotheiradjacentmatrices.Thedarkblueandpurpleblocksshowwhichofthesevaluestheredelementdependson.Thus,theredelementdependsonboththedark-andlight-shadedorangecells.JH:TODO:addlevellabelsActionResponseLayersThefeaturelayersdescribedabovearesuf\ufb01cienttomeetourobjective203ofmappingfromtheinputpayoffmatricestoadistributionovertherowplayer\u2019sactions.However,204thisarchitectureisnotcapableofexplicitlyrepresentingiterativestrategicreasoning,whichthe205behavioralgametheoryliteraturehasidenti\ufb01edasanimportantmodelingingredient.Weincorporate206thisingredientusingactionresponselayers:the\ufb01rstplayercanrespondtothesecond\u2019sbeliefs,207thesecondcanrespondtothisresponsebythe\ufb01rstplayer,andsoontosome\ufb01nitedepth.The208proportionofplayersinthepopulationwhoiterateateachdepthisaparameterofthemodel;thus,209ourarchitectureisalsoabletolearnnottoperformiterativereasoning.210Moreformally,webeginbydenotingtheoutputofthefeaturelayersasar(r)0=Pki=1w(r)0if(r)i,211wherewenowincludeanindex(r)torefertotheoutputofrowplayer\u2019sactionresponselayer212ar(r)02m.Similarly,byapplyingthefeaturelayerstoatransposedversionoftheinputmatrices,213themodelalsooutputsacorrespondingar(c)02nforthecolumnplayerwhichexpressestherow214player\u2019sbeliefsaboutwhichactionsthecolumnplayerwillchoose.Eachactionresponselayer215composesitsoutputbycalculatingtheexpectedvalueofaninternalrepresentationofutilitywith216respecttoitsbeliefdistributionovertheoppositionactions.Forthisinternalrepresentationofutility217wechosesimplyaweightedsumofthe\ufb01nallayerofthehiddenlayers,PiwiHL,i,becauseeach218HL,iisalreadysomenon-lineartransformationoftheoriginalpayoffmatrix,andsothisallowsthe219modeltoexpressutilityasatransformationoftheoriginalpayoffs.Giventhematrixthatresultsfrom220thissum,wecancomputeexpectedutilitywithrespecttothevectorofbeliefsabouttheopposition\u2019s221choiceofactions,ar(c)j,bysimplytakingthedotproductoftheweightedsumandbeliefs.When222weiteratethisprocessofrespondingtobeliefsaboutone\u2019soppositionmorethanonce,higherlevel223playerswillrespondtobeliefs,ari,forallilesstheirlevelandthenoutputaweightedcombination224oftheseresponsesusingsomeweights,vl,i.Puttingthistogether,thelthactionresponselayerforthe225rowplayer(r)isde\ufb01nedas226ar(r)l=softmax l l1Xj=0v(r)l,j kXi=1w(r)l,iH(r)L,i!\u00b7ar(c)j!!,ar(r)l2m,l2{1,...,K},wherelindexestheactionresponselayer,lisascalarsharpnessparameterthatallowsustosharpen227theresultingdistribution,w(r)l,iandv(r)l,jarescalarweights,HL,iaretherowplayer\u2019skhiddenunits228fromthe\ufb01nalhiddenlayerL,ar(c)jistheoutputofthecolumnplayer\u2019sjthactionresponselayerand229Kisthetotalnumberofactionresponselayers.Weconstrainw(r)liandv(r)ljtothesimplexanduse230ltosharpentheoutputdistributionsothatwecanoptimizethesharpnessofthedistributionand231relativeweightingofitstermsindependently.Webuildupthecolumnplayer\u2019sactionresponselayer,232ar(c)l,similarly,usingthecolumnplayer\u2019sinternalutilityrepresentation,H(c)L,i,respondingtotherow233player\u2019sactionresponselayers,ar(r)l.Theselayersarenotusedinthe\ufb01naloutputdirectlybutare234relieduponbysubsequentactionresponselayersoftherowplayer.2356Figure3:Left:Withoutpoolingunits,eachelementofeveryhiddenmatrixunitdependsonlyonthecorrespondingelementsintheunitsfromthelayerbelow;e.g.,themiddleelementhighlightedinreddependsonlyonthevalueoftheelementsofthematriceshighlightedinorange.Right:Withpoolingunitsateachlayerinthenetwork,eachelementofeveryhiddenmatrixunitdependsbothonthecorrespondingelementsintheunitsbelowandthepooledquantityfromeachrowandcolumn.E.g.,thelightblueandpurpleblocksrepresenttherowandcolumn-wiseaggregatescorrespondingtotheiradjacentmatrices.Thedarkblueandpurpleblocksshowwhichofthesevaluestheredelementdependson.Thus,theredelementdependsonboththedark-andlight-shadedorangecells.JH:TODO:addlevellabelsActionResponseLayersThefeaturelayersdescribedabovearesuf\ufb01cienttomeetourobjective203ofmappingfromtheinputpayoffmatricestoadistributionovertherowplayer\u2019sactions.However,204thisarchitectureisnotcapableofexplicitlyrepresentingiterativestrategicreasoning,whichthe205behavioralgametheoryliteraturehasidenti\ufb01edasanimportantmodelingingredient.Weincorporate206thisingredientusingactionresponselayers:the\ufb01rstplayercanrespondtothesecond\u2019sbeliefs,207thesecondcanrespondtothisresponsebythe\ufb01rstplayer,andsoontosome\ufb01nitedepth.The208proportionofplayersinthepopulationwhoiterateateachdepthisaparameterofthemodel;thus,209ourarchitectureisalsoabletolearnnottoperformiterativereasoning.210Moreformally,webeginbydenotingtheoutputofthefeaturelayersasar(r)0=Pki=1w(r)0if(r)i,211wherewenowincludeanindex(r)torefertotheoutputofrowplayer\u2019sactionresponselayer212ar(r)02m.Similarly,byapplyingthefeaturelayerstoatransposedversionoftheinputmatrices,213themodelalsooutputsacorrespondingar(c)02nforthecolumnplayerwhichexpressestherow214player\u2019sbeliefsaboutwhichactionsthecolumnplayerwillchoose.Eachactionresponselayer215composesitsoutputbycalculatingtheexpectedvalueofaninternalrepresentationofutilitywith216respecttoitsbeliefdistributionovertheoppositionactions.Forthisinternalrepresentationofutility217wechosesimplyaweightedsumofthe\ufb01nallayerofthehiddenlayers,PiwiHL,i,becauseeach218HL,iisalreadysomenon-lineartransformationoftheoriginalpayoffmatrix,andsothisallowsthe219modeltoexpressutilityasatransformationoftheoriginalpayoffs.Giventhematrixthatresultsfrom220thissum,wecancomputeexpectedutilitywithrespecttothevectorofbeliefsabouttheopposition\u2019s221choiceofactions,ar(c)j,bysimplytakingthedotproductoftheweightedsumandbeliefs.When222weiteratethisprocessofrespondingtobeliefsaboutone\u2019soppositionmorethanonce,higherlevel223playerswillrespondtobeliefs,ari,forallilesstheirlevelandthenoutputaweightedcombination224oftheseresponsesusingsomeweights,vl,i.Puttingthistogether,thelthactionresponselayerforthe225rowplayer(r)isde\ufb01nedas226ar(r)l=softmax l l1Xj=0v(r)l,j kXi=1w(r)l,iH(r)L,i!\u00b7ar(c)j!!,ar(r)l2m,l2{1,...,K},wherelindexestheactionresponselayer,lisascalarsharpnessparameterthatallowsustosharpen227theresultingdistribution,w(r)l,iandv(r)l,jarescalarweights,HL,iaretherowplayer\u2019skhiddenunits228fromthe\ufb01nalhiddenlayerL,ar(c)jistheoutputofthecolumnplayer\u2019sjthactionresponselayerand229Kisthetotalnumberofactionresponselayers.Weconstrainw(r)liandv(r)ljtothesimplexanduse230ltosharpentheoutputdistributionsothatwecanoptimizethesharpnessofthedistributionand231relativeweightingofitstermsindependently.Webuildupthecolumnplayer\u2019sactionresponselayer,232ar(c)l,similarly,usingthecolumnplayer\u2019sinternalutilityrepresentation,H(c)L,i,respondingtotherow233player\u2019sactionresponselayers,ar(r)l.Theselayersarenotusedinthe\ufb01naloutputdirectlybutare234relieduponbysubsequentactionresponselayersoftherowplayer.2356Input UnitsHidden Layer 1Hidden Layer 2\fSoftmax output Our model predicts a distribution over the row player\u2019s actions. In order to do this,\nwe need to map from the hidden matrices in the \ufb01nal layer, HL,i \u2208 Rm\u00d7n, of the network onto a\npoint on the m-simplex, \u2206m. We achieve this mapping by applying a row-preserving sum to each\nof the \ufb01nal layer hidden matrices HL,i (i.e. we sum uniformly over the columns of the matrix as\ndescribed above) and then applying a softmax function to convert each of the resulting vectors hi\ninto normalized distributions. This produces k features fi, each of which is a distribution over the\nrow player\u2019s m actions:\n\nj =\n\nfi = softmax(cid:16)h(i)(cid:17) where h(i)\nar0 =(cid:80)k\n\ni \u2208 {1, ..., k}.\nWe can then produce the output of our features, ar0, using a weighted sum of the individual features,\n\ni=1 wifi, where we optimize wi under simplex constraints, wi \u2265 0,(cid:80)i wi = 1. Because\n\neach fi is a distribution and our weights wi are points on the simplex, the output of the feature layers\nis a mixture of distributions.\n\nh(i)\nj,k for all j \u2208 {1, ..., m}, h(i)\n\nj,k \u2208 H(i)\n\nn(cid:88)k=1\n\nAction Response Layers The feature layers described above are suf\ufb01cient to meet our objective\nof mapping from the input payoff matrices to a distribution over the row player\u2019s actions. However,\nthis architecture is not capable of explicitly representing iterative strategic reasoning, which the\nbehavioral game theory literature has identi\ufb01ed as an important modeling ingredient. We incorporate\nthis ingredient using action response layers: the \ufb01rst player can respond to the second\u2019s beliefs,\nthe second can respond to this response by the \ufb01rst player, and so on to some \ufb01nite depth. The\nproportion of players in the population who iterate at each depth is a parameter of the model; thus,\nour architecture is also able to learn not to perform iterative reasoning.\nMore formally, we begin by denoting the output of the feature layers as ar(r)\n0i f (r)\n,\nwhere we now include an index (r) to refer to the output of row player\u2019s action response layer\nar(r)\n0 \u2208 \u2206m. Similarly, by applying the feature layers to a transposed version of the input matrices,\nthe model also outputs a corresponding ar(c)\n0 \u2208 \u2206n for the column player which expresses the row\nplayer\u2019s beliefs about which actions the column player will choose. Each action response layer\ncomposes its output by calculating the expected value of an internal representation of utility with\nrespect to its belief distribution over the opposition actions. For this internal representation of utility\n\nwe chose a weighted sum of the \ufb01nal layer of the hidden layers,(cid:80)i wiHL,i, because each HL,i is\n\nalready some non-linear transformation of the original payoff matrix, and so this allows the model to\nexpress utility as a transformation of the original payoffs. Given the matrix that results from this sum,\nwe can compute expected utility with respect to the vector of beliefs about the opposition\u2019s choice of\nactions, ar(c)\n, by simply taking the dot product of the weighted sum and beliefs. When we iterate\nj\nthis process of responding to beliefs about one\u2019s opposition more than once, higher-level players will\nrespond to beliefs, ari, for all i less than their level and then output a weighted combination of these\nresponses using some weights, vl,i. Putting this together, the lth action response layer for the row\nplayer (r) is de\ufb01ned as\n\n0 =(cid:80)k\n\ni=1 w(r)\n\ni\n\nar(r)\n\nl = softmax(cid:32)\u03bbl(cid:32) l\u22121(cid:88)j=0\n\nv(r)\n\nl,j (cid:32) k(cid:88)i=1\n\nw(r)\nl,i H(r)\n\nL,i(cid:33) \u00b7 ar(c)\n\nj (cid:33)(cid:33),\n\nar(r)\n\nl \u2208 \u2206m, l \u2208 {1, ..., K},\n\nl,i and v(r)\n\nwhere l indexes the action response layer, \u03bbl is a scalar sharpness parameter that allows us to sharpen\nthe resulting distribution, w(r)\nl,j are scalar weights, HL,i are the row player\u2019s k hidden units\nfrom the \ufb01nal hidden layer L, ar(c)\nis the output of the column player\u2019s jth action response layer,\nj\nand K is the total number of action response layers. We constrain w(r)\nto the simplex and\nuse \u03bbl to sharpen the output distribution so that we can optimize the sharpness of the distribution and\nrelative weighting of its terms independently. We build up the column player\u2019s action response layer,\nar(c)\nL,i, responding to the row\nplayer\u2019s action response layers, ar(r)\n. These layers are not used in the \ufb01nal output directly but are\nrelied upon by subsequent action response layers of the row player.\n\n, similarly, using the column player\u2019s internal utility representation, H(c)\n\nli and v(r)\n\nlj\n\nl\n\nl\n\nOutput Our model\u2019s \ufb01nal output is a weighted sum of the outputs of the action response layers.\nThis output needs to be a valid distribution over actions. Because each of the action response layers\n\n6\n\n\fis thus y =(cid:80)K\n\n, where y and ar(r)\n\nj \u2208 \u2206m, and wj \u2208 \u2206K.\n\nj=1 wjar(r)\n\nj\n\nalso outputs a distribution over actions, we can achieve this requirement by constraining these weights\nto the simplex, thereby ensuring that the output is just a mixture of distributions. The model\u2019s output\n\nRelation to existing deep models Our model\u2019s functional form has interesting connections with\nexisting deep model architectures. We discuss two of these here. First, our invariance-preserving\nhidden layers can be encoded as MLP Convolution Layers described in Lin et al. [10] with the two-\nchannel 1\u00d7 1 input xi,j corresponding to the two players\u2019 respective payoffs when actions i and j are\nplayed (using patches larger than 1 \u00d7 1 would imply the assumption that local structure is important,\nwhich is inappropriate in our domain; thus, we do not need multiple mlpconv layers). Second, our\npooling units are super\ufb01cially similar to the pooling units used in convolutional networks. However,\nours differ both in functional form and purpose: we use pooling as a way of sharing information\nbetween cells in the matrices that are processed through our network by taking maximums across\nentire rows or columns, while in computer vision, max-pooling units are used to produce invariance\nto small translations of the input image by taking maximums in a small local neighborhood.\n\nRepresentational generality of our architecture Our work aims to extend existing models in\nbehavioral game theory via deep learning, not to propose an orthogonal approach. Thus, we must\ndemonstrate that our representation is rich enough to capture models and features that have proven\nimportant in that literature. We omit the details here for space reasons (see the supplementary\nmaterial, Section 4), but summarize our \ufb01ndings. Overall, our architecture can express the quantal\ncognitive hierarchy [23] and quantal level-k [18] models and as their sharpness tends to in\ufb01nity, their\nbest-response equivalents cognitive hierarchy [3] and level-k [5]. Using feature layers we can also\nencode all the behavioral features used in Wright and Leyton-Brown [23]. However, our architecture\nis not universal; notably, it is unable to express certain features that are likely to be useful, such as\nidenti\ufb01cation of dominated strategies. We plan to explore this in future work.\n\n4 Experiments\n\nExperimental Setup We used a dataset combining observations from 9 human-subject experi-\nmental studies conducted by behavioral economists in which subjects were paid to select actions\nin normal-form games. Their payment depended on the subject\u2019s actions and the actions of their\nunseen opposition who chose an action simultaneously (see Section 1 of the supplementary material\nfor further details on the experiments and data). We are interested in the model\u2019s ability to predict the\ndistribution over the row player\u2019s action, rather than just its accuracy in predicting the most likely\naction. As a result, we \ufb01t models to maximize the likelihood of training data P(D|\u03b8) (where \u03b8 are the\nparameters of the model and D is our dataset) and evaluate them in terms of negative log-likelihood\non the test set.\nAll the models presented in the experimental section were optimized using Adam [8] with an initial\nlearning rate of 0.0002, \u03b21 = 0.9, \u03b22 = 0.999 and \u0001 = 10\u22128. The models were all regularized using\nDropout with drop probability = 0.2 and L1 regularization with parameter = 0.01. They were all\ntrained until there was no training set improvement up to a maximum of 25 000 epochs and the\nparameters from the iteration with the best training set performance was returned. Our architecture\nimposes simplex constraints on the mixture weight parameters. Fortunately, simplex constraints fall\nwithin the class of simple constraints that can be ef\ufb01ciently optimized using the projected gradient\nalgorithm [7]. The algorithm modi\ufb01es standard SGD by projecting the relevant parameters onto the\nconstraint set after each gradient update.\n\nExperimental Results Figure 4 (left) shows a performance comparison between a model built\nusing our deep learning architecture with only a single action response layer (i.e. no iterative\nreasoning; details below) and the previous state of the art, quantal cognitive hierarchy (QCH) with\nhand-crafted features (shown as a blue line); for reference we also include the best feature-free model,\nQCH with a uniform model of level-0 behavior (shown as a pink line). We refer to an instantiation\nof our model with L hidden layers and K action response layers as an N + K layer network. All\ninstantiations of our model with 3 or more layers signi\ufb01cantly improved on both alternatives and thus\nrepresents a new state of the art. Notably, the magnitude of the improvement was considerably larger\nthan that of adding hand-crafted features to the original QCH model.\n\n7\n\n\fFigure 4: Negative Log Likelihood Performance. The error bars represent 95% con\ufb01dence intervals\nacross 10 rounds of 10-fold cross-validation. We compare various models built using our architecture\nto QCH Uniform (pink line) and QCH Linear4 (blue line).\n\nFigure 4 (left) considers the effect of varying the number of hidden units and layers on performance\nusing a single action response layer. Perhaps unsurprisingly, we found that a two layer network with\nonly a single hidden layer of 50 units performed poorly on both training and test data. Adding a\nsecond hidden layer resulted in test set performance that improved on the previous state of the art.\nFor these three layer networks (denoted (20, 20), (50, 50) and (100, 100)), performance improved\nwith more units per layer, but there were diminishing returns to increasing the number of units per\nlayer beyond 50. The four-layer networks (denoted (50, 50, 50) and (100, 100, 100)) offered further\nimprovements in training set performance but test set performance diminished as the networks were\nable to over\ufb01t the data. To test the effect of pooling units on performance, in Figure 4 (center)\nwe \ufb01rst removed the pooling units from two of the network con\ufb01gurations, keeping the rest of the\nhyper-parameters unchanged. The models that did not use pooling layers under \ufb01t on the training\ndata and performed very poorly on the test set. While we were able to improve their performance\nby turning off dropout, these unregularized networks didn\u2019t match the training set performance of\nthe corresponding network con\ufb01gurations that had pooling units (see Section 3 of the supplementary\nmaterial). Thus, our \ufb01nal network contained two layers of 50 hidden units and pooling units.\nOur next set of experiments committed to this con\ufb01guration for feature layers and investigated\ncon\ufb01gurations of action-response layers, varying their number between one and four (i.e., from no\niterative reasoning up to three levels of iterative reasoning; see Figure 4 (right) ). The networks with\nmore than one action-response layer showed signs of over\ufb01tting: performance on the training set\nimproved steadily as we added AR layers but test set performance suffered. Thus, our \ufb01nal network\nused only one action-response layer. We nevertheless remain committed to an architecture that can\ncapture iterative strategic reasoning; we intend to investigate more effective methods of regularizing\nthe parameters of action-response layers in future work.\n\n5 Discussion and Conclusions\n\nTo design systems that ef\ufb01ciently interact with human players, we need an accurate model of\nboundedly rational behavior. We present an architecture for learning such models that signi\ufb01cantly\nimproves upon state-of-the-art performance without needing hand-tuned features developed by\ndomain experts. Interestingly, while the full architecture can include action response layers to\nexplicitly incorporate the iterative reasoning process modeled by level-k-style models, our best\nperforming model did not need them to achieve set a new performance benchmark. This indicates\nthat the model is performing the mapping from payoffs to distributions over actions in a manner that\nis substantially different from previous successful models. Some natural future directions, besides\nthose already discussed above, are to extend our architecture beyond two-player, unrepeated games to\ngames with more than two players, as well as to richer interaction environments, such as games in\nwhich the same players interact repeatedly and games of imperfect information.\n\n8\n\n5020,2050,50100,10050,50,50100,100,100940960980100010201040NLL(TestLoss)5020,2050,50100,10050,50,50100,100,100ModelVariations(#hiddenunits)75008000850090009500NLL(TrainingLoss)50,50(nopooling)50,50(pooling)100,100,100(nopooling)100,100,100(pooling)940960980100010201040NLL(TestLoss)50,50(nopooling)50,50(pooling)100,100,100(nopooling)100,100,100(pooling)PoolingComparison(#units)75008000850090009500NLL(TrainingLoss)1234940960980100010201040NLL(TestLoss)1234ActionResponse(#layers)75008000850090009500NLL(TrainingLoss)\fReferences\n[1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013.\n\n[2] C.F. Camerer. Behavioral game theory: Experiments in strategic interaction. Princeton\n\nUniversity Press, 2003.\n\n[3] C.F. Camerer, T.H. Ho, and J.K. Chong. A cognitive hierarchy model of games. Quarterly\n\nJournal of Economics, 119(3), 2004.\n\n[4] C. Clark and A. J. Storkey. Training deep convolutional neural networks to play go.\n\nProceedings of the 32nd International Conference on Machine Learning, ICML 2015, 2015.\n\nIn\n\n[5] M. Costa-Gomes, V.P. Crawford, and B. Broseta. Cognition and behavior in normal-form games:\n\nAn experimental study. Econometrica, 69(5), 2001.\n\n[6] B. Edelman, M. Ostrovsky, and M. Schwarz. Internet advertising and the generalized second-\nprice auction: Selling billions of dollars worth of keywords. The American Economic Review,\n97(1), 2007.\n\n[7] A. Goldstein. Convex programming in hilbert space. Bulletin of the American Mathematical\n\nSociety, 70(5), 09 1964.\n\n[8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In The International\n\nConference on Learning Representations (ICLR), 2015.\n\n[9] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.\n[10] M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference on Learning\n\nRepresentations, volume abs/1312.4400. 2014.\n\n[11] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn CVPR, June 2015.\n\n[12] R.D. McKelvey and T.R. Palfrey. Quantal response equilibria for normal form games. GEB, 10\n\n(1), 1995.\n\n[13] P. Milgrom and I. Segal. Deferred-acceptance auctions and radio spectrum reallocation. In\nProceedings of the Fifteenth ACM Conference on Economics and Computation. ACM, 2014.\n[14] D. C. Parkes and M. P. Wellman. Economic reasoning and arti\ufb01cial intelligence. Science, 349\n\n(6245), 2015.\n\n[15] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 2015.\n[16] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-theoretic, and\n\nLogical Foundations. Cambridge University Press, 2008.\n\n[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, Grewe D, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of go with deep neural networks and tree search. Nature, 529, 2016.\n\n[18] D.O. Stahl and P.W. Wilson. Experimental evidence on players\u2019 models of other players. JEBO,\n\n25(3), 1994.\n\n[19] M. Tambe. Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned.\n\nCambridge University Press, New York, NY, USA, 1st edition, 2011.\n\n[20] H. R. Varian. Position auctions. International Journal of Industrial Organization, 25, 2007.\n[21] J. R. Wright and K. Leyton-Brown. Beyond equilibrium: Predicting human behavior in normal-\n\nform games. In AAAI. AAAI Press, 2010.\n\n[22] J. R. Wright and K. Leyton-Brown. Behavioral game-theoretic models: A Bayesian framework\nfor parameter analysis. In Proceedings of the 11th International Conference on Autonomous\nAgents and Multiagent Systems (AAMAS-2012), volume 2, pages 921\u2013928, 2012.\n\n[23] J. R. Wright and K. Leyton-Brown. Level-0 meta-models for predicting human behavior in\ngames. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, pages\n857\u2013874, 2014.\n\n[24] R. Yang, C. Kiekintvled, F. Ordonez, M. Tambe, and R. John. Improving resource allocation\nstrategies against human adversaries in security games: An extended study. Arti\ufb01cial Intelligence\nJournal (AIJ), 2013.\n\n9\n\n\f", "award": [], "sourceid": 1263, "authors": [{"given_name": "Jason", "family_name": "Hartford", "institution": "University of British Columbia"}, {"given_name": "James", "family_name": "Wright", "institution": "University of British Columbia"}, {"given_name": "Kevin", "family_name": "Leyton-Brown", "institution": "University of British Columbia"}]}