{"title": "Learning Tree Structured Potential Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1552, "page_last": 1560, "abstract": "Many real phenomena, including behaviors, involve strategic interactions that can be learned from data. We focus on learning tree structured potential games where equilibria are represented by local maxima of an underlying potential function. We cast the learning problem within a max margin setting and show that the problem is NP-hard even when the strategic interactions form a tree. We develop a variant of dual decomposition to estimate the underlying game and demonstrate with synthetic and real decision/voting data that the game theoretic perspective (carving out local maxima) enables meaningful recovery.", "full_text": "Learning Tree Structured Potential Games\n\nVikas K. Garg\nCSAIL, MIT\n\nvgarg@csail.mit.edu\n\nAbstract\n\nTommi Jaakkola\n\nCSAIL, MIT\n\ntommi@csail.mit.edu\n\nMany real phenomena, including behaviors, involve strategic interactions that can\nbe learned from data. We focus on learning tree structured potential games where\nequilibria are represented by local maxima of an underlying potential function. We\ncast the learning problem within a max margin setting and show that the problem\nis NP-hard even when the strategic interactions form a tree. We develop a variant\nof dual decomposition to estimate the underlying game and demonstrate with\nsynthetic and real decision/voting data that the game theoretic perspective (carving\nout local maxima) enables meaningful recovery.\n\n1\n\nIntroduction\n\nStructured prediction methods [1; 2; 3; 4; 5] are widely adopted techniques for learning mappings\nbetween context descriptions x \u2208 X and con\ufb01gurations y \u2208 Y. The variables specifying each con\ufb01g-\nuration y (e.g., arcs in natural language parsing) are typically mutually dependent and it is therefore\nbene\ufb01cial to predict them jointly rather than individually. The predicted y often arises as the highest\nscoring con\ufb01guration with respect to a parameterized scoring function that decomposes into terms\nthat couple two or more variables together to model their interactions. Structured prediction methods\nhave been broadly useful across areas, from computational biology (e.g., molecular arrangements,\nalignments), natural language processing (e.g., parsing, tagging), computer vision (e.g., segmentation,\nmatching), and many others. However, the setting is less suitable for modeling strategic interactions\nthat are better characterized in terms of local consistency constraints.\nWe consider the problem of predicting con\ufb01gurations y that represent game theoretic equilibria. Such\ncon\ufb01gurations are unlikely to coincide with the maximum of a global scoring function as in structured\nprediction. Indeed, there may be many possible equilibria in a speci\ufb01c context, and the particular\nchoice may vary considerably. Each possible con\ufb01guration is nevertheless characterized by local\nconstraints that represent myopic optimizations of individual players. For example, senators can\nbe thought to vote relative to give and take deals with other closely associated senators. Several\nassumptions are necessary to make the game theoretic setting feasible.\nWe abstract the setting as a potential game [6; 7; 8] among the players, and de\ufb01ne a stochastic process\nto model the dynamics of the game. A game is said to be a potential game if the incentive of all\nplayers to change their strategy can be expressed using a single global potential function. Every\npotential game is guaranteed to have at least one (possibly multiple) pure strategy Nash equilibria\n[9], and we will exploit this property in modeling and analyzing several real world scenarios. Note\nthat each pure Nash equilibrium corresponds to a local optimum of the underlying potential function\nrather than the global optimum as in structured prediction.\nWe further restrict the setting by permitting the payoff of each player to depend only on their own\naction and the actions of their neighbors (a subset of the other players). Thus, we may view our\nsetting as a graphical game [10; 11]. In this work, we investigate potential games where the graphical\nstructure of the interactions form a tree. The goal is to recover the tree structured potential function\nthat supports observed con\ufb01gurations of actions as locally optimal solutions. We prove that it is\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fNP-hard to recover such games under a max-margin setting. We then propose a variant of dual\ndecomposition (cf. [12; 13]) to learn the tree structure and the associated parameters.\n\ny = (y1, y2, . . . , yn) \u2208 Y =(cid:81)n\n\n2 Setting\nWe commence with the game theoretic setting. There are n players indexed by a position in [n] (cid:44)\n{1, 2, . . . , n}. These players can be visualized as nodes of a tree-structured graph T with undirected\nedges E. We denote the set of neighbors of node i by Ni, i.e., (i, j) \u2208 E \u21d0\u21d2 j \u2208 Ni \u2227 i \u2208 Nj, and\nabbreviate (i, j) \u2208 E as ij \u2208 T without introducing ambiguity. Each player i has a \ufb01nite discrete\nset of strategies Yi. A strategy pro\ufb01le or label con\ufb01guration is an n-dimensional vector of the form\ni=1 Yi. We denote the parametric potential function associated with\nthe tree by f (y; x, T, \u03b8), where y is a strategy pro\ufb01le, \u03b8 the set of parameters, and x \u2208 X is a context\n[14]. We obtain an (n \u2212 1)-dimensional vector y\u2212i = (y1, . . . , yi\u22121, yi+1, . . . , yn) by considering\nthe strategies of all players other than i. Thus, we may equivalently write y = (yi, y\u2212i). Moreover,\nwe use yNi to denote the strategy pro\ufb01le pertaining to the neighbors of node i. We can extract from\nf (y; x, T, \u03b8) individual payoff (or cost) functions fi(yi, yNi; x, T, \u03b8), i \u2208 [n], which merely include\nall the terms that pertain to the strategy of the ith player yi.\nThe choice of a particular equilibrium (local optimum) in a context results from a stochastic process.\nStarting with an initial con\ufb01guration y at time t = 0 (e.g., chosen at random), the game proceeds\nin an iterative fashion: during each subsequent iteration t = 1, 2, . . ., a player pt \u2208 [n] is chosen\nuniformly at random. The player pt then computes the best response candidate set\n\nZpt = arg max\nz\u2208Ypt\n\nfpt(z, yNpt\n\n; x, T, \u03b8),\n\nand switches to a strategy within this set uniformly at random if their current strategy does not already\nbelong to this set, i.e., player changes their strategy only if a better option presents itself. The game\n\ufb01nishes when a locally optimal con\ufb01guration \u02c6y \u2208 Y has been reached, i.e., when no player can\nimprove their payoff unilaterally. Since many locally optimal con\ufb01gurations could have been reached\nin the given context x, the stochastic process induces a distribution over the strategy pro\ufb01les.\nWe assume that our training data S = {(x1, y1), . . . , (xM , yM )} is generated by some distribution\nover contexts and the induced conditional distribution over strategy pro\ufb01les with respect to some tree\nstructured potential function. Our objective is to learn both the underlying tree structure T and the\nparameters \u03b8 using a max-margin setting. Speci\ufb01cally, given S, we are interested in \ufb01nding T and \u03b8\nsuch that\n\n\u2200 m \u2208 [M ], i \u2208 [n], yi \u2208 Yi, f (ym; xm, T, \u03b8) \u2265 f (ym\u2212i, yi; xm, T, \u03b8) + e(y, ym),\n\nwhere e(y, ym) is a non-negative loss (e.g. Hamming loss), which is 0 if and only if y = ym.\nNote that the maximum margin framework does not make an explicit use of the assumed induced\ndistribution over equilibria.\nThe setting here is super\ufb01cially similar to relaxations of structured prediction tasks such as pseudo-\nlikelihood [15] or decomposed learning [16]. These methods are, however, designed to provide\ncomputationally ef\ufb01cient approximations of the original structured prediction task by using fewer\nconstraints during learning. Instead, we are speci\ufb01cally interested in modeling the observations as\nlocally optimal solutions with respect to the potential function.\nWe only state the results of our theorems in the main text, and defer all the proofs to the Supplementary.\n\n3 Learning Tree Structured Potential Games\n\nWe \ufb01rst show that it is NP-hard to learn a tree structured potential game in a discriminative max-\nmargin setting. Previous hardness results are available about learning structured prediction models\nunder global constraints and arbitrary graphs [15], and under global constraints and tree structured\nmodels [17], also in a max-margin setting.\nTheorem 1. Given a set of training examples S = {(xm, ym)}M\nfunctions of the form\n\nm=1 and a family of potential\n\n(cid:88)\n\n(cid:88)\n\nf (y; x, T, \u03b8) =\n\n(cid:88)\n\nij\u2208T\n\n\u03b8ij(yi, yj) +\n\n2\n\n\u03b8i(yi) +\n\nxi(yi),\n\ni\n\ni\n\n\fit is NP-hard to decide whether there exists a tree T and parameters \u03b8 (up to model equivalence) such\nthat the following holds:\n\n\u2200 m, i, yi, f (ym; xm, T, \u03b8) \u2265 f (ym\u2212i, yi; xm, T, \u03b8) + e(y, ym).\n\n3.1 Dual decomposition algorithm\n\nThe remainder of this section concerns with developing an approximate method for learning the\npotential function by appeal to dual decomposition. Dual decomposition methods are typically\nemployed to solve inference tasks over combinatorial structures (e.g., [12; 13]). In contrast, we\ndecompose the problem on two levels. On one hand, we break the problem into independent local\nneighborhood choices and use dual variables to reconcile these choices across the players so as to\nobtain a single tree-structured model. On the other hand, we ensure that initially disjoint parameters\nmediating the interactions between a player and its neighbors are in agreement across the edges in the\nresulting structure. The two constraints ensure that there is a single tree-structured global potential\nfunction.\nFor each node i, let Ni be the set of neighbors of i represented in terms of indicator variables such\nthat Nij = 1 if i selects j as a neighbor. Nij can be chosen independently from Nji but the two\nwill be enforced to agree at the solution. We will use Ni as a set of neighbors and as a set of\nindicator variables interchangeably. Similarly, we decompose the parameters into node potentials\n\u03b8i \u00b7 \u03c6(yi; x) = \u03b8i(yi; x) and edge potentials \u03b8ij \u00b7 \u03c6(yi, yj; x) = \u03b8i,j(yi, yj; x) where again \u03b8ij may\nbe chosen separately from \u03b8ji but will be encouraged to agree. The set of parameters associated with\neach player then consists of locally controllable parameters \u0398i = {\u03b8i, \u03b8i\u00b7} and Ni, where Ni selects\nthe relevant subset of interaction terms:\n\n(cid:88)\n\nj(cid:54)=i\n\nf (y; x, Ni, \u0398i) = \u03b8i(yi; x) +\n\nNij\u03b8i,j(yi, yj; x)\n\nGiven a training set S = {(x1, y1), . . . , (xM , yM )}, the goal is to learn the set of neighbors N =\n{N1, . . . , Nn}, and weights \u0398 = {\u03981, . . . , \u0398n} so as to minimize\n\nn(cid:88)\n\nM(cid:88)\n\ni=1\n\nm=1\n\ni )(cid:3)\n(cid:2)f (ym\u2212i, yi; xm, Ni, \u0398i) \u2212 f (ym; xm, Ni, \u0398i) + e(yi, ym\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nmax\n\nyi\n\n||\u0398||2 +\n\n1\n2\n\nC\nM n\n\n(1)\n\n(cid:44)Rmi(Ni,\u0398i)\n\nC/(M n)(cid:80)\nn(cid:88)\n(cid:2) 1\n(cid:124)\n\ni=1\n\n2\n\nto N forming a tree and \u0398 agreeing across the players.\n\nsubject\nLet Ri(Ni, \u0398i) =\nm Rmi(Ni, \u0398i). We force the neighbor choices to agree with a global tree structure\nrepresented by indicators N (cid:48). Similarly, we enforce parameters \u0398i to agree across neighbors. The\nresulting Lagrangian can be written as\n||\u0398i||2 + Ri(Ni, \u0398i) +\n\n(cid:88)\n\n\u03b4ijN(cid:48)\n\n(\u03b4ijNij + \u03bbij \u00b7 \u03b8ij)(cid:3)\n(cid:125)\n\n+(cid:2) \u2212 (cid:88)\n(cid:124)\n\ni,j(cid:54)=i\n\nij + G(N (cid:48))(cid:3)\n(cid:125)\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\nj(cid:54)=i\n\nG(N (cid:48),\u03b4)\n\nL(\u0398i,Ni;\u03b4,\u03bb)\n\nwhere G(N (cid:48)) = 0 if N (cid:48) forms a tree and \u221e otherwise, and \u03bbij = \u2212\u03bbji. For the dual decomposition\nalgorithm, we must be able to solve min\u0398i L(\u0398i, Ni; \u03b4, \u03bb) to obtain \u0398\u2217\ni and minNi L(\u0398i, Ni; \u03b4, \u03bb) to\nget N\u2217\ni . The former is a QP while the latter is more challenging though may permit ef\ufb01cient solutions\nvia additional relaxations, exploiting combinatorial properties in restricted cases (sub-modularity), or\neven brute force for smaller problems. G(N (cid:48), \u03b4) corresponds to a minimum weighted spanning tree,\nand thus can be ef\ufb01ciently solved using any standard algorithm like Bor\u02dauvka\u2019s, Kruskal\u2019s or Prim\u2019s.\ni , and N (cid:48)\u2217, resulting in updates of the dual\nThe basic dual decomposition alternatively solves \u0398\u2217\nvariables based on disagreements. While the method has been successful for enforcing structural\nconstraints (e.g., parsing), it is less appropriate for constraints involving continuous variables. To\naddress this, we employ the alternating direction method of multipliers (ADMM) [18; 19; 20] for\nparameter agreements. Speci\ufb01cally, we encourage \u03b8i\u00b7 and \u03b8\u00b7i to agree with their mean ui\u00b7, by\nintroducing an additional term to the Lagrangian L\n\ni , N\u2217\n\nLA(\u0398i, Ni; ui\u00b7, \u03b4, \u03bb) = L(\u0398i, Ni; \u03b4, \u03bb) +\n\n||\u03b8i\u00b7 \u2212 ui\u00b7||2\n\n\u03c1\n2\n\n3\n\n\fwhere ui\u00b7 is updated as an independent parameter.\nThere are many ways to schedule the updates. We employ a two-phase algorithm that learns the\nstructure of the game tree and the parameters separately. The algorithm is motivated by the following\ntheorem. Since the result applies broadly to the dual decomposition paradigm, we state the theorem\nin a slightly more generic form than that required for our purpose. The theorem applies to our setting\nwith\n\nf (N (cid:48)) = \u2212G(N (cid:48)),A = [n], and gi(\u0398i, Ni) =\n\n\u03b4ijNij \u2212 L(\u0398i, Ni; \u03b4, \u03bb).\n\nWe now set up the conditions of the theorem. Consider the following combinatorial problem\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:41)\n\n(cid:40)\n\n(cid:88)\n\n\u03b1\u2208A\n\nOpt = max\n\nz\n\nf (z) +\n\ng\u03b1(z\u03b1)\n\n,\n\nwhere f (z) speci\ufb01es global constraints on admissible z, and g\u03b1(z\u03b1) represent local terms guiding the\nassignment of values to different subsets of variables z\u03b1 = {zj}j\u2208\u03b1. Let the problem be minimized\nwith respect to the dual coef\ufb01cients {\u03b4i,\u03b1(zi)} by following a dual decomposition approach. Suppose\nwe can \ufb01nd a global assignment \u02c6z and dual coef\ufb01cients such that this assignment nearly attains the\nlocal maxima for all \u03b1 \u2208 A, i.e.,\n\n(cid:88)\n\nj\u2208\u03b1\n\ng\u03b1(\u02c6z\u03b1) +\n\n\u03b4j,\u03b1(\u02c6zj) \u2265 max\n\nz\u03b1\n\n(cid:8) g\u03b1(z\u03b1) +\n\n(cid:88)\n\nj\u2208\u03b1\n\n\u03b4j,\u03b1(zj)(cid:9) \u2212 \u0001.\n\nAssume further, without loss of generality,1 that the assignment attains the max for the global\nconstraint. Then, we have the following result.\n\nTheorem 2. If there exists an assignment \u02c6z and associated dual coef\ufb01cients such that the assignment\nobtains \u0001-maximum of each term in the decomposition, for some \u0001 > 0, then the objective value for\n\n\u02c6z \u2208(cid:2)Opt \u2212 |A|\u0001, Opt(cid:3).\n\nThe theorem implies that if a global structure nearly attains the optima for the local neighborhoods,\nthen we might as well shift our focus to \ufb01nding the global structure rather than optimize for the\nparameters corresponding to the exact local optima. The result guarantees that the value of such a\nglobal structure cannot be too far from that of the optimal global structure.\nWe outline our two-phase approach in Algorithm 1. The \ufb01rst phase concerns only with iteratively\n\ufb01nding a globally consistent structure. It is possible that at the conclusion of this phase, the local\nstructures do not fully agree (the relaxation is not tight). For this reason, the procedure runs for a\nspeci\ufb01ed maximum number of iterations and selects the global tree corresponding to an iteration that\nis least inconsistent with the local neighborhoods. Note that this phase does not precisely solve the\noriginal problem we posed earlier. Instead, the structure is obtained without constraining parameters\nto agree. In this sense, the \ufb01rst phase does not consider strictly potential games as the interactions\nbetween players can remain intrinsic to the players themselves.\nThe second phase simply optimizes the parameters for the already speci\ufb01ed global tree. This step\nrealizes a potential game as the parameters and the structure will be in agreement. We note that such\nparameters could be optimized directly for the selected tree without the need of dual decomposition.\nHowever, Algorithm 1 remains suitable in a distributed setting since each player is required to solve\nonly local problems during the entire execution of the algorithm.\n\n3.2 Scaling the algorithm\n\nAs already noted, Algorithm 1 exhaustively enumerates all neighborhoods for each local optimization\nproblem. This makes the algorithm computationally prohibitive in realistic settings. We now outline\nan approximation procedure that restricts the candidate neighborhood assignments. Speci\ufb01cally, for a\nlocal optimization at any node i, we may restrict the possible local neighborhoods at any iteration t to\nonly those con\ufb01gurations that are at most h Hamming distance away from the best local con\ufb01guration\nfor i in iteration t-1. That is, we update each local max-structure incrementally, still guided by the\n\n1We can adjust the bound with a term that depends on the difference between the value of the optimal global\n\nstructure and the value of the global structure under consideration if these values do not coincide.\n\n4\n\n\foverall tree within the same dual decomposition framework. Note that we recover Algorithm 1 as a\nspecial case when h = n. A small h corresponds to searching over a much smaller space compared to\nthe brute force algorithm. For instance, if we take h = 1, then the total complexity of the approximate\nalgorithm reduces to O(n2 \u2217 M axIter) since in each iteration we need to solve n local problems\neach having O(n) candidate neighborhoods.\n\nAlgorithm 1 Learning tree structured potential games\n1: procedure LEARNTREEPOTENTIALGAME\n2: Input: parameters \u03c1, \u03b2, M axIter, and \u0001 > 0.\n3:\n4: Phase 1: Learn Tree Structure\n5:\n6:\n7:\n8:\n9:\n10:\n\nInitialize t = 1, \u03bbij = 0, \u03b4ij = 0, M inGap = \u221e.\nrepeat\n\nCompute \u0398\u2217t+1\n\nL(\u0398i, Ni; \u03b4, 0)\n\nFind N (cid:48) = argminN G(N , \u03b4) using a minimum spanning tree algorithm\nfor each i \u2208 [n] do\nfor each Ni do\n\ni\n\n11:\n\n12:\n\n13:\n\nFind N\u2217\nCompute gap: Gap =\n\ni = argmin\nNi\n\n= min\n\u0398i\nL(\u0398\u2217t+1\n\n(cid:88)\n\ni\n\n, Ni; \u03b4, 0)\nij (cid:54)= N(cid:48)\n\nI(N\u2217\n\nij).\n\ni,j\nif Gap < M inGap then\nM inGap = Gap, Global = N (cid:48)\nUpdate \u03b4 \u2200i, j (cid:54)= i: \u03b4ij = \u03b4ij + \u03b2t(N\u2217\nt \u2190 t + 1\nuntil M inGap = 0 or t > M axIter.\nSet N (cid:48)\u2217 = Global.\n\nij \u2212 N\n\n(cid:48)\nij)\n\n14:\n15:\n16:\n17:\n18:\n19: Phase 2: Learn Parameters\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n\nSet N = N (cid:48)\u2217\nCompute \u0398\u2217t+1\nrepeat\n\nuntil (cid:88)\n\n= min\n\u0398i\n\nij \u2212 \u03b8\u2217t+1\n\nji\n\ni\n\n||\u03b8\u2217t+1\nji = (\u03b8\u2217t+1\n\ni,j(cid:54)=i\nij, \u03b8\u2217\n\nSet \u03b8\u2217\n\n28:\n29: Output: N (cid:48)\u2217, \u0398\u2217\n\n||2 < \u0001\n\nij + \u03b8\u2217t+1\n\nji\n\n)/2\n\nL(\u0398i, Ni; 0, \u03bb)\nij = (\u03b8\u2217t+1\n\nCompute u \u2200i, j (cid:54)= i: ut+1\nUpdate \u03bb \u2200i, j (cid:54)= i: \u03bbij = \u03bbij + \u03c1(\u03b8\u2217t+1\nCompute \u0398\u2217t+1\nt \u2190 t + 1\n\n= min\n\u0398i\n\nij + \u03b8\u2217t+1\n\n)/2\nij \u2212 ut+1\nij )\n\nLA(\u0398i, Ni; ui\u00b7, 0, \u03bb)\n\nji\n\ni\n\n4 Experimental Results\n\nWe now describe the results of our experiments on both synthetic and real data to demonstrate the\nef\ufb01cacy of our algorithm. We found the algorithm to perform well for a wide range of C and \u03b2\nacross different data. We report below the results of our experiments with the following setting\nof parameters: \u03c1 = 1, \u03b2t = 0.005 (for all t), C = 10, \u0001 = 0.1, and M axIter = 100. For each\nlocal optimization problem, the con\ufb01gurations were constrained to share the slack variable in order\nto reduce the total number of optimization variables. Moreover, we used a scaled 0-1 loss [15],\ne(y, ym) = 1{y (cid:54)= ym}/n for each local optimization. We set h = 1 for the approximate method.\nWe conducted different sets of experiments to underscore the different aspects of our approach.\nOur experiments with toy synthetic data highlight recovery of an underlying true structure under\ncontrolled conditions (pertaining to the data generation process). The results on a real, but toy dataset,\nSupreme Court vindicate the applicability of the exhaustive approach to unraveling the interactions\n\n5\n\n\flatent in real datasets. Finally, we address the scalability issues inherent in the exhaustive search, by\ndemonstrating the approximate version on the larger Congressional Votes real dataset.\n\n4.1 Synthetic Dataset\n\nWe will now describe how the brute force method recovered the true structure on a synthetic dataset.\nFor this, data were assumed to come from the underlying model\n\n(cid:88)\n\nij\u2208E\n\n(cid:88)\n\ni\n\nf (y; x, \u03b8) =\n\n\u03b8ij(yi, yj) +\n\nxi\u03b8i(yi),\n\nwhere x represents the context that varies. The parameters were set as follows. We designed a n-node\ndegenerate or pathological tree, n = 6, with edges between node i and i + 1, i \u2208 {1, 2, . . . , n \u2212 1}.\nOn each edge (i, j) \u2208 E, we sampled \u03b8ij(yi, yj), yi, yj \u2208 {0, 1} uniformly at random from [\u22121, 1]\nindependently of the other edges. For each node i, we also sampled \u03b8i(yi), yi \u2208 {0, 1} independently\nfrom the same range. Each training example pair (xm, ym) was sampled in two steps. First, each\nxmj, j \u2208 [n] was set uniformly at random in the range [\u221210, 10], independently of each other. The\nassociated ym was then generated according to the stochastic process described in Section 2. Brie\ufb02y,\nstarting with ym \u2208 {0, 1}n sampled uniformly at random, we successively updated the con\ufb01guration\nby changing a randomly chosen coordinate of ym, and accepting the move only if the associated\nscore was higher. Since there are 2n possible con\ufb01gurations of binary vectors, we were guaranteed\nthat, in \ufb01nite time, this procedure ended in a locally stable con\ufb01guration. Once this locally stable\ncon\ufb01guration was reached, we checked if the score of this con\ufb01guration exceeded all the other\ncon\ufb01gurations with Hamming distance one by at least 1/n. If yes, then we included the pair (xm, ym)\nin our synthetic data set, otherwise we discarded the pair. Starting with 100 examples, this procedure\nresulted in a total of 78 stable con\ufb01gurations that scored higher than each con\ufb01guration one Hamming\ndistance away by at least 1/n. These con\ufb01gurations formed our synthetic data set. We were able\nto exactly recover the tree structure at the end of the Phase 1 of our algorithm using the training\ndata. Fig. 1 shows the evolution of the global tree structure (i.e. N (cid:48)\nin the iterations that resulted in\ndecrease of Gap). Note how the algorithm corrects for incorrect edges, starting from a star tree till it\nrecovers the pathological tree structure. Fig. 2 elucidates the synergy between the global tree and\nlocal neighborhoods toward recovering the correct structure.\n\n3\n\n2\n\n1\n\n4\n\n2\n\n1\n\n4\n\n3\n\n5\n\n6\n\n3\n\n5\n\n6\n\n2\n\n1\n\n4\n\n3\n\n5\n\n6\n\n2\n\n1\n\n4\n\n5\n\n6\n\nFigure 1: Recovery on synthetic data. Evolution of the tree structure is shown from left to right.\nEach incorrect edge is indicated by coloring one of the end nodes in red. After \ufb01rst iteration, only the\nedge (1, 2) is identi\ufb01ed correctly. At termination, all edges in the underlying structure are recovered.\n\nWe show in Fig. 3 the evolution of the tree when the observations were falsely treated as globally\noptimal points. Clearly, structured prediction failed to recover the underlying tree structure.\n\n4.2 Real Dataset 1: Supreme Court Rulings\n\nFor both real datasets, we assumed the following decomposition:\n\n(cid:88)\n\nij\u2208E\n\n(cid:88)\n\ni\n\nf (y; \u03b8) =\n\n\u03b8ij(yi, yj) +\n\n\u03b8i(yi).\n\nFor our \ufb01rst real dataset,2 we considered the rulings of a Supreme Court bench comprising Justices\nAlito (A), Breyer (B), Ginsburg (G), Kennedy (K), Roberts (R), Scalia (S), and Thomas (T ), during\n\n2Publicly available at http://scdb.wustl.edu/.\n\n6\n\n\f1\n\n6\n\n2\n\n5\n\n4\n\n3\n\n3\n\n6\n\n5\n\n2\n\n1\n\n4\n\n1\n\n2\n\n3\n\n4\n\n6\n\n5\n\nFigure 2: Global-Local Synergy. (Center & Right) Spanning trees formed from two separate local\nneighborhoods (in different iterations). (Left) The common global tree structure. The global tree\nstructure reappears during the execution of the algorithm. On \ufb01rst occurrence, the global tree is\nmisaligned from chain 2-3-4 of the local neighborhood tree at node 5, as indicated by tree in the\ncenter. The algorithm takes corrective action, and on the next occurrence, node 5 moves to the desired\nposition, as seen from tree on the right. The algorithm proceeds to correct the positioning of node 6.\n\n3\n\n2\n\n1\n\n4\n\n5\n\n1\n\n4\n\n3\n\n5\n\n6\n\n5\n\n1\n\n6\n\n3\n\n2\n\n4\n\n5\n\n1\n\n4\n\n2\n\n6\n\n3\n\n2\n\n6\n\nFigure 3: Evolution of structured prediction. Structured prediction fails to recover true structure.\n\nS\nScalia\n\nAlito\n\nT\n\nThomas\n\nRoberts\n\nA\n\nR\nConservatives\n\nKennedy\n\nK\n\nBreyer\n\nB\n\nG\n\nGinsburg\nLiberals\n\nK\n\nB\n\nT\n\nG\n\nA\n\nR\n\nS\n\nFigure 4: (Left) Tree recovered from Supreme Court data. The tree is consistent with widely known\nideology of the justices: Justice Kennedy (K) is considered largely moderate, while the others espouse\na more conservative or liberal jurisprudence. The thickness of an edge indicates the strength of\ninteraction in terms of (scaled) l2-norm of the edge parameters. (Right) Enforcing global constraints\n(structured prediction) resulted in a qualitatively incorrect structure.\n\nthe year 2013. Justices Alito, Roberts, Scalia, and Thomas are known to be conservatives, while\nJustices Breyer and Ginsburg belong to the liberal side of the Court. Justice Kennedy generally takes\na moderate stand on most issues. On every case under their jurisdiction, each Justice chose an integer\nfrom the set {1, 2, . . . , 8}. We considered all the rulings of this bench that had at least one \u201cdissent\".\nFor our purposes, we created a dataset from those rulings that did not register a value 6, 7, 8 from any\nof the Justices, since these values seem to have a complex interpretation instead of a simple yes/no.\nFor all other values, we used the interpretation by [21]: dissent value 2 was treated as 0 (no), and\nothers with 1 (yes). Fig. 4 shows that we were able to recover the known ideology of the Justices\nby correctly treating the rulings as local optimal, whereas structured prediction failed to identify a\nqualitatively correct structure.\n\n7\n\n\fFigure 5: (Congressional Votes.) The recovered tree is consistent with the expected voting pattern\nthat, in general, Democrats and Republicans vote along their respective party principles.\n\n4.3 Real Dataset 2: Congressional Voting Records\n\nWe also experimented with a dataset3 obtained by compiling the votes on all the bills of the 110th\nUnited States Congress (Session 2). The US Congress records the voting proceedings of the legislative\nbranch of the US federal government [11]. The U.S. Senate consists of 100 senators: each of the\n50 U.S. states is represented by two senators. We compiled all the votes of the \ufb01rst 30 senators (in\ndata order) over this period on bills without unanimity. Each vote takes one of the two values: +1 or\n-1, to denote whether the vote was in favor or against the proposed bill. We treated vote values -1\nwith 0 to create a binary dataset. Fig. 5 shows how the approximate algorithm is able to recover a\nqualitatively correct structure that Democrats and Republicans typically vote along their respective\nparty ideologies (note that there might be more than one qualitatively correct structure). Speci\ufb01cally,\nwe obtain a structure where no Democrat is sandwiched between two Republicans, or vice-versa.\n\nDiscussion\n\nA primary goal of this work is to argue that complex strategic interactions are better modeled as\nlocally optimal solutions instead of globally optimal assignments (as done, for instance, in structured\nprediction). We believe this local versus global distinction has not been accorded due signi\ufb01cance in\nthe literature, and we hope our work fosters more research in that direction.\nThe work opens up several interesting avenues. All the results presented in this paper are qualitative\nin nature, primarily because quantitative evaluation is non-trivial in our setting since a strategic game\nmay have multiple equilibria (local optima). The incremental method proposed in this paper does\nnot come with any certi\ufb01cate of optimality, unlike most dual decomposition settings. We assumed\nthe dynamics of the underlying game follow a stochastic process, whereas players typically take\ndeterministic turns in real game settings. From a statistical learning perspective, it will be interesting\nto estimate the generalization bounds in terms of the number of local equibria samples. Learning\nacross (repeated) games and exploring sub-modular potential functions are other directions.\n\nAcknowledgments\n\nJean Honorio provided the Congressional Votes dataset for our experiments. We would also like to\nthank the anonymous reviewers for their helpful comments.\n\n3Publicly available at http://www.senate.gov/.\n\n8\n\n\fReferences\n[1] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables, JMLR, 6(2), pp. 1453-1484, 2005.\n\n[2] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks, NIPS, 2003.\n\n[3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data, ICML, 2001.\n\n[4] J. K. Bradley and C. Guestrin. Learning tree conditional random \ufb01elds, ICML, 2010.\n\n[5] S. Nowozin and C. H. Lampert. Structured Learning and Prediction in Computer Vision, Founda-\n\ntions and Trends in Computer Graphics and Vision, 2011.\n\n[6] P. Dubey, O. Haimanko, and A. Zapechelnyuk. Strategic complements and substitutes, and\n\npotential games, Games and Economic Behavior, 54, pp. 77-94, 2006.\n\n[7] D. Monderer and L. Shapley. Potential Games, Games and Economic Behavior, 14, pp. 124-143,\n\n1996.\n\n[8] Y. Song, S. H. Y. Wong, and K.-W. Lee. Optimal gateway selection in multi-domain wireless\n\nnetworks: a potential game perspective, MobiCom, 2011.\n\n[9] T. Ui. Robust equilibria of potential games, Econometrica, 69, pp. 1373-1380, 2000.\n\n[10] M. Kearns, M. L. Littman, and S. P. Singh. Graphical Models for Game Theory, UAI, 2001.\n\n[11] J. Honorio and L. Ortiz. Learning the Structure and Parameters of Large-Population Graphical\n\nGames from Behavioral Data, JMLR, 16, pp. 1157-1210, 2015.\n\n[12] A. M. Rush and M. Collins. A Tutorial on Dual Decomposition and Lagrangian Relaxation for\n\nInference in Natural Language Processing, JAIR, 45, pp. 305-362, 2012.\n\n[13] A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On Dual Decomposition and Linear\n\nProgramming Relaxations for Natural Language Processing, EMNLP, 2010.\n\n[14] M. Hoefer and A. Skopalik. Social Context in Potential Games, Internet and Network Economics,\n\npp. 364-377, 2012.\n\n[15] D. Sontag, O. Meshi, T. Jaakkola, and A. Globerson. More data means less inference: A\n\npseudo-max approach to structured learning, NIPS, 2010.\n\n[16] R. Samdani and D. Roth. Ef\ufb01cient Decomposed Learning for Structured Prediction, ICML,\n\n2012.\n\n[17] O. Meshi, E. Eban, G. Elidan, and A. Globerson. Learning Max-Margin Tree Predictors, UAI,\n\n2013.\n\n[18] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Ecksteain. Distributed Optimization and Statistical\nLearning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine\nLearning, 3(1), pp. 1-122, 2010.\n\n[19] A. F. T. Martins, N. A. Smith, E. P. Xing, P. M. Q. Aguiar, and M. A. T. Figueiredo. Augmenting\n\nDual Decomposition for MAP Inference, NIPS, 2010.\n\n[20] A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo. Dual Decomposition\n\nwith Many Overlapping Components, EMNLP, 2011.\n\n[21] M. T. Irfan and L. E. Ortiz. On in\ufb02uence, stable behavior, and the most in\ufb02uential individuals in\n\nnetworks: A game-theoretic approach, Arti\ufb01cial Intelligence, 215, pp. 79-119, 2014.\n\n9\n\n\f", "award": [], "sourceid": 852, "authors": [{"given_name": "Vikas", "family_name": "Garg", "institution": "MIT"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}