{"title": "Fully Decentralized Policies for Multi-Agent Systems: An Information Theoretic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2941, "page_last": 2950, "abstract": "Learning cooperative policies for multi-agent systems is often challenged by partial observability and a lack of coordination. In some settings, the structure of a problem allows a distributed solution with limited communication. Here, we consider a scenario where no communication is available, and instead we learn local policies for all agents that collectively mimic the solution to a centralized multi-agent static optimization problem. Our main contribution is an information theoretic framework based on rate distortion theory which facilitates analysis of how well the resulting fully decentralized policies are able to reconstruct the optimal solution. Moreover, this framework provides a natural extension that addresses which nodes an agent should communicate with to improve the  performance of its individual policy.", "full_text": "Fully Decentralized Policies for Multi-Agent Systems:\n\nAn Information Theoretic Approach\n\nElectrical Engineering and Computer Science\n\nElectrical Engineering and Computer Science\n\nRoel Dobbe\u2217\n\nDavid Fridovich-Keil\u2217\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\ndobbe@eecs.berkeley.edu\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\ndfk@eecs.berkeley.edu\n\nClaire Tomlin\n\nElectrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\ntomlin@eecs.berkeley.edu\n\nAbstract\n\nLearning cooperative policies for multi-agent systems is often challenged by partial\nobservability and a lack of coordination. In some settings, the structure of a problem\nallows a distributed solution with limited communication. Here, we consider a\nscenario where no communication is available, and instead we learn local policies\nfor all agents that collectively mimic the solution to a centralized multi-agent static\noptimization problem. Our main contribution is an information theoretic framework\nbased on rate distortion theory which facilitates analysis of how well the resulting\nfully decentralized policies are able to reconstruct the optimal solution. Moreover,\nthis framework provides a natural extension that addresses which nodes an agent\nshould communicate with to improve the performance of its individual policy.\n\n1\n\nIntroduction\n\nFinding optimal decentralized policies for multiple agents is often a hard problem hampered by\npartial observability and a lack of coordination between agents. The distributed multi-agent problem\nhas been approached from a variety of angles, including distributed optimization [Boyd et al., 2011],\ngame theory [Aumann and Dreze, 1974] and decentralized or networked partially observable Markov\ndecision processes (POMDPs) [Oliehoek and Amato, 2016, Goldman and Zilberstein, 2004, Nair\net al., 2005]. In this paper, we analyze a different approach consisting of a simple learning scheme to\ndesign fully decentralized policies for all agents that collectively mimic the solution to a common\noptimization problem, while having no access to a global reward signal and either no or restricted\naccess to other agents\u2019 local state. This algorithm is a generalization of that proposed in our prior\nwork [Sondermeijer et al., 2016] related to decentralized optimal power \ufb02ow (OPF). Indeed, the\nsuccess of regression-based decentralization in the OPF domain motivated us to understand when and\nhow well the method works in a more general decentralized optimal control setting.\nThe key contribution of this work is to view decentralization as a compression problem, and then\napply classical results from information theory to analyze performance limits. More speci\ufb01cally, we\ntreat the ith agent\u2019s optimal action in the centralized problem as a random variable u\u2217i , and model\nits conditional dependence on the global state variables x = (x1, . . . , xn), i.e. p(u\u2217i |x), which we\n\n\u2217Indicates equal contribution.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fassume to be stationary in time. We now restrict each agent i to observe only the ith state variable\nxi. Rather than solving this decentralized problem directly, we train each agent to replicate what it\nwould have done with full information in the centralized case. That is, the vector of state variables\nx is compressed, and the ith agent must decompress xi to compute some estimate \u02c6ui \u2248 u\u2217i . In our\napproach, each agent learns a parameterized Markov control policy \u02c6ui = \u02c6\u03c0i(xi) via regression. The\n\u02c6\u03c0i are learned from a data set containing local states xi taken from historical measurements of system\nstate x and corresponding optimal actions u\u2217i computed by solving an of\ufb02ine centralized optimization\nproblem for each x.\nIn this context, we analyze the fundamental limits of compression. In particular, we are interested\nin unraveling the relationship between the dependence structure of u\u2217i and x and the corresponding\nability of an agent with partial information to approximate the optimal solution, i.e. the difference \u2013\nor distortion \u2013 between decentralized action \u02c6ui = \u02c6\u03c0i(xi) and u\u2217i . This type of relationship is well\nstudied within the information theory literature as an instance of rate distortion theory [Cover and\nThomas, 2012, Chapter 13]. Classical results in this \ufb01eld provide a means of \ufb01nding a lower bound on\nthe expected distortion as a function of the mutual information \u2013 or rate of communication \u2013 between\nu\u2217i and xi. This lower bound is valid for each speci\ufb01ed distortion metric, and for any arbitrary strategy\nof computing \u02c6ui from available data xi. Moreover, we are able to leverage a similar result to provide\na conceptually simple algorithm for choosing a communication structure \u2013 letting the regressor \u02c6\u03c0i\ndepend on some other local states xj(cid:54)=i \u2013 in such a way that the lower bound on expected distortion\nis minimized. As such, our method generalizes [Sondermeijer et al., 2016] and provides a novel\napproach for the design and analysis of regression-based decentralized optimal policies for general\nmulti-agent systems. We demonstrate these results on synthetic examples, and on a real example\ndrawn from solving OPF in electrical distribution grids.\n\n2 Related Work\n\nDecentralized control has long been studied within the system theory literature, e.g. [Lunze, 1992,\nSiljak, 2011]. Recently, various decomposition based techniques have been proposed for distributed\noptimization based on primal or dual decomposition methods, which all require iterative computation\nand some form of communication with either a central node [Boyd et al., 2011] or neighbor-to-\nneighbor on a connected graph [Pu et al., 2014, Raffard et al., 2004, Sun et al., 2013]. Distributed\nmodel predictive control (MPC) optimizes a networked system composed of subsystems over a time\nhorizon, which can be decentralized (no communication) if the dynamic interconnections between\nsubsystems are weak in order to achieve closed-loop stability as well as performance [Christo\ufb01des\net al., 2013]. The work of Zeilinger et al. [2013] extended this to systems with strong coupling by\nemploying time-varying distributed terminal set constraints, which requires neighbor-to-neighbor\ncommunication. Another class of methods model problems in which agents try to cooperate on\na common objective without full state information as a decentralized partially observable Markov\ndecision process (Dec-POMDP) [Oliehoek and Amato, 2016]. Nair et al. [2005] introduce networked\ndistributed POMDPs, a variant of the Dec-POMDP inspired in part by the pairwise interaction\nparadigm of distributed constraint optimization problems (DCOPs).\nAlthough the speci\ufb01c algorithms in these works differ signi\ufb01cantly from the regression-based de-\ncentralization scheme we consider in this paper, a larger difference is in problem formulation. As\ndescribed in Sec. 3, we study a static optimization problem repeatedly solved at each time step. Much\nprior work, especially in optimal control (e.g. MPC) and reinforcement learning (e.g. Dec-POMDPs),\nposes the problem in a dynamic setting where the goal is to minimize cost over some time horizon.\nIn the context of reinforcement learning (RL), the time horizon can be very long, leading to the\nwell known tradeoff between exploration and exploitation; this does not appear in the static case.\nAdditionally, many existing methods for the dynamic setting require an ongoing communication\nstrategy between agents \u2013 though not all, e.g. [Peshkin et al., 2000]. Even one-shot static problems\nsuch as DCOPs tend to require complex communication strategies, e.g. [Modi et al., 2005].\nAlthough the mathematical formulation of our approach is rather different from prior work, the\npolicies we compute are similar in spirit to other learning and robotic techniques that have been\nproposed, such as behavioral cloning [Sammut, 1996] and apprenticeship learning [Abbeel and Ng,\n2004], which aim to let an agent learn from examples. In addition, we see a parallel with recent\nwork on information-theoretic bounded rationality [Ortega et al., 2015] which seeks to formalize\ndecision-making with limited resources such as the time, energy, memory, and computational effort\n\n2\n\n\f(a) Distributed multi-agent problem.\n\n(b) Graphical model of dependency structure.\n\nFigure 1: (a) shows a connected graph corresponding to a distributed multi-agent system. The circles\ndenote the local state xi of an agent, the dashed arrow denotes its action ui, and the double arrows\ndenote the physical coupling between local state variables. (b) shows the Markov Random Field\n(MRF) graphical model of the dependency structure of all variables in the decentralized learning\nproblem. Note that the state variables xi and the optimal actions u\u2217i form a fully connected undirected\nnetwork, and the local policy \u02c6ui only depends on the local state xi.\n\nallocated for arriving at a decision. Our work is also related to swarm robotics [Brambilla et al.,\n2013], as it learns simple rules aimed to design robust, scalable and \ufb02exible collective behaviors for\ncoordinating a large number of agents or robots.\n\n3 General Problem Formulation\n\nConsider a distributed multi-agent problem de\ufb01ned by a graph G = (N ,E), with N denoting the\nnodes in the network with cardinality |N| = N, and E representing the set of edges between\nnodes. Fig. 1a shows a prototypical graph of this sort. Each node has a real-valued state vector\nxi \u2208 R\u03b1i , i \u2208 N . A subset of nodes C \u2282 N , with cardinality |C| = C, are controllable and\nx = (xi, . . . , xN )(cid:62) \u2208 R(cid:80)\nhence are termed \u201cagents.\u201d Each of these agents has an action variable ui \u2208 R\u03b2i , i \u2208 C. Let\ni\u2208C \u03b2i = U\nthe stacked network optimization variable. Physical constraints such as spatial coupling are captured\nthrough equality constraints g(x, u) = 0. In addition, the system is subject to inequality constraints\nh(x, u) \u2264 0 that incorporate limits due to capacity, safety, robustness, etc. We are interested\nin minimizing a convex scalar function fo(x, u) that encodes objectives that are to be pursued\ncooperatively by all agents in the network, i.e. we want to \ufb01nd\n\ni\u2208N \u03b1i = X denote the full network state vector and u \u2208 R(cid:80)\n\nu\u2217 = arg min\nu\ns.t.\n\nfo(x, u) ,\ng(x, u) = 0,\n\nh(x, u) \u2264 0.\n\n(1)\n\nNote that (1) is static in the sense that it does not consider the future evolution of the state x or the\ncorresponding future values of cost fo. We apply this static problem to sequential control tasks by\nrepeatedly solving (1) at each time step. Note that this simpli\ufb01cation from an explicitly dynamic\nproblem formulation (i.e. one in which the objective function incorporates future costs) is purely for\nease of exposition and for consistency with the OPF literature as in [Sondermeijer et al., 2016]. We\ncould also consider the optimal policy which solves a dynamic optimal control or RL problem and\nthe decentralized learning step in Sec. 3.1 would remain the same.\nSince (1) is static, applying the learned decentralized policies repeatedly over time may lead to\ndynamical instability. Identifying when this will and will not occur is a key challenge in verifying the\nregression-based decentralization method, however it is beyond the scope of this work.\n\n3.1 Decentralized Learning\n\nWe interpret the process of solving (1) as applying a well-de\ufb01ned function or stationary Markov\npolicy \u03c0\u2217 : X \u2212\u2192 U that maps an input collective state x to the optimal collective control or action\nu\u2217. We presume that this solution exists and can be computed of\ufb02ine. Our objective is to learn C\ndecentralized policies \u02c6ui = \u02c6\u03c0i(xi), one for each agent i \u2208 C, based on T historical measurements\nof the states {x[t]}T\nt=1 and the of\ufb02ine computation of the corresponding optimal actions {u\u2217[t]}T\nt=1.\nAlthough each policy \u02c6\u03c0i individually aims to approximate u\u2217i based on local state xi, we are able\n\n3\n\nx1x2x3x4x5x6u2u5u6uC*\u00fb1\u00fbi\u00fbCui*u1*xjxNx1\fFigure 2: A \ufb02ow diagram explaining the key steps of the decentralized regression method, depicted\nfor the example system in Fig. 1a. We \ufb01rst collect data from a multi-agent system, and then solve\nthe centralized optimization problem using all the data. The data is then split into smaller training\nand test sets for all agents to develop individual decentralized policies \u02c6\u03c0i(xi) that approximate the\noptimal solution of the centralized problem. These policies are then implemented in the multi-agent\nsystem to collectively achieve a common global behavior.\n\nto reason about how well their collective action can approximate \u03c0\u2217. Figure 2 summarizes the\ndecentralized learning setup.\nMore formally, we describe the dependency structure of the individual policies \u02c6\u03c0i : R\u03b1i \u2212\u2192 R\u03b2i\nwith a Markov Random Field (MRF) graphical model, as shown in Fig. 1b. The \u02c6ui are only allowed\nto depend on local state xi while the u\u2217i may depend on the full state x. With this model, we can\ndetermine how information is distributed among different variables and what information-theoretic\nconstraints the policies {\u02c6\u03c0i}i\u2208C are subject to when collectively trying to reconstruct the centralized\npolicy \u03c0\u2217. Note that although we may refer to \u03c0\u2217 as globally optimal, this is not actually required\nfor us to reason about how closely the \u02c6\u03c0i approximate \u03c0\u2217. That is, our analysis holds even if (1) is\nsolved using approximate methods. In a dynamical reformulation of (1), for example, \u03c0\u2217 could be\ngenerated using techniques from deep RL.\n\n3.2 A Rate-Distortion Framework\n\nWe approach the problem of how well the decentralized policies \u02c6\u03c0i can perform in theory from\nthe perspective of rate distortion. Rate distortion theory is a sub-\ufb01eld of information theory which\nprovides a framework for understanding and computing the minimal distortion incurred by any given\ncompression scheme. In a rate distortion context, we can interpret the fact that the output of each\nindividual policy \u02c6\u03c0i depends only on the local state xi as a compression of the full state x. For a\ndetailed overview, see [Cover and Thomas, 2012, Chapter 10]. We formulate the following variant of\nthe the classical rate distortion problem\n\nD\u2217 = min\np(\u02c6u|u\u2217)\ns.t.\n\nE [d(\u02c6u, u\u2217)] ,\nI(\u02c6ui; u\u2217j ) \u2264 I(xi; u\u2217j ) (cid:44) \u03b3ij ,\nI(\u02c6ui; \u02c6uj) \u2264 I(xi; xj) (cid:44) \u03b4ij,\u2200i, j \u2208 C ,\n\n(2)\n\nwhere I(\u00b7,\u00b7) denotes mutual information and d(\u00b7,\u00b7) an arbitrary non-negative distortion measure. As\nusual, the minimum distortion between random variable u\u2217 and its reconstruction \u02c6u may be found by\nminimizing over conditional distributions p(\u02c6u|u\u2217).\nThe novelty in (2) lies in the structure of the constraints. Typically, D\u2217 is written as a function D(R),\nwhere R is the maximum rate or mutual information I(\u02c6u; u\u2217). From Fig. 1b however, we know that\npairs of reconstructed and optimal actions cannot share more information than is contained in the\nintermediate nodes in the graphical model, e.g. \u02c6u1 and u\u22171 cannot share more information than x1 and\nu\u22171. This is a simple consequence of the data processing inequality [Cover and Thomas, 2012, Thm.\n2.8.1]. Similarly, the reconstructed optimal actions at two different nodes cannot be more closely\nrelated than the measurements xi\u2019s from which they are computed. The resulting constraints are \ufb01xed\nby the joint distribution of the state x and the optimal actions u\u2217. That is, they are fully determined\nby the structure of the optimization problem (1) that we wish to solve.\n\n4\n\nDecentralizedLearningDecentralizedLearningDecentralized LearningMulti-Agent SystemLocal training sets264x\u21e41...x\u21e46375,24u\u21e42u\u21e45u\u21e4635264x1...x6375,24u2u5u635\u02c6u2=\u02c6\u21e12(x2)\u02c6u5=\u02c6\u21e15(x5)\u02c6u6=\u02c6\u21e16(x6)Data gatheringOptimal dataLocal policiesapproximateCentralized Optimization{x2[t],u\u21e42[t]}Tt=1\fWe emphasize that we have made virtually no assumptions about the distortion function. For the\nremainder of this paper, we will measure distortion as the deviation between \u02c6ui and u\u2217i . However,\nwe could also de\ufb01ne it to be the suboptimality gap fo(x, \u02c6u) \u2212 fo(x, u\u2217), which may be much\nmore complicated to compute. This de\ufb01nition could allow us to reason explicitly about the cost of\ndecentralization, and it could address the valid concern that the optimal decentralized policy may\nbear no resemblance to \u03c0\u2217. We leave further investigation for future work.\n\n3.3 Example: Squared Error, Jointly Gaussian\n\nTo provide more intuition into the rate distortion framework, we consider an idealized example in\nwhich the xi, ui \u2208 R1. Let d(\u02c6u, u\u2217) = (cid:107)\u02c6u \u2212 u\u2217(cid:107)2\n2 be the squared error distortion measure, and\nassume the state x and optimal actions u\u2217 to be jointly Gaussian. These assumptions allow us to\nderive an explicit formula for the optimal distortion D\u2217 and corresponding regression policies \u02c6\u03c0i.\nWe begin by stating an identity for two jointly Gaussian X, Y \u2208 R with correlation \u03c1: I(X; Y ) \u2264\n\u03b3 \u21d0\u21d2 \u03c12 \u2264 1 \u2212 e\u22122\u03b3 , which follows immediately from the de\ufb01nition of mutual information\nto be the correlation\nand the formula for the entropy of a Gaussian random variable. Taking \u03c1\u02c6ui,u\u2217\nbetween \u02c6ui and u\u2217i , \u03c32\nto be the variances of \u02c6ui and u\u2217i respectively, and assuming that u\u2217i\n\u02c6ui\nand \u02c6ui are of equal mean (unbiased policies \u02c6\u03c0i), we can show that the minimum distortion attainable\nis\n\nand \u03c32\nu\u2217\n\ni\n\ni\n\nD\u2217 = min\np(\u02c6u|u\u2217)\n\n2(cid:3) : \u03c12\nE(cid:2)(cid:107)u\u2217 \u2212 \u02c6u(cid:107)2\ni },{\u03c3 \u02c6ui}(cid:88)i (cid:16)\u03c32\n\nu\u2217\n\ni\n\nmin\n\n\u02c6ui,u\u2217\n\ni \u2264 1 \u2212 e\u22122\u03b3ii = \u03c12\ni ,xi,\u2200i \u2208 C ,\nu\u2217\ni \u03c3\u02c6ui(cid:17) : \u03c12\ni \u2264 \u03c12\n\u02c6ui \u2212 2\u03c1\u02c6ui,u\u2217\nu\u2217\ni ,xi ,\n\ni \u03c3u\u2217\n\n\u02c6ui,u\u2217\n\n+ \u03c32\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n=\n\n{\u03c1 \u02c6ui,u\u2217\n\n= min\n\n{\u03c3 \u02c6ui}(cid:88)i (cid:16)\u03c32\n=(cid:88)i\n\n\u03c32\nu\u2217\n\ni\n\n(1 \u2212 \u03c12\nu\u2217\ni ,xi) .\n\n+ \u03c32\n\nu\u2217\n\ni\n\n\u02c6ui \u2212 2\u03c1u\u2217\n\ni ,xi\u03c3u\u2217\n\ni \u03c3\u02c6ui(cid:17) ,\n\ni\n\nIn (4), we have solved for the optimal correlations \u03c1\u02c6ui,u\u2217\n. Unsurprisingly, the optimal value turns out\nto be the maximum allowed by the mutual information constraint, i.e. \u02c6ui should be as correlated to\nu\u2217i as possible, and in particular as much as u\u2217i is correlated to xi. Similarly, in (5) we solve for the\noptimal \u03c3\u02c6ui, with the result that at optimum, \u03c3\u02c6ui = \u03c1u\u2217\n. This means that as the correlation\nbetween the local state xi and the optimal action u\u2217i decreases, the variance of the estimated action \u02c6ui\ndecreases as well. As a result, the learned policy will increasingly \u201cbet on the mean\u201d or \u201clisten less\u201d\nto its local measurement to approximate the optimal action.\nMoreover, we may also provide a closed form expression for the regressor which achieves the\nminimum distortion D\u2217. Since we have assumed that each u\u2217i and the state x are jointly Gaussian, we\nmay write any u\u2217i as an af\ufb01ne function of xi plus independent Gaussian noise. Thus, the minimum\nmean squared estimator is given by the conditional expectation\n\ni ,xi\u03c3u\u2217\n\ni\n\n\u02c6ui = \u02c6\u03c0i(xi) = E [u\u2217i |xi] = E [u\u2217i ] +\n\n\u03c1u\u2217\n\ni xi\u03c3u\u2217\n\u03c3xi\n\ni\n\n(xi \u2212 E [xi]) .\n\n(7)\n\nThus, we have found a closed form expression for the best regressor \u02c6\u03c0i to predict u\u2217i from only xi in\nthe joint Gaussian case with squared error distortion. This result comes as a direct consequence of\nknowing the true parameterization of the joint distribution p(u\u2217, x) (in this case, as a Gaussian).\n\n3.4 Determining Minimum Distortion in Practice\nOften in practice, we do not know the parameterization p(u\u2217|x) and hence it may be intractable to\ndetermine D\u2217 and the corresponding decentralized policies \u02c6\u03c0i. However, if one can assume that\np(u\u2217|x) belongs to a family of parameterized functions (for instance universal function approximators\nsuch as deep neural networks), then it is theoretically possible to attain or at least approach minimum\ndistortion for arbitrary non-negative distortion measures.\nPractically, one can compute the mutual information constraint I(u\u2217i , xi) from (2) to understand\nhow much information a regressor \u02c6\u03c0i(xi) has available to reconstruct u\u2217i . In the Gaussian case, we\nwere able to compute this mutual information in closed form. For data from general distributions\n\n5\n\n\fhowever, there is often no way to compute mutual information analytically. Instead, we rely on access\nt=1, in order to estimate mutual informations numerically. In such\nto suf\ufb01cient data {x[t], u\u2217[t]}T\nsituations (e.g. Sec. 5), we discretize the data and then compute mutual information with a minimax\nrisk estimator, as proposed by Jiao et al. [2014].\n\n4 Allowing Restricted Communication\n\nSuppose that a decentralized policy \u02c6\u03c0i suffers from insuf\ufb01cient mutual information between its local\nmeasurement xi and the optimal action u\u2217i . In this case, we would like to quantify the potential\nbene\ufb01ts of communicating with other nodes j (cid:54)= i in order to reduce the distortion limit D\u2217 from (2)\nand improve its ability to reconstruct u\u2217i . In this section, we present an information-theoretic solution\nto the problem of how to choose optimally which other data to observe, and we provide a lower\nbound-achieving solution for the idealized Gaussian case introduced in Sec. 3.3. We assume that in\naddition to observing its own local state xi, each \u02c6\u03c0i is allowed to depend on at most k other xj(cid:54)=i.\nTheorem 1. (Restricted Communication)\nIf Si is the set of k nodes j (cid:54)= i \u2208 N which \u02c6ui is allowed to observe in addition to xi, then setting\n(8)\n\nI(u\u2217i ; xi,{xj : j \u2208 S}) : |S| = k ,\n\nSi = arg max\nS\n\nminimizes the best-case expectation of any distortion measure. That is, this choice of Si yields the\nsmallest lower bound D\u2217 from (2) of any possible choice of S.\n\nProof. By assumption, Si maximizes the mutual information between the observed local states\n{xi, xj : j \u2208 Si} and the optimal action u\u2217i . This mutual information is equivalent to the notion\nof rate R in the classical rate distortion theorem [Cover and Thomas, 2012]. It is well-known that the\ndistortion rate function D(R) is convex and monotone decreasing in R. Thus, by maximizing mutual\ninformation R we are guaranteed to minimize distortion D(R), and hence D\u2217.\n\nTheorem 1 provides a means of choosing a subset of the state {xj : j (cid:54)= i} to communicate to each\ndecentralized policy \u02c6\u03c0i that minimizes the corresponding best expected distortion D\u2217. Practically\nspeaking, this result may be interpreted as formalizing the following intuition: \u201cthe best thing to do\nis to transmit the most information.\u201d In this case, \u201ctransmitting the most information\u201d corresponds\nto allowing \u02c6\u03c0i to observe the set S of nodes {xj : j (cid:54)= i} which contains the most information\nabout u\u2217i . Likewise, by \u201cbest\u201d we mean that Si minimizes the best-case expected distortion D\u2217,\nfor any distortion metric d. As in Sec. 3.3, without making some assumption about the structure\nof the distribution of x and u\u2217, we cannot guarantee that any particular regressor \u02c6\u03c0i will attain D\u2217.\nNevertheless, in a practical situation where suf\ufb01cient data {x[t], u\u2217[t]}T\nt=1 is available, we can solve\n(8) by estimating mutual information [Jiao et al., 2014].\n\n4.1 Example: Joint Gaussian, Squared Error with Communication\n\nHere, we reexamine the joint Gaussian-distributed, mean squared error distortion case from Sec. 3.3,\nand apply Thm. 1. We will take u\u2217 \u2208 R1, x \u2208 R10 and u\u2217, x jointly Gaussian with zero mean and\narbitrary covariance. The speci\ufb01c covariance matrix \u03a3 of the joint distribution p(u\u2217, x) is visualized\nin Fig. 3a. For simplicity, we show the squared correlation coef\ufb01cients of \u03a3 which lie in [0, 1]. The\nboxed cells in \u03a3 in Fig. 3a indicate that x9 solves (8), i.e. j = 9 maximizes I(u\u2217; x1, xj) the mutual\ninformation between the observed data and regression target u\u2217. Intuitively, this choice of j is best\nbecause x9 is highly correlated to u\u2217 and weakly correlated to x1, which is already observed by \u02c6u;\nthat is, it conveys a signi\ufb01cant amount of information about u\u2217 that is not already conveyed by x1.\nFigure 3b shows empirical results. Along the horizontal axis we increase the value of k, the number\nof additional variables xj which regressor \u02c6\u03c0i observes. The vertical axis shows the resulting average\ndistortion. We show results for a linear regressor of the form of (7) where we have chosen Si optimally\naccording to (8), as well as uniformly at random from all possible sets of unique indices. Note that\nthe optimal choice of Si yields the lowest average distortion D\u2217 for all choices of k. Moreover, the\nlinear regressor of (7) achieves D\u2217 for all k, since we have assumed a Gaussian joint distribution.\n\n6\n\n\f(a) Squared correlation coef\ufb01cients.\n\n(b) Comparison of communication strategies.\n\nFigure 3: Results for optimal communication strategies on a synthetic Gaussian example. (a) shows\nsquared correlation coef\ufb01cients between of u\u2217 and all xi\u2019s. The boxed entries correspond to x9,\nwhich was found to be optimal for k = 1. (b) shows that the optimal communication strategy of Thm.\n1 achieves the lowest average distortion and outperforms the average over random strategies.\n\n5 Application to Optimal Power Flow\n\nIn this case study, we aim to minimize the voltage variability in an electric grid caused by intermittent\nrenewable energy sources and the increasing load caused by electric vehicle charging. We do so\nby controlling the reactive power output of distributed energy resources (DERs), while adhering\nto the physics of power \ufb02ow and constraints due to energy capacity and safety. Recently, various\napproaches have been proposed, such as [Farivar et al., 2013] or [Zhang et al., 2014]. In these\nmethods, DERs tend to rely on an extensive communication infrastructure, either with a central\nmaster node [Xu et al., 2017] or between agents leveraging local computation [Dall\u2019Anese et al.,\n2014]. We study regression-based decentralization as outlined in Sec. 3 and Fig. 2 to the optimal\npower \ufb02ow (OPF) problem [Low, 2014], as initially proposed by Sondermeijer et al. [2016]. We\napply Thm. 1 to determine the communication strategy that minimizes optimal distortion to further\nimprove the reconstruction of the optimal actions u\u2217i .\nSolving OPF requires a model of the electricity grid describing both topology and impedances; this\nis represented as a graph G = (N ,E). For clarity of exposition and without loss of generality, we\nintroduce the linearized power \ufb02ow equations over radial networks, also known as the LinDistFlow\nequations [Baran and Wu, 1989]:\n\nPjk + pc\n\ng\nj \u2212 p\nj ,\n\nPij = (cid:88)(j,k)\u2208E,k(cid:54)=i\nQij = (cid:88)(j,k)\u2208E,k(cid:54)=i\n\nQjk + qc\n\nj \u2212 q\nvj = vi \u2212 2 (rijPij + \u03beijQij)\n\ng\nj ,\n\n(9a)\n\n(9b)\n\n(9c)\n\ni and qc\n\ni and qg\n\nIn this model, capitals Pij and Qij represent real and reactive power \ufb02ow on a branch from node i to\nnode j for all branches (i, j) \u2208 E, lower case pc\ni are the real and reactive power consumption\nat node i, and pg\ni are its real and reactive power generation. Complex line impedances\nrij +\u221a\u22121\u03beij have the same indexing as the power \ufb02ows. The LinDistFlow equations use the squared\nvoltage magnitude vi, de\ufb01ned and indexed at all nodes i \u2208 N . These equations are included as\nconstraints in the optimization problem to enforce that the solution adheres to laws of physics.\nTo formulate our decentralized learning problem, we will treat xi (cid:44) (pc\ni , pg\ni ) to be the local state\ni , qc\nvariable, and, for all controllable nodes, i.e. agents i \u2208 C, we have ui (cid:44) qg\ni , i.e. the reactive power\ngeneration can be controlled (vi, Pij, Qij are treated as dummy variables). We assume that for all\ni are predetermined respectively by\nnodes i \u2208 N , consumption pc\nthe demand and the power generated by a potential photovoltaic (PV) system. The action space is\nconstrained by the reactive power capacity |ui| =(cid:12)(cid:12)q\n\ni(cid:12)(cid:12) \u2264 \u00afqi. In addition, voltages are maintained\n\ni and real power generation pg\n\ni , qc\n\ng\n\n7\n\nu\u2217u\u2217x1x1x2x2x3x3x4x4x5x5x6x6x7x7x8x8x9x9x10x100.20.40.60.810246810AdditionalObservationsk0510152025MSEoptimalstrategyaveragerandomstrategy\f(a) Voltage output with and without control.\n\n(b) Comparison of OPF communication strategies.\n\nFigure 4: Results for decentralized learning on an OPF problem. (a) shows an example result of\ndecentralized learning - the shaded region represents the range of all voltages in a network over a\nfull day. As compared to no control, the fully decentralized regression-based control reduces voltage\nvariation and prevents constraint violation (dashed line). (b) shows that the optimal communication\nstrategy Si outperforms the average for random strategies on the mean squared error distortion metric.\nThe regressors used are stepwise linear policies \u02c6\u03c0i with linear or quadratic features.\n\nwithin \u00b15% of 120V , which is expressed as the constraint v \u2264 vi \u2264 v . The OPF problem now reads\n(10)\n\nu\u2217 = arg min\n\nqg\n\ni , \u2200i\u2208C (cid:88)i\u2208N\n(9) , (cid:12)(cid:12)q\n\ns.t.\n\n|vi \u2212 vref| ,\n\ng\n\ni(cid:12)(cid:12) \u2264 \u00afqi , v \u2264 vi \u2264 v .\n\nFollowing Fig. 2, we employ models of real electrical distribution grids (including the IEEE Test\nt=1 of load and PV\nFeeders [IEEE PES, 2017]), which we equip with with T historical readings {x[t]}T\ndata, which is composed with real smart meter measurements sourced from Pecan Street Inc. [2017].\nWe solve (10) for all data, yielding a set of minimizers {u\u2217[t]}T\nt=1. We then separate the overall data\nt=1 , \u2200i \u2208 C and train linear policies with feature kernels\nset into C smaller data sets {xi[t], u\u2217i [t]}T\n\u03c6i(\u00b7) and parameters \u03b8i of the form \u02c6\u03c0i(xi) = \u03b8(cid:62)i \u03c6i(xi). Practically, the challenge is to select the best\nfeature kernel \u03c6i(\u00b7). We extend earlier work which showed that decentralized learning for OPF can\nbe done satisfactorily via a hybrid forward- and backward-stepwise selection algorithm [Friedman\net al., 2001, Chapter 3] that uses a quadratic feature kernels.\nFigure 4a shows the result for an electric distribution grid model based on a real network from\nArizona. This network has 129 nodes and, in simulation, 53 nodes were equipped with a controllable\nDER (i.e. N = 129, C = 53). In Fig. 4a we show the voltage deviation from a normalized setpoint\non a simulated network with data not used during training. The improvement over the no-control\nbaseline is striking, and performance is nearly identical to the optimum achieved by the centralized\nsolution. Concretely, we observed: (i) no constraint violations, and (ii) a suboptimality deviation of\n0.15% on average, with a maximum deviation of 1.6%, as compared to the optimal policy \u03c0\u2217.\nIn addition, we applied Thm. 1 to the OPF problem for a smaller network [IEEE PES, 2017], in order\nto determine the optimal communication strategy to minimize a squared error distortion measure. Fig.\n4b shows the mean squared error distortion measure for an increasing number of observed nodes k\nand shows how the optimal strategy outperforms an average over random strategies.\n\n6 Conclusions and Future Work\n\nThis paper generalizes the approach of Sondermeijer et al. [2016] to solve multi-agent static optimal\ncontrol problems with decentralized policies that are learned of\ufb02ine from historical data. Our rate\ndistortion framework facilitates a principled analysis of the performance of such decentralized policies\nand the design of optimal communication strategies to improve individual policies. These techniques\nwork well on a model of a sophisticated real-world OPF example.\nThere are still many open questions about regression-based decentralization.\nIt is well known\nthat strong interactions between different subsystems may lead to instability and suboptimality in\ndecentralized control problems [Davison and Chang, 1990]. There are natural extensions of our work\n\n8\n\n012345AdditionalObservationsk11.21.41.61.82MSE\u00d710\u22123linear,randomlinear,optimalquadratic,randomquadratic,optimal\fto address dynamic control problems more explicitly, and stability analysis is a topic of ongoing work.\nAlso, analysis of the suboptimality of regression-based decentralization should be possible within\nour rate distortion framework. Finally, it is worth investigating the use of deep neural networks to\nparameterize both the distribution p(u\u2217|x) and local policies \u02c6\u03c0i in more complicated decentralized\ncontrol problems with arbitrary distortion measures.\n\nAcknowledgments\n\nThe authors would like to acknowledge Roberto Calandra for his insightful suggestions and feedback\non the manuscript. This research is supported by NSF under the CPS Frontiers VehiCal project\n(1545126), by the UC-Philippine-California Advanced Research Institute under projects IIID-2016-\n005 and IIID-2015-10, and by the ONR MURI Embedded Humans (N00014-16-1-2206). David\nFridovich-Keil was also supported by the NSF GRFP.\n\nReferences\nP. Abbeel and A. Y. Ng. Apprenticeship Learning via Inverse Reinforcement Learning. In Interna-\n\ntional Conference on Machine Learning, New York, NY, USA, 2004. ACM.\n\nR. J. Aumann and J. H. Dreze. Cooperative games with coalition structures. International Journal of\n\nGame Theory, 3(4):217\u2013237, Dec. 1974.\n\nM. Baran and F. Wu. Optimal capacitor placement on radial distribution systems. IEEE Transactions\n\non Power Delivery, 4(1):725\u2013734, Jan. 1989.\n\nS. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical\nLearning via the Alternating Direction Method of Multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, July 2011.\n\nM. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo. Swarm robotics: a review from the swarm\n\nengineering perspective. Swarm Intelligence, 7(1):1\u201341, Mar. 2013.\n\nP. D. Christo\ufb01des, R. Scattolini, D. M. de la Pena, and J. Liu. Distributed model predictive control:\nA tutorial review and future research directions. Computers & Chemical Engineering, 51:21\u201341,\n2013.\n\nT. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\nE. Dall\u2019Anese, S. V. Dhople, and G. Giannakis. Optimal dispatch of photovoltaic inverters in\nresidential distribution systems. Sustainable Energy, IEEE Transactions on, 5(2):487\u2013497, 2014.\nURL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6719562.\n\nE. J. Davison and T. N. Chang. Decentralized stabilization and pole assignment for general proper\n\nsystems. IEEE Transactions on Automatic Control, 35(6):652\u2013664, 1990.\n\nM. Farivar, L. Chen, and S. Low. Equilibrium and dynamics of local voltage control in distribution\nsystems. In 2013 IEEE 52nd Annual Conference on Decision and Control (CDC), pages 4329\u20134334,\nDec. 2013. doi: 10.1109/CDC.2013.6760555.\n\nJ. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer\n\nseries in statistics Springer, Berlin, 2001.\n\nC. V. Goldman and S. Zilberstein. Decentralized control of cooperative systems: Categorization\nand complexity analysis. J. Artif. Int. Res., 22(1):143\u2013174, Nov. 2004. ISSN 1076-9757. URL\nhttp://dl.acm.org/citation.cfm?id=1622487.1622493.\n\nIEEE PES.\n\nIEEE Distribution Test Feeders, 2017. URL http://ewh.ieee.org/soc/pes/\n\ndsacom/testfeeders/.\n\nJ. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax Estimation of Functionals of Discrete\n\nDistributions. arXiv preprint, June 2014. arXiv: 1406.6956.\n\nS. Low. Convex Relaxation of Optimal Power Flow; Part I: Formulations and Equivalence. IEEE\n\nTransactions on Control of Network Systems, 1(1):15\u201327, Mar. 2014.\n\n9\n\n\fJ. Lunze. Feedback Control of Large Scale Systems. Prentice Hall PTR, Upper Saddle River, NJ,\n\nUSA, 1992. ISBN 013318353X.\n\nP. J. Modi, W.-M. Shen, M. Tambe, and M. Yokoo. Adopt: Asynchronous distributed constraint\noptimization with quality guarantees. Artif. Intell., 161(1-2):149\u2013180, Jan. 2005. ISSN 0004-3702.\ndoi: 10.1016/j.artint.2004.09.003. URL http://dx.doi.org/10.1016/j.artint.2004.09.\n003.\n\nR. Nair, P. Varakantham, M. Tambe, and M. Yokoo. Networked Distributed POMDPs: A synthesis of\n\ndistributed constraint optimization and POMDPs. In AAAI, volume 5, pages 133\u2013139, 2005.\n\nF. A. Oliehoek and C. Amato. A Concise Introduction to Decentralized POMDPs. Springer\n\nInternational Publishing, 1 edition, 2016.\n\nP. A. Ortega, D. A. Braun, J. Dyer, K.-E. Kim, and N. Tishby. Information-Theoretic Bounded\n\nRationality. arXiv preprint, 2015. arXiv:1512.06789.\n\nPecan Street Inc. Dataport, 2017. URL http://www.pecanstreet.org/.\n\nL. Peshkin, K.-E. Kim, N. Meuleau, and L. P. Kaelbling. Learning to cooperate via policy search. In\nProceedings of the Sixteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201900, pages\n489\u2013496, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-709-9.\nURL http://dl.acm.org/citation.cfm?id=2073946.2074003.\n\nY. Pu, M. N. Zeilinger, and C. N. Jones. Inexact fast alternating minimization algorithm for distributed\nmodel predictive control. In Conference on Decision and Control, Los Angeles, CA, USA, 2014.\nIEEE.\n\nR. L. Raffard, C. J. Tomlin, and S. P. Boyd. Distributed optimization for cooperative agents:\nApplication to formation \ufb02ight. In Conference on Decision and Control, Nassau, The Bahamas,\n2004. IEEE.\n\nC. Sammut. Automatic construction of reactive control systems using symbolic machine learning.\n\nThe Knowledge Engineering Review, 11(01):27\u201342, 1996.\n\nD. D. Siljak. Decentralized control of complex systems. Dover Books on Electrical Engineering.\n\nDover, New York, NY, 2011. URL http://cds.cern.ch/record/1985961.\n\nO. Sondermeijer, R. Dobbe, D. B. Arnold, C. Tomlin, and T. Keviczky. Regression-based Inverter\nControl for Decentralized Optimal Power Flow and Voltage Regulation. In Power and Energy\nSociety General Meeting, Boston, MA, USA, July 2016. IEEE.\n\nA. X. Sun, D. T. Phan, and S. Ghosh. Fully decentralized AC optimal power \ufb02ow algorithms. In\n\nPower and Energy Society General Meeting, Vancouver, Canada, July 2013. IEEE.\n\nY. Xu, Z. Y. Dong, R. Zhang, and D. J. Hill. Multi-Timescale Coordinated Voltage/Var Control of\nHigh Renewable-Penetrated Distribution Systems. IEEE Transactions on Power Systems, PP(99):\n1\u20131, 2017. ISSN 0885-8950. doi: 10.1109/TPWRS.2017.2669343.\n\nM. N. Zeilinger, Y. Pu, S. Riverso, G. Ferrari-Trecate, and C. N. Jones. Plug and play distributed\nmodel predictive control based on distributed invariance and optimization. In Conference on\nDecision and Control, Florence, Italy, 2013. IEEE.\n\nB. Zhang, A. Lam, A. Dominguez-Garcia, and D. Tse. An Optimal and Distributed Method for\nVoltage Regulation in Power Distribution Systems. IEEE Transactions on Power Systems, PP(99):\n1\u201313, 2014. ISSN 0885-8950. doi: 10.1109/TPWRS.2014.2347281.\n\n10\n\n\f", "award": [], "sourceid": 1706, "authors": [{"given_name": "Roel", "family_name": "Dobbe", "institution": "UC Berkeley"}, {"given_name": "David", "family_name": "Fridovich-Keil", "institution": "UC Berkeley"}, {"given_name": "Claire", "family_name": "Tomlin", "institution": "UC Berkeley"}]}