{"title": "Policy Gradient With Value Function Approximation For Collective Multiagent Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 4319, "page_last": 4329, "abstract": "Decentralized (PO)MDPs provide an expressive framework for sequential decision making in a multiagent system. Given their computational complexity, recent research has focused on tractable yet practical subclasses of Dec-POMDPs. We address such a subclass called CDec-POMDP where the collective behavior of a population of agents affects the joint-reward and environment dynamics. Our main contribution is an actor-critic (AC) reinforcement learning method for optimizing CDec-POMDP policies. Vanilla AC has slow convergence for larger problems. To address this, we show how a particular decomposition of the approximate action-value function over agents leads to effective updates, and also derive a new way to train the critic based on local reward signals. Comparisons on a synthetic benchmark and a real world taxi fleet optimization problem show that our new AC approach provides better quality solutions than previous best approaches.", "full_text": "Policy Gradient With Value Function Approximation\n\nFor Collective Multiagent Planning\n\nDuc Thien Nguyen Akshat Kumar Hoong Chuin Lau\n\n{dtnguyen.2014,akshatkumar,hclau}@smu.edu.sg\n\nSchool of Information Systems\n\nSingapore Management University\n80 Stamford Road, Singapore 178902\n\nAbstract\n\nDecentralized (PO)MDPs provide an expressive framework for sequential deci-\nsion making in a multiagent system. Given their computational complexity, re-\ncent research has focused on tractable yet practical subclasses of Dec-POMDPs.\nWe address such a subclass called CDec-POMDP where the collective behavior\nof a population of agents affects the joint-reward and environment dynamics. Our\nmain contribution is an actor-critic (AC) reinforcement learning method for opti-\nmizing CDec-POMDP policies. Vanilla AC has slow convergence for larger prob-\nlems. To address this, we show how a particular decomposition of the approximate\naction-value function over agents leads to effective updates, and also derive a new\nway to train the critic based on local reward signals. Comparisons on a synthetic\nbenchmark and a real world taxi \ufb02eet optimization problem show that our new AC\napproach provides better quality solutions than previous best approaches.\n\n1\n\nIntroduction\n\nDecentralized partially observable MDPs (Dec-POMDPs) have emerged in recent years as a promis-\ning framework for multiagent collaborative sequential decision making (Bernstein et al., 2002).\nDec-POMDPs model settings where agents act based on different partial observations about the\nenvironment and each other to maximize a global objective. Applications of Dec-POMDPs include\ncoordinating planetary rovers (Becker et al., 2004b), multi-robot coordination (Amato et al., 2015)\nand throughput optimization in wireless network (Winstein and Balakrishnan, 2013; Pajarinen et al.,\n2014). However, solving Dec-POMDPs is computationally challenging, being NEXP-Hard even for\n2-agent problems (Bernstein et al., 2002).\nTo increase scalability and application to practical problems, past research has explored restricted\ninteractions among agents such as state transition and observation independence (Nair et al., 2005;\nKumar et al., 2011, 2015), event driven interactions (Becker et al., 2004a) and weak coupling among\nagents (Witwicki and Durfee, 2010). Recently, a number of works have focused on settings where\nagent identities do not affect interactions among agents. Instead, environment dynamics are pri-\nmarily driven by the collective in\ufb02uence of agents (Varakantham et al., 2014; Sonu et al., 2015;\nRobbel et al., 2016; Nguyen et al., 2017), similar to well known congestion games (Meyers and\nSchulz, 2012). Several problems in urban transportation such as taxi supply-demand matching can\nbe modeled using such collective planning models (Varakantham et al., 2012; Nguyen et al., 2017).\nIn this work, we focus on the collective Dec-POMDP framework (CDec-POMDP) that formalizes\nsuch a collective multiagent sequential decision making problem under uncertainty (Nguyen et al.,\n2017). Nguyen et al. present a sampling based approach to optimize policies in the CDec-POMDP\nmodel. A key drawback of this previous approach is that policies are represented in a tabular form\nwhich scales poorly with the size of observation space of agents. Motivated by the recent suc-\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: T-step DBN for a CDec-POMDP\n\ncess of reinforcement learning (RL) approaches (Mnih et al., 2015; Schulman et al., 2015; Mnih\net al., 2016; Foerster et al., 2016; Leibo et al., 2017), our main contribution is a actor-critic (AC)\nreinforcement learning method (Konda and Tsitsiklis, 2003) for optimizing CDec-POMDP policies.\nPolicies are represented using function approxi-\nmator such as a neural network, thereby avoiding\nthe scalability issues of a tabular policy. We derive\nthe policy gradient and develop a factored action-\nvalue approximator based on collective agent in-\nteractions in CDec-POMDPs. Vanilla AC is slow\nto converge on large problems due to known issues\nof learning with global reward in large multiagent\nsystems (Bagnell and Ng, 2005). To address this,\nwe also develop a new way to train the critic, our\naction-value approximator, that effectively utilizes\nlocal value function of agents.\nWe test our approach on a synthetic multirobot grid navigation domain from (Nguyen et al., 2017),\nand a real world supply-demand taxi matching problem in a large Asian city with up to 8000 taxis (or\nagents) showing the scalability of our approach to large multiagent systems. Empirically, our new\nfactored actor-critic approach works better than previous best approaches providing much higher\nsolution quality. The factored AC algorithm empirically converges much faster than the vanilla AC\nvalidating the effectiveness of our new training approach for the critic.\nRelated work: Our work is based on the framework of policy gradient with approximate value\nfunction similar to Sutton et al. (1999). However, as we empirically show, directly applying the\noriginal policy gradient from Sutton et al. (1999) into the multi-agent setting and speci\ufb01cally for\nthe CDec-POMDP model results in a high variance solution.\nIn this work, we show a suitable\nform of compatible value function approximation for CDec-POMDPs that results in an ef\ufb01cient and\nlow variance policy gradient update. Reinforcement learning for decentralized policies has been\nstudied earlier in Peshkin et al. (2000), Aberdeen (2006). Guestrin et al. (2002) also proposed using\nREINFORCE to train a softmax policy of a factored value function from the coordination graph.\nHowever in such previous works, policy gradient is estimated from the global empirical returns\ninstead of a decomposed critic. We show in section 4 that having a decomposed critic along with an\nindividual value function based training of this critic is important for sample-ef\ufb01cient learning. Our\nempirical results show that our proposed critic training has faster convergence than training with\nglobal empirical returns.\n\n2 Collective Decentralized POMDP Model\nWe \ufb01rst describe the CDec-POMDP model introduced in (Nguyen et al., 2017). A T -step Dynamic\nBayesian Network (DBN) for this model is shown using the plate notation in \ufb01gure 1. It consists of\nthe following:\n\u2022 A \ufb01nite planning horizon H.\n\u2022 The number of agents M. An agent m can be in one of the states in the state space S. The joint\n\u2022 A set of action A for each agent m. We denote an individual action as j \u2208 A.\n\u2022 Let (s1:H , a1:H )m = (sm\n\nH ) denote the complete state-action trajectory of an\nagent m. We denote the state and action of agent m at time t using random variables sm\nt , am\nt .\nDifferent indicator functions It(\u00b7) are de\ufb01ned in table 1. We de\ufb01ne the following count given the\ntrajectory of each agent m \u2208 M:\n\nm=1S. We denote a single state as i \u2208 S.\n\nstate space is \u00d7M\n\n1 , am\n\n1 , sm\n\n2 . . . , sm\n\nH , am\n\nM(cid:88)\n\nnt(i, j, i\n\n(cid:48)) =\n\nIm\nt (i, j, i\n\n(cid:48)) \u2200i, i\n\n(cid:48)\u2208S, j\u2208A\n\nm=1\n\nAs noted in table 1, count nt(i, j, i(cid:48)) denotes the number of agents in state i taking action j\nat time step t and transitioning to next state i(cid:48); other counts, nt(i) and nt(i, j), are de\ufb01ned\nanalogously. Using these counts, we can de\ufb01ne the count tables nst and nstat for the time step\nt as shown in table 1.\n\n2\n\nm=1:Mom1om1sm1sm2am2am1smTamTnsTns2ns1rmTom2\ft = i\n\nt (i)\u2208{0, 1}\nIm\nt (i, j)\u2208{0, 1}\nIm\nt (i, j, i(cid:48))\u2208{0, 1}\nIm\nnt(i)\u2208 [0; M ]\nnt(i, j)\u2208 [0; M ]\nnt(i, j, i(cid:48))\u2208 [0; M ] Number of agents at state i taking action j at time t and transitioning to state i(cid:48) at time t + 1\nnst\nnstat\nnstatst+1\nTable 1: Summary of notations given the state-action trajectories, (s1:H , a1:H )m \u2200m, for all the agents\n\nif agent m is at state i at time t or sm\nif agent m takes action j in state i at time t or (sm\nif agent m takes action j in state i at time t and transitions to state i(cid:48) or (sm\nNumber of agents at state i at time t\nNumber of agents at state i taking action j at time t\nCount table (nt(i) \u2200i\u2208 S)\nCount table (nt(i, j) \u2200i\u2208 S, j\u2208 A)\nCount table (nt(i, j, i(cid:48)) \u2200i, i(cid:48)\u2208 S, j\u2208 A)\n\nt ) = (i, j)\n\nt , am\n\nt , am\n\nt , sm\n\nt+1) = (i, j, i(cid:48))\n\n\u2022 We assume a general partially observable setting wherein agents can have different observations\nIn\nbased on the collective in\ufb02uence of other agents. An agent observes its local state sm\nt .\naddition, it also observes om\nt and the count table nst. E.g.,\nan agent m in state i at time t can observe the count of other agents also in state i (=nt(i)) or\nother agents in some neighborhood of the state i (={nt(j) \u2200j \u2208 Nb(i)}).\n\nt at time t based on its local state sm\n\n(cid:1). The transition function is the same\n\n\u2022 The transition function is \u03c6t\n\n(cid:0)sm\n\nfor all the agents. Notice that it is affected by nst, which depends on the collective behavior of\nthe agent population.\n\nt+1 = i(cid:48)|sm\n\nt = i, am\n\nt = j, nst\n\nH ).\n1 , . . . , \u03c0m\n\n\u2022 Each agent m has a non-stationary policy \u03c0m\nm to take action j given its observation (i, om\nplanning horizon of an agent m to be \u03c0m = (\u03c0m\n\nt (j|i, om\nt (i, nst)) denoting the probability of agent\nt (i, nst)) at time t. We denote the policy over the\n\nthe counts nst.\n\nt = rt(i, j, nst) dependent on its local state and action, and\n\n\u2022 An agent m receives the reward rm\n\u2022 Initial state distribution, bo = (P (i)\u2200i \u2208 S), is the same for all agents.\nWe present here the simplest version where all the agents are of the same type having similar state\ntransition, observation and reward models. The model can handle multiple agent types where agents\nhave different dynamics based on their type. We can also incorporate an external state that is unaf-\nfected by agents\u2019 actions (such as taxi demand in transportation domain). Our results are extendible\nto address such settings also.\nModels such as CDec-POMDPs are useful in settings where agent population is large, and agent\nidentity does not affect the reward or the transition function. A motivating application of this model\nis for the taxi-\ufb02eet optimization where the problem is to compute policies for taxis such that the total\npro\ufb01t of the \ufb02eet is maximized (Varakantham et al., 2012; Nguyen et al., 2017). The decision making\nfor a taxi is as follows. At time t, each taxi observes its current city zone z (different zones constitute\nthe state-space S), and also the count of other taxis in the current zone and its neighboring zones\nas well as an estimate of the current local demand. This constitutes the count-based observation\no(\u00b7) for the taxi. Based on this observation, the taxi must decide whether to stay in the current\nzone z to look for passengers or move to another zone. These decision choices depend on several\nfactors such as the ratio of demand and the count of other taxis in the current zone. Similarly, the\nenvironment is stochastic with variable taxi demand at different times. Such historical demand data\nis often available using GPS traces of the taxi \ufb02eet (Varakantham et al., 2012).\nCount-Based statistic for planning: A key property in the CDec-POMDP model is that the model\ndynamics depend on the collective interaction among agents rather than agent identities. In settings\nsuch as taxi \ufb02eet optimization, the agent population size can be quite large (\u2248 8000 for our real\nworld experiments). Given such a large population, it is not possible to compute unique policy for\neach agent. Therefore, similar to previous work (Varakantham et al., 2012; Nguyen et al., 2017),\nour goal is to compute a homogenous policy \u03c0 for all the agents. As the policy \u03c0 is dependent on\ncounts, it represents an expressive class of policies.\nFor a \ufb01xed population M, let {(s1:T , a1:T )m \u2200m} denote the state-action trajectories of different\nagents sampled from the DBN in \ufb01gure 1. Let n1:T ={(nst, nstat, nstatst+1 ) \u2200t = 1 : T} be the\ncombined vector of the resulting count tables for each time step t. Nguyen et al. show that counts n\nare the suf\ufb01cient statistic for planning. That is, the joint-value function of a policy \u03c0 over horizon\n\n3\n\n\fH can be computed by the expectation over counts as (Nguyen et al., 2017):\n\nV (\u03c0) =\n\nE[rm\n\nT ] =\n\nP (n; \u03c0)\n\nnT (i, j)rT\n\nT =1\nSet \u21261:H is the set of all allowed consistent count tables as:\n\nT =1\n\nm=1\n\n(cid:88)\n\nn\u2208\u21261:H\n\n(cid:1)(cid:21)\n\n(cid:0)i, j, nT\n\n(1)\n\nnT (i) = M \u2200T ;\n\nnT (i, j) = nT (i) \u2200j\u2200T ;\n\nnT (i, j, i\n\n(cid:48)) = nT (i, j) \u2200i \u2208 S, j \u2208 A,\u2200T\n\n(cid:88)\n\ni\u2208S,j\u2208A\n\n(cid:20) H(cid:88)\n(cid:88)\n\ni(cid:48)\u2208S\n\nM(cid:88)\n\nH(cid:88)\n(cid:88)\n\nj\u2208A\n\n(cid:88)\n\ni\u2208S\n\nP (n; \u03c0) is the distribution over counts (detailed expression in appendix). A key bene\ufb01t of this result\nis that we can evaluate the policy \u03c0 by sampling counts n directly from P (n) without sampling in-\ndividual agent trajectories (s1:H , a1:H )m for different agents, resulting in signi\ufb01cant computational\nsavings. Our goal is to compute the optimal policy \u03c0 that maximizes V (\u03c0). We assume a RL setting\nwith centralized learning and decentralized execution. We assume a simulator is available that can\nprovide count samples from P (n; \u03c0).\n\n3 Policy Gradient for CDec-POMDPs\n\nPrevious work proposed an expectation-maximization (EM) (Dempster et al., 1977) based sampling\napproach to optimize the policy \u03c0 (Nguyen et al., 2017). The policy is represented as a piecewise\nlinear tabular policy over the space of counts n where each linear piece speci\ufb01es a distribution over\nnext actions. However, this tabular representation is limited in its expressive power as the number of\npieces is \ufb01xed apriori, and the range of each piece has to be de\ufb01ned manually which can adversely\naffect performance. Furthermore, exponentially many pieces are required when the observation o is\nmultidimensional (i.e., an agent observes counts from some local neighborhood of its location). To\naddress such issues, our goal is to optimize policies in a functional form such as a neural network.\nWe \ufb01rst extend the policy gradient theorem of (Sutton et al., 1999) to CDec-POMDPs. Let \u03b8 denote\nthe vector of policy parameters. We next show how to compute \u2207\u03b8V (\u03c0). Let st, at denote the\njoint-state and joint-actions of all the agents at time t. The value function of a given policy \u03c0 in an\nexpanded form is given as:\n\nwhere P \u03c0(st, at|bo) = (cid:80)\n\nVt(\u03c0) =\n\nP \u03c0(st, at|bo, \u03c0)Q\u03c0\n\nt (st, at)\n\n(2)\n\ns1:t\u22121,a1:t\u22121 P \u03c0(s1:t, a1:t|bo) is the distribution of the joint state-action\n\nst, at under the policy \u03c0. The value function Q\u03c0\n\nQ\u03c0\n\nt (st, at) = rt(st, at) +\n\nt (st, at) is computed as:\nP \u03c0(st+1, at+1|st, at)Q\u03c0\n\nt+1(st+1, at+1)\n\n(cid:88)\n\nst,at\n\n(cid:88)\n\nst+1,at+1\n\n(cid:20)\n\nWe next state the policy gradient theorem for CDec-POMDPs:\nTheorem 1. For any CDec-POMDP, the policy gradient is given as:\n\n\u2207\u03b8V1(\u03c0) =\n\nEst,at|bo,\u03c0\n\nQ\u03c0\n\nt (st, at)\n\nnt(i, j)\u2207\u03b8 log \u03c0t\n\n(cid:88)\n\ni\u2208S,j\u2208A\n\nH(cid:88)\n\nt=1\n\n(3)\n\n(4)\n\n(cid:0)j|i, o(i, nst)(cid:1)(cid:21)\n\nThe proofs of this theorem and other subsequent results are provided in the appendix.\nNotice that computing the policy gradient using the above result is not practical for multiple reasons.\nThe space of join-state action (st, at) is combinatorial. Given that the agent population size can be\nlarge, sampling each agent\u2019s trajectory is not computationally tractable. To remedy this, we later\nshow how to compute the gradient by directly sampling counts n\u223c P (n; \u03c0) similar to policy evalua-\ntion in (1). Similarly, one can estimate the action-value function Q\u03c0\nt (st, at) using empirical returns\nas an approximation. This would be the analogue of the standard REINFORCE algorithm (Williams,\n1992) for CDec-POMDPs. It is well known that REINFORCE may learn slowly than other methods\nthat use a learned action-value function (Sutton et al., 1999). Therefore, we next present a function\napproximator for Q\u03c0\nt , and show the computation of policy gradient by directly sampling counts n.\n\n4\n\n\f3.1 Policy Gradient with Action-Value Approximation\n\nOne can approximate the action-value function Q\u03c0\nthe following special form of the approximate value function fw:\n\nt (st, at) in several different ways. We consider\n\nt (st, at) \u2248 fw(st, at) =\nQ\u03c0\n\nf m\nw\n\nt , o(sm\n\nt , nst), am\nt\n\n(5)\n\nM(cid:88)\n\nm=1\n\n(cid:0)sm\n\n(cid:1)\n\nw is de\ufb01ned for each agent m and takes as input the agent\u2019s local state, action and\nwhere each f m\nthe observation. Notice that different components f m\nw are correlated as they depend on the com-\nmon count table nst. Such a decomposable form is useful as it leads to ef\ufb01cient policy gradient\ncomputation. Furthermore, an important class of approximate value function having this form for\nCDec-POMDPs is the compatible value function (Sutton et al., 1999) which results in an unbiased\npolicy gradient (details in appendix).\nProposition 1. Compatible value function for CDec-POMDPs can be factorized as:\n\nfw(st, at) =\n\nf m\nw (sm\n\nt , o(sm\n\nt , nst), am)\n\n(cid:88)\n\nm\n\nWe can directly replace Q\u03c0(\u00b7) in policy gradient (4) by the approximate action-value function fw.\nEmpirically, we found that variance using this estimator was high. We exploit the structure of fw\nand show further factorization of the policy gradient next which works much better empirically.\nTheorem 2. For any value function having the decomposition as:\n\nfw(st, at) =\n\nt , o(sm\n\nt , nst), am\nt\n\n(cid:0)sm\n\nf m\nw\n\n(cid:88)\n\u2207\u03b8 log \u03c0(cid:0)am\n\nm\n\nt , nst)(cid:1)f m\n\nw\n\nt |sm\n\nt , o(sm\n\n(cid:1),\n(cid:0)sm\n\n(6)\n\n(7)\n\n(cid:1)(cid:105)\n\nt , o(sm\n\nt , nst), am\nt\n\nthe policy gradient can be computed as\n\nH(cid:88)\n\n(cid:104)(cid:88)\n\n\u2207\u03b8V1(\u03c0) =\n\nEst,at\n\nt=1\n\nm\n\nThe above result shows that if the approximate value function is factored, then the resulting policy\ngradient also becomes factored. The above result also applies to agents with multiple types as we\nw is different for each agent. In the simpler case when all the agents are of\nassumed the function f m\nsame type, then we have the same function fw for each agent, and also deduce the following:\n\nfw(st, at) =\n\n(cid:88)\n\n(cid:104)(cid:88)\n\n\u2207\u03b8V1(\u03c0) =\n\nEst,at\n\nnt(i, j)fw\n\n(cid:0)i, j, o(i, nst)(cid:1)\n\n(cid:88)\nnt(i, j)\u2207\u03b8 log \u03c0(cid:0)j|i, o(i, nst)(cid:1)fw(i, j, o(i, nst))\n\ni,j\n\n(cid:105)\n\n(8)\n\n(9)\n\nUsing the above result, we simplify the policy gradient as:\n\nt\n\ni,j\n\n3.2 Count-based Policy Gradient Computation\n\nNotice that in (9), the expectation is still w.r.t. joint-states and actions (st, at) which is not ef\ufb01cient\nin large population sizes. To address this issue, we exploit the insight that the approximate value\nfunction in (8) and the inner expression in (9) depends only on the counts generated by the joint-state\nand action (st, at).\n\nTheorem 3. For any value function having the form: fw(st, at) =(cid:80)\n\ni,j nt(i, j)fw\n\nthe policy gradient can be computed as:\n\n(cid:20) H(cid:88)\n\n(cid:88)\n\nt=1\n\ni\u2208S,j\u2208A\n\nEn1:H\u2208\u21261:H\n\nnt(i, j)\u2207\u03b8 log \u03c0(cid:0)j|i, o(i, nt)(cid:1)fw(i, j, o(i, nt))\n\n(cid:0)i, j, o(i, nst)(cid:1),\n(cid:21)\n\n(10)\n\nThe above result shows that the policy gradient can be computed by sampling count table vectors\nn1:H from the underlying distribution P (\u00b7) analogous to computing the value function of the policy\nin (1), which is tractable even for large population sizes.\n\n5\n\n\f4 Training Action-Value Function\n\nIn our approach, after count samples n1:H are generated to compute the policy gradient, we also\nneed to adjust the parameters w of our critic fw. Notice that as per (8), the action value function\nfw(st, at) depends only on the counts generated by the joint-state and action (st, at). Training fw\ncan be done by taking a gradient step to minimize the following loss function:\n\nK(cid:88)\n\nH(cid:88)\n\n(cid:16)\n\n\u03be=1\n\nt=1\n\nmin\n\nw\n\n(cid:17)2\n\nfw(n\u03be\n\nt ) \u2212 R\u03be\n\nt\n\n(11)\n\nwhere n\u03be\nfunction and R\u03be\n\n1:H is a count sample generated from the distribution P (n; \u03c0); fw(n\u03be\nt is the total empirical return for time step t computed using (1):\n\nt ) is the action value\n\nfw(n\u03be\n\nt ) =\n\nn\u03be\nt (i, j)fw(i, j, o(i, n\u03be\n\nt )); R\u03be\n\nt =\n\nn\u03be\nT (i, j)rT (i, j, n\u03be\nT )\n\n(12)\n\nH(cid:88)\n\n(cid:88)\n\nT =t\n\ni\u2208S,j\u2208A\n\n(cid:88)\n\ni,j\n\nt (i, j) = E[(cid:80)H\n\nHowever, we found that the loss in (11) did not work well for training the critic fw for larger prob-\nlems. Several count samples were required to reliably train fw which adversely affects scalability\nfor large problems with many agents. It is already known in multiagent RL that algorithms that\nsolely rely on the global reward signal (e.g. R\u03be\nt in our case) may require several more samples than\napproaches that take advantage of local reward signals (Bagnell and Ng, 2005). Motivated by this\nobservation, we next develop a local reward signal based strategy to train the critic fw.\nIndividual Value Function: Let n\u03be\n1:H, let\n1:H ] denote the total expected reward obtained by an agent\nV \u03be\nthat is in state i and takes action j at time t. This individual value function can be computed using\ndynamic programming as shown in (Nguyen et al., 2017). Based on this value function, we next\nshow an alternative reparameterization of the global empirical reward R\u03be\nLemma 1. The empirical return R\u03be\nparameterized as: R\u03be\ni\u2208S,j\u2208A n\u03be\n\nt for the time step t given the count sample n\u03be\nt (i, j)V \u03be\n\n1:H be a count sample. Given the count sample n\u03be\n\n1:H can be re-\n\nt in (12):\n\nm = j, n\u03be\n\nt (i, j).\n\nt = i, at\n\nt(cid:48) |sm\n\nt(cid:48)=t rm\n\nIndividual Value Function Based Loss: Given lemma 1, we next derive an upper bound on the on\nthe true loss (11) which effectively utilizes individual value functions:\n\nt =(cid:80)\n(cid:17)2\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)\n\n\u03be\n\nt\n\nfw(n\u03be) \u2212 R\u03be\n\nt\n\nn\u03be\nt (i, j)fw(i, j, o(i, n\u03be\n\nn\u03be\nt (i, j)V \u03be\n\nt (i, j)\n\n(cid:16)(cid:88)\n(cid:88)\n(cid:88)\n(cid:18)(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\ni,j\n\ni,j\n\n\u03be\n\n\u03be\n\nt\n\nt\n\n=\n\n=\n\n(cid:16)\n\n(cid:16)\n\nt )) \u2212(cid:88)\n\ni,j\n\n(cid:17)2\n\n(13)\n\n(14)\n\n(cid:17)(cid:19)2\n\n(cid:17)2\n\nn\u03be\nt (i, j)\n\nfw(i, j, o(i, n\u03be\n\nt )) \u2212 V \u03be\n\nt (i, j)\n\n\u2264 M\n\nnt(i, j)\n\nfw(i, j, o(i, n\u03be\n\nt )) \u2212 V \u03be\n\nt (i, j)\n\n\u03be\n\nt,i,j\n\nwhere the last relation is derived by Cauchy-Schwarz inequality. We train the critic using the modi-\n\ufb01ed loss function in (14). Empirically, we observed that for larger problems, this new loss function\nin (14) resulted in much faster convergence than the original loss function in (13). Intuitively, this is\nbecause the new loss (14) tries to adjust each critic component fw(i, j, o(i, n\u03be\nt )) closer to its counter-\npart empirical return V \u03be\nt (i, j). However, in the original loss function (13), the focus is on minimizing\nthe global loss, rather than adjusting each individual critic factor fw(\u00b7) towards the corresponding\nempirical return.\nAlgorithm 1 shows the outline of our AC approach for CDec-POMDPs. Lines 7 and 8 show two\ndifferent options to train the critic. Line 7 represents critic update based on local value functions,\nalso referred to as factored critic update (fC). Line 8 shows update based on global reward or global\ncritic update (C). Line 10 shows the policy gradient computed using theorem 2 (fA). Line 11 shows\nhow the gradient is computed by directly using fw from eq. (5) in eq. 4.\n\n6\n\n\fAlgorithm 1: Actor-Critic RL for CDec-POMDPs\n1 Initialize network parameter \u03b8 for actor \u03c0 and and w for critic fw\n2 \u03b1 \u2190 actor learning rate\n3 \u03b2 \u2190 critic learning rate\n4 repeat\n5\n6\n\n1:H \u223c P (n1:H ; \u03c0) \u2200\u03be = 1 to K\n\nSample count vectors n\u03be\nUpdate critic as:\nfC : w = w \u2212 \u03b2 1\nC : w = w \u2212 \u03b2 1\nUpdate actor as:\nfA : \u03b8 = \u03b8 + \u03b1 1\n\n(cid:104)(cid:80)\n(cid:104)(cid:80)\nK \u2207w\nK \u2207w\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n\nA : \u03b8 = \u03b8 + \u03b1 1\n\nK \u2207\u03b8\nK \u2207\u03b8\n\n\u03be\n\n\u03be\n\n\u03be\n\n\u03be\n\nt,i,j n\u03be\n\n(cid:80)\n(cid:16)(cid:80)\n(cid:80)\n(cid:104)(cid:80)\n(cid:104)(cid:80)\n\nt (i, j)\ni,j n\u03be\n\nt )) \u2212 V \u03be\n\nfw(i, j, o(i, n\u03be\n\n(cid:16)\nt (i, j) log \u03c0(cid:0)j|i, o(i, n\u03be\nt (i, j) log \u03c0(cid:0)j|i, o(i, n\u03be\n\nt (i, j)fw(i, j, o(i, n\u03be\n\n(cid:17)2(cid:105)\nt )) \u2212(cid:80)\nt )(cid:1)fw(i, j, o(n\u03be\nt )(cid:1)(cid:105)(cid:104)(cid:80)\n\nt (i, j)\n\ni,j n\u03be\n\ni,j n\u03be\ni,j n\u03be\n\nt\n\nt\n\nt\n\n7\n\n8\n\n9\n\n10\n\n11\n12 until convergence\n13 return \u03b8, w\n\n(cid:17)2(cid:105)\n\ni,j n\u03be\n\nt (i, j)V \u03be\n\nt (i, j)\n\n(cid:105)\n\nt , i))\n\nt (i, j)fw(i, j, o(n\u03be\n\nt , i))\n\n(cid:105)\n\n5 Experiments\n\nt |sm\n\nt |sm\n\nt and the single count observation nt(sm\n\nThis section compares the performance of our AC approach with two other approaches for solv-\ning CDec-POMDPs\u2014Soft-Max based \ufb02ow update (SMFU) (Varakantham et al., 2012), and the\nExpectation-Maximization (EM) approach (Nguyen et al., 2017). SMFU can only optimize policies\nt ), as it approximates the effect\nwhere an agent\u2019s action only depends on its local state, \u03c0(am\nof counts n by computing the single most likely count vector during the planning phase. The EM\nt ,\u00b7) is a piecewise\napproach can optimize count-based piecewise linear policies where \u03c0t(am\nfunction over the space of all possible count observations ot.\nAlgorithm 1 shows two ways of updating the critic (in lines 7, 8) and two ways of updating the actor\n(in lines 10, 11) leading to 4 possible settings for our actor-critic approach\u2014fAfC, AC, AfC, fAC.\nWe also investigate the properties of these different actor-critic approaches. The neural network\nstructure and other experimental settings are provided in the appendix.\nFor fair comparisons with previous approaches, we use three different models for counts-based\nt and not on counts. In\nobservation ot. In \u2018o0\u2019 setting, policies depend only on agent\u2019s local state sm\nt ). That\n\u2018o1\u2019 setting, policies depend on the local state sm\nt . In \u2018oN\u2019 setting,\nis, the agent can only observe the count of other agents in its current state sm\nt and also the count of other agents from a local neighborhood\nthe agent observes its local state sm\nt . The \u2018oN\u2019 observation model provides the most information to an\n(de\ufb01ned later) of the state sm\nagent. However, it is also much more dif\ufb01cult to optimize as policies have more parameters. The\nSMFU only works with \u2018o0\u2019 setting; EM and our actor-critic approach work for all the settings.\nTaxi Supply-Demand Matching: We test our approach on this real-world domain described in\nsection 2, and introduced in (Varakantham et al., 2012). In this problem, the goal is to compute taxi\npolicies for optimizing the total revenue of the \ufb02eet. The data contains GPS traces of taxi movement\nin a large Asian city over 1 year. We use the observed demand information extracted from this\ndataset. On an average, there are around 8000 taxis per day (data is not exhaustive over all taxi\noperators). The city is divided into 81 zones and the plan horizon is 48 half hour intervals over 24\nhours. For details about the environment dynamics, we refer to (Varakantham et al., 2012).\nFigure 2(a) shows the quality comparisons among different approaches with different observation\nmodels (\u2018o0\u2019, \u2018o1\u2019 and \u2018oN\u2019). We test with total number of taxis as 4000 and 8000 to see if taxi pop-\nulation size affects the relative performance of different approaches. The y-axis shows the average\nper day pro\ufb01t for the entire \ufb02eet. For the \u2018o0\u2019 case, all approaches (fAfC-\u2018o0\u2019, SMFU, EM-\u2018o0\u2019)\ngive similar quality with fAfC-\u2018o0\u2019 and EM-\u2018o0\u2019 performing slightly better than SMFU for the 8000\ntaxis. For the \u2018o1\u2019 case, there is sharp improvement in quality by fAfC-\u2018o1\u2019 over fAfC-\u2018o0\u2019 con-\n\ufb01rming that taking count based observation into account results in better policies. Our approach\nfAfC-\u2018o1\u2019 is also signi\ufb01cantly better than the policies optimized by EM-\u2018o1\u2019 for both 4000 and\n8000 taxi setting.\n\n7\n\n\f(a) Solution quality with varying taxi population\n\n(b) Solution quality in grid navigation problem\n\nFigure 2: Solution quality comparisons on the taxi problem and the grid navigation\n\n(a) AC convergence with \u2018o0\u2019\n\n(b) AC convergence with \u2018o1\u2019\n\n(c) AC convergence with \u2018oN\u2019\n\nFigure 3: Convergence of different actor-critic variants on the taxi problem with 8000 taxis\n\nTo further test the scalability and the ability to optimize complex policies by our approach in the \u2018oN\u2019\nsetting, we de\ufb01ne the neighborhood of each state (which is a zone in the city) to be the set of its\ngeographically connected zones based on the zonal decomposition shown in (Nguyen et al., 2017).\nOn an average, there are about 8 neighboring zones for a given zone, resulting in 9 count based\nobservations available to the agent for taking decisions. Each agent observes both the taxi count\nand the demand information from such neighboring zones. In \ufb01gure 2(a), fAfC-\u2018oN\u2019 result clearly\nshows that taking multiple observations into account signi\ufb01cantly increases solution quality\u2014fAfC-\n\u2018oN\u2019 provides an increase of 64% in quality over fAfC-\u2018o0\u2019 and 20% over fAfC-\u2018o1\u2019 for the 8000\ntaxi case. For EM-\u2018oN\u2019, we used a bare minimum of 2 pieces per observation dimension (resulting\nin 29 pieces per time step). We observed that EM was unable to converge within 30K iterations and\nprovided even worse quality than EM-\u2018o1\u2019 at the end. These results show that despite the larger\nsearch space, our fAfC approach can effectively optimize complex policies whereas the tabular\npolicy based EM approach was ineffective for this case.\nFigures 3(a-c) show the quality Vs. iterations for different variations of our actor critic approach\u2014\nfAfC, AC, AfC, fAC\u2014for the \u2018o0\u2019, \u2018o1\u2019 and the \u2018oN\u2019 observation model. These \ufb01gures clearly\nshow that using factored actor and the factored critic update in fAfC is the most reliable strategy\nover all the other variations and for all the observation models. Variations such as AC and fAC\nwere not able to converge at all despite having exactly the same parameters as fAfC. These results\nvalidate different strategies that we have developed in our work to make vanilla AC converge faster\nfor large problems.\nRobot navigation in a congested environment: We also tested on a synthetic benchmark intro-\nduced in (Nguyen et al., 2017). The goal is for a population of robots (= 20) to move from a set\nof initial locations to a goal state in a 5x5 grid. If there is congestion on an edge, then each agent\nattempting to cross the edge has higher chance of action failure. Similarly, agents also receive a\nnegative reward if there is edge congestion. On successfully reaching the goal state, agents receive\na positive reward and transition back to one of the initial state. We set the horizon to 100 steps.\nFigure 2(b) shows the solution quality comparisons among different approaches. In the \u2018oN\u2019 obser-\nvation model, the agent observes its 4 immediate neighbor node\u2019s count information. In this prob-\nlem, SMFU performed worst, fAfC and EM both performed much better. As expected fAfC-\u2018oN\u2019\n\n8\n\n40008000Taxi Population05000001000000150000020000002500000QualityfAfC-o0fAfC-o1fAfC-oNSMFUEM-o0EM-o1Grid Navigation\u221250050100150QualityfAfC-o0fAfC-o1fAfC-oNSMFUEM-o0EM-o1EM-oN05000100001500020000Iteration750000500000250000025000050000075000010000001250000QualityfAfCACfACAfC05000100001500020000Iteration50000005000001000000Quality05000100001500020000Iteration500000050000010000001500000Quality0500010000150002000025000Iteration500000100000015000002000000Quality\fprovides the best solution quality over all the other approaches. In this domain, EM is competi-\ntive with fAfC as for this relatively smaller problem with 25 agents, the space of counts is much\nsmaller than in the taxi domain. Therefore, EM\u2019s piecewise policy is able to provide a \ufb01ne grained\napproximation over the count range.\n\n6 Summary\n\nWe addressed the problem of collective multiagent planning where the collective behavior of a pop-\nulation of agents affects the model dynamics. We developed a new actor-critic method for solving\nsuch collective planning problems within the CDec-POMDP framework. We derived several new\nresults for CDec-POMDPs such as the policy gradient derivation, and the structure of the compatible\nvalue function. To overcome the slow convergence of the vanilla actor-critic method we developed\nmultiple techniques based on value function factorization and training the critic using individual\nvalue function of agents. Using such techniques, our approach provided signi\ufb01cantly better quality\nthan previous approaches, and proved scalable and effective for optimizing policies in a real world\ntaxi supply-demand problem and a synthetic grid navigation problem.\n\n7 Acknowledgments\n\nThis research project is supported by National Research Foundation Singapore under its Corp Lab\n@ University scheme and Fujitsu Limited. First author is also supported by A(cid:63)STAR graduate\nscholarship.\n\n9\n\n\fReferences\nAberdeen, D. (2006). Policy-gradient methods for planning. In Advances in Neural Information\n\nProcessing Systems, pages 9\u201316.\n\nAmato, C., Konidaris, G., Cruz, G., Maynor, C. A., How, J. P., and Kaelbling, L. P. (2015). Planning\nfor decentralized control of multiple robots under uncertainty. In IEEE International Conference\non Robotics and Automation, ICRA, pages 1241\u20131248.\n\nBagnell, J. A. and Ng, A. Y. (2005). On local rewards and scaling distributed reinforcement learning.\n\nIn International Conference on Neural Information Processing Systems, pages 91\u201398.\n\nBecker, R., Zilberstein, S., and Lesser, V. (2004a). Decentralized Markov decision processes with\nIn Proceedings of the 3rd International Conference on Autonomous\n\nevent-driven interactions.\nAgents and Multiagent Systems, pages 302\u2013309.\n\nBecker, R., Zilberstein, S., Lesser, V., and Goldman, C. V. (2004b). Solving transition independent\ndecentralized Markov decision processes. Journal of Arti\ufb01cial Intelligence Research, 22:423\u2013\n455.\n\nBernstein, D. S., Givan, R., Immerman, N., and Zilberstein, S. (2002). The complexity of decentral-\n\nized control of Markov decision processes. Mathematics of Operations Research, 27:819\u2013840.\n\nDempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data\n\nvia the EM algorithm. Journal of the Royal Statistical society, Series B, 39(1):1\u201338.\n\nFoerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. (2016). Learning to communicate\nIn Advances in Neural Information Processing\n\nwith deep multi-agent reinforcement learning.\nSystems, pages 2137\u20132145.\n\nGuestrin, C., Lagoudakis, M., and Parr, R. (2002). Coordinated reinforcement learning. In ICML,\n\nvolume 2, pages 227\u2013234.\n\nKonda, V. R. and Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and\n\nOptimization, 42(4):1143\u20131166.\n\nKumar, A., Zilberstein, S., and Toussaint, M. (2011). Scalable multiagent planning using probabilis-\ntic inference. In Proceedings of the Twenty-Second International Joint Conference on Arti\ufb01cial\nIntelligence, pages 2140\u20132146, Barcelona, Spain.\n\nKumar, A., Zilberstein, S., and Toussaint, M. (2015). Probabilistic inference techniques for scalable\n\nmultiagent decision making. Journal of Arti\ufb01cial Intelligence Research, 53(1):223\u2013270.\n\nLeibo, J. Z., Zambaldi, V. F., Lanctot, M., Marecki, J., and Graepel, T. (2017). Multi-agent rein-\nforcement learning in sequential social dilemmas. In International Conference on Autonomous\nAgents and Multiagent Systems.\n\nMeyers, C. A. and Schulz, A. S. (2012). The complexity of congestion games. Networks, 59:252\u2013\n\n260.\n\nMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,\nK. (2016). Asynchronous methods for deep reinforcement learning. In International Conference\non Machine Learning, pages 1928\u20131937.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529\u2013533.\n\nNair, R., Varakantham, P., Tambe, M., and Yokoo, M. (2005). Networked distributed POMDPs: A\nsynthesis of distributed constraint optimization and POMDPs. In AAAI Conference on Arti\ufb01cial\nIntelligence, pages 133\u2013139.\n\nNguyen, D. T., Kumar, A., and Lau, H. C. (2017). Collective multiagent sequential decision making\n\nunder uncertainty. In AAAI Conference on Arti\ufb01cial Intelligence, pages 3036\u20133043.\n\n10\n\n\fPajarinen, J., Hottinen, A., and Peltonen, J. (2014). Optimizing spatial and temporal reuse in wire-\nless networks by decentralized partially observable Markov decision processes. IEEE Trans. on\nMobile Computing, 13(4):866\u2013879.\n\nPeshkin, L., Kim, K.-E., Meuleau, N., and Kaelbling, L. P. (2000). Learning to cooperate via policy\nsearch. In Proceedings of the Sixteenth conference on Uncertainty in arti\ufb01cial intelligence, pages\n489\u2013496. Morgan Kaufmann Publishers Inc.\n\nRobbel, P., Oliehoek, F. A., and Kochenderfer, M. J. (2016). Exploiting anonymity in approxi-\nmate linear programming: Scaling to large multiagent MDPs. In AAAI Conference on Arti\ufb01cial\nIntelligence, pages 2537\u20132543.\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy opti-\n\nmization. In International Conference on Machine Learning, pages 1889\u20131897.\n\nSonu, E., Chen, Y., and Doshi, P. (2015).\n\nIndividual planning in agent populations: Exploiting\nanonymity and frame-action hypergraphs. In International Conference on Automated Planning\nand Scheduling, pages 202\u2013210.\n\nSutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for\nreinforcement learning with function approximation. In International Conference on Neural In-\nformation Processing Systems, pages 1057\u20131063.\n\nVarakantham, P., Adulyasak, Y., and Jaillet, P. (2014). Decentralized stochastic planning with\n\nanonymity in interactions. In AAAI Conference on Arti\ufb01cial Intelligence, pages 2505\u20132511.\n\nVarakantham, P. R., Cheng, S.-F., Gordon, G., and Ahmed, A. (2012). Decision support for agent\npopulations in uncertain and congested environments. In AAAI Conference on Arti\ufb01cial Intelli-\ngence, pages 1471\u20131477.\n\nWilliams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8(3):229\u2013256.\n\nWinstein, K. and Balakrishnan, H. (2013). Tcp ex machina: Computer-generated congestion control.\n\nIn Proceedings of the ACM SIGCOMM 2013 Conference, SIGCOMM \u201913, pages 123\u2013134.\n\nWitwicki, S. J. and Durfee, E. H. (2010). In\ufb02uence-based policy abstraction for weakly-coupled\nDec-POMDPs. In International Conference on Automated Planning and Scheduling, pages 185\u2013\n192.\n\n11\n\n\f", "award": [], "sourceid": 2255, "authors": [{"given_name": "Duc Thien", "family_name": "Nguyen", "institution": "Singapore Management University"}, {"given_name": "Akshat", "family_name": "Kumar", "institution": "Singapore Management University"}, {"given_name": "Hoong Chuin", "family_name": "Lau", "institution": "Singapore Management University"}]}