{"title": "Multi-Agent Filtering with Infinitely Nested Beliefs", "book": "Advances in Neural Information Processing Systems", "page_first": 1905, "page_last": 1912, "abstract": "In partially observable worlds with many agents, nested beliefs are formed when agents simultaneously reason about the unknown state of the world and the beliefs of the other agents. The multi-agent filtering problem is to efficiently represent and update these beliefs through time as the agents act in the world. In this paper, we formally define an infinite sequence of nested beliefs about the state of the world at the current time $t$ and present a filtering algorithm that maintains a finite representation which can be used to generate these beliefs. In some cases, this representation can be updated exactly in constant time; we also present a simple approximation scheme to compact beliefs if they become too complex. In experiments, we demonstrate efficient filtering in a range of multi-agent domains.", "full_text": "Multi-Agent Filtering with In\ufb01nitely Nested Beliefs\n\nLuke S. Zettlemoyer\n\nMIT CSAIL\n\nCambridge, MA 02139\nlsz@csai.mit.edu\n\nBrian Milch\u2217\nGoogle Inc.\n\nMountain View, CA 94043\nbrian@google.com\n\nAbstract\n\nLeslie Pack Kaelbling\n\nMIT CSAIL\n\nCambridge, MA 02139\nlpk@csail.mit.edu\n\nIn partially observable worlds with many agents, nested beliefs are formed when\nagents simultaneously reason about the unknown state of the world and the beliefs\nof the other agents. The multi-agent \ufb01ltering problem is to ef\ufb01ciently represent\nand update these beliefs through time as the agents act in the world. In this pa-\nper, we formally de\ufb01ne an in\ufb01nite sequence of nested beliefs about the state of\nthe world at the current time t, and present a \ufb01ltering algorithm that maintains a\n\ufb01nite representation which can be used to generate these beliefs. In some cases,\nthis representation can be updated exactly in constant time; we also present a sim-\nple approximation scheme to compact beliefs if they become too complex.\nIn\nexperiments, we demonstrate ef\ufb01cient \ufb01ltering in a range of multi-agent domains.\n\n1 Introduction\n\nThe existence of nested beliefs is one of the de\ufb01ning characteristics of a multi-agent world. As an\nagent acts, it often needs to reason about what other agents believe. For instance, a teacher must\nconsider what a student knows to decide how to explain important concepts. A poker agent must\nthink about what cards other players might have \u2014 and what cards they might think it has \u2014 in\norder to bet effectively. In this paper, we assume a cooperative setting where all the agents have\npredetermined, commonly-known policies expressed as functions of their beliefs; we focus on the\nproblem of ef\ufb01cient belief update, or \ufb01ltering.\nWe consider the nested \ufb01ltering problem in multi-agent, partially-observable worlds [6, 1, 9]. In\nthis setting, agents receive separate observations and independently execute actions, which jointly\nchange the hidden state of the world. Since each agent does not get to see the others\u2019 observations\nand actions, there is a natural notion of nested beliefs. Given its observations and actions, an agent\ncan reason not only about the state of the external world, but also about the other agents\u2019 observations\nand actions. It can also condition on what others might have seen and done to compute their beliefs\nat the next level of nesting. This pattern can be repeated to arbitrary depth.\nThe multi-agent \ufb01ltering problem is to ef\ufb01ciently represent and update these nested beliefs through\ntime. In general, an agent\u2019s beliefs depend on its entire history of actions and observations. One\napproach to computing these beliefs would be to remember the entire history, and perform inference\nto compute whatever probabilities are needed at each time step. But the time required for this\ncomputation would grow with the history length. Instead, we maintain a belief state that is suf\ufb01cient\nfor predicting future beliefs and can be approximated to achieve constant-time belief updates.\nWe begin by de\ufb01ning an in\ufb01nite sequence of nested beliefs about the current state st, and showing\nthat it is suf\ufb01cient for predicting future beliefs. We then present a multi-agent \ufb01ltering algorithm that\nmaintains a compact representation suf\ufb01cient for generating this sequence. Although in the worst\ncase this representation grows exponentially in the history length, we show that its size remains\nconstant for several interesting problems. We also describe an approximate algorithm that always\n\n\u2217This work was done while the second author was at MIT CSAIL.\n\n\fmaintains a constant representation size (and constant-time updates), possibly at the cost of accuracy.\nIn experiments, we demonstrate ef\ufb01cient and accurate \ufb01ltering in a range of multi-agent domains.\n\n2 Related Work\n\nIn existing research on partially observable stochastic games (POSGs) and Decentralized POMDPs\n(DEC-POMDPs) [6, 1, 9], policies are represented as direct mappings from observation histories to\nactions. That approach removes the need for the agents to perform any kind of \ufb01ltering, but requires\nthe speci\ufb01cation of some particular class of policies that return actions for arbitrarily long histories.\nIn contrast, many successful algorithms for single-agent POMDPs represent policies as functions on\nbelief states [7], which abstract over the speci\ufb01cs of particular observation histories. Gmytrasiewicz\nand Doshi [5] consider \ufb01ltering in interactive POMDPs. Their approach maintains \ufb01nitely nested\nbeliefs that are derived from a world model as well as hand-speci\ufb01ed models of how each agent\nreasons about the other agents. In this paper, all of the nested reasoning is derived from a single\nworld model, which eliminates the need for any agent-speci\ufb01c models.\nTo the best of our knowledge, our work is the \ufb01rst to focus on \ufb01ltering of in\ufb01nitely nested beliefs.\nThere has been signi\ufb01cant work on in\ufb01nitely nested beliefs in game theory, where Brandenburger\nand Dekel [2] introduced the notion of an in\ufb01nite sequence of \ufb01nitely nested beliefs. However,\nthey do not describe any method for computing these beliefs from a world model or updating them\nover time. Another long-standing line of related work is in the epistemic logic community. Fagin\nand Halpern [3] de\ufb01ne labeled graphs called probabilistic Kripke structures, and show how a graph\nwith \ufb01nitely many nodes can de\ufb01ne an in\ufb01nite sequence of nested beliefs. Building on this idea,\nalgorithms have been proposed for answering queries on probabilistic Kripke structures [10] and on\nin\ufb02uence diagrams that de\ufb01ne such structures [8]. However, these algorithms have not addressed\nthe fact that as agents interact with the world over time, the set of observation sequences they could\nhave received (and possibly the set of beliefs they could arrive at) grows exponentially.\n\n3 Nested Filtering\n\nIn this section, we describe the world model and de\ufb01ne the multi-agent \ufb01ltering problem. We then\npresent a detailed example where a simple problem leads to a complex pattern of nested reasoning.\n\n3.1 Partially observable worlds with many agents\n\nWe will perform \ufb01ltering given a multi-agent, decision-theoretic model for acting in a partially\nobservable world.1 Agents receive separate observations and independently execute actions, which\njointly change the state of the world. There is a \ufb01nite set of states S, but the current state s \u2208 S\ncannot be observed directly by any of the agents. Each agent j has a \ufb01nite set of observations Oj\nthat it can receive and a \ufb01nite set of actions Aj that it can execute. Throughout this paper, we will\nuse superscripts and vector notation to name agents and subscripts to indicate time. For example,\nti is a vector with actions for each of the\nt \u2208 Aj is the action for agent j at time t; ~at = hai\naj\nagents; and aj\nThe state dynamics is de\ufb01ned by a distribution p0(s) over initial states and a transition distribution\np(st|st\u22121, ~at\u22121) that is conditioned on the previous state st\u22121 and the action vector ~at\u22121. For each\nt|st, ~at\u22121) conditioned on the current\nagent j, observations are generated from a distribution p(oj\nstate and the previous joint action. Each agent j sees only its own actions and observations. To\nrecord this information, it is useful to de\ufb01ne a history hj\n1:t) for agent j at time t. A\npolicy is a distribution \u03c0j(aj\n0:t) over the actions agent j will take given this history. Together,\nthese distributions de\ufb01ne the joint world model:\n\nt) is a sequence of actions for agent j at time steps 0 . . . t.\n\n0:t = (aj\n\n0:t = (aj\n\n0, . . . , aj\n\nt, . . . , aj\n\n0:t\u22121, oj\n\nt|hj\n\nt\u22121Y\n\nwhere ~\u03c0(~at|~h0:t) =Q\n\np(s0:t,~h0:t) = p0(s0)\n\n0:t) and p(~ot+1|st+1, ~at) =Q\n\n~\u03c0(~ai|~h0:i)p(si+1|si, ~ai)p(~oi+1|si+1, ~ai)\nt+1|st+1, ~at).\n\nj p(oj\n\ni=0\n\nj \u03c0j(aj\n\nt|hj\n\n(1)\n\n1This is the same type of world model that is used to de\ufb01ne POSGs and DEC-POMDPs. Since we focus on\n\n\ufb01ltering instead of planning, we do not need to de\ufb01ne reward functions for the agents.\n\n\f3.2 The nested \ufb01ltering problem\n\nIn this section, we describe how to compute in\ufb01nitely nested beliefs about the state at time t. We then\nde\ufb01ne a class of policies that are functions of these beliefs. Finally, we show that the current nested\nbelief for an agent i contains all of the information required to compute future beliefs. Throughout\nthe rest of this paper, we use a minus notation to de\ufb01ne tuples indexed by all but one agent. For\nexample, h\u2212i\nWe de\ufb01ne in\ufb01nitely nested beliefs by presenting an in\ufb01nite sequence of \ufb01nitely nested beliefs. For\neach agent i and nesting level n, the belief function Bi,n : hi\nt maps the agent\u2019s history to its\nnth-level beliefs at time t. The agent\u2019s zeroth-level belief function Bi,0(hi\n0:t) returns the posterior\ndistribution bi,0\n0:t) over states given the input history, which can be computed from Eq. 1:\n\n0:t and \u03c0\u2212i are tuples of histories and policies for all agents k 6= i.\n\nt = p(st|hi\n\n0:t \u2192 bi,n\n\ns0:t\u22121,h\n\nBi,0(hi\n\np(s0:t, ~h0:t).\n\n0:t that lead to these beliefs (that is, such that b\n0:t) = p(st, b\n\n0:t) = p(st|hi\n0:t) returns a joint distribution on st and the zeroth-level\nAgent i\u2019s \ufb01rst-level belief function Bi,1(hi\nbeliefs of all the other agents (what the other agents believe about the state of the world). We can\n\u2212i,0\nfor all agents k 6= i by summing the probabilities of\ncompute the tuple of zeroth-level beliefs b\nt\nall histories h\u2212i\nBi,1(hi\n\n0:t)):\n\u2212i\n, B\u2212i,0(h\n0:t)).\nThe delta function \u03b4(\u00b7,\u00b7) returns one when its arguments are equal and zero otherwise.\n0:t) returns a distribution over states and level n \u2212 1 beliefs for the other agents.\nFor level n, Bi,n(hi\nFor example, at level 2, the function returns a joint distribution over: the state, what the other agents\nbelieve about the state, and what they believe others believe. Again, these beliefs are computed by\nsumming over histories for the other agents that lead to the appropriate level n \u2212 1 beliefs:\n\u2212i\n, B\u2212i,n\u22121(h\n0:t)).\n\n\u2212i,0\nt\np(s0:t, ~h0:t)\u03b4(b\n\n= B\u2212i,0(h\u2212i\n\u2212i,0\nt\n\n0:t) \u221d P\n\n0:t) \u221d P\n\np(s0:t, ~h0:t)\u03b4(b\n\n0:t) = p(st, b\n\n\u2212i,n\u22121\nt\n\n\u2212i,n\u22121\nt\n\nBi,n(hi\n\n\u2212i,0\nt\n\ns0:t\u22121,h\n\n|hi\n\n|hi\n\n0:t) \u221d P\n\n\u2212i\n0:t\n\n\u2212i\n0:t\n\ns0:t\u22121,h\n\n\u2212i\n0:t\n\nt\n\n0:t.\n\n0:t) is a discrete distribution. There are only \ufb01nitely many\n\nNote that for all nesting levels n, Bi,n(hi\nbeliefs each agent k could hold at time t \u2014 each arising from one of the possible histories hk\nDe\ufb01ne bi,\u2217\nt = Bi,\u2217(hi\n0:t) to be the in\ufb01nite sequence of nested beliefs generated by computing\n0:t) for n = 0, 1, . . .. We can think of bi,\u2217\nBi,n(hi\nas a belief state for agent i, although not one\nthat can be used directly by a \ufb01ltering algorithm. We will assume that the policies \u03c0i are represented\nt|bi,\u2217\nt ) can be thought of as a procedure that looks\nas functions of these belief states: that is, \u03c0i(ai\nat arbitrary parts of the in\ufb01nite sequence bi,\u2217\nand returns a distribution over actions. We will see\nexamples of this type of policy in the next section. Under this assumption, bi,\u2217\nis a suf\ufb01cient statistic\nfor predicting future beliefs in the following sense:\nt|bj,\u2217\nt+1 . Bi,\u2217(ai\n\nProposition 1 In a model with policies \u03c0j(aj\nfunction BE s.t. \u2200ai\nTo prove this result, we need to demonstrate a procedure that correctly computes the new belief\ngiven only the old belief and the new action and observation. The \ufb01ltering algorithm we will present\nin Sec. 4 achieves this goal by representing the nested belief with a \ufb01nite structure that can be used\nto generate the in\ufb01nite sequence, and showing how these structures are updated over time.\n\nt ) for each agent j, there exists a belief estimation\n0:t, oi\n\n1:t+1) = BE(Bi,\u2217(ai\n\n0:t\u22121, oi\n\n0:t\u22121, oi\n\n1:t), ai\n\n1:t, ai\n\nt+1).\n\nt, oi\n\nt, oi\n\nt\n\nt\n\n3.3 Extended Example: The Tiger Communication World\n\nWe now describe a simple two-agent \u201ctiger world\u201d where the optimal policies require the agents to\ncoordinate their actions. In this world there are two doors: behind one randomly chosen door is a\nhungry tiger, and behind the other is a pile of gold. Each agent has unique abilities. Agent l (the\ntiger listener) can hear the tiger roar, which is a noisy indication of its current location, but cannot\nopen the doors. Agent d (the door opener) can open doors but cannot hear the roars. To facilitate\ncommunication, agent l has two actions, signal left and signal right, which each produce a unique\nobservation for agent d. When a door is opened, the world resets and the tiger is placed behind a\nrandomly chosen door. To act optimally, agent l must listen to the tiger\u2019s roars until it is con\ufb01dent\nabout the tiger\u2019s location and then send the appropriate signal to agent d. Agent d must wait for this\n\n\fbl,\u2217\n\nbl,0(T L) > 0.8\nbl,0(T L) > 0.8\n\notherwise\n\nal\n\nSL\nSR\nL\n\n\u03c0l(al|bl,\u2217)\n\n1.0\n1.0\n1.0\n\nbd,\u2217\n\nbd,0(T L) > 0.8\nbd,0(T R) > 0.8\n\notherwise\n\nad\n\nOR\nOL\nL\n\n\u03c0d(ad|bd,\u2217)\n\n1.0\n1.0\n1.0\n\nFigure 1: Deterministic policies for the tiger world that depend on each agent\u2019s beliefs about the physical state,\nwhere the tiger can be on the left (T L) or the right (T R). The tiger listener, agent l, will signal left (SL) or\nright (SR) if it con\ufb01dent of the tiger\u2019s location. The door opener, agent d, will open the appropriate door when\nit is con\ufb01dent about the tiger\u2019s location. Otherwise both agents listen (to the tiger or for a signal).\n\nsignal and then open the appropriate door. Fig. 1 shows a pair of policies that achieve this desired\ninteraction and depend only on each agent\u2019s level-zero beliefs about the state of the world. However,\nas we will see, the agents cannot maintain their level-zero beliefs in isolation. To correctly update\nthese beliefs, each agent must reason about the unseen actions and observations of the other agent.\nConsider the beliefs that each agent must maintain to execute its policies during a typical scenario.\nAssume the tiger starts behind the left door. Initially, both agents have uniform beliefs about the\nlocation of the tiger. As agent d waits for a signal, it does not gain any information about the tiger\u2019s\nlocation. However, it maintains a representation of the possible beliefs for agent l and knows that l is\nreceiving observations that correlate with the state of the tiger. In this case, the most likely outcome\nis that agent l will hear enough roars on the left to do a \u201csignal left\u201d action. This action produces an\nobservation for agent d which allows it to gain information about l\u2019s beliefs. Because agent d has\nmaintained the correspondence between the true state and agent l\u2019s beliefs, it can now infer that the\ntiger is more likely to be on the left (it is unlikely that l could have come to believe the tiger was\non the left if that were not true). This inference makes agent d con\ufb01dent enough about the tiger\u2019s\nlocation to open the right door and reset the world. Agent l must also represent agent d\u2019s beliefs,\nbecause it never receives any observations that indicate what actions agent d is taking. It must track\nagent d\u2019s belief updates to know that d will wait for a signal and then immediately open a door.\nWithout this information, l cannot predict when the world will be reset, and thus when it should\ndisregard past observations about the location of the tiger.\nEven in this simple tiger world, we see a complicated reasoning pattern: the agents must track each\nothers\u2019 beliefs. To update its belief about the external world, each agent must infer what actions the\nother agent has taken, which requires maintaining that agent\u2019s beliefs about the world. Moreover,\nupdating the other agent\u2019s beliefs requires maintaining what it believes you believe. Continuing\nthis reasoning to deeper levels leads to the in\ufb01nitely nested beliefs de\ufb01ned in Sec. 3.2. However,\nwe will never explicitly construct these in\ufb01nite beliefs. Instead, we maintain a \ufb01nite structure that\nis suf\ufb01cient to recreate them to arbitrary depth, and only expand as necessary to compute action\nprobabilities.\n\n4 Ef\ufb01cient Filtering\nIn this section, we present an algorithm for performing belief updates bi,\u2217\nt)\nt\u22121, oi\non nested beliefs. This algorithm is applicable in the cooperative setting where there are commonly\nknown policies \u03c0j(aj\nt ) for each agent j. The approach, which we call the SDS \ufb01lter, maintains\na set of Sparse Distributions over Sequences of past states, actions, and observations.\n\nt = BE(bi,\u2217\n\nt|bj,\u2217\n\nt\u22121, ai\n\n0:t\u22121, oj\n\nSequence distributions. The SDS \ufb01lter deals with two kinds of sequences: histories hj\n0:t =\n(aj\n1:t) and trajectories x0:t = (s0:t, ~a0:t\u22121). A history represents what agent j knows be-\nfore acting at time t; a trajectory is a trace of the states and joint actions through time t. The\n\ufb01lter for agent i maintains the following sequence sets: a set X of trajectories that might have\noccurred so far, and for each agent j (including i itself), a set H j of possible histories. One of\nthe elements of H i is marked as being the history that i has actually experienced. The SDS \ufb01lter\nmaintains belief information in the form of sequence distributions \u03b1j(x0:t|hj\n0:t) and\n0:t \u2208 H j, and trajectories x0:t \u2208 X.2 The\n\u03b2j(hj\n\u03b1j distributions represent what agent j would believe about the possible sequences of states and\nother agents\u2019 actions given hj\n0:t. The \u03b2j distributions represent the probability of j receiving the\nobservations in hj\n\n0:t|x0:t) for all agents j, histories hj\n\n0:t if the trajectory x0:t had actually happened.\n\n0:t|x0:t) = p(hj\n\n0:t) = p(x0:t|hj\n\n2Actions are included in both histories and trajectories; when x0:t and hj\n\n0:t specify different actions, both\n\n\u03b1j(x0:t|hj\n\n0:t) and \u03b2j(hj\n\n0:t|x0:t) are zero.\n\n\fThe insight behind the SDS \ufb01lter is that these sequence distributions can be used to compute the\nnested belief functions Bi,n(hi\n0:t) from Sec. 3.2 to arbitrary depth. The main challenge is that sets\nof possible histories and trajectories grow exponentially with the time t. To avoid this blow-up,\nthe SDS \ufb01lter does not maintain the complete set of possible sequences. We will see that some\nsequences can be discarded without affecting the results of the belief computations. If this pruning\nis insuf\ufb01cient, the SDS \ufb01lter can drop low-probability sequences and perform approximate \ufb01ltering.\nA second challenge is that if we represent each sequence explicitly, the space required grows linearly\nwith t. However, the belief computations do not require the details of each trajectory and history. To\ncompute beliefs about current and future states, it suf\ufb01ces to maintain the sequence distributions \u03b1j\nand \u03b2j de\ufb01ned above, along with the \ufb01nal state st in each trajectory. The SDS \ufb01lter maintains only\nthis information.3 For clarity, we will continue to use full sequence notation in the paper.\nIn the rest of this section, we \ufb01rst show how the sequence distributions can be used to compute nested\nbeliefs of arbitrary depth. Then, we show how to maintain the sequence distributions. Finally, we\npresent an algorithm that computes these distributions while maintaining small sequence sets.\nThe nested beliefs from Sec. 2.2 can be written in terms of the sequence distributions as follows:\n\nBj,0(hj\n\n0:t)(s) =\n\nBj,n(hj\n\n0:t)(s, b\n\n\u2212j,n\u22121) =\n\nX\nX\n\nx0:t\u2208X : xt=s\n\nx0:t\u2208X : xt=s\n\n\u03b1j(x0:t|hj\n\n0:t)\n\n\u03b1j(x0:t|hj\n\n0:t)\n\nX\n\nY\n\nk6=j\n\n0:t\u2208Hk\nhk\n\n(2)\n\n(4)\n(5)\n\n\u03b2k(hk\n\n0:t|x0:t)\u03b4(bk,n\u22121, Bk,n\u22121(hk\n\n0:t)) (3)\n\nAt level zero, we sum over the probabilities according to agent j of all trajectories with the correct\n\ufb01nal state. At level n, we perform the same outer sum, but for each trajectory we sum the proba-\nbilities of the histories for agents k 6= j that would lead to the beliefs we are interested in. Thus,\nthe sequence distributions at time t are suf\ufb01cient for computing any desired element of the in\ufb01nite\nbelief sequence Bj,\u2217(hj\n\n0:t) for any agent j and history hj\n\n0:t.\n\nUpdating the distributions. The sequence distributions are updated at each time step t as follows.\nFor each agent j, trajectory x0:t = (s0:t, ~a0:t\u22121) and history hj\nt|st, ~at\u22121)\n\n0:t|x0:t) = \u03b2j(hj\n\n0:t\u22121|x0:t\u22121)p(oj\n\n0:t = (aj\n\n0:t\u22121, oj\n\n1:t):\n\n0:t) = \u03b1j(x0:t\u22121|hj\n\n0:t\u22121)p(~at\u22121|x0:t\u22121)p(st|st\u22121, oj\n\nt , ~at\u22121)\n\n\u03b2j(hj\n\u03b1j(x0:t|hj\n\nY\n\nX\n\nThe values of \u03b2j on length-t histories are computed from existing \u03b2j values by multiplying in the\nprobability of the most recent observation. To extend \u03b1j to length-t trajectories, we multiply in the\nprobability of the state transition and the probability of the agents\u2019 actions given the past trajectory:\n\np(~at\u22121|x0:t\u22121) =\n\n\u03b2k(hk\n\n0:t\u22121|x0:t\u22121)\u03c0k(ak\n\nt\u22121|Bk,\u2217\n\n(hk\n\n0:t\u22121))\n\n(6)\n\nk\n\nhk\n\n0:t\u22121\n\nHere, to predict the actions for agent k, we take an expectation over its possible histories hk\n0:t\u22121\n(according to the \u03b2k distribution from the previous time step) of the probability of each action\nt\u22121 given the beliefs Bk,\u2217(hk\n0:t\u22121) induced by the history. In practice, only some of the entries in\nak\nBk,\u2217(hk\n0:t\u22121) will be needed to compute k\u2019s action; for example, in the tiger world, the policies are\nfunctions of the zero-level beliefs. The necessary entries are computed from the the previous \u03b1 and\n\u03b2 distributions as described in Eqs. 2 and 3. This computation is not prohibitive because, as we will\nsee later, we only consider a small subset of the possible histories.\nReturning to the example tiger world, we can see that maintaining these sequence distributions will\nallow us to achieve the desired interactions described in Sec. 3.3. For example, when the door opener\nreceives a \u201csignal left\u201d observation, it will infer that the tiger is on the left because it has done the\nreasoning in Eq. 6 and determined that, with high probability, the trajectories that would have led\nthe tiger listener to take this action are the ones where the tiger is actually on the left.\n\n3This data structure is closely related to probabilistic Kripke structures [3] which are known to be suf\ufb01cient\n\nfor recreating nested beliefs. We are not aware of previous work that guarantees compactness through time.\n\n\fInitialization. Input: Distribution p(s) over states.\n\n1. Initialize trajectories and histories: X = {((s), ())|s \u2208 S}, H j = {((), ())}\n2. Initialize distributions: \u2200x = ((s), ()) \u2208 X, j, hj \u2208 H j: \u03b1j(x|hj) = p(s) and \u03b2j(hj|x) = 1.\n\nFiltering. Input: Action ai\n\nt\u22121 and observation oi\nt.\n\n1. Compute new sequence sets X and H j, for all agents j, by adding all possible states, actions, and\nobservations to sequences in the previous sets. Compute new sequence distributions \u03b1j and \u03b2j, for\nall agents j, as described in Eqs. 5, 4, and 6. Mark the observed history hi\n\n0:t \u2208 H i.\n\n2. Merge and drop sequences:\n\n0:t \u2208 H j . \u03b1j(x0:t|hj\n\n\u2022 \u2200x0:t \u2208 X s.t. \u2200j, hj\n\u2022 \u2200j, hj\n\n(a) Drop trajectories and histories that are commonly known to be impossible:\n0:t) = 0: Set X = X \\ {x0:t}.\n0:t|x0:t) = 0: Set H j = H j \\ {hj\n0:t}.\n0:t) = \u03b1j(x0:t|h\n0j\n0:t):\n0:t|x0:t) + \u03b2j(h\n0:t|x0:t) for all x0:t.\n0j\n\n0:t \u2208 H j s.t. \u2200x0:t \u2208 X . \u03b1j(x0:t|hj\n0j\nSet H j = H j \\ {h\n\n0:t \u2208 H j s.t. \u2200x0:t \u2208 X . \u03b2j(hj\n0:t \u2208 H j, h\n\n(b) Merge histories that lead to the same beliefs:\n\n\u2022 \u2200j, hj\n\n0:t} and \u03b2j(hj\n0j\n(c) Reset when marginal of st is common knowledge:\n0:t,\u2208 H k, st . \u03b1j(st|hj\n\n\u2022 If \u2200j, k, hj\n\n0:t \u2208 H j, hk\n\n0:t|x0:t) = \u03b2j(hj\n\nReinitialize the \ufb01lter using the distribution \u03b1j(st|hj\n\n3. Prune: For all \u03b1j or \u03b2j with m \u2265 N non-zero entries:\n\nRemove the m \u2212 N lowest-probability sequences and renormalize.\n\n0:t) = \u03b1k(st|hk\n\n0:t):\n\n0:t) instead of the prior p0(s).\n\nFigure 2: The SDS \ufb01lter for agent i. At all times t, the \ufb01lter maintains sequence sets X and H j, for all agents\nj, along with the sequence distributions \u03b1j and \u03b2j for all agents j. Agent i\u2019s actual observed history is marked\nas a distinguished element hi\n\n0:t \u2208 H i and used to compute its beliefs Bi,\u2217(hi\n\n0:t).\n\nFiltering algorithm. We now consider the challenge of maintaining small sequence sets. Fig. 2\nprovides a detailed description of the SDS \ufb01ltering algorithm for agent i. The \ufb01lter is initialized with\nempty histories for each agent and trajectories with single states that are distributed according to the\nprior. At each time t, Step 1 extends the sequence sets, computes the sequence distributions, and\nrecords agent i\u2019s history. Running a \ufb01lter with only this step would generate all possible sequences.\nStep 2 introduces three operations that reduce the size of the sequence sets while guaranteeing that\nEqs. 2 and 3 still produce the correct nested beliefs at time t. Step 2(a) removes trajectories and\nhistories when all the agents agree that they are impossible; there is no reason to track them. For\nexample, in the tiger communication world, the policies are such that for the \ufb01rst few time steps each\nagent will always listen (to the tiger or for signals). During this period all the trajectories where other\nactions are taken are known to be impossible and can be ignored. Step 2(b) merges histories for an\nagent j that lead to the same beliefs. This is achieved by arbitrarily selecting one history to be\ndeleted and adding its \u03b2j probability to the other\u2019s \u03b2j. For example, as the tiger listener hears roars,\nany two observation sequences with the same numbers of roars on the left and right provide the same\ninformation about the tiger and can be merged. Step 2(c) resets the \ufb01lter if the marginal over states\nat time t has become commonly known to all the agents. For example, when both agents know that a\ndoor has been opened, this implies that the world has reset and all previous trajectories and histories\ncan be discarded. This type of agreement is not limited to cases where the state of the world is reset.\nIt occurs with any distribution over states that the agents agree on, for example when they localize\nand both know the true state, even if they disagree about the trajectory of past states.\nTogether, these three operators can signi\ufb01cantly reduce the size of the sequence sets. We will see\nin the experiments (Sec. 5) that they enable the SDS \ufb01lter to exactly track the tiger communication\nworld extremely ef\ufb01ciently. However, in general, there is no guarantee that these operators will be\nenough to maintain small sets of trajectories and histories. Step 3 introduces an approximation by\nremoving low-probability sequences and normalizing the belief distributions. This does guarantee\nthat we will maintain small sequence sets, possibly at the cost of accuracy. In many domains we can\nignore unlikely histories and trajectories without signi\ufb01cantly changing the current beliefs.\n\n5 Evaluation\n\nIn this section, we describe the performance of the SDS algorithm on three nested \ufb01ltering problems.\n\n\f(a) Tiger world: time.\n\n(b) Box pushing: time.\n\n(c) Box pushing: error.\n\nFigure 3: Time per \ufb01ltering step, and error, for the SDS algorithm on two domains.\n\nTiger Communication World. The tiger communication world was described in detail in Sec. 3.3.\nFig. 3(a) shows the average computation time used for \ufb01ltering at each time step. The full algorithm\n(SDS) maintains a compact, exact representation without any pruning and takes only a fraction of a\nsecond to do each update. The graph also shows the results of disabling different parts of Step 2(a-c)\nof the algorithm (for example, SDS -a,-b,-c does not do any simpli\ufb01cations from Step 2). Without\nthese steps, the algorithm runs in exponential time. Each simpli\ufb01cation allows the algorithm to\nperform better, but all are required for constant-time performance. Since the SDS \ufb01lter runs without\nthe pruning in Step 3, we know that it computes the correct beliefs; there is no approximation error.4\nBox Pushing. The DEC-POMDP literature includes several multi-agent domains; we evaluate\nSDS on the largest of them, known as the box-pushing domain [9]. In this scenario, two agents\ninteract in a 3x4 grid world where they must coordinate their actions to move a large box and then\nindependently push two small boxes. The state encodes the positions and orientations of the robots,\nas well as the locations of the three boxes. The agents can move forward, rotate left and right, or\nstay still. These actions fail with probability 0.1, leaving the state unchanged. Each agent receives\ndeterministic observations about what is in the location in front of it (empty space, a robot, etc.).\nWe implemented policies for each agent that consist of a set of 20 rules specifying actions given its\nzeroth-level beliefs about the world state. While executing their policies, the agents \ufb01rst coordinate\nto move the large box and then independently move the two small boxes. The policies are such that,\nwith high probability, the agents will always move the boxes. There is uncertainty about when this\nwill happen, since actions can fail. We observed, in practice, that it rarely took more than 20 steps.\nFig. 3(b) shows the running time of the SDS \ufb01lter on this domain, with various pruning parameters\n(N = 10, 50, 100,\u221e in Step 3). Without pruning (N = \u221e), the costs are too high for the \ufb01lter\nto move beyond time step \ufb01ve. With pruning, however, the cost remains reasonable. Fig. 3(c)\nshows the error incurred with various degrees of pruning, in terms of the difference between the\nestimated zeroth-level beliefs for the agents and the true posterior over physical states given their\nobservations.5 Note that in order to accurately maintain each agent\u2019s beliefs about the physical\nstate\u2014which includes the position of the other robot\u2014the \ufb01lter must assign accurate probabilities\nto unobserved actions by the other agent , which depend on its beliefs. This is the same reasoning\npattern we saw in the tiger world where we are required to maintain in\ufb01nitely nested beliefs. As\nexpected, we see that more pruning leads to faster running time but decreased accuracy. We also\n\ufb01nd that the problem is most challenging around time step ten and becomes easier in the limit, as the\nworld moves towards the absorbing state where both agents have \ufb01nished their tasks. With N = 100,\nwe get high-quality estimates in an acceptable amount of time.\nNoisy Muddy Children. The muddy children problem is a classic puzzle often discussed by re-\nsearchers in epistemic logic [4]. There are n agents and 2n possible states. Each agent\u2019s forehead\ncan be either muddy or clean, but it does not get any direct observations about this fact. Initially, it is\ncommonly known that at least one agent has a muddy forehead. As time progresses, the agents fol-\nlow a policy of raising their hand if they know that their forehead is muddy; they must come to this\nconclusion given only observations about the cleanliness of the other agents\u2019 foreheads and who has\n\n4The exact version of SDS also runs in constant time on the broadcast channel domain of Hansen et al. [6].\n5Because the box-pushing problem is too large for beliefs to be computed exactly, we compare the \ufb01lter\u2019s\nperformance to empirical distributions obtained by generating 10,000 sequences of trajectories and histories.\n0:t; for all histories that appear at least ten times, we compare the empirical\nWe group the runs by the history hi\ndistribution \u02c6bt of states occurring after that history to the \ufb01lter\u2019s computed beliefs \u02dcbi,0\n, using the variational\nt\ndistance V D(\u02c6bt, \u02dcbi,0\n\nt ) =P\n\ns |\u02c6bt(s) \u2212 \u02dcbi,0\n\nt (s)|.\n\n 0 2 4 6 8 10 0 5 10 15 20 25 30Running Time (seconds)Time StepSDS -a,-b,-cSDS -b,-cSDS -cSDS 0 2 4 6 8 10 12 14 0 5 10 15 20Running Time (seconds)Time StepSDS N=10SDS N=50SDS N=100SDS N=\u221e 0 0.05 0.1 0.15 0.2 0.25 0 5 10 15 20Empirical Variational DistanceTime StepSDS N=10SDS N=50SDS N=100\fraised their hands (this yields 22n possible observations for each agent). This puzzle is represented\nin our framework as follows. The initial knowledge is encoded with a prior that is uniform over all\nstates with in which at least one agent is muddy. The state of the world never changes. Observations\nabout the muddiness of the other agents are only correct with probability \u03bd, and each agent raises its\nhand if it assigns probability at least 0.8 to being muddy.\nWhen there is no noise, \u03bd = 1.0, the agents behave as follows. With m \u2264 n muddy agents, everyone\nwaits m time steps and then all of the muddy agents simultaneously raise their hands.6 The SDS\n\ufb01lter exhibits exactly this behavior and runs in reasonable time, using only a few seconds per \ufb01ltering\nstep, for problem instances with up to 10 agents without pruning. We also ran the \ufb01lter on instances\nwith noise (\u03bd = 0.9) and up to 5 agents. This required pruning histories to cope with the extremely\nlarge number of possible but unlikely observation sequences. The observed behavior is similar to\nthe deterministic case: eventually, all of the m muddy agents raise their hands. In expectation, this\nhappens at a time step greater than m, since the agents must receive multiple observations before\nthey are con\ufb01dent about each other\u2019s cleanliness. If one agent raises its hand before the others, this\nprovides more information to the uncertain agents, who usually raise their hands soon after.\n\n6 Conclusions\nWe have considered the problem of ef\ufb01cient belief update in multi-agent scenarios. We introduced\nthe SDS algorithm, which maintains a \ufb01nite belief representation that can be used to compute an\nin\ufb01nite sequence of nested beliefs about the physical world and the beliefs of other agents. We\ndemonstrated that on some problems, SDS can maintain this representation exactly in constant time\nper \ufb01ltering step. On more dif\ufb01cult examples, SDS maintains constant-time \ufb01ltering by pruning\nlow-probability trajectories, yielding acceptable levels of approximation error.\nThese results show that ef\ufb01cient \ufb01ltering is possible in multi-agent scenarios where the agents\u2019\npolicies are expressed as functions of their beliefs, rather than their entire observation histories.\nThese belief-based policies are independent of the current time step, and have the potential to be\nmore compact than history-based policies. In the single-agent setting, many successful POMDP\nplanning algorithms construct belief-based policies; we plan to investigate how to do similar belief-\nbased planning in the multi-agent case.\n\nReferences\n[1] D. S. Bernstein, E. Hansen, and S. Zilberstein. Bounded policy iteration for decentralized POMDPs. In\n\nProc. of the 19th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2005.\n\n[2] A. Brandenburger and E. Dekel. Hierarchies of beliefs and common knowledge. Journal of Economic\n\nTheory, 59:189\u2013198, 1993.\n\n[3] R. Fagin and J. Y. Halpern. Reasoning about knowledge and probability. Journal of the ACM, 41(2):340\u2013\n\n367, 1994.\n\n[4] R. Fagin, J. Y. Halpern, Y. Moses, and M. Y. Vardi. Reasoning About Knowledge. The MIT Press, 1995.\n[5] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings. Journal\n\nof Arti\ufb01cial Intelligence Research, 24:49\u201379, 2005.\n\n[6] E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochas-\n\ntic games. In Proc. of the 19th National Conf, on Arti\ufb01cial Intelligence (AAAI), 2004.\n\n[7] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic\n\ndomains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[8] B. Milch and D. Koller. Probabilistic models for agents\u2019 beliefs and decisions. In Proc. 16th Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2000.\n\n[9] S. Seuken and S. Zilberstein.\n\nImproved memory-bounded dynamic programming for decentralized\n\nPOMDPs. In Proc. of the 23rd Conference on Uncertainty in Arti\ufb01cial Intelligences (UAI), 2007.\n\n[10] A. Shirazi and E. Amir. Probabilistic modal logic. In Proc. of the 22nd National Conference on Arti\ufb01cial\n\nIntelligence (AAAI), 2007.\n\n6This behavior can be veri\ufb01ed by induction. If there is one muddy agent, it will see that the others are clean\nand raise its hand immediately. This implies that if no one raises their hand in the \ufb01rst round, there must be\nat least two muddy agents. At time two, they will both see only one other muddy agent and infer that they are\nmuddy. The pattern follows for larger m.\n\n\f", "award": [], "sourceid": 882, "authors": [{"given_name": "Luke", "family_name": "Zettlemoyer", "institution": null}, {"given_name": "Brian", "family_name": "Milch", "institution": null}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": null}]}