{"title": "Inference with Minimal Communication: a Decision-Theoretic Variational Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 682, "abstract": null, "full_text": "Inference with Minimal Communication: a Decision-Theoretic Variational Approach\n\nO. Patrick Kreidl and Alan S. Willsky Department of Electrical Engineering and Computer Science MIT Laboratory for Information and Decision Systems Cambridge, MA 02139 {opk,willsky}@mit.edu\n\nAbstract\nGiven a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed offline, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efficiency and (iii) connections to active research areas.\n\n1\n\nIntroduction\n\nGiven a probabilistic model with discrete-valued hidden variables, Belief Propagation (BP) and related graph-based algorithms are commonly employed to solve for the Maximum APosteriori (MAP) assignment (i.e., the mode of the joint distribution of all hidden variables) and Maximum-Posterior-Marginal (MPM) assignment (i.e., the modes of the marginal distributions of every hidden variable) [1]. The established \"message-passing\" interpretation of BP extends naturally to a distributed network setting: associating to each node and edge in the graph a distinct processor and communication link, respectively, the algorithm is equivalent to a sequence of purely-local computations interleaved with only nearestneighbor communications. Specifically, each computation event corresponds to a node evaluating its local processing rule, or a function by which all messages received in the preceding communication event map to messages sent in the next communication event. Practically, the viability of BP appears to rest upon an implicit assumption that network communication resources are abundant. In a general network, because termination of the algorithm is in question, the required communication resources are a-priori unbounded. Even when termination can be guaranteed, transmission of exact messages presumes communication channels with infinite capacity (in bits per observation), or at least of sufficiently high bandwidth such that the resulting finite message precision is essentially error-free. In\n\n\f\nsome distributed settings (e.g., energy-limited wireless sensor networks), it may be prohibitively costly to justify such idealized online communications. While recent evidence suggests substantial but \"small-enough\" message errors will not alter the behavior of BP [2], [3], it also suggests BP may perform poorly when communication is very constrained. Assuming communication constraints are severe, we examine the extent to which alternative processing rules can avoid a loss in (MAP or MPM) performance. Specifically, given a directed graphical model with binary-valued hidden variables and real-valued noisy observations, we assume each node may broadcast only to its children a single binary-valued message. We cast the problem within a variational formulation [4], seeking to minimize a decision-theoretic penalty function subject to such online communication constraints. The formulation turns out to be an extension of the optimization problem underlying the decentralized detection paradigm [5], [6], which advocates a team-theoretic [7] relaxation of the original problem to both justify a particular finite parameterization for all local processing rules and obtain an iterative algorithm to be executed offline (i.e., before observations are realized). To our knowledge, that this relaxation permits analytical progress given any directed acyclic network is new. Moreover, for MPM assignment in a tree-structured network, we discover an added convenience with respect to the envisioned distributed processor setting: the offline computation itself admits an efficient message-passing interpretation. This paper is organized as follows. Section 2 details the decision-theoretic variational formulation for discrete-variable assignment. Section 3 summarizes the main results derived from its connection to decentralized detection, culminating in the offline message-passing algorithm and the assumptions that guarantee convergence and maximal efficiency. We omit the mathematical proofs [8] here, focusing instead on intuition and illustrative examples. Closing remarks and relations to other active research areas appear in Section 4.\n\n2\n\nVariational Formulation\n\nIn abstraction, the basic ingredients are (i) a joint distribution p(x, y ) for two length-N random vectors X and Y , taking hidden and observable values in the sets {0, 1}N and RN , respectively; (ii) a decision-theoretic penalty function J :   R, where  denotes the set of all candidate strategies  : RN  {0, 1}N for posterior assignment; and (iii) the set G   of strategies that also respect stipulated communication constraints in a given N -node directed acyclic network G . The ensuing optimization problem is expressed by J (  ) = min J ( ) subject to   G , (1)\n \n\nwhere  then represents an optimal network-constrained strategy for discrete-variable assignment. The following subsections provide details unseen at this level of abstraction. 2.1 Decision-Theoretic Penalty Function\n\n\n\nLet U =  (Y ) denote the decision process induced from the observation process Y by any candidate assignment strategy   . If we associate a numeric \"cost\" c(u, x) to every possible joint realization of (U, X ), then the expected cost is a well-posed penalty function: J ( ) = E [c ( (Y ), X )] = E [E [c( (Y ), X ) | Y ]] . (2) Expanding the inner expectation and recognizing p(x|y ) to be proportional to p(x)p(y |x) for every y such that p(y ) > 0, it follows that   minimizes (2) over  if and only if  x   (Y ) = arg min  p(x)c(u, x)p(Y |x) with probability one. (3)\nu{0,1}N  { 0, 1} N\n\nOf note are (i) the likelihood function p(Y |x) is a finite-dimensional sufficient statistic of Y , (ii) real-valued coefficients (u, x) provide a finite parameterization of the function b space  and (iii) optimal coefficient values  (u, x) = p(x)c(u, x) are computable offline. b\n\n\f\nBefore introducing communication constraints, we illustrate by examples how the decisiontheoretic penalty function relates to familiar discrete-variable assignment problems. Example 1: Let c(u, x) indicate whether u = x. Then (2) and (3) specialize to, respectively, the word error rate (viewing each x as an N -bit word) and the MAP strategy:   (Y ) = arg max p(x|Y ) with probability one. \nx{0,1}N\n\nN Example 2: Let c(u, x) = n=1 cn (un , xn ), where each cn indicates whether un = xn . Then (2) and (3) specialize to, respectively, the bit error rate and the MPM strategy: a w   (Y ) =  rg max p(x1 |Y ), . . . , arg max p(xN |Y ) ith probability one.\nx1 {0,1} xN {0,1}\n\n2.2\n\nNetwork Communication Constraints\n\nLet G (V , E ) be any directed acyclic graph with vertex set V = {1, . . . , N } and edge set E = {(i, j )  V  V | i   (j )  j  (i)}, where index sets  (n)  V and (n)  V indicate, respectively, the parents and children of each node n  V . Without loss-of-generality, we assume the node labels respect the natural partial-order implied by the graph G ; specifically, we assume every node n has parent nodes  (n)  {1, . . . , n - 1} and child nodes (n)  {n + 1, . . . , N }. Local to each node n  V are the respective components Xn and Yn of the joint process (X, Y ). Under best-case assumptions on p(x, y ) and G , Belief Propagation methods (e.g., max-product in Example 1, sum-product in Example 2) require at least 2|E | real-valued messages per observation Y = y , one per direction along each edge in G . In contrast, we insist upon a single forward-pass through G where each node n broadcasts to its children (if any) a single binary-valued message. This yields communication overhead of only |E | bits per observation Y = y , but also renders the minimizing strategy of (3) infeasible. Accepting that performance-communication tradeoffs are inherent to distributed algorithms, we proceed with the goal of minimizing the loss in performance relative to J (  ).  Specifically, we now translate the stipulated restrictions on communication into explicit constraints on the function space  over which to minimize (2). The simplest such translation assumes the binary-valued message produced by node n also determines the respective component un in decision vector u =  (y ). Recognizing that every node n receives the messages u(n) from its parents (if any) as side information to yn , any function of the form n : R  {0, 1}|(n)|  {0, 1} is a feasible processing rule; we denote the set of all such rules by n . Then, every strategy in the set G = 1      N respects the constraints.\n\n3\n\nSummary of Main Results\n\nAs stated in Section 1, the variational formulation presented in Section 2 can be viewed as an extension of the optimization problem underlying decentralized Bayesian detection [5], [6]. Even for specialized network structures (e.g., the N -node chain), it is known that exact solution to (1) is NP-hard, stemming from the absence of a guarantee that    G possesses a finite parameterization. Also known is that analytical progress can be made for a   relaxation of (1), which is based on the following intuition: if strategy   = (1 , . . . , N ) G is optimal over  , then for each n and assuming all components i  V \\n are fixed at   rules i , the component rule n must be optimal over n . Decentralized detection has roots in team decision theory [7], a subset of game theory, in which the relaxation is named person-by-person (pbp) optimality. While global optimality always implies pbp-optimality, the converse is false--in general, there can be multiple pbp-optimal solutions with varying\n\n\f\npenalty. Nonetheless, pbp-optimality (along with a specialized observation process) justifies a particular finite parameterization for the function space G , leading to a nonlinear fixed-point equation and an iterative algorithm with favorable convergence properties. Before presenting the general algorithm, we illustrate its application in two simple examples. Example 3: Consider the MPM assignment problem in Example 2, assuming N = 2 and distribution p(x, y ) is defined by positive-valued parameters , 1 and 2 as follows: . 1 - nN 1 (yn - n xn )2 , x1 = x2  exp p(x)  and p(y |x) =  , x1 = x2 2 2 =1 Note that X1 and X2 are marginally uniform and  captures their correlation (positive, zero, or negative when  is less than, equal to, or greater than unity, respectively), while Y captures the presence of additive white Gaussian noise with signal-to-noise ratio at node n equal to n . The (unconstrained) MPM strategy   simplifies to a pair of threshold rules \nu1 = 1\n\nL1 (y1 )\n\n> <\nu1 = 0\n\n1 = \n\n1 + L2 (y2 )  + L2 (y2 )\n\nu2 = 1\n\nand\n\nL2 (y2 )\n\n> <\nu2 = 0\n\n2 = \n\n1 + L1 (y1 ) ,  + L1 (y1 )\n\nwhere Ln (yn ) = exp [n (yn - n /2)] denotes the likelihood-ratio local to node n. Let E = {(1, 2)} and define two network-constrained strategies: myopic strategy  0 employs 0 0 thresholds 1 = 2 = 1, meaning each node n acts to minimize Pr[Un = Xn ] as if in isolah 0 h tion, whereas heuristic strategy  h employs thresholds 1 = 1 and 2 = 2u1 -1 , meaning node 2 adjusts its threshold as if X1 = u1 (i.e., as if the myopic decision by node 1 is always correct). Figure 1 compares these strategies and a pbp-optimal strategy  k --only  k is both feasible and consistently \"hedging\" against all uncertainty i.e., J ( 0 )  J ( k )  J (  ). \nL1   0 h k J ( )\n\n\n0.6 0.4\n\n1\n\n (0, 0)\n0\n\nJ ( )\n1 2.5\n\n\n\n-1 (1, 0)\n\n0.8\n\n(1, 1)\n\n0.6\n\n0.5 .01 1 100\n\n(0, 1)\n\n 1 -1 L2 (a) Shown for  < 1\n\n0.4\n\n1 (b) With  = 0.1, 2 = 1\n\n (c) With 1 = 2 = 1\n\nFigure 1. Comparison of the four alternative strategies in Example 3: (a) sketch of the decision regions in likelihood-ratio space, showing that network-constrained threshold rules cannot exactly reproduce   (unless  = 1); (b) bit-error-rate versus 1 with  and 2 fixed, showing  h performs  comparably to  k when Y1 is accurate relative to Y2 but otherwise performs worse than even  0 (which requires no communication); (c) bit-error-rate versus  with 1 and 2 fixed, showing  k uses the allotted bit of communication such that roughly 35% of the loss J ( 0 ) - J (  ) is recovered. \n\nExample 4: Extend Example 3 to N > 2 nodes, but assuming X is equally-likely to be all zeros or all ones (i.e., the extreme case of positive correlation) and Y has identicallyaccuratei components with n = 1 for all n. The MPM strategy employs thresholds n =   V \\n 1/Li (yi ) for all n, leading to U =  (Y ) also being all zeros or all ones; thus, its cost distribution, or the probability mass function for c(  (Y ), X ), has mass only  0 on the values 0 and N . The myopic strategy employs thresholds n = 1 for all n, leading to 0 independent and identically-distributed (binary-valued) random variables cn (n (Yn ), Xn ); thus, its cost distribution, approaching a normal shape as N gets large, has mass on all values 0, 1, . . . , N . Figure 2 considers a particular directed network G and, initializing to  0 , shows the sequence of cost distributions resulting from the iterative offline algorithm--note the shape progression towards the cost distribution of the (infeasible) MPM strategy and the successive reduction in bit-error-rate J ( k ). Also noteworthy is the rapid convergence and the successive reduction in word-error-rate Pr[c( k (Y ), X ) = 0].\n\n\f\nk 6 u 6\n\nCost Distribution per Iteration k = 0, 1, . . .\n0.4\n\nk 4 k 2 u 2 k 5 k 3 u 3\n\nu4 u5\n\nk 7\n\nu7\nk 11\n\nu10 u11 u12\n\nprobability mass function\n\nk 1\n\nu1\n\nk 10\n\nJ ( ) = 3.7\n0.3 0.2 0.1 0 0 4 8 12\n\n0\n\nJ ( 1 ) = 2.9\n\nJ ( 2 ) = 2.8\n\nJ ( 3 ) = 2.8\n\nk 8\n\nu8\nk 12\n\nk 9 u9\n\n0\n\n4\n\n8\n\n12\n\n0\n\n4\n\n8\n\n12\n\n0\n\n4\n\n8\n\n12\n\nnumber of bit errors\n\nFigure 2. Illustration of the iterative offline computation given p(x, y ) as described in Example 4 and the directed network shown (N = 12). A Monte-Carlo analysis of   yields an estimate for its  bit-error-rate of J (  )  0.49 (with standard deviation of 0.05)--thus, with a total of just |E | = 11  bits of communication, the pbp-optimal strategy  3 recovers roughly 28% of the loss J ( 0 ) - J (  ). \n\n3.1\n\nNecessary Optimality Conditions\n\nWe start by providing an explicit probabilistic interpretation of the general problem in (1). Lemma 1 The minimum penalty J (  ) defined in (1) is, firstly, achievable by a deterministic1 strategy and, secondly, equivalently defined by y u x c(u, x) p(u|y )p(y |x) dy J (  ) = min p(x)\np(u|y )  { 0, 1} N  { 0, 1} N R N\n\nsub ject to\n\np(u|y ) =\n\nn\n\nV\n\np(un |yn , u(n) ).\n\nLemma 1 is primarily of conceptual value, establishing a correspondence between fixing a component rule n  n and inducing a decision process Un from the information (Yn , U(n) ) local to node n. The following assumption permits analytical progress towards a finite parameterization for each function space n and the basis of an offline algorithm. n Assumption 1 The observation process Y satisfies p(y |x) = V p(yn |x).\n\nLemma 2 Let Assumption 1 hold. Upon fixing a deterministic rule n  n local to node n (in correspondence with p(un |yn , u(n) ) by virtue of Lemma 1), we have the identity y p(un |x, u(n) ) = p(un |yn , u(n) )p(yn |x) dyn . (4)\nn R\n\nMoreover, upon fixing a deterministic strategy   G , we have the identity n p(un |x, u(n) ). p(u|x) =\nV\n\n(5)\n\nLemma 2 implies fixing component rule n  n is in correspondence with inducing the conditional distribution p(un |x, u(n) ), now a probabilistic description that persists local to node n no matter the rule i at any other node i  V \\n. Lemma 2 also introduces further structure in the constrained optimization expressed by Lemma 1: recognizing the integral over RN to equal p(u|x), (4) and (5) together imply it can be expressed as a product of\nA randomized (or mixed) strategy, modeled as a probabilistic selection from a finite collection of deterministic strategies, takes more inputs than just the observation process Y . That deterministic strategies suffice, however, justifies \"post-hoc\" our initial abuse of notation for elements in the set .\n1\n\n\f\ncomponent integrals, each over R. We now argue that, despite these simplifications, the component rules of   continue to be globally coupled. Starting with any deterministic strategy   G , consider optimizing the nth component rule n over n assuming all other components stay fixed. With n a degree-of-freedom, decision process Un is no longer well-defined so each un  {0, 1} merely represents a candidate decision local to node n. Online, each local decision will be made only upon receiving both the local observation Yn = yn and all parents' local decisions U(n) = u(n) . It follows that node n, upon deciding a particular un , may assert that random vector U is restricted to values in the subset U [u(n) , un ] = {u  {0, 1}N | u (n) = u(n) , u = un }. n  Then, viewing (Yn , U(n) ) as a composite local observation and proceeding in the manner by which (3) is derived, the pbp-optimal relaxation of (1) reduces to the following form. Proposition 1 Let Assumption 1 hold. In an optimal network-constrained strategy     G , for each n and assuming all components i  V \\n are fixed at rules i (each  in correspondence with p (ui |x, u(i) ) by virtue of Lemma 2), the rule n satisfies x  b (un , x; U(n) )p(Yn |x) with probability one n (Yn , U(n) ) = arg min n\nun {0,1}  { 0, 1} N\n\nwhere, for each u(n)  {0, 1}|(n)| , u b (un , x; u(n) ) = p(x) n\n\n(6) c(u, x) i p (ui |x, u(i) ). (7)\n\n U [ u  (n ) , u n ]\n\nV \\n\n\nOf note are (i) the likelihood function p(Yn |x) is a finite-dimensional sufficient statistic of Yn , (ii) real-valued coefficients bn provide a finite parameterization of the function space n and (iii) the pbp-optimal coefficient values b , while still computable offline, also den  pend on the distributions p (ui |x, u(i) ) in correspondence with all fixed rules i . 3.2 Offline Message-Passing Algorithm\n\nLet fn map from coefficients {bi ; i  V \\n} to coefficients bn by the following operations: 1. for each i  V \\n, compute p(ui |x, u(i) ) via (4) and (6) given bi and p(yi |x); 2. compute bn via (7) given p(x), c(u, x) and {p(ui |x, u(i) ); i  V \\n}. Then, the simultann ous satisfaction of Proposition 1 at all N nodes can be viewed as a e | (n)| system of 2N +1 nonlinear equations in as many unknowns, V 2 bn = fn (b1 , . . . , bn-1 , bn+1 , . . . , bN ), n = 1, . . . , N , (8) or, more concisely, b = f (b). The connection between each fn and Proposition 1 affords an equivalence between solving the fixed-point equation f via a Gauss-Seidel iteration and minimizing J ( ) via a coordinate-descent iteration [9], implying an algorithm guaranteed to terminate and achieve penalty no greater than that of an arbitrary initial strategy  0  b0 . Proposition 2 Initialize to any coefficients b0 = (b0 , . . . , b0 ) and generate the sequence 1 N {bk } using a component-wise iterative application of f in (8) i.e., for k = 1, 2, . . . , bk := fn (bk-1 , . . . , bk-1 , bk +1 , . . . , bk ), n N n-1 n 1 n = N , N - 1, . . . , 1. (9) If Assumption 1 holds, the associated sequence {J ( k )} is non-increasing and converges: J ( 0 )  J ( 1 )      J ( k )  J   J (  )  J (  ). \n\n\f\nDirect implementation of (9) is clearly imprudent from a computational perspective, because the transformation from fixed coefficients bk to the corresponding distribution n pk (un |x, u(n) ) need not be repeated within every component evaluation of f . In fact, assuming every node n stores in memory its own likelihood function p(yn |x), this transformation can be accomplished locally (cf. (4) and (6)) and, also assuming the resulting distribution is broadcast to all other nodes before they proceed with their subsequent component evaluation of f , the termination guarantee of Proposition 2 is retained. Requiring every node to perform a network-wide broadcast within every iteration k makes (9) a decidedly global algorithm, not to mention that each node n must also store in memory p(x, yn ) and c(u, x) to carry forth the supporting local computations. n Assumption 2 The cost function satisfies c(u, x) = V cn (un , x) for some collection of functions {cn : {0, 1}N +1  R} and the directed graph G is tree-structured. Proposition 3 Under Assumption 2, the following two-pass procedure is identical to (9):  Forward-pass at node n: upon receiving messages from all parents i   (n), store them for use in the next reverse-pass and send to each child j  (n) the following messages: u u i k pk-1 n |x, u(n) Pik n (ui |x). Pnj (un |x) := (10) \n (n )  { 0 , 1 } |  (n )|\n\n (n)\n\n Reverse-pass at node n: upon receiving messages from all children j  (n), update   i j u : k Pik n (ui |x) cn (un , x) + bk n , x; u(n) = p(x) Cj n (un , x) (11)  n\n (n) (n) k\n\nand the corresponding distribution p (un |x, u(n) ) via (4) and (6), store the distribution for use in the next forward pass and send to each parent i   (n) the following messages:   u j k k Cni (ui , x) := p(un |x, ui ) cn (un , x) + Cj n (un , x) , (12)\nn  { 0, 1}\n\n(n)\n\np(un |x, ui )\n\n=\n\nu\n\np\n\nk\n\n | (n)| |u =u } i  (n )  { u  { 0 , 1 } i\n\nu\n\nn |x, u (n)\n\n\n\n (n)\\i\n\nPk n (u |x). \n\nAn intuitive interpretation of Proposition 3, from the perspective of node n, is as follows. From (10) in the forward pass, the messages received from each parent define what, during subsequent online operation, that parent's local decision means (in a likelihood sense) about its ancestors' outputs and the hidden process. From (12) in the reverse pass, the messages received from each child define what the local decision will mean (in an expected cost sense) to that child and its descendants. From (11), both types of incoming messages impact the local rule update and, in turn, the outgoing messages to both types of neighbors. While Proposition 3 alleviates the need for the iterative global broadcast of distributions pk (un |x, u(n) ), the explicit dependence of (10)-(12) on the full vector x implies the memory and computation requirements local to each node can still be exponential in N . n Assumption 3 The hidden process X is Markov on G , or p(x) = V p(xn |x (n) ), and all component likelihoods/costs satisfy p(yn |x) = p(yn |xn ) and cn (un , x) = cn (un , xn ). Proposition 4 Under Assumption 3, the iterates in Proposition 3 specialize to the form of bk (un , xn ; u(n) ), n and each node n need only store in memory p(x(n) , xn , yn ) and cn (un , xn ) to carry forth the supporting local computations. (The actual equations can be found in [8].)\nk Pnj (un |xn )\n\nand\n\nk Cni (ui , xi ),\n\nk = 0, 1, . . .\n\n\f\nProposition 4 implies the convergence properties of Proposition 2 are upheld with maximal efficiency (linear nn N ) when G is tree-structured and the global distribution and costs sati n isfy p(x, y ) = V cn (un , xn ), respectively. V p(xn |x (n) )p(yn |xn ) and c(u, x) = Note that these conditions hold for the MPM assignment problems in Examples 3 & 4.\n\n4\n\nDiscussion\n\nOur decision-theoretic variational approach reflects several departures from existing methods for communication-constrained inference. Firstly, instead of imposing the constraints on an algorithm derived from an ideal model, we explicitly model the constraints and derive a different algorithm. Secondly, our penalty function drives the approximation by the desired application of inference (e.g., posterior assignment) as opposed to a generic error measure on the result of inference (e.g., divergence in true and approximate marginals). Thirdly, the necessary offline computation gives rise to a downside, namely less flexibility against time-varying statistical environments, decision objectives or network conditions. Our development also evokes principles in common with other research areas. Similar to the sum-product version of Belief Propagation (BP), our message-passing algorithm originates assuming a tree structure, an additive cost and a synchronous message schedule. It is thus enticing to claim that the maturation of BP (e.g., max-product, asynchronous schedule, cyclic graphs) also applies, but unique aspects to our development (e.g., directed graph, weak convergence, asymmetric messages) merit caution. That we solve for correlated equilibria and depend on probabilistic structure commensurate with cost structure for efficiency is in common with graphical games [10], which distinctly are formulated on undirected graphs and absent of hidden variables. Finally, our offline computation resembles learning a conditional random field [11], in the sense that factors of p(u|x) are iteratively modified to reduce penalty J ( ); online computation via strategy u =  (y ), repeated per realization Y = y , is then viewed as sampling from this distribution. Along the learning thread, a special case of our formulation appears in [12], but assuming p(x, y ) is unknown. Acknowledgments This work supported by the Air Force Office of Scientific Research under contract FA955004-1 and by the Army Research Office under contract DAAD19-00-1-0466. We are grateful to Professor John Tsitsiklis for taking time to discuss the correctness of Proposition 1. References\n[1] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [2] L. Chen, et al. Data association based on optimization in graphical models with application to sensor networks. Mathematical and Computer Modeling, 2005. To appear. [3] A. T. Ihler, et al. Message errors in belief propagation. Advances in NIPS 17, MIT Press, 2005. [4] M. I. Jordan, et al. An introduction to variational methods for graphical models. Learning in Graphical Models, pp. 105161, MIT Press, 1999. [5] J. N. Tsitsiklis. Decentralized detection. Adv. in Stat. Sig. Proc., pp. 297344, JAI Press, 1993. [6] P. K. Varshney. Distributed Detection and Data Fusion. Springer-Verlag, 1997. [7] J. Marschak and R. Radner. The Economic Theory of Teams. Yale University Press, 1972. [8] O. P. Kreidl and A. S. Willsky. Posterior assignment in directed graphical models with minimal online communication. Available: http://web.mit.edu/opk/www/res.html [9] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1995. [10] S. Kakade, et al. Correlated equilibria in graphical games. ACM-CEC, pp. 4247, 2003. [11] J. Lafferty, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML, 2001. [12] X. Nguyen, et al. Decentralized detection and classification using kernel methods. ICML,2004.\n\n\f\n", "award": [], "sourceid": 2811, "authors": [{"given_name": "O.", "family_name": "Kreidl", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}]}