{"title": "Consensus Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 899, "page_last": 906, "abstract": null, "full_text": "Consensus Propagation\n\nCiamac C. Moallemi Stanford University Stanford, CA 95014 USA ciamac@stanford.edu\n\nBenjamin Van Roy Stanford University Stanford, CA 95014 USA bvr@stanford.edu\n\nAbstract\nWe propose consensus propagation, an asynchronous distributed protocol for averaging numbers across a network. We establish convergence, characterize the convergence rate for regular graphs, and demonstrate that the protocol exhibits better scaling properties than pairwise averaging, an alternative that has received much recent attention. Consensus propagation can be viewed as a special case of belief propagation, and our results contribute to the belief propagation literature. In particular, beyond singly-connected graphs, there are very few classes of relevant problems for which belief propagation is known to converge.\n\n1\n\nIntroduction\n\nConsider a network of n nodes nn which the ith node observes a number yi  [0, 1] and i aims to compute the average i=1 yi /n. The design of scalable distributed protocols for this purpose has received much recent attention and is motivated by a variety of potential needs. In both wireless sensor and peer-to-peer networks, for example, there is interest in simple protocols for computing aggregate statistics (see, for example, the references in [1]), and averaging enables computation of several important ones. Further, averaging serves as a primitive in the design of more sophisticated distributed information processing algorithms. For example, a maximum likelihood estimate can be produced by an averaging protocol if each node's observations are linear in variables of interest and noise is Gaussian [2]. As another example, averaging protocols are central to policy-gradient-based methods for distributed optimization of network performance [3]. In this paper we propose and analyze a new protocol  consensus propagation  for asynchronous distributed averaging. As a baseline for comparison, we will also discuss another asychronous distributed protocol  pairwise averaging  which has received much recent attention. In pairwise averaging, each node maintains its current estimate of the average, and each time a pair of nodes communicate, they revise their estimates to both take on the mean of their previous estimates. Convergence of this protocol in a very general model of asynchronous computation and communication was established in [4]. Recent work [5, 6] has studied the convergence rate and its dependence on network topology and how pairs of nodes are sampled. Here, sampling is governed by a certain doubly stochastic matrix, and the convergence rate is characterized by its second-largest eigenvalue. Consensus propagation is a simple algorithm with an intuitive interpretation. It can also be viewed as an asynchronous distributed version of belief propagation as applied to approxi-\n\n\f\nmation of conditional distributions in a Gaussian Markov random field. When the network of interest is singly-connected, prior results about belief propagation imply convergence of consensus propagation. However, in most cases of interest, the network is not singlyconnected and prior results have little to say about convergence. In particular, Gaussian belief propagation on a graph with cycles is not guaranteed to converge, as demonstrated by examples in [7]. In fact, there are very few relevant cases where belief propagation on a graph with cycles is known to converge. Some fairly general sufficient conditions have been established [8, 9, 10], but these conditions are abstract and it is difficult to identify interesting classes of problems that meet them. One simple case where belief propagation is guaranteed to converge is when the graph has only a single cycle [11, 12, 13]. Recent work proposes the use of belief propagation to solve maximum-weight matching problems and proves convergence in that context [14]. [15] proves convergence in the application of belief propogation to a classification problem. In the Gaussian case, [7, 16] provide sufficient conditions for convergence, but these conditions are difficult to interpret and do not capture situations that correspond to consensus propagation. With this background, let us discuss the primary contributions of this paper: (1) we propose consensus propagation, a new asynchronous distributed protocol for averaging; (2) we prove that consensus propagation converges even when executed asynchronously. Since there are so few classes of relevant problems for which belief propagation is known to converge, even with synchronous execution, this is surprising; (3) We characterize the convergence time in regular graphs of the synchronous version of consensus propagation in terms of the the mixing time of a certain Markov chain over edges of the graph; (4) we explain why the convergence time of consensus propagation scales more gracefully with the number of nodes than does that of pairwise averaging, and for certain classes of graphs, we quantify the improvement.\n\n2\n\nAlgorithm\n\nConsider a connected undirected graph (V , E ) with |V | = n nodes. For each node i  V , let N (i) = {j : (i, j )  E } be the set of neighbors of i. Each node i  V is assiigned a number yi  [0, 1]. The goal is for each node to obtain an estimate of y =  yi /n through an asynchronous distributed protocol in which each node carries out V simple computations and communicates parsimonious messages to its neighbors. We propose consensus propagation as an approach to the aforementioned problem. In this protocol, if a node i communicates to a neighbor j at time t, it transmits a message consistt ing of two numerical values. Let tj  R and Kij  R+ denote the values associated with i the most recently transmitted message from i to j at or before time t. At each time t, node j t has stored in memory the most recent message from each neighbor: {tj , Kij |i  N (j )}. i The initial values in memory before receiving any messages are arbitrary. Consensus propagation is parameterized by a scalar  > 0 and a non-negative matrix Q  Rnn with Qij > 0 if and only if i = j and (i, j )  E . Let E  V  V be a set + consisting of two directed edges (i, j ) and (j, i) per undirected edge (i, j )  E . For each (i, j )  E , it is useful to define the following three functions: u 1+ Kui i 1 N (u)\\j , Fij (K ) = (1) 1 1 +  Qij + N (i)\\j Kui Gij (, K ) = yi + 1+ u\nN (i)\\j\n\nKui ui Kui\n\nu\n\n,\n\nXi (, K ) =\n\nyi + 1+\n\nu\n\nN (i)\n\nKui ui Kui\n\nN (i)\\j\n\nu\n\n.\n\n(2)\n\nN (i)\n\n\f\nFor each t, denote by Ut  E the set of directed edges along which messages are transmitted at time t. Consensus propagation is presented below as Algorithm 1. Algorithm 1 Consensus propagation. 1: for time t = 1 to  do 2: for all (i, j )  Ut do t 3: Kij  Fij (K t-1 ) 4: tj  Gij (t-1 , K t-1 ) i 5: end for 6: for all (i, j )  Ut do / t- t 7: Kij  Kij 1 - 8: tj  tj 1 i i 9: end for 10: xt  X (t , K t ) 11: end for Consensus propagation is a distributed protocol because computations at each node require only information that is locally available. In particular, the messages Fij (K t-1 ) and t t Gij (K t-1 ) transmitted from node i to node j depend only on {u-1 , Ku-1 |u  N (i)}, i i which node i has stored in memory. Similarly, xt , which serves as an estimate of y , dei t pends only on {t i , Kui |u  N (i)}. u Consensus propagation is an asynchronous protocol because only a subset of the potential messages are transmitted at each time. Our convergence analysis can also be extended to accommodate more general models of asynchronism that involve communication delays, as those presented in [17]. In our study of convergence time, we will focus on the synchronous version of consensus propagation. This is where Ut = E for all t. Note that synchronous consensus propagation is defined by: K t = F (K t-1 ), t = G (t-1 , K t-1 ), xt = X (t-1 , K t-1 ). (3) 2.1 Intuitive Interpretation\n\nConsider the special case of a singly connected graph. For any (i, j )  E , there is a set Sij  V of nodes that can transmit information to Sj i = V \\ Sij only through (i, j ). In order for nodes in Sj i to compute y , they must at least be provided with the average j i  among observations at nodes in Sij and the cardinality Kij = |Sij |. The messages tj and i t t Kij can be viewed as estimates. In fact, when  = , tj and Kij converge to j and i i  Kij , as we will now explain. Suppose the graph is singly connected,  = , and transmissions are synchronous. Then, u t t Kij = 1 + Ku-1 , (4) i\nN (i)\\j\n\nfor all (i, j )  E . This is a recursive characterization of |Sij |, and it is easy to see that it converges in a number of iterations equal to the diameter of the graph. Now consider the iteration u t-1 t-1 yi + N (i)\\j Kui ui t , ij = u t-1 1+ N (i)\\j Kui for all (i, j )  E . A simple inductive argument shows that at each time t, t is an average\nij t among observations at Kij nodes in Sij , and after a number of iterations equal to the\n\n\f\ndiameter of the graph, t =  . Further, for any i  V , u yi + N (i) Kui ui u y= , 1+ N (i) Kui so xt converges to y . This interpretation can be extended to the asynchronous case where it i elucidates the fact that t and K t become  and K  after every pair of nodes in the graph has established bilateral communication through some sequence of transmissions among adjacent nodes. Suppose now that the graph has cycles. If  = , for any (i, j )  E that is part of a cycle, t Kij   whether transmissions are synchronous or asynchronous, so long as messages are transmitted along each edge of the cycle an infinite number of times. uA heuristic fix t-1 ~t might be to compose the iteration (4) with one that attenuates: Kij  1+ N (i)\\j Kui , t ~t ~t and Kij  Kij /(1 + ij Kij ). Here, ij > 0 is a small constant. The message is essentially ~t ~t unaffected when ij Kij is small but becomes increasingly attenuated as Kij grows. This is exactly the kind of attenuation carried out by consensus propagation when  Qij = 1/ ij < . Understanding why this kind of attenuation leads to desirable results is a subject of our analysis. 2.2 Relation to Belief Propagation\n\nConsensus propagation can also be viewed as a special case of belief propagation. In this context, belief propagation is used to approximate the marginal distributions of a vector x  Rn conditioned on the observations y  Rn . The mode of each of the marginal distributions approximates y . Take the prior distribution over (x, y ) to be the normalized product of potential func tions {i ()|i  V } and compatibility functions {ij ()|(i, j )  E }, given by i (xi ) =  exp(-(xi - yi )2 ), and ij (xi , xj ) = exp(- Qij (xi - xj )2 ), where Qij , for each (i, j )  E , and  are positive constants. Note that  can be viewed as an inverse temperature parameter; as  increases, components of x associated with adjacent nodes become increasingly correlated. ( Let  be a positive semidefinite symmetric matrix such that xT x = i,j )E Qij (xi - 2 xj ) . Note that when Qij = 1 for all (i, j )  E ,  is the graph Laplacian. Given the vector y of observations, the conditional density of x is i ( - .  p (x)  i (xi ) ij (xi , xj ) = exp x - y 2 -  xT x 2\nV i,j )E\n\nLet x denote the mode of p (). Since the distribution is Gaussian, each component x i is also the mode of the corresponding marginal distribution. Note that x it is the unique solution to the positive definite quadratic program minimize\nx\n\nx - y 2 +  xT x. 2\n\n(5)\n\nThe following theorem, whose proof can be found in [1], suggests that if  is sufficiently large each component x can be used as an estimate of the mean value y .  i i  Theorem 1. xi /n = y and lim  xi = y , for all i  V .   In belief propagation, messages are passed along edges of a Markov random field. In our case, because of the structure of the distribution p (), the relevant Markov random field\n\n\f\nhas the same topology as the graph (V , E ). The message Mij () passed from node i to node j is a distribution on the variable xj . Node i computes this message using incoming messages from other nodes as defined by the update equation  u t t i i Mij (xj ) =  Mu-1 (xi ) dxi . (6) ij (x , xj )i (x ) i\nN (i)\\j\n\nHere,  is a normalizing constant. Since our underlying distribution p () is Gaussian, it is natural to consider messages which are Gaussian distributions. In particular, let t (tj , Kij )  R  R+ parameterize Gaussian message Mitj () according to Mitj (xj )  i- . t exp Kij (xj - tj )2 Then, (6) is equivalent to synchronous consensus propagation i iterations for K t and t . The sequence of densities  pt (xj ) j  j (xj ) i\nN (j )\n\n\n2\n\nMitj (xj )\n\n= exp -(xj - yj ) -\n\ni\nN (j )\n\nt Kij (xj\n\n-\n\ntj )2  , i\n\nis meant to converge to an approximation of the marginal conditional distribution of xj . As such, an approximation to x is given by maximizing pt (). It is easy to show that, j j the maximum is attained by xt = Xj (t , K t ). With this and aforementioned corresponj dences, we have shown that consensus propagation is a special case of belief propagation. Readers familiar with belief propagation will notice that in the derivation above we have used the sum product form of the algorithm. In this case, since the underlying distribution is Gaussian, the max product form yields equivalent iterations.\n\n3\n\nConvergence\n\nThe following theorem is our main convergence result. Theorem 2. (i) There are unique vectors ( , K  ) such that K  = F (K  ), and  = G ( , K  ). (ii) Assume that each edge (i, j )  E appears infinitely often in the sequence of communication sets {Ut }. Then, independent of the initial condition (0 , K 0 ), limt K t = K  , and limt t =  . (iii) Given ( , K  ), if x = X ( , K  ), then x is the mode of the distribution p (). The proof of this theorem can be found in [1], but it rests on two ideas. First, notice that, according to the update equation (1), K t evolves independently of t . Hence, we analyze K t first. Following the work of [7], we prove that the functions {Fij ()} are monotonic. This property is used to establish convergence to a unique fixed point. Next, we analyze t assuming that K t has already converged. Given fixed K , the update equations for t are linear, and we establish that they induce a contraction with respect to the maximum norm. This allows us to establish existence of a fixed point and asynchronous convergence.\n\n4\n\nConvergence Time for Regular Graphs\n\nIn this section, we will study the convergence time of synchronous consensus propagation. For > 0, we will say that an estimate x of y is -accurate if x - 1 2,n  . Here, for ~ ~y  integer m,  2,m is the norm on Rm defined by x 2,m = x 2 / m. We are interested in the number of iterations required to obtain an -accurate estimate of the mean y . \n\n\f\n4.1\n\nThe Case of Regular Graphs\n\nWe will restrict our analysis of convergence time to cases where (V , E ) is a d-regular graph, for d  2. Extension of our analysis to broader classes of graphs remains an open issue. We will also make simplifying assumptions that Qij = 1, 0j = yi , and K 0 = [k0 ]ij for i some scalar k0  0. In this restricted setting, the subspace of constant K vectors is invariant under F . This implies that there is some scalar k  > 0 so that K  = [k  ]ij . This k  is the unique solution to the fixed point equation k  = (1 + (d - 1)k  )/((1 + (1 + (d - 1)k  )/ ). Given a uniform initial condition K 0 = [k0 ]ij , we can study the sequence of iterates {K t } by examining the scalar sequence {kt }, defined by kt = (1 + (d - 1)kt-1 )(1 + (1 + (d - 1)kt-1 )/ ). In particular, we have K t = [kt ]ij , for all t  0. Similarly, in this setting, the equations for the evolution of t take the special form 1 u yi 1 t-1 t ui ij = + - . 1 + (d - 1)kt-1 1 + (d - 1)kt-1 d-1\nN (i)\\j\n\nDefining t = 1/(1 + (d - 1)kt ), we have, in vector form, ^ t = t-1 y + (1 - t-1 )P t-1 , ^ Rndnd + (7) ^ is a doubly stochastic matrix. where y  Rnd is a vector with yij = yi and P  ^ ^ ^ corresponds to a Markov chain on the set of directed edges E . In this chain, The matrix P an edge (i, j ) transitions to an edge (u, i) with u  N (i) \\ j , with equal probability assigned to each such edge. As in (3), we associate each t with an estimate xt of x according to xt = y /(1 i+ dk  ) + dk  At /(1 + dk  ), where A  Rnnd is a matrix defined by + (A)j = N (j ) ij /d. The update equation (7) suggests that the convergence of t is intimately tied to a notion t-1 ^ ^ ^ ^ ` of mixing time associated with P . Let P be the Cesaro limit P = limt  =0 P  /t. ^  ^ ) 2,nd . Here,  2,nd is Define the Cesaro mixing time  by  = supt0 t `  =0 (P - P ^ the matrix norm induced by the corresponding vector norm  2,nd . Since P is a stochastic ^ is well-defined and  < . Note that, in the case where P is aperiodic, ^ matrix, P irreducible, and symmetric,  corresponds to the traditional definition of mixing time: the ^ inverse of the spectral gap of P . A time t is said to be an -convergence time if estimates xt are -accurate for all t  t . The following theorem, whose proof can be found in [1], establishes a bound on the convergence time of synchronous consensus propagation given appropriately chosen  , as a function of and  . Theorem 3. Suppose k0  k  . If d = 2 there exists a  = (( / )2 ) and if d > 2 there exists a  = ( / ) such that some t = O(( / ) log( / )) is an -convergence time. Alternatively, suppose k0 = k  . If d = 2 there exists a  = (( / )2 ) and if d > 2 there exists a  = ( / ) such that some t = O(( / ) log(1/ )) is an -convergence time. In the first part of the above theorem, k0 is initialized arbitrarily so long as k0  k  . Typically, one might set k0 = 0 to guarantee this. The second case of interest is when k0 = k  , so that kt = k  for all t  0 Theorem 3 suggests that initializing with k0 = k  leads to an improvement in convergence time. However, in our computational experience, we have found that an initial condition of k0 = 0 consistently results in faster convergence than k0 = k  . Hence, we suspect that a convergence time bound of O(( / ) log(1/ )) also holds for the case of k0 = 0. Proving this remains an open issue. Theorems 3 posits choices of  that require knowledge of  , which may be both difficult to compute and also\n\n\f\nrequires knowledge of the graph topology. This is not a major restriction, however. It is not difficult to imagine variations of Algorithm 1 which use a doubling sequence of guesses for  the Cesaro mixing time  . Each guess leads to a choice of  and a number of iterations t to run with that choice of  . Such a modified algorithm would still have an -convergence time of O(( / ) log( / )).\n\n5\n\nComparison with Pairwise Averaging\n\nUsing the results of Section 4, we can compare the performance of consensus propagation to that of pairwise averaging. Pairwise averaging is usually defined in an asynchronous setting, but there is a synchronous counterpart which works as follows. Consider a doubly stochastic symmetric matrix P  Rnn such that Pij = 0 if (i, j )  E . Evolve estimates / according to xt = P xt-1 , initialized with x0 = y . Clearly xt = P t y  y 1 as t  .  In the case of a singly-connected graph, synchronous consensus propagation converges exactly in a number of iterations equal to the diameter of the graph. Moreover, when  = , this convergence is to the exact mean, as discussed in Section 2.1. This is the best one can hope for under any algorithm, since the diameter is the minimum amount of time required for a message to travel between the two most distant nodes. On the other hand, for a fixed accuracy , the worst-case number of iterations required by synchronous pairwise averaging on a singly-connected graph scales at least quadratically in the diameter [18]. The rate of convergence of synchronous pairwise averaging is governed by the relation xt - y 1 2,n  t , where 2 is the second largest eigenvalue of P . Let 2 = 1/ log(1/2 ),  2 and call it the mixing time of P . In order to guarantee -accuracy (independent of y ), t > 2 log(1/ ) suffices and t = (2 log(1/ )) is required [6]. Consider d-regular graphs and fix a desired error tolerance . The number of iterations required by consensus propagation is ( log  ), whereas that required by pairwise averaging is (2 ). Both mixing times depend on the size and topology of the graph. 2 is the mixing time of a process on nodes that transitions along edges whereas  is the mixing time of a process on directed edges that transitions towards nodes. An important distinction is that the former process is allowed to \"backtrack\" where as the latter is not. By this we mean that a sequence of states {i, j, i} can be observed in the vertex process, but the sequence {(i, j ), (j, i)} cannot be observed in the edge process. As we will now illustrate through an example, it is this difference that makes 2 larger than  and, therefore, pairwise averaging less efficient than consensus propagation. In the case of a cycle (d = 2) with an even number of nodes n, minimizing the mixing time over P results in 2 = (n2 ) [19]. For comparison, as demonstrated in the following theorem (whose proof can be found in [1]),  is linear in n.  Theorem 4. For the cycle with n nodes,   n/ 2. Intuitively, the improvement in mixing time arises from the fact that the edge process moves around the cycle in a single direction and therefore explores the entire graph within n iterations. The vertex process, on the other hand, randomly transitions back and forth among adjacent nodes, relying on chance to eventually explore the entire cycle. The cycle example demonstrates a (n/ log n) advantage offered by consensus propagation. Comparisons of mixing times associated with other graph topologies remains an issue for future analysis. But let us close by speculating on a uniform grid of n nodes over the m-dimensional unit torus. Here, n1/m is an integer, and each vertex has 2m neighbors, each a distance n-1/m away. With P optimized, it can be shown that 2 = (n2/m ) [20]. We put forth a conjecture on  . Conjecture 1. For the m-dimensional torus with n nodes, \n=\n\n(n(2m-1)/m ).\n\n2\n\n\f\nAcknowledgments\nThe authors wish to thank Balaji Prabhakar and Ashish Goel for their insights and comments. The first author was supported by a Benchmark Stanford Graduate Fellowship. This research was supported in part by the National Science Foundation through grant IIS-0428868 and a supplement to grant ECS-9985229 provided by the Management of Knowledge Intensive Dynamic Systems Program (MKIDS).\n\nReferences\n[1] C. C. Moallemi and B. Van Roy. Consensus propagation. Technical report, Management Science & Engineering Deptartment, Stanford University, 2005. URL: http://www. moallemi.com/ciamac/papers/cp- 2005.pdf. [2] L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. To appear in the proceedings of IPSN, 2005. [3] C. C. Moallemi and B. Van Roy. Distributed optimization in adaptive networks. In Advances in Neural Information Processing Systems 16, 2004. [4] J. N. Tsitsiklis. Problems in Decentralized Decision-Making and Computation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1984. [5] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In ACM Symposium on Theory of Computing, 2004. [6] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis and applications. To appear in the proceedings of INFOCOM, 2005. [7] P. Rusmevichientong and B. Van Roy. An analysis of belief propagation on the turbo decoding graph with Gaussian densities. IEEE Transactions on Information Theory, 47(2):745765, 2001. [8] S. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002. [9] T. Heskes. On the uniqueness of loopy belief propagation fixed points. Neural Computation, 16(11):23792413, 2004. [10] A. T. Ihler, J. W. Fisher III, and A. S. Willsky. Message errors in belief propagation. In Advances in Neural Information Processing Systems, 2005. [11] G. Forney, F. Kschischang, and B. Marcus. Iterative decoding of tail-biting trelisses. In Proceedings of the 1998 Information Theory Workshop, 1998. [12] S. M. Aji, G. B. Horn, and R. J. McEliece. On the convergence of iterative decoding on graphs with a single cycle. In Proceedings of CISS, 1998. [13] Y. Weiss and W. T. Freeman. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12:141, 2000. [14] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. preprint, 2005. [15] V. Saligrama, M. Alanyali, and O. Savas. Asynchronous distributed detection in sensor networks. preprint, 2005. [16] Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian graphical models of arbitrary topology. Neural Computation, 13:21732200, 2001. [17] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, Belmont, MA, 1997. [18] S. Boyd, P. Diaconis, J. Sun, and L. Xiao. Fastest mixing Markov chain on a path. submitted to The American Mathematical Monthly, 2003. [19] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Mixing times for random walks on geometric random graphs. To appear in the proceedings of SIAM ANALCO, 2005. [20] S. Roch. Bounded fastest mixing. preprint, 2004.\n\n\f\n", "award": [], "sourceid": 2913, "authors": [{"given_name": "Benjamin", "family_name": "Roy", "institution": null}, {"given_name": "Ciamac", "family_name": "Moallemi", "institution": null}]}