{"title": "Inference with Multivariate Heavy-Tails in Linear Models", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 216, "abstract": "Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L\\'evy and Gaussian distributions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to only stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise.", "full_text": "Inference with Multivariate Heavy-Tails\n\nin Linear Models\n\nDanny Bickson and Carlos Guestrin\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\n{bickson,guestrin}@cs.cmu.edu\n\nAbstract\n\nHeavy-tailed distributions naturally occur in many real life problems. Unfortu-\nnately, it is typically not possible to compute inference in closed-form in graphical\nmodels which involve such heavy-tailed distributions.\nIn this work, we propose a novel simple linear graphical model for independent\nlatent random variables, called linear characteristic model (LCM), defined in the\ncharacteristic function domain. Using stable distributions, a heavy-tailed family\nof distributions which is a generalization of Cauchy, L\u00b4evy and Gaussian distri-\nbutions, we show for the first time, how to compute both exact and approximate\ninference in such a linear multivariate graphical model. LCMs are not limited to\nstable distributions, in fact LCMs are always defined for any random variables\n(discrete, continuous or a mixture of both).\nWe provide a realistic problem from the field of computer networks to demon-\nstrate the applicability of our construction. Other potential application is iterative\ndecoding of linear channels with non-Gaussian noise.\n\n1\n\nIntroduction\n\nHeavy-tailed distributions naturally occur in many real life phenomena, for example in computer\nnetworks [23, 14, 16]. Typically, a small set of machines are responsible for a large fraction of the\nconsumed network bandwidth. Equivalently, a small set of users generate a large fraction of the\nnetwork traffic. Another common property of communication networks is that network traffic tends\nto be linear [8, 23]. Linearity is explained by the fact that the total incoming traffic at a node is\ncomposed from the sum of distinct incoming flows.\n\nRecently, several works propose to use linear multivariate statistical methods for monitoring network\nhealth, performance analysis or intrusion detection [15, 13, 16, 14]. Some of the aspects of network\ntraffic makes the task of modeling it using a probabilistic graphical models challenging. In many\ncases, the underlying heavy-tailed distributions are difficult to work with analytically. That is why\nexisting solutions in the area of network monitoring involve various approximations of the joint\nprobability distribution function using a variety of techniques: mixtures of distributions [8], spectral\ndecomposition [13] historgrams [14], sketches [16], entropy [14], sampled moments [23], etc.\n\nIn the current work, we propose a novel linear probabilistic graphical model called linear charac-\nteristic model (LCM) to model linear interactions of independent heavy-tailed random variables\n(Section 3). Using the stable family of distributions (defined in Section 2), a family of heavy-tailed\ndistributions, we show how to compute both exact and approximate inference (Section 4). Using\nreal data from the domain of computer networks we demonstrate the applicability of our proposed\nmethods for computing inference in LCM (Section 5).\n\nWe summarize our contributions below:\n\n1\n\n\f\u2022 We propose a new linear graphical model called LCM, defined as a product of factors in the\ncf domain. We show that our model is well defined for any collection of random variables,\nsince any random variable has a matching cf.\n\n\u2022 Computing inference in closed form in linear models involving continuous variables is typ-\nically limited to the well understood cases of Gaussians and simple regression problems in\nexponential families. In this work, we extend the applicability of belief propagation to the\nstable family of distributions, a generalization of Gaussian, Cauchy and L\u00b4evy distributions.\nWe analyze both exact and approximate inference algorithms, including convergence and\naccuracy of the solution.\n\n\u2022 We demonstrate the applicability of our proposed method, performing inference in real\n\nsettings, using network tomography data obtained from the PlanetLab network.\n\n1.1 Related work\n\nThere are three main relevant works in the machine learning domain which are related to the current\nwork: Convolutional Factor Graphs (CFG), Copulas and Independent Component Analysis (ICA).\nBelow we shortly review them and motivate why a new graphical model is needed.\n\nConvolutional Factor Graphs (CFG) [18, 19] are a graphical model for representing linear relation\nof independent latent random variables. CFG assume that the probability distribution factorizes\nas a convolution of potentials, and proposes to use duality to derive a product factorization in the\ncharacteristic function (cf) domain. In this work we extend CFG by defining the graphical model as\na product of factors in the cf domain. Unlike CFGs, LCMs are always defined, for any probability\ndistribution, while CFG may are not defined when the inverse Fourier transform does not exist.\n\nA closely related technique is the Copula method [22, 17]. Similar to our work, Copulas assume a\nlinear underlying model. The main difference is that Copulas transform each marginal variable into\na uniform distribution and perform inference in the cumulative distribution function (cdf) domain. In\ncontrast, we perform inference in the cf domain. In our case of interest, when the underlying distri-\nbutions are stable, Copulas can not be used since stable distributions are not analytically expressible\nin the cdf domain.\n\nA third related technique is ICA (independent component analysis) on linear models [27]. Assum-\ning a linear model Y = AX 1, where the observations Y are given, the task is to estimate the linear\nrelation matrix A, using only the fact that the latent variables X are statistically mutually indepen-\ndent. Both techniques (LCM and ICA) are complementary, since ICA can be used to learn the linear\nmodel, while LCM is used for computing inference in the learned model.\n\n2 Stable distribution\n\nStable distribution [30] is a family of heavy-tailed distributions, where Cauchy, L\u00b4evy and Gaussian\nare special instances of this family (see Figure 1). Stable distributions are used in different prob-\nlem domains, including economics, physics, geology and astronomy [24]. Stable distribution are\nuseful since they can model heavy-tailed distributions that naturally occur in practice. As we will\nsoon show with our networking example, network flows exhibit empirical distribution which can be\nmodeled remarkably well by stable distributions.\nWe denote a stable distribution by a tuple of four parameters: S(\u03b1, \u03b2, \u03b3, \u03b4). We call \u03b1 as the char-\nacteristic exponent, \u03b2 is the skew parameter, \u03b3 is a scale parameter and \u03b4 is a shift parameter. For\nexample (Fig. 1), a Gaussian N (\u03bc, \u03c32) is a stable distribution with the parameters S(2, 0, \u03c3\u221a2\n, \u03bc), a\nCauchy distribution Cauchy(\u03b3, \u03b4) is stable with S(1, 0, \u03b3, \u03b4) and a L\u00b4evy distribution L\u00b4evy(\u03b3, \u03b4) is\nstable with S( 1\na unit scale, zero-centered stable random variable.\nDefinition 2.1.\n\u22121 \u2264 \u03b2 \u2264 1, a, b \u2208 R, a 6= 0 and Z is a random variable with characteristic function2\n\n2 , 1, \u03b3, \u03b4). Following we define formally a stable distribution. We begin by defining\n[25, Def. 1.6] A random variable X is stable if and only if X \u223c aZ + b, 0 < \u03b1 \u2264 2,\nE[exp(iuZ)] =(exp(cid:0) \u2212 |u|\u03b1[1 \u2212 i\u03b2 tan( \u03c0\u03b1\n\n(1)\n\n1Linear model is formally defined in Section 3.\n2We formally define characteristic function in the supplementary material.\n\nexp(cid:0) \u2212 |u|[1 + i\u03b2 2\n\n2 ) sign(u)](cid:1) \u03b1 6= 1\n\u03c0 sign(u) log(|u|)](cid:1) \u03b1 = 1\n\n.\n\n2\n\n\fA basic property of stable laws is that weighted sums of \u03b1-stable random variables is \u03b1-stable (and\nhence the family is called stable). This property will be useful in the next section where we compute\ninference in a linear graphical model with underlying stable distributions. The following proposition\nformulates this linearity.\nProposition 2.1. [25, Prop. 1.16]\n\na) Multiplication by a scalar. If X \u223c S(\u03b1, \u03b2, \u03b3, \u03b4) then for any a, b \u2208 R, a 6= 0,\n\naX + b \u223c S(\u03b1, sign(a)\u03b2,|a|\u03b3, a\u03b4 + b) .\n\nb) Summation of two stable variables. If X1 \u223c S(\u03b1, \u03b21, \u03b31, \u03b41) and X2 \u223c S(\u03b1, \u03b22, \u03b32, \u03b42)\n\nare independent, then X1 + X2 \u223c S(\u03b1, \u03b2, \u03b3, \u03b4) where\n1 + \u03b3\u03b1\n2 ,\n\n, \u03b3\u03b1 = \u03b3\u03b1\n\n\u03b2 =\n\n1 + \u03b22\u03b3\u03b1\n\u03b21\u03b3\u03b1\n2\n\u03b3\u03b1\n1 + \u03b3\u03b1\n2\n2 )[\u03b2\u03b3 \u2212 \u03b21\u03b31 \u2212 \u03b22\u03b32]\n\n\u03be =(cid:26)tan( \u03c0\u03b1\n\n2\n\n\u03b1 6= 1\n\u03c0 [\u03b2\u03b3 log \u03b3 \u2212 \u03b21\u03b31 log \u03b31 \u2212 \u03b22\u03b32 log \u03b32] \u03b1 = 1\n\n.\n\n\u03b4 = \u03b41 + \u03b42 + \u03be ,\n\nNote that both X1, X2 have to be distributed with the same characteristic exponent \u03b1.\n\n3 Linear characteristic models\n\nA drawback of general stable distributions, is that they do not have closed-form equation for the pdf\nor the cdf. This fact makes the handling of stable distributions more difficult. This is probably one\nof the reasons stable distribution are rarely used in the probabilistic graphical models community.\n\nWe propose a novel approach for modeling linear interactions between random variables distributed\naccording to stable distributions, using a new linear probabilistic graphical model called LCM. A\nnew graphical model is needed, since previous approaches like CFG or the Copula method can not be\nused for computing inference in closed-form in linear models involving stable distribution, because\nthey require computation in the pdf or cdf domains respectively. We start by defining a linear model:\n(Linear model) Let X1,\u2219\u2219\u2219 , Xn a set of mutually independent random variables.3\nDefinition 3.1.\nLet Y1,\u2219\u2219\u2219 , Ym be a set of observations obtained using the linear model:\n\nYi \u223cXj\n\nAijXj \u2200i ,\n\nNext we define a general stable random variable.\nDefinition 2.2.\n\n[25, Def. 1.7] A random variable X is S(\u03b1, \u03b2, \u03b3, \u03b4) if\n\nwhere Z is given by (1). X has characteristic function\n\nX \u223c(\u03b3(Z \u2212 \u03b2 tan( \u03c0\u03b1\nE exp(iuZ) =(exp(\u2212\u03b3\u03b1|u|\u03b1[1 \u2212 i\u03b2 tan( \u03c0\u03b1\n\nexp(\u2212\u03b3|u|[1 + i\u03b2 2\n\n\u03b3Z + \u03b4\n\n2 )) + \u03b4 \u03b1 6= 1\n\u03b1 = 1\n\n,\n\n2 ) sign(u)(|\u03b3u|1\u2212\u03b1 \u2212 1)] + i\u03b4u) \u03b1 6= 1\n\u03b1 = 1\n\n.\n\n\u03c0 sign(u) log(\u03b3|u|)] + i\u03b4u)\n\nwhere Aij \u2208 R are weighting scalars. We denote the linear model in matrix notation as Y = AX.\nLinear models are useful in many domains. For example, in linear channel decoding, X are the\ntransmitted codewords, the matrix A is the linear channel transformation and Y is a vector of obser-\nvations. When X are distributed using a Gaussian distribution, the channel model is called AWGN\n(additive white Gaussian noise) channel. Typically, the decoding task is finding the most probable\nX, given A and the observation Y. Despite the fact that X are assumed statistically mutually inde-\npendent when transmitting, given an observation Y , X are not independent any more, since they\nare correlated via the observation. Besides of the network application we focus on, other potential\napplication to our current work is linear channel decoding with stable, non-Gaussian, noise.\n\nIn the rest of this section we develop the foundations for computing inference in a linear model using\nunderlying stable distributions. Because stable distributions do not have closed-form equations in\nthe pdf domain, we must work in the cf domain. Hence, we define a dual linear model in the cf\ndomain.\n\n3We do not limit the type of random variables. The variables may be discrete, continuous, or a mixture of\n\nboth.\n\n3\n\n\f3.1 Duality of LCM and CFG\n\nCFG [19] have shown that the joint probability p(x, y) of any linear model can be factorized as a\nconvolution:\n\np(x, y) = p(x1,\u2219\u2219\u2219 , xn, y1,\u2219\u2219\u2219 , ym) =\n\np(xi, y1,\u2219\u2219\u2219 , ym) .\n\n(2)\n\nInformally, LCM is the dual representation of (2) in the characteristic function domain. Next, we\ndefine LCM formally, and establish the duality to the factorization given in (2).\nDefinition 3.2.\n(LCM)\n\n(LCM) Given the linear model Y=AX, we define the linear characteristic model\n\n\u2217Yi\n\n\u03d5(t1,\u2219\u2219\u2219 , tn, s1,\u2219\u2219\u2219 , sm) ,Yi\n\n\u03d5(ti, s1,\u2219\u2219\u2219 , sm) ,\n\nwhere \u03d5(ti, s1,\u2219\u2219\u2219 , sm) is the characteristic function4 of the joint distribution p(xi, y1,\u2219\u2219\u2219 , ym).\nThe following two theorems establish duality between the LCM and its dual representation in the\npdf domain. This duality is well known (see for example [18, 19]), but important for explaining the\nderivation of LCM from the linear model.\nTheorem 3.3. Given a LCM, assuming p(x, y) as defined in (2) has a closed form and the Fourier\ntransform F [p(x, y)] exists, then the F [p(x, y)] = \u03d5(t1,\u2219\u2219\u2219 , tn, s1,\u2219\u2219\u2219 , sm).\nTheorem 3.4. Given\nF\u22121(\u03d5(t1,\u2219\u2219\u2219 , tn, s1,\u2219\u2219\u2219 , sm)) = p(x, y) as defined in (2).\nThe proof of all theorem is deferred to the supplementary material. Whenever the inverse Fourier\ntransform exists, LCM model has a dual CFG model. In contrast to the CFG model, LCM are always\ndefined, even the inverse Fourier transform does not exist. The duality is useful, since it allows us to\ncompute inference in either representations, whenever it is more convenient.\n\ntransform exists,\n\na LCM, when\n\nthen\n\nthe\n\ninverse Fourier\n\n4 Main result: exact and approximate inference in LCM\n\nThis section brings our main result. Typically, exact inference in linear models with continuous\nvariables is limited to the well understood cases of Gaussian and simple regression problem in\nexponential families. In this section we extend previous results, to show how to compute inference\n(both exact and approximate) in linear model with underlying stable distributions.\n\n4.1 Exact inference in LCM\nThe inference task typically involves computation of marginal distribution or a conditional distri-\nbution of a probability function. For the rest of the discussion we focus on marginal distribution.\nMarginal distribution of the node xi is typically computed by integrating out all other nodes:\n\np(xi|y) \u223c ZX\\i\n\np(x, y) dX\\i ,\n\nwhere X \\ i is the set of all nodes excluding node i. Unfortunately, when working with stable\ndistribution, the above integral is intractable. Instead, we propose to use a dual operation called\nslicing, computed in the cf domain.\nDefinition 4.1.\n(a) Joint cf. Given random variables X1, X2, the joint cf is \u03d5X1,X2(t1, t2) = E[eit1x1+it2x2].\n(b) Marginal cf. The marginal cf is derived from the joint cf by \u03d5X1(t1) = \u03d5X1,X2 (t1, 0).\nThis operation is called slicing or evaluation. We denote the slicing operation as \u03d5X1 (t1) =\n\n(slicing/evaluation)[28, p. 110]\n\n\u03d5X1,X2 (t1, t2)(cid:21)t2=0\n\n.\n\nThe following theorem establishes the fact that marginal distribution can be computed in the cf\ndomain, by using the slicing operation.\n\n4Defined in the supplementary material.\n\n4\n\n\f\u03c6(tj , s1, \u2219 \u2219 \u2219 , sm)(cid:21)ti=0\n\nfor i \u2208 |T|\n{Eliminate ti by computing\n\u03c6m+i(N (ti)) = Y\u03d5j\u2208N (ti )\nRemove \u03c6(tj , s1, \u2219 \u2219 \u2219 , sm) and ti from LCM.\nAdd \u03c6m+i to LCM.\n}Finally: If F\u22121 exists, compute\np(xi) = F\u22121(\u03c6f inal) .\n\n0.018\n\n0.016\n\n0.014\n\n0.012\n\n0.01\n\n0.008\n\n0.006\n\n0.004\n\n0.002\n\n0\n \n-3\n\n \n\nCauchy\nGaussian\nLevy\n\n-2\n\n-1\n\n0\n\n1\n\n2\n\n3\n\n4\n\nAlgorithm 1: Exact inference in LCM using\nLCM-Elimination.\nTheorem 4.2. Given a LCM, the marginal cf of the random variable Xi can be computed using\n\nFigure 1: The three special cases of stable\ndistribution where closed-form pdf exists.\n\n\u03d5(ti) =Yj\n\n\u03d5(tj, s1,\u2219\u2219\u2219 , sm)(cid:21)T\\i=0\n\n,\n\n(3)\n\nIn case the inverse Fourier transform exists, then the marginal probability of the hidden variable Xi\nis given by p(xi) \u223c F\u22121{\u03d5(ti)} .\nBased on the results of Thm. 4.2 we propose an exact inference algorithm, LCM-Elimination, for\ncomputing the marginal cf (shown in Algorithm 1). We use the notation N (k) as the set of graph\nneighbors of node k, excluding k5. T is the set {t1,\u2219\u2219\u2219 , tn}.\nLCM-Elimination is dual to CFG-Elimination algorithm [19]. LCM-Elimination operates in the cf\ndomain, by evaluating one variable at a time, and updating the remaining graphical model accord-\ningly. The order of elimination does not affect correctness (although it may affect efficiency). Once\nthe marginal cf \u03d5(ti), is computed, assuming the inverse Fourier transform exists, we can compute\nthe desired marginal probability p(xi).\n\n4.2 Exact inference in stable distributions\n\nAfter defining LCM and showing that inference can be computed in the cf domain, we are finally\nready to show how to compute exact inference in a linear model with underlying stable distributions.\nWe assume that all observation nodes Yi are distributed according to a stable distribution. From\nthe linearity property of stable distribution, it is clear that the hidden variables Xi are distributed\naccording to a stable distribution as well. The following theorem is one of the the novel contributions\nof this work, since as far as we know, no closed-form solution was previously derived.\nTheorem 4.3. Given a LCM, Y = AX +Z, with n i.i.d. hidden variables Xi \u223c S(\u03b1, \u03b2xi , \u03b3xi , \u03b4xi ),\nn i.i.d. noise variables with known parameters Zi \u223c S(\u03b1, \u03b2zi , \u03b3zi , \u03b4zi ), and n observations yi \u2208 R,\nassuming the matrix An\u00d7n is invertible6, then\na) the observations Yi are distributed according to stable distribution Yi \u223c S(\u03b1, \u03b2yi , \u03b3yi , \u03b4yi ) with\nthe following parameters:\n\n\u03b3y = |A|\u03b1\u03b3x + \u03b3z, \u03b2y = \u03b3\u2212\u03b1\n\ny (cid:12) [(|A| (cid:12) sign(A))(\u03b2x (cid:12) \u03b3x) + \u03b2z (cid:12) \u03b3z],\n\n2 )[\u03b2y (cid:12) \u03b3y \u2212 A(\u03b2x (cid:12) \u03b3x) \u2212 \u03b2z (cid:12) \u03b3z]\n\n\u03bey =(tan( \u03c0\u03b1\nb) the result of exact inference for computing the marginals p(xi|y) \u223c S(\u03b1, \u03b2xi|y, \u03b3xi|y, \u03b4xi|y) is\ngiven in vector notation:\n\n\u03b1 6= 1\n\u03c0 [\u03b2y (cid:12) \u03b3y (cid:12) log(\u03b3y) \u2212 A (cid:12) log(|A|)(\u03b2x (cid:12) \u03b3x) \u2212 A(\u03b2x (cid:12) \u03b3x (cid:12) log(\u03b3x)) \u2212 \u03b2z (cid:12) \u03b3z] \u03b1 = 1\n\n,\n\n2\n\n\u03b4y = A\u03b4x + \u03bey\n\n\u03b2x|y = \u03b3\u2212\u03b1\n\nx|y (cid:12) [(|A|\u03b1 (cid:12) sign(A))\u22121(\u03b2y (cid:12) \u03b3\u03b1\n\ny )] , \u03b3\u03b1\n\nx|y = (|A|\u03b1)\u22121\u03b3\u03b1\ny ,\n\n\u03b4x|y = A\u22121[\u03b4y \u2212 \u03bex ],\n\n(4)\n\n5More detailed explanation of the construction of a graphical model out of the linear relation matrix A is\n\nfound on [4, Chapter 2.3].\n\n6To simplify discussion we assume that the length of both the hidden and observation vectors |X| = |Y | =\nn. However the results can be equivalently extended to the more general case where |X| = n,|Y | = m, m 6=\nn. See for example [6].\n\n5\n\n\fInitialize: mij (xj ) = 1,\nIterate until convergence\n\n\u2200Aij 6= 0.\n\nmki(ti)#ti=0\n\nmij (xj ) =Zxi\n\np(xi, y1, \u2219 \u2219 \u2219 , ym) \u2217\n\nmki(xi)dxi\n\nFinally:\n\np(xi) = p(xi, y1, \u2219 \u2219 \u2219 , ym) \u2217\n\n(b)\n\n\u2217Yk\u2208N (i)\\j\n\u2217Yk\u2208N (i)\n\nmki(xi).\n\nInitialize: mij (xj ) = 1,\nIterate until convergence\n\n\u2200Aij 6= 0.\nmij (tj ) = \u03d5i(ti, s1, \u2219 \u2219 \u2219 , sm) Yk\u2208N (i)\\j\n\u03d5(ti) = \u03d5i(ti, s1, \u2219 \u2219 \u2219 , sm) Yk\u2208N (i)\n\nFinally:\n\n(a)\nInitialize: \u03b2xi|y , \u03b3xi|y , \u03b4xi|y = S(\u03b1, 0, 0, 0),\nIterate until convergence:\n\u03b3\u03b1\nxi|y = \u03b3\u03b1\nxj|y ,\n\nmki(ti).\n\nyi \u2212Xj6=i\n\ntan( \u03c0\u03b1\n\n|Aij|\u03b1\u03b3\u03b1\n2 )[\u03b2yi \u03b3yi \u2212Pj Aij \u03b2xj|y \u03b3\n\n\u03b2xi|y = \u03b2yi \u03b3\u03b1\n1\u2212\u03b1\n\u03b1\nxj|y ]\n\n\u03bexi|y =\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u2200i.\nyi \u2212Xj6=i\n\nsign(Aij )|Aij|\u03b1\u03b2xj|y ,\n\n\u03b4xi|y = \u03b4yi \u2212Xj6=i\n\nAij \u03b4xj|y \u2212 \u03bexi|y,\n\n\u03b1 6= 1\n1\u2212\u03b1\n\u03b1\nxj|y )] \u03b1 = 1\n\n(6)\n\n2\n\n\u03c0 [\u03b2yi \u03b3yi log(\u03b3yi ) \u2212Pj:Aij6=0 Aij log(|Aij|)\u03b2xj|y \u03b3\n\nxj|y \u2212Pj Aij \u03b2xj|y \u03b3xj|y log(\u03b3\n\n1\u2212\u03b1\n\u03b1\n\n(c)\n\nOutput: xi|y \u223c S(\u03b1, \u03b2xi|y /\u03b3\u03b1\nAlgorithm 2: Approximate inference in LCM using the (a) Characteristic-Sum-Product (CSP) algorithm (b)\nIntegral Convolution (IC) algorithm. Both are exact on tree topologies. (c) Stable-Jacobi algorithm.\n\nxi|y , \u03b3xi|y , \u03b4xi|y)\n\n2 )[\u03b2y (cid:12) \u03b3y \u2212 A(\u03b2x|y (cid:12) \u03b3x|y)]\n\n\u03b1 6= 1\n\u03c0 [\u03b2y (cid:12) \u03b3y (cid:12) log(\u03b3y) \u2212(A (cid:12) log(|A|)(\u03b2x|y (cid:12) \u03b3x|y) \u2212 A(\u03b2x|y (cid:12) \u03b3x|y (cid:12) log(\u03b3x|y))] \u03b1 = 1\n\n\u03bex|y =(tan( \u03c0\u03b1\n(5)\nwhere (cid:12) is the entrywise product (of both vectors and matrices),|A| is the absolute value (entrywise)\nlog(A), A\u03b1, sign(A) are entrywise matrix operations and \u03b2x , [\u03b2x1 ,\u2219\u2219\u2219 , \u03b2xn ]T and the same for\n\u03b2y, \u03b2z, \u03b3x, \u03b3y, \u03b3z, \u03b4x, \u03b4y, \u03b4z.\n\n,\n\n2\n\n4.3 Approximate Inference in LCM\nTypically, the cost of exact inference may be expensive. For example, in the related linear model of a\nmultivariate Gaussian (a special case of stable distribution), LCM-Elimination reduces to Gaussian\nelimination type algorithm with a cost of O(n3), where n is the number of variables. Approximate\nmethods for inference like belief propagation [26], usually require less work than exact inference,\nbut may not always converge (or convergence to an unsatisfactory solution). The cost of exact\ninference motivates us to devise a more efficient approximations.\n\nWe propose two novel algorithms that are variants of belief propagation for computing approximate\ninference in LCM. The first, Characteristic-Slice-Product (CSP) is defined in LCM (shown in Algo-\nrithm 2(a)). The second, Integral-Convolution (IC) algorithm (Algorithm 2(b)) is its dual in CFG.\nAs in belief propagation, our algorithms are exact on tree graphical models. The following theorem\nestablishes this fact.\nTheorem 4.4. Given an LCM with underlying tree topology (the matrix A is an irreducible adja-\ncency matrix of a tree graph), the CSP and IC algorithms, compute exact inference, resulting in the\nmarginal cf and the marginal distribution respectively.\nThe basic property which allows us to devise the CSP algorithm is that LCM is defined as a prod-\nuct of factor in the cf domain. Typically, belief propagation algorithms are applied to a probability\ndistribution which factors as a product of potentials in the pdf domain. The sum-product algorithm\nuses the distributivity of the integral and product operation to devise efficient recursive evaluation of\nthe marginal probability. Equivalently, the Characteristic-Slice-Product algorithm uses the distribu-\ntivity of the slicing and product operations to perform efficient inference to compute the marginal\ncf in the cf domain, as shown in Theorem 4.4. In a similar way, the Integral-Convolution algorithm\nuses distributivity of the integral and convolution operations to perform efficient inference in the pdf\ndomain. Note that the original CFG work [18, 19] did not consider approximate inference. Hence\nour proposed approximate inference algorithm further extends the CFG model.\n\n4.4 Approximate inference for stable distributions\nFor the case of stable distributions, we derive an approximation algorithm, Stable-Jacobi (Algo-\nrithm 2(c)), out of the CSP update rules. The algorithm is derived by substituting the convolution\nand multiplication by scalar operations (Prop. 2.1 b,a) into the update rules of the CSP algorithm\ngiven in Algorithm 2(a).\n\n6\n\n\fEmpirical Data\nLevi fit (5.4648e-04, 9.99e-01)\n\n \n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nh\n\nt\n\ni\n\nd\nw\nd\nn\na\nB\n\n \nf\n\no\n\n \n\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n102\n\n100\n\n10-2\n\n10-4\n\n10-6\n\nr\ne\n\nt\ni\n \ns\nu\no\nv\ne\nr\np\n\ni\n\n \n.\ns\nv\n \n\ne\ng\nn\na\nh\nC\n2\nL\n\n \n\n0\n\n \n0\n\n5\n\n10\n\n20\n\n15\n25\nSource Port\n\n30\n\n35\n\n40\n\n(a)\n\n(b)\n\n10-8\n\n \n0\n\n5\n\n \n\n\u03b2\n\u03b3\n\u03b4\n\n20\n\n25\n\n(c)\n\n10\n\n15\n\niteration\n\nPajek\n\nFigure 2: (a) Distribution of network flows on a typical PlanetLab host is fitted quite well with a Levy dis-\ntribution. (b) The core of the PlanetLab network. 1% of the flows consists of 19% of the total bandwidth. (c)\nConvergence of Stable-Jacobi.\nLike belief propagation, our approximate algorithm Stable-Jacobi is not guaranteed to converge on\ngeneral graphs containing cycles. We have analyzed the evolution dynamics of the update equations\nfor Stable-Jacobi and derived sufficient conditions for convergence. Furthermore, we have analyzed\nthe accuracy of the approximation. Not surprisingly, the sufficient condition for convergence relates\nto the properties of the linear transformation matrix A. The following theorem is one of the main\nnovel contributions of this work. It provides both sufficient condition for convergence of Stable-\nJacobi as well as closed-form equations for the fixed point.\nTheorem 4.5. Given a LCM with n i.i.d hidden variables Xi, n observations Yi distributed ac-\ncording to stable distribution Yi \u223c S(\u03b1, \u03b2yi , \u03b3yi , \u03b4yi), assuming the linear relation matrix An\u00d7n is\ninvertible and normalized to a unit diagonal7, Stable-Jacobi (as given in Algorithm 2(c)) converges\nto a unique fixed point under both the following sufficient conditions for convergence (both should\nhold):\n\n(1)\n\n\u03c1(|R|\u03b1) < 1 ,\n\n(2)\n\n\u03c1(R) < 1 .\n\nwhere \u03c1(R) is the spectral radius (the largest absolute value of the eigenvalues of R), R , I \u2212 A,\n|R| is the entrywise absolute value and |R|\u03b1 is the entrywise exponentiation. Furthermore, the\n(4)-(5). The algorithm converges to the\nunique fixed points of convergence are given by equations\nexact marginals for the linear-stable channel.8\n\n5 Application: Network flow monitoring\nIn this section we propose a novel application for inference in LCMs to model network traffic flows\nof a large operational worldwide testbed. Additional experimental results using synthetic examples\nare found in the supplementary material. Network monitoring is an important problem in monitoring\nand anomaly detection of communication networks [15, 16, 8]. We obtained Netflow PlanetLab net-\nwork data [10] collected on 25 January 2010. The PlanetLab network [1] is a distributed networking\ntestbed with around 1000 server nodes scattered in about 500 sites around the world. We define a\nnetwork flow as a directed edge between a transmitting and receiving hosts. The number of packets\ntransmitted in this flow is the scalar edge weight.\n\nWe propose to use LCMs for modeling distribution of network flows. Figure 2(a) plots a distribution\nof flows, sorted by their bandwidth, on a typical PlanetLab node. Empirically, we found out that\nnetwork flow distribution in a single PlanetLab node are fitted quite well using L \u00b4evy distribution a\nstable distribution with \u03b1 = 0.5, \u03b2 = 1. The empirical means are mean(\u03b3) \u2248 1e\u22124, mean(\u03b4) \u2248 1.\nFor performing the fitting, we use Mark Veillette\u2019s Matlab stable distribution package [31].\nUsing previously proposed techniques utilizing histograms [16] for tracking flow distribution in\nFigure 2(a), we would need to store 40 values (percentage of bandwidth for each source port).\nIn contrast, by approximating network flow distribution with stable distributions, we need only 4\n\n7When the matrix A is positive definite it is always possible to normalize it to a unit diagonal. The nor-\n1\n2 where D = diag(A). Normalizing to a unit diagonal is done to simplify\n\nmalized matrix is D\u2212\nconvergence analysis (as done for example in [12]) but does not limit the generality of the proposed method.\n\n1\n2 AD\u2212\n\n8Note that there is an interesting relation to the walk-summability convergence condition [12] of belief\npropagation in the Gaussian case: \u03c1(|R|) < 1. However, our results are more general since they apply for any\ncharacteristic exponent 0 < \u03b1 \u2264 2 and not just for \u03b1 = 2 as in the Gaussian case.\n\n7\n\n\fparameters (\u03b1, \u03b2, \u03b3, \u03b4)! Thus we dramatically reduce storage requirements. Furthermore, using\nthe developed theory in previous sections, we are able to linearly aggregate distribution of flows in\nclusters of nodes.\n\n652 nodes. We fit-\nWe extracted a connected component of traffic flows connecting the core network\nted a stable distribution characterizing flow behavior for each machine. A partition of 376 machines\nas the observed flows Yi (where flow distribution is known). The task is to predict the distribution of\nthe unobserved remaining 376 flows Xi, based on the observed traffic flows (entries of Aij). We run\napproximate inference using Stable-Jacobi and compared the results to the exact result computed by\nLCM-Elimination. We emphasize again, that using related techniques (Copula method , CFG, and\nICA) it is not possible to compute exact inference for the problem at hand. In the supplementary ma-\nterial, we provide a detailed comparison of two previous approximation algorithms: non-parametric\nBP (NBP) and expectation propagation (EP).\n\nFigure 2(c) plots convergence of the three parameters \u03b2, \u03b3, \u03b4 as a function of iteration number of\nthe Stable-Jacobi algorithm. Note that convergence speed is geometric. (\u03c1(R) = 0.02 << 1).\nRegarding computation overhead, LCM-Exact algorithm requires 4 \u2219 3763 operations, while Stable-\nJacobi converged to an accuracy of 1e\u22125 in only 4 \u2219 3762 \u2219 25 operations. Additional benefit of the\nStable-Jacobi is that it is a distributed algorithm, naturally suitable for communication networks.\nSource code of some of the algorithms presented here can be found on [3].\n\n6 Conclusion and future work\n\nWe have presented a novel linear graphical model called LCM, defined in the cf domain. We have\nshown for the first time how to perform exact and approximate inference in a linear multivariate\ngraphical model when the underlying distributions are stable. We have discussed an application of\nour construction for computing inference of network flows.\n\nWe have proposed to borrow ideas from belief propagation, for computing efficient inference, based\non the distributivity property of the slice-product operations and the integral-convolution operations.\nWe believe that other problem domains may benefit from this construction, and plan to pursue this\nas a future work.\n\nWe believe there are several exciting directions for extending this work. Other families of distri-\nbutions like geometric stable distributions or Wishart can be analyzed in our model. The Fourier\ntransform can be replaced with more general kernel transform, creating richer models.\nAcknowledgement\n\nD. Bickson would like to thank Andrea Pagnani (ISI) for inspiring the direction of this research, to John P.\nNolan (American University) , for sharing parts of his excellent book about stable distribution online, Mark\nVeillette (Boston University) for sharing his stable distribution code online, to Jason K. Johnson (LANL) for\nassisting in the convergence analysis and to Sapan Bathia and Marc E. Fiuczynski (Princeton University) for\nproviding the PlanetFlow data. This research was supported by ARO MURI W911NF0710287, ARO MURI\nW911NF0810242, NSF Mundo IIS-0803333 and NSF Nets-NBD CNS-0721591.\n\nReferences\n\n[1] PlanetLab Network Homepage http://www.planet-lab.org/.\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Calculation. Numerical Methods. Prentice\n\nHall, 1989.\n\n[3] D. Bickson. Linear characteristic graphical models Matlab toolbox. Carnegie Mellon university. Available\n\n[4] D. Bickson. Gaussian Belief Propagation: Theory and Application. PhD thesis, The Hebrew University\n\non http://www.cs.cmu.edu/\u223cbickson/stable/.\nof Jerusalem, 2008.\n\n[5] D. Bickson, D. Baron, A. T. Ihler, H. Avissar, and D. Dolev. Fault identification via non-parametric belief\n\npropagation. IEEE Tran. on Signal Processing, to appear, 2010.\n\n[6] D. Bickson, O. Shental, P. H. Siegel, J. K. Wolf, and D. Dolev. Gaussian belief propagation based\n\nmultiuser detection. In IEEE Int. Symp. on Inform. Theory (ISIT), Toronto, Canada, July 2008.\n\n[7] M. Briers, A. Doucet, and S. S. Singh. Sequential auxiliary particle belief propagation. In International\n\nConference on Information Fusion, pages 705\u2013711, 2005.\n\n8\n\n\f[8] A. Chen, J. Cao, and T. Bu. Network tomography: Identifiability and fourier domain estimation.\n\nIn\n\nINFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 1875\u2013\n1883, May 2007.\n\n[9] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[10] M. Huang, A. Bavier, and L. Peterson. Planetflow: maintaining accountability for network services.\n\nSIGOPS Oper. Syst. Rev., (1):89\u201394, 2006.\n\n[11] A. T. Ihler, E. Sudderth, W. Freeman, and A. Willsky. Efficient multiscale sampling from products of\n\nGaussian mixtures. In Neural Information Processing Systems (NIPS), Dec. 2003.\n\n[12] J. Johnson, D. Malioutov, and A. Willsky. Walk-Sum Interpretation and Analysis of Gaussian Belief\n\nPropagation. In Advances in Neural Information Processing Systems 18, pages 579\u2013586, 2006.\n\n[13] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In SIGCOMM \u201904:\nProceedings of the 2004 conference on Applications, technologies, architectures, and protocols for com-\nputer communications, number 4, pages 219\u2013230, New York, NY, USA, October 2004.\n\n[14] A. Lakhina, M. Crovella, and C. Diot. Mining anomalies using traffic feature distributions. In SIGCOMM\n\u201905: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for\ncomputer communications, pages 217\u2013228, New York, NY, USA, 2005. ACM.\n\n[15] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft. Structural analysis\nof network traffic flows. In\nSIGMETRICS \u201904/Performance \u201904: Proceedings of the joint international\nconference on Measurement and modeling of computer systems, number 1, pages 61\u201372, New York, NY,\nUSA, June 2004.\n\n[16] X. Li, F. Bian, M. Crovella, C. Diot, R. Govindan, G. Iannaccone, and A. Lakhina. Detection and\nidentification of network anomalies using sketch subspaces. In IMC \u201906: Proceedings of the 6th ACM\nSIGCOMM conference on Internet measurement, pages 147\u2013152, New York, NY, USA, 2006. ACM.\n\n[17] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. In Journal of Machine Learning Research, to appear., 2009.\n\n[18] Y. Mao and F. R. Kschischang. On factor graphs and the Fourier transform.\n\nTheory, volume 51, pages 1635\u20131649, August 2005.\n\nIn IEEE Trans. Inform.\n\n[19] Y. Mao, F. R. Kschischang, and B. J. Frey. Convolutional factor graphs as probabilistic models.\n\nIn\nUAI \u201904: Proceedings of the 20th conference on Uncertainty in artificial intelligence , pages 374\u2013381,\nArlington, Virginia, United States, 2004. AUAI Press.\n\n[20] R. J. Marks II. Handbook of Fourier Analysis and Its Applications. Oxford University Press, 2009.\n[21] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI \u201901: Proceedings of\nthe 17th Conference in Uncertainty in Artificial Intelligence , pages 362\u2013369, San Francisco, CA, USA,\n2001. Morgan Kaufmann Publishers Inc.\n\n[22] R. B. Nelsen. An Introduction to Copulas. Springer Serias in Statistics, second edition, 2006.\n[23] H. X. Nguyen and P. Thiran. Network loss inference with second order statistics of end-to-end flows. In\n\nIMC \u201907: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 227\u2013240,\nNew York, NY, USA, 2007. ACM.\n\n[24] J. P. Nolan. Bibliography on stable distributions, processes and related topics. Technical report, 2010.\n[25] J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkh\u00a8auser, Boston, 2010. In progress,\n\n[26] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nChapter 1 online at http://academic2.american.edu/\u223cjpnolan.\nmann, San Francisco, 1988.\n\n[27] H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. Signal Process-\n\ning, IEEE Transactions on, 57(9):3498\u20133511, May 2009.\n\n[28] T. T. Soong. Fundamentals of Probability and Statistics for Engineers. Wiley, 2004.\n[29] E. Sudderth, A. T. Ihler, W. Freeman, and A. Willsky. Nonparametric belief propagation. In Conference\n\non Computer Vision and Pattern Recognition (CVPR), June 2003.\n\n[30] V. V. Uchaikin and V. M. Zolotarev. Chance and stability. In Stable Distributions and their Applications.\n\nUtrecht, VSP, 1999.\n\n[31] M. Veillette.\n\nStable distribution Matlab package. Boston university.\n\nhttp://math.bu.edu/people/mveillet/.\n\nAvailable on\n\n[32] A. Yener, R. Yates, and S. Ulukus. CDMA multiuser detection: A nonlinear programming approach.\n\nIEEE Tran. On Communications, 50(6):1016\u20131024, 2002.\n\n9\n\n\f", "award": [], "sourceid": 734, "authors": [{"given_name": "Danny", "family_name": "Bickson", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}]}