{"title": "Beyond Gaussian Processes: On the Distributions of Infinite Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 275, "page_last": 282, "abstract": null, "full_text": "Beyond Gaussian Processes: On the Distributions of Infinite Networks\nRicky Der Department of Mathematics University of Pennsylvania Philadelphia, PA 19104 rickyder@math.upenn.edu Daniel Lee Department of Electrical Engineering University of Pennsylvania Philadelphia, PA 19104 ddlee@seas.upenn.edu\n\nAbstract\nA general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.\n\n1 Introduction\nConsider the model fn (x) =\nn n 1j 1j vj h(x; uj ) vj hj (x) sn =1 sn =1\n\n(1)\n\nwhich can be viewed as a multi-layer perceptron with input x, hidden functions h, weights uj , output weights vj , and sn a sequence of normalizing constants. The work of Radford Neal [1] showed that, under certain assumptions on the parameter priors {vj , hj }, the distribution over the implied network functions fn converged to that of a Gaussian process, in the large network limit n . The main feature of this derivation consisted of an invocation of the classical Central Limit Theorem (CLT). While one cavalierly speaks of \"the\" central limit theorem, there are in actuality many different CLTs, of varying generality and effect. All are concerned with the limits of suitably normalised sums of independent random variables (or where some condition is imposed so that no one variable dominates the sum1 ), but the limits themselves differ greatly: Gaussian, stable, infinitely divisible, or, discarding the infinitesimal assumption, none of these. It follows that in general, the asymptotic process for (1) may not be Gaussian. The following questions then arise: what is the relationship between choices of distributions on the model priors, and the asymptotic distribution over the induced neural functions? Under what conditions does the Gaussian approximation hold? If there do exist non-Gaussian limit points, is it possible to construct analagous generalizations of Gaussian process regression?\n1\n\nTypically called an infinitesimal condition -- see [4].\n\n\f\nPrevious work on these problems consists mainly in Neal's publication [1], which established that when the output weights vj are finite variance and i.i.d., the limiting distribution is a Gaussian process. Additionally, it was shown that when the weights are i.i.d. symmetric stable (SS), the first-order marginal distributions of the functions are also SS. Unfortunately, no mathematical analysis was presented to show that the higher-order distributions converged, though empirical evidence was suggestive of that hypothesis. Moreover, the exact form of the higher-dimensional distributions remained elusive. This paper conducts a further investigation of these questions, with concentration on the cases where the weight priors can be 1) of infinite variance, and 2) non-i.i.d. Such assumptions fall outside the ambit of the classical CLT, but are amenable to more general limit methods. In Section 1, we give a general classification of the possible limiting processes that may arise under an i.i.d. assumption on output weights distributed from a certain class -- roughly speaking, those weights with tails asymptotic to a power-law -- and provide explicit formulae for all the joint distribution functions. As a byproduct, Neal's preliminary analysis is completed, a full multivariate prescription attained and the convergence of the finite-dimensional distributions proved. The subsequent section considers non-i.i.d. priors, specifically independent priors where the \"identically distributed\" assumption is discarded. An example where a finite-variance non-Gaussian process acts as a limit point for a nontrivial infinite network is presented, followed by an investigation of conditions under which the Gaussian approximation is valid, via the Lindeberg-Feller theorem. Finally, we raise the possibility of replacing network models with the processes themselves for learning applications: here, motivated by the foregoing limit theorems, the set of stable processes form a natural generalization to the Gaussian case. Classes of stable stochastic processes are examined where the parameterizations are particularly simple, as well as preliminary applications to the nonlinear regression problem.\n\n2 Neural Network Limits\nReferring to (1), we make the following assumptions: hj (x) h(x; uj ) are uniformly bounded in x (as for instance occurs if h is associated with some fixed nonlinearity), and {uj } is an i.i.d. sequence, so that hj (x) are i.i.d. for fixed x, and independent of {vj }. With these assumptions, the choice of output priors vj will tend to dictate large-network behavior, independently of uj . In the sequel, we restrict ourselves to functions fn (x) : R R, as the respective proofs for the generalizations of x and fn to higher-dimensional spaces are routine. Finally, all random variables are assumed to be of zero mean whenever first moments exist. For brevity, we only present sketches of proofs. 2.1 Limits with i.i.d. priors The Gaussian distribution has the feature that if X1 and X2 are statistically independent copies of the Gaussian variable X , then their linear combination is also Gaussian, i.e. aX1 + bX2 has the same distribution as cX + d for some c and d. More generally, the stable distributions [5], [6, Chap. 17] are defined to be the set of all distributions satisfying the above \"closure\" property. If one further demands symmetry of the distribution, then they must have characteristic function (t) = e- |t| , for parameters > 0 (called the spread), and 0 < 2, termed the index. Since the characteristic functions are not generally twice differentiable at t = 0, their variances are infinite, the Gaussian distribution being the only finite variance stable distribution, associated to index = 2. The attractive feature of stable variables, by definition, is closure under the formation of linear combinations: the linear combination of any two independent stable variables is another stable variable of the same index. Moreover, the stable distributions are attraction points of distributions under a linear combiner operator, and indeed, the only such distributions in\n\n\f\nn the following sense: if {Yj } are i.i.d., and an + s1 j =1 Yj converges in distribution to X , n then X must be stable [5]. This fact already has consequences for our network model (1), and implies that -- under i.i.d. priors vj , and assuming (1) converges at all -- convergence can occur only to stable variables, for each x. Multivariate analogues are defined similarly: we say a random vector X is (strictly) stable if, for every a, b R, there exists a constant c such that aX1 + bX2 = cX where Xi are independent copies of X and the equality is in distribution. A symmetric stable random vector is one which is stable and for which the distribution of X is the same as -X. The following important classification theorem gives an explicit Fourier domain description of all multivariate symmetric stable distributions: Theorem 1. Kuelbs [5]. X is a symmetric -stable vector if and only if it has characteristic function ( -S (t) = exp\nd-1\n\n| t, s | d(s)\n\n2)\n\nwhere is a finite measure on the unit (d - 1)-sphere S d-1 , and 0 < 2. ~ Remark: (2) remains unchanged replacing by the symmetrized measure = 1 ((A) + 2 ~ (-A)), for all Borel sets A. In this case, the (unique) symmetrized measure is called the spectral measure of the stable random vector X. Finally, stable processes are defined as indexed sets of random variables whose finitedimensional distributions are (multivariate) stable. First we establish the following preliminary result. Lemma 1. Let v be a symmetric stable random variable of index 0 < 2, and spread > 0. Let h be indepn ndent of v and E |h| < . If y = hv , and {yi } are i.i.d. copies of e y , then Sn = n11 i=1 yi converges in distribution to an -stable variable with charac/ teristic function (t) = exp{-| t| E |h| }. Proof. This follows by computing the characteristic function Sn , then using standard theorems in measure theory (e.g. [4]), to obtain limn log Sn (t) = -| t| E |h| . Now we can state the first network convergence theorem. Proposition 1. Let the network (1) have symmetric stable i.i.d. weights vj of index 0 < n 2 and spread . Then fn (x) = n11 j =1 vj hj (x) converges in distribution to a / symmetric -stable process f (x) as n . The finite-dimensional stable distribution of (f (x1 ), . . . , f (xd )), where xi R, has characteristic function: (t) = exp (- Eh | t, h | )\n\n\n(3)\n\nwhere h = (h(x1 ), . . . , h(xd )), and h(x) is a random variable with the common distribution (across j ) of hj (x). Moreover, if h = (h(x1 ), . . . , h(xd )) has joint probability density p(h) = p(rs), with s on the S d-1 sphere and r the radial component of h, then the finite measure corresponding to the multivariate stable distribution of (f (x1 ), . . . , f (xd )) is given by d d(s) =\n0\n\nr+d-1 p(rs) dr\n\ns\n\n(4)\n\nwhere ds is Lebesgue measure on S d-1 . Proof. It suffices to show that every finite-dimensional distribution of f (x) converges d to a symmetric multivariate stable characteristic function. We have i=1 ti fn (xi ) =\n\n\f\nd vj i=1 ti hj (xi ) for constants {x1 , . . . , xd } and (t1 , . . . , td ) Rd . An application of Lemma 1 proves the statement. The relation between the expectation in (3) and the stable spectral measure (4) is derived from a change of variable to spherical coordinates in the d-dimensional space of h.\n1 n1/ j =1\n\nn\n\nRemark: When = 2, the exponent in the characteristic function (3) is a quadratic form in t, and becomes the usual Gaussian multivariate distribution. The above proposition is the rigorous completion of Neal's analysis, and gives the explicit form of the asymptotic process under i.i.d. SS weights. More generally, we can consider output weights from the normal domain of attraction of index , which, roughly, consists of those densities whose tails are asymptotic to |x|-(+1) , 0 < < 2 [6, pg. 547]. With a similar proof to the previous theorem, one establishes Proposition 2. Let network (1) have i.i.d. weights vj from the normal domain of attraction n of an SS variable with index , spread . Then fn (x) = n11 j =1 vj hj (x) converges in / distribution to a symmetric -stable process f (x), with the joint characteristic functions given as in Proposition 1. 2.1.1 Example: Distributions with step-function priors Let h(x) = sgn(a + ux), where a and u are independent Gaussians with zero mean. From (3) it is clear that the limiting network function f (x) is a constant (in law, hence almost surely), as |x| , so that the interesting behavior occurs in some \"central region\" |x| < k . Neal in [1] has shown that when the output weights vj are Gaussian, then the choice of the signum nonlinearity for h gives rise to local Brownian motion in the central regime. There is a natural generalization of the Brownian process within the context of symmetric stable processes, called the symmetric -stable Levy motion. It is characterised by an indexed sequence {wt : t R} satisfying i) w0 = 0 almost surely, ii) independent increments, and iii) wt - ws is distributed symmetric -stable with spread = |t - s|1/ . As we shall now show, the choice of step-function nonlinearity for h and symmetric -stable priors for vj lead to locally Levy stable motion, which provide a theoretical exposition for the empirical observations in [1]. Fix two nearby positions x and y , and select = 1 for notational simplicity. From (3) the random variable f (x) - f (y ) is symmetric stable with spread parameter [Eh |h(x) - h(y )| ]1/ . For step inputs, |h(x) - h(y )| is non-zero only when the step located at -a/u falls between x and y . For small |x - y | approximate the density of this event to be uniform, so that [Eh |h(x) - h(y )| ] |x - y |. Hence locally, the increment f (x) - f (y ) is a symmetric stable variable with spread proportional to |x - y |1/ , which is condition (iii) of Levy motion. Next let us demonstrate that the increments are independent. Consider the vector (f (x1 ) - f (x2 ), f (x2 ) - f (x3 ), . . . , f (xn-1 ) - f (xn )), where x1 < x2 < . . . < xn . Its joint characteristic function in the variables t1 , . . . , tn-1 can be calculated to be (t1 , . . . , tn-1 ) = exp (-Eh |t1 (h(x1 ) - h(x2 )) + + tn-1 (h(xn-1 ) - h(xn ))| ) (5) The disjointness of the intervals (xi-1 , xi ) implies that the only events which have nonzero probability within the range [x1 , xn ] are the events |h(xi ) - h(xi-1 )| = 2 for some i, and zero for all other indices. Letting pi denote the probabilities of those events, (5) reads (t1 , . . . , tn-1 ) = exp (-2 (p1 |t1 | + + pn-1 |tn-1 | )) (6) which describes a vector of independent -stable random variables, as the characteristic function splits. Thus the limiting process has independent increments. The differences between sample functions arising from Cauchy priors as opposed to Gaussian priors is evident from Fig. 1, which displays sample paths from Gaussian and\n\n\f\n5\n\n10000 5000\n\n0 0 -5 0 150 100 50 0 -50 0 2000 4000 6000 (c) 8000 10000 -5000 -10000 0 2000 4000 6000 (d) 8000 10000 2000 4000 6000 (a) 8000 10000 5000 0 -5000 0 5000 (b) 10000\n\n Figure 1: Sample functions: (a) i.i.d. Gaussian, (b) i.i.d. Cauchy, (c) Brownian motion, (d) Levy Cauchy-Stable motion.\n\nn Cauchy i.i.d. processes wn and their \"integrated\" versions i=1 wi , simulating the Levy motions. The sudden jumps in the Cauchy motion arise from the presence of strong outliers in the respective Cauchy i.i.d. process, which would correspond, in the network, to hidden units with heavy weighting factors vj . 2.2 Limits with non-i.i.d. priors We begin with an interesting example, which shows that if the \"identically distributed\" assumption for the output weights is dispensed with, the limiting distribution of (1) can attain a non-stable (and non-Gaussian) form. Take vj to be independent random variables with P (vj = 2-j ) = P (vj = -2-j ) = 1/2. The characteristic functions can easily be computed as E [eitvj ] = cos(t/2j ). Now recall the Viete formula: jn\n=1\n\ncos(t/2j ) =\n\nsin t 2n sin(t/2n )\n\n(7)\n\nTaking n shows that the limiting characteristic function is a sinc function, which corresponds with the uniform density. Selecting the signum nonlinearity for h, it is not difficult to show with estimates on the tail ofnthe product (7) that all finite-dimensional distributions of the neural process fn (x) = j =1 vj hj (x) converge, so that fn converges in distribution to a random process whose first-order distributions are uniform2 . What conditions are required on independent, but not necessarily identically distributed priors vj for convergence to the Gaussian? This question is answered by the classical Lindeberg-Feller theorem. Theorem 2. Central Limit Theorem (Lindeberg-Feller) [4]. Let vj be a sequence of independent random variables each with zero mean and finine variance, define s2 = n t n var[ j =1 vj ], and assume s1 = 0. Then the sequence s1 j =1 vj converges in distrin\nAn intuitive proof is as follows: one thinks of vj as a binary expansion of real numbers in [-1,1]; the prescription of the probability laws for vj imply all such expansions are equiprobable, manifesting in the uniform distribution.\n2\n\nj\n\n\f\nbution to an N (0, 1) variable, if\n\n| n 1i lim n s2 n =1 v |\n\nsn\n\nv 2 dFvj (v ) = 0\n\n(8)\n\nfor each > 0, and where Fvj is the distribution function for vj . Condition (8) is called the Lindeberg condition, and imposes an \"infinitesimal\" requirement on the sequence {vj } in the sense that no one variable is allowed to dominate the sum. This theorem can be used to establish the following non-i.i.d. network convergence result. Proposition 3. Let the network (1) have independent finite-variance weights vj . Defining n n s2 = var[ j =1 vj ], if the sequence {vj } is Lindeberg then fn (x) = s1 n j =1 vj hj (x) n converges in distribution to a Gaussian process f (x) of mean zero and covariance function C (f (x), f (y )) = E [h(x)h(y )] as n , where h(x) is a variable with the common distribution of the hj (x). Proof. Fix a finite set of points {x1 , . . . , xk } in the input space, and look at the joint distribution (fn (x1 ), . . . , fn (xn )). We want to show these variables are jointly Gaussian in the limit as n , by showing that every linear combination of the components converges in distribution to a Gaussian distribution. Fixing k constants i , we n k k k have i=1 i f (xi ) = s1 j =1 vj i=1 i hj (xi ). Define j = i=1 i hj (xi ), and n n s2 = var( j =1 vj j ) = (E 2 )s2 , where is a random variable with the common distrib~n n ution of j . Then for some c > 0: | | n n 1j c2 1 j |vj ( )j ( )|2 dP ( ) |vj ( )|2 dP ( ) 1 s2 =1 vj j | sn ~n E 2 s2 =1 vj | (E2 )c /2 sn ~ n The right-hand side can be made arbitrarily small, from the Lindeberg assumption on {vj }, hence {vj j } is Lindeberg, from which the theorem follows. The covariance function is easy to calculate. Corollary 1. If the output weights {vj } are a uniformly bounded sequence of independent random variables, and limn sn = , then fn (x) in (1) converges in distribution to a Gaussian process. The preceding corollary, besides giving an easily verifiable condition for Gaussian limits, demonstrates that the non-Gaussian convergence in the example initialising Section 2.2 was made possible precisely because the weights vj decayed sufficiently quickly with j , with the result that limn sn < .\n\n3 Learning with Stable Processes\nOne of the original reasons for focusing machine learning interest on Gaussian processes consisted in the fact that they act as limit points of suitably constructed parametric models [2], [3]. The problem of learning a regression function, which was previously tackled by Bayesian inference on a modelling neural network, could be reconsidered by directly placing a Gaussian process prior on the fitting functions themselves. Yet already in early papers introducing the technique, reservations had been expressed concerning such wholesale replacement [2]. Gaussian processes did not seem to capture the richness of finite neural networks -- for one, the dependencies between multiple outputs of a network vanished in the Gaussian limit. Consider the simplest regression problem, that of the estimation of a state process u(x) from observations y (xi ), under the model y (x) = u(x) + (x) (9)\n\n\f\n5\n\n5\n\n5\n\n0\n\n0\n\n0\n\n-5 -5 0 (a) 5\n\n-5 -5 0 (b) 5\n\n-5 -5 0 (c) 5\n\n5\n\n5\n\n5\n\n0\n\n0\n\n0\n\n-5 -5 0 5\n\n-5 -5 0 5\n\n-5 -5 0 5\n\nFigure 2: Scatter plots of bivariate symmetric -stable distributions with discrete spectral measures. Top row: = 1.5; Bottom row: = 0.5. Left to right: (a) H = identity, (b) H a rotation, (c) H a 2 3 matrix with columns (-1/16, 3/16)T , (0, 1)T , (1/16, 3/16)T .\n\nwhere (x) is noise independent of u. The obvious generalization of Gaussian process regression involves the placement of a stable process prior of index on u, and setting as i.i.d. stable noise of the same index. Then the observations y also form a stable process of index . Two advantages come with such generalization. First, the use of a heavytailed distribution for will tend to produce more robust regression estimates, relative to the Gaussian case; this robustness can be additionally controlled by the stability parameter . Secondly, a glance at the classification of Theorem 1 indicates that the correlation structure of stable vectors (hence processes), is significantly richer than that of the Gaussian; the space of n-dimensional stable vectors is already characterised by a whole space of measures, rather than an n n covariance matrix. The use of such priors on the data u afford a significant broadening in the number of interesting dependency relationships that may be assumed. An understanding of the dependency structure of multivariate stable vectors can be first broached by considering the following basic class. Let v be a vector of i.i.d. symmetric stable variables of the same index, and let H be a matrix of appropriate dimension so that x = Hv is well-defined. Then x has a symmetric stable characteristic function, where ~ the spectral measure in Theorem 1 is discrete, i.e. concentrated on a finite number of points. Divergences in the correlation structure are readily apparent even within this class. In the Gaussian case, there is no advantage in the selection of non-square matrices H, since ~ the distribution of x can always be obtained by a square mixing matrix H with the same number of rows as H. Not so when < 2, for then the characteristic function for x in general possesses n fundamental discontinuities in higher-order derivatives, where n is the number of columns of H. Furthermore, in the square case, replacement of H with HR, where R is any rotation matrix, leaves the distribution invariant when = 2; for nonGaussian stable vectors, the mixing matrices H and H give rise to the same distribution only when |H-1 H | is a permutation matrix, where | | is defined component-wise. Figure 2 illustrates the variety of dependency structures which can be attained as H is changed. A number of techniques already exist in the statistical literature for the estimation of the spectral measure (and hence the mixing H) of multivariate stable vectors from empirical data. The infinite-dimensional generalization of the above situation gives rise to the set of stable processes produced as time-varying filtered versions of i.i.d. stable noise, and\n\n\f\nsimilar to the Gaussian process, are parameterized by a centering (mean) function (x) and a bivariate filter function h(x, ) encoding dependency information. Another simple family of stable processes consist of the so-called sub-Gaussian processes. These are processes defined by u(x) = A1/2 G(x) where A is a totally right-skew /2 stable variable [5], and G a Gaussian process of mean zero and covariance K . The result is a symmetric -stable random process with finite-dimensional characteristic functions of form 1 (t) = exp(- | t, Kt |/2 ) (10) 2 The sub-Gaussian processes are then completely parameterized by the statistics of the subordinating Gaussian process G. Even more, they have the following linear regression property [5]: if Y1 , . . . , Yn are jointly sub-Gaussian, then E [Yn |Y1 , . . . , Yn-1 ] = a1 Y1 + an-1 Yn-1 . (11) Unfortunately, the regression is somewhat trivial, because a calculation shows that the coefficients of regression {ai } are the same as the case where Yi are assumed jointly Gaussian! Indeed, this curious property appears anytime the variables take the form Y = B G, for any fixed scalar random variable B and Gaussian vector G. It follows that the predictive mean estimates for (10) employing sub-Gaussian priors are identical to the estimates under a Gaussian hypothesis. On the other hand, the conditional distribution of Yn |Y1 , . . . , Yn-1 differs greatly from the Gaussian, and is neither stable nor symmetric about its conditional mean in general. From Fig. 2 one even sees that the conditional distribution may be multimodal, in which case the predictive mean estimates are not particularly valuable. More useful are MAP estimates, which in the Gaussian scenario coincide with the conditional mean. In any case, regression on stable processes suggest the need to compute and investigate the entire a posteriori probability law. The main thrust of our foregoing results indicate that the class of possible limit points of network functions is significantly richer than the family of Gaussian processes, even under relatively restricted (e.g. i.i.d.) hypotheses. Gaussian processes are the appropriate models of large networks with finite variance priors in which no one component dominates another, but when the finite variance assumption is discarded, stable processes become the natural limit points. Non-stable processes can be obtained with the appropriate choice of non-i.i.d. parameters priors, even in an infinite network. Our discussion of the stable process regression problem has principally been confined to an exposition of the basic theoretical issues and principles involved, rather than to algorithmic procedures. Nevertheless, since simple closed-form expressions exist for the characteristic functions, the predictive probability laws can all in principle be computed with multi-dimensional Fourier transform techniques. Stable variables form mathematically natural generalisations of the Gaussian, with some fundamental, but compelling, differences which suggest additional variety and flexibility in learning applications. References\n[1] R. Neal, Bayesian Learning for Neural Networks. New York: Springer-Verlag, 1996. [2] D. MacKay. Introduction to Gaussian Processes. Extended lecture notes, NIPS 1997. [3] M. Seeger, Gaussian Processes for Machine Learning. International Journal of Neural Systems 14(2), 2004, 69106. [4] C. Burrill, Measure, Integration and Probability. New York: McGraw-Hill, 1972. [5] G. Samorodnitsky & M. Taqqu, Stable Non-Gaussian Random Processes. New York: Chapman & Hall, 1994. [6] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 2. New York: John Wiley & Sons, 1966.\n\n\f\n", "award": [], "sourceid": 2869, "authors": [{"given_name": "Ricky", "family_name": "Der", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}