{"title": "Symplectic Nonlinear Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 437, "page_last": 443, "abstract": null, "full_text": "Symplectic Nonlinear Component \n\nAnalysis \n\nLucas C. Parra \n\nSiemens Corporate Research \n\n755 College Road East, Princeton, NJ 08540 \n\nlucas@scr.siemens.com \n\nAbstract \n\nStatistically independent features can be extracted by finding a fac(cid:173)\ntorial representation of a signal distribution. Principal Component \nAnalysis (PCA) accomplishes this for linear correlated and Gaus(cid:173)\nsian distributed signals. Independent Component Analysis (ICA), \nformalized by Comon (1994), extracts features in the case of lin(cid:173)\near statistical dependent but not necessarily Gaussian distributed \nsignals. Nonlinear Component Analysis finally should find a facto(cid:173)\nrial representation for nonlinear statistical dependent distributed \nsignals. This paper proposes for this task a novel feed-forward, \ninformation conserving, nonlinear map - the explicit symplectic \ntransformations. It also solves the problem of non-Gaussian output \ndistributions by considering single coordinate higher order statis(cid:173)\ntics. \n\n1 \n\nIntroduction \n\nIn previous papers Deco and Brauer (1994) and Parra, Deco, and Miesbach (1995) \nsuggest volume conserving transformations and factorization as the key elements \nfor a nonlinear version of Independent Component Analysis. As a general class \nof volume conserving transformations Parra et al. (1995) propose the symplectic \ntransformation . It was defined by an implicit nonlinear equation, which leads to a \ncomplex relaxation procedure for the function recall. In this paper an explicit form \nof the symplectic map is proposed, overcoming thus the computational problems. \n\n\f438 \n\nL. C.PARRA \n\nIn order to correctly measure the factorization criterion for non-Gaussian output \ndistributions, higher order statistics has to be considered. Comon (1994) includes \nin the linear case higher order cumulants of the output distribution. Deco and \nBrauer (1994) consider multi-variate, higher order moments and use them in the \ncase of nonlinear volume conserving transformations. But the calculation of multi(cid:173)\ncoordinate higher moments is computational expensive. \n\nThe factorization criterion for statistical independence can be expressed in terms of \nminimal mutual information. Considering only volume conserving transformations \nallows to concentrate on single coordinate statistics, which leads to an important \nreduction of computational complexity. So far, this approach (Deco & Schurman, \n1994; Parra et aI., 1995) has been restricted to second order statistic. The present \npaper discusses the use of higher order cumulants for the estimation of the single \ncoordinate output distributions. The single coordinate entropies measured by the \nproposed technique match the entropies of the sampled data more accurately. This \nleads in turns to better factorization results. \n\n2 Statistical Independence \n\nMore general than decorrelation used in PCA the goal is to extract statistical \nindependent features from a signal distribution p(x). We look for a determinis(cid:173)\ntic transformation on ~n: y = f(x) which generates a factorial representation \np(y) = It p(Yd, or at least a representation where the individual coordinates P(Yi) \nof the output variable yare \"as factorial as possible\". This can be accomplished \nby minimizing the mutual information M I[P(y)]. \n\no ::; M I[P(y)] = L H[P(Yi)] - H[P(y)], \n\nn \n\ni=l \n\n(1) \n\nsince M I[P(y)] = 0 holds if p(y) is factorial. The mutual information can be used \nas a measure of \"independence\". The entropies H in the definition (1) are defined \nas usual by H[P(y)] = - J~oop(y)lnp(y)dy. \nAs in linear PCA we select volume conserving transformations, but now without \nrestricting ourselves to linearity. In the noise-free case of reversible transformations \nvolume conservation implies conservation of entropy from the input x to the output \ny, i.e. H[P(y)] = H[P(x)] = canst (see Papoulis, 1991). The minimization of mutual \ninformation (1) reduces then to the minimization of the single coordinate output \nentropies H[P(Yi)]. This substantially simplifies the complexity of the problem, \nsince no multi-coordinate statistics is required. \n\n2.1 Measuring the Entropy with Cumulants \n\nWith an upper bound minimization criterion the task of measuring entropies can \nbe avoided (Parra et aI., 1995): \n\n(2) \n\n\fSymplectic Nonlinear Component Analysis \n\n439 \n\nEdgeworth appIOlClmatlOr'l to second and fanh order \n\nO.B,----~--~-~--~-___, \n\n~ . \n\n: \n\n0.7 \n\n0.6 \n\n~0.5 \n\n>-\n\n~ 04 \n~ 03 \n~ \nQ.. 0.2 \n\n0., \n\no -\n\n.O.~~--=---,!------=---~----: \n\ndQ(y1)/dY1 \n-----~> )i \n\n1 \n\nFigure 1: LEFT: Doted line: exponential distribution with additive Gaussian noise \n(noise-variance/decay-constant = 0.2). Dashed \nsampled with 1000 data points. \nline: Gaussian approximation equivalent to the Edgeworth approximation to second \norder. Solid line: Edgeworth approximation including terms up to fourth order. \nRIGHT: Structure of the volume conserving explicit symplectic map. \n\nThe minimization of the individual output coordinate entropies H(P(Yi)] simplifies \nto the minimization of output variances (Ti. For the validity of that approach it is \ncrucial that the map y = f(x) transforms the arbitrary input distribution p(x) into \na Gaussian output distribution. But volume conserving and continuous maps can \nnot transform arbitrary distributions into Gaussians. To overcome this problem one \nincludes statistics - higher than second order - to the optimization criterion. \n\nComon (1994) suggests to use the Edgeworth expansion of a probability distribu(cid:173)\ntion. This leads to an analytic expression of the entropy in terms of measurable \nhigher order cumulants. Edgeworth expands the multiplicative correction to the \nbest Gaussian approximation of the distribution in the orthonormal basis of Her(cid:173)\nmite polynomials hcr(y). The expansion coefficients are basically given by the cu(cid:173)\nmulants Ccr of distribution p~y). The Edgeworth expansions reads for a zero-mean \ndistribution with variance (T \n\n, (see Kendall & Stuart, 1969) \n\np(y) \n\n2 \n\n-l-e-~ f(y) \n-j2;(J \n\n(3) \n\nNote, that by truncating this expansion at a certain order, we obtain an approx(cid:173)\nimation Papp(Y), which is not strictly positive. Figure 1, left shows a sampled \nexponential distribution with additive Gaussian noise. \n\nBy cutting expansion (3) at fourth order, and further expanding the logarithm in \ndefinition of entropy up to sixth order, Comon (1994) approximates the entropy by, \n\n\f440 \n\nL.C.PARRA \n\n1 c~ C4 \nH(P(Y)app] ~ 2\"ln(271'e) + In 0' - 120'6 - 480'8 - 480'12 + 8\" 0'60'4 \n\n7 c~ \n\n1 c~ \n\n1 \n\n1 c\u00a7 \n\n(4) \n\nWe suggest to use this expression to minimize the single coordinate entropies in the \ndefinition of the mutual information (1). \n\n2.2 Measuring the Entropy by Estimating an Approximation \n\nNote that (4) could only be obtained by truncating the expansion (3). It is there(cid:173)\nfore limited to fourth order statistic, which might be not enough for a satisfactory \napproximation. Besides, the additional approximation of the logarithm is accurate \nonly for small corrections to the best Gaussian approximation, i.e. for fey) ~ 1. \nFor distributions with non-Gaussian tails the correction terms might be rather large \nand even negative as noted above. We therefore suggest alternatively, to measure \nthe entropy by estimating the logarithm of the approximated distribution In Papp (y) \nwith the given data points Yv and using Edgeworth approximation (3) for Papp (y), \n\nH(P(y)] ~ - N L lnpapp (Yv) = canst + In 0' - N LIn f(yv) \n\n1 N \n\n1 N \n\n(5) \n\nv=1 \n\nv=1 \n\nFurthermore, we suggest to correct the truncated expansion Papp by setting \nfapp (y) -+ 0 for all fapp (y) < O. For the entropy measurement (5) there is in \nprinciple no limitation to any specific order. \n\nIn table 1 the different measures of entropy are compared. The values in the row \nlabeled 'partition' are measured by counting the numbers n(i) of data points falling \nin equidistant intervals i of width D.y and summing -pC i)D.y lnp(i) over all intervals, \nwith p(i)D.y = n(i)IN. This gives good results compared to the theoretical values \nonly because of the relatively large sampling size. These values are presented here \nin order to have an reliable estimate for the case of the exponential distribution, \nwhere cumulant methods tend to fail. \n\nThe results for the exponential distribution show the difficulty of the measurement \nproposed by Comon, whereas the estimation measurement given by equation (5) is \nstable even when considering (for this case) unreliable 5th and 6th order cumulants. \nThe results for the symmetric-triangular and uniform distribution demonstrate the \ninsensibility of the Gaussian upper bound for the example of figure 2. A uniform \nsquared distribution is rotated by an angle a. On the abscissa and ordinate a \ntriangular or uniform distribution are observed for the different angles a = II/4 \nor a = 0 respectively. The approximation of the single coordinate entropies with \na Gaussian measure is in both cases the same. Whereas measurements including \nhigher order statistics correctly detect minimal entropy (by fixed total information) \nfor the uniform distribution at a = O. \n\n3 Explicit Symplectic Transformation \n\nDifferent ways of realizing a volume conserving transformation that guarantees \nH(P(x)] = H(P(x)] have been proposed (Deco & Schurman, 1994; Parra et aI., \n\n\fSymplectic Nonlinear Component Analysis \n\n441 \n\n11easured entropy of \nsampled distributions \npartition \nGaussian upper bound (2) \nComan, eq. (4) \nEstimate (5) - 4th order \nEstimate (5) - 6th order \ntheoretical value \n\nGauss \n\nuniform \n\n1.35 \u00b1 .02 \n1.415 \u00b1 .02 \n1.414 \u00b1 .02 \n1.414 \u00b1 .02 \n1.414 \u00b1 .02 \n\n1.419 \n\n.024 \u00b1 .006 \n.18 \u00b1 .016 \n.14 \u00b1 .015 \n.13 \u00b1 .015 \n.092 \u00b1 .001 \n\n.0 \n\ntriangular \nexponential \nsymmetric + Gauss noise \n.14 \u00b1 .02 \n.18 \u00b1 .02 \n.17 \u00b1 .02 \n.17\u00b1.02 \n.16 \u00b1 .02 \n\n1.31 \u00b1 .03 \n1.53 \u00b1 .04 \n3.0 \u00b1 2.5 \n1.39 \u00b1 .05 \n\n1.3 \u00b1 .5 \n\n.153 \n\nTable 1: Entropy values for different distributions sampled with N = 1000 data \npoints and the different estimation methods explained in the text . The standard \ndeviations are obtained by multiple repetition of the experiment. \n\n1995). A general class of volume conserving transformations are the symplectic \nmaps (Abraham & Marsden, 1978). An interesting and for our purpose important \nfact is that any symplectic transformation can be expressed in terms of a scalar \nfunction. And in turn any scalar function defines a symplectic map. In (Parra \net al., 1995) a non-reflecting symplectic transformation has been presented. But \nits implicit definition results in the need of solving a nonlinear equation for each \ndata point. This leads to time consuming computations which limit in practice the \napplications to low dimensional problems (n~ 10). In this work reflecting symplec(cid:173)\ntic transformations with an explicit definition are used to define a \"feed-forward\" \nvolume conserving maps. The input and output space is divided in two partitions \nx = (Xl, X2) and Y = (Yl, Y2), with Xl, X2, Yl , Y2 E ?Rn / 2 . \n\n(6) \n\nThe structure of this symplectic map is represented in figure 1, right. Two scalar \nfunctions P : ?Rn / 2 1-+ ?R and Q : ?Rn / 2 1-+ ?R can be chosen arbitrarily. Note that \nfor quadratic functions equation (6) represents a linear transformation. In order \nto have a general transformation we introduce for each of these scalar functions a \n3-layer perceptron with nonlinear hidden units and a single linear output unit: \n\n(7) \n\nThe scalar functions P and Q are parameterized by the network parameters \nWl, W2 E Rm and Wl, W 2 E Rm x Rn/2. The hidden-unit, nonlinear activation \nfunction 9 applies to each component of the vectors WlYl and W2X2 respectively. \nBecause of the structure of equation (6) the output coordinates Yl depend only addi(cid:173)\ntively on the input coordinates Xl. To obtain a more general nonlinear dependence \na second symplectic layer has to be added. \n\nTo obtain factorial distributions the parameters of the map have to be trained. \nThe approximations of the single coordinate entropies (4) or (5) are inserted in the \nmutual information optimization criterion (1). These approximations are expressed \nthrough moments in terms of the measured output data points. Therefore, the \n\n\f442 \n\nL.C.PARRA \n\nO,B,.---~-~-~-~-~-~-~---, \n\n0,6 \n\n0,4 \n\n0,2 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n, , \n\n. :.': :' ... \" \n..... \n:.,' \n. \n\n, \n, \n, \n\n-~~,B---0~,6-~-0~.4---0~,2-~--0~.2--0~.4--0~.6-~0,B \n\nFigure 2: Sampled 2-dimensional squared uniform distribution rotated by 7l\" /4. Solid \nlines represent the directions found by any of the higher order techniques explained \nin the text. Dashed lines represent directions calculated by linear PCA. (This result \nis arbitrary and varies with noise) . \n\ngradient of these expressions with respect to parameters ofthe map can be computed \nin principle. For that matter different kinds of averages need to be computed. \nEven though, the computational complexity is not substantially increased compared \nwith the efficient minimum variances criterion (2), the complexity of the algorithm \nincreases considerably. Therefore, we applied an optimization algorithm that does \nnot require any gradient information. The simple stochastic and parallel update \nalgorithm ALOPEX (Unnikrishnan & Venugopal, 1994) was used. \n\n4 Experiments \n\nAs explained above, finding the correct statistical independent directions of a ro(cid:173)\ntated two dimensional uniform distribution causes problems for techniques which \ninclude only second order statistic. The statistical independent coordinates are sim(cid:173)\nply the axes parallel to the edges of the distribution (see figure 2). A rotation i. e. \na linear transformation suffices for this task. The covariance matrix of the data is \ndiagonal for any rotation of the squared distribution and, hence, does not provide \nany information about the correct orientation of the square. It is well known, that \nPCA fails to find in the case of non-Gaussian distributions the statistical indepen(cid:173)\ndent coordinates. Similarly the Gaussian upper bound technique (2)is not capable \nto minimize the mutual information in this case. Instead, with anyone of the higher \norder criteria explained in the previous section one finds the appropriate coordinates \nfor any linearly transformed multi-dimensional uniform distribution. This has been \nobserved empirically for a series of setups. The symplectic map was restricted in \nthis experiments to linea1;ity by using square scalar functions. \n\nThe second example shows that the proposed technique in fact finds nonlinear \nrelations between the input coordinates. An one-dimensional signal distributed \naccording to the distribution of figure 1 was nonlinearly transformed into a two-\n\n\fSymplectic Nonlinear Component Analysis \n\n443 \n\n. '.: <~., \n\n.' . \n\n. : ' .. ; \n\nFigure 3: Symplectic map trained with 4th and 2nd order statistics corresponding \nto the equations (5) and (2) respectively. Left: input distribution. The line at \nthe center of the distribution gives the nonlinear transformed noiseless signal dis(cid:173)\ntributed according to the distribution shown in figure 1. Center and Right : Output \ndistribution of the symplectic map corresponding to the 4th order (right) and 2nd \norder (center) criterion. \n\ndimensional signal and corrupted with additive noise, leading to the distribution \nshown in figure 3, left. The task of finding statistical independent coordinates has \nbeen tackled by an explicit symplectic transformation with. n = 2 and m = 6. \nOn figure 3 the different results for the optimization according to the Gaussian \nupper bound criterion (2) and the approximated entropy criterion (5) are shown. \nObviously considering higher order statistics in fact improves the result by finding \nthe better representation of the nonlinear dependency. \n\nReference \n\nAbraham, R., & Marsden, J . (1978). Foundations of Mechanics The Benjamin(cid:173)\n\nCummings Publishing Company, Inc., London. \n\nComon, P. (1994). Independent component analysis, A new concept Signal Pro(cid:173)\n\ncessing, 36, 287- 314. \n\nDeco, G., & Brauer, W. (1994). Higher Order Statistical Decorrelation by Volume \n\nConcerving Nonlinear Maps. Neural Networks, ? submitted. \n\nDeco, G., & Schurman, B. (1994). Learning Time Series Evolution by Unsupervised \n\nExtraction of Correlations. Physical Review E, ? submitted. \n\nKendall, M. G., & Stuart, A. (1969). The Advanced Theory of Statistics (3 edition)., \n\nVol. 1. Charles Griffin and Company Limited, London. \n\nPapoulis, A. (1991). Probability, Random Variables, and Stochastic Processes. Third \n\nEdition, McGraw-Hill, New York. \n\nParra, L., Deco, G., & Miesbach, S. (1995). \n\nRedundancy reduction with \n\ninformation-preserving nonlinear maps. Network, 6(1), 61-72. \n\nUnnikrishnan, K., P., & Venugopal, K., P. (1994). Alopex: A Correlation-Based \nLearning Algorithm for Feedforward and Recurrent Neural Networks. Neural \nComputation, 6(3), 469- 490. \n\n\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Lucas", "family_name": "Parra", "institution": null}]}