{"title": "Structural equations and divisive normalization for energy-dependent component analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1872, "page_last": 1880, "abstract": "Components estimated by independent component analysis and related methods  are typically not independent in real data. A very common form of nonlinear  dependency between the components is correlations in their variances or ener-  gies. Here, we propose a principled probabilistic model to model the energy-  correlations between the latent variables. Our two-stage model includes a linear  mixing of latent signals into the observed ones like in ICA. The main new fea-  ture is a model of the energy-correlations based on the structural equation model  (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to  divisive normalization which effectively reduces energy correlation. Our new two-  stage model enables estimation of both the linear mixing and the interactions re-  lated to energy-correlations, without resorting to approximations of the likelihood  function or other non-principled approaches. We demonstrate the applicability of  our method with synthetic dataset, natural images and brain signals.", "full_text": "Structural equations and divisive normalization for\n\nenergy-dependent component analysis\n\nJun-ichiro Hirayama\nDept. of Systems Science\n\nGraduate School of of Informatics\n\nKyoto University\n\n611-0011 Uji, Kyoto, Japan\n\nAapo Hyv\u00a8arinen\n\nDept. of Mathematics and Statistics\nDept. of Computer Science and HIIT\n\nUniversity of Helsinki\n00560 Helsinki, Finland\n\nAbstract\n\nComponents estimated by independent component analysis and related methods\nare typically not independent in real data. A very common form of nonlinear\ndependency between the components is correlations in their variances or ener-\ngies. Here, we propose a principled probabilistic model to model the energy-\ncorrelations between the latent variables. Our two-stage model includes a linear\nmixing of latent signals into the observed ones like in ICA. The main new fea-\nture is a model of the energy-correlations based on the structural equation model\n(SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to\ndivisive normalization which effectively reduces energy correlation. Our new two-\nstage model enables estimation of both the linear mixing and the interactions re-\nlated to energy-correlations, without resorting to approximations of the likelihood\nfunction or other non-principled approaches. We demonstrate the applicability of\nour method with synthetic dataset, natural images and brain signals.\n\n1 Introduction\n\nStatistical models of natural signals have provided a rich framework to describe how sensory neurons\nprocess and adapt to ecologically-valid stimuli [28, 12]. In early studies, independent component\nanalysis (ICA) [2, 31, 13] and sparse coding [22] have successfully shown that V1 simple cell-like\nedge \ufb01lters, or receptive \ufb01elds, emerge as optimal inference on latent quantities under linear genera-\ntive models trained on natural image patches. In the subsequent developments over the last decade,\nmany studies (e.g. [10, 32, 11, 14, 23, 17]) have focused explicitly or implicitly on modeling a par-\nticular type of nonlinear dependency between the responses of the linear \ufb01lters, namely correlations\nin their variances or energies. Some of them showed that models on energy-correlation could ac-\ncount for, e.g., response properties of V1 complex cells [10, 15], cortical topography [11, 23], and\ncontrast gain control [26].\n\nInterestingly, such energy correlations are also prominent in other kinds of data, including brain\nsignals [33] and presumably even \ufb01nancial time series which have strong heteroscedasticity. Thus,\ndeveloping a general model for energy-correlations of linear latent variables is an important problem\nin the theory of machine learning, and such models are likely to have a wide domain of applicability.\n\nHere, we propose a new statistical model incorporating energy-correlations within the latent vari-\nables. Our two-stage model includes a linear mixing of latent signals into the observed ones like\nin ICA, and a model of the energy-correlations based on the structural equation model (SEM) [3],\nin particular the Linear Non-Gaussian (LiNG) SEM [27, 18] developed recently. As a model of\nnatural signals, an important feature of our model is its connection to \u201cdivisive normalization\u201d\n(DN) [7, 4, 26], which effectively reduces energy-correlations of linearly-transformed natural sig-\nnals [32, 26, 29, 19, 21] and is now part of a well-accepted model of V1 single cell responses [12].\n\n1\n\n\fWe provide a new generative interpretation of DN based on the SEM, which is an important con-\ntribution of this work. Also, from machine learning perspective, causal analysis by using SEM has\nrecently become very popular; our model could extend the applicability of LiNG-SEM for blindly\nmixed signals.\n\nAs a two-stage extension of ICA, our model is also closely related to both the scale-mixture-based\nmodels, e.g. [11, 30, 14] (see also [32]) and the energy-based models, e.g. [23, 17]. An advantage of\nour new model is its tractability: our model requires neither an approximation of likelihood function\nnor non-canonical principles for modeling and estimation as previous models.\n\n2 Structural equation model and divisive normalization\n\nA structural equation model (SEM) [3] of a random vector y = (y1, y2, . . . , yd)\nsimultaneous equations of random variables, such that\n\n\u22a4\n\nis formulated as\n\nyi = \u03bai(yi, y\u2212i, ri),\n\ni = 1, 2, . . . , d,\n\n(1)\n\n.\n\n\u2032\n\n\u22a4\n\nor y = \u03ba(y, r), where the function \u03bai describes how each single variable yi is related to other\nvariables y\u2212i, possibly including itself, and a corresponding stochastic disturbance or external input\nri which is independent of y. These equations, called structural equations, specify the distribution\nof y, as y is an implicit function (assuming the system is invertible) of the random vector r =\n(r1, r2, . . . , rd)\nIf there exists a permutation (cid:5) : y 7\u2192 y\n\u2032\ni only depends on the preceding ones\n|j < i}, an SEM is called recursive or acyclic, associated with a directed acyclic graph (DAG);\n{y\nthe model is then a cascade of (possibly) nonlinear regressions of yi\u2019s on the preceding variables\non the graph, and is also seen as a Bayesian network. Otherwise, the SEM is called non-recursive\nor cyclic, where the structural equations cannot be simply decomposed into regressive models. In\na standard interpretation, a cyclic SEM rather describes the distribution of equilibrium points of a\ndynamical system, y(t) = \u03ba(y(t \u2212 1), r) (t = 0, 1, . . .), where every realized input r is \ufb01xed until\ny(t) converges to y [24, 18]; some conditions are usually needed to make the interpretation valid.\n\nsuch that each y\n\n\u2032\nj\n\n2.1 Divisive normalization as non-linear SEM\n\n\u22a4\n\nNow, we brie\ufb02y point out the connection of SEM to DN, which strongly motivated us to explore the\napplication of SEM to natural signal statistics.\nLet s1, s2, . . . , sd be scalar-valued outputs of d linear \ufb01lters applied to a multivariate input, collec-\ntively written as s = (s1, s2, . . . , sd)\n. The linear \ufb01lters may either be derived/designed with some\nmathematical principles (e.g. Wavelets) or be learned from data (e.g. ICA). The outputs of linear\n\ufb01lters often have the property that their energies \u03d5(|si|) (i = 1, 2, . . . , d) have non-negligible depen-\ndencies or correlations to each other, even when the outputs themselves are linearly uncorrelated.\nThe nonlinear function \u03d5 is any appropriate measure of energy, typically given by the squaring func-\ntion, i.e. \u03d5(|s|) = s2 [26, 12], while other choices will not be excluded; we assume \u03d5 is continuously\ndifferentiable and strictly increasing over [0,\u221e), and \u03d5(0) = 0.\nDivisive Normalization (DN) [26] is an effective nonlinear transformation for eliminating the\nenergy-dependencies remained in the \ufb01ltered outputs. Although several variants have been pro-\nposed, a basic form can be formulated as follows: Given the d outputs, their energies are normalized\n(divided) by a linear combination of the energies of other signals, such that\n\n,\n\ni = 1, 2, . . . , d,\n\n(2)\n\n\u2211\n\nzi =\n\n\u03d5(|si|)\nj hij\u03d5(|sj|) + hi0\n\u2211\n\nwhere hij and hi0 are real-valued parameters of this transform. Now, it is straightforward to see that\nthe following structural equations in the log-energy domain,\n\nyi := ln \u03d5(|si|) = ln(\n\nhij exp(yj) + hi0) + ri,\n\ni = 1, 2, . . . , d,\n\n(3)\n\nj\n\ncorrespond to Eq. (2) where zi = exp(ri) is another representation of the disturbance. The SEM will\ntypically be cyclic, since the coef\ufb01cients hij in Eq. (2) are seldom constrained to satisfy acyclicity;\n\n2\n\n\fEq. (3) thus implies a nonlinear dynamical system, and this can be interpreted as the data-generating\nprocesses underlying DN. Interestingly, Eq. (3) also implies a linear system with multiplicative\n\n\u2211\nj hijeyj + hi0)zi, in the energy domain, i.e. eyi := \u03d5(|si|). The DN transform of\n\ninput, eyi = (\nNote also that the SEM above implies ey = (I \u2212 diag(z)H)\n\nEq. (2) gives the optimal mapping under the SEM to infer the disturbance from given si\u2019s; if the true\ndisturbances are independent, it optimally reduces the energy-dependencies. This is consistent with\nthe redundancy reduction view of DN [29, 19].\n\n\u22121diag(h0)z with H = (hij) and\nh0 = (hi0), as shown in [20] in the context of DN 1. Although mathematically equivalent, such a\ncomplicated dependence [20] on the disturbance z does not provide an elegant model of the under-\nlying data-generating process, compared to relatively the simple form of Eq. (3).\n\n3 Energy-dependent ICA using structural equation model\n\nNow, we de\ufb01ne a new generative model which models energy-dependencies of linear latent compo-\nnents using an SEM.\n\n3.1 Scale-mixture model\n\n\u22a4\nLet s now be a random vector of d source signals underlying an observation x = (x1, x2, . . . , xd)\nwhich has the same dimensionality for simplicity. They follow a standard linear generative model:\n\nx = As,\n\n(4)\n\nwhere A is a square mixing matrix. We assume here E[x] = E[s] = 0 without loss of generality, by\nalways subtracting the sample mean from every observation. Then, assuming A is invertible, each\n\u22121 gives the optimal \ufb01lter to recover\ntransposed row wi of the demixing (\ufb01ltering) matrix W = A\nsi from x, which is constrained to have unit norm, \u2225wi\u22252\nTo introduce energy-correlations into the sources, a classic approach is to use a scale-mixture rep-\nresentation of sources, such that si = ui\u03c3i, where ui represents a normalized signal having zero\nmean and constant variance, and \u03c3i is a positive factor that is independent of ui and modulates the\nvariance (energy) of si [32, 11, 30, 14, 16]. Also, in vector notation, we write\n\n2 = 1 to \ufb01x the scaling ambiguity.\n\ns = u \u2299 (cid:27),\n\n(5)\nwhere \u2299 denotes component-wise multiplication. Here, u and (cid:27) are mutually independent, and ui\u2019s\nare also independent of each other. Then E[s|(cid:27)] = 0 and E[ss\nd) for any\ngiven (cid:27), where \u03c3i\u2019s may be dependent of each other and introduce energy-correlations. A drawback\nof this approach is that to learn effectively the model based on the likelihood, we usually need some\napproximation to deal with the marginalization over u.\n\n\u22a4|(cid:27)] = diag(\u03c32\n\n2, . . . , \u03c32\n\n1, \u03c32\n\n3.2 Linear Non-Gaussian SEM\nHere, we simplify the above scale-mixture model by restricting ui to be binary, i.e. ui \u2208 {\u22121, 1},\nand uniformly distributed. Although the simpli\ufb01cation reduces the \ufb02exibility of source distribution,\nthe resultant model is tractable, i.e. no approximation is needed for likelihood computation, as will\nbe shown below. Also, this implies that ui = sign(si) and \u03c3i = |si|, and hence the log-energy\nabove now has a simple deterministic relation to \u03c3i, i.e. yi = ln \u03d5(\u03c3i), which can be inverted to\n\u03c3i = \u03d5\nWe particularly assume the log-energies yi follow the Linear Non-Gaussian (LiNG) [27, 18] SEM:\n\n\u22121(exp(yi)).\n\n\u2211\n\nyi =\n\nhijyj + hi0 + ri,\n\ni = 1, 2, . . . , d,\n\n(6)\n\nj\n\nwhere the disturbances are zero-mean and in particular assumed to be non-Gaussian and independent\nof each other, which has been shown to greatly improve the identi\ufb01ability of linear SEMs [27];\nthe interaction structure in Eq. (6) can be represented by a directed graph for which the matrix\n1To be precise, [20] showed the invertibility of the entire mapping s 7! z in the case of a \u201csigned\u201d DN\n\ntransform that keeps the signs of zi and si to be the same.\n\n3\n\n\f(\u220f\n\neyi =\n\n)\n\neyhij\n\nj\n\nj\n\nDN transform, given by\n\nH = (hij) serves as the weighted adjacency matrix. In the energy domain, Eq. (6) is equivalent to\nehi0 zi (i = 1, 2, . . . , d), and interestingly, these SEMs further imply a novel form of\n\n\u220f\n\u03d5(|si|)\nj \u03d5(|sj|)hij\n\nzi =\n\nehi0\n\n,\n\ni = 1, 2, . . . , d,\n\n(7)\n\nwhere the denominator is now not additive but multiplicative. It provides an interesting alternative\nto the original DN.\n\nTo recapitulate the new generative model proposed here: 1) The log-energies y are generated accord-\n\u22121(exp(yi))\ning to the SEM in Eq. (6); 2) the sources are generated according to Eq. (5) with \u03c3i = \u03d5\nand random signs, ui; and 3) the observation x is obtained by linearly mixing the sources as in\nEq. (4). In our model, the optimal mapping to infer zi = exp(ri) from x under this model is the\nlinear \ufb01ltering W followed by the new DN transform, Eq. (7). On the other hand, it would also be\npossible to de\ufb01ne the energy-dependent ICA by using the nonlinear SEM in Eq. (3) instead. Then,\nthe optimal inference would be given by the divisive normalization in Eq. (2). However, estimation\nand other theoretical issues (e.g. identi\ufb01ability) related to nonlinear SEMs, particularly in the case\nof non-Gaussianity of the disturbances, are quite involved, and are still under development, e.g. [8].\n\n3.3\n\nIdenti\ufb01ability issues\n\nBoth the theory and algorithms related to LiNG coincide largely with those of ICA, since Eq. (6)\nwith non-Gaussian r implies the generative model of ICA, y = Br + b0, where B = (I \u2212 H)\n\u22121\nand b0 = Bh0 with h0 = (hi0). Like ICA [13], Eq. (6) is not completely identi\ufb01able due to\nthe ambiguities related to scaling (with signs) and permutation [27, 18]. To \ufb01x the scaling, we set\n] = I here. The permutation ambiguity is more serious than in the case of ICA, because\nE[rr\nthe row-permutation of H completely changes the structure of corresponding directed graph, and is\ntypically addressed by constraining the graph structure, as will be discussed next.\n\n\u22a4\n\nTwo classes of LiNG-SEM have been proposed, corresponding to different constraints on the graph\nstructure. One is LiNGAM [27], which ensures the full identi\ufb01ability by the DAG constraint. The\nother is generally referred to as LiNG [18] which allows general cyclic graphs; the \u201cLiNG discovery\u201d\nalgorithm in [18] dealt with the non-identi\ufb01ability of cyclic SEMs by \ufb01nding out multiple solutions\nthat give the same distribution.\n\nHere we de\ufb01ne two variants of our model: One is the acyclic model, using LiNGAM. In contrast\nto original LiNGAM, our target is (linear) latent variables, but not observed ones. The ordering of\nlatent variables is not meaningful, because the rows of \ufb01lter matrix W can be arbitrarily permuted.\nThe acyclic constraint thus can be simpli\ufb01ed into a lower-triangular constraint on H. Another one is\nthe symmetric model, which uses a special case of cyclic SEM, i.e. those with a symmetric constraint\non H. Such constraint would be relatively new to the context of SEM, although it is a well-known\nsetting in the ICA literature (e.g. [5]). The SEM is then identi\ufb01able using only the \ufb01rst- and second-\norder statistics, based on the relations h0 = VE[y] and V := I \u2212 H = Cov[y]\n2 [5], provided\nthat V is positive de\ufb01nite 2. This implies the non-Gaussianity is not essential for identi\ufb01ability, in\ncontrast that the acyclic model is not identi\ufb01able without non-Gaussianity [27]. The above relations\nalso suggest moment-based estimators of h0 and V, which can be used either as the \ufb01nal estimates\nor as the initial conditions in the maximum likelihood algorithm below.\n\n\u2212 1\n\n3.4 Maximum likelihood\n\u220f\n(|s|) as a con-\nLet \u03c8(s) := ln \u03d5(|s|) for notational simplicity, and denote \u03c8\nvention, e.g. (ln|s|)\n\u2032\n:= 1/s. Also, following the basic theory of ICA, we assume the disturbances\nhave a joint probability density function (pdf) pr(r) =\ni \u03c1(ri) with a common \ufb01xed marginal pdf\n\u03c1. Then, we have the following pdf of s without any approximation (see Appendix for derivation):\n\n\u2032\n(s) := sign(s)(ln \u03d5)\n\n\u2032\n\ni \u03c8(s) \u2212 hi0)|\u03c8\n\u22a4\n\n\u2032\n\n(si)|.\n\n\u03c1(v\n\n(8)\n\n| det V| d\u220f\n\ni=1\n\nps(s) =\n\n1\n2d\n\n2Under the dynamical system interpretation, the matrix H should have absolute eigenvalues smaller than\none for stability [18], where V = I (cid:0) H is naturally positive de\ufb01nite because the eigenvalues are all positive.\n\n4\n\n\fFigure 1: Estimation performance of mixing matrix measured by the \u201cAmari Index\u201d [1] (non-\nnegative, and zero denotes perfect estimation up to unavoidable indeterminacies) versus sample\nsize, shown in log-log scales. Each panel corresponds to a particular value of \u03b1, which determined\nthe relative connection strength between sources. The solid lines denotes the median of ten runs.\n\n\u2211\n\nwhere vi is i-th transposed row vector of V (= I \u2212 H). The pdf of x is given by px(x) =\n| det W|ps(Wx), and the corresponding loss function, l = \u2212 ln px(x) + const., is given by\n\n\u2211\nl(x, W, V, h0) = (cid:22)f (V\u03c8(Wx) \u2212 h0) + (cid:22)g(Wx) \u2212 ln| det W| \u2212 ln| det V|,\n(si)|.\ni f (ri), f (ri) = \u2212 ln \u03c1(ri), (cid:22)g(s) =\n\ni g(si), and g(si) = \u2212 ln|\u03c8\n\nwhere (cid:22)f (r) =\nNote that the loss function above is closely related to the ones in previous studies, such as of energy-\nbased models [23, 17]. Our model is less \ufb02exible to these models, since it is limited to the case that\nA is square, but the exact likelihood is available. It is also interesting to see that the loss function\nabove includes an additional second term that has not appeared in previous models, due to the formal\nderivation of pdf by the argument of transformation of random variables.\nTo obtain the maximum likelihood estimates of W, V, and h0, we minimize the negative log-\nlikelihood (i.e. empirical average of the losses) by the projected gradient method (for the unit-norm\nconstraints, \u2225wi\u22252\n\n(9)\n\n\u2032\n\n2 = 1). The required \ufb01rst derivatives are given by\n\u2212\u22a4\n\u2202l\n\u2202h0\n\u2202l\n\u2202W\n\n(Vy \u2212 h0)y\n\u22a4 \u2212 V\n(Vy \u2212 h0) + g\n\u2032\n\u2032\n\n{\n= \u2212f\n\n(Wx))V\n\n\u2202l\n\u2202V\n\ndiag(\u03c8\n\n(r),\n\n= f\n\n=\n\n\u2032\n\n\u22a4\n\nf\n\n,\n\n}\n\n(Wx)\n\n\u2032\n\n\u2032\n\n\u22a4 \u2212 W\nx\n\n\u2212\u22a4\n\n.\n\n(10a)\n\n(10b)\n\nIn both acyclic and symmetric cases, only the lower-triangular elements in V are free parameters.\nIf acyclic, the upper-triangular elements are \ufb01xed at zero; if symmetric, they are dependent of the\nlower-triangular elements, and thus \u2202l/\u2202vij (i > j) should be replaced with \u2202l/\u2202vij + \u2202l/\u2202vji.\n\n4 Simulations\n\nTo demonstrate the applicability of our method, we conducted the following simulation experiments.\nIn all experiments below, we set \u03d5(|s|) = |s|, and \u03c1(r) = (1/2)sech(\u03c0r/2) corresponding to\nthe standard tanh nonlinearity in ICA: f\nIn our projected gradient\nalgorithm, the matrix W was \ufb01rst initialized by FastICA [9]; the SEM parameters, H and h0, were\ninitialized by the moment-based estimator described above (symmetric model) or by the LiNGAM\nalgorithm [27] (acyclic model). The algorithm was terminated when the decrease of objective value\n\u22126; the learning rate was adjusted in each step by simply multiplying it by the\nwas smaller than 10\nfactor 0.9 until the new point did not increase the objective value.\n\n(r) = (\u03c0/2) tanh((\u03c0/2)r).\n\n\u2032\n\n4.1 Synthetic dataset\n\nFirst, we examined how the energy-dependence learned in the SEM affects the estimation of linear\n\ufb01lters. We arti\ufb01cially sampled the dataset with d = 10 from our generative model by setting the\nmatrix V to be tridiagonal, where all the main and the \ufb01rst diagonals were set at 10 and 10\u03b1,\nrespectively. Figure 1 shows the \u201cAmari Index\u201d [1] of estimated W by three methods, at several\n\n5\n\n102103100101a=\u22120.4Amari Index102103a=\u22120.3102103a=\u22120.2102103a=0Sample Size102103a=0.2  FastICA102103a=0.3  No Dep.102103a=0.4  Proposed\fFigure 2: Connection weights versus pairwise differences of four properties of linear basis functions,\nestimated by \ufb01tting 2D Gabor functions. The curves were \ufb01t by local Gaussian smoothing.\n\nfactors \u03b1 and sample sizes, with ten runs for every condition. In each run, the true mixing matrix\nwas given by inverting W randomly generated from standard Gaussian and then row-normalized to\nhave unit norms. The three methods were: 1) FastICA 3 with the tanh nonlinearity, 2) Our method\n(symmetric model) without energy-dependence (NoDep) initialized by FastICA, and 3) Our full\nmethod (symmetric model) initialized by NoDep. NoDep was the same as the full method except\nthat the off-diagonal elements of H was kept zero. Note that our two algorithms used exactly the\nsame criterion for termination of algorithm, while FastICA used a different one. This could cause\nthe relatively poor performance of FastICA in this \ufb01gure. The comparison between the full method\nand NoDep showed that energy-dependence learned in the SEM could improve the estimation of\n\ufb01lter matrix, especially when the dependence was relatively strong.\n\n4.2 Natural images\nThe dataset consisted of 50, 000 image patches of 16 \u00d7 16 pixels randomly taken from the original\ngray-scale pictures of natural scenes 4. As a preprocessing, the sample mean was subtracted and\nthe dimensionality was reduced to 160 by the principal component analysis (PCA) where 99% of\nthe variance was retained. We constrained the SEM to be symmetric. Both of the obtained basis\nfunctions and \ufb01lters were qualitatively very similar to those reported in many previous studies, and\ngiven in the Supplementary Material.\nFigure 2 shows the values of connection weights hij (after a row-wise re-scaling of V to set any\nhii = 1 \u2212 vii to be zero, as a standard convention in SEM [18]) for every d(d \u2212 1) pairs, compared\nwith the pairwise difference of four properties of learned features (i.e. basis functions), estimated by\n\ufb01tting 2D Gabor functions: spatial positions, frequencies, orientations and phases. As is clearly seen,\nthe connection weights tended to be large if the features were similar to each other, except for their\nphases; the phases were not strongly correlated with the weights as suggested by the \ufb01tted curve,\nwhile they exhibited a weak tendency to be the same or the opposite (shifted \u00b1\u03c0) to each other. We\ncan also see a weak tendency for the negative weights to have large magnitudes when the pairs have\nnear-orthogonal directions or different frequencies. Figure 3 illustrates how the learned features are\nassociated with the other ones, using iconi\ufb01ed representations. We can see: 1) associations with\npositive weights between features were quite spatially-localized and occur particularly with similar\norientations, and 2) those with negative weights especially occur from cross-oriented features to a\ntarget, which were sometimes non-localized and overlapped to the target feature. Notice that in the\nDN transform (7), these positive weights learned in the SEM perform as inhibitory and will suppress\nthe energies of the \ufb01lters having similar properties.\n\n4.3 Magnetoencephalography (MEG)\n\nBrain activity was recorded in a single healthy subject who received alternating visual, auditory, and\ntactile stimulation interspersed with rest periods [25]. The original signals were measured in 204\nchannels (sensors) for several minutes with sampling rate (75Hz); the total number of measurements,\ni.e. sample size, was N = 73, 760. As a preprocessing, we applied a band-pass \ufb01lter (8-30Hz) and\nremove some outliers. Also, we subtracted the sample mean and then reduced the dimensionality by\nPCA to d = 24, with 90% of variance still retained.\n\n3Matlab package is available at http://research.ics.tkk.\ufb01/ica/fastica/. We used the following options: g=tanh,\n\napproach=symm, epsilon=10\n\n(cid:0)6, MaxNumIterations=104, \ufb01netune=tanh.\n\n4Available in Imageica Toolbox by Patrik Hoyer, at http://www.cs.helsinki.\ufb01/u/phoyer/software.html\n\n6\n\n00.5100.020.040.06PositionPairwise DistanceConnection Weight\u221210100.020.040.06OrientationPairwise Difference (mod \u2013p/2)Connection Weight\u22120.200.200.020.040.06FrequencyPairwise DifferenceConnection Weight\u221220200.020.040.06PhasePairwise Difference (mod \u2013 p)Connection Weight\fFigure 3: Depiction of connection properties between learned basis functions in a similar manner\nto that has used in e.g. [6]. In each small panel, the black bar depicts the position, orientation and\nlength of a single Gabor-like basis function obtained by our method; the red (resp. blue) pattern\nof superimposed bars is a linear combination of the bars of the other basis functions according to\nthe absolute values of positive (resp. negative) connection weights to the target one. The intensities\nof red and blue colors were adjusted separately from each other in each panel; the ratio of the\nmaximum positive and negative connection strengths is depicted at the bottom of each small panel\nby the relative length of horizontal color bars.\n\nFigure 4: Estimated interaction graph (DAG) for MEG data. The red and blue edges respec-\ntively denotes the positive and negative connections. Only the edges with strong connections are\ndrawn, where the absolute threshold value was the same for positive and negative weights. The two\nmanually-inserted contours denote possible clusters of sources (see text).\n\n7\n\n\fFigure 4 shows an interaction graph under the DAG constraint. One cluster of components, high-\nlighted in the \ufb01gure by the manually inserted yellow contour, seems to consist of components related\nto auditory processing. The components are located in the temporal cortex, and all but one in the\nleft hemisphere. The direction of in\ufb02uence, which we can estimate in the acyclic model, seems to\nbe from the anterior areas to posterior ones. This may be related to top-down in\ufb02uence, since the\nprimary auditory cortex seems to be included in the posterior areas on the left hemisphere; at the\nend of the chain, the signal goes to the right hemisphere. Such temporal components are typically\nquite dif\ufb01cult to \ufb01nd because the modulation of their energies is quite weak. Our method may help\nin grouping such components together by analyzing the energy correlations.\n\nAnother cluster of components consists of low-level visual areas, highlighted by the green contour.\nIt is more dif\ufb01cult to interpret these interactions because the areas corresponding to the components\nare very close to each other. It seems, however, that here the in\ufb02uences are mainly from the primary\nvisual areas to the higher-order visual areas.\n\n5 Conclusion\n\nWe proposed a new statistical model that uses SEM to model energy-dependencies of latent variables\nin a standard linear generative model. In particular, with a simpli\ufb01ed form of scale-mixture model,\nthe likelihood function was derived without any approximation. The SEM has both acyclic and\ncyclic variants. In the acyclic case, non-Gaussianity is essential for identi\ufb01ability, while in the cyclic\ncase we introduces the constraint of symmetricity which also guarantees identi\ufb01ability. We also\nprovided a new generative interpretation of DN transform based on a nonlinear SEM. Our method\nexhibited a high applicability in three simulations each with synthetic dataset, natural images, and\nbrain signals.\n\n\u222b\n\n\u222b\n\nS1\n\nAppendix: Derivation of Eq. (8)\nFrom the uniformity of signs, we have ps(s) = ps(Ds) for any D = diag(\u00b11, . . . ,\u00b11); par-\nticularly, let Dk correspond to the signs of k-th orthant Sk of Rd, and S1 = (0,\u221e)d. Then, the\nd(cid:27) ps((cid:27)) im-\nrelation\nplies ps(s) = (1/2d)p(cid:27)(s) for any s \u2208 S1; thus ps(s) = (1/2d)p(cid:27)(|s|) for any s \u2208 Rd. Now,\n(\u03c3i)|, where we assume\n|(ln \u03d5)\n\u2032\ny = ln \u03d5(\u03c3) (for every component) and thus p(cid:27)((cid:27)) = py(y)\n\u03d5 is differentiable. Let \u03c8(s) := ln \u03d5(|s|) and \u03c8\n(|s|). Then it follows that\n\u2032\n(s) := sign(s)(ln \u03d5)\n(si)|, where \u03c8(s) performs component-wise. Since y maps lin-\nps(s) = (1/2d)py(\u03c8(s))\ni \u03c1(ri); combining it\n\nearly to r with the absolute Jacobian | det V|, we have py(y) = | det V|\u220f\n\nd(cid:27) ps(Dk(cid:27)) = 2d\n\n\u2211\n\u220f\n\nd(cid:27) p(cid:27)((cid:27)) =\n\nds ps(s) =\n\nK\nk=1\n\nSk\n\n\u220f\n\ni\n\n\u222b\n\nS1\n\n\u2032\n\n|\u03c8\n\ni\n\n\u2211\n\n\u222b\n\nK\nk=1\n\nS1\n\n\u2032\n\nwith ps above, we obtain Eq. (8).\n\nAcknowledgements\n\nWe would like to thank Jes\u00b4us Malo and Valero Laparra for inspiring this work, Michael Gutmann\nand Patrik Hoyer for helpful discussions and providing codes for \ufb01tting Gabor functions and visual-\nization. The MEG data was kindly provided by Pavan Ramkumar and Riitta Hari. J.H. was partially\nsupported by JSPS Research Fellowships for Young Scientists.\n\nReferences\n\n[1] S. Amari, A. Cichoki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances\n\nin Neural Information Processing Systems, volume 8, 1996.\n\n[2] A. J. Bell and T. J. Sejnowski. The \u2018independent components\u2019 of natural scenes are edge \ufb01lters. Vision\n\nRes., 37:3327\u20133338, 1997.\n\n[3] K. A. Bollen. Structural Equations with Latent Variables. Wiley, New York, 1989.\n[4] M. Carandini, D. J. Heeger, and J. A. Movshon. Linearity and normalization in simple cells of the\n\nmacaque primary visual cortex. Journal of Neuroscience, 17:8621\u20138644, 1997.\n\n[5] A. Cichocki and P. Georgiev. Blind source separation algorithms with matrix constraints. IEICE Trans.\n\nFundamentals, E86-A(3):522\u2013531, 2003.\n\n8\n\n\f[6] P. Garrigues and B. A. Olshausen. Learning horizontal connections in a sparse coding model of natural\n\nimages. In Advances in Neural Information Processing Systems, volume 20, pages 505\u2013512, 2008.\n\n[7] D. J. Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181\u2013197, 1992.\n[8] P. O. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal discovery with additive\nnoise models. In Advances in Neural Information Processing Systems, volume 21, pages 689\u2013696, 2009.\n[9] A. Hyv\u00a8arinen. Fast and robust \ufb01xed-point algorithms for independent component analysis. IEEE Trans-\n\nactions on Neural Networks, 10(3):626\u2013634, 1999.\n\n[10] A. Hyv\u00a8arinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of\n\nnatural images into independent feature subspaces. Neural Comput., 12(7):1705\u20131720, 2000.\n\n[11] A. Hyv\u00a8arinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Comput.,\n\n13(7):1527\u20131558, 2001.\n\n[12] A Hyv\u00a8arinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics \u2013 A probabilistic approach to early\n\ncomputational vision. Springer-Verlag, 2009.\n\n[13] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001.\n[14] Y. Karklin and M. S. Lewicki. A hierarchical Bayesian model for learning nonlinear statistical regularities\n\nin nonstationary natural signals. Neural Comput., 17:397\u2013423, 2005.\n\n[15] Y. Karklin and M. S. Lewicki. Emergence of complex cell properties by learning to generalize in natural\n\nscenes. Nature, 457:83\u201386, January 2009.\n\n[16] M. Kawanabe and K.-R. M\u00a8uller. Estimating functions for blind separation when sources have variance\n\ndependencies. Journal of Machine Learning Research, 6:453\u2013482, 2005.\n\n[17] U. K\u00a8oster and A. Hyv\u00a8arinen. A two-layer model of natural stimuli estimated with score matching. Neural\n\nComput., 22:2308\u20132333, 2010.\n\n[18] G. Lacerda, P. Spirtes, J. Ramsey, and P. Hoyer. Discovering cyclic causal models by independent com-\nponents analysis. In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in\nArti\ufb01cial Intelligence (UAI\u201908), pages 366\u2013374, 2008.\n\n[19] S. Lyu. Divisive normalization: Justi\ufb01cation and effectiveness as ef\ufb01cient coding transform. In Advances\n\nin Neural Information Processing Systems 23, pages 1522\u20131530, 2010.\n\n[20] J. Malo, I. Epifanio, R. Navarro, and E. P. Simoncelli. Nonlinear image representation for ef\ufb01cient per-\n\nceptual coding. IEEE Trans Image Process, 15(1):68\u201380, 2006.\n\n[21] J. Malo and V. Laparra. Psychophysically tuned divisive normalization approximately factorizes the PDF\n\nof natural images. Neural Comput., 22(12):3179\u20133206, 2010.\n\n[22] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381:607\u2013609, 1996.\n\n[23] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Comput., 18:381\u2013414, 2006.\n\n[24] J. Pearl. On the statistical interpretation of structural equations. Technical Report R-200, UCLA Cognitive\n\nSystems Laboratory, 1993.\n\n[25] P. Ramkumar, L. Parkkonen, R. Hari, and A. Hyv\u00a8arinen. Characterization of neuromagnetic brain rhythms\nover time scales of minutes using spatial independent component analysis. Human Brain Mapping, 2011.\nIn press.\n\n[26] O. Schwartz and E. P. Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience,\n\n4(8), 2001.\n\n[27] S. Shimizu, P.O. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal\n\ndiscovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\n[28] E. P. Simoncelli and B. A. Olshausen. Natural image statistics and neural representation. Annu. Rev.\n\nNeurosci., 24:1193\u20131216, 2001.\n\n[29] R. Valerio and R. Navarro. Optimal coding through divisive normalization models of V1 neurons. Net-\n\nwork: Computation in Neural Systems, 14:579\u2013593, 2003.\n\n[30] H. Valpola, M. Harva, and J. Karhunen. Hierarchical models of variance sources. Signal Processing,\n\n84(2):267\u2013282, 2004.\n\n[31] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\n\nsimple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265(359\u2013366), 1998.\n\n[32] M. J. Wainwright and E. P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images.\n\nIn Advances in Neural Information Processing Systems, volume 12, pages 855\u2013861, 2000.\n\n[33] K. Zhang and A. Hyv\u00a8arinen. Source separation and higher-order causal analysis of MEG and EEG. In\n\nProceedings of the Twenty-Sixth Conference (UAI 2010), pages 709\u2013716, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1061, "authors": [{"given_name": "Jun-ichiro", "family_name": "Hirayama", "institution": null}, {"given_name": "Aapo", "family_name": "Hyv\u00e4rinen", "institution": null}]}