{"title": "Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2544, "page_last": 2552, "abstract": "The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise.   In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement a more efficient version of a Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case.  The algorithm is based on a special form of gradient iteration (different from gradient descent).   We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case.", "full_text": "Fast Algorithms for Gaussian Noise Invariant\n\nIndependent Component Analysis\n\nJames Voss\n\nOhio State University\n\nComputer Science and Engineering,\n2015 Neil Avenue, Dreese Labs 586.\n\nColumbus, OH 43210\n\nvossj@cse.ohio-state.edu\n\nLuis Rademacher\n\nOhio State University\n\nComputer Science and Engineering,\n2015 Neil Avenue, Dreese Labs 495.\n\nColumbus, OH 43210\n\nlrademac@cse.ohio-state.edu\n\nMikhail Belkin\n\nOhio State University\n\nComputer Science and Engineering,\n2015 Neil Avenue, Dreese Labs 597.\n\nColumbus, OH 43210\n\nmbelkin@cse.ohio-state.edu\n\nAbstract\n\nThe performance of standard algorithms for Independent Component Analysis\nquickly deteriorates under the addition of Gaussian noise. This is partially due\nto a common \ufb01rst step that typically consists of whitening, i.e., applying Prin-\ncipal Component Analysis (PCA) and rescaling the components to have identity\ncovariance, which is not invariant under Gaussian noise.\nIn our paper we develop the \ufb01rst practical algorithm for Independent Component\nAnalysis that is provably invariant under Gaussian noise. The two main contribu-\ntions of this work are as follows:\n1. We develop and implement an ef\ufb01cient, Gaussian noise invariant decorrelation\n(quasi-orthogonalization) algorithm using Hessians of the cumulant functions.\n2. We propose a very simple and ef\ufb01cient \ufb01xed-point GI-ICA (Gradient Iteration\nICA) algorithm, which is compatible with quasi-orthogonalization, as well as with\nthe usual PCA-based whitening in the noiseless case. The algorithm is based on\na special form of gradient iteration (different from gradient descent). We provide\nan analysis of our algorithm demonstrating fast convergence following from the\nbasic properties of cumulants. We also present a number of experimental compar-\nisons with the existing methods, showing superior results on noisy data and very\ncompetitive performance in the noiseless case.\n\nIntroduction and Related Works\n\n1\nIn the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown\ndistribution. The goal is to recover the latent signals under some appropriate structural assumption.\nA prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking\nsimultaneously and d microphones, with each microphone capturing a superposition of the voices.\nThe objective is to recover the speech of each individual speaker. The simplest modeling assumption\nis to consider each speaker as producing a signal that is a random variable independent of the others,\nand to take the superposition to be a linear transformation independent of time. This leads to the\nfollowing formalization: We observe samples from a random vector x distributed according to the\nequation x = As + b + \u03b7 where A is a linear mixing matrix, b \u2208 Rd is a constant vector, s is a\nlatent random vector with independent coordinates, and \u03b7 is an unknown random noise independent\n\n1\n\n\fof s. For simplicity, we assume A \u2208 Rd\u00d7d is square and of full rank. The latent components of s\nare viewed as containing the information describing the makeup of the observed signal (voices of\nindividual speakers in the cocktail party setting). The goal of Independent Component Analysis is\nto approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore\nthe noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is\nobserved.\nArguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these\nalgorithms are based on a two step process:\n(1) The data is centered and whitened, that is, made to have identity covariance matrix. This is\ntypically done using principal component analysis (PCA) and rescaling the appropriate components.\nIn the noiseless case this procedure orthogonalizes and rescales the independent components and\nthus recovers A up to an unknown orthogonal matrix R.\n(2) Recover the orthogonal matrix R.\nMost practical ICA algorithms differ only in the second step. In FastICA, various objective functions\nare used to perform a projection pursuit style algorithm which recovers the columns of R one at a\ntime. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R.\nStep 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar\ntimes identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the\nwhitening of the underlying independent components. Hence, the second step of the process is no\nlonger justi\ufb01ed. This failure may be even more signi\ufb01cant if the noise is not white, which is likely to\nbe the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider\nthe case where the noise \u03b7 is an arbitrary (not necessarily white) additive Gaussian variable drawn\nindependently from s.\nIn [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the\nsecond step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent sig-\nnals (quasi-orthogonalization) is a signi\ufb01cantly less restrictive condition as it does not force the\nunderlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy\nsetting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but\nnot the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a\nnoise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct\nimplementation of that method requires estimating the full fourth-order cumulant tensor, which is\ncomputationally challenging even in relatively low dimensions. In this paper we derive a practical\nversion of that algorithm based on directional Hessians of the fourth univariate cumulant, thus re-\nducing the complexity dependence on the data dimensionality from d4 to d3, and also allowing for\na fully vectorized implementation.\nWe also develop a fast and very simple gradient iteration (not to be confused with gradient descent)\nalgorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to\nhave convergence of order r \u2212 1, when implemented using a univariate cumulant of order r. For the\ncumulant of order four, commonly used in practical applications, we obtain cubic convergence. We\nshow how these convergence rates follow directly from the properties of the cumulants, which sheds\nsome light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods\n[13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a\ntotal algorithmic complexity of O(N d3) for step 1 and O(N d2t) for step 2, where t is the number\nof iterations for convergence in the gradient iteration.\nInterestingly, while the techniques are quite different, our gradient iteration algorithm turns out to\nbe closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the\ncumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a\nconceptual simpli\ufb01cation) of Fast ICA for more general quasi-orthogonalized data.\nWe present experimental results showing superior performance in the case of data contaminated\nby Gaussian noise and very competitive performance for clean data. We also note that the GI-\nICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent\n\n1This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. How-\never, this con\ufb02icts with the de\ufb01nition of quasi-whitening given in [12] which requires the latent signals to be\nwhitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonal-\nizing the latent signals.\n\n2\n\n\fcomponents) 100 000 points in dimension 5 in well under a second on a standard desktop computer.\nOur Matlab implementation of GI-ICA is available for download at http://sourceforge.\nnet/projects/giica/.\nFinally, we observe that our method is partially compatible with the robust cumulants introduced\nin [20]. We brie\ufb02y discuss how GI-ICA can be extended using these noise-robust techniques for\nICA to reduce the impact of sparse noise.\nThe paper is organized as follows. In section 2, we discuss the relevant properties of cumulants,\nand discuss results from prior work which allows for the quasi-orthogonalization of signals with\nnon-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumu-\nlant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques\nseen in [2] and [11]. We use this connection to create a more computationally ef\ufb01cient and prac-\ntically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In\nsection 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which\nare compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms\nare stated in an abstract form as if we have exact knowledge of required distribution parameters.\nSection 5 discusses the estimators of required distribution parameters to be used in practice. Section\n6 discusses numerical experiments demonstrating the applicability of our techniques.\nRelated Work. The name Independent Component Analysis refers to a broad range of algorithms\naddressing the blind signal separation problem as well as its variants and extensions. There is an\nextensive literature on ICA in the signal processing and machine learning communities due to its\napplicability to a variety of important practical situations. For a comprehensive introduction see\nthe books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing\nnew and more ef\ufb01cient techniques for quasi-orthogonalization and subsequent component recovery.\nThe quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm\nfor the case when the fourth cumulants of all independent components are of the same sign. A\ngeneral algorithm with complete theoretical analysis was provided in [3]. That algorithm required\nestimating the full fourth-order cumulant tensor.\nWe note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the\nHessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized\none step noise-robust ICA algorithms based on the cumulant generating function and the fourth\ncumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is\nclosely related to the work [18], which provides a gradient-based algorithm derived from the fourth\nmoment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For\nthe special case of the fourth cumulant, the idea of gradient iteration has appeared in the context\nof FastICA with a different justi\ufb01cation, see e.g. [16, Equation 11 and Theorem 2]. We also note\nthe work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that\nthe noise parameters are known. Finally, there are several papers that considered the problem of\nperforming PCA in a noisy framework.\n[5] gives a provably robust algorithm for PCA under a\nsparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust\nto white Gaussian noise and sparse noise.\n\n2 Using Cumulants to Orthogonalize the Independent Components\nProperties of Cumulants: Cumulants are similar to moments and can be expressed in terms of\ncertain polynomials of the moments. However, cumulants have additional properties which allow\nindependent random variables to be algebraically separated. We will be interested in the fourth order\nmulti-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order\ncumulant tensor for the random vector x. So, (Qx)ijkl is the cross-cumulant between the random\nvariables xi, xj, xk, and xl, which we alternatively denote as Cum(xi, xj, xk, xl). Cumulant tensors\nare symmetric, i.e. (Qx)ijkl is invariant under permutations of indices. Multivariate cumulants have\nthe following properties (written in the case of fourth order cumulants):\n1. (Multilinearity) Cum(\u03b1xi, xj, xk, xl) = \u03b1 Cum(xi, xj, xk, xl) for random vector x and scalar \u03b1.\nIf y is a random variable, then Cum(xi +y, xj, xk, xl) = Cum(xi, xj, xk, xl)+Cum(y, xj, xk, xl).\n2. (Independence) If xi and xj are independent random variables, then Cum(xi, xj, xk, xl) = 0.\nWhen x and y are independent, Qx+y = Qx + Qy.\n3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables.\n\n3\n\n\fThe \ufb01rst order cumulant is the mean, and the second order multivariate cumulant is the covariance\nmatrix. We will denote by \u03bar(x) the order-r univariate cumulant, which is equivalent to the cross-\ncumulant of x with itself r times: \u03bar(x) := Cum(x, x, . . . , x) (where x appears r times). Univariate\nr-cumulants are additive for independent random variables, i.e. \u03bar(x + y) = \u03bar(x) + \u03bar(y), and\nhomogeneous of degree r, i.e. \u03bar(\u03b1x) = \u03b1r\u03bar(x).\nQuasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As +\nb + \u03b7 gives the generative ICA model. We de\ufb01ne an operation of fourth-order tensors on matrices:\nFor Q \u2208 Rd\u00d7d\u00d7d\u00d7d and M \u2208 Rd\u00d7d, Q(M ) is the matrix such that\n\nd(cid:88)\n\nd(cid:88)\n\nQ(M )ij :=\n\nQijklmlk .\n\n(1)\n\nWe can use this operation to orthogonalize the latent random signals.\nDe\ufb01nition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal\nmatrix R and a nonsingular diagonal matrix D such that W A = RD.\n\nk=1\n\nl=1\n\nWe will need the following results from [3]. Here we use Aq to denote the qth column of A.\nLemma 2.2. Let M \u2208 Rd\u00d7d be an arbitrary matrix. Then, Qx(M ) = ADAT where D is a\ndiagonal matrix with entries dqq = \u03ba4(sq)AT\nTheorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx(I),\nand let C = Qx(M\u22121). Then C = ADAT where D is a diagonal matrix with entries dqq =\n1/(cid:107)Aq(cid:107)2\n2. In particular, C is positive de\ufb01nite, and for any factorization BBT of C, B\u22121 is a quasi-\northogonalization matrix.\n\nq M Aq.\n\n3 Quasi-Orthogonalization using Cumulant Hessians\nWe have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize\nobserved data. However, this method na\u00a8\u0131vely requires the estimation of O(d4) terms from data.\nThere is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and\nthe tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the\ntensor-method to be rewritten using a series of Hessian operations. We make this connection precise\nbelow. The Hessian version requires only O(d3) terms to be estimated from data and simpli\ufb01es the\ncomputation to consist of matrix and vector operations.\nLet Hu denote the Hessian operator with respect to a vector u \u2208 Rd. The following lemma connects\nHessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]).\nLemma 3.1. Hu(\u03ba4(uT x)) = ADAT where dqq = 12(uT Aq)2\u03ba4(sq).\n\nq (uuT )Aq). By com-\nIn Lemma 3.1, the diagonal entries can be rewritten as dqq = 12\u03ba4(sq)(AT\nparing with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be\n12Hu(\u03ba4(uT x)). This formula extends\nrewritten in terms of the Hessian operations: Qx(uuT ) = 1\nto arbitrary symmetric matrices by the following Lemma.\nLemma 3.2. Let M be a symmetric matrix with eigen decomposition U \u039bU T such that U =\n(u1, u2, . . . , ud) and \u039b = diag(\u03bb1, \u03bb2, . . . , \u03bbd). Then, Qx(M ) = 1\n12\nThe matrices I and M\u22121 in Theorem 2.3 are symmetric. As such, the tensor-based method for\nquasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1.\n\n(cid:80)d\ni=1 \u03bbiHui\u03ba4(uT\n\ni x).\n\n4 Gradient Iteration ICA\nIn the preceding sections, we discussed techniques to quasi-orthogonalize data. For this sec-\ntion, we will assume that quasi-orthogonalization is accomplished, and discuss de\ufb02ationary ap-\nproaches that can quickly recover the directions of the independent components. Let W be a quasi-\northogonalization matrix. Then, de\ufb01ne y := W x = W As + W \u03b7. Note that since \u03b7 is Gaussian\nnoise, so is W \u03b7. There exists a rotation matrix R and a diagonal matrix D such that W A = RD.\nLet \u02dcs := Ds. The coordinates of \u02dcs are still independent random variables. Gaussian noise makes\nrecovering the scaling matrix D impossible. We aim to recover the rotation matrix R.\n\n4\n\n\f(cid:80)d\ni=1 Hu\u03ba4(uT x)|u=ei. See Equation (4) for the estimator.\ni=1 \u03bbiHu\u03ba4(uT x)|u=Ui. See Equation (4) for the estimator.\n\nAlgorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix.\n1: function FINDQUASIORTHOGONALIZATIONMATRIX(x)\n2:\n3:\n4:\nFactorize C as BBT .\n5:\nreturn B\u22121\n6:\n7: end function\n\nLet M = 1\n12\nLet U \u039bU T give the eigendecomposition of M\u22121\n\nLet C =(cid:80)d\n\nTo see why recovery of D is impossible, we note that a white Gaussian random variable \u03b71 has\nindependent components. It is impossible to distinguish between the case where \u03b71 is part of the\nsignal, i.e. W A(s + \u03b71) + W \u03b7, and the case where A\u03b71 is part of the additive Gaussian noise, i.e.\nW As + W (A\u03b71 + \u03b7), when s, \u03b71, and \u03b7 are drawn independently. In the noise-free ICA setting, the\nlatent signal is typically assumed to have identity covariance, placing the scaling information in the\ncolumns of A. The presence of additive Gaussian noise makes recovery of the scaling information\nimpossible since the latent signals become ill-de\ufb01ned. Following the idea popularized in FastICA,\nwe will discuss a de\ufb02ationary technique to recover the columns of R one at a time.\nFast Recovery of a Single Independent Component. In the de\ufb02ationary approach, a function f is\n\ufb01xed that acts upon a directional vector u \u2208 Rd. Based on some criterion (typically maximization\nor minimization of f), an iterative optimization step is performed until convergence. This technique\nwas popularized in FastICA, which is considered fast for the following reasons:\n1. As an approximate Newton method, FastICA requires computation of \u2207uf and a quick-to-\ncompute estimate of (Hu(f ))\u22121 at each iterative step. Due to the estimate, the computation runs in\nO(N d) time, where N is the number of samples.\n2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and\nglobal cubic-order convergence when using the fourth cumulant [13].\nWe note that cubic convergence rates are not unique to FastICA and have been seen using gradient\ndescent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed\nde\ufb02ationary algorithm will be comparable with FastICA in terms of computational complexity, and\nthe iterative step will take on a conceptually simpler form as it only relies on \u2207u\u03bar. We provide a\nderivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants\nare invariant with respect to the additive Gaussian noise, the proposed methods will be admissible\nfor both standard and noisy ICA.\nWhile cumulants are essentially unique with the additivity and homogeneity properties [17] when\nno restrictions are made on the probability space, the preprocessing step of ICA gives additional\nstructure (like orthogonality and centering), providing additional admissible functions. In particular,\n[20] designs \u201crobust cumulants\u201d which are only minimally effected by sparse noise. Welling\u2019s robust\ncumulants have versions of the additivity and homogeneity properties, and are consistent with our\nupdate step. For this reason, we will state our results in greater generality.\nLet G be a function of univariate random variables that satis\ufb01es the additivity, degree-r (r \u2265 3)\nhomogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a\ngeneric choice of input vector v, Algorithm 2 will demonstrate order r\u22121 convergence. In particular,\nif G is \u03ba3, then we obtain quadratic convergence; and if G is \u03ba4, we obtain cubic convergence.\nLemma 4.1 helps explain why this is true.\n\nLemma 4.1. \u2207vG(v \u00b7 y) = r(cid:80)d\n\ni=1(v \u00b7 Ri)r\u22121G(\u02dcsi)Ri.\n\nIf we consider what is happening in the basis of the columns of R, then up to some multiplicative\nconstant, each coordinate is raised to the r \u2212 1 power and then renormalized during each step of\nAlgorithm 2. This ultimately leads to the order r \u2212 1 convergence.\nTheorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v \u00b7 Ri)r\u22122G(\u02dcsi)| has a\nunique answer, then v has order r \u2212 1 convergence to Rh up to sign. In particular, if the following\nconditions are met: (1) There exists a coordinate random variable si of s such that G(si) (cid:54)= 0. (2) v\ninputted into Algorithm 2 is chosen uniformly at random from the unit sphere Sd\u22121. Then Algorithm\n2 converges to a column of R (up to sign) almost surely, and convergence is of order r \u2212 1.\n\n5\n\n\fAlgorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from\nthe unit sphere. Equations (2) and (3) provide k-statistic based estimates of \u2207v\u03ba3 and \u2207v\u03ba4, which\ncan be used as practical choices of \u2207vG on real data.\n1: function GI-ICA(v, y)\n2:\nv \u2190 \u2207vG(vT y)\n3:\nv \u2190 v/(cid:107)v(cid:107)2\n4:\n5:\n6: end function\n\nuntil Convergence return v\n\nrepeat\n\nAlgorithm 3 Algorithm for ICA in the presence of Gaussian noise. \u02dcA recovers A up to column\norder and scaling. RT W is the demixing matrix for the observed random vector x.\n\nfunction GAUSSIANROBUSTICA(G, x)\n\nW = FINDQUASIORTHOGONALIZATIONMATRIX(x)\ny = W x\nR columns = \u2205\nfor i = 1 to d do\n\nDraw v from Sd\u22121 \u2229 span(R columns)\u22a5 uniformly at random.\nR columns = R columns \u222a {GI-ICA(v, y)}\n\nend for\nConstruct a matrix R using the elements of R columns as columns.\n\u02dcs = RT y\n\u02dcA = (RT W )\u22121\nreturn \u02dcA, \u02dcs\n\nend function\n\nBy convergence up to sign, we include the possibility that v oscillates between Rh and \u2212Rh on\nalternating steps. This can occur if G(\u02dcsi) < 0 and r is odd. Due to space limitations, the proof is\nomitted.\nRecovering all Independent Components. As a Corollary to Theorem 4.2 we get:\nCorollary 4.3. Suppose R1, R2, . . . , Rk are known for some k < d. Suppose there exists i > k\nsuch that G(si) (cid:54)= 0. If v is drawn uniformly at random from Sd\u22121 \u2229 span(R1, . . . , Rk)\u22a5 where\nSd\u22121 denotes the unit sphere in Rd, then Algorithm 2 with input v converges to a new column of R\nalmost surely.\n\nSince the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm\n3. In practice (not required by the theory), it may be better to enforce orthogonality between the\ncolumns of R, by orthogonalizing v against previously found columns of R at the end of each step\nin Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G.\n\n5 Time Complexity Analysis and Estimation of Cumulants\nTo implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit\nour discussion to estimation of the third and fourth cumulants, as lower order cumulants are more\nstatistically stable to estimate than higher order cumulants. \u03ba3 is useful in Algorithm 2 for non-\nsymmetric distributions. However, since \u03ba3(si) = 0 whenever si is a symmetric distribution, it is\nplausible that \u03ba3 would not recover all columns of R. When s is suspected of being symmetric, it\nis prudent to use \u03ba4 for G. Alternatively, one can fall back to \u03ba4 from \u03ba3 when \u03ba3 is detected to be\nnear 0.\n(cid:80)N\nDenote by z(1), z(2), . . . , z(N ) the observed samples of a random variable z. Given a sample, each\ncumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr(z(i)) the k-\ni=1(z(i) \u2212 \u00afz)r give the rth sample\nstatistic sample estimate of \u03bar(z). Letting mr(z(i)) := 1\nN\ncentral moment, then\n\nk3(z(i)) :=\n\nN 2m3(z(i))\n\n(N \u2212 1)(N \u2212 2)\n\n, k4(z(i)) := N 2 (N + 1)m4(z(i)) \u2212 3(N \u2212 1)m2(z(i))2\n\n(N \u2212 1)(N \u2212 2)(N \u2212 3)\n\n6\n\n\fgives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for\nAlgorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves.\nThe following Lemma shows how to obtain unbiased estimates:\nLemma 5.1. Let z be a d-dimensional random vector with \ufb01nite moments up to order r. Let z(i) be\nan iid sample of z. Let \u03b1 \u2208 Nd be a multi-index. Then \u2202\u03b1\nu kr(u \u00b7 z(i)) is an unbiased estimate for\nu \u03bar(u \u00b7 z).\n\u2202\u03b1\nIf we mean-subtract (via the sample mean) all observed random variables, then the resulting esti-\nmates are:\n\n\u2207uk3(u \u00b7 y) = (N \u2212 1)\u22121(N \u2212 2)\u221213N\n\n\u2207uk4(u \u00b7 y) =\n\nN 2\n\n(N \u2212 1)(N \u2212 2)(N \u2212 3)\n\n(cid:33)\n\n((u \u00b7 y(i)))3y(i)\n\n(cid:33)(cid:41)\n\n\u221212\n\nN \u2212 1\nN 2\n\n(u \u00b7 y(i))2\n\n(u \u00b7 y(i))y(i)\n\n(cid:32) N(cid:88)\n(cid:40)\n\ni=1\n\nN(cid:88)\n(cid:40)\n\ni=1\n\n4\n\nN\n\n(u \u00b7 y(i))2y(i)\n\ni=1\n\nN + 1\n\n(cid:32) N(cid:88)\n(cid:33)(cid:32) N(cid:88)\nN(cid:88)\n(cid:32) N(cid:88)\n\n(u \u00b7 x(i))x(i)\n\ni=1\n\ni=1\n\n(2)\n\n(3)\n\n(4)\n\n(cid:33)T\uf8fc\uf8fd\uf8fe\n\nHuk4(u \u00b7 x) =\n\n\u2212 N \u2212 1\n\nN 2\n\n12N 2\n\nN(cid:88)\n\n(N \u2212 1)(N \u2212 2)(N \u2212 3)\n\nN(cid:88)\n(xxT )(i) \u2212 2N \u2212 2\n\nN\n\n(u \u00b7 x(i))2\n\nN 2\n\nN + 1\n\n((u \u00b7 x(i)))2(xxT )(i)\n\n(cid:33)(cid:32) N(cid:88)\n\n(u \u00b7 x(i))x(i)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nUsing (4) to estimate Hu\u03ba4(uT x) from data when implementing Algorithm 1, the resulting quasi-\northogonalization algorithm runs in O(N d3) time. Using (2) or (3) to estimate \u2207uG(vT y) (with G\nchosen to be \u03ba3 or \u03ba4 respectively) when implementing Algorithm 2 gives an update step that runs\nin O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2t)\nsteps are required to recover all columns of R once quasi-orthogonalization has been achieved.\n\n6 Simulation Results\nIn Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10],\nusing the code made available by the authors. Except for the choice of the contrast function for\nFastICA the baselines were run using default settings. All tests were done using arti\ufb01cially generated\ndata. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during\nthe update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on\n\ufb01ve distributions indicates that each of the independent coordinates was generated from a distinct\ndistribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the t-\ndistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform\ndistribution. Most of these distributions are symmetric, making GI-\u03ba3 inadmissible.\nWhen generating data for the ICA algorithm, we generate a random mixing matrix A with condition\nnumber 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular\nvalues chosen uniformly at random. The noise magnitude indicates the strength of an additive white\nGaussian noise. We de\ufb01ne 100% noise magnitude to mean variance 10, with 25% noise and 50%\nnoise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index\nintroduced in [1]. Let \u02c6B denote the approximate demixing matrix returned by an ICA algorithm,\n+\n\nand let M = \u02c6BA. Then, the Amari index is given by: E := (cid:80)n\n(cid:80)n\n\n(cid:16) |mij|\n(cid:17)\nmaxk |mik| \u2212 1\n\n. The Amari index takes on values between 0 and the dimensionality\nd. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D\n(where P is a permutation matrix and D is a diagonal matrix).\nFrom the noiseles data, we see that quasi-orthogonalization requires more data than whitening in\norder to provide accurate results. Once suf\ufb01cient data is provided, all fourth order methods (GI-\u03ba4,\nJADE, and \u03ba4-FastICA) perform comparably. The difference between GI-\u03ba4 and \u03ba4-FastICA is not\n\n(cid:80)n\n\n|mij|\n\nmaxk |mkj| \u2212 1\n\n(cid:80)n\n\n(cid:16)\n\n(cid:17)\n\nj=1\n\nj=1\n\ni=1\n\ni=1\n\n7\n\n\fFigure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal\nrefer to the choice of the \ufb01rst step of ICA. All baseline algorithms use whitening. Reported Amari\nindices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives\nthe data dimensionality, with two copies of each distribution used when d = 10.\n\nstatistically signi\ufb01cant over 50 runs with 100 000 samples. We note that GI-\u03ba4 under whitening and\n\u03ba4-FastICA have the same update step (up to a slightly different choice of estimators), with GI-\u03ba4\ndiffering to allow for quasi-orthogonalization. Where provided, the error bars give a 2\u03c3 con\ufb01dence\ninterval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error\nbars for the baseline algorithms are provided when they do not hinder readability.\nIt is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-\u03ba4 un-\nder quasi-orthogonalization degrades far less when given suf\ufb01cient samples. For this reason, the\nquasi-orthogonalized GI-\u03ba4 outperforms all other algorithms (given suf\ufb01cient samples) including\nthe log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GI-\n\u03ba4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization\nis necessary to be robust to Gaussian noise.\nRun times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with\n50% Gaussian noise magnitude, GI-\u03ba4 (including the orthogonalization step) had an average running\ntime2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The\ncorresponding average number of iterations to convergence per independent component (at 0.0001\nerror) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence\n(per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that\nonce suf\ufb01ciently many samples were taken, the number of steps to convergence becomes remarkably\nsmall.\n\nNumber of data pts\n\nwhitening+GI-\u03ba4: mean num steps\nquasi-orth.+GI-\u03ba4: mean num steps\n\n500\n11.76\n213.92\n\n1000\n5.92\n65.95\n\n5000\n4.99\n4.48\n\n10000\n4.59\n4.36\n\n50000\n4.35\n4.06\n\n100000\n\n4.16\n4.08\n\n7 Acknowledgments\n\nThis work was supported by NSF grant IIS 1117707.\n\n2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM.\n\n8\n\n 100 1000 10000 1000000.010.101.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=5, noisless data) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE 100 1000 10000 1000000.010.101.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=5, 25% noise magnitude) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE 100 1000 10000 1000000.010.101.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=5, 50% noise magnitude) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE 100 1000 10000 100000 0.01 0.10 1.0010.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=10, noisless data) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE 100 1000 10000 100000 0.01 0.10 1.0010.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=10, 25% noise magnitude) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE 100 1000 10000 100000 0.01 0.10 1.0010.00Number of SamplesAmari IndexICA Comparison on 5 distributions (d=10, 50% noise magnitude) GI\u2212\u03ba4 (white)GI\u2212\u03ba4 (quasi\u2212orthogonal)\u03ba4\u2212FastICAlog cosh\u2212FastICAJADE\fReferences\n[1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation.\n\nAdvances in neural information processing systems, pages 757\u2013763, 1996.\n\n[2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise,\nwith implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384\u20132392, 2012.\n[3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian\n\nnoise. In JMLR W&CP, volume 30: COLT, pages 270\u2013287, 2013.\n\n[4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural\n\nNetworks. ICANN, 1:509\u2013514, 1999.\n\n[5] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR,\n\nabs/0912.3599, 2009.\n\n[6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and\n\nSignal Processing, IEE Proceedings F, volume 140, pages 362\u2013370. IET, 1993.\n\n[7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http://\n[On-\n\nperso.telecom-paristech.fr/\u02dccardoso/Algo/Jade/jadeR.m, 2005.\nline; accessed 8-May-2013].\n\n[8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010.\n[9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Process-\n\ning, IEEE Transactions on, 20(12):3419\u20133430, 2011.\n\n[10] H. G\u00a8avert, J. Hurri, J. S\u00a8arel\u00a8a, and A. Hyv\u00a8arinen. Matlab FastICA v 2.5. http://\n[Online;\n\nresearch.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005.\naccessed 1-May-2013].\n\n[11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and\n\nspectral decompositions. In ITCS, pages 11\u201320, 2013.\n\n[12] A. Hyv\u00a8arinen. Independent component analysis in the presence of Gaussian noise by maxi-\n\nmizing joint likelihood. Neurocomputing, 22(1-3):49\u201367, 1998.\n\n[13] A. Hyv\u00a8arinen. Fast and robust \ufb01xed-point algorithms for independent component analysis.\n\nIEEE Transactions on Neural Networks, 10(3):626\u2013634, 1999.\n\n[14] A. Hyv\u00a8arinen and E. Oja.\n\nIndependent component analysis: Algorithms and applications.\n\nNeural Networks, 13(4-5):411\u2013430, 2000.\n\n[15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962.\n[16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function.\n\nIEEE Transactions on Neural Networks, 19(3):408\u2013420, 2008.\n\n[17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601\u2013622, 1999.\n[18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU\n\nsignatures. J. Cryptology, 22(2):139\u2013160, 2009.\n\n[19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. http://\n\nsourceforge.net/projects/giica/, 2013. [Online].\n\n[20] M. Welling. Robust higher order statistics.\n\nIn Tenth International Workshop on Arti\ufb01cial\n\nIntelligence and Statistics, pages 405\u2013412, 2005.\n\n[21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing,\n\n80(5):897\u2013902, 2000.\n\n[22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1199, "authors": [{"given_name": "James", "family_name": "Voss", "institution": "Ohio State University"}, {"given_name": "Luis", "family_name": "Rademacher", "institution": "Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}]}