{"title": "ICA based on a Smooth Estimation of the Differential Entropy", "book": "Advances in Neural Information Processing Systems", "page_first": 433, "page_last": 440, "abstract": "In this paper we introduce the MeanNN approach for estimation of main information theoretic measures such as differential entropy, mutual information and divergence. As opposed to other nonparametric approaches the MeanNN results in smooth differentiable functions of the data samples with clear geometrical interpretation. Then we apply the proposed estimators to the ICA problem and obtain a smooth expression for the mutual information that can be analytically optimized by gradient descent methods. The improved performance on the proposed ICA algorithm is demonstrated on standard tests in comparison with state-of-the-art techniques.", "full_text": "ICA based on a Smooth Estimation of the Differential\n\nEntropy\n\nLev Faivishevsky\n\nSchool of Engineering, Bar-Ilan University\n\nJacob Goldberger\n\nSchool of Engineering, Bar-Ilan University\n\nlevtemp@gmail.com\n\ngoldbej@eng.biu.ac.il\n\nAbstract\n\nIn this paper we introduce the MeanNN approach for estimation of main infor-\nmation theoretic measures such as differential entropy, mutual information and\ndivergence. As opposed to other nonparametric approaches the MeanNN results\nin smooth differentiable functions of the data samples with clear geometrical inter-\npretation. Then we apply the proposed estimators to the ICA problem and obtain\na smooth expression for the mutual information that can be analytically optimized\nby gradient descent methods. The improved performance of the proposed ICA\nalgorithm is demonstrated on several test examples in comparison with state-of-\nthe-art techniques.\n\n1 Introduction\n\nIndependent component analysis (ICA) is the problem of recovering latent random vector from\nobservations of unknown linear functions of that vector. Assume a data S \u2208 Rd is generated via d\nindependent sources. We observe X = AS where A is an unknown square matrix called the mixing\nmatrix. We are given repeated observation dataset {x1, ..., xn} and our goal is to recover the linear\ntransformation A and the sources s1, ..., sn that generated our data xi = Asi.\nGiven the minimal statement of the problem, it has been shown [6] that one can recover the origi-\nnal sources up to a scaling and a permutation provided that at most one of the underlying sources is\nGaussian and the rest are non-Gaussian. Upon pre-whitening the observed data, the problem reduces\nto a search over rotation matrices in order to recover the source and mixing matrix in the sense de-\nscribed above [10]. We will assume henceforth that such pre-processing has been done. Specifying\ndistributions for the components of X, one obtains a parametric model that can be estimated via\nmaximum likelihood [3, 4]. Working with W = A\u22121 as the parametrization, one readily obtains\na gradient or \ufb01xed-point algorithm that yields an estimate \u02c6W and provides estimates of the latent\ncomponents via \u02c6S = \u02c6W X [10].\nIn practical applications the distributions of the d components of X are unknown. Therefore it is\npreferable to consider the ICA model as a semiparametric model in which the distributions of the\ncomponents of X are left unspeci\ufb01ed. The problem is then, obviously, to \ufb01nd a suitable contrast\nfunction, i.e. a target function to be minimized in order to estimate the ICA model. The earliest\nICA algorithms were based on contrast functions de\ufb01ned in terms of expectations of a single \ufb01xed\nnonlinear function, chosen in ad-hoc manner [5]. More sophisticated algorithms have been obtained\nby careful choice of a single \ufb01xed nonlinear function, such that the expectations of this function\nyield a robust approximation to the mutual information [9].\n\nMaximizing the likelihood in the semiparametric ICA model is essentially equivalent to minimizing\nthe mutual information between the components of the estimate \u02c6S = \u02c6W X [4]. The usage of the\nmutual information as a contrast function to be minimized in estimating the ICA model is well\nmotivated, quite apart from the link to maximum likelihood [6].\n\n1\n\n\fEstimating MI from a given \ufb01nite sample set is dif\ufb01cult. Several modern approaches rely on k-\nnearest neighbor estimates of entropy and mutual information [12, 16]. Recently the Vasicek esti-\nmator [17] for the differential entropy of 1D random variables, based on k-nearest neighbors statis-\ntics, was applied to ICA [8, 13]. In addition ICA was studied by another recently introduced MI\nestimator [16]. However, the derivative of the estimators that are based on order statistics can hardly\nbe computed and therefore the optimization of such numerical criteria can not be based on gradient\ntechniques. Also the result numerical criteria tend to have a non-smooth dependency on sample\nvalues. The optimization therefore should involve computation of contrast function on a whole grid\nof searched parameters.\n\nIn addition, such estimators do not utilize optimally the whole amount of data included in the sam-\nples of random vectors. Therefore they require signi\ufb01cant arti\ufb01cial enlargement of data sets by a\ntechnique called data augmentation [13] that replaces each data point in sample with R-tuple (R is\nusually 30) of points given by an statistical procedure with ad-hoc parameters. An alternative is the\nFourier \ufb01ltering of the estimated values of the evaluated MI estimators [16].\n\nIn the present paper we propose new smooth estimators for the differential entropy, the mutual in-\nformation and the divergence. The estimators are obtained by a novel approach averaging k-nearest\nneighbor statistics for the all possible values of order statistics k. The estimators are smooth, their\nderivatives may be easily analytically calculated thus enabling fast gradient optimization techniques.\nThey fully utilize the amount of data comprised into a random variable sample. The estimators pro-\nvide a novel geometrical interpretation for the entropy. When applied to ICA problem, the proposed\nestimator leads to the most precise results for many distributions known at present.\n\nThe rest of the paper is organized as follows: Section 2 reviews the kNN approach for the entropy\nand divergence estimation, Section 3 introduces the mean estimator for the differential entropy,\nthe mutual information and the divergence. Section 4 describes the application of the proposed\nestimators to the ICA problem and Section 5 describes conducted numerical experiments.\n\n2 kNN Estimators for the Differential Entropy\n\nWe review the nearest neighbor technique for the Shannon entropy estimation. The differential\nentropy of X is de\ufb01ned as:\n\nH(X) = \u2212\n\nf(x) log f(x)dx\n\n(1)\n\n(cid:90)\n\nWe describe the derivation of the Shannon differential entropy estimate of [11, 18]. Our aim is\nto estimate H(X) from a random sample (x1, ..., xn) of n random realizations of a d-dimensional\nrandom variable X with unknown density function f(x). The entropy is the average of \u2212 log f(x).\nIf one had unbiased estimators for log f(xi), one would arrive to an unbiased estimator for the\nentropy. We will estimate log f(xi) by considering the probability density function Pik(\u0001) for the\ndistance between xi and its k-th nearest neighbor (the probability is computed over the positions\nof all other n \u2212 1 points, with xi kept \ufb01xed). The probability Pik(\u0001)d\u0001 is equal to the chance that\nthere is one point within distance r \u2208 [\u0001, \u0001 + d\u0001] from xi, that there are k\u22121 other points at smaller\ndistances, and that the remaining n\u2212k\u22121 points have larger distances from xi. Denote the mass of\nthe \u0001-ball centered at xi by pi(\u0001), i.e. pi(\u0001) =\n(cid:107)x\u2212xi(cid:107)<\u0001 f(x)dx. Applying the trinomial formula\nwe obtain:\n(n\u22121)!\n\n(cid:82)\n\ndpi(\u0001)\n\nPik(\u0001) =\n\n1!(k\u22121)!(n\u2212k\u22121)!\n\nd\u0001\n\npk\u22121\n\ni\n\n(1 \u2212 pi)n\u2212k\u22121\n\n(2)\n\nIt can be easily veri\ufb01ed that indeed\nlog pi(\u0001) according to the distribution Pik(\u0001) is:\n\nPik(\u0001)d\u0001 = 1. Hence, the expected value of the function\n\n(cid:82)\n\n(cid:90) \u221e\n\n0\n\n(cid:181)\n\n(cid:182)(cid:90) 1\n\n0\n\nn\u22121\nk\n= \u03c8(k) \u2212 \u03c8(n)\n\nEPik(\u0001)(log pi(\u0001)) =\n\nPik(\u0001) log pi(\u0001)d\u0001 = k\n\npk\u22121(1 \u2212 p)n\u2212k\u22121 log p dp\n\n(3)\n\n(cid:82) 1\nwhere \u03c8(x) is the digamma function (the logarithmic derivative of the gamma function). To verify\n0 xa\u22121(1\u2212x)b\u22121 = \u0393(a)\u0393(b)/\u0393(a + b) with respect to\nthe last equality, differentiate the identity\n\n2\n\n\fthe parameter a and recall that \u0393(cid:48)(x) = \u03c8(x)\u0393(x). The expectation is taken over the positions of\nall other n \u2212 1 points, with xi kept \ufb01xed. Assuming that f(x) is almost constant in the entire \u0001-ball\naround xi, we obtain:\n\n(4)\nwhere d is the dimension of x and cd is the volume of the d-dimensional unit ball (cd = \u03c0d/2/\u0393(1 +\nd/2) for Euclidean norm). Substituting Eq. (4) into Eq. (3), we obtain:\n\npi(\u0001) \u2248 cd\u0001df(xi).\n\n\u2212 log f(xi) \u2248 \u03c8(n) \u2212 \u03c8(k) + log(cd) + dE(log(\u0001))\n\nwhich \ufb01nally leads to the unbiased kNN estimator for the differential entropy [11]:\n\nHk(X) = \u03c8(n) \u2212 \u03c8(k) + log(cd) + d\nn\n\nlog \u0001i\n\nn(cid:88)\n\ni=1\n\nwhere \u0001i is the distance from xi to its k-th nearest neighbor. An alternative proof of the asymptotic\nunbiasedness and consistency of the kNN estimator is found at [15].\n\nA similar approach can be used to obtain a kNN estimator for the Kullback-Leibler divergence [19].\nThe estimator works as follows. Let {x1, ..., xn} and {y1, ..., ym} be i.i.d. d-dimensional samples\ndrawn independently from the densities p and q respectively. By de\ufb01nition the divergence is given\nby:\n\n(cid:90)\n\np(x) log p(x)\nq(x)\nThe distance of xi to its nearest neighbor in {xj}j(cid:54)=i is de\ufb01ned as\n\nD(p(cid:107)q) =\n\n\u03c1n(i) = min\nj(cid:54)=i\n\nd(xi, xj)\n\nWe also de\ufb01ne the distance of xi to its nearest neighbor in {yj}\nd(xi, yj)\n\n\u03bdn(i) = min\n\nj=1,...,m\n\nThen the estimator of [19] is given by\n\n\u02c6Dn,m = d\nn\n\nlog \u03bdm(i)\n\u03c1n(i)\n\n+ log m\nn \u2212 1\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nThe authors established asymptotic unbiasedness and mean-square consistency of the estimator (10).\nThe same proofs could be applied to obtain k-nearest neighbor version of the estimator:\n\nn,m = d\n\u02c6Dk\nn\n\nm(i)\nlog vk\nn(i)\n\u03c1k\n\n+ log m\nn \u2212 1\n\n(11)\n\nBeing non-parametric, the kNN estimators (6, 11) rely on the order statistics. This makes the ana-\nlytical calculation of the gradient hardly possible. Also it leads to a certain lack of smoothness of the\nestimator value as a function of the sample coordinates. One also should mention that \ufb01nding the\nk-nearest neighbor is a computationally intensive problem. It becomes necessarily to use involved\napproximate nearest neighbor techniques for large data sets.\n\n3 The MeanNN Entropy Estimator\n\nWe propose a novel approach for the entropy estimation as a function of sample coordinates. It is\nbased on the fact that the kNN estimator (6) is valid for every k. Therefore the differential entropy\ncan be also extracted from a mean of several estimators corresponding to different values of k. Next\nwe consider all the possible values of order statistics k from 1 to n \u2212 1:\n\nHmean =\n\n1\n\nn \u2212 1\n\nHk = log(cd) + \u03c8(n) +\n\n1\n\nn \u2212 1\n\n(\u2212\u03c8(k) + d\nn\n\nlog \u0001i,k)\n\n(12)\n\nn\u22121(cid:88)\n\nk=1\n\nn(cid:88)\n\ni=1\n\nn\u22121(cid:88)\n\nk=1\n\nwhere \u0001i,k is the k-th nearest neighbor of xi. Consider the double-summation last term in Eq. (12).\nExchanging the order of summation, the last sum adds for each sample point xi the sum of log of\n\n3\n\n\fits distances to all its nearest neighbors in the sample. It is of course equivalent to the sum of log of\nits distances to all other points in the sample set. Hence the mean estimator (12) for the differential\nentropy can be written as:\n\nHmean = const +\n\nd\n\nn(n \u2212 1)\n\nlog (cid:107)xi \u2212 xj(cid:107)\n\n(13)\n\n(cid:88)\n\ni(cid:54)=j\n\nwhere the constant depends just on the sample size and dimensionality. We dub this estimator, the\nMeanNN estimator for differential entropy. It follows that the differential entropy (approximation)\nhas a clear geometric meaning. It is proportional to log of the products of distances between each\ntwo points in a random i.i.d. sample. It is an intuitive observation since a higher entropy would\nlead to a larger scattering of the samples thus pairwise distances would grow resulting in a larger\nproduct of all distances. Moreover, the MeanNN estimator (13) is a smooth function of the sample\ncoordinates. Its gradient can be easily found. The asymptotic unbiasedness and consistency of the\nestimator follow from the same properties of the kNN estimator (6). Obviously, the same method\ngives the mean estimator for the mutual information by usage of well known equality connecting the\nmutual information and marginal and joint entropies:\n\nImean(X; Y ) = Hmean(X) + Hmean(Y ) \u2212 Hmean(X, Y )\n\n(14)\n\n\u2212 x\n\nWe demonstrate the MeanNN estimator for the entropy in the case exponential distributed random\n\u00b5 , x > 0, \u00b5 > 0. In this case case the entropy may be analytically calculated\nvariable f(x, \u00b5) = 1\n\u00b5 e\nas H = log \u00b5 + 1. We compared the performance of the MeanNN estimator with k-nearest neighbor\nestimator (6) for various values of k. Results are given in Table 1. One may see that the mean\nsquare error of the MeanNN estimator is the same or worse for the traditional kNN estimators. But\nthe standard deviation of the estimator values is best for the MeanNN estimator. Further we will\napply MeanNN for optimization of a certain criterion based on the entropy. In such cases the most\nimportant characteristics of an estimator is its monotonic dependency on the estimated value and\nthe prediction of the exact value of the entropy is less important. Therefore one may conclude that\nMeanNN is better applicable for optimization of entropy based numerical criteria.\n\nMean square error of entropy estimation\n\nSTD of estimator values\n\n1NN\n0.0290\n0.1698\n\n4NN\n0.0136\n0.1166\n\n10NN MeanNN\n0.0248\n0.0117\n0.1079\n0.1029\n\nTable 1: Performance of MeanNN entropy estimator in comparison with kNN entropy estimators.\n100 samples of random variable, 10 various values of \u00b5 parameter, 100 repetitions.\n\n\uf8eb\uf8ed(cid:88)\n\ni,j\n\n\uf8f6\uf8f8\n\n(cid:88)\n\ni(cid:54)=j\n\nn\u22121(cid:88)\n\nn(cid:88)\n\nk=1\n\ni=1\n\nTo obtain the estimator for the divergence we apply the same mean approach to estimator (11) setting\nm = n \u2212 1:\n\n\u02c6Dmean\n\nn,n\u22121 =\n\nd\n\nn(n \u2212 1)\n\nm(i)\nlog vk\nn(i)\n\u03c1k\n\n=\n\nd\n\nn(n \u2212 1)\n\nlog d(xi, yj) \u2212\n\nlog d(xi, xj)\n\n(15)\nThe mean estimator for the divergence has a clear geometric interpretation. If the product of all\ndistances inside one sample is small in comparison with the product of pairwise distances between\nthe samples then one concludes that divergence is large and vice versa.\n\n4 The MeanNN ICA Algorithm\n\n(cid:90)\n\nAs many approaches do, we will use a contrast function\n\nJ(Y ) =\n\nq(y1, ..., yd) log q(y1, .., yd)\ni=1 q(yi)\n\n(cid:81)d\n\nd\u00b5 = D(q(y1, .., yd)(cid:107) d(cid:89)\n\nd(cid:88)\n\ni=1\n\nq(yi)) =\n\nH(Yi)\u2212H(Y1, ..., Yd)\n\n(16)\n\ni=1\n\nConsidering Y as linear function of X, Y = W X, it is easily veri\ufb01ed [3, 7, 10] that\n\n4\n\n\fd(cid:88)\n\nt=1\n\nJ(Y ) =\n\nH(Yt) \u2212 H(X1, ..., Xd) \u2212 log(|W|)\n\n(17)\n\nIn particular, the change in the entropy of the joint distribution under linear transformation is simply\nthe logarithm of the Jacobian of the transformation. As we will assume the X\u2019s to be pre-whitened,\nW will be restricted to rotation matrices, therefore log(|W|) = 0 and the minimization of J(Y )\nreduces to \ufb01nding\n\n\u02c6W = arg min\n\nW\n\nH(Y1) + ... + H(Yd)\n\n(18)\n\nDenoting the rows of the matrix W by W = (w1, ..., wd)(cid:62)\nexpression as a function of W :\n\n, we can explicitly write the minimization\n\n\u02c6W = arg min\n\nW\n\nH(w\n\n(cid:62)\nt X)\n\n(19)\n\nThen we can plug the MeanNN entropy estimator into Eq. (19) to obtain (after omitting irrelevant\nconstants) an explicit contrast function to minimize:\n\n\u02c6W = arg min\n\nW\n\nS(W ) = arg min\nW\n\nlog((w\n\n(cid:62)\n\nt (xi \u2212 xj))2)\n\n(20)\n\nThe gradient of the contrast function S(W ) with respect to a rotation matrix W may be found with\nthe assistance of the so-called Givens rotations (see e.g. [14]). In this parametrization a rotation\nmatrix W \u2208 Rd\u00d7d is represented by a product of d(d \u2212 1)/2 plane rotations:\n\nW =\n\nGst\n\n(21)\n\nd(cid:88)\n\nt=1\n\nd(cid:88)\n\nn(cid:88)\n\nt=1\n\ni(cid:54)=j\n\nd\u22121(cid:89)\n\nd(cid:89)\n\ns=1\n\nt=s+1\n\n(cid:184)\n\n(cid:183)\n\n(cid:183)\n\n(cid:183)\nd(cid:88)\n\nwhere Gst is a rotation matrix corresponding to a rotation in the st plane by an angle \u03bbst. It is\nthe identity matrix except that its elements (s, s),(s, t),(t, s),(t, t) form a two-dimensional (2-D)\nrotation matrix by\n\nGst(s, s) Gst(s, t)\nGst(t, s) Gst(t, t)\n\n=\n\ncos(\u03bbst)\n\u2212 sin(\u03bbst)\n\nsin(\u03bbst)\ncos(\u03bbst)\n\n(22)\n\nThe gradient of a single rotation matrix Gst with respect to \u03bbst is a zero matrix except for elements\n(s, s),(s, t),(t, s),(t, t) for which\n\ncos(\u03bbst)\n\u2212 cos(\u03bbst) \u2212 sin(\u03bbst)\nIt can easily veri\ufb01ed that the gradient of the contrast function (20) is given by\n\nGst(s, s) Gst(s, t)\nGst(t, s) Gst(t, t)\n\n\u2202\u03bbst\n\n\u2202\n\n(cid:184)\n\n(cid:184)\n\n\u2202\n\nS =\n\n\u2202S\n\u2202wqr\n\n\u2202wqr\n\u2202\u03bbst\n\n= 2\n\nq,r=1\n\nGuv if both u = s and v = t, and \u02dcGuv = Guv otherwise.\n\n\u2202\u03bbst\nwhere \u02dcGuv = \u2202\nThe contrast function S(W ) and its gradient\nS may in theory suffer from discontinuities if a\nrow wt is perpendicular to a vector xi \u2212 xj. To overcome this numerical dif\ufb01culty we utilize a\nsmoothed version of the contrast function S(W, \u0001) and give the expression for its gradient:\n\nv=u+1\n\ni(cid:54)=j\n\nq,r=1\n\n\u2202\u03bbuv\n\n\u2202\u03bbst\n\nu=1\n\nqr\n\n\u2202\n\n(cid:183) \u2212 sin(\u03bbst)\n(cid:184)\nn(cid:88)\n\n=\n\n(xir \u2212 xjr)\n|w\nq (xi \u2212 xj)|\n\n(cid:62)\n\nd(cid:88)\n\n(cid:34)\n\nd\u22121(cid:89)\n\nd(cid:89)\n\n(23)\n\n(cid:35)\n\n\u02dcGuv\n\n(24)\n\nd(cid:88)\n\nq,r=1\n\n\u2202\n\n\u2202\u03bbst\n\nS =\n\nS(W, \u0001) =\n\nlog((w\n\n\u2202S\n\u2202wqr\n\n\u2202wqr\n\u2202\u03bbst\n\n=\n\nq,r=1\n\ni(cid:54)=j\n\n(w\n\n(cid:62)\n\n(cid:34)\nt (xi \u2212 xj))2 + \u0001)\nd\u22121(cid:89)\n(xir \u2212 xjr)\nq (xi \u2212 xj))2 + \u0001\n\n(cid:62)\n\nu=1\n\nd(cid:89)\n\n(cid:35)\n\n\u02dcGuv\n\nv=u+1\n\nqr\n\n(25)\n\n(26)\n\nFor the optimization of the contrast function we apply the conjugate gradient method. The algorithm\nis summarized in Figure 1.\n\n5\n\nn(cid:88)\nd(cid:88)\nd(cid:88)\nn(cid:88)\n\ni(cid:54)=j\n\nt=1\n\n\fInput: Data vectors x1, x2, ..., xn \u2208 Rd, assumed whitened\nOutput: Mixing matrix W\nMethod:\n\n\u2022 Initialize d(d \u2212 1)/2 rotation angles \u03bbst\n\u2022 Apply the conjugate gradient optimization to the contrast function S(W (\u03bb)) (25) to\n\u2022 Reconstruct the rotation matrix W from the found angles by Givens rotations (21)\n\n\ufb01nd the optimal angles\n\nFigure 1: The MeanNN ICA algorithm\n\n5 Experiments\n\nFirst we study the set of 9 problems proposed by [2]. Each problem corresponds to a 1D probability\ndistribution q(x). One thousand pairs of random numbers x and y are mixed as x(cid:48) = x cos \u03c6 +\ny sin \u03c6, y(cid:48) = \u2212x sin \u03c6 + y cos \u03c6 with random angle \u03c6 common to all pairs (i.e. A is a pure rotation).\nWe applied the conjugate gradient methods for the optimization of the contrast function (25) with\n\u0001 = 1/n = 0.001 in order to recover this rotation matrix. This was repeated 100 times with different\nangles \u03c6 and with different random sets of pairs (x, y). To assess the quality of the estimator \u02c6A\n(or, equivalently, of the back transformation \u02c6W = \u02c6A\u22121), we use the Amari performance index Perr\nfrom [1].\n\nd(cid:88)\n\ni,j=1\n\nPerr =\n\n1\n2d\n\n|pij|\n\nmaxk |pik| +\n\n|pij|\n\nmaxk |pkj|) \u2212 1\n\n(\n\n(27)\n\nwhere pij = ( \u02c6A\u22121A)ij. We compared our method with three state-of-the-art approaches: MILCA\n[16], RADICAL [13] and KernelICA [2]. We used the of\ufb01cial code proposed by authors1. For the\n\ufb01rst two techniques that utilize different information theoretic measures assessed by order statistics\nit is highly recommended to use dataset augmentation. This is a computationally intensive technique\nfor the dataset enlargement by replacing each data set point with a \ufb01xed number (usually 30) new\ndata points randomly generated in the small neighborhood of the original point. The proposed\nmethod gives smooth results without any additional augmentation due to its smooth nature (see Eq.\n(13)).\n\npdfs MILCA MILCA Aug RADICAL RADICAL Aug KernelICA MeanNN ICA\n\na\nb\nc\nd\ne\nf\ng\nh\ni\n\n3.3\n3.4\n7.5\n1.8\n1.7\n1.4\n1.4\n1.7\n1.9\n\n2.5\n3.0\n4.4\n1.7\n1.6\n1.3\n1.3\n2.0\n2.1\n\n3.6\n3.6\n7.6\n1.4\n1.5\n1.6\n1.6\n1.6\n1.8\n\n2.8\n3.3\n5.4\n1.6\n1.7\n1.4\n1.4\n1.7\n1.8\n\n3.3\n3.0\n4.9\n1.4\n1.5\n1.4\n1.4\n1.4\n1.5\n\n2.4\n2.6\n4.2\n1.4\n1.4\n1.4\n1.4\n1.5\n1.8\n\nTable 2: Amari performance (multiplied by 100) for two-component ICA. The distributions are: (a)\nStudent with 3 degrees of freedom; (b) double exponential; (c) Student with 5 degrees of freedom;\n(d) exponential; (e) mixture of two double exponentials; (f) symmetric mixtures of two Gaussians;\n(g) nonsymmetric mixtures of two Gaussians; (h) symmetric mixtures of four Gaussians; (i) non-\nsymmetric mixtures of four Gaussians.\n\nIn the explored cases the proposed method achieves the level of a state-of-the-art performance. This\nis well explained by the inherent smoothness of MeanNN estimator, see Figure 2. Here we presented\nhttps://www.cs.umass.edu/\u223celm/ICA/,\n\n1http://www.klab.caltech.edu/\u223ckraskov/MILCA/,\n\nhttp://www.di.ens.fr/\u223cfbach/kernel-ica/index.htm\n\n6\n\n\fMILCA approach. Also the contrast function corresponding to the order statistics k = 30 (cid:39) \u221a\n\nthe comparison of different contrast functions based on different order statistics estimators for a grid\nof possible rotations angles for the mixture of two exponentially distributed random variables (case\ne). The contrast function corresponding to the order statistics k = 10 generally coincides with the\nn\ngenerally coincides with the RADICAL method. One may see that MeanNN ICA contrast function\nleads to much more robust prediction of the rotation angle. One should mention that the gradient\nbased optimization enables to obtain the global optimum with high precision as opposed to MILCA\nand RADICAL schemes which utilize subspace grid optimization.\n\nApplication of the gradient based optimization schemes also leads to a computational advantage.\nThe number of needed function evaluations was limited by 20 as opposed to 150 evaluations for grid\noptimization schemes MILCA and RADICAL.\n\nFigure 2: Convergence analysis for a mixture of two exponentially distributed random variables.\nContrast function dependence on a rotation angle for different entropy estimators. 1000 samples,\n0.01 radian grid.\n\nWe also studied the application of MeanNN ICA to multidimensional problems. For that purpose\nwe chose at random D (generally) different distributions, then we mixed them by a random rotation\nand ran the compared ICA algorithms to recover the rotation matrix. The results are presented at\nTable 3. MeanNN ICA achieved the best performance.\n\ndims MILCA MILCA Aug RADICAL RADICAL Aug KernelICA MeanNN ICA\n\n2\n4\n\n3.0\n2.7\n\n3.3\n2.7\n\n3.1\n2.8\n\n3.0\n2.3\n\n2.9\n2.6\n\n2.5\n2.2\n\nTable 3: Amari index (multiplied by 100) for multidimensional ICA. 1000 samples, 10 repetitions\n\n6 Conclusion\n\nWe proposed a novel approach for estimation of main information theoretic measures such as dif-\nferential entropy, mutual information and divergence. The estimators represent smooth differential\nfunctions with clear geometrical meaning. Next this novel estimation technique was applied to the\nICA problem. Compared to state-of-the-art ICA methods the proposed method demonstrated supe-\nrior results in the conducted tests.\n\nStudied state-of-the-art approaches can be divided in two groups. The \ufb01rst group is based on exact\nentropy estimation, that usually leads to high performance as demonstrated by MILCA and RADI-\nCAL. The drawback of such estimators is the lack of the gradient and therefore numerical dif\ufb01culties\nin optimization. The second group apply different from entropy criteria, that bene\ufb01t easy calcula-\ntion of gradient (KernelICA). However such methods may suffer from deteriorated performance.\n\n7\n\n00.20.40.60.811.21.41.622.12.22.32.42.52.62.72.82.9Rotation angle \u03c6Contrast function S(W(\u03c6))  MeanNN10NN30NN\fMeanNN ICA comprises the advantages of these two kinds of estimators. It represents a contrast\nfunction based on an accurate entropy estimation and its gradient is given analytically therefore it\nmay be readily optimized.\n\nFinally we mention that the proposed estimation method may further be applied to various problems\nin the \ufb01eld of machine learning and beyond.\n\nReferences\n\n[1] S. Amari, A. Cichoki, and H.H.Yang. A new learning algorithm for blind signal separation. Advances in\n\nNeural Information Processing Systems, 8, 1996.\n\n[2] F. Bach and M. Jordan. Kernel independent component analysis. Journal of Machine Learning Research,\n\n3, 2002.\n\n[3] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind\n\ndeconvolution. Neural Computatiuon, 7, 1995.\n\n[4] J.-F. Cardoso. Multidimensional independent component analysis. Proceedings of the International Con-\n\nference on Acoustics, Speech, and Signal Processing (ICASSP\u201998), 1998.\n\n[5] C.Jutten and J.Herault. Blind separation of sources, part 1: An adaptive algorithm based on neuromimetic\n\narchitecture. Signal Processing, 1991.\n\n[6] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3), 1994.\n[7] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, August\n\n1991.\n\n[8] D.T.Pham and P.Garat. Blind separation of mixtures of independent signals through a quasi-maximum\n\nlikelihood approach. IEEE transactions on Signal Processing 45(7), 1997.\n\n[9] A. Hyvarinen and E.Oja. A fast \ufb01xed point algorithm for independent component analysis. Neural\n\ncomputation, 9(7), 1997.\n\n[10] A. Hyvarinen, J. Karhunen, and E. Oja. Independent component analysis. 2001.\n[11] L. Kozachenko and N. Leonenko. On statistical estimation of entropy of random vector. Problems Infor.\n\nTransmiss., 23 (2), 1987.\n\n[12] A. Kraskov, H. St\u00a8ogbauer, and P. Grassberger. Estimating mutual information. Physical Review E,\n\n69:066138, 2004.\n\n[13] E. Miller and J. Fisher. Ica using spacing estimates of entropy. Proc. Fourth International Symposium on\nIndependent Component Analysis and Blind Signal Separation, Nara, Japan, Apr. 2003, pp. 1047\u20131052.,\n2003.\n\n[14] J. Peltonen and S. Kaski. Discriminative components of data. IEEE Transactions on Neural Networks,\n\n16(1), 2005.\n\n[15] H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of\n\nentropy. American Journal of Mathematical and Management Sciences, 2003.\n\n[16] H. St\u00a8ogbauer, A. Kraskov, S. Astakhov, and P. Grassberger. Least-dependent-component analysis based\n\non mutual information. Phys. Rev. E, 70(6):066123, Dec 2004.\n\n[17] O. Vasicek. A test for normality based on sample entropy. J. Royal Stat. Soc. B, 38 (1):54\u201359, 1976.\n[18] J. D. Victor. Binless strategies for estimation of information from neural data. Physical Review, 2002.\n[19] Q. Wang, S. R. Kulkarni, and S. Verdu. A nearest-neighbor approach to estimating divergence between\n\ncontinuous random vectors. IEEE Int. Symp. Information Theory, Seattle, WA, 2006.\n\n8\n\n\f", "award": [], "sourceid": 513, "authors": [{"given_name": "Lev", "family_name": "Faivishevsky", "institution": null}, {"given_name": "Jacob", "family_name": "Goldberger", "institution": null}]}