{"title": "Gaussianization", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 429, "abstract": null, "full_text": "Gaussianization \n\nScott Shaobing Chen \n\nRenaissance Technologies \nEast Setauket, NY  11733 \n\nschen@rentec.com \n\nRamesh A. Gopinath \n\nIBM TJ. Watson Research Center \n\nYorktown Heights, NY  10598 \n\nrameshg@us.ibm.com \n\nAbstract \n\nHigh dimensional data modeling is difficult mainly because the so-called \n\"curse of dimensionality\". We propose a technique called \"Gaussianiza(cid:173)\ntion\" for high dimensional density estimation, which alleviates the curse \nof dimensionality by exploiting the independence structures in  the data. \nGaussianization is  motivated from  recent developments in  the statistics \nliterature:  projection pursuit, independent component analysis and Gaus(cid:173)\nsian  mixture  models  with  semi-tied  covariances.  We  propose  an  iter(cid:173)\native  Gaussianization  procedure  which  converges  weakly:  at  each  it(cid:173)\neration,  the  data is  first  transformed  to  the  least dependent coordinates \nand then each coordinate is  marginally Gaussianized by univariate tech(cid:173)\nniques.  Gaussianization offers density estimation sharper than traditional \nkernel  methods and radial  basis function  methods.  Gaussianization can \nbe viewed as efficient solution of nonlinear independent component anal(cid:173)\nysis and high dimensional projection pursuit. \n\n1  Introduction \n\nDensity Estimation is  a fundamental  problem in  statistics.  In  the  statistics  literature,  the \nunivariate problem is well-understood and well-studied. Techniques such as  (variable) ker(cid:173)\nnel methods, radial basis function methods, Gaussian mixture models, etc, can be applied \nsuccessfully to obtain univariate density estimates.  However, the high dimensional problem \nis  very challenging, mainly due to  the problem of the so-called \"curse of dimensionality\". \nIn  high  dimensional  space,  data  samples  are  often  sparsely  distributed:  it  requires  very \nlarge neighborhoods to  achieve sufficient counts,  or the  number of samples has to  grows \nexponentially according  to  the  dimensions  in  order to  achieve  sufficient coverage of the \nsampling space.  As  a result, direct extension of univariate techniques can be highly biased, \nbecause they are neighborhood-based. \n\nIn this paper, we attempt to  overcome the curse of dimensionality by  exploiting indepen(cid:173)\ndence structures in the data.  We advocate the notion that \n\nIndependence lifts the curse of dimensionality! \n\nIndeed, if the  dimensions are independent,  then  there is  no  curse of dimensionality  since \nthe high dimensional problem can be reduced to univariate problems along each dimension. \n\nFor natural data sets which do not have independent dimensions, we would like to construct \ntransforms such that after the transformation, the dimensions become independent. We pro(cid:173)\npose a technique called \"Gaussianization\" which finds and exploits independence structures \n\n\fin the data for high dimensional density estimation.  For a random variable X  EnD, we \ndefine  its  Gaussianization  transform to  be  an  invertible  and  differential  transform T(X) \nsuch that the transformed variable T(X) follows the standard Gaussian distribution: \n\nT(X) '\" N(O, I) \n\nIt is clear that density estimates can be derived from Gaussianization transforms.  We pro(cid:173)\npose  an  iterative procedure which converges weakly in probability:  at each iteration,  the \ndata  is  first  transformed  to  the  least  dependent coordinates  and  then  each  coordinate is \nmarginally  Gaussianized  by  univariate techniques  which  are  based  on  univariate  density \nestimation.  At each iteration,  the coordinates become less dependent in terms  of the mu(cid:173)\ntual information, and the transformed data samples become more Gaussian in  terms of the \nKullback-Leibler divergence.  In fact,  at each iteration,  as  far  as  the data is  linearly trans(cid:173)\nformed to less dependent coordinates, the convergence result still holds.  Our convergence \nproof of Gaussianization is highly related to Huber's convergence proof of projection pur(cid:173)\nsuit [4]. \n\nAlgorithmically, each Gaussianization iteration amounts to performing the linear indepen(cid:173)\ndent component analysis.  Since the assumption of linear independent component analysis \nmay  not be valid,  the resulting linear transform does  not necessarily make the  coordinate \nindependent, however, it does make the coordinates as independent as possible. Therefore \nthe  engine  of our  algorithm  is  the  linear  independent component analysis.  We  propose \nan efficient EM  algorithm which jointly  estimates  the linear transform and  the  marginal \nunivariate Gaussianization transform at each iteration.  Our parametrization is identical to \nthe  independence factor  analysis  proposed by  Attias  (1999)  [1].  However,  we  apply  the \nvariational  method  in  the  M-step,  as  in  the  semi-tied covariance  algorithm proposed for \nGaussian mixture models by Gales (1999) [3]. \n\n2  Existence of Gaussianization \n\nWe  first  show  the  existence  of Gaussianization  transforms.  Denote  \u00a2(.)  the  probability \ndensity  function  of the  standard  normal  N(O, I);  denote  \u00a2(', /-\u00a3,};.)  the  probability den(cid:173)\nsity  function  of N(/-\u00a3,};.) ; denote  <1>(-)  the  cumulative distribution function  (CDF)  of the \nstandard normal. \n\n2.1  Univariate Gaussianization \n\nUnivariate Gaussianization exists uniquely and can be derived from univariate density es(cid:173)\ntimation.  Let X  E n 1  be the univariate variable.  We assume that the density function  of \nX  is strictly positive and differentiable. Let F(\u00b7) be the cumulative distribution function of \nX. Then T(\u00b7) is a Gaussianization transform if and only if it satisfies the following partial \ndifferential equation: \n\np(x) = \u00a2(T(x))1 ax I\u00b7 \n\naT \n\n\u00b1<I>-l(F(X)) '\" N(O, 1) \n\nIt can be easily verified that the above partial differential equation has only two solutions: \n(1) \nIn  practice,  the  CDF F( \u00b7) is  not available;  it has  to  be  estimated from  the  training data. \nWe  choose to  approximate it by  Gaussian mixture models:  p(x)  = 2:[=1 7ri\u00a2(X, /-\u00a3i, O'l); \nequivalently  we  assume  the  CDF  F(x)  =  2:[=1 7ri<l>(X~:,)  where  the  parameters \n{7ri' /-\u00a3i, O'd  can be  estimated  via  maximum likelihood using  the  standard EM  algorithm. \nTherefore we can parameterize the Gaussianization transform as \n\nT(X) = <1>-1(2: 7ri <I> ( ~)) \n\nI \n\nX-. \n\ni=l \n\nO'i \n\n(2) \n\n\fIn practice there is  an  issue  of model  selection:  we suggest to  use  model  selection tech(cid:173)\nniques such as the Bayesian information criterion [6]  to determine the number of Gaussians \nI.  Throughout the paper, we shall assume that univariate density estimation and univariate \nGaussianization can be solved by univariate Gaussian mixture models. \n\n2.2  High Dimensional Gaussianization \n\nHowever,  the  existence  of high  dimensional  Gaussianization  is  non-trivial.  We  present \nhere a theoretical construction.  For simplicity, we consider the two dimensional case.  Let \nX  = (X I, X 2) T  be the  random variable.  Gaussianization can be achieved in  two  steps. \nWe  first  marginally Gaussianize the  first  coordinate X I  and fix  the second coordinate X 2 \nunchanged; the transformed variable will have the following density \nP(XI,X2)  =P(XI)P(X2Ixt) = \u00a2(xt)p(x2Ixt) . \n\nWe then marginally Gaussian each conditional density p(\u00b7IXI) for each Xl.  Notice that the \nmarginal Gaussianization is different for different Xl : \n\nT X1 (X2 )  =  CP-I(F.l x l(X2 )), \n\nOnce all the conditional densities are marginally Gaussianized, we achieve joint Gaussian(cid:173)\nization \n\np(XI,X2) =  p(XI)p(X2IxI)  =  \u00a2(Xt}\u00a2(X2)  . \n\nThe existence of high dimensional Gaussianization can be proved by similar construction. \n\nThe above construction, however, is not practical since the marginal Gaussianization of the \nconditional densities P(X2 = x21XI = xt) requires estimation of the conditional densities \ngiven all  Xl , which is  impossible with finite  samples.  In the following  sections,  we shall \ndevelop an iterative Gaussianization algorithm that is  practical and  also  can be proved to \nconverge weakly. \n\nHigh-dimensional Gaussianization is unique up to any invertible transforms which preserve \nthe  measure  on  RP  induced  by  the  standard  Gaussian  distribution.  Examples  of such \ntransforms are orthogonal linear transforms and certain nontrivial Nonlinear transforms. \n\n3  Gaussianization with Linear leA Assumption \n\nLet  (Xl,' . . , XN)  be  the  i.i.d.  samples  from  the  random  variable  X  E  RP.  We  as(cid:173)\nsume  that  there  exist a  linear  transform A DxD  such  that the  transformed variable Y  = \n(YI, ... , YD)T  =  AX has independent components:  p(YI, ... , YD)  =  p(yt} ... p(YD)'  In \nthis  case,  Gaussianization is  reduced to  linear ICA:  we  can  first  find  the  linear transfor(cid:173)\nmation A by linear independent component analysis, and then Gaussianize each individual \ndimension of Y  by univariate Gaussianization. \n\nWe  parametrize the marginal Gaussianization  by  univariate Gaussian  mixtures  (2).  This \namounts to model the coordinates of the transformed variable by univariate Gaussian mix-\ntures:  p(Yd)  = l:[~l 7r d,i\u00a2(Yd, f.l,d,i, a~,i)'  We  would  like  to  jointly  optimize  both  the \nlinear transform A  and  the  marginal  Gaussianization parameters  (7r, f.l\"  a)  via maximum \nlikelihood.  In fact,  this  is  the same parametrization as  in  Attias  (1999) [1] .  We  point out \nthat modeling the coordinates after the linear transform as  non-Gaussian distributions, for \nwhich we assume univariate Gaussian mixtures are adequate, leads  to ICA while as  mod(cid:173)\neling them as single Gaussians leads to PCA. \n\nThe joint estimation of the parameters can be computed via the EM algorithm.  The auxil(cid:173)\niary function which has to be maximized in the M-step has the following form: \n( \n\n)2 \nQ(A,7r,f.l\"a)  =  Nlogldet(A)I+ L LLWn,d,dlog7rd,i-210g27ra~,i- Yn,d2~td,i  1 \n\n1 \n\nd,t \n\nN  D \n\nId \n\nn=l d=l  i=l \n\n\fwhere (Wn,d,i)  are the posterior counts computed at the E-step.  It can be easily shown that \nthe priors  (7r d,i  can be easily updated and the means  (f..Ld,i  can be entirely determined by \nthe linear transform A.  However, updating the linear transform A and the variances (lTd,i) \ndoes not have closed form  solution and has to be solved iteratively by numerical methods. \nAttias (1999) [1]  proposed to  optimize Q via gradient descent:  at each iteration, one fixes \nthe linear transform and compute the Gaussian mixture parameters, then fixes the Gaussian \nntixture parameters and update the linear transform via gradient descent using the so-called \nnatural gradient. \n\nWe  propose an  iterative  algorithm as  in  Gales  (1999)  [3]  for  the  M-step  which  does  not \ninvolve gradient descent and the nuisance and instability caused by of the step size param(cid:173)\neter.  At each iteration,  we fix  the  linear transform A and  update the variances  (lTd,i);  we \nthen fix  (lTd,i)  and update each row of A with all  the other rows of A fixed:  updating each \nrow amounts to solving a system of linear equations.  Our iterative scheme guarantees that \nthe auxiliary function Q to be increased at every iteration. Notice that each iteration in our \nM-step  updates the rows  of the  linear matrix  A by  solving D  linear equations.  Although \nour iterative scheme may be slightly more expensive per iteration than standard numerical \noptintization techniques  such as  Attias'  algorithm, in practice it converges after very few \niterations, as  observed in Gales (1999) [3].  In contrast the numerical optintization scheme \nmay take an order of magnitude more iterations.  In fact, in our experiments, our algorithm \nconverges much faster than Attias's algorithm.  Furthermore, our algorithm is  stable since \neach iteration is guaranteed to increase the likelihood. \n\nThe  M-step  in  both  Attias'  algorithm  and  our algorithm can  be  implemented efficiently \nby  storing  and  accessing  the  sufficient  statistics.  Typically  in  our M-steps,  most  of the \nimprovement on  the  likelihood  comes  in  the  first  few  iterations.  Therefore we  can  stop \neach M-step after,  say one iteration of updating the parameters; even though the auxiliary \nfunction is  not optimized,  but it is  guaranteed to  improve.  Therefore we  obtained the  so(cid:173)\ncalled  generalized  EM  algorithm.  Attias  (1999)  [1]  reported  faster  convergence  of the \ngeneralized EM algorithm than the standard EM algorithm. \n\n4  Iterative Gaussianization \n\nIn this section we develop an iterative algorithm which Gaussianizes arbitrary random vari(cid:173)\nables.  At each iteration, the data is first transformed to the least dependent coordinates and \nthen each coordinate is  marginally Gaussianized by univariate techniques which are based \non  univariate density  estimation.  We  shall  show  that transfornting the  data into  the  least \ndependent coordinates can be achieved by linear independent component analysis.  We also \nprove the weak convergence result. \n\nWe  define  the  negentropy  1  of a random variable X  =  (Xl,\"', X D) T  as  the  Kullback(cid:173)\nLeibler  divergence  between  X  and  the  standard  Gaussian  distribution.  We  define  the \nmarginal negentropy to  be  JM(X)  = Ef=l J(Xd).  One can  show  that the  negentropy \ncan  be  decomposed  as  the  sum  of the  marginal  negentropy  and  the  mutual  information: \nJ(X) = JM(X) + I(X).  Gaussianization is  equivalent to finding an invertible transform \nTO such that the negentropy ofthe transformed variable vanishes:  J(T(X)) =  O. \nFor arbitrary random variable X  E R D, we propose the following iterative Gaussianization \nalgorithm. Let X(O)  =  X. At each iteration, \n\n(A)  Linearly transform the data:  y(k)  = Ax(k). \n1 We are abusing the terminology slightly:  normally the negentropy of a random variable is defined \nto be the Kullback-Leibler distance between itself and the Gaussian variable with the same mean and \ncovariance. \n\n\f(B)  Nonlinearly transform the data by marginal Gaussianization: \n\nX(k+1)  = w \n\n'Tr,J.1.,u \n\n(y(k)) \n\nwhere  the  marginal  Gaussianization  w 1T, It, 0.(-),  which  approximates  the  ideal \nmarginal Gaussianization w (.), can be derived from univariate Gaussian mixtures \n(2): \n\nThe  parameters  are  chosen  by  minimizing  the  negentropy  of  the  transformed  variable \nX(k+1): \n\n(3) \n\n(..4, 7T, it, 0-)  =  min  J(w 1T,It,.,.(AX)). \n\nA,1r,f-t,u \n\nThus,  after each  iteration,  the  transformed variable  becomes  as  close  as  possible  to  the \nstandard Gaussian in the Kullback-Leibler distance. \n\nFirst, the problem of minizing the negentropy (3) is  equivalent to the maximum likelihood \nproblem for  Gaussianization  with  linear  ICA  assumption  in  section  3,  and  thus  can  be \nsolved by the same efficient EM algorithm. \n\nSecond, since the data X(k)  might not satisfy the linear ICA assumption, the  optimal lin(cid:173)\near transform might not  transform  X(k)  into  independent coordinates.  However,  it does \ntransform X (k)  into the least dependent coordinates, since \n\nJ(X(k+1)) = JM(w(AX(k))) + I(w(AX(k))) = I(AX(k)) . \n\nFurther more,  if the  linear transform  A is  constrained  to  be orthogonal,  then  finding  the \nleast dependent coordinates is equivalent to finding the marginally most non-Gaussian co(cid:173)\nordinates, since \n\nJ(X(k)) =  J(AX(k))  =  JM(AX(k)) + I(AX(k)) \n(notice that the negentropy is invariant under orthogonal transforms). \n\nTherefore our iterative algorithm can be viewed as follows.  At each iteration, the data is lin(cid:173)\nearly transformed to the least dependent coordinates and then each coordinate is marginally \nGaussianized.  In  practice,  after the  first  iteration,  the  algorithm  finds  linear  transforms \nwhich are almost orthogonal. Therefore one can also view practically that at each iteration, \nthe data is  linearly transformed to  the most marginally non-Gaussian coordinates and then \neach coordinate is marginally Gaussianized. \n\nFor the sake of simplicity, we assume that we can achieve perfect marginal Gaussianization \nw(\u00b7)  by w1T,It,.,.O), which is derived from univariate Gaussian mixtures.  In fact, when the \nnumber of Gaussians goes to  infinity and the number of samples goes to infinity,  one can \nshow that \n\nThus it suffices to analyze the ideal iterative Gaussianization \n\nlim W 1T ,It,.,.  =  W. \n\nX(k)  = W(AX(k)) \n\nwhere \n\nA = argmin J(W(AX(k))) = argmin I(AX(k)) . \n\nFollowing Huber's argument [4], we can show that \n\nX(k)  -t N(O, I) \n\nin the sense of weak convergence, i.e. the density function of X(k)  converges pointwise to \nthe density function of standard normal. \n\n\fOriginal Data \n\nIteratIon  1 \n\nIteration 2 \n\n:~ \n-2~ \n\n-4 \n-4 -2024  \n\n-~4L-_-::-2 ----:-0 -----:-2----'4 \n\n-1 \n\n-~2L---:-O -----' \n\niteratIon  3 \n\nIteratIon  4 \n\nIteration 5 \n\niteratIon  6 \n\n_4L - - - - - - - '  \n-4-2 0 24  \n\n' \u2022. \n\n\u2022. ~:.  :!'::-. \n0\".' .' \n\n-2 \n\n:~:). -.: ... \n\n.. \n\nf  0 \n0\" \n\no \n-2  : :  :-:...  ~ .. : \n\n:ltJ\u00b7\u00b7:,.\u00a3.~\u00b7 \n; \u2022 .;..=: \n\n!to.: . \n\n'-; . \n\n\" \u2022\u2022 \n\n- 2 :\u00b7  \u2022 \n\n0 \n\n: \"  \n\nIteratIon  7 \n\n-4 \n-4-2 0 24  \n\n-4 \n-4-2 0 24  \n\n:_:  ~.:-\n\n4~ 4~ \n\n...  :  0.0 \n\u2022\u2022 \u2022 \" .  \n\n........  l: \n..... \n.r, ... \n\nIteration 8 \n\n2 \n\n2 \n\n-4 \n-4-2 0 24  \n\n-4 \n-4-2 0 24  \n\nGausslanlzatlon DensIty EstImatIon \n\nGaussIan MIxture  DensIty EstImatIon \n\n0.2 \n\n0. 4 \n\n0.6 \n\nO.S \n\n0. 2 \n\n0. 4 \n\n0.6 \n\nO.S \n\nFigure 1:  Iterative Gaussianization on a synthetic circular data set \n\nWe  point out that out iterative algorithm can  be relaxed as  follows.  At each iteration, the \ndata  can  linearly  transformed  into  coordinates  which  are  less  dependent,  instead  of into \ncoordinates which are the least dependent: \n\nI(X(k)  - I(AkX(k))  ~ E[I(X(k)  - i~t I(AX(k))] \n\nwhere the constant E > O. We can show that this relaxed algorithm still converges weakly. \n\n5  Examples \n\nWe demonstrate the process of our iterative Gaussianization algorithm through a very dif(cid:173)\nficult  two  dimensional  synthetic  data  set.  The  true  underlying variable is  circularly  dis(cid:173)\ntributed: in the polar coordinate system, the angle is  uniformly distributed; the radius fol(cid:173)\nlows a mixture of four non-overlapping Gaussians.  We drew  1000 i.i.d.  samples from this \ndistribution.  We  ran 8 iterations to  Gaussianize the data  set.  Figure 4 displays the trans(cid:173)\nformed data set at each iteration.  Clearly we  see the transformed data gradually becomes \nstandard Gaussian. \nLet  X(O)  =  X ;  assume  that  the  iterative  Gaussianization  procedure  converges  after  K \niterations,  i.e.  X(K)  '\"  N( O, I).  Since  the  transforms  at  each  iteration  are  invertible, \nwe can then compute Jacobian and obtain density estimation for  X.  The Jacobian can be \ncomputed rapidly  due  to  the  chain  rule.  Figure  4  compares  the  Gaussianization  density \n\n\festimate (8  iterations) and Gaussian mixture density estimate (40 Gaussians).  Clearly we \nsee that the Gaussianization density estimate recovers the four circular structure; however, \nthe Gaussian mixture estimate lacks resolution. \n\n6  Discussion \n\nGaussianization is closely connected with the exploratory projection pursuit algorithm pro(cid:173)\nposed by Friedman (1987) [2].  In fact we argue that our iterative Gaussianization procedure \ncan  easily  constrained as  an  efficient parametric solution  of high dimensional projection \npursuit.  Assume that we are interested in [-dimensional projections where 1  ~ [  ~ D.  If \nwe constrain that at each iteration the linear transform has  to  be orthogonal, and only the \nfirst [ coordinates of the transformed variable are marginally Gaussianized, then the itera(cid:173)\ntive  Gaussianization algorithm achieves  [ dimensional projection pursuit.  The bottleneck \nof Friedman's high dimensional projection pursuit is to find the jointly most non-Gaussian \nprojection and to jointly Gaussianize that projection.  In contrast,  our algorithm finds  the \nmost marginally  non-Gaussian projection and  marginally Gaussianize  that  projection;  it \ncan be computed by an efficient EM algorithm. \n\nWe argue that Gaussianization density estimation indeed alleviates the problem of the curse \nof dimensionality.  At each iteration, the effect of the curse of dimensionality is  solely on \nfinding a linear transform such that the transformed coordinates are less dependent, which \nis  a relatively much easier problem than the original problem of high dimensional density \nestimation itself;  after the linear transform,  the  marginal Gaussianization can  be derived \nfrom univariate density estimation, which has  nothing to do with the curse of dimension(cid:173)\nality.  Hwang 1994 [5]  performed extensive comparative study among the following three \npopular density estimates:  one dim projection pursuit density estimates  (a special  case of \nour iterative Gaussianization algorithm), adaptive kernel density estimates and radial basis \nfunction density estimates;  he concluded that projection pursuit density estimates outper(cid:173)\nform in most data set. \n\nWe  are currently experimenting with application of Gaussianization density estimation in \nautomatic speech and speaker recognition. \n\nReferences \n\n[1]  H. Attias,  \"Independent factor analysis\", Neural  Computation, vol.  11, pp. 803-851, \n\n1999. \n\n[2]  J.H. Friedman, \"Exploratory projection pursuit\", J.  American Statistical Association, \n\nvol.  82, pp. 249-266, 1987. \n\n[3]  MJ.F.  Gales,  \"Semi-tied  covariance  matrices  for  hidden  Markov  Models\",  IEEE \n\nTransaction Speech and Audio Processing, vol. 7, pp. 272-281, 1999. \n\n[4]  PJ. Huber, \"Projection pursuit\", Annals of Statistics, vol.  13, pp 435-525, 1985. \n[5]  J.  Hwang, S. Lay  and A.  Lippman, \"Nonparametric multivariate density estimation: \na  comparative study\",  IEEE Transaction  Signal Processing,  vol.  42,  pp  2795-2810, \n1994. \n\n[6]  G.  Schwarz, \"Estimating the dimension of a model\", Annals of Statistics,  vol.  6,  pp \n\n461-464,1978. \n\n\f", "award": [], "sourceid": 1856, "authors": [{"given_name": "Scott", "family_name": "Chen", "institution": null}, {"given_name": "Ramesh", "family_name": "Gopinath", "institution": null}]}