{"title": "Kernel Measures of Conditional Dependence", "book": "Advances in Neural Information Processing Systems", "page_first": 489, "page_last": 496, "abstract": null, "full_text": "Kernel Measures of Conditional Dependence\n\nKenji Fukumizu\n\nInstitute of Statistical Mathematics\n4-6-7 Minami-Azabu, Minato-ku\n\nTokyo 106-8569 Japan\n\nArthur Gretton\n\nMax-Planck Institute for Biological Cybernetics\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\narthur.gretton@tuebingen.mpg.de\n\nfukumizu@ism.ac.jp\n\nXiaohai Sun\n\nMax-Planck Institute for Biological Cybernetics\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n\nxiaohi@tuebingen.mpg.de\n\nBernhard Sch\u00a8olkopf\n\nMax-Planck Institute for Biological Cybernetics\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n\nbernhard.schoelkopf@tuebingen.mpg.de\n\nAbstract\n\nWe propose a new measure of conditional dependence of random variables, based\non normalized cross-covariance operators on reproducing kernel Hilbert spaces.\nUnlike previous kernel dependence measures, the proposed criterion does not de-\npend on the choice of kernel in the limit of in\ufb01nite data, for a wide class of ker-\nnels. At the same time, it has a straightforward empirical estimate with good\nconvergence behaviour. We discuss the theoretical properties of the measure, and\ndemonstrate its application in experiments.\n\n1 Introduction\n\nMeasuring dependence of random variables is one of the main concerns of statistical inference. A\ntypical example is the inference of a graphical model, which expresses the relations among variables\nin terms of independence and conditional independence. Independent component analysis employs\na measure of independence as the objective function, and feature selection in supervised learning\nlooks for a set of features on which the response variable most depends.\n\nKernel methods have been successfully used for capturing (conditional) dependence of variables\n[1, 5, 8, 9, 16]. With the ability to represent high order moments, mapping of variables into re-\nproducing kernel Hilbert spaces (RKHSs) allows us to infer properties of the distributions, such as\nindependence and homogeneity [7]. A drawback of previous kernel dependence measures, however,\nis that their value depends not only on the distribution of the variables, but also on the kernel, in\ncontrast to measures such as mutual information.\n\nIn this paper, we propose to use the Hilbert-Schmidt norm of the normalized conditional cross-\ncovariance operator, and show that this operator encodes the dependence structure of random vari-\nables. Our criterion includes a measure of unconditional dependence as a special case. We prove in\nthe limit of in\ufb01nite data, under assumptions on the richness of the RKHS, that this measure has an\nexplicit integral expression which depends only on the probability densities of the variables, despite\nbeing de\ufb01ned in terms of kernels. We also prove that its empirical estimate converges to the kernel-\nindependent value as the sample size increases. Furthermore, we provide a general formulation for\n\n1\n\n\fthe \u201crichness\u201d of an RKHS, and a theoretically motivated kernel selection method. We successfully\napply our measure in experiments on synthetic and real data.\n\n2 Measuring conditional dependence with kernels\n\nThe probability law of a random variable X is denoted by PX, and the space of the square integrable\nfunctions with probability P by L2(P ). The symbol X??Y j Z indicates the conditional indepen-\ndence of X and Y given Z. The null space and the range of an operator T are written N (T ) and\nR(T ), respectively.\n2.1 Dependence measures with normalized cross-covariance operators\n\nCovariance operators on RKHSs have been successfully used for capturing dependence and condi-\ntional dependence of random variables, by incorporating high order moments [5, 8, 16]. We give\na brief review here; see [5, 6, 2] for further detail. Suppose we have a random variable (X; Y ) on\nX (cid:2)Y, and RKHSs HX and HY on X and Y, respectively, with measurable positive de\ufb01nite kernels\nkX and kY. Throughout this paper, we assume the integrability\n(A-1)\nE[kX (X; X)] < 1; E[kY (Y; Y )] < 1.\nThis assumption ensures HX (cid:26) L2(PX ) and HY (cid:26) L2(PY ). The cross-covariance operator (cid:6)Y X :\nHX ! HY is de\ufb01ned by the unique bounded operator that satis\ufb01es\n(1)\nfor all f 2 HX and g 2 HY. If Y = X, (cid:6)XX is called the covariance operator, which is self-adjoint\nand positive. The operator (cid:6)Y X naturally extends the covariance matrix CY X on Euclidean spaces,\nand represents higher order correlations of X and Y through f (X) and g(Y ) with nonlinear kernels.\nIt is known [2] that the cross-covariance operator can be decomposed into the covariance of the\nmarginals and the correlation; that is, there exists a unique bounded operator VY X such that\n\n( = E[f (X)g(Y )] (cid:0) E[f (X)]E[g(Y )])\n\nhg; (cid:6)Y X fiHY = Cov[f (X); g(Y )]\n\n(cid:6)Y X = (cid:6)1=2\n\nY Y VY X (cid:6)1=2\nXX ;\n\n(2)\nR(VY X ) (cid:26) R((cid:6)Y Y ), and N (VY X )? (cid:26) R((cid:6)XX ). The operator norm of VY X is less than or equal\nto 1. We call VY X the normalized cross-covariance operator (NOCCO, see also [4]).\nWhile the operator VY X encodes the same information regarding the dependence of X and Y as\n(cid:6)Y X, the former rather expresses the information more directly than (cid:6)Y X, with less in\ufb02uence of the\nmarginals. This relation can be understood as an analogue to the difference between the covariance\nCov[X; Y ] and the correlation Cov[X; Y ]=(Var(X)Var(Y ))1=2. Note also that kernel canonical\ncorrelation analysis [1] uses the largest eigenvalue of VY X and its corresponding eigenfunctions [4].\nSuppose we have another random variable Z on Z and RKHS (HZ ; kZ ), which satisfy the analog\nto (A-1). We then de\ufb01ne the normalized conditional cross-covariance operator,\n(3)\nfor measuring the conditional dependence of X and Y given Z, where VY Z and VZX are de\ufb01ned\nsimilarly to Eq. (2). The operator VY XjZ may be better understood by expressing it as\n\nVY XjZ = VY X (cid:0) VY ZVZX ;\n\nVY XjZ = (cid:6)(cid:0)1=2\n\nY Y (cid:0)(cid:6)Y X (cid:0) (cid:6)Y Z(cid:6)(cid:0)1\n\nZZ(cid:6)ZX(cid:1)(cid:6)(cid:0)1=2\n\nXX ;\n\nZZCZX of Gaussian random variables.\n\nZZ(cid:6)ZX can be interpreted as a nonlinear extension of the condi-\n\nwhere (cid:6)Y XjZ = (cid:6)Y X (cid:0) (cid:6)Y Z(cid:6)(cid:0)1\ntional covariance matrix CY X (cid:0) CY ZC(cid:0)1\nThe operator (cid:6)Y X can be used to determine the independence of X and Y : roughly speaking,\n(cid:6)Y X = O if and only if X??Y . Similarly, a relation between (cid:6)Y XjZ and conditional independence,\nX??Y j Z, has been established in [5]: if the extended variables (cid:127)X = (X; Z) and (cid:127)Y = (Y; Z) are\nused, X??Y j Z is equivalent to (cid:6) (cid:127)X (cid:127)Y jZ = O. We will give a rigorous treatment in Section 2.2\nNoting that the conditions (cid:6)Y X = O and (cid:6)Y XjZ = O are equivalent to VY X = O and VY XjZ = O,\nrespectively, we propose to use the Hilbert-Schmidt norms of the latter operators as dependence\n\n2\n\n\fmeasures. Recall that an operator A : H1 ! H2 is called Hilbert-Schmidt if for complete or-\nthonormal systems (CONSs) f(cid:30)ig of H1 and f jg of H2, the sumPi;jh j; A(cid:30)ii2\nis \ufb01nite (see\n[13]). For a Hilbert-Schmidt operator A, the Hilbert-Schmidt (HS) norm kAkHS is de\ufb01ned by\n. It is easy to see that this sum is independent of the choice of CONSs.\nkAk2\nProvided that VY X and VY XjZ are Hilbert-Schmidt, we propose the following measures:\n\nHS =Pi;jh j; A(cid:30)ii2\n\nH2\n\nH2\n\nI CON D(X; Y jZ) = kV (cid:127)Y (cid:127)XjZk2\nI N OCCO(X; Y ) = kVY Xk2\nHS:\n\nHS;\n\n(4)\n\n(5)\n\n:\n\n1\n\nY = 1\n\nY X = 1\n\nb(cid:6)(n)\n\ni=1 kY ((cid:1) ; Yi), an estimator of (cid:6)Y X is\n\nY Y are de\ufb01ned similarly. The estimators of VY X and VY XjZ are respectively\n\nA suf\ufb01cient condition that these operators are Hilbert-Schmidt will be discussed in Section 2.3.\nIt is easy to provide empirical estimates of the measures. Let (X1; Y1; Z1); : : : ; (Xn; Yn; Zn)\nX =\n\n;\nwhere \"n > 0 is a regularization constant used in the same way as [1, 5], and\n\nbe an i.i.d. sample from the joint distribution. Using the empirical mean elements bm(n)\nnPn\nnPn\ni=1 kX ((cid:1) ; Xi) and bm(n)\nY )(cid:10)kX ((cid:1) ; Xi) (cid:0) bm(n)\nnPn\ni=1(kY ((cid:1) ; Yi) (cid:0) bm(n)\nb(cid:6)(n)\nXX and b(cid:6)(n)\nXX + \"nI(cid:1)(cid:0)1=2\nY X(cid:0)b(cid:6)(n)\nY Y + \"nI(cid:1)(cid:0)1=2b(cid:6)(n)\nY X =(cid:0)b(cid:6)(n)\nbV (n)\nY Z bV (n)\nY X (cid:0) bV (n)\nbV (n)\nY XjZ = bV (n)\n(6)\nfrom Eq. (3). The HS norm of the \ufb01nite rank operator bV (n)\nY XjZ is easy to calculate. Let GX, GY , and\nX ; kX ((cid:1) ; Xj) (cid:0) bm(n)\nGZ be the centered Gram matrices, such that GX;ij = hkX ((cid:1) ; Xi) (cid:0) bm(n)\nX iHX\nand so on, and de\ufb01ne RX, RY , and RZ as RX = GX (GX +n\"nIn)(cid:0)1; RY = GY (GY +n\"nIn)(cid:0)1;\nand RZ = GZ(GZ + n\"nIn)(cid:0)1. The empirical dependence measures are then\n\nX ; (cid:1)(cid:11)HX\n\nZX ;\n\n^I CON D\nn\n\n(cid:17)(cid:13)(cid:13)bV (n)\n(cid:127)Y (cid:127)XjZ(cid:13)(cid:13)2\n\nHS = Tr(cid:2)R (cid:127)Y R (cid:127)X (cid:0) 2R (cid:127)Y R (cid:127)X RZ + R (cid:127)Y RZR (cid:127)X RZ(cid:3);\n(X; Y ) (cid:17)(cid:13)(cid:13)bV (n)\nY X(cid:13)(cid:13)2\n\nHS = Tr(cid:2)RY RX(cid:3);\n\n(8)\nwhere the extended variables are used for ^I CON D\n. These empirical estimators, and use of \"n, will be\njusti\ufb01ed in Section 2.4 by showing the convergence to I N OCCO and I CON D. With the incomplete\nCholesky decomposition [17] of rank r, the complexity to compute ^I CON D\n\n^I N OCCO\nn\n\nis O(r2n).\n\n(7)\n\nn\n\nn\n\n2.2\n\nInference on probabilities by characteristic kernels\n\nTo relate I N OCCO and I CON D with independence and conditional independence, respectively, the\nRKHS should contain a suf\ufb01ciently rich class of functions to represent all higher order moments.\nSimilar notions have already appeared in the literature: universal kernel on compact domains [15]\nand Gaussian kernels on the entire Rm characterize independence via the cross-covariance operator\n[8, 1]. We now discuss a uni\ufb01ed class of kernels for inference on probabilities.\nLet (X ;B) be a measurable space, X a random variable on X , and (H; k) an RKHS on X satisfying\nassumption (A-1). The mean element of X on H is de\ufb01ned by the unique element mX 2 H such\nthat hmX ; fiH = E[f (X)] for all f 2 H (see [7]). If the distribution of X is P , we also use mP to\ndenote mX. Letting P be the family of all probabilities on (X ;B), we de\ufb01ne the map Mk by\nThe kernel k is said to be characteristic1 if the map Mk is injective, or equivalently, if the condition\nEX(cid:24)P [f (X)] = EX(cid:24)Q[f (X)] (8f 2 H) implies P = Q.\nThe notion of a characteristic kernel is an analogy to the characteristic function EP [ep(cid:0)1uT X ],\nwhich is the expectation of the Fourier kernel kF (x; u) = ep(cid:0)1uT x. Noting that mP = mQ iff\nEP [k(u; X)] = EQ[k(u; X)] for all u 2 X , the de\ufb01nition of a characteristic kernel generalizes the\nwell-known property of the characteristic function that EP [kF (u; X)] uniquely determines a Borel\nprobability P on Rm. The next lemma is useful to show that a kernel is characteristic.\n\nMk : P ! H;\n\nP 7! mP :\n\n1Although the same notion was called probability-determining in [5], we call it \u201dcharacteristic\u201d by analogy\n\nwith the characteristic function.\n\n3\n\n\fLemma 1. Let q (cid:21) 1. Suppose that (H; k) is an RKHS on a measurable space (X ;B) with k\nmeasurable and bounded. If H + R (the direct sum of the two RKHSs) is dense in Lq(X ; P ) for any\nprobability P on (X ;B), the kernel k is characteristic.\nProof. Assume mP = mQ. By the assumption, for any \" > 0 and a measurable set A, there is a\nfunction f 2 H and c 2 R such that jEP [f (X)] + c(cid:0) P (A)j < \" and jEQ[f (Y )] + c(cid:0) Q(A)j < \",\nfrom which we have jP (A) (cid:0) Q(A)j < 2\". Since \" > 0 is arbitrary, this means P (A) = Q(A).\nMany popular kernels are characteristic. For a compact metric space, it is easy to see that the RKHS\ngiven by a universal kernel [15] is dense in L2(P ) for any P , and thus characteristic (see also [7]\nTheorem 3). It is also important to consider kernels on non-compact spaces, since many standard\nrandom variables, such as Gaussian variables, are de\ufb01ned on non-compact spaces. The next theorem\nimplies that many kernels on the entire Rm, including Gaussian and Laplacian, are characteristic.\nThe proof is an extension of Theorem 2 in [1], and is given in the supplementary material.\nTheorem 2. Let (cid:30)(z) be a continuous positive function on Rm with the Fourier transform ~(cid:30)(u),\nand k be a kernel of the form k(x; y) = (cid:30)(x (cid:0) y). If for any (cid:24) 2 Rm there exists (cid:28)0 such that\nR ~(cid:30)((cid:28) (u+(cid:24)))2\ndu < 1 for all (cid:28) > (cid:28)0, then the RKHS associated with k is dense in L2(P ) for any\n\nBorel probability P on Rm. Hence k is characteristic with respect to the Borel (cid:27)-\ufb01eld.\n\n~(cid:30)(u)\n\nThe assumptions to relate the operators with independence are well described by using characteristic\nkernels and denseness. The next result generalizes Corollary 9 in [5] (we omit the proof: see [5, 6]).\nTheorem 3. (i) Assume (A-1) for the kernels. If the product kX kY is characteristic, then we have\n\n(ii) Denote (cid:127)X = (X; Z) and k (cid:127)X\ncharacteristic kernel on (X (cid:2) Z) (cid:2) Y, and HZ + R is dense in L2(PZ ). Then,\n\nVY X = O\n= kX kZ. In addition to (A-1), assume that the product k (cid:127)X\n\nX??Y:\n\n()\n\nkY is a\n\nVY (cid:127)XjZ = O\n\n()\n\nX??Y j Z:\n\nFrom the above results, we can guarantee that VY X and VY (cid:127)XjZ will detect independence and condi-\ntional independence, if we use a Gaussian or Laplacian kernel either on a compact set or the whole\nof Rm. Note also that we can substitute V (cid:127)Y (cid:127)XjZ for VY (cid:127)XjZ in Theorem 3 (ii).\n2.3 Kernel-free integral expression of the measures\n\nA remarkable property of I N OCCO and I CON D is that they do not depend on the kernels under some\nassumptions, having integral expressions containing only the probability density functions. The\n\nprobability EZ[PXjZ (cid:10) PY jZ] on X (cid:2)Y is de\ufb01ned by EZ[PY jZ (cid:10) PXjZ](B (cid:2) A)R E[(cid:31)B(Y )jZ =\nz]E[(cid:31)A(X)jZ = z]dPZ(z) for A 2 BX and B 2 BY.\nTheorem 4. Let (cid:22)X and (cid:22)Y be measures on X and Y, respectively, and assume that the probabilities\nPXY and EZ[PXjZ (cid:10) PY jZ] are absolutely continuous with respect to (cid:22)X (cid:2) (cid:22)Y with probability\ndensity functions pXY and pX??Y jZ, respectively. If HZ + R and (HX (cid:10) HY ) + R are dense in\nL2(PZ) and L2(PX (cid:10) PY ), respectively, and VY X and VY ZVZX are Hilbert-Schmidt, then we have\nHS =Z ZX(cid:2)Y\nI CON D = kVY XjZk2\npX (x)pY (y)d(cid:22)X d(cid:22)Y ;\nwhere pX and pY are the density functions of the marginal distributions PX and PY , respectively.\nAs a special case of Z = ;, we have\n\npX (x)pY (y) (cid:19)2\npX??Y jZ(x; y)\n\n(cid:18) pXY (x; y)\npX (x)pY (y) (cid:0)\n\nI N OCCO = kVY Xk2\n\npX (x)pY (y)d(cid:22)X d(cid:22)Y :\n\n(9)\n\nHS =Z ZX(cid:2)Y\n\npX (x)pY (y) (cid:0) 1(cid:19)2\n(cid:18) pXY (x; y)\n\nSketch of the proof (see the supplement for the complete proof). Since it is known [8] that (cid:6)ZZ is\nHilbert-Schmidt under (A-1), there exist CONSs f(cid:30)ig1i=1 (cid:26) HX and f jg1j=1 (cid:26) HY consisting of\nthe eigenfunctions of (cid:6)XX and (cid:6)Y Y , respectively, with (cid:6)XX (cid:30)i = (cid:21)i(cid:30)i ((cid:21)i (cid:21) 0) and (cid:6)Y Y j =\n\n4\n\n\fLet I X\n\n+ and j 2 I Y\n\nHS admits the expansion\n\nHY (cid:0) 2h j; VY X (cid:30)iiHYh j; VY ZVZX (cid:30)iiHY + h j; VY ZVZX (cid:30)ii2\n\nHYo:\n+ = fi 2 N j (cid:23)i > 0g, and de\ufb01ne ~(cid:30)i = ((cid:30)i (cid:0) E[(cid:30)i(X)])=p(cid:21)i\n+ . For simplicity, L2 denotes L2(PX (cid:10) PY ).\nis a\n\n(cid:23)j j ((cid:23)j (cid:21) 0). Then, kVY XjZk2\nP1i;j=1nh j; VY X (cid:30)ii2\n+ = fi 2 N j (cid:21)i > 0g and I Y\nand ~ j = ( j (cid:0) E[ j(Y )])=p(cid:23)j for i 2 I X\nWith the notations ~(cid:30)0 = 1 and ~ 0 = 1, it is easy to see that the class f ~(cid:30)i ~ jgi2I X\n+ [f0g;j2I Y\nCONS of L2. From Parseval\u2019s equality, the \ufb01rst term of the above expansion is rewritten as\n=Pi2I X\nPi2I X\n=(cid:13)(cid:13) pXY (x;y)\npX (x)pY (y)(cid:13)(cid:13)2\npX (x)pY (y)(cid:13)(cid:13)2\nE(cid:2) ~ j(Y )] (cid:0) 1 =(cid:13)(cid:13) pXY (x;y)\npX pY (cid:1)L2\n(cid:0)2(cid:0) pXY\n\n+ h ~ j; (cid:6)Y X ~(cid:30)ii2\nL2 (cid:0)Pi2I X\n\nL2 (cid:0) 1; respectively. This completes the proof.\n\nEY X(cid:2) ~ j(Y ) ~(cid:30)i(X)(cid:3)2\n\nE(cid:2) ~(cid:30)i(X)] (cid:0)Pj2I Y\n\nHY =Pi2I X\n\npX pY(cid:1)L2\n\n+(cid:0) ~(cid:30)i ~ j; pXY\nL2 (cid:0) 1:\n\nthe expansion are rewritten as\n\nthe second and third term of\n\nBy a similar argument,\n\n; pX??Y jZ\n\npX pY\n\n+ 2 and(cid:13)(cid:13) pX??Y jZ\npX pY (cid:13)(cid:13)2\n\nMany practical kernels, such as the Gaussian and Laplacian, satisfy the assumptions in the above\ntheorem, as we saw in Theorems 2 and the remark after Lemma 1. While the empirical estimate\nfrom \ufb01nite samples depends on the choice of kernels, it is a desirable property for the empirical\ndependence measure to converge to a value that depends only on the distributions of the variables.\n\n+\n\n+\n\n+ ;j2I Y\n\n+ ;j2I Y\n\n+\n\n+ ;j2I Y\n\n+ [f0g\n\nEq. (9) shows that, under the assumptions, I N OCCO is equal to the mean square contingency, a\nwell-known dependence measure[14] commonly used for discrete variables. As we show in Section\n2.4, ^I N OCCO\nThe expression of Eq. (9) can be compared with the mutual information,\n\nworks as a consistent kernel estimator of the mean square contingency.\n\nn\n\nM I(X; Y ) =Z ZX(cid:2)Y\n\npXY (x; y) log\n\npXY (x; y)\npX (x)pY (y)\n\nd(cid:22)X d(cid:22)Y :\n\nBoth the mutual information and the mean square contingency are nonnegative, and equal to zero\nif and only if X and Y are independent. Note also that from log z (cid:20) z (cid:0) 1, the inequality\nM I(X; Y ) (cid:20) I N OCCO(X; Y ) holds under the assumptions of Theorem 4. While the mutual infor-\nmation is the best known dependence measure, its \ufb01nite sample empirical estimate is not straight-\nforward, especially for continuous variables. The direct estimation of a probability density function\nis infeasible if the joint space has even a moderate number of dimensions.\n\n2.4 Consistency of the measures\n\nthat bV (n)\n\nIt is important to ask whether the empirical measures converge to the population value I CON D and\nI N OCCO, since this provides a theoretical justi\ufb01cation for the empirical measures. It is known [4]\nY X converges in probability to VY X in operator norm. The next theorem asserts convergence\nin HS norm, provided that VY X is Hilbert-Schmidt. Although the proof is analogous to the case of\noperator norm, it is more involved to discuss the HS norm. We give it in the supplementary material.\nTheorem 5. Assume that VY X, VY Z, and VZX are Hilbert-Schmidt, and that the regularization\nconstant \"n satis\ufb01es \"n ! 0 and \"3\n\nnn ! 1. Then, we have the convergence in probability\nand\n(n ! 1):\n\nkbV (n)\nY XjZ (cid:0) VY XjZkHS ! 0\n! I N OCCO and ^I CON D\n\n! I CON D (n ! 1) in probability.\n\nn\n\nkbV (n)\nY X (cid:0) VY XkHS ! 0\n\nIn particular, ^I N OCCO\n\nn\n\n(10)\n\n2.5 Choice of kernels\n\nAs with all empirical measures, the sample estimates ^I N OCCO\nare dependent on the\nkernel, and the problem of choosing a kernel has yet to be solved. Unlike supervised learning, there\nare no easy criteria to choose a kernel for dependence measures. We propose a method of choosing\na kernel by considering the large sample behavior. We explain the method only brie\ufb02y in this paper.\n\nand ^I CON D\n\nn\n\nn\n\nThe basic idea is that a kernel should be chosen so that the covariance operator detects independence\nof variables as effectively as possible. It has been recently shown [10], under the independence of\n\n5\n\n\f4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\nO\nC\nC\nO\nN\nI\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n0\n\n0.2\n\n0.4\n\nAngle\n\n0.6\n\n0.8\n\nFigure 1: Left and Middle: Examples of data ((cid:18) = 0 and (cid:18) = (cid:25)=4). Right: The marks \u201do\u201d and \u201d+\u201d\nshow ^I N OCCO\n\nfor each angle and the 95th percentile of the permutation test, respectively.\n\nn\n\nX and Y , that the measure HSIC = kb(cid:6)(n)\nHS ([8]) multiplied by n converges to an in\ufb01nite\nY Xk2\nmixture of (cid:31)2 distributions with variance Varlim[nHSIC] = 2k(cid:6)XXk2\nHS. We choose a\nkernel so that the bootstrapped variance VarB[nHSIC] of nHSIC is close to this theoretical limit\nvariance. More precisely, we compare the ratio T = VarB[nHSIC]=Varlim[nHSIC] for various\ncandidate kernels. In preliminary experiments for choosing the variance parameter (cid:27) of Gaussian\nkernels, we often observed the ratio decays and saturates below 1, as (cid:27) increases. Therefore, we use\n(cid:27) starting the saturation by choosing the minimum of (cid:27) among all candidates that satisfy jT(cid:27) (cid:0) (cid:11)j (cid:20)\n(1 + (cid:14)) min(cid:27) jT(cid:27) (cid:0) (cid:11)j for (cid:14) > 0; (cid:11) 2 (0; 1]. We always use (cid:14) = 0:1 and (cid:11) = 0:5. We can expect\nthat the chosen kernel uses the data effectively. While there is no rigorous theoretical guarantee, in\nthe next section we see that the method gives a reasonable result for ^I N OCCO\n\nHSk(cid:6)Y Y k2\n\nand ^I CON D\n\n.\n\nn\n\nn\n\n3 Experiments\n\nn\n\n2(cid:27)2 kx1(cid:0)x2k2 and choose (cid:27) by the method proposed in Section 2.5.\n\nTo evaluate the dependence measures, we use a permutation test of independence for data sets with\nvarious degrees of dependence. The test randomly permutes the order of Y1; : : : ; Yn to make many\nsamples independent of (X1; : : : ; Xn), thus simulating the null distribution under independence.\nFor the evaluation of ^I CON D\n, the range of Z is partitioned into Z1; : : : ;ZL with the same number\nof data, and the sample f(Xi; Yi) j Zi 2 Z\u2018g within the \u2018-th bin is randomly permuted. The\nsigni\ufb01cance level is always set to 5%. In the following experiments, we always use Gaussian kernels\ne(cid:0) 1\nSynthetic data for dependence. The random variables X (0) and Y (0) are independent and uni-\nformly distributed on [(cid:0)2; 2] and [a; b] [ [(cid:0)b;(cid:0)a], respectively, so that (X (0); Y (0)) has a scalar\ncovariance matrix. (X ((cid:18)); Y ((cid:18))) is the rotation of (X (0); Y (0)) by (cid:18) 2 [0; (cid:25)=4] (see Figure 1). X ((cid:18))\nand Y ((cid:18)) are always uncorrelated, but dependent for (cid:18) 6= 0. We generate 100 sets of 200 data.\nWe perform permutation tests with ^I N OCCO\nHS, and the mutual information\n(MI). For the empirical estimates of MI, we use the advanced method from [11], with no need for\nexplicit estimation of the densities. Since ^I N OCCO\nis an estimate of the mean square contingency,\nwe also apply a relevant contingency-table-based independence test ([12]), partitioning the variables\ninto bins. Figure 1 shows the values of ^I N OCCO\nfor a sample. In Table 1, we see that the results\nof ^I N OCCO\nare stable w.r.t. the choice of \"n, provided it is suf\ufb01ciently small. We \ufb01x \"n = 10(cid:0)6\nfor all remaining experiments. While all the methods are able to detect the dependence, ^I N OCCO\nwith the asymptotic choice of (cid:27) is the most sensitive to very small dependence. We also observe\nthe chosen parameters (cid:27)Y for Y increase from 0.58 to 2.0 as (cid:18) increases. The small (cid:27)Y for small (cid:18)\nseems reasonable, because the range of Y is split into two small regions.\nChaotic time series. We evaluate a chaotic time series derived from the coupled H\u00b4enon map. The\nvariables X and Y are four dimensional: the components X1; X2; Y1, and Y2 follow the dynamics\n(X1(t + 1); X2(t + 1)) = (1:4 (cid:0) X1(t)2 + 0:3X2(t); X1(t)), (Y1(t + 1); Y2(t + 1)) = (1:4 (cid:0)\nf(cid:13)X1(t)Y1(t) + (1 (cid:0) (cid:13))Y2(t)2g + 0:1Y2(t); Y1(t)), and X3; X4; Y3; Y4 are independent noise with\nN (0; (0:5)2). X and Y are independent for (cid:13) = 0, while they are synchronized chaos for (cid:13) > 0\n(see Figure 2 for examples). A sample consists of 100 data generated from this system. Table 2\n\n, HSIC = kb(cid:6)(n)\nY Xk2\n\nn\n\nn\n\nn\n\nn\n\nn\n\n6\n\n\fn\n\nn\n\nn\n\nn\n\n(\" = 10(cid:0)4, Median)\n(\" = 10(cid:0)6, Median)\n(\" = 10(cid:0)8, Median)\n(Asymp. Var.)\n\nAngle (degree)\n^I N OCCO\n^I N OCCO\n^I N OCCO\n^I N OCCO\nHSIC (Median)\nHSIC (Asymp. Var.)\nMI (#Nearest Neighbors = 1)\nMI (#Nearest Neighbors = 3)\nMI (#Nearest Neighbors = 5)\nConting. Table (#Bins= 3)\nConting. Table (#Bins= 4)\nConting. Table (#Bins= 5)\n\n0\n94\n92\n93\n94\n93\n93\n93\n96\n97\n100\n98\n98\n\n4.5\n23\n20\n15\n11\n92\n44\n62\n43\n49\n96\n29\n82\n\n9\n0\n1\n0\n0\n63\n1\n11\n0\n0\n46\n0\n5\n\n13.5\n\n0\n0\n0\n0\n5\n0\n0\n0\n0\n9\n0\n0\n\n18\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n0\n0\n\n22.5\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n27\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n31.5\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n36\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n40.5\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n45\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\nTable 1: Comparison of dependence measures. The number of times independence is accepted out\nof 100 permutation tests is shown. \u201dAsymp. Var.\u201d is the method in Section 2.5. \u201dMedian\u201d is a\nheuristic method [8] which chooses (cid:27) as the median of pairwise distances of the data.\n\n)\nt\n(\n\nX\n\n2\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\u22122\n\n \n\nt+1\n\n|X\n,Y\n)\nt\nt\n\nI(X\nThresh (a =0.05)\n\n2\n\n1\n\n0\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n)\nt\n(\n\nY\n\n1\n\n\u22121\n\n0\nX\n(t)\n1\n\n1\n\n2\n\n\u22122\n\n\u22121\n\n0\nX\n(t)\n1\n\n1\n\n2\n\n0\n \n0\n\n0.2\n\n0.4\n\n0.6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n \n0\n\n \n\nt+1\n\n|Y\n,X\n)\nt\nt\n\nI(Y\nThresh (a=0.05)\n\n0.2\n\n0.4\n\n0.6\n\n(a) Plot of H\u00b4enon map (b) Xt;1-Yt;1 ((cid:13) = 0:25)\n\n(c) I(Xt+1; YtjXt)\n\n(d) I(Yt+1; XtjYt)\n\nFigure 2: Chaotic time series. (a,b): examples of data. (c,d) examples of ^I CON D\nthe threshholds of the permutation test with signi\ufb01cance level 5% (black \u201d+\u201d).\n\nn\n\n(colored \u201do\u201d) and\n\nshows the results of permutation tests of independence for the instantaneous pairs (X(t); Y (t))100\nt=1.\nThe proposed ^I N OCCO\n\noutperforms the other methods in capturing small dependence.\n\nn\n\nn\n\nn\n\nNext, we apply ^I CON D\nto detect the causal structure of the same time series. Note that the series X\nis a cause of Y for (cid:13) > 0, but there is no opposite causality, i.e., Xt+1??Yt j Xt and Yt+1 6??Xt j Yt.\nIn Table 3, it is remarkable that ^I CON D\ndetects the small causal in\ufb02uence from Xt to Yt+1 for\n(cid:13) (cid:21) 0:1, while for (cid:13) = 0 the result is close to the theoretical value of 95%.\nGraphical modeling from medical data. This is the inference of a graphical model from data with\nno time structure. The data consist of three variables, creatinine clearance (C), digoxin clearance\n(D), urine \ufb02ow (U). These were taken from 35 patients, and analyzed with graphical models in [3,\nSection 3.1.4.]. From medical knowledge, D should be independent of U when controlling C. Table\n4 shows the results of the permutation tests and a comparison with the linear method. The relation\nD??U j C is strongly af\ufb01rmed by ^I CON D\n\n, while the partial correlation does not \ufb01nd it.\n\nn\n\n(cid:13) (strength of coupling)\n\n^I N OCCO\n\nn\n\nHSIC\n\nMI (k = 3)\nMI (k = 5)\nMI (k = 7)\n\n0.0\n97\n75\n87\n87\n87\n\n0.1\n66\n70\n91\n88\n86\n\n0.2\n21\n58\n83\n75\n75\n\n0.3\n1\n52\n73\n67\n64\n\n0.4\n0\n13\n23\n23\n21\n\n0.5\n1\n1\n6\n5\n5\n\n0.6\n0\n0\n0\n0\n0\n\nTable 2: Results for the independence tests for the chaotic time series. The number of times inde-\npendence was accepted out of 100 permutation tests is shown. (cid:13) = 0 implies independence.\n\n7\n\n\f(cid:13) (coupling)\n^I N OCCO\n\nn\n\nHSIC\n\n0.0\n97\n94\n\nH0: Yt is not a cause of Xt+1\n0.5\n0.1\n68\n96\n94\n73\n\n0.2\n93\n92\n\n0.3\n85\n81\n\n0.4\n81\n60\n\n0.6\n75\n66\n\n0.0\n96\n93\n\nH0: Xt is not a cause of Yt+1\n0.5\n0.1\n0\n0\n95\n1\n\n0.3\n0\n56\n\n0.2\n0\n85\n\n0.4\n0\n1\n\n0.6\n0\n1\n\nTable 3: Results of the permutation test of non-causality for the chaotic time series. The number of\ntimes non-causality was accepted out of 100 tests is shown.\n\nKernel measure\n\nLinear method\n\nD??U j C\n\nC??D\nC??U\nD??U\n\nn\n\n^I CON D\n1.458\n0.776\n0.194\n0.343\n\nP -value\n0.924\n<0.001\n0.117\n0.023\n\nParcorr(D; U jC)\n\nCorr(C; D)\nCorr(C; U )\nCorr(D; U )\n\n(partial) correl.\n\n0.4847\n0.7754\n0.3092\n0.5309\n\nP -value\n0.0037\n0.0000\n0.0707\n0.0010\n\nTable 4: Graphical modeling from the medical data. Higher P -values indicate (conditional) inde-\npendence more strongly.\n\n4 Concluding remarks\n\nThere are many dependence measures, and further theoretical and experimental comparison is im-\nportant. That said, one unambiguous strength of the kernel measure we propose is its kernel-free\npopulation expression. It is interesting to ask if other classical dependence measures, such as the\nmutual information, can be estimated by kernels (in a broader sense than the expansion about inde-\npendence of [9]). A relevant measure is the kernel generalized variance (KGV [1]), which is based\non a sum of the logarithm of the eigenvalues of VY X, while I N OCCO is their squared sum. It is also\ninteresting to investigate whether the KGV has a kernel-free expression. Another topic for further\nstudy is causal inference with the proposed measure, both with and without time information ([16]).\n\nReferences\n[1] F. Bach and M. Jordan. Kernel independent component analysis. J. Machine Learning Res., 3:1\u201348, 2002.\n[2] C. Baker. Joint measures and cross-covariance operators. Trans. Amer. Math. Soc., 186:273\u2013289, 1973.\n[3] D. Edwards. Introduction to graphical modelling. Springer verlag, New York, 2000.\n[4] K. Fukumizu, F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. J.\n\nMachine Learning Res., 8:361\u2013383, 2007.\n\n[5] K. Fukumizu, F. Bach, and M. Jordan. Dimensionality reduction for supervised learning with reproducing\n\nkernel Hilbert spaces. J. Machine Learning Res., 5:73\u201399, 2004.\n\n[6] K. Fukumizu, F. Bach, and M. Jordan. Kernel dimension reduction in regression. Tech Report 715, Dept.\n\nStatistics, University of California, Berkeley, 2006.\n\n[7] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method for the two-sample-\n\nproblem. Advances in NIPS 19. MIT Press, 2007.\n\n[8] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\n\nSchmidt norms. 16th Intern. Conf. Algorithmic Learning Theory, pp.63\u201377. Springer, 2005.\n\n[9] A. Gretton, R. Herbrich, A. Smola, O. Bousquet and B. Sch\u00a8olkopf. Kernel Methods for Measuring\n\nIndependence. J. Machine Learning Res., 6:2075\u20132129, 2005.\n\n[10] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Sch\u00a8olkopf, A. Smola. A Kernel Statistical Test of Indepen-\n\ndence. Advances in NIPS 21. 2008, to appear.\n\n[11] A. Kraskov, H. St\u00a8ogbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69,\n\n066138-1\u201316, 2004.\n\n[12] T. Read and N. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer-Verlag, 1988.\n[13] M. Reed and B. Simon. Functional Analysis. Academic Press, 1980.\n[14] A. R\u00b4enyi. Probability Theory. Horth-Holland, 1970.\n[15] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. J. Machine\n\nLearning Res., 2:67\u201393, 2001.\n\n[16] X. Sun, D. Janzing, B. Sch\u00a8olkopf, and K. Fukumizu. A kernel-based causal learning algorithm. Proc.\n\n24th Intern. Conf. Machine Learning, 2007 to appear.\n\n[17] S. Fine and K. Scheinberg Ef\ufb01cient SVM Training using Low-Rank Kernel Representations J. Machine\n\nLearning Res., 2:243\u2013264, 2001.\n\n8\n\n\f", "award": [], "sourceid": 559, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Xiaohai", "family_name": "Sun", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}