{"title": "Independence clustering (without a matrix)", "book": "Advances in Neural Information Processing Systems", "page_first": 4013, "page_last": 4023, "abstract": "The  independence clustering problem is considered in the following formulation: given a set $S$ of random variables,  it is required to find the finest partitioning $\\{U_1,\\dots,U_k\\}$ of  $S$ into clusters  such that the clusters $U_1,\\dots,U_k$ are mutually independent. Since mutual independence is the target, pairwise similarity measurements are of no use, and thus traditional clustering algorithms are inapplicable.   The distribution of the random variables in $S$ is, in general, unknown, but a sample  is available.  Thus, the problem is cast in terms of time series.  Two forms of sampling are considered: i.i.d.\\ and stationary  time series, with the main emphasis being on the latter, more general, case. A consistent, computationally tractable algorithm for each of the settings is proposed, and a number of fascinating open directions for further research are outlined.", "full_text": "Independence clustering (without a matrix)\n\nDaniil Ryabko\nINRIA Lillle,\n\n40 avenue de Halley, Villeneuve d\u2019Ascq, France\n\ndaniil@ryabko.net\n\nAbstract\n\nThe independence clustering problem is considered in the following formulation:\ngiven a set S of random variables, it is required to \ufb01nd the \ufb01nest partitioning\n{U1, . . . , Uk} of S into clusters such that the clusters U1, . . . , Uk are mutually\nindependent. Since mutual independence is the target, pairwise similarity measure-\nments are of no use, and thus traditional clustering algorithms are inapplicable. The\ndistribution of the random variables in S is, in general, unknown, but a sample is\navailable. Thus, the problem is cast in terms of time series. Two forms of sampling\nare considered: i.i.d. and stationary time series, with the main emphasis being on\nthe latter, more general, case. A consistent, computationally tractable algorithm for\neach of the settings is proposed, and a number of fascinating open directions for\nfurther research are outlined.\n\n1, . . . , X i\n\nIntroduction\n\n1\nMany applications face the situation where a set S = {x1, . . . , xN} of samples has to be divided into\nclusters in such a way that inside each cluster the samples are dependent, but the clusters between\nthemselves are as independent as possible. Here each xi may itself be a sample or a time series\nn. For example, in \ufb01nancial applications, xi can be a series of recordings of prices of\nxi = X i\na stock i over time. The goal is to \ufb01nd the segments of the market such that different segments evolve\nindependently, but within each segment the prices are mutually informative [15, 17]. In biological\napplications, each xi may be a DNA sequence, or may represent gene expression data [28, 20], or, in\nother applications, an fMRI record [4, 13].\nThe staple approach to this problem in applications is to construct a matrix of (pairwise) correlations\nbetween the elements, and use traditional clustering methods, e.g., linkage-based methods or k means\nand its variants, with this matrix [15, 17, 16]. If mutual information is used, it is used as a (pairwise)\nproximity measure between individual inputs, e.g. [14].\nWe remark that pairwise independence is but a surrogate for (mutual) independence, and, in addition,\ncorrelation is a surrogate for pairwise independence. There is, however, no need to resort to surrogates\nunless forced to do so by statistical or computational hardness results. We therefore propose to\nreformulate the problem from the \ufb01rst principles, and then show that it is indeed solvable both\nstatistically and computationally \u2014 but calls for completely different algorithms. The formulation\nproposed is as follows.\nGiven a set S = (x1, . . . , xN ) of random variables, it is required to \ufb01nd the \ufb01nest partitioning\n{U1, . . . , Uk} of S into clusters such that the clusters U1, . . . , Uk are mutually independent.\nTo our knowledge, this problem in its full generality has not been addressed before. A similar\ninformal formulation appears in the work [1] that is devoted to optimizing a generalization of the\nICA objective. However, the actual problem considered only concerns the case of tree-structured\ndependence, which allows for a solution based on pairwise measurements of mutual information.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1 , . . . , X N\n1, . . . , X i\n\nn, . . . , X N\n\nx1, . . . , xN are i.i.d. and xN +1 = (cid:80)N\n\nNote that in the fully general case pairwise measurements are useless, as are, furthermore, bottom-up\n(e.g., linkage-based) approaches. Thus, in particular, a proximity matrix cannot be used for the\nanalysis. Indeed, it is easy to construct examples in which any pair or any small group of elements\nare independent, but are dependent when the same group is considered jointly with more elements.\nFor instance, consider a group of Bernoulli 1/2-distributed random variables x1, . . . , xN +1, where\ni=1 xi mod 2. Note that any N out of these N + 1 random\nvariables are i.i.d., but together the N + 1 are dependent. Add then two more groups like this, say,\ny1, . . . , yN +1 and z1, . . . , zN +1 that have the exact same distribution, with the groups of x, y and z\nmutually independent. Naturally, these are the three clusters we would want to recover. However, if\nwe try to cluster the union of the three, then any algorithm based on pairwise correlations will return\nan essentially arbitrary result. What is more, if we try to \ufb01nd clusters that are pairwise independent,\nthen, for example, the clustering {(xi, yi, zi)i=1..N} of the input set into N + 1 clusters appears\ncorrect, but, in fact, the resulting clusters are dependent. Of course, real-world data does not come\nin the form of summed up Bernoulli variables, but this simple example shows that considering\nindependence of small subsets may be very misleading.\nThe considered problem is split into two parts considered separately: the computational and the\nstatistical part. This is done by \ufb01rst considering the problem assuming the joint distribution of\nall the random variables is known, and is accessible via an oracle. Thus, the problem becomes\ncomputational. A simple, computationally ef\ufb01cient algorithm is proposed for this case. We then\nproceed to the time-series formulations: the distribution of (x1, . . . , xN ) is unknown, but a sample\nn ) is provided, so that xi can be identi\ufb01ed with the time series\n1 ), . . . , (X 1\n(X 1\nn. The sample may be either independent and identically distributed (i.i.d.), or, in a more\nX i\ngeneral formulation, stationary. As one might expect, relying on the existing statistical machinery, the\ncase of known distributions can be directly extended to the case of i.i.d. samples. Thus, we show that\nit is possible to replace the oracle access with statistical tests and estimators, and then use the same\nalgorithm as in the case of known distributions. The general case of stationary samples turns out\nto be much more dif\ufb01cult, in particular because of a number of strong impossibility results. In fact,\nit is challenging already to determine what is possible and what is not from the statistical point of\nview. In this case, it is not possible to replicate the oracle access to the distribution, but only its weak\nversion that we call \ufb01ckle oracle. We \ufb01nd that, in this case, it is only possible to have a consistent\nalgorithm for the case of known k. An algorithm that has this property is constructed. This algorithm\nis computationally feasible when the number of clusters k is small, as its complexity is O(N 2k).\nBesides, a measure of information divergence is proposed for time-series distributions that may be\nof independent interest, since it can be estimated consistently without any assumptions at all on the\ndistributions or their densities (the latter may not exist).\nThe main results of this work are theoretical. The goal is to determine, as a \ufb01rst step, what is\npossible and what is not from both statistical and computational points of view. The main emphasis\nis placed on highly dependent time series, as warranted by the applications cited above, leaving\nexperimental investigations for future work. The contribution can be summarized as follows:\n\u2022 a consistent, computationally feasible algorithm for known distributions, unknown number\n\u2022 an algorithm that is consistent under stationary ergodic sampling with arbitrary, unknown\n\u2022 an impossibility result for clustering stationary ergodic samples with k unknown;\n\u2022 an information divergence measure for stationary ergodic time-series distributions along\n\nof clusters, and an extension to the case of unknown distributions and i.i.d. samples;\n\ndistributions, but with a known number k of clusters;\n\nwith its estimator that is consistent without any extra assumptions;\n\nIn addition, an array of open problems and exciting directions for future work is proposed.\nRelated work. Apart from the work on independence clustering mentioned above, it is worth pointing\nout the relation to some other problems. First, the proposed problem formulation can be viewed\nas a Bayesian-network learning problem: given an unknown network, it is required to split it into\nindependent clusters. In general, learning a Bayesian network is NP-hard [5], even for rather restricted\nclasses of networks (e.g., [18]). Here the problem we consider is much less general, which is why it\nadmits a polynomial-time solution. A related clustering problem, proposed in [23] (see also [12]) is\nclustering time series with respect to distribution. Here, it is required to put two time series samples\nx1, x2 into the same cluster if and only if their distribution is the same. Similar to the independence\nclustering introduced here, this problem admits a consistent algorithm if the samples are i.i.d. (or\n\n2\n\n\fmixing) and the number of distributions (clusters) is unknown, and in the case of stationary ergodic\nsamples if and only if k is known.\n\n1 , . . . , X 1\n\nn), . . . , (X N\n1, . . . , X i\n\n1 , . . . , X N\nn, or X i\n\n2 Set-up and preliminaries\nA set S := {x1, . . . , xN} is given, where we will either assume that the joint distribution of xi is\nn ) is\nknown, or else that the distribution is unknown but a sample (X 1\ngiven. In the latter case, we identify each xi with the sequence (sample) X i\n1..n for\nshort, of length n. The lengths of the samples are the same only for the sake of notational convenience;\nit is easy to generalize all algorithms to the case of different sample lengths ni, but the asymptotic\nj \u2208 X := R are real-valued,\nwould then be with respect to n := mini=1..N ni. It is assumed that X i\nbut extensions to more general cases are straightforward.\nFor random variables A, B, C we write (A \u22a5 B)|C to say that A is conditionally independent of B\ngiven C, and A \u22a5 B \u22a5 C to say that A, B and C are mutually independent.\nThe (unique up to a permutation) partitioning U := {U1, . . . , Uk} of the set S is called the ground-\ntruth clustering if U1, . . . , Uk are mutually independent (U1 \u22a5 \u00b7\u00b7\u00b7 \u22a5 Uk) and no re\ufb01nement of U\nhas this property. A clustering algorithm is consistent if it outputs the ground-truth clustering, and\nit is asymptotically consistent if w.p. 1 it outputs the ground-truth clustering from some n on.\na\u2208A \u2212P (X =\na) log P (X = a), letting 0 log 0 = 0. For a distribution with a density f its (differential) entropy is\nI(X, Y ) is de\ufb01ned as I(X, Y ) = H(X) + H(Y )\u2212 H(X, Y ). For discrete random variables, as well\nas for continuous ones with a density, X \u22a5 Y if and only if I(X, Y ) = 0; see, e.g., [6]. Likewise,\n\nFor a discrete A-valued r.v. X its Shannon entropy is de\ufb01ned as H(X) := (cid:80)\nde\ufb01ned as H(X) =: \u2212(cid:82) f (x) log f (x). For two random variables X, Y their mutual information\nI(X1, . . . , Xm) is de\ufb01ned as(cid:80)\n\ni=1..m H(Xi) \u2212 H(X1, . . . , Xm).\n\nFor the sake of convenience, in the next two sections we make the assumption stated below. However,\nwe will show (Sections 5,6) that this assumption can be gotten rid of as well.\nAssumption 1. All distributions in question have densities bounded away from zero on their support.\n\n3 Known distributions\n\nAs with any statistical problem, it is instructive to start with the case where the (joint) distribution of\nall the random variables in question is known. Finding out what can be done and how to do it in this\ncase helps us to set the goals for the (more realistic) case of unknown distributions.\nThus, in this section, x1, . . . , xN are not time series, but random variables whose joint distribution is\nknown to the statistician. The access to this distribution is via an oracle; speci\ufb01cally, our oracle will\nprovide answers to the following questions about mutual information (where, for convenience, we\nassume that the mutual information with the empty set is 0):\nOracle TEST. Given sets of random variables A, B, C, D \u2282 {x1, . . . , xN} answer whether\nI(A, B) > I(C, D).\nRemark 1 ( Conditional independence oracle). Equivalently, one can consider an oracle that answers\nconditional independence queries of the form (A \u22a5 B)|C. The de\ufb01nition above is chosen for the sake\nof continuity with the next section, and it also makes the algorithm below more intuitive. However, in\norder to test conditional independence statistically one does not have to use mutual information, but\nmay resort to any other divergence measure instead.\n\nThe proposed algorithm (see the pseudocode listing below) works as follows. It attempts to split the\ninput set recursively into two independent clusters, until it is no longer possible. To split a set in\ntwo, it starts with putting one element x from the input set S into a candidate cluster C := {x}, and\nmeasures its mutual information I(C, R) with the rest of the set, R := S \\ C. If I(C, R) is already 0\nthen we have split the set into two independent clusters and can stop. Otherwise, the algorithm then\ntakes the elements out of R one by one without replacement and each time looks whether I(C, R)\nhas decreased. Once such an element is found, it is moved from R to C and the process is restarted\nfrom the beginning with C thus updated. Note that, if we have started with I(C, R) > 0, then taking\nelements out of R without replacement we eventually should \ufb01nd a one that decreases I(C, R), since\nI(C, \u2205) = 0 and I(C, R) cannot increase in the process.\n\n3\n\n\fTheorem 1. The algorithm CLIN outputs the correct clustering using at most 2kN 2 oracle calls.\n\nProof. We shall \ufb01rst show that the procedure for splitting a set into two indeed splits the input set into\ntwo independent sets, if and only if such two sets exist. First, note that if I(C, S\\ C) = 0 then C \u22a5 R\nand the function terminates. In the opposite case, when I(C, S \\ C) > 0, by removing an element\nfrom R := S \\ C, I(C, R) can only decrease (indeed, h(C|R) \u2264 h(C|R \\ {x}) by information\nprocessing inequality). Eventually when all elements are removed, I(C, R) = I(C,{}) = 0, so\nthere must be an element x removing which decreases I(C, R). When such an element x found it is\nmoved to C. Note that, in this case, indeed x\u22a5\\C. However, it is possible that removing an element x\nfrom R does not reduce I(C, R), yet x\u22a5\\C. This is why the while loop is needed, that is, the whole\nprocess has to be repeated until no elements can be moved to C. By the end of each for loop, we\nhave either found at least one element to move to C, or we have assured that C \u22a5 S \\ C and the\nloop terminates. Since there are only \ufb01nitely many elements in S \\ C, the while loop eventually\nterminates. Moreover, each of the two loops (while and for) terminates in at most N iterations.\nFinally, notice that if (C1, C2) \u22a5 C3 and C1 \u22a5 C2 then also C1 \u22a5 C2 \u22a5 C3, which means that by\nrepeating the Split function recursively we \ufb01nd the correct clustering.\nFrom the above, the bound on the number of oracle calls is easily obtained by direct calculation.\n\n4\n\nI.I.D. sampling\n\nFigure 1: CLIN: cluster with k unknown,\ngiven an oracle for MI\nINPUT: The set S.\n(C1, C2) := Split(S)\nif C2 (cid:54)= \u2205 then\nelse\n\nOutput:CLIN (C1), CLIN (C2)\n\nOutput: C1\n\nend if\nFunction Split(Set S of samples)\nInitialize: C := {x1}, R := S \\ C;\nwhile TEST(I(C; R) > 0) do\n\nfor each x \u2208 R do\n\nif TEST(I(C; R)>I(C; R \\ {x}))\nthen\n\nmove x from R to C\nbreak the for loop\n\nelse\n\nmove x from R to M\n\nend if\nend for\nM := {}, R := S \\ C;\n\nend while\nReturn(C,R)\nEND function\n\nn, . . . , X N\n\n1 ), . . . , (X 1\n\nIn this section we assume that\nthe distribution of\nis not known, but an i.i.d. sample\n(x1, . . . , xN )\nn ) is provided. We iden-\n(X 1\n1 , . . . , X N\n1..n. Formally, N X -\ntify xi with the (i.i.d.) time series X i\nvalued processes is just a single X N -valued process. The\nj)i=1..N,j=1..\u221e, where\nlatter can be seen as a matrix (X i\n1..n.. and each column j\neach row i is the sample xi = X i\nis what is observed at time j: X 1\nj .\nj ..X N\nThe case of i.i.d. samples is not much different from the\ncase of a known distribution. What we need is to replace\nthe oracle test with (nonparametric) statistical tests. First,\na test for independence is needed to replace the oracle call\nTEST(I(C, R) > 0) in the while loop. Such tests are\nindeed available, see, for example, [8]. Second, we need\nan estimator of mutual information I(X, Y ), or, which is\nsuf\ufb01cient, for entropies, but with a rate of convergence.\nIf the rate of convergence is known to be asymptotically\nbounded by, say, t(n), then, in order to construct an asymp-\ntotically consistent test, we can take any t(cid:48)(n) \u2192 0 such\nthat t(n) = o(t(cid:48)(n)) and decide our inequality as fol-\nlows: if \u02c6I(C; R \\ {x}) < \u02c6I(C; R) \u2212 t(cid:48)(n) then say that\nI(C; R \\ {x}) < I(C; R). The required rates of conver-\ngence, which are of order\nn under Assumption 1, can be\nfound in [3].\nGiven the result of the previous section, it is clear that if\nthe oracle is replaced by the tests described, then CLIN is\na.s. consistent. Thus, we have demonstrated the following.\nTheorem 2. Under Assumption 1, there is an asymptoti-\n\n\u221a\n\ncally consistent algorithm for independence clustering with i.i.d. sampling.\nRemark 2 (Necessity of the assumption). The independence test of [8] does not need Assumption 1,\nas it is distribution-free. Since the mutual information is de\ufb01ned in terms of densities, if we want\nto completely get rid of Assumption 1, we would need to use some other measure of dependence\nfor the test. One such measure is de\ufb01ned in the next section already for the general case of process\ndistributions. However, the rates of convergence of its empirical estimates under i.i.d. sampling\nremain to be studied.\n\n4\n\n\fRemark 3 (Estimators vs. tests). As noted in Remark 1, the tests we are using are, in fact, tests\nfor (conditional) independence: testing I(C; R) > I(C; R \\ {x}) is testing for (C \u22a5 {x}|R \\\n{x}). Conditional independence can be tested directly, without estimating I (see, for example 27),\npotentially allowing for tighter performance guarantees under more general conditions.\n\n5 Stationary sampling\n\nWe now enter the hard mode. The general case of stationary sampling presents numerous obstacles,\noften in the form of theoretical impossibility results: there are (provably) no rates of convergence,\nno independence test, and 0 mutual information rate does not guarantee independence. Besides,\nsome simple-looking questions regarding the existence of consistent tests, which indeed have simple\nanswers in the i.i.d. case, remain open in the stationary ergodic case. Despite all this, a computationally\nfeasible asymptotically consistent independence clustering algorithm can be obtained, although only\nfor the case of a known number of clusters. This parallels the situation of clustering according to the\ndistribution [23, 12].\nIn this section we assume that the distribution of (x1, . . . , xN ) is not known, but a jointly stationary\nergodic sample (X 1\nn ) is provided. Thus, xi is a stationary ergodic time\n1..n. Here is also where we drop Assumption 1; in particular, densities do not have to exist.\nseries X i\nThis new relaxed set of assumptions can be interpreted as using a weaker oracle, as explained in\nRemark 5 below.\nWe start with preliminaries about stationary processes, followed by impossibility results, and con-\ncluding with an algorithm for the case of known k.\n\n1 ), . . . , (X 1\n\nn, . . . , X N\n\n1 , . . . , X N\n\n5.1 Preliminaries: stationary ergodic processes\nStationary, ergodicity, information rate. (Time-series) distributions, or processes, are measures\non the space (X \u221e,FX \u221e ), where FX \u221e is the Borel sigma-algebra of X \u221e. Recall that N X -valued\nprocess is just a single X N -valued process. So the distributions are on the space ((X N )\u221e,F(AN )\u221e);\nthis will be often left implicit. For a sequence x \u2208 An and a set B \u2208 B denote \u03bd(x, B) the\nfrequency with which the sequence x falls in the set B. A process \u03c1 is stationary if \u03c1(X1..|B| =\nB) = \u03c1(Xt..t+|B|\u22121 = B) for any measurable B \u2208 X \u2217 and t \u2208 N. We further abbreviate\n\u03c1(B) := \u03c1(X1..|B| = B). A stationary process \u03c1 is called (stationary) ergodic if the frequency of\noccurrence of each measurable B \u2208 X \u2217 in a sequence X1, X2, . . . generated by \u03c1 tends to its a priori\n(or limiting) probability a.s.: \u03c1(limn\u2192\u221e \u03bd(X1..n, B) = \u03c1(B)) = 1. By virtue of the ergodic theorem,\nthis de\ufb01nition can be shown to be equivalent to the more standard de\ufb01nition of stationary ergodic\nprocesses given in terms of shift-invariant sets [26]. Denote S and E the sets of all stationary and\nstationary ergodic processes correspondingly. The ergodic decomposition theorem for stationary\nprocesses (see, e.g., 7) states that any stationary process can be expressed as a mixture of stationary\nergodic processes. That is, a stationary process \u03c1 can be envisaged as \ufb01rst selecting a stationary\nergodic distribution according to some measure W\u03c1 over the set of all such distributions, and then\nusing this ergodic distribution to generate the sequence. More formally, for any \u03c1 \u2208 S there is a\n\nmeasure W\u03c1 on (S,FS ), such that W\u03c1(E) = 1, and \u03c1(B) =(cid:82) dW\u03c1(\u00b5)\u00b5(B), for any B \u2208 FX \u221e.\n\nFor a stationary time series x, its m-order entropy hm(x) is de\ufb01ned as EX1..m\u22121h(Xm|X1..m\u22121) (so\nthe usual Shannon entropy is the entropy of order 0). By stationarity, the limit limm\u2192\u221e hm exists\nm h(X1..m) (see, for example, [6] for more details). This limit is called entropy\nand equals limm\u2192\u221e 1\nrate and is denoted h\u221e. For l stationary processes xi = (X i\nn, . . . ), i = 1..l, the m-order\ni=1 hm(xi) \u2212 hm(x1, . . . , xl) and the mutual\n\nmutual information is de\ufb01ned as Im(x1, . . . , xl) :=(cid:80)l\n\n1, . . . , X i\n\ninformation rate is de\ufb01ned as the limit\n\nI\u221e(x1, . . . , xl) := lim\n\n(1)\nDiscretisations and a metric. For each m, l \u2208 N, let Bm,l be a partitioning of X m into 2l sets such\nthat jointly they generate Fm of X m, i.e. \u03c3(\u222al\u2208NBm,l) = Fm. The distributional distance between a\npair of process distributions \u03c11, \u03c12 is de\ufb01ned as follows [7]:\n\nm\u2192\u221e Im(x1, . . . , xl).\n\nd(\u03c11, \u03c12) =\n\nwmwl\n\n|\u03c11(B) \u2212 \u03c12(B)|,\n\n(2)\n\n\u221e(cid:88)\n\nm,l=1\n\n(cid:88)\n\nB\u2208Bm,l\n\n5\n\n\fwhere we set wj := 1/j(j + 1), but any summable sequence of positive weights may be used.\nAs shown in [22], empirical estimates of this distance are asymptotically consistent for arbitrary\nstationary ergodic processes. These estimates are used in [23, 12] to construct time-series clustering\n(cid:80)\nalgorithms for clustering with respect to distribution. Here we will only use this distance in the\nimpossibility results. Basing on these ideas, Gy\u00f6r\ufb01 [9] suggested to use a similar construction for\nA,B\u2208Bm,l |\u03c11(A)\u03c12(B)\u2212 \u03c1(A\u00d7 B)|,\nwhere \u03c11 and \u03c12 are the two marginals of a process \u03c1 on pairs, and noted that its empirical estimates\nare asymptotically consistent. The distance we will use is similar, but is based on mutual information.\n\nstudying independence, namely d(\u03c11, \u03c12) =(cid:80)\u221e\n\nm,l=1 wmwl\n\n5.2\n\nImpossibility results\n\nFirst of all, while the de\ufb01nition of ergodic processes guarantees convergence of frequencies to the\ncorresponding probabilities, this convergence can be arbitrary slow [26]: there are no meaningful\nbounds on |\u03bd(X1..n, 0) \u2212 \u03c1(X1 = 0)| in terms of n for ergodic \u03c1. This means that we cannot use\ntests for (conditional) independence of the kind employed in the i.i.d. case (Section 4).\nThus, the \ufb01rst question we have to pose is whether it is possible to test independence, that is, to say\nwhether x1 \u22a5 x2 based on a stationary ergodic samples X 1\n1..n. Here we show that the answer\nin a certain sense is negative, but some important questions remain open.\n1..n and a parameter \u03b1 \u2208 (0, 1),\nAn (independence) test \u03d5 is a function that takes two samples X 1\ncalled the con\ufb01dence level, and outputs a binary answer: independent or not. A test \u03d5 is \u03b1-level\nconsistent if, for every stationary ergodic distribution \u03c1 over a pair of samples (X 1\n1..n..), for\nevery con\ufb01dence level \u03b1, \u03c1(\u03d5\u03b1(X 1\n1..n) = 1) < \u03b1 if the marginal distributions of the samples\nare independent, and \u03d5\u03b1(X 1\n1..n, X 2\nThe next proposition can be established using the criterion of [25]. Recall that, for \u03c1 \u2208 S, the measure\nW\u03c1 over E is its ergodic decomposition. The criterion states that there is an \u03b1-level consistent test for\nH0 against E \\ H0 if an only if W\u03c1(H0) = 1 for every \u03c1 \u2208 cl H0.\nProposition 1. There is no \u03b1-level consistent independence test (jointly stationary ergodic samples).\n\n1..n, X 2\n1..n) converges to 1 as n \u2192 \u221e with \u03c1-probability 1 otherwise.\n\n1..n.., X 2\n\n1..n, X 2\n\n1..n, X 2\n\n1 , X 2\n\n1 . . . , X 1\n\nProof. The example is based on the so-called translation process, constructed as follows. Fix\nsome irrational \u03b1 \u2208 (0, 1) and select r0 \u2208 [0, 1] uniformly at random. For each i = 1..n.. let\nri = (ri\u22121 + \u03b1) mod 1 (the previous element is shifted by \u03b1 to the right, considering the [0,1]\ninterval looped). The samples Xi are obtained from ri by thresholding at 1/2, i.e. Xi := I{ri > 0.5}\n(here ri can be considered hidden states). This process is stationary and ergodic; besides, it has 0\nentropy rate [26], and this is not the last of its peculiarities. Take now two independent copies of this\nprocess to obtain a pair (x1, x2) = (X 1\nn, . . . ). The resulting process on pairs, which\nwe denote \u03c1, is stationary, but it is not ergodic. To see the latter, observe that the difference between\nthe corresponding hidden states remains constant. In fact, each initial state (r1, r2) corresponds to\nan ergodic component of our process on pairs. By the same argument, these ergodic components\nare not independent. Thus, we have taken two independent copies of a stationary ergodic process,\nand obtained a stationary process which is not ergodic and whose ergodic components are pairs of\nprocesses that are not independent! To apply the criterion cited above, it remains to show that the\nprocess \u03c1 we constructed can be obtained as a limit of stationary ergodic processes on pairs. To see\nthis, consider, for each \u03b5, a process \u03c1\u03b5, whose construction is identical to \u03c1 except that instead of\nshifting the hidden states by \u03b1 we shift them by \u03b1 + u\u03b5\ni are i.i.d. uniformly random on\n[\u2212\u03b5, \u03b5]. It is easy to see that lim\u03b5\u21920 \u03c1\u03b5 = \u03c1 in distributional distance, and all \u03c1\u03b5 are stationary ergodic.\nThus, if H0 is the set of all stationary ergodic distributions on pairs, we have found a distribution\n\u03c1 \u2208 cl H0 such that W\u03c1(H0) = 0.\n\ni where u\u03b5\n\nn, X 2\n\nThus, there is no consistent test that could provide a given level of con\ufb01dence under H0, even if\nonly asymptotic consistency is required under H1. However, a yet weaker notion of consistency\nmight suf\ufb01ce to construct asymptotically consistent clustering algorithms. Namely, we could ask\nfor a test whose answer converges to either 0 or 1 according to whether the distributions generating\nthe samples are independent or not. Unfortunately, it is not known whether a test consistent in this\nweaker sense exists or not. I conjecture that it does not. The conjecture is based not only on the\nresult above, but also on the result of [24] that shows that there is no such test for the related problem\nof homogeneity testing, that is, for testing whether two given samples have the same or different\ndistributions. This negative result holds even if the distributions are independent, binary-valued, the\n\n6\n\n\fdifference is restricted to P (X0 = 0), and, \ufb01nally, for a smaller family of processes (B-processes).\nThus, for now what we can say is that there is no test for independence available that would be\nconsistent under ergodic sampling. Therefore, we cannot distinguish even between the cases of 1 and\n2 clusters. Thus, in the following it is assumed that the number of clusters k is given.\nThe last problem we have to address is mutual information for processes. The analogue of mutual\ninformation for stationary processes is the mutual information rate (1). Unfortunately, 0 mutual\ninformation rate does not imply independence. This is manifest on processes with 0 entropy rate, for\nexample those of the example in the proof of Proposition 1. What happens is that, if two processes\nare dependent, then indeed at least one of the m-order entropy rates Im is non-zero, but the limit may\nstill be zero. Since we do not know in advance which Im to take, we will have to consider all of them,\nas is explained in the next subsection.\n\n5.3 Clustering with the number of clusters known\n\nThe quantity introduced below, which we call sum-information, will serve as an analogue of mutual\ninformation in the i.i.d. case, allowing us to get around the problem that the mutual information\nrate may be 0 for a pair of dependent stationary ergodic processes. De\ufb01ned in the same vein as the\ndistributional distance (2), this new quantity is a weighted sum over all the mutual informations up\nto time n; in addition, all the individual mutual informations are computed for quantized versions\nof random variables in question, with decreasing cell size of quantization, keeping all the mutual\ninformation resulting from different quantizations. The latter allows us not to require the existence\nof densities. Weighting is needed in order to be able to obtain consistent empirical estimates of the\ntheoretical quantity under study.\nFirst, for a process x = (X1, . . . , Xn, . . . ) and for each m, l \u2208 N de\ufb01ne the l\u2019th quantized version\n[X1..m]l of X1..m as the index of the cell of Bm,l to which X1..m belongs. Recall that each of the\npartitions Bm,l has cell size 2l, and that wl := 1/l(l + 1).\nDe\ufb01nition 1 (sum-information). For stationary x1..xN de\ufb01ne the sum-information\n\n\u221e(cid:88)\n\nm=1\n\n\u221e(cid:88)\n\nl=1\n\n(cid:32) N(cid:88)\n\ni=1\n\n(cid:33)\n\nsI(x1, . . . , xN ) :=\n\n1\nm\n\nwm\n\n1\nl\n\nwl\n\nh([X i\n\n1..m]l)\n\n\u2212 h([X 1\n\n1..m]l, . . . , [X N\n\n1..m]l)\n\n(3)\n\nThe next lemma follows from the fact that \u222al\u2208NBm,l generates Fm and \u222am\u2208NFm generates F\u221e.\nLemma 1. sI(x1, . . . , xN ) = 0 if and only if x1, . . . , xN are mutually independent.\n\nThe empirical estimates \u02c6hn([X i\n\nfrequencies; the estimate(cid:98)sI n(x1, . . . , xN ) of is obtained by replacing h in (3) with \u02c6h.\nRemark 4 (Computing(cid:98)sI n). The expression (3) might appear to hint at a computational disaster, as\n\n1..m]l) of entropy are de\ufb01ned by replacing unknown probabilities by\n\nit involves two in\ufb01nite sums, and, in addition, the number of elements in the sum inside h([]l) grows\nexponentially in l. However, it is easy to see that, when we replace the probabilities with frequencies,\nall but a \ufb01nite number of summands are either zero or can be collapsed (because they are constant).\nMoreover, the sums can be further truncated so that the total computation becomes quasilinear in n.\nThis can be done exactly the same way as for distributional distance, as described in [12, Section 5].\n\nThe following lemma can be proven analogously to the corresponding statement about consistency of\nempirical estimates of the distributional distance, given in [22, Lemma 1].\nLemma 2. Let\n\nthe distribution \u03c1 of x1, . . . , xN be jointly stationary ergodic.\n\n(cid:98)sI n(x1, . . . , xk) \u2192 sI(x1, . . . , xN ) \u03c1-a.s.\n\nThen\n\nThis lemma alone is enough to establish the existence of a consistent clustering algorithm. To see this,\n\ufb01rst consider the following problem, which is the \u201cindependence\u201d version of the classical statistical\nthree-sample problem.\nThe 3-sample-independence problem. Three samples x1, x2, x3, are given, and it is known that\neither (x1, x2) \u22a5 x3 or x1 \u22a5 (x2, x3) but not both. It is required to \ufb01nd out which one is the case.\nProposition 2. There exists an algorithm for solving the 3-sample-independence problem that is\nasymptotically consistent under ergodic sampling.\n\n7\n\n\fIndeed, it is enough to consider an algorithm that compares(cid:98)sI n((x1, x2), x3) and(cid:98)sI n(x1, (x2, x3))\n\nand answers according to whichever is smaller.\nThe independence clustering problem which we are after is a generalisation of the 3-sample-\nindependence problem to N samples. We can also have a consistent algorithm for the clustering\nproblem, simply comparing all possible clusterings U1, . . . , Uk of the N samples given and selecting\n\nwhichever minimizes(cid:98)sI n(U1, . . . , Uk). Such an algorithm is of course not practical, since the number\n\nof computations it makes must be exponential in N and k. We will show that the number of candidate\nclustering can be reduced dramatically, making the problem amenable to computation.\n\nFigure 2: CLINk: cluster given k and an\nestimator of mutual sum-information\n\nConsider all the clusterings obtained\nby applying recursively the function\nSplit to each of the sets in each of\nthe candidate partitions, starting with\nthe input set S, until k clusters are\nobtained. Output the clustering U that\n\nminimizes(cid:98)sI(U )\n\nFunction Split(Set S of samples)\nInitialize: C := {x1}, R := S \\ C,\nP := {}\nwhile R (cid:54)= \u2205 do\n\nInitialize:M := {}, d := 0;\n\nxmax:= index of any x in R\n\nAdd (C, R) to P\nfor each x \u2208 R do\n\nr := \u02c6sI(C, R)\nmove x from R to M\nr(cid:48) := \u02c6sI(C, R); d(cid:48) := r \u2212 r(cid:48)\nif d(cid:48) > d then\nd := d(cid:48), xmax:=index of(x)\n\nend if\nend for\nMove xxmax from M to C; R :=\nS \\ C\nend while\nReturn(List of candidate splits P)\nEND function\n\nThe proposed algorithm CLINk (Algorithm 2 below)\nworks similarly to CLIN, but with some important dif-\nferences. Like before, the main procedure is to attempt\nto split the given set of samples into two clusters. This\nsplitting procedure starts with a single element x1 and\n\nestimates its sum-information (cid:98)sI(x1, R) with the rest of\nthis changes(cid:98)sI(x1, R). As before, once and if we \ufb01nd an\n\nthe elements, R. It then takes the elements out of R one\nby one without replacement, each time measuring how\n\nelement that is not independent of x1, this change will\nbe positive. However, unlike in the i.i.d. case, here we\ncannot test whether this change is 0. Yet, we can say that\nif, among the tested elements, there is one that gives a\nnon-zero change in sI, then one of such elements will be\n\nthe one that gives the maximal change in(cid:98)sI (provided, of\ncourse, that we have enough data for the estimates (cid:98)sI to\n\nbe close enough to the theoretical values sI). Thus, we\nkeep each split that arises from such a maximal-change el-\nement, resulting in O(N 2) candidate splits for the case of\n2 clusters. For k clusters, we have to consider all the com-\nbinations of the splits, resulting in O(N 2k\u22122) candidate\n\nclusterings. Then select the one that minimizes(cid:98)sI.\nProof. The consistency of(cid:98)sI (Lemma 2) implies that, for\n\nTheorem 3. CLINk is asymptotically consistent under\nergodic sampling. This algorithm makes at most N 2k\u22122\ncalls to the estimator of mutual sum-information.\n\nevery \u03b5 > 0, from some n on w.p. 1, all the estimates of\nsI the algorithm uses will be within \u03b5 of their sI values.\nSince I(U1, . . . , Uk) = 0 if and only if U1, . . . , Uk is\nthe correct clustering (Lemma 1), it is enough to show\n\nthat, assuming all the (cid:98)sI estimates are close enough to\nthe sI values, the clustering that minimizes(cid:98)sI(U1, . . . , Uk)\n\nis among those the algorithm searchers through, that is,\namong the clusterings obtained by applying recursively the function Split to each of the sets in each\nof the candidate partitions, starting with the input set S, until k clusters are obtained.\nTo see the latter, on each iteration of the while loop, we either already have a correct candidate\nsplit in P, that is, a split (U1, U2) such that sI(U1, U2) = 0, or we \ufb01nd (executing the for loop) an\nelement x(cid:48) to add to the set C such that C\u22a5\\x(cid:48). Indeed, if at least one such element x(cid:48) exists, then\namong all such elements there is one that maximizes the difference d(cid:48). Since the set C is initialized as\na singleton, a correct split is eventually found if it exists. Applying the same procedure exhaustively\nto each of the elements of each of the candidate splits producing all the combinations of k candidate\n\nclusterings, under the assumption that all the estimates(cid:98)sI are suf\ufb01ciently close the corresponding\n\nvalues, we are guaranteed to have the one that minimizes I(U1, . . . , Uk) among the output.\nRemark 5 (Fickle oracle). Another way to look at the difference between the stationary and the\ni.i.d. cases is to consider the following \u201c\ufb01ckle\u201d version of the oracle test of Section 3. Consider\nthe oracle that, as before, given sets of random variables A, B, C, D \u2282 {x1, . . . , xN} answers\nwhether sI(A, B) > sI(C, D). However, the answer is only guaranteed to be correct in the case\n\n8\n\n\fsI(A, B) (cid:54)= sI(C, D). If sI(A, B) = sI(C, D) then the answer is arbitrary (and can be considered\nadversarial). One can see that Lemma 2 guarantees the existence of the oracle that has the requisite\n\ufb01ckle correctness property asymptotically, that is, w.p. 1 from some n on. It is also easy to see that\nAlgorithm 2 can be rewritten in terms of calls to such an oracle.\n\n6 Generalizations, future work\n\nA general formulation of the independence clustering problem has been presented, and attempt\nhas been made to trace out broadly the limits of what is possible and what is not possible in this\nformulation. In doing so, clear-cut formulations have been favoured over utmost generality, and over,\non the other end of the spectrum, precise performance guarantees. Thus, many interesting questions\nhave been left out; some of these are outlined in this section.\nBeyond time series. For the case when the distribution of the random variables xi is unknown, we\nhave assumed that a sample X i\n1..n is available for each i = 1..N. Thus, each xi is represented by a\ntime series. A time series is but one form the data may come in. Other ways include functional data,\nmutli-dimensional- or continuous-time processes, or graphs. Generalizations to some of these models,\nsuch as, for example, space-time stationary processes, are relatively straightforward, while others\nrequire more care. Some generalizations to in\ufb01nite stationary graphs may be possible along the lines\nof [21]. In any case, the generalization problem is statistical (rather than algorithmic). If the number\nof clusters is unknown, we need to be able to replace the emulate the oracle test of section 3 with\nstatistical tests. As explained in Section 4, it is suf\ufb01cient to \ufb01nd a test for conditional independence,\nor an estimator of entropy along with guarantees on its convergence rates. If these are not available,\nas is the case of stationary ergodic samples, we can still have a consistent algorithm for k known,\nas long as we have an asymptotically consistent estimator of mutual information (without rates), or,\nmore generally, if we can emulate the \ufb01ckle oracle (Remark 5).\nBeyond independence. The problem formulation considered rests on the assumption that there exists\na partition U1, . . . , Uk of the input set S such that U1, . . . , Uk are jointly independent, that is, such\nthat I(U1, . . . , Uk) = 0. In reality, perhaps, nothing is really independent, and so some relaxations\nare in order. It is easy to introduce some thresholding in the algorithms (replacing 0 in each test by\nsome threshold \u03b1) and derive some basic consistency guarantees for the resulting algorithms. The\ngeneral problem formulation is to \ufb01nd a \ufb01nest clustering such that I(U1, . . . , Uk) > \u03b5, for a given \u03b5\n(note that, unlike in the independence case of \u03b5 = 0, such a clustering may not be unique). If one\nwants to get rid of \u03b5, a tree of clusterings may be considered for all \u03b5 \u2265 0, which is a common way to\ntreat unknown parameters in the clustering literature (e.g.,[2]). Another generalization can be obtained\nby considering the problem from the graphical model point of view. The random variables xi are\nvertices of a graph, and edges represent dependencies. In this representation, clusters are connected\ncomponents of the graph. A generalization then is to clusters that are the smallest components that\nare connected (to each other) by at most l edges, where l is a parameter. Yet another generalization\nwould be to decomposable distributions of [10].\nPerformance guarantees. Non-asymptotic results (\ufb01nite-sample performance guarantees) can be\nobtained under additional assumptions, using the corresponding results on (conditional) independence\ntests and on estimators of divergence between distributions. Here it is worth noting that we are\nnot restricted to using the mutual information I, but any measure of divergence can be used, for\nexample, R\u00e9nyi divergence; a variety of relevant estimators and corresponding bounds, obtained\nunder such assumptions as H\u00f6lder continuity, can be found in [19, 11]. From any such bounds, at\nleast some performance guarantees for CLIN can be obtained simply using the union bound over all\nthe invocations of the tests.\nComplexity. The algorithmic aspects of the problem have only been started upon in this work. Thus,\nit remains to \ufb01nd out what is the computational complexity of the studied problem. So far, we have\npresented only some upper bounds, by constructing algorithms and bounding their complexity (kN 2\nfor CLIN and N 2k for CLINk). Lower bounds (and better upper bounds) are left for future work.\nA subtlety worth noting is that, for the case of known distributions, the complexity may be affected\nby the choice of the oracle. In other words, some calculations may be \u201cpushed\u201d inside the oracle.\nIn this regard, it may be better to consider the oracle for testing conditional independence, rather\nthan a comparison of mutual informations, as explained in Remarks 1, 3. The complexity of the\nstationary-sampling version of the problem can be studied using the \ufb01ckle oracle of Remark 5. The\nconsistency of the algorithm should then be established for every assignment of those answers of the\noracle that are arbitrary (adversarial).\n\n9\n\n\fReferences\n[1] Francis R Bach and Michael I Jordan. Beyond independent components: trees and clusters.\n\nJournal of Machine Learning Research, 4(Dec):1205\u20131233, 2003.\n\n[2] Maria-Florina Balcan, Yingyu Liang, and Pramod Gupta. Robust hierarchical clustering. Journal\n\nof Machine Learning Research, 15(1):3831\u20133871, 2014.\n\n[3] Jan Beirlant, Edward J Dudewicz, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Edward C Van der Meulen. Nonparametric\nInternational Journal of Mathematical and Statistical\n\nentropy estimation: An overview.\nSciences, 6(1):17\u201339, 1997.\n\n[4] Simon Benjaminsson, Peter Fransson, and Anders Lansner. A novel model-free data analysis\ntechnique based on clustering in a mutual information space: application to resting-state fmri.\nFrontiers in systems neuroscience, 4:34, 2010.\n\n[5] David Maxwell Chickering. Learning Bayesian networks is NP-complete. In Learning from\n\ndata, pages 121\u2013130. Springer, 1996.\n\n[6] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience,\n\nNew York, NY, USA, 2006.\n\n[7] Robert M. Gray. Probability, Random Processes, and Ergodic Properties. Springer Verlag,\n\n1988.\n\n[8] Arthur Gretton and L\u00e1szl\u00f3 Gy\u00f6r\ufb01. Consistent nonparametric tests of independence. Journal of\n\nMachine Learning Research, 11(Apr):1391\u20131423, 2010.\n\n[9] L\u00e1szl\u00f3 Gy\u00f6r\ufb01. Private communication. 2011.\n\n[10] Radim Jirouvsek. Solution of the marginal problem and decomposable distributions. Kyber-\n\nnetika, 27(5):403\u2013412, 1991.\n\n[11] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and\nJames M Robins. In\ufb02uence functions for machine learning: Nonparametric estimators for\nentropies, divergences and mutual informations. arXiv preprint arXiv:1411.4342, 2014.\n\n[12] Azadeh Khaleghi, Daniil Ryabko, J\u00e9r\u00e9mie Mary, and Philippe Preux. Consistent algorithms for\n\nclustering time series. Journal of Machine Learning Research, 17:1\u201332, 2016.\n\n[13] Artemy Kolchinsky, Martijn P van den Heuvel, Alessandra Griffa, Patric Hagmann, Luis M\nRocha, Olaf Sporns, and Joaqu\u00edn Go\u00f1i. Multi-scale integration and predictability in resting\nstate brain activity. Frontiers in Neuroinformatics, 8, 2014.\n\n[14] Alexander Kraskov, Harald St\u00f6gbauer, Ralph G Andrzejak, and Peter Grassberger. Hierarchical\n\nclustering using mutual information. EPL (Europhysics Letters), 70(2):278, 2005.\n\n[15] Rosario N Mantegna. Hierarchical structure in \ufb01nancial markets. The European Physical\n\nJournal B-Condensed Matter and Complex Systems, 11(1):193\u2013197, 1999.\n\n[16] Guillaume Marrelec, Arnaud Mess\u00e9, and Pierre Bellec. A Bayesian alternative to mutual infor-\nmation for the hierarchical clustering of dependent random variables. PloS one, 10(9):e0137278,\n2015.\n\n[17] Gautier Marti, S\u00e9bastien Andler, Frank Nielsen, and Philippe Donnat. Clustering \ufb01nancial time\n\nseries: How long is enough? In IJCAI\u201916, 2016.\n\n[18] Christopher Meek. Finding a path is harder than \ufb01nding a tree. J. Artif. Intell. Res. (JAIR),\n\n15:383\u2013389, 2001.\n\n[19] D\u00e1vid P\u00e1l, Barnab\u00e1s P\u00f3czos, and Csaba Szepesv\u00e1ri. Estimation of r\u00e9nyi entropy and mutual\ninformation based on generalized nearest-neighbor graphs. In Advances in Neural Information\nProcessing Systems, pages 1849\u20131857, 2010.\n\n[20] Ido Priness, Oded Maimon, and Irad Ben-Gal. Evaluation of gene-expression clustering via\n\nmutual information distance measure. BMC bioinformatics, 8(1):111, 2007.\n\n10\n\n\f[21] D. Ryabko. Hypotheses testing on in\ufb01nite random graphs. In Proceedings of the 28th Inter-\nnational Conference on Algorithmic Learning Theory (ALT\u201917), volume 76 of PMLR, pages\n400\u2013411, Kyoto, Japan, 2017. JMLR.\n\n[22] D. Ryabko and B. Ryabko. Nonparametric statistical inference for ergodic processes. IEEE\n\nTransactions on Information Theory, 56(3):1430\u20131435, 2010.\n\n[23] Daniil Ryabko. Clustering processes. In Proc. the 27th International Conference on Machine\n\nLearning (ICML 2010), pages 919\u2013926, Haifa, Israel, 2010.\n\n[24] Daniil Ryabko. Discrimination between B-processes is impossible. Journal of Theoretical\n\nProbability, 23(2):565\u2013575, 2010.\n\n[25] Daniil Ryabko. Testing composite hypotheses about discrete ergodic processes. Test, 21(2):317\u2013\n\n329, 2012.\n\n[26] P. Shields. The interactions between ergodic theory and information theory. IEEE Trans. on\n\nInformation Theory, 44(6):2079\u20132093, 1998.\n\n[27] K Zhang, J Peters, D Janzing, and B Sch\u00f6lkopf. Kernel-based conditional independence test and\napplication in causal discovery. In Proceedings of the 27th Annual Conference on Uncertainty\nin Arti\ufb01cial Intelligence (UAI), 2011.\n\n[28] Xiaobo Zhou, Xiaodong Wang, Edward R Dougherty, Daniel Russ, and Edward Suh. Gene clus-\ntering based on clusterwide mutual information. Journal of Computational Biology, 11(1):147\u2013\n161, 2004.\n\n11\n\n\f", "award": [], "sourceid": 2139, "authors": [{"given_name": "Daniil", "family_name": "Ryabko", "institution": "INRIA"}]}