{"title": "Locating Changes in Highly Dependent Data with Unknown Number of Change Points", "book": "Advances in Neural Information Processing Systems", "page_first": 3086, "page_last": 3094, "abstract": "The problem of multiple change point estimation is  considered for sequences  with unknown number of change points.  A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed.   In order for the consistency to be established the only assumption  required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form.  The theoretical results are complemented with experimental evaluations.", "full_text": "Locating Changes in Highly Dependent Data\n\nwith Unknown Number of Change Points\n\nAzadeh Khaleghi\n\nSequeL-INRIA/LIFL-CNRS,\nUniversit\u00b4e de Lille, France\n\nazadeh.khaleghi@inria.fr\n\nDaniil Ryabko\n\nSequeL-INRIA/LIFL-CNRS,\n\ndaniil@ryabko.net\n\nAbstract\n\nThe problem of multiple change point estimation is considered for sequences with\nunknown number of change points. A consistency framework is suggested that\nis suitable for highly dependent time-series, and an asymptotically consistent al-\ngorithm is proposed. In order for the consistency to be established the only as-\nsumption required is that the data is generated by stationary ergodic time-series\ndistributions. No modeling, independence or parametric assumptions are made;\nthe data are allowed to be dependent and the dependence can be of arbitrary form.\nThe theoretical results are complemented with experimental evaluations.\n\n1\n\nIntroduction\n\nWe are given a sequence x := X1, X2, . . . , Xn formed as the concatenation of an unknown num-\nber k + 1 of sequences, such that x = X1 . . . X\u03c01X\u03c01+1 . . . X\u03c02 . . . X\u03c0k . . . Xn. The time-series\ndistributions that generate a pair of adjacent sequences separated by indices \u03c0i, i = 1..k are differ-\nent. (Non-adjacent sequences are allowed to be generated by the same distribution). The so-called\nchange points \u03c0i, i = 1..k are unknown and to be estimated. Change point estimation is one of the\ncore problems in statistics, and as such, has been studied extensively under various formulations.\nHowever, even nonparametric formulations of the problem typically assume that the data in each\nsegment are independent and identically distributed, and that the change necessarily affects single-\ndimensional marginal distributions. In this paper we consider the most general nonparametric setting\nwhere, the changes may be completely arbitrary (e.g., in the form of the long-range dependence).\nWe propose a change point estimation algorithm that is asymptotically consistent under such mini-\nmal assumptions.\nMotivation. Change point analysis is an indispensable tool in a broad range of applications such\nas market analysis, bioinformatics, network traf\ufb01c, audio/video segmentation only to name a few.\nClearly, in these applications the data can be highly dependent and can not be easily modeled by\nparametric families of distributions. From a machine learning perspective, change point estimation\nis a dif\ufb01cult unsupervised learning problem: the objective is to estimate the change points in a given\nsequence while no labeled examples are available. To better understand the challenging nature of\nthe problem, it is useful to compare it to time-series clustering. In time-series clustering, a set of\nsequences is to be partitioned, whereas in change point estimation the partitioning is done on a se-\nquence of sequences. While objectives are the same, in the latter, information about the individual\nelements is no longer available, since only a single sequence formed by their concatenation is pro-\nvided as input. This makes change point estimation a more challenging problem than time-series\nclustering.\nIn the general setting of highly-dependent time-series correct estimation of the number of change\npoints is provably impossible, even in the weakest asymptotic sense, and even if there is at most\none change [23]. While a popular mitigation is to consider more restrictive settings, we are inter-\nested in intermediate formulations that can have asymptotically consistent solutions under the most\n\n1\n\n\fgeneral assumptions. In light of the similarities between clustering and change point analysis, we\npropose a formulation that is motivated by hierarchical clustering. When the number of clusters is\nunknown, a hierarchical clustering algorithm produces a tree, such that some pruning of this tree\ngives the ground-truth clustering (e.g., [3]). In change point estimation with an unknown number\nk of change points, we suggest to aim for a sorted list of change points, whose \ufb01rst k elements are\nsome permutation of the true change points. An algorithm that achieves this goal is called consistent.\nRelated Work. Change point analysis is a classical problem in mathematical statistics [6, 4, 5,\n17]. In a typical formulation, samples within each segment are assumed to be i.i.d. and the change\nusually refers to the change in the mean. More general formulations are often considered as well,\nhowever, it is usually assumed that the samples are i.i.d. in each of the segments [20, 8, 9, 21] or\nthat they belong to some speci\ufb01c model class (such as Hidden Markov processes) [15, 16, 27]. In\nthese frameworks the problem of estimating the number of change points is usually addressed with\npenalized criteria, see, for example, [19, 18]. In nonparametric settings, the typical assumptions\nusually impose restrictions on the form of the change or the nature of dependence (e.g., the time-\nseries are assumed strongly mixing) [6, 4, 10, 12]. Even when more general settings are considered,\nit is almost exclusively assumed that the single-dimensional marginal distributions are different [7].\nThe framework considered in this paper is similar to that of [25] and of our recent paper [13], in\nthe sense that the only assumption made is that the distributions generating the data are stationary\nergodic. The particular case of k = 1 is considered in [25]. In [13] we provide a non-trivial extension\nof [25] for the case where k > 1 is known and is provided to the algorithm. However, as mentioned\nabove, when the number k of change points is unknown, it is provably impossible to estimate it,\neven under the assumption k \u2208 {0, 1} [23]. In particular, if the input k is not the correct number of\nchange points, then the behavior of the algorithm proposed in [13] can be arbitrary bad.\nResults. We present a nonparametric change point estimation algorithm for time-series data with\nunknown number of change points. We consider the most general framework where the only as-\nsumption made is that the unknown distributions generating the data are stationary ergodic. This\nmeans that we make no such assumptions as independence, \ufb01nite memory or mixing. Moreover, we\ndo not need the \ufb01nite-dimensional marginals of any \ufb01xed size before and after the change points to\nbe different. Also, the marginal distributions are not required to have densities.\nWe show that the proposed algorithm is asymptotically consistent in the sense that among the change\npoint estimates that it outputs, the \ufb01rst k converge to the true change points. Moreover, our algorithm\ncan be ef\ufb01ciently calculated; it has a computational complexity O(n2 polylog n) where n is the\nlength of the input sequence. To the best of our knowledge, this work is the \ufb01rst to address the\nchange point problem with an unknown number of change points in such general framework.\nWe further con\ufb01rm our theoretical \ufb01ndings through experiments on synthetic data. Our experimental\nsetup is designed so as to demonstrate the generality of the suggested framework. To this end, we\ngenerate our data by time-series distributions that, while being stationary ergodic, do not belong to\nany \u201csimpler\u201d class of processes. In particular they cannot be modeled as hidden Markov processes\nwith \ufb01nite or countably in\ufb01nite set of states. Through our experiments we show that the algorithm\nis consistent in the sense that as the length of the input sequence grows, the produced change point\nestimations converge to the actual change points.\nOrganization. In Section 2 we introduce some preliminary notation and de\ufb01nitions. We formulate\nthe problem in Section 3. Section 4 presents our main theoretical results, including the proposed\nalgorithm, and an informal description of how and why it works. In Section 5 we prove that the\nproposed algorithm is asymptotically consistent under the general framework considered; we also\nshow that our algorithm can be computed ef\ufb01ciently. In Section 6 we present some experimental\nresults, and \ufb01nally in Section 7 we provide some concluding remarks and future directions.\n\n2 Notation and de\ufb01nitions\nLet X be some measurable space (the domain); in this work we let X = R, but extensions to\nmore general spaces are straightforward. For a sequence X1, . . . , Xn we use the abbreviation X1..n.\nConsider the Borel \u03c3-algebra B on X \u221e generated by the cylinders {B\u00d7X \u221e : B \u2208 Bm,l, m, l \u2208 N}\nwhere, the sets Bm,l, m, l \u2208 N are obtained via the partitioning of X m into cubes of dimension m\nand volume 2\u2212ml (starting at the origin). Let also Bm := \u222al\u2208NBm,l. Processes are probability\n\n2\n\n\fmeasures on the space (X \u221e,B). For x = X1..n \u2208 X n and B \u2208 Bm let \u03bd(x, B) denote the\nfrequency with which x falls in B, i.e.\n\n\u03bd(x, B) :=\n\nI{n \u2265 m}\nn \u2212 m + 1\n\nI{Xi..i+m\u22121 \u2208 B}\n\n(1)\n\nA process \u03c1 is stationary if for any i, j \u2208 1..n and B \u2208 Bm, m \u2208 N, we have \u03c1(X1..j \u2208 B) =\n\u03c1(Xi..i+j\u22121 \u2208 B). A stationary process \u03c1 is called (stationary) ergodic if for all B \u2208 B we have\nlimn\u2192\u221e \u03bd(X1..n, B) = \u03c1(B) with \u03c1-probability 1. The distributional distance between a pair of\nprocess distributions \u03c11 and \u03c12 is de\ufb01ned as follows\n\nd(\u03c11, \u03c12) :=\n\nwmwl\n\nm,l=1\n\nB\u2208Bm,l\n\n|\u03c11(B) \u2212 \u03c12(B)|\n\nwhere, wi := 2\u2212i, i \u2208 N. Note that any summable sequence of positive scores also works. It is\neasy to see that d(\u00b7,\u00b7) is a metric. For more on the distributional distance and its properties see [11].\nIn this work we use empirical estimates of this distance. Speci\ufb01cally, the empirical estimate of the\ndistance between a sequence x = X1..n \u2208 X n, n \u2208 N and a process distribution \u03c1 is de\ufb01ned as\n\n\u02c6d(x, \u03c1) :=\n\nwmwl\n\n|\u03bd(x, B) \u2212 \u03c1(B)|\n\nand for a pair of sequences xi \u2208 X ni ni \u2208 N, i = 1, 2. it is de\ufb01ned as\n\n\u02c6d(x1, x2) :=\n\nwmwl\n\nm,l=1\n\nB\u2208Bm,l\n\n|\u03bd(x1, B) \u2212 \u03bd(x2, B)|.\n\nAlthough expressions (2) and (3) involve in\ufb01nite sums they can be easily calculated [22]. Moreover,\nthe estimates \u02c6d(\u00b7,\u00b7) are asymptotically consistent [25]: for any pair of stationary ergodic distributions\n\u03c11, \u03c12 generating sequences xi \u2208 X ni i = 1, 2 we have\n\nn\u2212m+1(cid:88)\n\ni=1\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nB\u2208Bm,l\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nm,l=1\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nlim\n\nn1,n2\u2192\u221e\nlim\nni\u2192\u221e\n\n\u02c6d(x1, x2) = d(\u03c11, \u03c12), a.s., and\n\u02c6d(xi, \u03c1j) = d(\u03c1i, \u03c1j), i, j \u2208 1, 2, a.s.\n\nMoreover, a more general estimate of (.\u00b7,\u00b7) may be obtained as\n\n\u02c7d(x1, x2) :=\n\nwmwl\n\n|\u03bd(x1, B) \u2212 \u03bd(x2, B)|\n\nmn(cid:88)\n\nln(cid:88)\n\nm=1\n\nl=1\n\n(cid:88)\n\nB\u2208Bm,l\n\nwhere, mn and ln are any sequences of integers that go to in\ufb01nity with n. As shown in [22] the\nconsistency results for \u02c6d, i.e. (2) and (3) equally hold for \u02c7d so long as mn, ln go to in\ufb01nity with n.\nLet x = X1..n be a sequence and consider a subsequence Xa..b of x with a < b \u2208 1..n. We de\ufb01ne\nthe intra-subsequence distance of Xa..b as\n\n(7)\n\n(8)\n\n\u2206x(a, b) := \u02c6d(Xa..(cid:98) a+b\n\n2 (cid:99), X(cid:100) a+b\n\n2 (cid:101)..b)\n\nWe further de\ufb01ne the single-change point estimator of Xa..b, a < b as\n\n\u03a6x(a, b, \u03b1) := argmax\nt\u2208[a,b]\n\n\u02c6d(Xa\u2212n\u03b1..t, Xt..b+n\u03b1), \u03b1 \u2208 (0, 1)\n\n3 Problem Formulation\n\nWe formalize the multiple change point estimation problem as follows. We are given a sequence\n\nx := X1, . . . , Xn \u2208 X n\n\nwhich is the concatenation of an unknown number \u03ba + 1 of sequences\n\nX1..\u03c01, X\u03c01+1..\u03c02 , . . . , X\u03c0\u03ba+1..n.\n\n3\n\n\fEach of these sequences is generated by an unknown stationary ergodic process distribution. More-\nover, every two consecutive sequences are generated by two different process distributions. (A pair\nof non-consecutive sequences may be generated by the same distribution.) The process distributions\nare not required to be independent. The parameters \u03c0k are unknown and have to be estimated; they\nare called change points. Note that it is not required for the means, variances or single-dimensional\nmarginals of the distributions to be different. We are considering the most general scenario where\nthe process distributions are different.\nDe\ufb01nition 1 (change point estimator). A change point estimator is a function that takes a sequence x\nand a parameter \u03bb \u2208 (0, 1) and outputs a sequence of change point estimates, \u02c6\u03c0 := \u02c6\u03c01, \u02c6\u03c02, . . . \u02c6\u03c01/\u03bb.\n(Note that the total number of estimated change points 1/\u03bb may be larger than the true number of\nchange points \u03ba.)\n\nTo construct consistent algorithms, we assume that the change points \u03c0k are linear in n i.e. \u03c0k :=\nn\u03b8k where \u03b8k \u2208 (0, 1) k = 1..\u03ba are unknown. We also de\ufb01ne the minimum normalized distance\nbetween the change points as\n\n\u03bbmin := min\n\nk=1..\u03ba+1\n\n\u03b8k \u2212 \u03b8k\u22121\n\n(9)\n\nwhere \u03b80 := 0 and \u03b8\u03ba+1 := 1, and assume \u03bbmin > 0. The reason why we impose these conditions is\nthat the consistency properties we are after are asymptotic in n. If the length of one of the sequences\nis constant or sublinear in n then asymptotic consistency is impossible in this setting. We de\ufb01ne the\nconsistency of a change point estimator as follows.\nDe\ufb01nition 2 (Consistency of a change point estimator). Let \u02c6\u03c0 := \u02c6\u03c01, \u02c6\u03c02, . . . \u02c6\u03c01/\u03bb be a change point\nn \u02c6\u03c0\u03ba), where sort(\u00b7) orders the \ufb01rst \u03ba elements\nestimator. Let \u02c6\u03b8(\u03ba) = (\u02c6\u03b81, . . . , \u02c6\u03b8\u03ba) := sort( 1\n\u02c6\u03c01, . . . , \u02c6\u03c0\u03ba of \u02c6\u03c0 with respect to their order of appearance in x. We call the change point estimator\n\u02c6\u03c0 asymptotically consistent if with probability 1 we have\n\nn \u02c6\u03c01, . . . , 1\n\nlim\nn\u2192\u221e sup\n\nk=1..\u03ba\n\n|\u02c6\u03b8k \u2212 \u03b8k| = 0.\n\n4 Theoretical Results\n\nIn this section we introduce a nonparametric multiple change point estimation algorithm for the\ncase where the number of change points is unknown. We also give an informal description of the\nalgorithm, and intuitively explain why it works. The main result is Theorem 1 which states that the\nproposed algorithm is consistent under the most general assumptions. Moreover, the computational\ncomplexity of the algorithm is O(n2 polylog n) where n denotes the length of the input sequence.\nThe main steps of the algorithm are as follows. Given \u03bb \u2208 (0, 1), a sequence of evenly-spaced\nindices is formed. The index-sequence is used to partition x = X1..n into consecutive segments\n3 . The single-change point estimator \u03a6(\u00b7,\u00b7,\u00b7) is used to generate a\nof length n\u03b1, where \u03b1 := \u03bb\ncandidate change point within every segment. Moreover, the intra-subsequence-distance \u2206(\u00b7,\u00b7) of\neach segment is used as its performance score s(\u00b7,\u00b7). The change point candidates are ordered\naccording to the performance-scores of their corresponding segments. The algorithm assumes the\ninput parameter \u03bb to be a lower-bound on the true normalized minimum distance \u03bbmin between\nactual change points. Hence, the sorted list of estimated change points is \ufb01ltered in such a way that\nits elements are at least \u03bbn apart. The algorithm outputs an ordered sequence \u02c6\u03c0 of change point\nestimates, where the ordering is done with respect to the performance scores s(\u00b7,\u00b7). The length of\n\u02c6\u03c0 may be larger than \u03ba. However, as we show in Theorem 1, from some n on, the \ufb01rst \u03ba elements\n\u02c6\u03c0k, k = 1..\u03ba of the output \u02c6\u03c0 converge to some permutation of the true change points, \u03c01,\u00b7\u00b7\u00b7 , \u03c0\u03ba.\nTheorem 1. Let x := X1..n \u2208 X n, n \u2208 N be a sequence with change points at least n\u03bbmin apart,\nfor some \u03bbmin \u2208 (0, 1). Then Alg1(x, \u03bb) is asymptotically consistent for \u03bb \u2208 (0, \u03bbmin].\nRemark 2 (Computational complexity). While the de\ufb01nition (3) of \u02c6d(\u00b7,\u00b7) involves taking in\ufb01-\nnite sums, the distance can be calculated ef\ufb01ciently.\nIndeed, in (3) all summands correspond-\ning to m > maxi=1,2 ni equal 0; moreover, all summands corresponding to l > smin are\nequal, where smin corresponds to the partition in which each cell has at most one point in it\nsmin := mini,j\u22081..n, Xi(cid:54)=Xj |Xi \u2212 Xj|. Thus, even with a most naive implementation the com-\nputational complexity of the algorithm is at most polynomial in all arguments. A more ef\ufb01cient\nimplementation can be obtained if one uses \u02c7d(\u00b7,\u00b7) given by (6), instead of \u02c6d(\u00b7,\u00b7), with m = log n,\n\n4\n\n\fAlgorithm 1 Estimating the change points\n\ninput: Sequence x = X1..n, Minimum Normalized Distance between the change points \u03bb\ninitialize: Step size \u03b1 \u2190 \u03bb\n1. Generate 2 sets of index-sequences:\n\n3 , Output change point Sequence \u02c6\u03c0 \u2190 ()\n\ni \u2190 n\u03b1(i +\nbt\n\n1\n\nt + 1\n\n), i = 0..\n\n1\n\u03b1\n\n, t = 1, 2\n\n2. Calculate the intra-distance value (given by (7)) of every segment Xbt\nperformance score:\n, t = 1, 2\n\ns(t, i) \u2190 \u2206x(bt\n\ni+1), i = 1..\n\ni, bt\n\ni ..bt\n\ni+1\n\n, i = 1.. 1\n\n\u03b1 , t = 1, 2 as its\n\n3. Use the single-change point-estimator (given by (8)) to estimate a change point in every segment:\n\n\u02c6p(t, i) := \u03a6x(bt\n\ni, bt\n\ni+1, \u03b1), i = 1..\n\n1\n\u03b1\n\u2212 1, t = 1, 2\n\n1\n\u03b1\n\n4. Remove duplicates and sort based on scores:\n\nU \u2190 {(t, i) : i \u2208 1..\n\n\u2212 1, t = 1, 2}\n\n1\n\u03b1\n\nwhile U (cid:54)= \u2205 do\n\ni. Select an available change point estimate of highest score and add it to \u02c6\u03c0:\n(\u03c4, l) \u2190 argmax(t,i)\u2208U s(t, i) - break the ties arbitrarily\n\n\u02c6\u03c0 \u2190 \u02c6\u03c0 \u2295 \u02c6p(\u03c4, l), i.e. append \u02c6\u03c0 with \u02c6p(\u03c4, l)\n\nii. Remove the estimates within a radius of \u03bbn/2 from \u02c6\u03c0(l):\n\nU \u2190 U \\ {(t, i) : \u02c6p(t, i) \u2208 (\u02c6p(\u03c4, l) \u2212 \u03bbn/2, \u02c6p(\u03c4, l) + \u03bbn/2)}\n\nend while\nreturn: A sequence \u02c6\u03c0 of change point estimates. Note: Elements of \u02c6\u03c0 are at least n\u03bb apart and are\nsorted in decreasing order of their scores s(\u00b7,\u00b7).\n\nwhere n is the length of the samples; in this case, the consistency results are unaffected, and the\ncomputational complexity of calculating the distance becomes n polylog n, making the complexity\nof the algorithm n2 polylog n. The choice m = log n is further justi\ufb01ed by the fact that the fre-\nquencies of cells in Bm,l corresponding to higher values of m are not consistent estimates of their\nprobabilities (and thus only add to the error of the estimate); see [22, 14] for further discussion.\n\n2\n\nand X a+b\n\nThe proof of the theorem is given in the next section. Here we provide an intuition as to why the\nconsistency statement holds.\nFirst, recall that the empirical distributional distance between a given pair of sequences converges\nto the distributional distance between the corresponding process distributions. Consider a sequence\nx = X1..n, and assume that a segment Xa..b, a, b \u2208 1..n does not contain any change points, so\n2 ..b are generated by the same process. If the length of Xa..b is linear in n, so\nthat Xa.. a+b\nthat b \u2212 a = \u03b1n for some \u03b1 \u2208 (0, 1), then its intra-subsequence distance \u2206x(a, b) (de\ufb01ned by (7))\nconverges to 0 with n going to in\ufb01nity. On the other hand, if there is a single change point \u03c0 within\nXa..b whose distance from a and b is linear in n, then \u2206x(a, b) converges to a non-zero constant.\nNow assume that Xa..b with its change point at \u03c0 \u2208 a..b is contained within a larger segment\nXa\u2212n\u03b1(cid:48)..b+n\u03b1(cid:48) for some \u03b1(cid:48) \u2208 (0, 1).\nIn this case, the single-change point estimator \u03a6(a, b, \u03b1(cid:48))\n(de\ufb01ned by (8)) produces an estimate that from some n on converges to \u03c0 provided that \u03c0 is the only\nchange point in Xa\u2212n\u03b1(cid:48)..b+n\u03b1(cid:48). These observations are key to the consistency of the algorithm.\nWhen \u03bb \u2264 \u03bbmin, each of the index-sequences generated with \u03b1 := \u03bb\n3 partitions x in such a way\nthat every three consecutive segments of the partition contain at most one change point. Also, the\nsegments are of lengths linear in n. In this scenario, from some n on, the change point estimator\n\u03a6(\u00b7,\u00b7,\u00b7) produces correct candidates within each of the segments that contains a true change point.\nMoreover, from some n on, the performance scores s(\u00b7,\u00b7) of the segments without change points\nconverge to 0, while those corresponding to the segments that encompass a change point converge\n\n5\n\n\fto a non-zero constant. Thus from some n on, the \u03ba change point candidates of highest performance\nscore that are at least at a distance \u03bbn from one another, each converge to a unique change point.\nA problem occurs if the generated index-sequence is such that it includes some of the change points\nas elements. As a mitigation strategy, we generate two index-sequences with the same gap \u03b1n\nbetween their consecutive elements but with distinct starting points: one starts at n\u03b1\n2 and the other\nat n\u03b1\n3 . Each index-sequence gives a different partitioning of x into consecutive segments. This way,\nevery change point is fully encompassed by at least one segment from either of the two partitions.\nWe choose the appropriate segments based on their performance scores. From the above argument\nwe can see that segments with change points will have higher scores, and the change points within\nwill be estimated correctly; \ufb01nally, this is used to prove the theorem in the next session.\n\n5 Proof of Theorem 1\n\nThe proof relies on Lemma 1 and Lemma 2, which we borrow from [13] and state here without proof.\nWe also require the following additional notation.\nDe\ufb01nition 3. For every change point \u03c0k, k = 1..\u03ba and every \ufb01xed t = 1, 2 we denote by Lt(\u03c0k)\nand by Rt(\u03c0k) the elements of the index-sequence bt\n\u03b1 that appear immediately to the left\nand to the right of \u03c0k respectively, i.e. Lt(\u03c0k) :=\ni.\nbt\nmin\n(Equality occurs when \u03c0k for some k \u2208 1..\u03ba is exactly at the start or at the end of a segment.)\nLemma 1 ([13]). Let x = X1..n be generated by a stationary ergodic process \u03c1. For all \u03b6 \u2208 [0, 1)\nand \u03b1 \u2208 (0, 1) we have,\n\ni, i = 1.. 1\ni\u2264\u03c0k, i=0.. 1\nbt\n\ni and Rt(\u03c0k) :=\nbt\n\ni\u2265\u03c0k, i=0.. 1\nbt\n\n\u2206x(b1, b2) = 0.\n\nmax\n\nsup\n\n\u03b1\n\n\u03b1\n\nlim\nn\u2192\u221e\n\nb1\u2265\u03b6n, b2\u2265b1+\u03b1n\n\nLemma 2 ([13]). Let \u03b4 denote the minimum distance between the distinct distributions generating\nthe data. Denote by \u03ba the \u201cunknown\u201d number of change points and assume that for some \u03b6 \u2208 (0, 1)\nand some t = 1, 2 we have, inf k=1..\u03ba\ni=0.. 1\n\u03b1\nlim\nn\u2192\u221e inf\nk\u22081..\u03ba\n\n(i) With probability one we have,\n(ii) If additionally we have that [Lt(\u03c0k) \u2212 n\u03b1, Rt(\u03c0k) + n\u03b1] \u2286 [\u03c0k\u22121, \u03c0k+1] then with probability\none we obtain,\n\n|\u03a6x(Lt(\u03c0k), Rt(\u03c0k), \u03b1) \u2212 \u03c0k| = 0.\n\n\u2206x(Lt(\u03c0k), Rt(\u03c0k)) \u2265 \u03b4\u03b6.\n\ni \u2212 \u03c0k| \u2265 \u03b6n.\n|bt\n\nn\u2192\u221e sup\nlim\nk\u22081..\u03ba\n\n1\nn\n\nProof of Theorem 1. We \ufb01rst give an outline of the proof. In order for a change point \u03c0k, k \u2208 1..\u03ba\nto be estimated correctly through this algorithm, there needs to be at least one t = 1, 2 such that\n\n1. \u03c0k \u2208 (Lt(\u03c0k), Rt(\u03c0k)) and 2. [Lt(\u03c0k) \u2212 n\u03b1, Rt(\u03c0k) + n\u03b1] \u2286 [\u03c0k\u22121, \u03c0k]\n\nwhere \u03b1 := \u03bb\n3 , as speci\ufb01ed by the algorithm. We show that from some n on, for every change point\nthe algorithm selects an appropriate segment satisfying these conditions, and assigns it a perfor-\nmance score s(\u00b7,\u00b7) that converges to a non-zero constant. Moreover, the performance scores of the\nsegments without change points converge to 0. Recall that, the change point candidates are \ufb01nally\nsorted according to their performance scores, and the sorted list is \ufb01ltered to include only elements\nthat are at least \u03bbn apart. For \u03bb \u2264 \u03bbmin, from some n on, the \ufb01rst \u03ba elements of the output change\npoint sequence \u02c6\u03c0 are some permutation of the true change points. The proof follows.\nFix an \u03b5 > 0. Recall that the algorithm speci\ufb01es \u03b1 := \u03bb\nindicies bt\n\n3 and generates a sequence of evenly-spaced\n\ni := n\u03b1(i + 1\n\nt+1 ), i = 1.. 1\n\n\u03b1 , t = 1, 2. Observe that\ni \u2212 bt\n1\nbt\n\u03b1\n\ni\u22121 = n\u03b1, i = 1..\n\n.\n\n(10)\n\n\u03b1 and t \u2208 1, 2 we have that the index bt\n\nFor every i \u2208 0.. 1\nhas a linear distance from it. More formally, de\ufb01ne \u03b6(t, i) := min\nk\u22081..\u03ba\n1..2. (Note that \u03b6(t, i) can also be zero). For all i \u2208 0.. 1\n\ni is either exactly equal to a change point or\nt+1 )\u2212\u03b8k|, i \u2208 0..1/\u03b1 t \u2208\n\n|\u03b1(i+ 1\n\n\u03b1 , t = 1, 2 and k \u2208 1..\u03ba we have\n\ni \u2212 \u03c0k| \u2265 n\u03b6(t, i).\n|bt\n\n6\n\n(11)\n\n\fFor every t = 1, 2 and i = 0..1/\u03b1, a performance score s(t, i) is calculated as the intra-subsequence\n\u03b1 s.t. \u2203k \u2208\ndistance \u2206x(bt\n1..\u03ba, \u03c0k \u2208 (bt\n\u03b1} \\ I. By (10), (11)\nand Lemma 1, there exists some N1 such that for all n \u2265 N1 we have,\n\ni+1) of the segment Xbt\ni+1)}. Also de\ufb01ne the complement set I(cid:48) := {1, 2} \u00d7 {1.. 1\n\n. Let I := {(t, i) : t \u2208 1, 2, i \u2208 1.. 1\n\ni, bt\ni, bt\n\ni..bt\n\ni+1\n\n(12)\nSince \u03bb \u2264 \u03bbmin, we have \u03b1 \u2208 (0, \u03bbmin/3]. Therefore, for every t = 1, 2 and every change point\n\u03c0k, k \u2208 1..\u03ba we have\n\nsup\n(t,i)\u2208I(cid:48)\n\ns(t, i) \u2264 \u03b5.\n\n[Lt(\u03c0k) \u2212 n\u03b1, Rt(\u03c0k) + n\u03b1] \u2286 [\u03c0k\u22121, \u03c0k+1].\nDe\ufb01ne \u00b5min := min(t,i)\u2208I \u03b6(t, i). It follows from the de\ufb01nition of I that\n\n(13)\n\n\u00b5min > 0.\n\n(14)\nBy (10), (11), (13), (14) and Lemma 2.(i), there exists some N2 such that for all n \u2265 N2 we have\n(15)\nwhere \u03b4 denotes the minimum distance between the distributions. Let \u03c0(t, i), i \u2208 0..1/\u03b1, t = 1, 2\ni+1, (t, i) \u2208 I, i.e. \u03c0(t, i) := \u03c0k, k \u2208\ndenote the change point that is contained within bt\n1..\u03ba s.t. \u03c0k \u2208 (bt\ni+1). As speci\ufb01ed in Step 3, the change point candidates are obtained as \u02c6p(t, i) :=\ni+1 , \u03b1), i = 1..1/\u03b1 \u2212 1. By (10), (11), (13), (14) and Lemma 2.(ii) there exists some N4\n, b\u03c4 (i)\n\u03a6x(b\u03c4 (i)\nsuch that for all n \u2265 N4 we have\n\n(t,i)\u2208I s(t, i) \u2265 \u03b4\u00b5min\n\ni..bt\n\ni, bt\n\ninf\n\ni\n\nsup\n(t,i)\u2208I\n\n1\nn\n\n|\u02c6p(t, i) \u2212 \u03c0(t, i)| \u2264 \u03b5.\n\n(16)\n\nLet N := maxi=1..4 Ni. Recall that (as speci\ufb01ed in Step 4), the algorithm generates an output\nsequence \u02c6\u03c0 := \u02c6\u03c01, . . . , \u02c6\u03c01/\u03bb by \ufb01rst sorting the change point candidates according to their perfor-\nmance scores, and then \ufb01ltering the sorted list so that the remaining elements are at least n\u03bb apart.\nIt remains to see that the corresponding estimate of every change point appears exactly once in \u02c6\u03c0.\nBy (12) and (15) for all n \u2265 N the segments bt\ni+1, (t, i) \u2208 I are assigned higher scores than\ni+1, (t, i) \u2208 I(cid:48). Moreover, by construction for every change point \u03c0k, k = 1..\u03ba there ex-\nbt\ni..bt\nists some (t, i) \u2208 I such that \u03c0k = \u03c0(t, i) which, by (16) is estimated correctly for all n \u2265 N.\nNext we show that every estimate appears at most once in the output sequence \u02c6\u03c0. By (16) for all\n(t, i), (t(cid:48), i(cid:48)) \u2208 I such that \u03c0(t, i) = \u03c0(t(cid:48), i(cid:48)) and all n \u2265 N we have\n\ni..bt\n\n1\nn\n\n|\u02c6p(t, i) \u2212 \u02c6p(t(cid:48), i(cid:48))| \u2264 1\nn\n\n|\u02c6p(t, i) \u2212 \u03c0(t, i)| +\n\n|\u02c6p(t(cid:48), i(cid:48)) \u2212 \u03c0(t(cid:48), i(cid:48))| \u2264 2\u03b5.\n\n(17)\n\n1\nn\n\nOn the other hand, for all (t, i), (t(cid:48), i(cid:48)) \u2208 I such that \u03c0(t, i) (cid:54)= \u03c0(t(cid:48), i(cid:48)) and all n \u2265 N we have\n|\u02c6p(t(cid:48), i(cid:48)) \u2212 \u03c0(t(cid:48), i(cid:48))|\n\n1\nn\n\n|\u02c6p(t, i) \u2212 \u02c6p(t(cid:48), i(cid:48))| \u2265 1\nn\n\u2265 1\nn\n\n|\u03c0(t, i) \u2212 \u03c0(t(cid:48), i(cid:48))| \u2212 1\nn\n|\u03c0(t, i) \u2212 \u03c0(t(cid:48), i(cid:48))| \u2212 2\u03b5 \u2265 \u03bbmin \u2212 2\u03b5\n\n|\u02c6p(t, i) \u2212 \u03c0(t, i)| \u2212 1\nn\n\n(18)\n\nwhere the last inequality follows from (16) and that the true change points are at least n\u03bbmin apart.\nBy (17) and (18) the duplicate estimates of every change point are \ufb01ltered, while estimates cor-\nresponding to different change points are left untouched. Finally, following the notation of De\ufb01-\nnition 2, let \u02c6\u03b8(\u03ba) = (\u02c6\u03b81, . . . , \u02c6\u03b8\u03ba) := sort( 1\nn \u02c6\u03c0\u03ba), (sorted with respect to their order of\nappearance in x). For n \u2265 N we have, supk\u22081..\u03ba |\u02c6\u03b8k \u2212 \u03b8k| \u2264 \u03b5 and the statement follows.\n\nn \u02c6\u03c01,\u00b7\u00b7\u00b7 , 1\n\n6 Experimental Results\nIn this section we use synthetically generated time-series data to empirically evaluate our algorithm.\nTo generate the data we have selected distributions that while being stationary ergodic, do not be-\nlong to any \u201csimpler\u201d class of time-series, and are dif\ufb01cult to approximate by \ufb01nite-state models. In\nparticular they cannot be modeled by a hidden Markov process with a \ufb01nite state-space. These dis-\ntributions were used in [26] as examples of stationary ergodic processes which are not B-processes.\n\n7\n\n\fFigure 1: Left (Experiment 1): Average (over 20 runs) error as a function of the length of the input\nsequence. Right (Experiment 2): Average (over 25 runs) error as a function the input parameter \u03bb.\n\nTime-series generation. To generate a sequence x = X1..n we proceed as follows. Fix some pa-\nrameter \u03b1 \u2208 (0, 1) and select r0 \u2208 [0, 1]. For each i = 1..n let ri = ri\u22121 + \u03b1 \u2212 (cid:98)ri\u22121 + \u03b1(cid:99). The\nsamples Xi are obtained from ri by thresholding at 0.5, i.e. Xi := I{ri > 0.5}. We call this pro-\ncedure DAS(\u03b1). If \u03b1 is irrational then x forms a stationary ergodic time-series. We simulate \u03b1 by\na longdouble with a long mantisa. For the purpose of our experiments we use four different process\ndistributions DAS(\u03b1i), i = 1..4 with \u03b11 = 0.30..., \u03b12 = 0.35..., \u03b13 = 0.40... and \u03b14 = 0.45....\nTo generate an input sequence x = X1..n we \ufb01x some \u03bbmin = 0.23 and randomly generate \u03ba = 3\nchange points at a minimum distance n\u03bbmin. We use DAS(\u03b1i), i = 1..4 to respectively generate\nthe four subsequences between every pair of consecutive change points.\nExperiment 1: (Convergence with Sequence Length) In this experiment we demonstrate that the\nestimation error converges to 0 as the sequence length grows. We iterate over n = 1000..20000; at\nevery iteration we generate an input sequence of length n as described above. We apply Algorithm 1\nwith \u03bb = 0.18 to \ufb01nd the change points. Figure 1 (Left) shows the average error-rate as a function\nof sequence length.\nExperiment 2: (Dependence on \u03bb) Algorithm 1 requires \u03bb \u2208 (0, 1) as a lower-bound on \u03bbmin.\nIn this experiment we show that this lower bound need not be tight. In particular, there is a rather\nlarge range of \u03bb \u2264 \u03bbmin for which the estimation error is low. To demonstrate this, we \ufb01xed the\nsequence length n = 20000 and observed the error-rate as we varied the input parameter \u03bb between\n0.01..0.35. Figure 1 (Right) shows the average error-rate as a function of \u03bb.\n\n7 Outlook\n\nIn this work we propose a consistency framework for multiple change points estimation in highly\ndependent time-series, for the case where the number of change points is unknown. The notion of\nconsistency that we consider requires an algorithm to produce a list of change points such that the\n\ufb01rst k change points approach the true unknown change points in asymptotic. While in the general\nsetting that we consider it is not possible to estimate the number of change points, other related\nformulations may be of interest. For example, if the number of different time-series distributions is\nknown, but the number of change points is not, it may still be possible to estimate the latter. A simple\nexample of this scenario would be when two distributions generate many segments in alternation.\nWhile the consistency result here (and in the previous works [14, 22, 25]) rely on the convergence of\nfrequencies, recent results of [1, 2] on uniform convergence can be used (see [24]) to solve related\nstatistical problems about time-series (e.g., clustering) and thus may also prove useful in change\npoint analysis.\nAcknowledgements. This work is supported by the French Ministry of Higher Education and Research, Nord-\nPas-de-Calais Regional Council and FEDER through CPER 2007-2013, ANR projects EXPLO-RA (ANR-08-\nCOSI-004) and Lampada (ANR-09-EMER-007), by an INRIA Ph.D. grant to Azadeh Khaleghi, by the Euro-\npean Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495 (project\nCompLACS), and by Pascal-2.\n\n8\n\n00.511.522.5x 10400.050.10.150.20.250.30.350.40.45Length of the input sequenceError(cid:239)rate00.050.10.150.20.250.30.3500.10.20.30.40.50.60.70.80.91Input parameter (cid:104)Error(cid:239)rate\fReferences\n[1] Terrence M. Adams and Andrew B. Nobel. Uniform convergence of Vapnik-Chervonenkis\n\nclasses under ergodic sampling. The Annals of Probability, 38:1345\u20131367, 2010.\n\n[2] Terrence M. Adams and Andrew B. Nobel. Uniform approximation and bracketing properties\n\nof VC classes. Bernoulli, to appear.\n\n[3] M.F. Balcan and P. Gupta. Robust hierarchical clustering. In COLT, 2010.\n[4] M. Basseville and I.V. Nikiforov. Detection of abrupt changes: theory and application. Pren-\n\ntice Hall information and system sciences series. Prentice Hall, 1993.\n\n[5] P.K. Bhattacharya. Some aspects of change-point analysis. Lecture Notes-Monograph Series,\n\npages 28\u201356, 1994.\n\n[6] B.E. Brodsky and B.S. Darkhovsky. Nonparametric methods in change-point problems. Math-\n\nematics and its applications. Kluwer Academic Publishers, 1993.\n\n[7] E. Carlstein and S. Lele. Nonparametric change-point estimation for data from an ergodic\n\nsequence. Teor. Veroyatnost. i Primenen., 38:910\u2013917, 1993.\n\n[8] L. Dumbgen. The asymptotic behavior of some nonparametric change-point estimators. The\n\nAnnals of Statistics, 19(3):pp. 1471\u20131495, 1991.\n\n[9] D. Ferger. Exponential and polynomial tailbounds for change-point estimators. Journal of\n\nstatistical planning and inference, 92(1-2):73\u2013109, 2001.\n\n[10] L. Giraitis, R. Leipus, and D. Surgailis. The change-point problem for dependent observations.\n\nJournal of Statistical Planning and Inference, 53(3), 1996.\n\n[11] R. Gray. Prob. Random Processes, & Ergodic Properties. Springer Verlag, 1988.\n[12] S. B. Hariz, J. J. Wylie, and Q. Zhang. Optimal rate of convergence for nonparametric change-\n\npoint estimators for nonstationary sequences. Annals of Statistics, 35(4):1802\u20131826, 2007.\n\n[13] A. Khaleghi and D. Ryabko. Multiple change-point estimation in highly dependent time series.\n\nTechnical report, arXiv:1203.1515, 2012.\n\n[14] A. Khaleghi, D. Ryabko, J. Mary, and P. Preux. Online clustering of processes. In AISTATS,\n\nJMLR W&CP 22, pages 601\u2013609, 2012.\n\n[15] J. Kohlmorgen and S. Lemm. A dynamic hmm for on-line segmentation of sequential data.\n\nAdvances in Neural Inf. Proc. Systems, 14:793\u2013800, 2001.\n\n[16] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random \ufb01elds:\n\nProbabilistic models for segmenting & labeling sequence data. In ICML, 2001.\n\n[17] T.L. Lai. Sequential changepoint detection in quality control and dynamical systems. Journal\n\nof the Royal Statistical Society, pages 613\u2013658, 1995.\n\n[18] Marc Lavielle. Using penalized contrasts for the change-point problem. Signal Processing,\n\n85(8):1501 \u2013 1510, 2005.\n\n[19] E. Lebarbier. Detecting multiple change-points in the mean of gaussian process by model\n\nselection. Signal Processing, 85(4):717 \u2013 736, 2005.\n\n[20] C.B. Lee. Nonparametric multiple change-point estimators. Statistics & probability letters,\n\n27(4):295\u2013304, 1996.\n\n[21] Hidetoshi Murakami. A nonparametric locationscale statistic for detecting a change point. The\n\nInter. Journal of Advanced Manufacturing Technology, 2001.\n\n[22] D. Ryabko. Clustering processes. In ICML, pages 919\u2013926, Haifa, Israel, 2010.\n[23] D. Ryabko. Discrimination between B-processes is impossible. Journal of Theoretical Proba-\n\nbility, 23(2):565\u2013575, 2010.\n\n[24] D. Ryabko and J. Mary. Reducing statistical time-series problems to binary classi\ufb01cation. In\n\nNIPS, Lake Tahoe, USA, 2012.\n\n[25] D. Ryabko and B. Ryabko. Nonparametric statistical inference for ergodic processes. IEEE\n\nTransactions on Information Theory, 56(3), 2010.\n\n[26] P. Shields. The Ergodic Theory of Discrete Sample Paths. AMS Bookstore, 1996.\n[27] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series.\n\nIn ICML, pages 1055\u20131062. ACM, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1426, "authors": [{"given_name": "Azadeh", "family_name": "Khaleghi", "institution": null}, {"given_name": "Daniil", "family_name": "Ryabko", "institution": null}]}