{"title": "Kernel Change-point Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": "We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an (unlabelled) sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistics based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistics also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented.", "full_text": "Kernel Change-point Analysis\n\nZa\u00a8\u0131d Harchaoui\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\nzaid.harchaoui@enst.fr\n\nFrancis Bach\n\nWillow Project, INRIA-ENS\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\n\u00b4Eric Moulines\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\neric.moulines@enst.fr\n\nAbstract\n\nWe introduce a kernel-based method for change-point analysis within a sequence\nof temporal observations. Change-point analysis of an unlabelled sample of obser-\nvations consists in, \ufb01rst, testing whether a change in the distribution occurs within\nthe sample, and second, if a change occurs, estimating the change-point instant\nafter which the distribution of the observations switches from one distribution to\nanother different distribution. We propose a test statistic based upon the maximum\nkernel Fisher discriminant ratio as a measure of homogeneity between segments.\nWe derive its limiting distribution under the null hypothesis (no change occurs),\nand establish the consistency under the alternative hypothesis (a change occurs).\nThis allows to build a statistical hypothesis testing procedure for testing the pres-\nence of a change-point, with a prescribed false-alarm probability and detection\nprobability tending to one in the large-sample setting. If a change actually occurs,\nthe test statistic also yields an estimator of the change-point location. Promising\nexperimental results in temporal segmentation of mental tasks from BCI data and\npop song indexation are presented.\n\n1 Introduction\n\nThe need to partition a sequence of observations into several homogeneous segments arises in many\napplications, ranging from speaker segmentation to pop song indexation. So far, such tasks were\nmost often dealt with using probabilistic sequence models, such as hidden Markov models [1], or\ntheir discriminative counterparts such as conditional random \ufb01elds [2]. These probabilistic models\nrequire a sound knowledge of the transition structure between the segments and demand careful\ntraining beforehand to yield competitive performance; when data are acquired online, inference in\nsuch models is also not straightforward (see, e.g., [3, Chap. 8]). Such models essentially perform\nmultiple change-point estimation, while one is often also interested in meaningful quantitative mea-\nsures for the detection of a change-point within a sample.\n\nWhen a parametric model is available to model the distributions before and after the change, a com-\nprehensive literature for change-point analysis has been developed, which provides optimal criteria\nfrom the maximum likelihood framework, as described in [4]. Nonparametric procedures were also\nproposed, as reviewed in [5], but were limited to univariate data and simple settings. Online coun-\nterparts have also been proposed and mostly built upon the cumulative sum scheme (see [6] for\nextensive references). However, so far, even extensions to the case where the distribution before the\nchange is known, and the distribution after the change is not known, remains an open problem [7].\nThis brings to light the need to develop statistically grounded change-point analysis algorithms,\nworking on multivariate, high-dimensional, and also structured data.\n\n1\n\n\fWe propose here a regularized kernel-based test statistic, which allows to simultaneously provide\nquantitative answers to both questions: 1) is there a change-point within the sample? 2) if there is\none, then where is it? We prove that our test statistic for change-point analysis has a false-alarm prob-\nability tending to \u03b1 and a detection probability tending to one as the number of observations tends\nto in\ufb01nity. Moreover, the test statistic directly provides an accurate estimate of the change-point\ninstant. Our method readily extends to multiple change-point settings, by performing a sequence of\nchange-point analysis in sliding windows running along the signal. Usually, physical considerations\nallow to set the window-length to a suf\ufb01ciently small length for being guaranteed that at most one\nchange-point occurs within each window, and suf\ufb01ciently large length for our decision rule to be\nstatistically signi\ufb01cant (typically n > 50).\nIn Section 2, we set up the framework of change-point analysis, and in Section 3, we describe how\nto devise a regularized kernel-based approach to the change-point problem. Then, in Section 4\nand in Section 5, we respectively derive the limiting distribution of our test statistic under the null\nhypothesis H0 : \u201dno change occurs\u201c, and establish the consistency in power under the alternative\nHA : \u201da change occurs\u201c. These theoretical results allow to build a test statistic which has provably a\nfalse-alarm probability tending to a prescribed level \u03b1, and a detection probability tending to one, as\nthe number of observations tends to in\ufb01nity. Finally, in Section 7, we display the performance of our\nalgorithm for respectively, segmentation into mental tasks from BCI data and temporal segmentation\nof pop songs.\n\n2 Change-point analysis\n\nIn this section, we outline the change-point problem, and describe formally a strategy for building\nchange-point analysis test statistics.\n\nChange-point problem\nchange-point analysis of the sample {X1, . . . , Xn} consists in the following two steps.\n\nLet X1, . . . , Xn be a time series of independent random variables. The\n\n1) Decide between\n\nH0 :\nHA :\n\nPX1 = \u00b7\u00b7\u00b7 = PXk = \u00b7\u00b7\u00b7 = PXn\nthere exists 1 < k\u22c6 < n such that\nPX1 = \u00b7\u00b7\u00b7 = PXk\u22c6 6= PXk\u22c6+1 = \u00b7\u00b7\u00b7 = PXn .\n2) Estimate k\u22c6 from the sample {X1, . . . , Xn} if HA is true .\n\n(1)\n\nWhile sharing many similarities with usual machine learning problems, the change-point problem is\ndifferent.\n\nStatistical hypothesis testing An important aspect of the above formulation of the change-\npoint problem is its natural embedding in a statistical hypothesis testing framework. Let us re-\ncall brie\ufb02y the main concepts in statistical hypothesis testing, in order to rephrase them within\nthe change-point problem framework (see, e.g., [8]). The goal is to build a decision rule to\nanswer question 1) in the change-point problem stated above. Set a false-alarm probability \u03b1\nwith 0 < \u03b1 < 1 (also called level or Type I error), whose purpose is to theoretically guar-\nantee that P(decide HA, when H0 is true) is close to \u03b1. Now, if there actually is a change-\npoint within the sample, one would like not to miss it, that is that the detection probability\n\u03c0 = P(decide HA, when HA is true)\u2014also called power and equal to one minus the Type II\nerror\u2014should be close to one. The purpose of Sections 4-5 is to give theoretical guarantees to those\npractical requirements in the large-sample setting, that is when the number of observations n tends\nto in\ufb01nity.\n\nRunning maximum partition strategy An ef\ufb01cient strategy for building change-point analysis\nprocedures is to select the partition of the sample which yields a maximum heterogeneity between\nthe two segments: given a sample {X1, . . . , Xn} and a candidate change point k with 1 < k < n,\nassume we may compute a measure of heterogeneity \u2206n,k between the segments {X1, . . . , Xk} on\nthe one hand, and {Xk+1, . . . , Xn} on the other hand. Then, the \u201crunning maximum partition strat-\negy\u201d consists in using max1<k<n \u2206n,k as a building block for change-point analysis (cf. Figure 1).\nNot only max1<k<n \u2206n,k may be used to test for the presence of a change-point and assess/discard\n\n2\n\n\fP(\u2113)\n\nP(r)\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\n\n\u0001\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\n\u0001\n\n\n\u0001\n\n\u0001\n\n\n\u0001\u0001\u0001\u0001\u0001\u0001\u0001\u0001\n\nn\n\n1\n\nk\n\nk\u22c6\n\nFigure 1: The running maximum strategy for change-point analysis. The test statistic for change-\npoint analysis runs a candidate change-point k with 1 < k < n along the sequence of observations,\nhoping to catch the true change-point k\u22c6.\n\nthe overall homogeneity of the sample; besides, \u02c6k = argmax1<k<n\u2206n,k provides a sensible estima-\ntor of the true change-point instant k\u22c6 [5].\n\n3 Kernel Change-point Analysis\n\nIn this section, we describe how the kernel Fisher discriminant ratio, which has proven relevant for\nmeasuring the homogeneity of two samples in [9], may be embedded into the running maximum par-\ntition strategy to provide a powerful test statistic, coined KCpA for Kernel Change-point Analysis,\nfor addressing the change-point problem. This is described in the operator-theoretic framework,\ndeveloped for the statistical analysis of kernel-based learning and testing algorithms in [10, 11].\n\nReproducing kernel Hilbert space\nLet (X , d) be a separable measurable metric space. Let\nX be an X -valued random variable, with probability measure P; the expectation with respect to\nP is denoted by E[\u00b7] and the covariance by Cov(\u00b7,\u00b7). Consider a reproducing kernel Hilbert space\n(RKHS) (H,h\u00b7,\u00b7iH) of functions from X to R. To each point x \u2208 X , there corresponds an element\n\u03a6(x) \u2208 H such that h\u03a6(x), fiH = f (x) for all f \u2208 H, and h\u03a6(x), \u03a6(y)iH = k(x, y), where\nk : X \u00d7 X \u2192 R is a positive de\ufb01nite kernel [12]. In the following, we exclusively work with the\nAronszajn-map, that is, we take \u03a6(x) = k(x,\u00b7) for all x \u2208 X . It is assumed from now on that\nH is a separable Hilbert space. Note that this is always the case if X is a separable metric space\nand if the kernel is continuous [13]. We make the following two assumptions on the kernel (which\nare satis\ufb01ed in particular for the Gaussian kernel; see [14]): (A1) the kernel k is bounded, that is\nsup(x,y)\u2208X \u00d7X k(x, y) < \u221e, (A2) for all probability distributions P on X , the RKHS associated\nwith k(\u00b7,\u00b7) is dense in L2(P).\nKernel Fisher Discriminant Ratio\nindependent observations\nX1, . . . , Xn \u2208 X . For any [i, j] \u2282 {2, . . . , n \u2212 1}, de\ufb01ne the corresponding empirical mean el-\nements and covariance operators as follows\n\nConsider a sequence of\n\n\u02c6\u00b5i:j :=\n\n1\n\nj \u2212 i + 1\n\nk(X\u2113,\u00b7) ,\n\n\u02c6\u03a3i:j :=\n\n1\n\nj \u2212 i + 1\n\nj\n\nX\u2113=i\n\nj\n\nX\u2113=i\n\n{k(X\u2113,\u00b7) \u2212 \u02c6\u00b5i:j} \u2297 {k(X\u2113,\u00b7) \u2212 \u02c6\u00b5i:j} .\n\nThese quantities have obvious population counterparts, the population mean element and the pop-\nulation covariance operator, de\ufb01ned for any probability measure P as h\u00b5P, fiH := E[f (X)] for\nall f \u2208 H, and hf, \u03a3PgiH := CovP[f (X), g(X)] for f, g \u2208 H. For all k \u2208 {2, . . . , n \u2212 1} the\n(maximum) kernel Fisher discriminant ratio, which we abbreviate as KFDR is de\ufb01ned as\n\nKFDRn,k;\u03b3(X1, . . . , Xn) :=\n\nn\n\nn\n\n\u02c6\u03a31:k +\n\nn \u2212 k\n\nk(n \u2212 k)\n\n\u02c6\u03a3k+1:n + \u03b3I(cid:19)\u22121/2\nn2}, then with KFDRn1+n2,n1+1;\u03b3(X1, . . . , Xn1 , X \u2032\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:18) k\n\nn\n\n(\u02c6\u00b5k+1:n \u2212 \u02c6\u00b51:k)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nNote that, if we merge two labelled samples {X1, . . . , Xn1} and {X \u2032\nn2} into a single sample\nas {X1, . . . , Xn1 , X \u2032\nn2 ) we re-\ncover the test statistic considered in [9] for testing the homogeneity of two samples {X1, . . . , Xn1}\nand {X \u2032\nn2}.\n\n1, . . . , X \u2032\n\n1, . . . , X \u2032\n\n1, . . . , X \u2032\n\n1, . . . , X \u2032\n\n2\n\nH\n\n.\n\n3\n\n\fFollowing [9], we make the following assumptions on all the covariance operators \u03a3 considered in\n(\u03a3) < \u221e, (B2) there are in\ufb01nitely\n\nthis paper: (B1) the eigenvalues {\u03bbp(\u03a3)}p\u22651 satisfyP\u221e\nmany strictly positive eigenvalues {\u03bbp(\u03a3)}p\u22651 of \u03a3.\nKernel change-point analysis\nNow, we may apply the strategy described before (cf. Figure 1)\nto obtain the main building block of our test statistic for change-point analysis. Indeed, we de\ufb01ne\nour test statistic Tn,k;\u03b3 as\n\np=1 \u03bb1/2\n\np\n\nKFDRn,k;\u03b3 \u2212 d1,n,k;\u03b3( \u02c6\u03a3W\nn,k)\n\n,\n\nan<k<bn\n\nTn;\u03b3(k) := max\n\n\u221a2 d2,n,k;\u03b3( \u02c6\u03a3W\nn,k)\nn,k := k \u02c6\u03a31:k + (n\u2212 k) \u02c6\u03a3k+1:n. The quantities d1,n,k;\u03b3( \u02c6\u03a3W\nn,k) := Tr{( \u02c6\u03a3W\n\nn,k + \u03b3I)\u22121 \u02c6\u03a3W\n\nn,k} ,\n\nwhere n \u02c6\u03a3W\nrespectively as\nd1,n,k;\u03b3( \u02c6\u03a3W\n\nn,k) and d2,n,k;\u03b3( \u02c6\u03a3W\n\nn,k), de\ufb01ned\n\nn,k)2} ,\nact as normalizing constants for Tn;\u03b3(k) to have zero-mean and unit-variance as n tends to in\ufb01nity,\na standard statistical transformation known as studentization. The maximum is searched within the\ninterval [an, bn] with an > 1 and bn < n, which is restriction of ]1, n[, in order to prevent the\ntest statistic from uncontrolled behaviour in the neighborhood of the interval boundaries, which is\nstandard practice in this setting [15].\n\nn,k) := Tr{( \u02c6\u03a3W\n\nn,k + \u03b3I)\u22122( \u02c6\u03a3W\n\nd2,n,k;\u03b3( \u02c6\u03a3W\n\nRemark\nNote that, if the input space is Euclidean, for instance X = Rd, and if the kernel is linear\nk(x, y) = xT y, then Tn;\u03b3(k) may be interpreted as a regularized version of the classical maximum-\nlikelihood multivariate test statistic used to test change in mean with unequal covariances, under the\nassumption of normal observations, described in [4, Chap. 3]. Yet, as the next section shall show,\nour test statistic is truly nonparametric, and its large-sample properties do not require any \u201cgaussian\nin the feature space\u201d-type of assumption. Moreover, in practice it may be computed thanks to the\nkernel trick, adapted to the kernel Fisher discriminant analysis and outlined in [16, Chapter 6].\n\nFalse-alarm and detection probability\nIn order to build a principled testing procedure, a proper\ntheoretical analysis from a statistical point of view is necessary. First, as the next section shows, for a\nprescribed \u03b1, we may build a procedure which has, as n tends to in\ufb01nity, the false-alarm probability\n\u03b1 under the null hypothesis H0, that is when the sample is completely homogeneous and contains\nno-change-point. Besides, when the sample actually contains at most one change-point, we prove\nthat our test statistic is able to catch it with probability one as n tends to in\ufb01nity.\n\nLarge-sample setting\nFor the sake of generality, we describe here the large-sample setting for\nthe change-point problem under the alternative hypothesis HA. Essentially, it corresponds to nor-\nmalizing the signal sampling interval to [0, 1] and letting the resolution increase as we observe more\ndata points [4].\nAssume there is 0 < k\u22c6 < n such that PX1 = \u00b7\u00b7\u00b7 = PXk\u22c6 6= PXk\u22c6+1 = \u00b7\u00b7\u00b7 = PXn. De\ufb01ne\n\u03c4 \u22c6 := k\u22c6/n such that \u03c4 \u22c6 \u2208]0, 1[, and de\ufb01ne P(\u2113) the probability distribution prevailing within the\nleft segment of length \u03c4 \u22c6, and P(r) the probability distribution prevailing within the right segment\nof length 1 \u2212 \u03c4 \u22c6. Then, we want to study what happens if we have \u230an\u03c4 \u22c6\u230b observations from P(\u2113)\n(before change) and \u230an(1 \u2212 \u03c4 \u22c6)\u230b observations from P(r) (after change) where n is large and \u03c4 \u22c6 is\nkept \ufb01xed.\n\n4 Limiting distribution under the null hypothesis\nThroughout this section, we work under the null hypothesis H0 that is PX1 = \u00b7\u00b7\u00b7 = PXk = \u00b7\u00b7\u00b7 =\nPXn for all 2 \u2264 k \u2264 n. The \ufb01rst result gives the limiting distribution of Tn;\u03b3(k) as the number of\nobservations n tends to in\ufb01nity.\nBefore stating the theoretical results, let us describe informally the crux of our approach. We may\nprove, under H0, using operator-theoretic pertubation results similar to [9], that it is suf\ufb01cient to\nstudy the large-sample behaviour of \u02dcTn;\u03b3(k) := maxan<k<bn(\u221a2 d2;\u03b3(\u03a3))\u22121Qn,\u221e;\u03b3(k) where\n\nQn,\u221e;\u03b3(k) :=\n\nk(n \u2212 k)\n\nn\n\n(\u03a3 + \u03b3I)\u22121/2 (\u02c6\u00b5k+1:n \u2212 \u02c6\u00b51:k)(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)\n\n4\n\n2\n\nH \u2212 d1;\u03b3(\u03a3) ,\n\n1 < k < n ,\n\n(2)\n\n\fand d1;\u03b3(\u03a3) and d2;\u03b3(\u03a3) are respectively the population recentering and rescaling quantities with\n1:n the within-class covariance operator. Note that the only remaining stochastic\n\u03a3 = \u03a31:n = \u03a3W\nterm in (2) is \u02c6\u00b5k+1:n \u2212 \u02c6\u00b51:k. Let us expand (2) onto the eigenbasis {\u03bbp, ep}p\u22651 of the covariance\noperator \u03a3, as follows:\n\n(\u03bbp + \u03b3)\u22121(cid:26) k(n \u2212 k)\ni=1 \u03bb\u22121/2\n\nn\n\nh\u00b5k+1:n \u2212 \u00b51:k, epi2 \u2212 \u03bbp(cid:27) ,\n(ep(Xi) \u2212 E[ep(X1)]), we may rewrite Qn,\u221e;\u03b3(k) as\nn S1:n,p, which yields\n\n1 < k < n .\n\n(3)\n\nThen, de\ufb01ning S1:k,p := n\u22121/2Pk\nan in\ufb01nite-dimensional quadratic form in the tied-down partial sums S1:k,p \u2212 k\n\u2212 1) ,\n\n(\u03bbp + \u03b3)\u22121\u03bbp( n2\n\nk(n \u2212 k)(cid:18)S1:k,p \u2212\n\nS1:n,p(cid:19)2\n\nQn,\u221e;\u03b3(k) =\n\nk\nn\n\n\u221e\n\np\n\n1 < k < n .\n\n(4)\n\nQn,\u221e;\u03b3(k) =\n\n\u221e\n\nXp=1\n\nXp=1\n\nThe idea is to view {Qn,\u221e;\u03b3(k)}1<k<n as a stochastic process, that is a random function [k 7\u2192\nQn,\u221e;\u03b3(k, \u03c9)] for any \u03c9 \u2208 \u2126, where (\u2126,F, P) is a probability space. Then, invoking the so-\ncalled invariance principle in distribution [17], we realize that the random sum S1:\u230ant\u230b,p(\u03c9), which\nfor all \u03c9 linearly interpolates between the values S1:i/n,p(\u03c9) at points i/n for i = 1, . . . , n, be-\nhaves, asymptotically as n tends to in\ufb01nity, like a Brownian motion (also called Wiener process)\n{Wp(t)}0<t<1. Hence, along each component ep, we may de\ufb01ne a Brownian bridge {Bp(t)}0<t<1,\nthat is a tied-down brownian motion Bp(t) := Wp(t) \u2212 tWp(1) which yields continuous approx-\nimation in distribution of the corresponding {S1:k,p \u2212 k\nn S1:n,p}1<k<n. The proof (omitted due to\nspace limitations) consists in deriving a functional (noncentral) limit theorem for KFDRn,k;\u03b3, and\nthen applying a continuous mapping argument.\n\nProposition 1 Assume (A1) and (B1), and that H0 holds, that is PXi = P for all 1 \u2264 i \u2264 n.\nAssume in addition that the regularization parameter \u03b3 is held \ufb01xed as n tends to in\ufb01nity, and that\nan/n \u2192 u > 0 and bn/n \u2192 v < 1 as n tends to in\ufb01nity. Then,\n\n\u221e\n\n1\n\nu<t<v\n\np(t)\n\n\u03bbp(\u03a3)\n\n\u03bbp(\u03a3) + \u03b3 B2\n\nt(1 \u2212 t) \u2212 1! ,\n\nQ\u221e;\u03b3(t) :=\n\n\u221a2d2;\u03b3(\u03a3)\n\nTn;\u03b3(k) D\u2212\u2192 sup\n\nXp=1\nwhere {\u03bbp(\u03a3)}p\u22651 is the sequence of eigenvalues of the overall covariance operator \u03a3, while\n{Bp(t)}p\u22651 is a sequence of independent brownian bridges.\nDe\ufb01ne t1\u2212\u03b1 as the (1\u2212 \u03b1)-quantile of supu<t<v Q\u221e;\u03b3(t). We may compute t1\u2212\u03b1 either by Monte-\nCarlo simulations, as described in [18], or by bootstrap resampling under the null hypothesis (see).\nThe next result proves that, using the limiting distribution under the null stated above, we may build\na test statistic with prescribed false-alarm probability \u03b1 for large n.\nCorollary 2 The test maxan<k<bn Tn,\u03b3(k) \u2265 t1\u2212\u03b1(\u03a3, \u03b3) has false-alarm probability \u03b1, as n tends\nto in\ufb01nity.\n\nBesides, when the sequence of regularization parameters {\u03b3n}n\u22651 decreases to zero slowly enough\n(in particular slower than n\u22121/2), the test statistic maxan<k<bn Tn,\u03b3n(k) turns out to be asymptot-\nically kernel-independent as n tends to in\ufb01nity. While the proof hinges upon martingale functional\nlimit theorems [17], still, we may point out that if we replace \u03b3 by \u03b3n in the limiting null distribution,\nthen Q\u221e;\u03b3(\u00b7) is correctly normalized for all n \u2265 1 to have zero-mean and variance one.\nProposition 3 Assume (A1) and (B1-B2) and that H0 holds, that is PXi = P for all 1 \u2264 i \u2264 n.\nAssume in addition that the regularization parameters {\u03b3n}n\u22651 is such that\n\n\u03b3n +\n\nd1,n;\u03b3n(\u03a3)\nd2,n;\u03b3n(\u03a3)\n\n\u03b3\u22121\nn n\u22121/2 \u2192 0 ,\n\nand that an/n \u2192 u > 0 and bn/n \u2192 v < 1 as n tends to in\ufb01nity. Then,\n.\npt(1 \u2212 t)\n\nTn;\u03b3n(k) D\u2212\u2192 sup\n\nan<k<bn\n\nB(t)\n\nmax\n\nu<t<v\n\n5\n\n\fRemark\nA closer look at Proposition 1 brings to light that the reweighting by t(1 \u2212 t) of the\nsquared brownian bridges on each component is crucial for our test statistic to be immune against\nimbalance between segment lengths under the alternative HA, that is when \u03c4 \u22c6 is far from 1/2.\nIndeed, swapping out the reweighting by t(1\u2212 t), to simply consider the corresponding unweighted\ntest statistic is hazardous, and yields a loss of power for alternatives when \u03c4 \u22c6 is far from 1/2.\nThis section allowed us get an \u03b1-level test statistic for the change-point problem, by looking at the\nlarge-sample behaviour of the test statistic under the null hypothesis H0. The next step is to prove\nthat the test statistic is consistent in power, that is the detection probability tends to one as n tends\nto in\ufb01nity under the alternative hypothesis HA.\n\n5 Consistency in power\n\nThis section shows that, when the alternative hypothesis HA holds, our test statistic is able to detect\npresence of a change with probability one in the large-sample setting. The next proposition is proved\nwithin the same framework as the one considered in the previous section, except that now, along each\ncomponent ep, one has to split the random sum into three parts [1, k], [k + 1, k\u22c6], [k\u22c6 + 1, n], and\nthen the large-sample behaviour of each projected random sum is the one of a two-sided Brownian\nmotion with drifts.\n\nProposition 4 Assume (A1-A2) and (B1-B2), and that HA holds, that is there is exists u < \u03c4 \u22c6 < v\nwith u > 0 and v < 1 such that PX\u230an\u03c4 \u22c6 \u230b 6= PX\u230an\u03c4 \u22c6 \u230b+1 for all 1 \u2264 i \u2264 n. Assume in addition that\nthe regularization parameter \u03b3 is held \ufb01xed as n tends to in\ufb01nity, and that limn\u2192\u221e an/n > u and\nlimn\u2192\u221e bn/n < v. Then, for any 0 < \u03b1 < 1, we have\n\nPHA(cid:18) max\n\nan<k<bn\n\nTn;\u03b3(k) > t1\u2212\u03b1(cid:19) \u2192 1 ,\n\nas n \u2192 \u221e .\n\n(5)\n\n6 Extensions and related works\n\nExtensions\nIt is worthwhile to note that we may also have built similar procedures from the\nmaximum mean discrepancy (MMD) test statistic proposed by [19]. Note also that, instead of the\nTikhonov-type regularization of the covariance operator, other regularization schemes may also be\napplied, such as the spectral truncation regularization of the covariance operator, equivalent to pre-\nprocessing by a centered kernel principal component analysis [20, 21], as used in [22] for instance.\n\nRelated works\nA related problem is the abrupt change detection problem, explored in [23],\nwhich is naturally also encompassed by our framework. Here, one is interested in the early de-\ntection of a change from a nominal distribution to an erratic distribution. The procedure KCD of\n[23] consists in running a window-limited detection algorithm, using two one-class support vector\nmachines trained respectively on the left and the right part of the window, and comparing the sets\nof obtained weights; Their approach differs from our in two points. First, we have the limiting\nnull distribution of KCpA, which allows to compute decision thresholds in a principled way. Sec-\nond, our test statistic incorporates a reweighting to keep power against alternatives with unbalanced\nsegments.\n\n7 Experiments\n\nComputational considerations\nIn all experiments, we set \u03b3 = 10\u22125 and took the Gaussian ker-\nnel with isotropic bandwidth set by the plug-in rule used in density estimation. Second, since from k\nto k + 1, the test statistic changes from KFDRn,k;\u03b3 to KFDRn,k+1;\u03b3, it corresponds to take into ac-\ncount the change from {(X1, Y1 = \u22121), . . . , (Xk, Yk = \u22121), (Xk+1, Yk+1 = +1), . . . , (Xn, Yn =\n+1)} to {(X1, Y1 = \u22121), . . . , (Xk, Yk = \u22121), (Xk+1, Yk+1 = \u22121), (Xk+2, Yk+2 =\n+1) . . . , (Xn, Yn = +1)} in the labelling in KFDR [9, 16]. This motivates an ef\ufb01cient strategy\nfor the computation of the test statistic. We compute the matrix inversion of the regularized kernel\ngram matrix once for all, at the cost of O(n3), and then compute all values of the test statistic for all\npartitions in one matrix multiplication\u2014in O(n2). As for computing the decision threshold t1\u2212\u03b1,\nwe used bootstrap resampling calibration with 10, 000 runs. Other Monte-Carlo based calibration\nprocedures are possible, but are left for future research.\n\n6\n\n\fSubject 1\n\nSubject 2\n\nSubject 3\n\nKCpA\nSVM\n\n79%\n76%\n\n74%\n69%\n\n61%\n60%\n\nTable 1: Average classi\ufb01cation accuracy for each subject\n\nBrain-computer interface data\nSignals acquired during Brain-Computer Interface (BCI) trial\nexperiments naturally exhibit temporal structure. We considered a dataset proposed in BCI compe-\ntition III1 acquired during 4 non-feedback sessions on 3 normal subjects, where each subject was\nasked to perform different tasks, the time where the subject switches from one task to another being\nrandom (see also [24]). Mental tasks segmentation is usually tackled with supervised classi\ufb01cation\nalgorithms, which require labelled data to be acquired beforehand. Besides, standard supervised\nclassi\ufb01cation algorithms are context-sensitive, and sometimes yield poor performance on BCI data.\nWe performed a sequence of change-point analysis on sliding windows overlapping by 20% along\nthe signals. We provide here two ways of measuring the performance of our method. First, in Fig-\nure 2 (left), we give in the empirical ROC-curve of our test statistic, averaged over all the signals at\nhand. This shows that our test statistic yield competitive performance for testing the presence of a\nchange-point, when compared with a standard parametric multivariate procedure (param) [4]. Sec-\nond, in Table 1, we give experimental results in terms of classi\ufb01cation accuracy, which proves that\nwe can reach comparable/better performance as supervised multi-class (one-versus-one) classi\ufb01ca-\ntion algorithms (SVM) with our completely unsupervised kernel change-point analysis algorithm.\nIf each segment is considered as a sample of a given class, then the classi\ufb01cation accuracy corre-\nsponds here to the proportion of correctly assigned points at the end of the segmentation process.\nThis also clearly shows that KCpA algorithm give accurate estimates of the change-points, since the\nchange-point estimation error is directly measured by the classi\ufb01cation accuracy.\n\nr\ne\nw\no\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nROC Curve\n\n \n\nKCpA\nparam\n\n0.1\n\n0.2\n\nLevel\n\n0.3\n\n0.4\n\n0.5\n\nr\ne\nw\no\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nROC Curve\n\n \n\nKCpA\nKCD\n\n0.1\n\n0.2\n\nLevel\n\n0.3\n\n0.4\n\n0.5\n\nFigure 2: Comparison of ROC curves for task segmentation from BCI data (left), and pop songs\nsegmentation (right).\n\nPop song segmentation\nIndexation of music signals aims to provide a temporal segmentation\ninto several sections with different dynamic or tonal or timbral characteristics. We investigated\nthe performance of KCpA on a database of 100 full-length \u201cpop music\u201d signals, whose manual\nsegmentation is available. In Figure 2 (right), we provide the respective ROC-curves of KCD of [23]\nand KCpA. Our approach is indeed competitive in this context.\n\n8 Conclusion\n\nWe proposed a principled approach for the change-point analysis of a time-series of independent\nobservations. It provides a powerful testing procedure for testing the presence of a change in distri-\nbution in a sample. Moreover, we saw in experiments that it also allows to accurately estimate the\nchange-point when a change occurs. We are currently exploring several extensions of KCpA. Since\nexperimental results are promising on real data, in which the assumption of independence is rather\nunrealistic, it is worthwhile to analyze the effect of dependence on the large-sample behaviour of our\n\n1see http://ida.first.fraunhofer.de/projects/bci/competition_iii/\n\n7\n\n\ftest statistic, and explain why the test statistic remains powerful even for (weakly) dependent data.\nWe are also investigating adaptive versions of the change-point analysis, in which the regularization\nparameter \u03b3 and the reproducing kernel k are learned from the data.\n\nAcknowledgments\n\nThis work has been supported by Agence Nationale de la Recherche under contract ANR-06-BLAN-\n0078 KERNSIG.\n\nReferences\n[1] F. De la Torre Frade, J. Campoy, and J. F. Cohn. Temporal segmentation of facial behavior. In\n\nICCV, 2007.\n\n[2] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. ICML, 2001.\n\n[3] O. Capp\u00b4e, E. Moulines, and T. Ryden. Inference in Hidden Markov Models. Springer, 2005.\n[4] J. Chen and A.K. Gupta. Parametric Statistical Change-point Analysis. Birkh\u00a8auser, 2000.\n[5] M. Cs\u00a8org\u00a8o and L. Horv\u00b4ath. Limit Theorems in Change-Point Analysis. Wiley and sons, 1998.\n[6] M. Basseville and N. Nikiforov. Detection of abrupt changes. Prentice-Hall, 1993.\n[7] T. L. Lai. Sequential analysis: some classical problems and new challenges. Statistica Sinica,\n\n11, 2001.\n\n[8] E. Lehmann and J. Romano. Testing Statistical Hypotheses (3rd ed.). Springer, 2005.\n[9] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discrimi-\n\nnant analysis. In Adv. NIPS, 2007.\n\n[10] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component\n\nanalysis. Machine Learning, 66, 2007.\n\n[11] K. Fukumizu, F. Bach, and A. Gretton. Statistical convergence of kernel canonical correlation\n\nanalysis. JLMR, 8, 2007.\n\n[12] C. Gu. Smoothing Spline ANOVA Models. Springer, 2002.\n[13] I. Steinwart, D. Hush, and C. Scovel. An explicit description of the rkhs of gaussian RBF\n\nkernels. IEEE Trans. on Inform. Th., 2006.\n\n[14] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch\u00a8olkopf. Injective\n\nhilbert space embeddings of probability measures. In COLT, 2008.\n\n[15] B. James, K. L. James, and D. Siegmund. Tests for a change-point. Biometrika, 74, 1987.\n[16] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. UP, 2004.\n[17] P. Billingsley. Convergence of Probability Measures (2nd ed.). Wiley Interscience, 1999.\n[18] P. Glasserman. Monte Carlo Methods in Financial Engineering (1rst ed.). Springer, 2003.\n[19] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A.J. Smola. A kernel method for the\n\ntwo-sample problem. In Adv. NIPS, 2006.\n\n[20] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.\n[21] G. Blanchard and L. Zwald. Finite-dimensional projection for classi\ufb01cation and statistical\n\nlearning. IEEE Transactions on Information Theory, 54(9):4169\u20134182, 2008.\n\n[22] Z. Harchaoui, F. Vallet, A. Lung-Yut-Fong, and O. Capp\u00b4e. A regularized kernel-based approach\n\nto unsupervised audio segmentation. In ICASSP, 2009.\n\n[23] F. D\u00b4esobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE\n\nTrans. on Signal Processing, 53(8):2961\u20132974, August 2005.\n\n[24] Z. Harchaoui and O. Capp\u00b4e. Retrospective multiple change-point estimation with kernels. In\n\nIEEE Workshop on Statistical Signal Processing (SSP), 2007.\n\n8\n\n\f", "award": [], "sourceid": 590, "authors": [{"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}, {"given_name": "Eric", "family_name": "Moulines", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}