{"title": "White Functionals for Anomaly Detection in Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 432, "page_last": 440, "abstract": "We propose new methodologies to detect anomalies in discrete-time processes taking values in a set. The method is based on the inference of functionals whose evaluations on successive states visited by the process have low autocorrelations. Deviations from this behavior are used to flag anomalies. The candidate functionals are estimated in a subset of a reproducing kernel Hilbert space associated with the set where the process takes values. We provide experimental results which show that these techniques compare favorably with other algorithms.", "full_text": "White Functionals for Anomaly Detection in\n\nDynamical Systems\n\nMarco Cuturi\n\nORFE - Princeton University\nmcuturi@princeton.edu\n\nJean-Philippe Vert\n\nMines ParisTech, Institut Curie, INSERM U900\n\nJean-Philippe.Vert@mines.org\n\nAlexandre d\u2019Aspremont\n\nORFE - Princeton University\n\naspremon@princeton.edu\n\nAbstract\n\nWe propose new methodologies to detect anomalies in discrete-time processes\ntaking values in a probability space. These methods are based on the inference\nof functionals whose evaluations on successive states visited by the process are\nstationary and have low autocorrelations. Deviations from this behavior are used\nto \ufb02ag anomalies. The candidate functionals are estimated in a subspace of a\nreproducing kernel Hilbert space associated with the original probability space\nconsidered. We provide experimental results on simulated datasets which show\nthat these techniques compare favorably with other algorithms.\n\n1 Introduction\n\nDetecting abnormal points in small and simple datasets can often be performed by visual inspec-\ntion, using notably dimensionality reduction techniques. However, non-parametric techniques are\noften the only credible alternative to address these problems on the many high-dimensional, richly\nstructured data sets available today.\n\nWhen carried out on independent and identically distributed (i.i.d) observations, anomaly detection\nis usually referred to as outlier detection and is in many ways equivalent to density estimation.\nSeveral density estimators have been used in this context and we refer the reader to the exhaustive\nreview in [1]. Among such techniques, methods which estimate non-parametric alarm functions in\nreproducing kernel Hilbert spaces (rkHs) are particularly relevant to our work. They form alarm\n\nfunctions of the type f ( \u00b7 ) = Pi\u2208I cik(xi, \u00b7 ), where k is a positive de\ufb01nite kernel and (ci)i\u2208I\n\nis a family of coef\ufb01cients paired with a family (xi)i\u2208I of previously observed data points. A new\nobservation x is \ufb02agged as anomalous whenever f (x) goes outside predetermined boundaries which\nare also provided by the algorithm. Two well known kernel methods have been used so far for\nthis purpose, namely kernel principal component analysis (kPCA) [2] and one-class support vector\nmachines (ocSVM) [3]. The ocSVM is a popular density estimation tool and it is thus not surprising\nthat it has already found successful applications to detect anomalies in i.i.d data [4]. kPCA can also\nbe used to detect outliers as described in [5], where an outlier is de\ufb01ned as any point far enough\nfrom the boundaries of an ellipsoid in the rkHs containing most of the observed points.\n\nThese outlier detection methods can also be applied to dynamical systems. We now monitor discrete\ntime stochastic processes Z = (Zt)t\u2208N taking values in a space Z and, based on previous obser-\nvations zt\u22121, \u00b7 \u00b7 \u00b7 , z0, we seek to detect whether a new observation zt abnormally deviates from the\nusual dynamics of the system. As explained in [1], this problem can be reduced to density estimation\nwhen either Zt or a suitable representation of Zt that includes a \ufb01nite number of lags is Markovian,\ni.e. when the conditional probability of Zt given its past depends only on the values taken by Zt\u22121.\n\n1\n\n\fIn practice, anomaly detection then involves a two step procedure. It \ufb01rst produces an estimator\n\u02c6Zt of the conditional expectation of Zt given Zt\u22121 to extract an empirical estimator for the residues\n\u02c6\u03b5t = Zt \u2212 \u02c6Zt. Under an i.i.d assumption, abnormal residues can then be used to \ufb02ag anomalies. This\napproach and advanced extensions can be used both for multivariate data [6, 7] and linear processes\nin functional spaces [8] using spaces of H\u00a8olderian functions.\n\nThe main contribution of our paper is to propose an estimation approach of alarm functionals that\ncan be used on arbitrary Hilbert spaces and which bypasses the estimation of residues \u02c6\u03b5t \u2208 Z by fo-\ncusing directly on suitable properties for alarm functionals. Our approach is based on the following\nintuition. Detecting anomalies in a sequence generated by white noise is a task which is arguably\neasier than detecting anomalies in arbitrary time-series. In this sense, we look for functionals \u03b1 such\nthat \u03b1(Zt) exhibits a stationary behavior with low autocorrelations, ideally white noise, which can\nbe used in turn to \ufb02ag an anomaly whenever \u03b1(Zt) departs from normality. We call functionals \u03b1\nthat strike a good balance between exhibiting a low autocovariance of order 1 and a high variance on\nsuccessive values Zt a white functional of the process Z. Our de\ufb01nition can be naturally generalized\nto higher autocovariance orders as the reader will naturally see in the remaining of the paper.\n\nOur perspective is directly related to the concept of cointegration (see [9] for a comprehensive re-\nview) for multivariate time series, extensively used by econometricians to study equilibria between\nvarious economic and \ufb01nancial indicators. For a multivariate stochastic process X = (Xt)t\u2208Z tak-\ning values in Rd, X is said to be cointegrated if there exists a vector a of Rd such that (aT Xt)t\u2208Z is\nstationary. Economists typically interpret the weights of a as describing a stable linear relationship\nbetween various (non-stationary) macroeconomic or \ufb01nancial indicators. In this work we discard the\nimmediate interpretability of the weights associated with linear functionals aT Xt to focus instead\non functionals \u03b1 in a rkHs H such that \u03b1(Zt) is stationary, and use this property to detect anomalies.\nThe rest of this paper is organized as follows. In Section 2, we study different criterions to measure\nthe autocorrelation of a process, directly inspired by min/max autocorrelation factors [10] and the\nseminal work of Box-Tiao [11] on cointegration. We study the asymptotic properties of \ufb01nite sample\nestimators of these criterions in Section 3 and discuss the practical estimation of white functionals\nin Section 4. We discuss relationships with existing methods in Section 5 and provide experimental\nresults to illustrate the effectiveness of these approaches in Section 6.\n\n2 Criterions to de\ufb01ne white functionals\n\nConsider a process Z = (Zt)t\u2208 Z taking values in a probability space Z. Z will be mainly considered\nin this work under the light of its mapping onto a rkHs H associated with a bounded and continuous\nkernel k on Z \u00d7 Z. Z is assumed to be second-order stationary, that is the densities p(Zt = z) and\njoint densities p(Zt = z, Zt+k = z\u2032 ) for k \u2208 N are independent of t. Following [12, 13] we write\n\n\u03c6t = \u03d5(Zt) \u2212 Ep[\u03d5(Zt)],\n\nfor the centered projection of Z in H, where \u03d5 : z \u2208 Z \u2192 k(z, \u00b7) \u2208 H is the feature map associated\nwith k. For two elements \u03b1 and \u03b2 of H we write \u03b1 \u2297 \u03b2 for their tensor product, namely the linear\nmap of H onto itself such that \u03b1 \u2297 \u03b2 : x \u2192 h\u03b1, xiH \u03b2. Using the notations of [14] we write\n\nC = Ep[\u03c6t \u2297 \u03c6t],\n\nD = Ep[\u03c6t \u2297 \u03c6t+1],\n\nrespectively for the covariance and autocovariance of order 1 of \u03c6t. Both C and D are linear op-\nerators of H by weak stationarity [14, De\ufb01nition 2.4] of (\u03c6t)t\u2208Z, which can be deduced from the\nsecond-order stationarity of Z. The following de\ufb01nitions introduce two criterions which quantify\nhow related two successive evaluations of \u03b1(Zt) are.\nDe\ufb01nition 1 (Autocorrelation Factor [10]). Given an element \u03b1 of H such that h\u03b1, C\u03b1iH > 0,\n\u03b3(\u03b1) is the absolute autocorrelation of \u03b1(Z) of order 1,\n\n\u03b3(\u03b1) = | corr(\u03b1(Zt), \u03b1(Zt+1)| =\n\n|h\u03b1, D\u03b1iH|\nh\u03b1, C\u03b1iH\n\n.\n\n(1)\n\nThe condition h\u03b1, C\u03b1iH > 0 requires that var \u03b1(\u03c6t) is not zero, which excludes constant or van-\nishing functions on the support of the density of \u03c6t. Note also that de\ufb01ning \u03b3 requires no other\nassumption than second-order stationarity of Z.\n\n2\n\n\fIf we assume further that \u03c6 is an autoregressive Hilbertian process of order 1 [14], ARH(1) for short,\nthere exists a compact operator \u03c1 : H \u2192 H and a H strong white noise1 (\u03b5t)t\u2208Z such that\n\n\u03c6t+1 = \u03c1 \u03c6t + \u03b5t.\n\nIn their seminal work, Box and Tiao [11] quantify the predictability of the linear functionals of\na vector autoregressive process in terms of variance ratios. The following de\ufb01nition is a direct\nadaptation of that principle to autoregressive processes in Hilbert spaces. From [14, Theorem 3.2]\nwe have that C = \u03c1 C\u03c1\u2217 + C\u03b5 where for any linear operator A of H, A\u2217 is its adjoint.\nDe\ufb01nition 2 (Predictability in the Box-Tiao sense [11]). Given an element \u03b1 of H such that\nh\u03b1, C\u03b1iH > 0, the predictability \u03bb(\u03b1) is the quotient\n\n\u03bb(\u03b1) =\n\nvarh\u03b1, \u03c1 \u03c6tiH\nvarh\u03b1, \u03c6tiH\n\n=\n\nh\u03b1, \u03c1 C \u03c1\u2217\u03b1iH\n\nh\u03b1, C\u03b1iH\n\n=\n\nh\u03b1, DC\u22121D\u2217\u03b1iH\n\nh\u03b1, C\u03b1iH\n\n.\n\n(2)\n\nThe right hand-side of Equation (2) follows from the fact that \u03c1 C = D and \u03c1\u2217 = C\u22121D\u2217 [14],\nthe latter equality being always valid irrelevant of the existence of C\u22121 on the whole of H as noted\nin [15]. Combining these two equalities gives \u03c1 C\u03c1\u2217 = DC\u22121D\u2217.\nBoth \u03b3 and \u03bb are convenient ways to quantify for a given function f of H the independence of f (Zt)\nwith its immediate past. We provide in this paragraph a common representation for \u03bb and \u03b3. For any\nlinear operator A of H and any non-zero element x of H write R(A, x) for the Rayleigh quotient\n\nR(A, x) =\n\nhx, AxiH\nhx, xiH\n\n.\n\nWe use the notations in [12] and introduce the normalized cross-covariance (or rather auto-\ncovariance in the context of this paper) operator V = C\u2212 1\n2 . Note that for any skew-\nsymmetric operator A, that is A = \u2212A\u2217, we have that hx, AxiH = hA\u2217x, xiH = \u2212hAx, xiH = 0\nand thus R(A, x) = R( A+A\u2217\n, x). Both \u03bb and \u03b3 applied on a function \u03b1 \u2208 H can thus be written as\n\n2 DC\u2212 1\n\n2\n\nR(cid:18) V + V \u2217\n\n2\n\n, C\n\n\u03b3(\u03b1) =(cid:12)(cid:12)(cid:12)(cid:12)\n\n1\n\n2 \u03b1(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\n, \u03bb(\u03b1) = R(V V \u2217, C\n\n1\n\n2 \u03b1).\n\nAs detailed in Section 4, our goal is to estimate functions in H from data such that they have either\nlow \u03b3 or \u03bb values. Minimizing \u03bb is equivalent to solving a generalized eigenvalue problem through\nthe Courant-Fisher-Weyl theorem. Minimizing \u03b3 is a more challenging problem since the operator\nV + V \u2217 is not necessarily positive de\ufb01nite. The S-lemma from control theory [16, Appendix B.2]\ncan be used to cast the problem of estimating functions with low \u03b3 as a semi-de\ufb01nite program. In\npractice the eigen-decomposition of V + V \u2217 provides good approximate answers.\nThe formulation of \u03b3 and \u03bb as Rayleigh quotients is also useful to obtain the asymptotic convergence\nof their empirical counterparts (Section 3) and to draw comparisons with kernel-CCA (Section 5).\n\n3 Asymptotics and matrix expressions for empirical estimators of \u03b3 and \u03bb\n\n3.1 Asymptotic convergence of the normalized cross-covariance operator V\n\nThe covariance operator C and cross-covariance operator D can be estimated through a \ufb01nite sample\nof points z0, \u00b7 \u00b7 \u00b7 , zn translated into a sample of centered points \u03c61, \u00b7 \u00b7 \u00b7 , \u03c6n in H, where \u03c6i =\n\u03d5(zi) \u2212 1\n\nj=0 \u03d5(zj). We write\n\nn+1Pn\n\nCn =\n\n1\n\nn \u2212 1\n\n\u03c6i \u2297 \u03c6i, Dn =\n\n1\n\nn \u2212 1\n\n\u03c6i \u2297 \u03c6i+1,\n\nn\u22121\n\nXi=1\n\nn\n\nXi=1\n\nfor the estimates of C and D respectively which converge in Hilbert-Schmidt norm [14]. Estimators\nfor \u03b3 or \u03bb require approximating C\u2212 1\n2 , which is a typical challenge encountered when studying\n\n1namely a sequence (\u03b5t)t\u2208Z of H random variables such that (i) 0 < E k\u03b5tk2 = \u03c32, E \u03b5t = 0 and the\n\ncovariance C\u03b5t is constant, equal to C\u03b5; (ii) (\u03b5t) is a sequence of i.i.d H-random variables\n\n3\n\n\fARH(1) processes and more generally stationary linear processes in Hilbert spaces [14, Section\n8]. This issue is addressed in this section through a Tikhonov-regularization, that is considering a\nsequence of positive numbers \u01ebn we write\n\nVn = (Cn + \u01ebnI)\u2212 1\n\n2 Dn(Cn + \u01ebnI)\u2212 1\n2 ,\n\nfor the empirical estimate of V regularized by \u01ebn. We have already assumed that k is bounded and\ncontinuous. The convergence of Vn to V in norm is ensured under the additional conditions below\nTheorem 3. Assume that V is a compact operator,\nwriting k \u00b7 kS for the Hilbert-Schmidt operator norm,\n\n\u01ebn = 0 and lim\nn\u2192\u221e\nkVn \u2212 V kS = 0.\n\n= 0. Then\n\n(log n/n)\n\n\u01ebn\n\n1\n3\n\nlim\nn\u2192\u221e\nlim\nn\u2192\u221e\n\n2 D(C + \u01ebnI)\u2212 1\n\nProof. The structure of the proof is identical to that of of [12, Theorem 1] except that the i.i.d\nassumption does not hold here. In [12], the norm kVn \u2212 V kS is upper-bounded by the two terms\nkVn \u2212 (C + \u01ebnI)\u2212 1\n2 \u2212 V kS. The second term\nconverges under the assumption that \u01ebn \u2192 0 [12, Lemma 7] while the \ufb01rst term decreases at a rate\nthat is proportional to the rates of kCn \u2212 CkS and kDn \u2212 DkS. With the assumptions above [14,\nCorollary4.1,Theorem 4.8] gives us that kCn\u2212CkS = O(( log n\n2 ).\nn )\nWe use this result to substitute the latter rate to the faster rate obtained for i.i.d observations in [12,\nLemma 5] and conclude the proof.\n\n2 ) and kDn\u2212DkS = O(( log n\nn )\n\n2 kS + k(C + \u01ebnI)\u2212 1\n\n2 D(C + \u01ebnI)\u2212 1\n\n1\n\n1\n\n3.2 Empirical estimators and matrix expressions\n\nGiven \u03b1 \u2208 H, consider the following estimators of \u03b3(\u03b1) and \u03bb(\u03b1) de\ufb01ned in Equations (1) and (2),\n\nR(cid:18) Vn + V \u2217\n\n2\n\nn\n\n, (Cn + \u01ebnI)\n\n\u03bbn(\u03b1) = R(VnV \u2217\n\nn , (Cn + \u01ebnI)\n\n1\n\n2 \u03b1) =\n\n\u03b3n(\u03b1) =(cid:12)(cid:12)(cid:12)(cid:12)\n\n= (cid:12)(cid:12)h\u03b1, 1\n\n1\n\n2 \u03b1(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\nh\u03b1, (Cn + \u01ebnI)\u03b1iH\nn\u03b1iH\n\nh\u03b1, Dn(Cn + \u01ebnI)\u22121D\u2217\nh\u03b1, (Cn + \u01ebnI)\u03b1iH\n\n,\n\n2 (Dn + D\u2217\n\nn)\u03b1iH(cid:12)(cid:12)\n\n,\n\narbitrary decomposition \u03b1 = Pn\n\nwhich converge to the adequate values through the convergence of (Cn + \u01ebnI)\nn and\nn . The n observations \u03c61, . . . , \u03c6n which de\ufb01ne the empirical estimators above also span a\nVnV \u2217\nsubspace Hn in H which can be used to estimate white functionals. Given \u03b1 \u2208 Hn we use any\nai\u03c6i. We write K for the original n + 1 \u00d7 n + 1 Gram matrix\n[k(zi, zj)]i,j and \u00afK for its centered counterpart \u00afK = (In \u2212 1\n1n,n) = [h\u03c6i, \u03c6j iH]i,j.\nn\nBecause of the centering span{\u03c60, . . . , \u03c6n} is actually equal to span{\u03c61, . . . , \u03c6n} and we will only\nuse the n \u00d7 n matrix K obtained by removing the \ufb01rst row and column of \u00afK.\nFor a n \u00d7 n matrix M , we write M\u2212i for the n \u00d7 n \u2212 1 matrix obtained by removing the ith column\nof M . With these notations, \u03bbn and \u03b3n take the following form when evaluated on \u03b1 \u2208 Hn,\n\n1n,n)K(In \u2212 1\nn\n\n2 , Vn + V \u2217\n\ni=1\n\n1\n\n\u03b3n(\u03b1) = \u03b3n  n\nXi=1\n\u03bbn(\u03b1) = \u03bbn  n\nXi=1\n\nai\u03c6i! =\nai\u03c6i! =\n\n1\n2\n\n|aT (K\u22121KT\n\n\u2212n + K\u22121KT\naT (K2 + n\u01ebnK)a\n\n\u2212n)a|\n\n,\n\naT K\u22121KT\n\n\u2212n(K2 + n\u01ebnK)\u22121K\u2212nKT\n\u22121\naT (K2 + n\u01ebnK)a\n\na\n\n.\n\nIf \u01ebn follows the assumptions of Theorem 3, both \u03b3n and \u03bbn converge to \u03b3 and \u03bb pointwise in Hn.\n\n4 Selecting white functionals in practice\n\nBoth \u03b3(\u03b1) and \u03bb(\u03b1) are proxies to quantify the independence of successive observations \u03b1(Zt).\nNamely, functions with low \u03b3 and \u03bb are likely to have low autocorrelations and be stationary when\nevaluated on the process Z, and the same can be said of functions with low \u03b3n and \u03bbn asymptotically.\nHowever, when H is of high or in\ufb01nite dimension, the direct minimization of \u03b3n and \u03bbn is likely\nto result in degenerate functions2 which may have extremely low autocovariance on Z but very low\nvariance as well. We select white functionals having this trade off in mind, such that both h\u03b1, C, \u03b1iH\nis not negligible and \u03b3 or \u03bb are low at the same time.\n\n2Since the rank of operator Vn is actually n \u2212 1, we are even guaranteed to \ufb01nd in Hn a minimizer for \u03b3n\n\nand another for \u03bbn with respectively zero predictability and zero absolute autocorrelation.\n\n4\n\n\f4.1 Enforcing a lower bound on h\u03b1, C\u03b1iH\n\ncan be decomposed as Cn =Pn\n\nWe consider the following strategy: following the approach outlined in [14, Section 8] to estimate\nautocorrelation operators, and more generally in [17] in the context of kernel methods, we restrict\nHn to the directions spanned by the p \ufb01rst eigenfunctions of the operator Cn. Namely, suppose Cn\ni=1 giei \u2297 ei where ei is an orthonormal basis of eigenvectors with\neigenvalues in decreasing order g1 \u2265 g2 \u2265 \u00b7 \u00b7 \u00b7 \u2265 gn \u2265 0. For 1 \u2264 p \u2264 n We write Hp for the\nspan{e1, . . . , ep} of the p \ufb01rst eigenfunctions. Any function \u03b1 in Hp is such that h\u03b1, Cn\u03b1iH \u2265 gp\nand thus allows us to keep the empirical variance of \u03b1(Zt) above a certain threshold. Let Ep be the\nn \u00d7 p coordinate matrix of eigenvectors3 e1, . . . , ep expressed in the family of n vectors \u03c61, . . . , \u03c6n\nbiei\n\nand G the p \u00d7 p diagonal matrix of terms (g1, . . . , gp). We consider now a function \u03b2 = Pp\n\nin Hp, and note that\n\ni\n\n1\n2\nbT ET\np\n\n\u03b3n(\u03b2) =\n\n\u03bbn(\u03b2) =\n\n|bT ET\n\np (K\u22121KT\n\n\u2212n + K\u22121KT\n\n\u2212n)Epb|\n\nbT (G + n\u01ebnI)b\n\u2212n(K2 + n\u01ebnK)\u22121K\u2212nKT\n\u22121\n\nK\u22121KT\n\n,\n\nEpb\n\n.\n\n(3)\n\n(4)\n\nbT (G + n\u01ebnI)b\n\nWe de\ufb01ne two different functions of Hp, \u03b2mac and \u03b2BT, as the the functionals in Hp whose coef\ufb01-\ncients correspond to the eigenvector with minimal (absolute) eigenvalue of the two Rayleigh quo-\ntients of Equations (3) and (4) respectively. We call these functionals the minimum autocorrelation\n(MAC) and Box-Tiao (BT) functionals of Z. Below is a short recapitulation of all the computational\nsteps we have described so far.\n\n\u2022 Input: n + 1 observations z0, \u00b7 \u00b7 \u00b7 , zn \u2208 Z of a time-series Z, a p.d. kernel k on Z \u00d7 Z\n\nand a parameter p (we propose an experimental methodology to set p in Section 6.3)\ni=0 cik(zi, \u00b7) that is a white functional of Z.\n\n\u2022 Output: a real-valued function f (\u00b7) =Pn\n\n\u2022 Algorithm:\n\n\u2013 Compute the (n + 1) \u00d7 (n + 1) kernel matrix K, center it and drop the \ufb01rst row and\n\ncolumn to obtain K.\n\n\u2013 Store K\u2019s p \ufb01rst eigenvectors and eigenvalues in matrices U and diag(v1, \u00b7 \u00b7 \u00b7 , vp).\n\u2013 Compute Ep = U diag(v1, \u00b7 \u00b7 \u00b7 , vp)\u22121/2 and G = 1\n\u2013 Compute the matrix numerator N and denominator D of either Equation (3) or Equa-\ntion (4) and recover the eigenvector b with minimal absolute eigenvalue of the gener-\nalized eigenvalue problem (N, D)\n\u2013 Set a = Epb \u2208 Rn. Set c0 = \u2212 1\n\nn diag(v1, \u00b7 \u00b7 \u00b7 , vp).\n\naj and ci = ai \u2212 1\n\naj\n\n1\n\nnPn\n\n1\n\nnPn\n\n5 Relation to other methods and discussion\n\nThe methods presented in this work offer numerous parallels with other kernel methods such as\nkernel-PCA or kernel-CCA which, similarly to the BT and MAC functionals, provide a canonical\ndecomposition of Hn into n ranked eigenfunctions.\nWhen Z is \ufb01nite dimensional, the authors of [18] perform PCA on a time-series sample z0, . . . , zn\nand consider its eigenvector with smallest eigenvalue to detect cointegrated relationships in the pro-\ncess Zt. Their assumption is that a linear mapping \u03b1T Zt that has small variance on the whole\nsample can be interpreted as an integrated relationship. Although the criterion considered by PCA,\nnamely variance, disregards the temporal structure of the observations and only focuses on the val-\nues spanned by the process, this technique is useful to get rid of all non-stationary components of\nZt. On the other hand, kernel-PCA [2], a non-parametric extension of PCA, can be naturally applied\nfor anomaly detection in an i.i.d. setting [5]. It is thus natural to use kernel-PCA, namely an eigen-\nfunction with low variance, and hope that it will have low autocorrelation to de\ufb01ne white functionals\nof a process. Our experiments show that this is indeed the case and in agreement with [5] seem to\n\n3Recall that if (ui, vi) are eigenvalue and eigenvector pairs of K, the matrix E of coordinates of eigenfunc-\n) and the eigenvalues gi are equal\n\ntions ei expressed in the n points \u03c61, . . . , \u03c6n can be written as U diag(v\u22121/2\nto vi\n\nn if taken in the same order[2].\n\ni\n\n5\n\n\findicate that the eigenfunctions which lie at the very low end of the spectrum, usually discarded as\nnoise and less studied in the literature, can prove useful for anomaly detection tasks.\n\nkernel-CCA and variations such as NOCCO [12] are also directly related to the BT functional.\nIndeed, the operator V V \u2217 used in this work to de\ufb01ne \u03bb is used in the context of kernel-CCA to\nextract one of the two functions which maximally correlate two samples, the other function being\nobtained from V \u2217V . Notable differences between our approach and kernel-CCA are: 1.\nin the\ncontext of this paper, V is an autocorrelation operator while the authors of [12] consider normalized\ncovariances between two different samples; 2. kernel-CCA assumes that samples are independently\nand identically drawn, which is de\ufb01nitely not the case for the BT functional; 3. while kernel-CCA\nmaximizes the Rayleigh quotient of V V \u2217, we look for eigenfunctions which lie at the lower end of\nthe spectrum of the same operator. A possible extension of our work is to look for two functionals f\nand g which, rather than maximize the correlation of two distinct samples as is the case in CCA, are\nestimated to minimize the correlation between g(zt) and f (zt+1). This direction has been explored\nin [19] to shed a new light on the Box-Tiao approach in the \ufb01nite dimensional case.\n\n6 Experimental results using a population dynamics model\n\n6.1 Generating sample paths polluted by anomalies\n\nWe consider in this experimental section a simulated dynamical system perturbed by arbitrary\nanomalies. To this effect, we use the Lotka-Volterra equations to generate time-series quantify-\ning the populations of different species competing for common resources. For S species, the model\ntracks the population level Xt,i at time t of each species i, which is a number bounded between 0\nand 1. Values of 0 and 1 account respectively for the extinction and the saturation levels of each\nspecies. Writing \u25e6 for the coordinate-wise kronecker product of vectors and matrices and h > 0 for\na discretization step, the population vector Xt \u2208 [0, 1]S follows the discrete-time dynamic equation\n\nWe consider the following coef\ufb01cients introduced in [20] which are known to yield chaotic behavior,\n\nXt+1 = Xt +\n\n1\nh\n\nr \u25e6 Xt \u25e6 (1S \u2212 AXt) .\n\n1\n\n0.72\n1.53\n1.27\n\nr =\uf8eb\n\uf8ec\uf8ed\n\n\uf8f6\n\uf8f7\uf8f8\n\n, A =\uf8eb\n\uf8ec\uf8ed\n\n1\n0\n\n2.33\n1.21\n\n1.09 1.52\n.44\n1\n.35\n\n1\n0\n.51\n\n0\n\n1.36\n.47\n1\n\n,\n\n\uf8f6\n\uf8f7\uf8f8\n\nS = 4,\n\nwhich can be turned into a stochastic system by adding an i.i.d. standard Gaussian noise \u03b5t,\n\nZt+1 = Zt +\n\n1\nh\n\nr \u25e6 Zt \u25e6 (14 \u2212 AZt) + \u03c3\u03b5\u03b5t.\n\n(5)\n\nWhenever the equations generate coordinates below 0 or above 1, the violating coordinates are set\nto 0 + u or 1 \u2212 u respectively, where u is uniform over [0, 0.01].\nWe consider trajectories of length 800 of the Lotka-Volterra system described in Equation (5). For\neach experiment we draw a starting point Z0 randomly with uniform distribution on [0, 1]4, discard\nthe 10 \ufb01rst iterations and generate 400 iterations following Equation (5). Following this we select\nrandomly (uniformly over the remaining 400 steps) 40 time stamps t1, \u00b7 \u00b7 \u00b7 , t40 where we introduce\na random perturbation at tk such that Ztk, rather than following the dynamic of Equation (5) is\nrandomly perturbed by a noise \u03b4t chosen uniformly over {\u22121, 1}4 with a magnitude \u03c3\u03b4, that is\n\nZtk = Ztk\u22121 + \u03c3\u03b4\u03b4tk\u22121.\n\nFor all other timestamps tk < t < tk+1, the system follows the usual dynamic of Equation (5).\nAnomalies violate the usual dynamics in two different ways: \ufb01rst, they ignore the usual dynamical\nequations and the current location of the process to create instead purely random increments; second,\ndepending on the magnitude of \u03c3\u03b4 relative to \u03c3\u01eb, such anomalies may induce unusual jumps.\n\n6.2 Estimation of white functionals and other alarm functions\n\nWe compare in this experiment \ufb01ve techniques to detect the anomalies described above: the Box-\nTiao functional and a variant described in the paragraph below, the minimal autocorrelation func-\ntional, a one-class SVM and the low-variance functional de\ufb01ned by the (p + 1)th eigenfunction of\n\n6\n\n\f1\n\n0.5\n\n0\n\n4\n\n2\n\n0\n\n\u22122\n\n0\n\n\u22120.1\n\n\u22120.2\n\nLotka Volterra System\n\n4\n\n2\n\n0\n\n0.2\n0\n\u22120.2\n\u22120.4\n\n\u22122\n\n6\n4\n2\n0\n\u22122\n\n0.05\n\n200\n\n180\nweights\n\n0.4\n0.2\n0\n\u22120.2\n\u22120.4\n\n200\n\nweights\n\n0.4\n0.2\n0\n\u22120.2\n\n20\n\n60\nBox\u2212Tiao \u2212 AUC: 0.828\n\n40\n\n80\nweights\n\n100\n\n120\nkMAC \u2212 AUC: 0.797\n\n140\n\n160\n\n50\n150\nocSVM \u2212 AUC: 0.444\n\n100\n\n200\n\nweights\n\n0.1\n\n50\n150\nkPCA \u2212 AUC: 0.628\n\n100\n\n50\n\n100\n\n150\n\n200\n\n0\n\n50\n\n100\n\n150\n\n200\n\nFigure 1: The \ufb01gure on the top plots a sample path of length 200 of a 4-dimensional Lotka-Volterra\ndynamic system with perturbations drawn with \u03c3\u03b5 = .01 and \u03c3\u03b4 = 0.02. The data is split between\n80 regular observations and 120 observations polluted by 10 anomalies. All four functionals have\nbeen estimated using \u03c1 = 1, and we highlight by a red dot the values they take when an anomaly\nis actually observed. The respective weights associated to each of the 80 training observations are\ndisplayed on the right of each methodology.\n\nthe empirical covariance Cn, given by kernel-PCA. All techniques are parameterized by a kernel k.\nWriting \u2206zi = zi \u2212 zi\u22121, we use the following mixture of kernels k :\n\nk(zi, zj) = \u03c1 e\u2212100k\u2206zi\u2212\u2206zjk2\n\n+ (1 \u2212 \u03c1)e\u221210kzi\u2212zjk2\n\n,\n\n(6)\n\nwith \u03c1 \u2208 [0, 1]. The \ufb01rst term in k discriminates observations according to their location in [0, 1]4.\nWhen \u03c1 = 0.5, k accounts for both the state of the system and its most recent increments, while\nonly increments are considered for \u03c1 = 1. Anomalies can be detected with both criterions, since\nthey can be tracked down when the process visits unusual regions or undergoes brusque and atypical\nchanges. The kernel widths have been set arbitrarily.\n\nWe discuss in this paragraph a variant of the BT functional. While the MAC functional is de\ufb01ned\nand estimated in order to behave as closely as possible to random i.i.d noise, the BT functional \u03b2BT\nis tuned to be stationary as discussed in [11]. In order to obtain a white functional from \u03b2BT it is\npossible to model the time series \u03b2BT(zt) as an unidimensional autoregressive model, that is estimate\n(on the training sample again) coef\ufb01cients r1, r2, . . . , rq such that\n\nq\n\n\u03b2BT(zt) =\n\nri\u03b2BT(zt\u2212i) + \u02c6\u03b5BT\nt\n\n.\n\nXi=1\n\nBoth the order q and the autoregressive coef\ufb01cients can be estimated on the training sample with\nstandard AR packages, using for instance Schwartz\u2019s criterion to select q. Note that although \u03c6(Zt)\nis assumed to be ARH(1), this does not necessarily translate into the fact that the real-valued process\n\u03b2BT(Zt) = h\u03b2BT, \u03c6tiH is AR(1) as pointed out in [14, Theorem 3.4]. In practice however we use\nthe residuals \u02c6\u03b5BT\ni=1 ri\u03b2BT(zt\u2212i) to de\ufb01ne the Box-Tiao residuals functional which\nwe write \u02dc\u03b2BT.\n\nt = \u03b2BT(zt) \u2212Pp\n\n7\n\n\f \u03c1 = 0\n\nBT Res\nBT\nMAC\nocSVM\nkpca\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\nC\nU\nA\n\n0.01 0.02 0.03 0.04 0.05\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n \n\n \u03c1 = 0.5\n\n \n\n \u03c1 = 1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.01 0.02 0.03 0.04 0.05\nNoise Amplitude \u03c3\u03b4\n\n0.01 0.02 0.03 0.04 0.05\n\nFigure 2: The three successive plot stand for three different values of \u03c1 = 0, 0.5, 1. The detection\nrate naturally increases with the size of the anomaly, to the extent that the task becomes only a gap\ndetection problem when \u03c3\u03b4 becomes closer to 0.05. Functionals \u03b2BT, \u02dc\u03b2BT and \u03b2mac have a similar\nperformance and outperform other techniques when the task is most dif\ufb01cult and \u03c3\u03b4 is small.\n\n6.3 Parameter selection methodology and numerical results\n\ni=1 gi > 0.98\u00b7Pn\n\ninteger such thatPp\n\nThe BT functional \u03b2BT and its residuals \u02dc\u03b2BT, the MAC function \u03b2mac, the one-class SVM \u02c6focSVM\nand the p + 1th eigenfunction ep+1 are estimated on a set of 400 observations. We set p through the\nrule that the p \ufb01rst directions must carry at least 98% of the total variance of Cn, that is p is the \ufb01rst\ni=1 gi. We \ufb01x the \u03bd paramater of the ocSVM to 0.1. The BT and\nMAC functionals additionally require the use of a regularization term \u01ebn which we select by \ufb01nding\nthe best ridge regressor of \u03c6t+1 given \u03c6t through a 4-fold cross validation procedure on the training\nset. For \u03b2BT, \u02dc\u03b2BT, \u03b2mac and the kPCA functional ep+1 we use their respective empirical mean \u00b5\nand variance \u03c3 on the training set to rescale and whiten their output on the test set, namely consider\nvalues (f (z) \u2212 \u00b5)/\u03c3. Although more elaborate anomaly detection schemes on such unidimensional\ntime-series might be considered, for the sake of simplicity we treat directly these raw outputs as\nalarm scores.\n\nHaving on the one hand the correct labels for anomalies and the scores for all detectors, we vary\nthe threshold at which an alarm is raised to produce ROC curves. We use the area under the curve\nof each method on each sample path as a performance measure for that path. Figure 1 provides\na summary of the performance of each method on a unique sample path of 200 observations and\n10 anomalies. Perturbation parameters are set such that \u03c3\u03b5 = 0.01 and \u03c3\u03b4 varies between 0.005\nand 0.055. For each couple (\u03c3\u03b5, \u03c3\u03b4) we generate 500 draws and compute the mean AUC of each\ntechnique on such draws. We report in Figure 2 these averaged performances for three different\nchoices of the kernel, namely three different values for \u03c1 as de\ufb01ned in Equation (6).\n\n6.4 Discussion\n\nIn the experimental setting, anomalies can be characterized as unusual increments between two\nsuccessive states of an otherwise smooth dynamical system. Anomalies are unusual due to their\nsize, controlled by \u03c3\u03b4, and their directions, sampled in {\u22121, 1}4. When the step \u03c3\u03b4 is relatively\nsmall, it is dif\ufb01cult to \ufb02ag correctly an anomaly without taking into account the system\u2019s dynamic\nas illustrated by the relatively poor performance of the ocSVM and the kPCA compared to the\nBT, BTres and MAC functions. On the contrary, when \u03c3\u03b4 is big, anomalies can be more simply\ndiscriminated as big gaps. The methods we propose do not perform as well as the ocSVM in such a\nsetting. We can hypothesize two reasons for this: \ufb01rst, white functionals may be less useful in such\na regime that puts little emphasis on dynamics than a simple ocSVM with adequate kernel. Second,\nin this study the BT and MAC functions \ufb02ag anomalies whenever an evaluation goes outside of a\ncertain bounding tube. More advanced detectors of a deviation or change from normality, such as\nCUSUM [21], might be studied in future work.\n\n8\n\n\fReferences\n\n[1] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009.\n[2] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural Comput., 10(5):1299\u20131319, 1998.\n\n[3] B. Sch\u00a8olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of\n\na high-dimensional distribution. Neural Comput., 13:2001, 1999.\n\n[4] A.B. Gardner, A.M. Krieger, G. Vachtsevanos, and B. Litt. One-class novelty detection for seizure analysis\n\nfrom intracranial EEG. J. Mach. Learn. Res., 7:1025\u20131044, 2006.\n\n[5] H. Hoffmann. Kernel PCA for novelty detection. Pattern Recognit., 40(3):863\u2013874, 2007.\n[6] A. J. Fox. Outliers in time series. J. R. Stat. Soc. Ser. B, 34(3):350\u2013363, 1972.\n[7] R.S. Tsay, D. Pena, and A.E. Pankratz. Outliers in multivariate time series. Biometrika, 87(4):789\u2013804,\n\n2000.\n\n[8] A. Laukaitis and A. Ra\u02c7ckauskas. Testing changes in Hilbert space autoregressive models. Lithuanian\n\nMathematical Journal, 42(4):343\u2013354, 2002.\n\n[9] G. S. Maddala and I. M. Kim. Unit roots, cointegration, and structural change. Cambridge Univ. Pr.,\n\n1998.\n\n[10] P. Switzer and A.A. Green. Min/max autocorrelation factors for multivariate spatial imagery. Computer\n\nscience and statistics, 16:13\u201316, 1985.\n\n[11] G. Box and G. C. Tiao. A canonical analysis of multiple time series. Biometrika, 64(2):355\u2013365, 1977.\n[12] K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis.\n\nJ. Mach. Learn. Res., 8:361\u2013383, 2007.\n\n[13] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nKluwer Academic Publishers, 2003.\n\n[14] D. Bosq. Linear Processes in Function Spaces: Theory and Applications. Springer, 2000.\n[15] A. Mas. Asymptotic normality for the empirical estimator of the autocorrelation operator of an ARH (1)\n\nprocess. Compt. Rendus Acad. Sci. Math., 329(10):899\u2013902, 1999.\n\n[16] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[17] L. Zwald and G. Blanchard. Finite dimensional projection for classi\ufb01cation and statistical learning. IEEE\n\nTrans. Inform. Theory, 54:4169, 2008.\n\n[18] J. H. Stock and M. W. Watson. Testing for common trends. J. Am. Stat. Assoc., pages 1097\u20131107, 1988.\n[19] P. Bossaerts. Common nonstationary components of asset prices. J. Econ. Dynam. Contr., 12(2-3):347\u2013\n\n364, 1988.\n\n[20] J. A. Vano, J. C. Wildenberg, M. B. Anderson, J. K. Noel, and J. C. Sprott. Chaos in low-dimensional\n\nlotka-volterra models of competition. Nonlinearity, 19(10):2391\u20132404, 2006.\n\n[21] M. Basseville and I.V Nikiforov. Detection of abrupt changes: theory and applications. Prentice-Hall,\n\n1993.\n\n9\n\n\f", "award": [], "sourceid": 1195, "authors": [{"given_name": "Marco", "family_name": "Cuturi", "institution": null}, {"given_name": "Jean-philippe", "family_name": "Vert", "institution": null}, {"given_name": "Alexandre", "family_name": "D'aspremont", "institution": null}]}