{"title": "Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 6067, "page_last": 6077, "abstract": "We study two problems in high-dimensional robust statistics: \\emph{robust mean estimation} and \\emph{outlier detection}.\nIn robust mean estimation the goal is to estimate the mean $\\mu$ of a distribution on $\\mathbb{R}^d$ given $n$ independent samples, an $\\epsilon$-fraction of which have been corrupted by a malicious adversary.\nIn outlier detection the goal is to assign an \\emph{outlier score} to each element of a data set such that elements more likely to be outliers are assigned higher scores.\nOur algorithms for both problems are based on a new outlier scoring method we call QUE-scoring based on \\emph{quantum entropy regularization}.\nFor robust mean estimation, this yields the first algorithm with optimal error rates and nearly-linear running time $\\tilde{O}(nd)$ in all parameters, improving on the previous fastest running time $\\tilde{O}(\\min(nd/\\e^6, nd^2))$.\nFor outlier detection, we evaluate the performance of QUE-scoring via extensive experiments on synthetic and real data, and demonstrate that it often performs better than previously proposed algorithms.", "full_text": "Quantum Entropy Scoring for Fast Robust Mean\n\nEstimation and Improved Outlier Detection\n\nYihe Dong\n\nMicrosoft Research\n\nyihedong@gmail.com\n\nSamuel B. Hopkins\n\nUniversity of California, Berkeley\n\nhopkins@berkeley.edu\n\nJerry Li\n\nMicrosoft Research\n\njerrl@microsoft.com\n\nAbstract\n\nWe study two problems in high-dimensional robust statistics: robust mean esti-\nmation and outlier detection. In robust mean estimation the goal is to estimate\nthe mean \u00b5 of a distribution on Rd given n independent samples, an \u03b5-fraction\nof which have been corrupted by a malicious adversary. In outlier detection the\ngoal is to assign an outlier score to each element of a data set such that elements\nmore likely to be outliers are assigned higher scores. Our algorithms for both\nproblems are based on a new outlier scoring method we call QUE-scoring based\non quantum entropy regularization. For robust mean estimation, this yields the\n\n\ufb01rst algorithm with optimal error rates and nearly-linear running time (cid:101)O(nd) in all\nparameters, improving on the previous fastest running time (cid:101)O(min(nd/\u03b56, nd2)).\n\nFor outlier detection, we evaluate the performance of QUE-scoring via extensive\nexperiments on synthetic and real data, and demonstrate that it often performs\nbetter than previously proposed algorithms. Code for these experiments is available\nat https://github.com/twistedcubic/que-outlier-detection.\n\n1\n\nIntroduction\n\nWe study outlier-robust statistics in high dimensions, focusing on the question: can theoretically\nsound outlier robust algorithms have practical running times for large, high-dimensional data sets?\nWe address two related problems: robust mean estimation, which is primarily theoretical, and an\napplied counterpart, outlier detection.\nRobust mean estimation Our main theoretical contribution is the \ufb01rst nearly-linear time algorithm\nfor robust mean estimation with nearly-optimal error. Here the goal is to estimate the mean \u00b5 \u2208 Rd\nof a d-dimensional distribution D given \u03b5-corrupted samples X1, . . . , Xn \u2013 that is, i.i.d. samples, an\nunknown \u03b5-fraction of which have been maliciously corrupted. Under (for instance) the assumption\n\u221a\nthat the covariance of D is bounded by Id, it has been long known to be possible in exponential time\nto estimate \u00b5 by \u02c6\u00b5 having (cid:107)\u00b5 \u2212 \u02c6\u00b5(cid:107)2 \u2264 O(\n\u03b5). In particular, this rate of error is independent of d.\nPolynomial-time algorithms provably achieving such d-independent error became known only re-\ncently, starting with the works [8, 15]. Until our work, the running time of algorithms with provably\nd-independent error remained suboptimal by polynomial factors in d or \u03b5: the fastest running time\n\nachieved before this work was (cid:101)O(min(nd2, nd/\u03b56)) [6, 8, 15, 9]. (Here (cid:101)O(\u00b7) notation hides logarith-\nrobust mean estimation with running time (cid:101)O(nd) which achieves error (cid:107)\u00b5 \u2212 \u02c6\u00b5(cid:107)2 \u2264 O(\n\nmic factors in n and d). While these running times represent a dramatic improvement over previous\nexponential-time algorithms, there are still many interesting regimes where the additional runtime\noverheads these algorithms incur render them impractically slow. We give the \ufb01rst algorithm for\n\u03b5). Note\nthat this running time is nearly-linear in the input size nd. Similar to prior works, our algorithm\nhas information-theoretically optimal sample complexity and nearly-optimal error rates in both the\nbounded-covariance and sub-Gaussian regimes.\n\n\u221a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOutlier detection Our main applied contribution is a new algorithm for high-dimensional outlier\ndetection, which we assess via experiments on both synthetic and real data 1. Our goal is to take\na dataset X1, . . . , Xn \u2208 Rd and assign to each Xi an outlier score \u03c4i \u2265 0, so that higher scores \u03c4i\nare assigned to points Xi more likely to be outliers. Of course, what constitutes an outlier varies\nacross applications, so no single algorithm for outlier detection is likely to be the best in all domains.\nWe show that our method performs well in settings where individual outliers are dif\ufb01cult to pick out\non their own (by, say, their (cid:96)2 norms or their distances to nearby points), but still collectively bias\nempirical statistics such as the mean and covariance.\nWe compare our method to baselines based on PCA and Euclidean distances, as well as more\nsophisticated algorithms from existing literature based on nearest-neighbor distances. Our algorithm\nhas nearly-linear running time in theory, and simple implementations in practice incur minimal\noverhead beyond standard spectral methods, allowing us to run on 1024-dimensional data with no\nspecial optimizations and 8192-dimensional data with a fast approximate implementation. It can\ntherefore be used in practice to complement existing approaches to outlier detection in exploratory\ndata analysis.\n\n1.1 What is an outlier and why are they hard to \ufb01nd?\n\nFor us, an outlier is an element of a data set which was generated according to a different process than\nthe majority of the data. For instance, we may imagine that our samples X1, . . . , Xn were sampled\ni.i.d. from a distribution (1 \u2212 \u03b5)D + \u03b5N over Rd, where D is the distribution of inliers, N is the\ndistribution of outliers, and \u03b5 > 0 is a small number \u2013 that is, we imagine that a constant fraction of\nour data may be outliers.\nFor this discussion, we also informally imagine that N is suf\ufb01ciently distinct from D that the set of\noutliers could be approximately identi\ufb01ed by brute-force search over subsets of (1 \u2212 \u03b5)n samples, if\ngiven unlimited computational resources. Otherwise, outlier detection is not a meaningful problem,\nand robust mean estimation is easy (because the empirical mean will be a good estimator). Under\nthese circumstances, what makes identifying outliers and estimating the mean in their presence\ndif\ufb01cult? Chie\ufb02y:\nOutliers may not be identi\ufb01able in isolation. On its own, a typical outlier Xi \u223c N may look much\nlike a typical inlier Xj \u223c D. For instance, it could be (cid:107)Xi(cid:107)2 \u2248 (cid:107)Xj(cid:107)2, and Xi, Xj may have similar\ndistance to the nearest few neighboring samples, especially in high dimensions where samples are far\napart.\nOutliers still introduce bias, collectively Even if individual outliers look innocuous, the collective\neffect a modi\ufb01ed \u03b5-fraction of samples Xi can still substantially change the empirical distribution of\nX1, . . . , Xn. As a result, even simple statistical tasks like estimating the mean or covariance of D\nrequire sophisticated estimators: naively pruning individual outliers and then employing standard\nempirical estimators typically leads to far-suboptimal error rates. For example, an \u03b5-fraction of\nX1, . . . , Xn which are all slightly biased in a single direction may shift the empirical mean of\nX1, . . . , Xn, but this bias will be dif\ufb01cult to detect by looking at small numbers of samples at once.\nThis also demonstrates that successful outlier detection can require global geometric information\nabout a high-dimensional dataset, such as whether or not a direction exists in which many (say, \u03b5n)\nsamples are unusually biased.\nOutliers may be inhomogeneous. Outliers need not exhibit unusual bias in only one direction, or all\nhave the same norm, or lie in a single cluster. Rather, if a dataset exhibits several forms of corruption,\nthere may be as many different-looking kinds of outliers. In the theoretical robust mean estimation\nsetting, the adversary producing \u03b5-corrupted samples may corrupt \u03b5n/10 samples by biasing them in\nsome direction, another \u03b5n/10 samples by unusually enlarging their norms, and so forth.\nSince robust mean estimation involves a malicious adversary, all of the above phenomena must be\naddressed by our robust mean estimation algorithm. In the empirical section of this paper, we focus\non designing an outlier detection method suited to situations where at least one of them occurs \u2013 in\nother cases, existing methods (such as those based on Euclidean norms or local neighborhoods of\nindividual samples [5]) may be more appropriate.\n\n1Code is available at https://github.com/twistedcubic/que-outlier-detection.\n\n2\n\n\f1.2 QUE: Quantum Entropy Scoring\n\n\u221a\n\nRecent innovations in robust mean estimation [15, 8] rely on the following crucial observation about\n\u03b5-corrupted samples X1, . . . , Xn from a distribution D with covariance \u03a3 (cid:22) Id. Namely: any subset\nS \u2286 {X1, . . . , Xn} of samples which shift the empirical mean by distance more than\n\u03b5 in some\ndirection v also introduce an eigenvalue of magnitude greater than 1 to the empirical covariance.\nIn robust mean estimation, this leads to (amongst others) the \ufb01lter algorithm of [8, 9], one of the \ufb01rst\nto achieve dimension-independent error rates. Roughly speaking, the algorithm iterates the following\nuntil the empirical covariance \u03a3 has small spectral norm: (1) compute the top eigenvector v of the\nempirical covariance of \u03a3, then (2) throw out samples Xi whose projections |(cid:104)Xi \u2212 \u00b5, v(cid:105)| (cid:29) 1 is\nunusually large, where \u00b5 is the empirical mean of the corrupted dataset. For outlier detection this\nsuggests a natural scoring rule \u2013 let the outlier score \u03c4i of sample Xi be proportional to |(cid:104)Xi \u2212 \u00b5, v(cid:105)|.\nThe main drawback of these algorithms is that they do not adequately account for inhomogeneity of\n\noutliers. For the \ufb01lter, this leads to a worst-case running time of (cid:101)O(nd2), because the \ufb01lter operation\n(which can be implemented in (cid:101)O(nd) time) may have to be repeated as many as d times if the\n\nadversary introduces outliers lying in d orthogonal directions. The rule \u03c4i = |(cid:104)Xi \u2212 \u00b5, v(cid:105)| may miss\noutliers causing a large eigenvalue of \u03a3, but in a direction orthogonal to the top eigenvector v.\nIn the opposite extreme, if outliers are maximally inhomogeneous \u2013 no group of them is unusually\nbiased in some shared direction v \u2013 then the only way they can bias the empirical mean is for the\nindividual (cid:96)2 norms (cid:107)Xi \u2212 \u00b5(cid:107)2 to be larger than typical. This suggests a different scoring rule:\n\u03c4i = (cid:107)Xi \u2212 \u00b5(cid:107)2. This approach, however, breaks down in the situation we started with, that groups\nof outliers are biased in a shared direction but they do not have larger norms than good samples.\nOur main conceptual contribution is an approach to utilize information about outliers beyond\nwhat is available in the top eigenvector of the empirical covariance \u03a3 and in individual (cid:96)2 norms.\nAppropriately adapted to their respective settings, this leads to our algorithms for both robust mean\nestimation and outlier detection.\nOur \ufb01rst observation is that any eigenvalue/eigenvector \u03bb, v \u2013 not just the top ones \u2013 of the empirical\ncovariance with \u03bb (cid:29) 1 must be due to outliers. We therefore consider the intermediate goal of \ufb01nding\na distribution over directions v \u2208 Rd containing information about as many outlier directions as\npossible. We formalize this as the following entropy-regularized convex program over d \u00d7 d positive\nsemide\ufb01nite matrices:\n\n\u03b1 \u00b7 (cid:104)U, \u03a3(cid:105) + S(U ) such that U (cid:23) 0, tr(U ) = 1 ,\n\nthe matrix U. If U = (cid:80)d\n\n(1)\nwhere \u03b1 \u2265 0 is some constant and (cid:104)A, B(cid:105) = tr(AB(cid:62)) denotes the trace inner product of matrices.\nHere, S(U ) = \u2212(cid:104)U, log U(cid:105) is the quantum entropy (also known as the von Neumann entropy) of\nis the eigendecomposition of U, since it has tr(U ) = 1 we\nmay interpret it as a distribution over orthonormal vectors v1, . . . , vd with weights \u00b51, . . . , \u00b5d and\nhence with entropy S(U ). Under this interpetation, (cid:104)U, \u03a3(cid:105) = Evi\u223c\u00b5(cid:104)vi, \u03a3vi(cid:105). As \u03b1 varies, (1)\ntrades off optimizing for a distribution supported on many distinct directions for a distribution\nsupported on eigenvectors of \u03a3 with large eigenvalues. The optimizer of (1) takes the form U =\nexp(\u03b1 \u00b7 \u03a3)/tr exp(\u03b1 \u00b7 \u03a3) where exp(\u00b7) is the matrix exponential function.\nDe\ufb01nition 1.1. Let U = exp(\u03b1 \u00b7 \u03a3)/tr exp(\u03b1 \u00b7 \u03a3) be the optimizer of (1), for some data set\nX = X1, . . . , Xn \u2208 Rd where \u03a3 is the covariance of X . The quantum entropy (QUE) scores with\nparameter \u03b1 are given by \u03c4i = (Xi \u2212 \u00b5)(cid:62)U (Xi \u2212 \u00b5), where \u00b5 is the mean of X .\nIntuitively, the QUE scores will penalize any point which is causing a large eigenvalue in any\ndirection, which should allow us to \ufb01nd more outliers than the naive spectral scores presented above.\nQUE scores also interpolate between two more naive scoring rules: when \u03b1 = 0 we have U = Id /d\n2 is the (cid:96)2 norm (up to a scaling), while when \u03b1 \u2192 \u221e we have U \u2192 vv(cid:62)\nand so \u03c4i = 1\nwhere v is the top eigenvector of \u03a3, recovering naive spectral scoring. In both experiments and theory\nwe \ufb01nd that choosing \u03b1 strictly between 0 and \u221e outperforms either of the extreme choices.\nQUE scores are also appealing from a computational perspective: we show that a list of approximate\ni = (1 \u00b1 0.01)\u03c4i can be computed from X1, . . . , Xn in nearly-linear time, by appro-\nQUE scores \u03c4(cid:48)\npriate use of Johnson-Lindenstrauss sketching and ef\ufb01cient computation of the matrix exponential by\n\nd(cid:107)Xi \u2212 \u00b5(cid:107)2\n\nmax\nU\u2208Rd\u00d7d\n\ni=1 \u00b5iviv(cid:62)\n\ni\n\n3\n\n\fseries expansion. This is crucial to both the nearly-linear running time of our algorithm for robust\nmean estimation and to the scalability of our outlier detection method.\nIn Section 1.4 we describe re\ufb01nements of QUE scoring which \ufb01t it into the matrix multiplicative\nweights framework [3], leading to our nearly-linear time algorithms for robust mean estimation.\nWe give two very similar algorithms, one for when the distribution of inliers is only assumed to\nhave bounded covariance, and one when the inliers are assumed to be subgaussian. The resulting\nalgorithms are conceptually similar to the following modi\ufb01cation of the \ufb01lter mentioned above: until\n(cid:107)\u03a3(cid:107)2 \u2264 O(1), compute QUE scores \u03c4i, throw out data points Xi with \u03c4i (cid:29) 1, and repeat. (To obtain\nprovable guarantees, our \ufb01nal algorithms are somewhat more complex: in some iterations we use\nQUE scores based on certain reweightings of the data learned in previous iterations.)\nIn Section 1.5 we describe experiments validating the QUE scoring rule on both synthetic and real\ndata sets. We show that it performs especially well by comparison to local-neighborhood methods\nand to scoring based on only the top eigenvector in data sets where the inliers are close to isotropic\n(or can be made so by applying data whitening procedures) and in which there are heterogeneous\noutliers.\n\n1.3 Related work\n\nRobust mean estimation: The study of robust statistics and in particular robust mean estimation\nbegan with major works by Anscombe, Huber, Tukey and others in the 1960s [2, 25, 12, 26]. The\nliterature on polynomial-time algorithms for robust statistics has exploded in recent years, following\nworks by Diakonikolas et al and Lai, Rao and Vempala giving the \ufb01rst polynomial-time algorithms for\nrobust mean estimation with dimension-independent (or nearly dimension-indepedent) error [8, 15].\nA full survey is beyond our scope here \u2013 see e.g. the recent theses [17, 24] for a thorough account.\nParticularly relevant to our work is the recent work of Cheng, Diakonikolas, and Ge who design an\n\nalgorithm for robust mean estimatin with running time (cid:101)O(nd/\u03b56) \u2013 the \ufb01rst to achieve nearly linear\n\ntime for constant \u03b5 \u2013 by appeal to nearly linear time solvers for packing and covering semide\ufb01nite\nprograms [6]. Our algorithms carry two advantages over this prior work: \ufb01rst, our algorithm runs\nin nearly linear time for any choice of \u03b5 = \u03b5(n, d), and second, because we avoid the 1/\u03b56 scaling\nand appeal to semide\ufb01nite programming, our theoretical ideas lead to a practical method for outlier\ndetection. The techniques of Diakonikolas et al. were later extended to robust covariance estimation\n[7]; it remains an interesting direction to extend our techniques to covariance estimation.\nConcurrent work: After this manuscript was initially submitted, we became aware of the concur-\nrent work [16], which also obtains a nearly-linear time algorithm for robust mean estimation of\ndistributions with bounded covariance. The algorithm of [16] also obtains subgaussian con\ufb01dence\nintervals (see e.g. [19]), which the algorithm in this work does not. By contrast, the algorithms in\nour work also obtain improved rates of error with respect to \u03b5 when the underlying distribution is\nsub-Gaussian, and our method is suf\ufb01ciently practical that we are able to implement parts of it to\nrun our experiments on outlier detection. (The method of [16] relies on nearly-linear time solvers\nfor packing/covering semide\ufb01nite programs, which are not yet practical.) Finally, implicit in the\nwork [16] is a reduction from arbitrary \u03b5 to the case \u03b5 = 1/100; we describe this reduction and some\nconsequences in supplementary material.\nOutlier detection Detection of outliers goes back nearly to the beginning of statistics itself [11].\nEven restricting to the high dimensional case it has a literature too broad to survey here. Much recent\nwork has focused on so-called local outlier factor-based methods, which assign outlier scores based\non the local density of other samples near each Xi \u2013 see e.g. [13, 14] and further references in [5].\nWe \ufb01nd that QUE scoring compares favorably to such local methods in high-dimensional datasets\nlike we describe in Section 1.1 \u2013 see Sections 1.5 and supplementary material for details.\n\n1.4 Robust mean estimation: results and algorithm overview\n\nWe turn to our algorithm for robust mean estimation, deferring details to supplementary material.\nDe\ufb01nition 1.2 (\u03b5-corrupted samples). Let D be a distribution on Rd. We say that X1, . . . , Xn are an\n\u03b5-corrupted set of samples from D if they are \ufb01rst drawn i.i.d. from D, then modi\ufb01ed by an adversary\nwho may adaptively inspect all the samples, remove \u03b5n of them, and replace them with arbitrary\nvectors in Rd.\n\n4\n\n\fNote that \u03b5-corruption is a stronger outlier model than the (1\u2212\u03b5)D +\u03b5N mixture model we described\nin Section 1; our algorithms also work in this milder mixture model. Our main theoretical result is:\nTheorem 1.1. For every n, d \u2208 N and \u03b5 > 0 there are algorithms QUESCOREFILTER ,S.G.-\n\n\u00b5 and covariance \u03a3, given n \u03b5-corrupted samples from D, QUESCOREFILTER produces \u02c6\u00b5 such\n\u221a\nthat (cid:107)\u02c6\u00b5 \u2212 \u00b5(cid:107)2 \u2264 O(\n\nQUESCOREFILTER with running time (cid:101)O(nd), such that for every distribution D on Rd with mean\n\u03b5) + (cid:101)O((cid:112)d/n) if \u03a3 (cid:22) Id, and S.G.-QUESCOREFILTER produces \u02c6\u00b5 such\nthat (cid:107)\u02c6\u00b5 \u2212 \u00b5(cid:107)2 \u2264 O(\u03b5(cid:112)log(1/\u03b5) +(cid:112)d/n) if D is sub-Gaussian with \u03a3 = Id, all with probability\nfactors. The other term, (cid:101)O((cid:112)d/n), is information-theoretically optimal up to the logarithmic\nfactors in the (cid:101)O(\u00b7) even without corruptions. For the sub-Gaussian case, the O(\u03b5(cid:112)log 1/\u03b5) term is\nsuch as Tukey median, and the latter is information-theoretically optimal [26]. The(cid:112)d/n term is\n\nbelieved to be necessary for computationally ef\ufb01cient algorithms (see e.g the statistical-query lower\nbound [10]), although that term can be made O(\u03b5) by using computationally-intractable estimators\n\n\u03b5) term information-theoretically optimal up to constant\n\nFor the bounded covariance case, the O(\n\nat least 0.99.\n\n\u221a\n\n\u03b5).\n\n\u221a\n\ninformation-theoretically optimal even without corruptions.\nIn this section we discuss our algorithm for the bounded-covariance case \u03a3 (cid:22) Id in the setting that\nthe adversary may not remove samples, leaving technical details and the modi\ufb01cations necessary to\nhandle removed samples and sub-Gaussian D to supplementary material.\nDe\ufb01nition 1.3 (Simpli\ufb01ed robust mean estimation). Let S = {X1, . . . , Xn} \u2286 Rd be a dataset with\nthe property that S partitions into S = Sg \u222a Sb with |Sb| \u2264 \u03b5n and Ei\u223cSg (Xi\u2212 \u00b5g)(Xi\u2212 \u00b5g)(cid:62) (cid:22) Id,\nwhere \u00b5g = Ei\u223cSg Xi. Given S, the goal is to \ufb01nd a vector \u02c6\u00b5 with (cid:107)\u00b5g \u2212 \u02c6\u00b5(cid:107)2 \u2264 O(\nLike prior algorithms for robust mean estimation, ours maintains a weight vector w1, . . . , wn \u2265 0\nsuspected to be outliers that are causing (cid:107)\u00b5(w) \u2212 \u00b5g(cid:107)2 to be large.2 A key insight of recent work\non robust mean estimation is that it suf\ufb01ces to \ufb01nd weights w which place almost as much mass\non Sg as does the uniform weighting and whose empirical covariance is small. This is formalized\n\nwith(cid:80) wi \u2264 1, initialized to wi = 1/n. The algorithm iteratively decreases the weight of points\n(cid:80) wiXi, and\nin the following lemma. For a weight vector w, let |w| = (cid:80) wi, \u00b5(w) = 1|w|\n(cid:80) wi(Xi \u2212 \u00b5(w))(Xi \u2212 \u00b5(w))(cid:62). Let (cid:107)M(cid:107)2 be the spectral norm of a matrix M.\n\nM (w) = 1|w|\nLemma 1.2 (Implicit in prior work). Let S = {X1, . . . , Xn} be as in De\ufb01nition 1.3. Suppose\nthat w is a weight vector such that (cid:107)M (w)(cid:107)2 \u2264 O(1) and w is mostly good, by which we mean\n| 1\nn1Sg \u2212 wg| \u2264 | 1\nn1Sb \u2212 wb|, where 1Sg , 1Sb are the indicators of Sg, Sb and wg, wb are w restricted\n\u221a\nto Sg, Sb respectively. (Intuitively, w is mostly good if it results by removing from the uniform\nweighting 1S/n more weight from Sb than from Sg.) Then (cid:107)\u00b5(w) \u2212 \u00b5g(cid:107)2 \u2264 O(\nw to cause (cid:107)\u00b5(w)\u2212\u00b5g(cid:107)2 (cid:29) \u221a\nLemma 1.2 captures the following geometric intuition: if the bad points Sb receive enough weight in\n\u03b5, then an O(\u03b5)-fraction of the mass of w is on Xi which are unusually\ncorrelated with the vector \u00b5(w) \u2212 \u00b5g, which leads to a large maximum eigenvalue in M (w). Prior\nworks employ a variety of methods to \ufb01nd a mostly good weight vector w with (cid:107)M (w)(cid:107)2 \u2264 O(1).\nPerhaps the simplest is the \ufb01lter of [8], which iterates: While (cid:107)M (w)(cid:107)2 (cid:29) 1, compute its top\neigenvector v and naive spectral scores \u03c4i = (cid:104)Xi \u2212 \u00b5(w), v(cid:105)2. Throw out Xi with large \u03c4i and repeat.\nThe \ufb01lter ensures that the weight vector it maintains is mostly good because (in an averaged sense)\n\u03c4i can be large only for Xi which are corrupted. This is because the (weighted) sum of all scores\nwi\u03c4i \u2248\n(cid:104) 1\n(Xi \u2212 \u00b5g)(Xi \u2212 \u00b5g)(cid:62), vv(cid:62)(cid:105) \u2264 1. (Here we ignore some details about centering Xi at \u00b5g\n\n(cid:80) wi\u03c4i = (cid:104)M (w), vv(cid:62)(cid:105) (cid:29) 1, while the contribution to this sum from Sg has(cid:80)\n(cid:80)\nrather than \u00b5(w).) Thus, the \u03c4i from Sb must make up almost all of(cid:80) wi\u03c4i. Simple approaches to\n\nremoving or downweighting Xi with large \u03c4i then remove strictly more weight from Sb than from Sg.\nHowever, \ufb01ltering based on naive spectral scores alone faces a barrier to achieving nearly-linear\nrunning-time. If the corruptions Sb are split among many orthogonal directions, the naive spectral\n\ni\u2208Sg\n\ni\u2208Sg\n\n\u03b5).\n\nn\n\n2Some prior algorithms, e.g. the \ufb01lter of [8] instead iteratively throw out points suspected to be outliers.\nHowever, since those algorithms are (necessarily) randomized, they can also be viewed as weighting points,\nwhere the weight of Xi is the probability it has not been thrown out. The algorithm we present here can also be\nimplemented by throwing out points in a randomized fashion \u2013 we discuss further in the appendix.\n\n5\n\n\fi\u2208Sb\n\ni\u2208Sg\n\nof M (w). We show that our modi\ufb01ed QUE scores \u03c4i maintain the property that(cid:80)\n(cid:80)\n\n\ufb01lter will have to \ufb01nd those directions one at a time. Thus, it may require \u2126(d) iterations (leading to\n\u2126(nd2) running time) to arrive at w with (cid:107)M (w)(cid:107)2 \u2264 O(1).\nOur main idea is that by replacing naive spectral scores with slightly modi\ufb01ed QUE scores, each\niteration of the \ufb01lter can take into account projections of each sample onto many large eigenvectors\nwi\u03c4i (cid:29)\nwi\u03c4i, and so downweighting according to \u03c4i removes more mass from Sb than Sg. However,\n\ufb01ltering with QUE scores makes faster progress than with naive spectral scores: roughly speaking,\nwe show that only O(log d)2 rounds of \ufb01ltering according to QUE scores are required to \ufb01nd a\nmostly-good weight vector w with (cid:107)M (w)(cid:107)2 \u2264 O(1).\nThe core of our algorithm is a subroutine, DECREASESPECTRALNORM, to take a mostly good weight\nvector w with (cid:107)M (w)(cid:107)2 (cid:29) 1 and in O(log d) rounds of QUE \ufb01ltering produce another mostly good\nw(cid:48) with (cid:107)M (w(cid:48))(cid:107)2 \u2264 3\n4(cid:107)M (w)(cid:107)2. Repeating this subroutine O(log d) times and then outputting the\nresulting \u00b5(w) yields our main algorithm. An outline of this subroutine is presented as Algorithm 1.\nWe \ufb01rst establish a rigorous sense in which downweighting according to outlier scores \u03c4i makes\nprogress: it decreases the weighted average of the scores while removing more weight from bad\npoints than good.\nLemma 1.3 (Progress in one round of downweighting, informal). There is a downweighting algorithm\nwhich takes a density matrix U and a mostly good weight vector w and produces a mostly good\nweight vector w(cid:48) by downweighting points with large score \u03c4i = (cid:104)Xi \u2212 \u00b5(w), U (Xi \u2212 \u00b5(w)(cid:105) such\n\nthat(cid:80) w(cid:48)\nLet us give a geometric interpretation to Lemma 1.3: it establishes that if(cid:80) wi\u03c4i = (cid:104)U, M (w)(cid:105) (cid:29) 1\n\n(cid:80) wi\u03c4i so long as(cid:80) wi\u03c4i (cid:29) 1. Furthermore, M (w(cid:48)) (cid:22) M (w).\n(cid:104)M (w(cid:48)), U(cid:105) \u2248(cid:88)\n\nthen the quadratic form of M (w(cid:48)) decreases in the directions de\ufb01ned by U, since\n\n(cid:104)M (w), U(cid:105) .\n\ni\u03c4i \u2264 1\n\n(cid:88)\n\n(2)\n\nwi\u03c4i =\n\n3\n\ni\u03c4i \u2264 1\nw(cid:48)\n3\n\n1\n3\n\nThis guarantee becomes more meaningful as the entropy S(U ) increases, because it suggests the\nquadratic form of M (w) has decreased in more directions. To make this formal, we appeal to the\nmatrix multiplicative weights framework. DECREASESPECTRALNORM applies downweighting\niteratively using a sequence of entropy-maximizing density matrices U1, . . . , UT chosen according\nto the matrix multiplicactive weights update rule, leading to a series of mostly good weight vectors\nw1, . . . , wT such that (cid:107)M (wT )(cid:107)2 \u2264 3\n\n(cid:32)\n\nUt = exp\n\n1\n\n(cid:107)M (w)(cid:107)2\n\nk=0\n\n(cid:32)\n4(cid:107)M (w0)(cid:107)2. We choose\nt\u22121(cid:88)\n\n(cid:33)(cid:30)\n\nM (wk)\n\ntr exp\n\nt\u22121(cid:88)\n\nk=0\n\n1\n\n(cid:107)M (w)(cid:107)2\n\n(cid:33)\n\nM (wk)\n\n,\n\n(3)\n\nwhere w0 = w is the input weight vector, U0 = Id, and wt results from applying the downweighting\nof Lemma 1.3 to wt\u22121 using Ut (if (cid:104)M (wt\u22121), Ut(cid:105) (cid:29) 1). The following lemma is a special case of\nthe standard (local norm) regret bound for matrix multiplicative weights.\nLemma 1.4 (Special case of Theorem 3.1, [1]). For any w0, . . . , wT , if \u03b1 \u2264 1/(cid:107)M (wt)(cid:107)2 for all\n\nt \u2264 T , then(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)T\u22121(cid:88)\n\nt=0\n\n\u2264 T\u22121(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nT\u22121(cid:88)\n\nM (wt)\n\n(cid:104)Ut, M (wt)(cid:105) + \u03b1\n\n(cid:104)Ut, M (wt)(cid:105) \u00b7 (cid:107)M (wt)(cid:107)2 +\n\nlog d\n\n\u03b1\n\n.\n\n(4)\n\nt=0\n\nt=0\n\nNow we sketch the analysis of DECREASESPECTRALNORM.\nClaim 1.5 (Informal). If w = w0 is mostly good, with (cid:107)M (w0)(cid:107)2 \u2265 100, then DECREASESPEC-\nTRALNORM produces mostly good wT with (cid:107)M (wT )(cid:107)2 \u2264 3\nProof sketch. Since M (wt) (cid:22) M (wt+1) by Lemma 1.3, we have (cid:107)M (wt)(cid:107)2 \u2264 (cid:107)M (w0)(cid:107)2 for all t,\nand hence \u03b1 = 1/(cid:107)M (w0)(cid:107)2 \u2264 1/(cid:107)M (wt)(cid:107)2 for all t, so w0, . . . , wT and U0, . . . , UT\u22121 satisfy the\nhypotheses of Lemma 1.4. By our choice of \u03b1 and M (wT ) (cid:22) M (wt) for all t, (4) implies\n\n4(cid:107)M (w)(cid:107)2.\n\nT \u00b7 (cid:107)M (wT )(cid:107)2 \u2264\n\nM (wt)\n\n\u2264 2\n\n(cid:104)Ut, M (wt)(cid:105) + (cid:107)M (w0)(cid:107)2 \u00b7 log d .\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)T\u22121(cid:88)\n\nt=0\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nT\u22121(cid:88)\n\nt=0\n\n6\n\n\fIf (cid:104)Ut, M (wt\u22121)(cid:105) \u2265 (cid:107)M (w0)(cid:107)2/3 (cid:29) 1, then DECREASESPECTRALNORM performs down-\nweighting, and by Lemma 1.3 and (2) (which we establish rigorously in supplemental mate-\nrial), (cid:104)M (wt), Ut(cid:105) \u2264 1\n3(cid:107)M (w0)(cid:107). Otherwise, by hypothesis (cid:104)M (wt), Ut(cid:105) =\n(cid:104)M (wt\u22121), Ut(cid:105) \u2264 (cid:107)M (w0)(cid:107)2/3. Using this bound and dividing by T , we obtain (cid:107)M (wT )(cid:107)2 \u2264\n3 + log d\n( 2\n\nT )(cid:107)M (w0)(cid:107)2. Choosing T \u2265 20 log d completes the proof sketch.\n\n3(cid:104)M (wt\u22121), Ut(cid:105) \u2264 1\n\nRunning time: Our overall algorithm only requires log(nd)O(1) iterations of DECREASESPECTRAL-\nNORM, and the latter only requires O(log(d)) iterations of downweighting, so we just have to\nimplement downweighting in nearly-linear time. We show in supplemental material that this can be\ndone by avoiding representing any of the matrices Ut explicitly in memory: instead, we maintain only\nlow-rank sketches of them. This leads to some approximation error in computing the QUE scores,\nbut we show that approximations to the QUE scores suf\ufb01ce for all arguments above.\nFor remaining technical details and full proofs, see Sections 5-9 of supplemental materials.\n\nAlgorithm 1 DECREASESPECTRALNORM\n1: Input: X1, . . . , Xn as in De\ufb01nition 1.3, mostly good weight vector w0.\n2: For iteration t = 0, . . . , O(log d), if (cid:107)M (wt)(cid:107)2 \u2264 3\n\n4(cid:107)M (w0)(cid:107)2, output wt and halt. Otherwise,\n3(cid:107)M (w0)(cid:107)2, let wt+1 = wt. Else let wt+1 be the output\n\nlet Ut as in (4). If (cid:104)Ut, M (wt\u22121)(cid:105) \u2264 1\nof downweighting from Lemma 1.3 with Ut.\n\n3: Output wT .\n\n1.5 Outlier detection: algorithm and experimental results\n\nIn this section, we empirically evaluate outlier detection using QUE scoring. QUE scoring can detect\n(some kinds of) spectral outliers. We call X \u2208 Rd a spectral outlier with respect to a dataset S if\nthe list of squared projections ((cid:104)X, v1(cid:105)2, . . . ,(cid:104)X, vd(cid:105)2) is atypical by comparison to most Y \u2208 S,\nwhere X = X \u2212 EY \u223cS Y and v1, . . . , vd are the eigenvectors of the covariance matrix of S. The\nQUE scoring approach to aggregate the list ((cid:104)X, v1(cid:105)2, . . . ,(cid:104)X, vd(cid:105)2) into one number carries (at\nleast) two distinct advantages: \ufb01rst, the QUE scores of a dataset can be computed approximately in\nnearly-linear time, and second, the QUE scores weigh (cid:104)X, vi(cid:105)2 more heavily for larger \u03bbi, while\nstill incorporating more information than (cid:104)X, v1(cid:105)2 (which is the naive spectral approach). There\nmay be many other useful ways to go beyond the naive spectral approach to combine the projections\n((cid:104)X, v1(cid:105)2, . . . ,(cid:104)X, vd(cid:105)2) into a single outlier score \u2013 indeed, by varying \u03b1 QUE scoring already\nprovides a tuneable range of methods.\nExperimental setup: We must work with data containing well-de\ufb01ned and known inliers and outliers\nso that we can compare our results to ground-truth. We generate such data sets in three distinct ways,\nleading to three main experiments. (In supplemental material we also study some outlier-detection\ndata sets appearing in prior work [5].)\nSynthetic: We create synthetic data sets in 128 dimensions and 103 \u2212 104 samples with an \u03b5-\nfraction of inhomogeneous outliers in k directions by sampling from a mixture of k + 1 Gaus-\ne1, . . . , ek are standard basis vectors, with C \u2248 1 and \u03c3 (cid:28) 1. The outliers are the samples from\n\n2N (\u2212C(cid:112)k/\u03b5 \u00b7 ei, \u03c32 Id)], where\nsians (1 \u2212 \u03b5)N (0, Id) +(cid:80)k\nN (\u00b1C(cid:112)k/\u03b5ei, \u03c32 Id). By varying \u03b5, k and the distribution \u03b51, . . . , \u03b5k of outlier weights, we de-\nin the presence of inhomogeneous outliers. We choose the scaling(cid:112)k/\u03b5 \u00b7 ei because then standard\ncalculations predict that if \u03b5i \u2248 \u03b5/k the outliers from N (\u00b1C(cid:112)k/\u03b5ei, \u03c32 Id) will contribute an\n\n2N (C(cid:112)k/\u03b5 \u00b7 ei, \u03c32 Id) + 1\n\nmostrate in this simpli\ufb01ed model how max-entropy outlier scoring improves on baseline algorithms\n\neigenvalue greater than 1 to the overall empirical covariance.\nMixed \u2013 word embeddings: We create a data set consisting of word embeddings drawn from several\nsources. Inliers are the 100-dimensional GloVe embeddings ([21]) of the words in a random \u2248 103\nword long section of a novel (we use Sherlock Holmes) and outliers are embeddings of the \ufb01rst\nparagraphs of k featured Wikipedia articles from May 2019 [27].\nPerturbed \u2013 images: We create a data set consisting of CIFAR10 images some of which have\narti\ufb01cially-introduced dead pixels. Inliers are \u2248 4500 random CIFAR images X \u2208 {1, . . . , 256}1024\n\ni=1 \u03b5i[ 1\n\n7\n\n\f(restricted to the red color channel). Outliers are \u2248 500 random CIFAR images, partitioned into\ngroups S1, . . . , Sk, such that for each group i a random coordinate pi \u2208 {1, . . . , 1024} and a random\nvalue ci \u2208 {1, . . . , 256} is chosen and for each X \u2208 Si we set Xpi = ci.\nMetric: All the methods we evaluate produce a vector of scores \u03c41, . . . , \u03c4n \u2208 R. We use the standard\nROCAUC metric to compare these scores to a ground-truth partition S = Sg \u222a Sb into inlier and\noutlier sets. ROCAUC(\u03c41, . . . , \u03c4n, Sb, Sg) = Pri\u223cSb,j\u223cSg (\u03c4i \u2265 \u03c4j) is simply the probability that a\nrandomly chosen outlier is scored higher than a random inlier.\nBaselines: We compare QUE scoring to the following other scoring rules. (cid:96)2: \u03c4i = (cid:107)Xi \u2212 \u00b5(cid:107) is the\ndistance of Xi to the empirical mean; top eigenvector naive spectral: \u03c4i = (cid:104)Xi \u2212 \u00b5, v(cid:105)2 where v is\nthe top eigenvector of the empirical covariance; k-nearest neighbors (k-NN) [22, 5] and local outlier\nfactor (LOF) [4, 5] methods: \u03c4i is a function of the distances to its k nearest neighbors; isolation forest\nand elliptic envelope: standard outlier detection methods as implemented in scikit-learn [23, 18, 20].\nWhitening: Scoring methods based on the projection of data points Xi onto large eigenvectors\nof the empirical covariance work best when those eigenvectors correspond to directions in which\nmany outliers lie. In particular, if \u03a3g, the covariance of Sg, itself has large eigenvalues then such\nspectral methods perform poorly. We assume access to a whitening transformation W \u2208 Rd\u00d7d, which\ncaptures a small amount of prior knowledge about the distribution of inliers Sg. For best performance\nW should approximate W \u2217 = (\u03a3g)\u22121/2 since W \u2217Xi form an isotropic set of vectors. Of course, to\ncompute W \u2217 exactly would require knowing which points are inliers, but we \ufb01nd that relatively naive\napproximations suf\ufb01ce. In particular, if a clean dataset Y1, . . . , Ym whose distribution is similar to\nthe distribution of inliers is available, its empirical covariance can be used to \ufb01nd a good whitening\ntransformation W . In our synthetic data we use W = Id. In our word embeddings experiment, we\nobtain W using the empirical covariance of the embedding of another random section of Sherlock\nHolmes. In our CIFAR-10 experiment, we obtain W from the empirical covariance of a fresh sample\nof \u2248 5000 randomly chosen images from CIFAR-10.\n\nAlgorithm 2 QUE-Scoring for Outlier Detection\n1: Input: dataset X1, . . . , Xn \u2208 Rd, optional whitening transformation W \u2208 Rd\u00d7d, scalar \u03b1 > 0.\n2: Let X(cid:48)\n3: For i \u2264 n, let \u03c4i = (X(cid:48)\nNote on \u03b1: in both synthetic and real data we \ufb01nd that \u03b1 = 4 is a good rule-of-thumb choice,\nconsistently resulting in improved scores over baseline methods.\n\n(cid:80)n\ni=1 X(cid:48)\ni)/ Tr exp(\u03b1\u03a3/(cid:107)\u03a3(cid:107)2). Return \u03c41, . . . , \u03c4n.\n\ni = W Xi be whitened data, \u00b5 = 1\nn\n\n(cid:80)d\ni=1(X(cid:48)\n\ni \u2212 \u00b5)(X(cid:48)\n\ni \u2212 \u00b5)(cid:62).\n\n(cid:62)\n\ni\n\nexp(\u03b1\u03a3/(cid:107)\u03a3(cid:107)2)X(cid:48)\n\ni and \u03a3 = 1\nn\n\nHigh-dimensional scaling: Implementing Algorithm 2 by explicitly forming the matrix \u03a3 and\nperforming a singular value decomposition (SVD) to compute exp(\u03b1\u03a3) is feasible on relatively low-\ndimensional data (d \u2248 100). See supplementary material for discussion and results of a nearly-linear\ntime implementation.\n\n8\n\n\f(a) synthetic\n\n(b) whitened CIFAR-10\n\n(c) whitened word embeddings\n\n(d) synthetic\n\n(e) whitened CIFAR-10\n\n(f) whitened word embeddings\n\n(g) synthetic\n\n(h) whitened CIFAR-10\n\n(i) whitened word embeddings\n\nFigure 1: (a-f): We plot the difference between ROCAUC performance of QUE and naive spectral\n(a-c), (cid:96)2 scoring (d-f) on all three data sets, as \u03b1 varies. Error bars represent one empirical standard\ndeviation in 20 trials. Note that in all three cases the mean improvement in ROCAUC score given by\nQUE is at least one standard deviation above 0 for a wide range of \u03b1. Observe also that in synthetic\ndata (which most closely parallels theory) the optimal \u03b1 decreases with increasing number of outlier\ndirections, in accord with the need to \ufb01nd a higher-entropy solution to (1). (g-i) We plot ROCAUC\nscores of QUE (with \u03b1 = 4) and a variety of other methods as the number of outlier directions\nincreases. Error bars represent one standard deviation over 3 \u2212 4 trials. Number of trials is small due\nto large running time requirements of Scikit-learn methods IsolationForest and EllipticEnvelope. The\nmethods \"lof\" and \"knn\" are based on nearest-neighbor distances [5]. All except spectral methods\nperform poorly on synthetic data; as k increases the performance gap between QUE and naive\nspectral scoring grows. In all plots \u03b5 = 0.2. Experiments were generated on a quad-core 2.6Ghz\nmachine with 16GB RAM and an NVIDIA P100 GPU.\n\nReferences\n[1] Zeyuan Allen-Zhu, Zhenyu Liao, and Lorenzo Orecchia. Spectral sparsi\ufb01cation and regret\nminimization beyond matrix multiplicative updates. In Proceedings of the forty-seventh annual\nACM symposium on Theory of computing, pages 237\u2013245. ACM, 2015.\n\n[2] Frank J Anscombe. Rejection of outliers. Technometrics, 2(2):123\u2013146, 1960.\n\n[3] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a\n\nmeta-algorithm and applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[4] MM Breunig, HP Kriegel, R Ng, and J Sander. Ef\ufb01cient algorithms for mining outliers from\nlarge data sets. Proceedings of the ACM international conference on management of data\n(SIGMOD), pages 93\u2013104, 2000.\n\n9\n\n\f[5] Guilherme O Campos, Arthur Zimek, J\u00f6rg Sander, Ricardo JGB Campello, Barbora Micenkov\u00e1,\nErich Schubert, Ira Assent, and Michael E Houle. On the evaluation of unsupervised outlier\ndetection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery,\n30(4):891\u2013927, 2016.\n\n[6] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in\nnearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 2755\u20132771. SIAM, 2019.\n\n[7] Yu Cheng, Ilias Diakonikolas, Rong Ge, and David Woodruff. Faster algorithms for high-\ndimensional robust covariance estimation. In Proceedings of the 32nd Annual Conference on\nLearning Theory (COLT 2019).\n\n[8] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart.\nRobust estimators in high-dimensions without the computational intractability. SIAM Journal\non Computing, 48(2):742\u2013864, 2019.\n\n[9] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair\nIn Proceedings of the 34th\nStewart. Being robust (in high dimensions) can be practical.\nInternational Conference on Machine Learning-Volume 70, pages 999\u20131008. JMLR. org, 2017.\n\n[10] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for\nrobust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th\nAnnual Symposium on Foundations of Computer Science (FOCS), pages 73\u201384. IEEE, 2017.\n\n[11] Douglas M Hawkins. Identi\ufb01cation of outliers, volume 11. Springer, 1980.\n\n[12] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages\n\n492\u2013518. Springer, 1992.\n\n[13] Edwin M Knorr and Raymond T Ng. A uni\ufb01ed notion of outliers: Properties and computation.\n\nIn KDD, volume 97, pages 219\u2013222, 1997.\n\n[14] Edwin M Knox and Raymond T Ng. Algorithms for mining distancebased outliers in large\nIn Proceedings of the international conference on very large data bases, pages\n\ndatasets.\n392\u2013403. Citeseer, 1998.\n\n[15] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance.\nIn 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages\n665\u2013674. IEEE, 2016.\n\n[16] Guillaume Lecu\u00e9 and Jules Depersin. Robust subgaussian estimation of a mean vector in nearly\n\nlinear time. arXiv preprint arXiv:1906.03058, 2019.\n\n[17] Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis,\n\nMassachusetts Institute of Technology, 2018.\n\n[18] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 Eighth IEEE\n\nInternational Conference on Data Mining, pages 413\u2013422. IEEE, 2008.\n\n[19] G\u00e1bor Lugosi, Shahar Mendelson, et al. Sub-gaussian estimators of the mean of a random\n\nvector. The Annals of Statistics, 47(2):783\u2013794, 2019.\n\n[20] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nOlivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-\nlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825\u20132830,\n2011.\n\n[21] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for\nword representation. In Empirical Methods in Natural Language Processing (EMNLP), pages\n1532\u20131543, 2014.\n\n[22] S Ramaswamy, R Rastogi, and K Shim. Ef\ufb01cient algorithms for mining outliers from large data\nsets. Proceedings of the ACM international conference on management of data (SIGMOD),\npages 427\u2013438, 2000.\n\n10\n\n\f[23] Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance\n\ndeterminant estimator. Technometrics, 41(3):212\u2013223, 1999.\n\n[24] Jacob Steinhardt. Robust Learning: Information Theory and Algorithms. PhD thesis, Stanford\n\nUniversity, 2018.\n\n[25] John W Tukey. A survey of sampling from contaminated distributions. Contributions to\n\nprobability and statistics, pages 448\u2013485, 1960.\n\n[26] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International\n\nCongress of Mathematicians, Vancouver, 1975, volume 2, pages 523\u2013531, 1975.\n\n[27] https://en.wikipedia.org/wiki/Wikipedia:Today\u2019s_featured_article/May_2019.\n\nMay 2019. Wikimedia Foundation.\n\n11\n\n\f", "award": [], "sourceid": 3268, "authors": [{"given_name": "Yihe", "family_name": "Dong", "institution": "Microsoft"}, {"given_name": "Samuel", "family_name": "Hopkins", "institution": "UC Berkeley"}, {"given_name": "Jerry", "family_name": "Li", "institution": "Microsoft"}]}