Review for NeurIPS paper: Outlier Robust Mean Estimation with Subgaussian Rates via Stability

NeurIPS 2020

Outlier Robust Mean Estimation with Subgaussian Rates via Stability

Review 1

Summary and Contributions: The paper considers the problem of robust mean estimation when the data is both heavy-tailed and corrupted with outliers. In this adversarial setup, the authors show that existing estimators do get the near-optimal subgaussian non-asymptotic rate, with optimal dependence on the contamination level \epsilon under bounded 2nd moment condition. Moreover, when the covariance is known, then, their estimator gets the optimal \epsilon^{1-1/k} rate for k-the bounded central-moment. The key technical result is the show the existence of a "good/stable" subset which is at the heart of efficient robust estimators.

Strengths: I think the paper is technically strong, and the results are very significant. This shows that existing spectral-filtering based approaches do indeed give optimal rates for heavy-tailed estimation, which I think is very interesting. ### After Rebuttal #### I thank the authors for their thoughtful response. After looking at it and reading the other reviews, my score remains the same.

Weaknesses: Nothing major.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The authors state that [DL19,LLVZ19] work only in the additive contamination setting, and not in the strong contamination model. Why is that?

Review 2

Summary and Contributions: I thank the authors for their feedback. --- The paper presents a robust mean estimator with sub-Gaussian concentration.

Strengths: The estimator run in polynomial time.

Weaknesses: The authors did not explicitly specify the polynomial complexity, thus the algorithm may still be impractical.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: CONTEXT The workhorse behind the robust statistics results from the last few years is an iterative filtering procedure that eventually identifies a subset of the corrupted data whose empirical mean is close to the true mean. This framework works under a "stability" assumption that the clean data satisfies the property that any large subset has empirical mean/covariance close to the true mean/covariance. Another line of work starting with Lugosi-Mendelson '19 has focused on getting sub-Gaussian rates for mean estimation of heavy-tailed (uncorrupted) distributions in high dimensions, e.g. for identity covariance distributions, the goal is to obtain \sqrt{d/n} + \sqrt{log(1/delta)/n} rates (as opposed to sqrt{d/(n\delta)} via empirical mean or \sqrt{d\log(1/\delta)/n} via geometric median). Subsequent works have culminated in estimators that achieve such optimal sub-Gaussian rates in near-linear time and that can also tolerate *additive* corruptions. ------------------------------------------------------------ THIS WORK This work is the first to get the best of both worlds by giving a refined analysis of the stability-based framework from the robust statistics literature, showing that (modulo a standard bucketing preprocessing step) the iterative filtering approach already achieves optimal sub-Gaussian rates for mean estimation, even in the presence of corruptions in the strong contamination model (where the adversary can both add and delete points). They also show that their analysis extends to the k-hypercontractive, identity-covariance case, at the loss of an extra \sqrt{log d} factor. ------------------------------------------------------------ TECHNIQUES The main technical contribution of this work is to strengthen the existing bound for the existence of a large stable subset in a collection of iid samples from a distribution with bounded covariance, in the high probability regime. The key idea is as follows. For w a set of high-min-entropy weights on the clean points and Sigma_w the weighted empirical covariance, a standard step in the robust mean estimation papers is to solve the minimax problem of searching for w minimizing ||Sigma_w - I|| (the max player picks a test matrix in the spectrahedron). Crucially, they pass to the maximin problem, which then becomes an empirical process question, where the empirical process is indexed over the spectrahedron, and then one can reason about concentration of min_w ||Sigma_w - I|| using Talagrand's concentration inequality. This shows the covariance bound in the definition of stability, and the mean deviation bound then follows by taking w minimizing ||Sigma_w - I|| and applying similar reasoning to refine w further. POST-AUTHOR FEEDBACK: Thanks for providing additional details on the sample complexity achieved by prior works. My opinion is unchanged, that is, I continue to be in favor for acceptance.

Strengths: This work is very nice for simultaneously 1) answering the open question of efficiently getting optimal sub-Gaussian rates for heavy-tailed mean estimation in the strong contamination model, and 2) showing that the existing filtering approach *already* achieves this (modulo a standard preprocessing step). The latter point ties together the two lines of work on robust mean estimation and heavy-tailed mean estimation in an especially satisfying way. While previous works like Cheng-Diakonikolas-Ge have looked at the dual SDP max_M min_w <Sigma_w,M>, they did so from an algorithmic perspective by using it as part of their descent procedure, whereas the nice idea of this work is to bound the value of this SDP in the *analysis*, for the sake of showing the existence of a stable set.

Weaknesses: The only "limitations" are that 1) they still need to bucket before running iterative filtering on the bucketed averages for eps = o(1), and 2) the results for k-hypercontractive are not yet optimal, but these aren't really weaknesses so much as interesting follow-up questions.

Correctness: I have verified that the proofs are correct.

Clarity: The paper is very cleanly written, and the steps of the argument are quite easy to follow, especially because they do a good job providing intuitive explanations for the main steps.

Relation to Prior Work: The author(s) clearly explain the relation of this work to previous works.

Reproducibility: Yes

Additional Feedback: It would be good to flesh out lines 140 to 143 a bit more. In particular, it would be good to directly compare the bound of Theorem 1.4 to the deterministic regularity condition bounds from the resilience papers, DHL19, DKK+17/DK19, etc. Minor typos in supplement: - equation in 310: N/((1-eps)n) -> 2N/((1-eps)n) - equation in line 631 should use Q_0 - equation in line 686: w^1 -> w

Review 4

Summary and Contributions: This paper addresses the problem of computationally efficient robust mean estimation from a (potentially adversarially) corrupted sample of n datapoints with the properties of: 1. Optimal dimension dependence of order \sqrt{d/n} 2. Optimal additive order \sqrt{epsilon} dependence where epsilon is the fraction of corrupted points; and 3. Optimal subgaussian rates of sqrt{log(1/tau)/n} where tau is the failure probability. In this context the paper makes the following contributions. 1. They show that with exponentially high probabilty, a sufficiently large subset of i.i.d. samples satisfies stability (defined in DK) with optimal epsilon rate and subgaussian dependence, and near-optimal dimension dependence (d log d rather than d). This implies (see DK) the existence of a polynomial time algorithm to obtain a near-optimal estimate of mu (in the sense of 1, 2, 3 above) 2. This is upgraded to the optimal rate (in terms of dimension) using a simple preprocessing step of running the algorithm, not on the original data, but on the empirical of a random bucketing of the data. This analysis relies also on stability, and the bucketing idea is from the median of means work. This provides the first computationally efficient procedure to satisfy all of 1, 2, 3, above. 3. However, the stability analysis earlier is not 'instance optimal' in the following way. When the distribution has additional favorable properties like higher order moments that improve the estimation rate in terms of epsilon. The paper establishes a stronger stability condition in this setting, again yielding a poly time estimator. Reference key: DK: Diakonikolas and Kane "recent advances ...." DKK+: Diakonikolas et al 'robust estimators in high ...'

Strengths: - Robust mean estimation from contaminated data is a problem of central interest to the machine learning and statistics community. This paper will be of interest to the NeurIPS audience. - The main strength of the paper is focusing on the stability structural property, as a method to obtain a robust estimator and computationally efficient estimator. - The paper also reduces stability via Claim 2.1 to control of the mean and (generalized) second moment. These conditions are then established using empirical process tools like Talagrand's concentration. This may be of independent interest.

Weaknesses: I do not have significant points of weaknesses for the paper. I have some comments on presentation and clarity in other portions of the review.

Correctness: At this point I do not see significant correctness issues with the paper.

Clarity: The paper is generally well-written and relatively easy to follow. Some comments: 1. There is inconsistent use of big O notation, and constants C, c notation. I would recommend to use as far as possible, one of them and, if possible, the latter. Some examples are in the additional comments section. 2. Def 1.2: Why is the generalized covariance compared to identity? Should this not be defined strictly for covariance = identity? Otherwise, if Sigma is rank deficient, stability would not hold for non trivial delta.

Relation to Prior Work: Related work is appropriately cited and introduced throughout the paper.

Reproducibility: Yes

Additional Feedback: Post author feedback, I realized I did not put these comments in the review. Apologies! - L.19 'highly suboptimal' is odd phrasing. This is repeated in a number of places. - L.140 why would epsilon' not be order one, it is anyway bounded by 1. Perhaps this is to be \Theta(1)? - L. 157 'non constructive' should probably be 'computationally inefficient' - Something is off in equation 4, there is w on the LHS but not the right where it is an optimization variable in the min max formula. Probably this is the second moment discrepancy from claim 2.1? EDIT POST RESPONSE: After reading the author response, I am maintaining my score on the paper.