__ Summary and Contributions__: The paper considers the problem of robust mean estimation when the data is both heavy-tailed and corrupted with outliers. In this adversarial setup, the authors show that existing estimators do get the near-optimal subgaussian non-asymptotic rate, with optimal dependence on the contamination level \epsilon under bounded 2nd moment condition. Moreover, when the covariance is known, then, their estimator gets the optimal \epsilon^{1-1/k} rate for k-the bounded central-moment. The key technical result is the show the existence of a "good/stable" subset which is at the heart of efficient robust estimators.

__ Strengths__: I think the paper is technically strong, and the results are very significant. This shows that existing spectral-filtering based approaches do indeed give optimal rates for heavy-tailed estimation, which I think is very interesting.
### After Rebuttal ####
I thank the authors for their thoughtful response. After looking at it and reading the other reviews, my score remains the same.

__ Weaknesses__: Nothing major.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: The authors state that [DL19,LLVZ19] work only in the additive contamination setting, and not in the strong contamination model. Why is that?

__ Summary and Contributions__: I thank the authors for their feedback.
---
The paper presents a robust mean estimator with sub-Gaussian concentration.

__ Strengths__: The estimator run in polynomial time.

__ Weaknesses__: The authors did not explicitly specify the polynomial complexity, thus the algorithm may still be impractical.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: CONTEXT
The workhorse behind the robust statistics results from the last few years is an iterative filtering procedure that eventually identifies a subset of the corrupted data whose empirical mean is close to the true mean. This framework works under a "stability" assumption that the clean data satisfies the property that any large subset has empirical mean/covariance close to the true mean/covariance.
Another line of work starting with Lugosi-Mendelson '19 has focused on getting sub-Gaussian rates for mean estimation of heavy-tailed (uncorrupted) distributions in high dimensions, e.g. for identity covariance distributions, the goal is to obtain \sqrt{d/n} + \sqrt{log(1/delta)/n} rates (as opposed to sqrt{d/(n\delta)} via empirical mean or \sqrt{d\log(1/\delta)/n} via geometric median). Subsequent works have culminated in estimators that achieve such optimal sub-Gaussian rates in near-linear time and that can also tolerate *additive* corruptions.
------------------------------------------------------------
THIS WORK
This work is the first to get the best of both worlds by giving a refined analysis of the stability-based framework from the robust statistics literature, showing that (modulo a standard bucketing preprocessing step) the iterative filtering approach already achieves optimal sub-Gaussian rates for mean estimation, even in the presence of corruptions in the strong contamination model (where the adversary can both add and delete points). They also show that their analysis extends to the k-hypercontractive, identity-covariance case, at the loss of an extra \sqrt{log d} factor.
------------------------------------------------------------
TECHNIQUES
The main technical contribution of this work is to strengthen the existing bound for the existence of a large stable subset in a collection of iid samples from a distribution with bounded covariance, in the high probability regime. The key idea is as follows. For w a set of high-min-entropy weights on the clean points and Sigma_w the weighted empirical covariance, a standard step in the robust mean estimation papers is to solve the minimax problem of searching for w minimizing ||Sigma_w - I|| (the max player picks a test matrix in the spectrahedron). Crucially, they pass to the maximin problem, which then becomes an empirical process question, where the empirical process is indexed over the spectrahedron, and then one can reason about concentration of min_w ||Sigma_w - I|| using Talagrand's concentration inequality. This shows the covariance bound in the definition of stability, and the mean deviation bound then follows by taking w minimizing ||Sigma_w - I|| and applying similar reasoning to refine w further.
POST-AUTHOR FEEDBACK: Thanks for providing additional details on the sample complexity achieved by prior works. My opinion is unchanged, that is, I continue to be in favor for acceptance.

__ Strengths__: This work is very nice for simultaneously 1) answering the open question of efficiently getting optimal sub-Gaussian rates for heavy-tailed mean estimation in the strong contamination model, and 2) showing that the existing filtering approach *already* achieves this (modulo a standard preprocessing step). The latter point ties together the two lines of work on robust mean estimation and heavy-tailed mean estimation in an especially satisfying way. While previous works like Cheng-Diakonikolas-Ge have looked at the dual SDP max_M min_w <Sigma_w,M>, they did so from an algorithmic perspective by using it as part of their descent procedure, whereas the nice idea of this work is to bound the value of this SDP in the *analysis*, for the sake of showing the existence of a stable set.

__ Weaknesses__: The only "limitations" are that 1) they still need to bucket before running iterative filtering on the bucketed averages for eps = o(1), and 2) the results for k-hypercontractive are not yet optimal, but these aren't really weaknesses so much as interesting follow-up questions.

__ Correctness__: I have verified that the proofs are correct.

__ Clarity__: The paper is very cleanly written, and the steps of the argument are quite easy to follow, especially because they do a good job providing intuitive explanations for the main steps.

__ Relation to Prior Work__: The author(s) clearly explain the relation of this work to previous works.

__ Reproducibility__: Yes

__ Additional Feedback__: It would be good to flesh out lines 140 to 143 a bit more. In particular, it would be good to directly compare the bound of Theorem 1.4 to the deterministic regularity condition bounds from the resilience papers, DHL19, DKK+17/DK19, etc.
Minor typos in supplement:
- equation in 310: N/((1-eps)n) -> 2N/((1-eps)n)
- equation in line 631 should use Q_0
- equation in line 686: w^1 -> w

__ Summary and Contributions__: This paper addresses the problem of
computationally efficient
robust mean estimation from a
(potentially adversarially) corrupted sample
of n datapoints with the properties of:
1. Optimal dimension dependence of order \sqrt{d/n}
2. Optimal additive order \sqrt{epsilon} dependence
where epsilon is the fraction of corrupted points; and
3. Optimal subgaussian rates
of sqrt{log(1/tau)/n} where tau is the failure
probability.
In this context the paper makes the following contributions.
1. They show that with exponentially high probabilty, a sufficiently large subset of i.i.d. samples satisfies
stability (defined in DK) with optimal epsilon rate
and subgaussian dependence, and near-optimal
dimension dependence (d log d rather than d).
This implies (see DK) the existence of a polynomial
time algorithm to obtain a near-optimal
estimate of mu (in the sense of 1, 2, 3 above)
2. This is upgraded to the optimal rate (in terms of dimension) using a simple preprocessing step of running
the algorithm, not on the original data, but on the empirical of a random bucketing of the data. This
analysis relies also on stability, and the bucketing idea is from the median of means work. This provides the
first computationally efficient procedure to satisfy
all of 1, 2, 3, above.
3. However, the stability analysis earlier is not
'instance optimal' in the following way. When the
distribution has additional favorable properties like
higher order moments
that improve the estimation rate in terms of epsilon. The paper establishes a stronger stability
condition in this setting, again yielding a poly time estimator.
Reference key:
DK: Diakonikolas and Kane "recent advances ...."
DKK+: Diakonikolas et al 'robust estimators in high ...'

__ Strengths__: - Robust mean estimation from contaminated
data is a problem of central interest to the
machine learning and statistics community. This paper
will be of interest to the NeurIPS audience.
- The main strength of the paper is focusing on the
stability structural property, as a method to obtain
a robust estimator and computationally efficient estimator.
- The paper also reduces stability via Claim 2.1 to control
of the mean and (generalized) second moment. These
conditions are then established using empirical process
tools like Talagrand's concentration. This may be
of independent interest.

__ Weaknesses__: I do not have significant points of weaknesses for the paper.
I have some comments on presentation and clarity in other
portions of the review.

__ Correctness__:
At this point I do not see significant correctness issues with the paper.

__ Clarity__: The paper is generally well-written and relatively easy to follow.
Some comments:
1. There is inconsistent use of big O notation, and constants C, c notation.
I would recommend to use as far as possible, one of them and, if possible, the
latter. Some examples are in the additional comments section.
2. Def 1.2: Why is the generalized covariance compared to identity? Should this
not be defined strictly for covariance = identity? Otherwise, if Sigma is rank
deficient, stability would not hold for non trivial delta.

__ Relation to Prior Work__: Related work is appropriately cited and introduced throughout
the paper.

__ Reproducibility__: Yes

__ Additional Feedback__: Post author feedback, I realized I did not put these comments in the review. Apologies!
- L.19 'highly suboptimal' is odd phrasing. This is repeated in a number of places.
- L.140 why would epsilon' not be order one, it is anyway bounded by 1. Perhaps
this is to be \Theta(1)?
- L. 157 'non constructive' should probably be 'computationally inefficient'
- Something is off in equation 4, there is w on the
LHS but not the right where it is an optimization
variable in the min max formula. Probably this is the second moment discrepancy
from claim 2.1?
EDIT POST RESPONSE:
After reading the author response, I am maintaining my score on the paper.