Reviews: Minimizers of the Empirical Risk and Risk Monotonicity

The paper considers the behaviour of learning curves, and formally defines a notion of risk monotonicity. It formally proves that for the commonly used square and hinge loss, these may be non-monotone for certain discrete distributions (i.e., that optimal performance may be achieved on a finite number of samples). Additional results are shown for density estimation, and several implications and resulting questions are discussed. The problem considered here is fundamental in nature, and concerns the intriguing phenomenon of performance getting worse with more samples. As noted in the introduction, this behaviour is non-intuitive, and certainly warrants careful study. The paper is rather well-written, and does a nice job of surveying the existing literature on this topic, some of which have demonstrated the existence of non-monotone learning curves. Compared to these works, the present paper aims to formally prove that non-monotonicity is possible for low-complexity learners, and for the loss being used for training (e.g., hinge). This is a reasonable setting for theoretical study, although distinct from the current trend in analysing generalisation properties of neural networks, such as the double-descent and jamming phenomena noted in line 99. While it would arguably be of more practical interest to directly study the latter, I think it of interest to demonstrate that non-intuitive behaviour is possible even in seemingly benign settings. The basic definition of monotonicity is sensible, and not particularly surprising: one asks that a learner yield lower expected risk (over the draw of the training sample) for any possible distribution over the inputs, and possibly for only sufficiently large sample sizes. Certainly it is useful to explicate the notion of monotonicity, but I did feel there was scope to shorten the discussion a bit to allow more space for the subsequent sections (see below). The theoretical results first present examples of learners that are provably risk-monotone, and then learners that are not. The former are useful to give some grounding as to where one obtains the expected behaviour, and include the simple case of estimating the mean of a Gaussian with known variance. It appeared that the discussion on fitting a categorical distribution was a bit more speculative: the authors state that "It therefore seems reasonable to expect that this learner acts risk monotonically. One route towards making a rigorous argument...". My suggestion is to compress or omit this case if a formal result is lacking, again to make space for the results that are actually proven, and also because the main interest in the paper is in the cases where we do not have monotonicity. The results for non-monotone learners involve a discrete distribution defined over two instances. Here, it is shown that unless the loss function satisfies a particular condition, one will not have monotonicity. At this stage, I found the discussion a little lacking: one has to wait until Sec 6 for the details of the result to get some mention, and even then the details of the setup considered remain mysterious. Specifically, it is not made particularly obvious how to interpret the condition on the loss in Lemma 1 or Remark 2, and to assess whether it is reflective of a scenario likely to be encountered in practice. As hinted in the discussion preceding, and explicated in the proof, the distribution where the bad behaviour is manifest involves there being two regions of space, one of which is very rare compared to the other. The condition on the loss is effectively a condition on how well the learner can perform on the dominant region, as a function of the sample size. The paper then uses this result to prove non-monotonicity is possible for learners using the square, hinge and absolute losses. Essentially, this relies on establishing that the condition on the loss function from Remark 2 is not satisfied for certain distributions. While the result appears correct, I again was hoping for some discussion on how to interpret the result: can one give some sense as to what is the quirk in the distribution which interacts badly with the studied losses? Though perhaps non-trivial to construct, it would've been ideal to see an example similar to that of [Loog and Duin 2012], Sec 2, which neatly provides intuition as to why non-monotonicity is possible for LDA. The empirical illustrations are on synthetic data, which I do not object to as means of demonstrating that non-intuitive phenomena are possible. Per earlier comments, one does wonder whether one can isolate specific properties of real-world distributions that increase the risk of non-monotone behaviour; and whether one can thus obtain similar results to the double-descent curve in the present setting. Minor comments: - line 15, "27" appears abruptly. - citation style, \citet is sometimes incorrectly used. - the paper notes that prior work on the possible non-monotonicity of curves relies on misspecified models. It was not clear if the the present paper fares differently. Is perhaps the point that the condition on the loss is an abstract statement that may hold whether or hold the Bayes-optimal predictor is in our class? - Sec 4.2 heading, append "monotonically" - "is a distribution where a very small part of the density is located relatively far away" -> it was not clear how exactly this is captured in Lemma 1. Presumably q -> 0 is meant to capture the "very small part of the density", but where is the assumption that the parts are "relatively far away"? Is this implicit in the assumption on the loss function? - Sec 4.2, for symmetry, consider starting with the case of fitting a Gaussian with unknown mean and variance, to act as counterpart to the beginning of Sec 4.1. - Sec 5, why do you need different values of q for the different losses? - a natural question is whether one can obtain stronger results about the shape of the learning curve (e.g., double-descent) in the simple scenarios considered here, rather than purely establishing that the behaviour is not monotone.

Update: I have reviewed the author feedback and was satisfied with some of the answers I have received. The motivation for the definition itself is still not clear enough to me (more specifically - its strictness) but I believe I can be patient and let the research of different but similar definitions be postponed to future work (as mentioned toward the end of the paper). Good luck. Original review below: The idea of the paper is interesting and discusses the fact that expected risk does not necessarily improve with larger data sets. It is clearly written and the points (and proofs) come across very neatly. On the other hand, the authors state that intuition would lead to the fact that the learning curve is monotonous. While it is an interesting discussion, I believe it is rather general to assume this and I believe that there are quite a few researchers that would put this statement to doubt. The analysis of the example distributions is interesting and clear, and the empirical results provide neat support to the proven theorems and examples. For this reason choose to accept this paper. The reason for my low acceptance score is that I believe the paper would have been more whole with more research of the exact conditions for monotonicity. The discussion that monotonicity is non-trivial is a large part of the paper which I believe could have been navigated towards the mentioned ideas. With further research regarding the conditions and the relation to PAC learnability (and combinatorial measures such as the VC dimension) I think this would be a very good paper. One last point which I would like to address: the notion of monotonicity seems rather strict. Would the proofs hold if instead of monotonicity you would require convergence of the risk? Best of luck

Paper ID:	4074
Title:	Minimizers of the Empirical Risk and Risk Monotonicity

Reviewer 1

Reviewer 2

Reviewer 3