Paper ID: | 3930 |
---|---|

Title: | Toward a Characterization of Loss Functions for Distribution Learning |

The paper studied loss functions for distribution approximation. It define various properties and prove different Theorems that show some nice inequalities with respect to these properties. Yet, the main punchline in the paper is missing. Why these properties are really important? I would expect to see a real example where the analysis made in the paper helps with. This part is lacking in the paper. So it is not really clear what is the applicability of the derived results. Also the claim that only log-loss is used for learning distributions is not true. Unless I miss something, generative models are designed to learn distributions and they don't use only the log-loss. Where do they fit in the story? I think that giving them as an example and studying for example the earth moving distance and seeing what properties it gives etc. will make the paper stronger and clearer of why it is important. At the moment, it is far from clear to me.

The paper is written on a good level and includes the ideas which allow to characterize the properties of loss-function for distribution learning on the finite discrete domain. Main such property is the calibration which allows to state several results on the size of sample size which ensures concentration of the empirical loss around true loss. However, there are some points, which to my mind should have been enlighted more in paper. Namely, for the case of log-loss function optimizing a loss-function with respect to measure $q$ is the same as minimizing the KL-divergence for given measure p and q - to choose. Authors only firstly mention this when considering the example of properness of the log-loss function. Results for the concentration sample size of the empirical loss function (provided restriction to calibrated distributions) are interesting, but, taking into account the folklore result on the log-loss sample complexity are not surprising. Several typos: line 137: will be are characterized line 138 : we will generally assume functions are differentiable. Efficient implementation strategy would be interesting to consider in the paper as well as the explicit procedure how we build the estimator $\hat{p}$. Is it the estimator based on the relative frequencies of occuring of each element? In this case what value should be considered for the elements which are in the domain but did not occure in the sample?

Originality: This work is original; the idea of restricting to calibrated distributions and subsequently finding a condition to obtain properness is new AFAIK. Clarity: For readers somewhat familiar with the material surrounding the paper content, the paper is clear. The organization of the paper is good. All of the proofs I checked were correct, aside from typos/what I think are minor errors that do not affect the result. However, I think the example from lines 218 to 223 is incorrect, where the authors try to show an example where the linear loss is not calibrated sample proper. Significance: For me, this is the weak point of the paper. Firstly, I think Theorems 3,4 can be improved (see comments in contributions section). Second, though mathematically nice, the restriction to calibrated distributions seems potentially problematic. Knowledge of the set of calibrated distributions is essentially knowledge of the target distribution itself. Thus, it is unclear why properness under restriction to calibrated distributions is significant. However, there may be a way to simultaneously learn calibrated distributions and minimize a loss over currently learned calibrated distributions; if the authors show something like that, in my opinion it would be very significant (although that seems worthy of another paper). The note on constructing approximately calibrated distributions in the appendix is a step towards this, but is far from it. Quality: I would score 6 or 7 at the moment. Here is a list of errata/points to be clarified: - Line 137: "Our main results will be are" - Line 214: "Since it is not a proper loss function over \Delta_X, by definition it is not sample proper over \Delta_X for any m(\epsilon, \delta, N)." I don't follow this justification - Line 221: If q_x is equal to 0 for x=N/2 + 1,..., N, how is it calibrated w.r.t p_x, which is nonzero for x=N/2+1,...,N? Also, \sqrt{m} \pm 2\sqrt{m} seems off, and I do not know why there is a -\Theta(1/N). - Line 237: reasonable loss -> reasonable losses - Theorem 1 should probably say strictly concave function f - Line 301: for where -> where - Line 412: Equation underneath should have a minus sign instead of a plus sign - Derivation under line 438: I think the third line should have an 8 in the denominator, not a 4. Also, I don't think \epsilon \leq p(B) is generally true, since if p=[0,0,0,0,1/2,1/2], B={1,2,3,4,5,6}, then t(B) = 1/6, so \epsilon = \sum_i |p_i-1/6| = 4/3, but p(B) = 1. I think \epsilon \leq 2p(B) is true, which adds another factor of 2 to the denominator. With the previous change, there is a 16 in the denominator, which doesn't change the end result. - Line 457: parenthesis in subscript - Line 463: I think c_2^{1/(1-r)} should be in the denominator, not c_2^{r/(1-r)} - Line 466: What's the point of Bernstein's inequality here instead of the simpler Hoeffding's? - Line 474 derivation: Why is \sum_x p_x^{1-r} = N^r? (It is \leq N, which still lets the result go through) - Line 477 equation underneath: How do you get a net power of N^{r-2}? I get N^{2r-3} (which still lets the result go through) - Observation 1. in appendix E does not seem to be correct, as despite e^{-x} being nonnegative, twice differentiable, decreasing, and convex, e^{-1/x} is not concave on strictly positive reals. - For the proof of observation 2, casework seems unnecessary, and there is an extra minus sign under line 509. - Line 537: q_t should be q_x. Edit after response: I think the authors' fixed example works. Now the linear loss is not calibrated sample proper when m is o(N), which is still saying something since the authors give losses that are calibrated sample proper with m=O(polylog(ln N)). I buy that with constant probability, \hat p_1 \leq 1/4 - \Theta(1/sqrt{m}) and \hat p_2 \leq 1+ 1/4 + \Theta(1/\sqrt{m}), since with constant probability a pair of i.i.d. Gaussians are less than 1 sd below/greater than 1 sd above their mean, but a more rigorous argument should be used in the final version. On the calibration assumption, I think the authors clarifying that this work is currently more applicable to evaluating algorithm outputs than designing learning algorithms, and the references cited on designing calibrated prediction schemes are good justification for the work. It would still be nice to have a clearer link between this work and existing distribution learning algorithms. My final score is a strong 7.