Paper ID: 737
Title: Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper demonstrates calibrated convex surrogate losses for multiclass classification. If the n by k loss matrix (where the label set has size n and the prediction set has size k) has rank d, minimizing this surrogate loss corresponds to a d-dimensional least squares problem, and converting the solution to a prediction corresponds to a d-dimensional linear maximization over a set of size k. The paper describes two examples (precision at q and expected rank utility) for which n and k are exponentially large, but predictions can be calculated efficiently. It describes two more examples (mean average precision and pairwise disagreement) with exponential n and k for which the calibrated predictions apparently cannot be calculated efficiently, but efficiently computable predictions are calibrated for restricted sets of probability distributions.

This is an exciting contribution that seems to be important for a wide variety of multiclass losses of great practical significance. Technically, it is a combination of two results from ref [16] - the important low rank observation appeared already in [16]. The paper is clearly written.

The main criticism is of the results on calibration with respect to restricted sets of probability distributions. For instance, Theorem 4 gives a result that the efficiently computable prediction rule is calibrated for a certain family P_reinforce. Why is this family interesting? What is the intuition behind it? Are there interesting examples of distributions that satisfy these conditions? Similar questions apply to the set of distributions considered in Theorem 7.

Minor comments:
line 227: 1(\sigma^{-1}(i)\leq q) should be 1(\sigma(i)\le q). (Similarly at line 236)
242: 1 and 0 interchanged.
254: Having max(y_i-v,0) in place of simply y_i seems silly. Why not just redefine the set of labels as {0,1,...,s-v}^r?
320: u^p_{ij} is only defined for i\geq j. The appendix mentions that it is made symmetric, but not the paper.
Q2: Please summarize your review in 1-2 sentences
The paper demonstrates calibrated convex surrogate losses for multiclass classification that are especially important when the loss matrix has low rank. This is an important contribution that applies to several significant losses.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Shows how to design a convex least squares style surrogate loss for high arity multiclass losses that have a low rank structure. Paper is very well written.

Only comment: given your introduction of the loss matrix \mathbf{L} why not state the condition in theorem 3 in matrix form?
Q2: Please summarize your review in 1-2 sentences
This is a very nicely written paper that shows how to design convex surrogates for multiclass losses in manner that makes the surrogate more tractable. This works when the target loss has a low rank structure. As is explained in the paper a number of popular losses have this structure and the authors show how to construct the efficient surrogate.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper considers convex calibrated surrogates for consisitency. In particular, it constructs a least-square surrogate loss which can be calibrated for ranking problems associated with low-rank target loss matrix. The results seem novel and interesting. The results potentially are interesting to the NIPS audience since ranking is a popular topic.

However, the motivation on why you consider low-rank target loss matrix could be better motivated.
Q2: Please summarize your review in 1-2 sentences
The paper addresses convex calibrated surrogates for ranking with low-rank loss matrix and the results seem novel and interesting. However, the motivation for considering low-rank loss matrix could be better motivated.

My review confidence is not certain since I have no research experience on ranking.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for the careful reading and helpful feedback. Below are brief responses to the main questions/concerns.

Reviewer 6 -

Intuition for noise conditions: The noise conditions in Theorems 4, 6, and 7 can essentially be viewed as conditions under which it is `clear and easy' to predict a ranking:

- Theorem 4: The ideal predictor (line 296) uses the entire $u$ matrix, but the predictor in theorem 4 uses only the diagonal elements. The noise condition P_reinforce can be viewed as requiring that the diagonal elements dominate and enforce a clear ordering among themselves. E.g. one simple (though extreme) example where the noise condition holds is when the relevance judgements associated with different documents (conditioned on the query) are independent of each other.

- Theorem 6: The noise condition P_f here requires that the expected value of the function f used in the surrogate must decide the `right' ordering. As mentioned in the paper, it is a generalization of the condition in Duchi et al. [11].

- Theorem 7: The condition P_DAG here is somewhat easier to understand than the others. In particular, it implies that the expected pairwise preferences are acyclic, and this is exploited by the algorithm used in the predictor. What makes this condition especially interesting is that it contains all the conditions in Theorem 6 (and therefore is more general than these conditions, including in particular that of Duchi et al. [11]).

We will try to include some of these intuitions for the conditions in the final version. Thanks also for pointing out the typos - we really appreciate it.

Reviewer 9 -

Motivation for considering low rank losses: The motivation is simply that such losses occur very often in natural problems. Indeed, all four ranking losses considered in the paper constitute such examples. Further examples can be found in various structured prediction problems, e.g. the Hamming loss used in sequence prediction constitutes another such loss.