Paper ID: | 1413 |
---|---|

Title: | Supervised learning through the lens of compression |

This paper deals with supervised multicategory classification, with general loss functions, in the classic and agnostic PAC settings. Learnability is related to compression, with several novel equivalences and separations established. The results are novel, interesting and important, moreover, deep mathematical tools are cleverly deployed. There is no question that the paper should be accepted.

Please see the detailed comments in the pdf file: https://www.dropbox.com/s/frlmxiklqg6u53w/1413-review.pdf

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This paper studies sample compression schemes for classification problems. Concerning supervised learning, four settings are considered, namely zero/one loss vs general loss functions, each in the realizable and in the agnostic case. It is shown in which of these cases learnability (or the uniform convergence property) implies compression and vice versa. Here not only compression to constant size is considered, but the more general version, where the size of the compression scheme is a function mapping the size of an original sample to the size of a compressed sample. For the study of general loss functions, a model of approximate compression schemes is introduced. As a result of this study, one obtains new theorems about - the (non-) equivalence of PAC learnability and agnostic PAC learnability in the multi-label classification/ general loss function settings. - a generalization from a known compactness result on learnability to the multi-label case - non-trivial compressibility implying logarithmic compressibility.

The paper is well written. I have only minor suggestions for improvement of the presentation itself. Related work is adequately cited, with the exception of Samei et al.'s work (COLT 2014, ALT 2014) on sample compression in the multi-label categorization setting, which should be referenced so as to not leave the impression that this paper is the first addressing such context. My major issue is with the significance of the results. The paper does not provide a substantial amount of new results - although I have to say that it is quite insightful and would deserve publication at some good venue. Currently, I do not see that this paper will be of great impact, and this is the only aspect in which I consider it below-standard for NIPS. Do the authors have any concrete suggestions of how their results could advance research on sample compression or on supervised learning? A few minor comments: - line 51: is equivalent [to possessing an] approximate ... - line 222 should start with "have" rather than "has" - line 237: exists should be exist - line 244: phenomena should be phenomenon - line 249: function should be functions - line 276: a a - the font in the references appears too small? - in the references, some letters that should be capitalized are not (VC, PAC-Bayesian). post-feedback addition: I really suggest to the authors to add some of the motivation given in their feedback to the paper.

2-Confident (read it all; understood it all reasonably well)

This paper studies learning with general label spaces, under either the 0-1 loss or in some cases more general losses, and establishes relationships between several quantities of interest in learning theory: namely, realizable-case PAC sample complexity, agnostic PAC sample complexity, the size of a sample compression scheme (in both realizable-case and agnostic-case variants), and the size of an approximate sample compression scheme (again, in both variants). Under the 0-1 loss, they show that the leading complexity factor in all of these quantities can be bounded by the others up to certain logarithmic factors. Under general losses, some of these relations break down, but they still find a relation between realizable-case sample complexity and realizable-case approximate sample compression, and between agnostic-case sample complexity and agnostic-case approximate sample compression. The paper also includes a very nice separation result, arguing that for losses other than 0/1 loss (in fact, even for 3-valued losses), realizable-case learnability does not imply agnostic learnability. Specifically, they introduce an interesting construction of a label space and loss function, such that realizable-case learning is trivial, but there are distributions in the agnostic case for which there exists a hypothesis of risk 1/2, but it is not possible to identify a classifier of risk less than 3/4 from a finite data set. They establish the negative result by combining their equivalence of approximate compressability and learnability with some combinatorial arguments on hypergraphs (of size a very very large infinity); they also sketch a much simpler construction not requiring such large infinities. Given the setup, it seems a direct proof of the negative result (not going through the compression equivalence) is also possible, but making use of the equivalence to approximate compressability offers a nice simplification and ties the result in with the theme of the rest of the paper.

Most of the results established in the paper would, in the special case of binary classification, trivially follow from the known upper and lower bounds on sample complexity based on the VC dimension. However, the results were not previously known for multiclass learning, and other general loss functions. The results for the 0-1 loss are not particularly surprising, but it is good to know that, for instance, in multiclass classification with the 0-1 loss, the complexity measure in the agnostic sample complexity is the same as that in the realizable-case (up to log factors, but no extra factors such as log(|Y|) not present in the realizable-case sample complexity). They also prove a tighter lower bound than previously known for the sample complexity of uniform convergence for multiclass classification in Theorem 3.6. The techniques used in the proofs are mostly straightforward or have appeared in other related contexts previously. They seem primarily to be pointing out ways to combine known results and argument to reach new conclusions. The separation result in section 4.3 may be the exception, as I do not believe I have seen this kind of construction before this paper. There are also some novel aspects in the proof of the lower bound in Theorem 3.6. Even if most of the arguments are straightforward, the results add up to a fairly complete picture of how these quantities are related in general. Furthermore, as demonstrated in the separation result in section 4.3, the results may be useful for simplifying proofs of learnability / non-learnability, by allowing one to focus on existence / non-existence of finite approximate compression schemes. The paper is generally well-written and easy to read. I did notice several grammatical mistakes, so a careful read-through to correct these would improve readability. minor comments: I think the upper bound in Theorem 3.6 can be improved, removing the log(1/epsilon) factor, simply by using the tail bound of Talagrand (based on the chaining technique) in place of the bound of Vapnik & Chervonenkis; this is the argument that appeared in the proof of Theorem 5 of Daniely, Sabato, Ben-David & Shalev-Shwartz (2015). In the second proof of Theorem 4.4 (last paragraph of the full paper), the fonts for X and Y need to be made consistent. Also, the range of the hypothesis class H should have Y_i in the union, rather than X_i. Finally, since most readers will find this second proof much more accessible than the first, it seems important to include enough details so that they can easily follow the argument. In particular, instead of merely claiming "one can verify" points (i) and (ii), it would be best to include these arguments explicitly.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The main contribution of this paper is to prove that for 0-1 loss functions, learnability is equivalent to sample compression, while for general bounded loss functions, it is equivalent to an approximate sample compression scheme. The paper, then, discusses the implications of this equivalence. While one direction of this equivalence is easy to see, I find the other direction quite surprising but I need further clarification on its proof.

The main contribution of this paper is to prove that for 0-1 loss functions, learnability is equivalent to sample compression, while for general bounded loss functions, it is equivalent to approximate sample compression schemes. The paper, then, discusses the implications of this equivalence. While one direction of this equivalence is easy to see, I find the other direction quite surprising but I need further clarification on its proof. Specifically, the direction "compression -> learning" is easy to see. The surprising part (to me at least) is the reverse direction. The proof in Appendix C is easy to follow once we take for granted the first couple of equations. However, I could not understand how those equations could be deduced in the first place. The first equation is the claim that there exists $h\in\mathcal{H}_S$ such that L_D(h) < 1/3. Did the authors mean to say that L_D(h) < 1/3 holds with a probability of at least 2/3 (easy to see) or with probability one (not so easy to see)? Also, can the authors elaborate, please, on how the minmax theorem has been used to deduce the next equation? Since this is the main result of the paper, and everything else is built on it, I think that a detailed step-by-step proof in Appendix C would be prudent. The implications of the equivalence are interesting and add value to this main result. However, I have a general remark (which the authors can feel free to ignore). Throughout the paper, it is never mentioned how the description length of an instance plays a role in the bounds. It should play a role because the definition of k(m) assumes that the side information are measured in bits, which is rather an arbitrary unit of measurement. This should show up as well in the proof of Theorem 4.1. By taking M to infinity, we are essentially increasing the description length of every training example. What if the description length of an instance is bounded (or equivalently the side information is measured in units of instances, not in bits). The proof would no longer be valid and it may turn out that sample compression is equivalent to learnability for general loss functions under this definition. This, however, is just a remark. The paper is very-well written. It is a pleasure to read. Some minor comments: - In Line 40, ref 6 is cited twice. - In Line 77, the authors claim that the results can be extended to more general loss functions. Can it be extended to loss functions with unbounded range? - In the definition of the selection scheme in Line 97, does the reconstructed hypothesis have to belong to the same hypothesis space H? This should be clarified in Line 101. - In Line 117, is "a selection scheme based learner fits its training data" a typing error? Did the authors intend to say "a selection scheme based learner does not overfit its training data"? - In the proofs, it would be easier to read if the authors denoted d(1/3,1/3) with a special mark, e.g. d^\star = d(1/3,1/3) instead of writing d=d(1/3,1/3). - In Ref 10, the order of the authors should be exchanged. - In line 208, did the authors mean "for general label sets" instead of "for general hypothesis classes"? ============= Post Rebuttal Remark: I have increased the scores after I read the clarification on the proofs. I find the minmax proof quite nice (and could be a useful technique in the future). I find the equivalence between learnability and sample compression quite important. As the authors mentioned, it can be used to link learnability with other concepts as well such as robust generalization.

2-Confident (read it all; understood it all reasonably well)

Explores the connection between learnability and compressibility using a selection scheme object introduced by the authors. Unfortunately, I have no idea what are the "knowledge nuggets" the authors might wish a reader to take away.

I'd like to rate this paper highly as it appears to be tackling a deep and important issue- the connection between learnability and compressibility. However, the paper gets deeply bogged down in terminology and many theorems which cause the inexpert reader to lose all context as to what or why something is proven or significant. Some form of toy or practical example could really help. Need a conclusion or summary and/or a nice table that encapsulates directions of implications and when they hold or fail between compression and learning. Ln 103: actually write k(m) = .... This is a crucial definition, and it's buried in the paragraph. ln 260: reference? review how or why? Typos Ln 23: to translate --> translation of ln 32: VC--> spell out ln 222: has --> have ln 297: hypothesys --> hypothesis ln 319: hypotheses --> hypothesis References are appearing as numbers, not author-year. My submission inserted as author year, but some papers I'm reviewing appear as numbers and some as author year.

1-Less confident (might not have understood significant parts)

This paper studies the the relation between the sample compression schemes and learnability. At first, the authors are able to establish the equivalence of learnability and compressibility under the zero/one loss functions. For general loss functions, even if the equivalence between sample compression schemes and learnability does not exist in general, the authors are able to define the approximate sample compression schemes and prove that the approximate sample compression schemes are equivalent to learnability.

1. The theorems in this paper are able to show the relationships between the learnability and compressibility. However, some explanations after the theorems could be really useful in helping people understand the importance of them. 2. The presentation of the paper could be improved.

1-Less confident (might not have understood significant parts)