NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5227 On the Value of Target Data in Transfer Learning

### Reviewer 1

Originality: I think the analysis given is novel, and related work is sufficiently cited and contrasted to the proposed discrepancies. What is especially nice is the section with examples that shows the benefit of the proposed discrepancies, and the authors even go to the length of trying to fix the $\gamma$ discrepancy, but show that even this fixed version may not be competitive with their proposed measure. I also think that the distributional conditions are novel and not usually introduced in theoretical analyses I've seen before. It also has many contributions. Regarding citations: - [6] has recently been published in JMLR, please update the citation - line 157, please add the Wasserstein notion to the related work Quality: - The work is extremely technical, and sadly, I could not check the proofs - The work is purely theoretical with no practical component. The work seems completely finished, and all proofs are there in the supplement. There are some typo's but nothing major, I think the work was well proofread. 3) One major issue I find is: the authors do not reflect on their own work, its limits, or its applicability. There is no Discussion, Conclusion or Future Work section. I think this is majorly important and should be added. 4) In particular, for a NIPS audience that is a bit more interested in practical algorithms, it would be nice if the authors reflect on whether the proposed algorithms and theory are practically applicable. I'm not asking for a practical algorithm (which is clearly out of the scope of this theoretical paper which already has enough contributions), but at least the authors could say something about what steps are necessary to bring this theory into practice. For example, are the proposed algorithms computeable, are the optimizations do-able (convex?) or perhaps are they NP-hard? What is the complexity? And for example, when we need distributions f (for the reweighting algorithm), how would we go about choosing them? I think the paper would be much more influential if the authors could say something about this (even if only one or two paragraphs), because it could be helpful for non-theory people to try and build practical algorithms based on the insights of this work 5) I don't see how the d_A and d_\Gamma discrepancies imply any rates (I have not seen those in the literature). Can you back this up with an explanation or citation? 6) Line 34. If so I understand that for the hypothesis-transfer approach, not just simply \hat{h}_P or \hat{h}_Q is selected? This was a bit unclear for my first reading. 7) line 245: Why 'for our purposes'? This is also vague. Is this choice necessary? 8) line 253: well chosen in what way? I find this a bit vague.... 9) line 251: what is near-minimal? why? can you elaborate on this 10) line 266: could you say something about the choice of the set P? could we choose a set of parametric functions or non-parametric ones? do they need support on all of X or could it be sufficient to choose the function to only have support on the observed samples S_P and S_Q? For these clarifications, maybe they could be added to the supplement. Clarity: I think the work is extremely dense and difficult to read... It took me a long time, and I had to go back and forth a lot. I think the clarity can be greatly improved. After reading it, and going back to the introduction, I also could not match all content with all things mentioned in the Introduction. Either some things or arguments are missing, or they were not explicitly in the main body, or it was too unclear to match up everything (which, after spending so much time on reading this paper, you would hope I would be able to do). I will now give some suggestions to improve the clarity: 11) In the introduction, give a list of contributions, and in which sections we can find them. Try to go over all sections, then I don't need to figure out what is where by myself, and I know in what order I can expect the material to come. It would also be nice if you mention which theorems and algorithms are where, so if needed, you can refer to them in the main text before they are introduced. Such as in line 136, you refer to a Theorem that has yet to come, and in line 202: Here I would find it useful to give a reference to the Theorem that does that (the best upper bound). 12) Introduction: I think it would be good to give a motivation and explenation for a broader audience about what is the problem. Why do we care about transfer learning in the first place? And possibly give a short explanation: what is transfer learning. You could also refer to a survey paper for readers that are interested. 13) I think also, if you give a Discussion and / or Conclusion section, you could summarize and point back to the work, reiterate what are the main points. I was missing this. 14) after reading the paper and going back to the introductions, I feel some things are missing: 14.1) line 30: I did not see, where it was shown that any classifier that has access to both P and Q, would not achieve the optimal Q rate. maybe this can be made more explicit 14.2) line 31: it would be good to give an argument in the corresponding section that explains the argument that cross-validating naively gives a suboptimal rate, and why. 14.3) line 33: what is hypothesis transfer? reference / citation? 14.4) line 36: I found it a bit confusing, that the marginal transfer-component actually depends on h^*. so it actually does not only depend on the marginal, but also indirectly on P(Y|X) through h^*. maybe its good to clarify that or make it a bit more explicit. 14.5) line 38: after reading the paper, it is still a bit unclear to me why we need the marginal exponent $\gamma$ instead of $\rho$ for the case of unlabeled data. since for both $\rho$ and $\gamma$ depend on P(Y|X) through h_P^*. or is it because in the case of unlabeled data, we can estimate h_P^*, and therefore $\gamma$ is more appliceable than $\rho$, since $\rho$ depends on the entire $P(X,Y)$? 14.6) line 33, the generic approach is first introduced (Section 5) and then the oracle algorithm that chooses either h_P or h_Q is introduced (Section 6). but this does not match the order of the introduction. or it may be the case Im misstaken and this is not Section 6, then I simply not could not find back this argument: that an oracle that ignores data can be optimal (but how to choose h_Q or h_P). 15) in Line 68 'However, a significant downside to these notions ...' what does these notions refer to? The proposed transfer-exponent notions, or the KL-difergence and Renyi divergence? More detailed comments: 16) Line 36: The sentence is not grammatically correct. Also, I do not understand what it means that 'it most accurately capture the fact that practical decisions in transfer most often involve'... means. I think this sentence is very convoluted... I would propose: We then propose a related notion of marginal transfer-exponent, which computes discrepancy w.r.t. the marginals, which can use unlabeled data. This is of practical relevance since unlabeled data is often cheaper to collect.' 17) Line 17: generally, in most domain adaptation and transfer learning papers I've read, I think the accepted practice is to indicate the source with Q and the target with P. 18) Line 28: I think there may be a typo. What is ignoring the 'worse' of the data? 19) line 175: I find the section title contains very little information. Maybe 'Lower-Bounds for Transfer' 20) line 208: I would like to see the section title improved here also 21) line 231: I would include in the section title 'unlabeled data' to improve the clarity 22) line 97: replace 'Transfer' with 'transfer learning'. 23) line 91, I think it would be nice to explain that NC stands for Noise Condition and RCS stands for Relaxed Covariate Shift 24) line 233: 'The idea is that ... in many applications'. This sentence is a bit empty 'X is true, so X is realistic'. Significance: I think the work is very important. It provides new insights into the problem of transfer and what the limits are of the setting: e.g. what kind of improvements are possible. This will definitely be important for the community. Also, several algorithms are given that may be of practical interest. In particular, the non-symmetric transfer component is interesting, and the example of 'super-transfer' is also surprising and cool. I think generally, the authors may have a problem with the pagelimit... But really, the proposed additions (contribution list, discussion / futurework, conclusion) could greatly benefit the paper. Your readers will really appreciate it! My suggestion is maybe to place some of the algorithms towards the supplement, to make space for these important additions. For example, you could easily push Section 7.1 to the supplement. Update after author feedback: - Seeing as the authors will move the proof ideas to the supplement, and since the authors at least have addressed 3 (discussion / limitations) 4 (practical / algorithmic questions), 13 (summarizing and listing contributions), I thrust 12 and 14 will be sufficiently addressed in the camera ready version. Thus I have changed my score to 9 as promised. Great paper and good luck preparing the camera ready version, looking forward to read it.

### Reviewer 2

Originality and significance: It is clear that this work makes substantial contributions to understanding worst-case behavior of transfer learning. Clarity: Given the heavily theoretical nature of the work, the submission's clarity can be improved in a few ways. A few comments on writing and structure: - From reading the introduction, it is unclear what the marginal transfer exponent is, and why it's needed. - Clearly state that the paper only studies binary classification problems with 0-1 losses. From the first two pages, it is unclear that the paper only deals with purely theoretical topics. This is not to say the paper adds meaningful contributions to learning theory. Introduction is somewhat misleading, and makes it hard to place the true contributions (and limitations) of this paper. - Throughout the paper, the use of "\gamma" and "\rho" as pronouns for transfer exponents is confusing, as these are not uniquely defined concepts. (e.g. Line 123 or Example 3) - Examples 2-4 would be better served if presented after Theorem 3. The forward referencing is confusing at best. - A brief explanation of why Theorem 2 is stated in terms of expectations would be helpful. - I'm guessing \epsilon_U = \epsilon_H in Remark 1. - Line 179: Define the VC dimension notation first before using it. - Line 57: "first first", Line 70: "does the", Line 74: "efforst", Line 200: second \epsilon_1 should be \epsilon_2

### Reviewer 3

This paper aims to contribute to the theory of transfer learning by making new divergence measures across domains to develop minimax bounds, bounds on sampling cost and covariate shift correction bounds using instance reweighting. While it is very useful to have minimax theory on the transfer learning setting (in this case covariate shift), it is difficult to parse significance of the theoretical results in the way the paper is written. In particular, the current results lack sufficient comparison with prior work, for example with transductive bounds for covariate shift, such as in [Cortest et al, 2008, Gretton et al, 2009]. Furthermore, there does not appear to be sufficient treatment of the results as to how they depend on various terms. For example, there is almost no discussion on the implications of Theorem 1, 3, 4, 5 and 6. However, the new quantities introduced by the authors to measure divergence, and the kinds of bounds being derived can be useful if they are put into better context, and more intuition is provided about the theoretical results presented References: Cortes, Corinna, et al. "Sample selection bias correction theory." International conference on algorithmic learning theory. Springer, Berlin, Heidelberg, 2008. Gretton, Arthur, et al. "Covariate shift by kernel mean matching." Dataset shift in machine learning 3.4 (2009): 5.

### Reviewer 4

* The addressed problem is of theoretical and practical interest in transfer learning. Starting from the definition of transfer exponents, a notion introduced in [1], the paper proposes several theoretical results on bounding the excess risk over the target domain under different settings (sampling cost minimization, source data re-weighting, choosing from multiple sources). These results are established for any classifier from a class of functions with some VC dimension. My feeling is that the assumptions and the obtained results are meaningful and reasonable, at least from conceptual point of view of transfer learning benefits. * All results heavily rely on the relaxation of covariate shift assumption. Do the established results still hold if the covariate assumption is strictly enforced that is P_Y|X = Q_Y|X. If not, how the rates change? * The related work to this submission is the paper of Kpotufe and Martinet [1]. How the obtained results contrast with the ones presented in [1]? Discussion and analysis on this point may be of interest to the reader. * Usually, sampling labels for target data may be much costly than for source data. From Thm 4, what are the transition regimes for $n_p$ and $n_Q$ w.r.t. sampling costs $c_P$ and $c_Q$? Under which condition sampling few target labels is sufficient to attain the desired excess risk. * All learning procedures proposed in the paper rely on probabilities $P_D(h \neg h')$. How these probabilities are evaluated in practice without explicit knowledge of source and target distributions? To turn these procedures into practical algorithms, an instantiation on a family of classifiers (linear model, k-NN or others) may be interesting and helpful. Minor comments ------------------- * Lines 141-145: the text is hard to read. * Line 249 (learning procedure): parameter $\tilde C$ is defined at line 0 and is not applied afterwards. Lines 2 and 4: parameter $C$ is undefined. Line 7: can the paper provides an intuition about the upper bound $\epsilon/4$ used to select hypothesis $\hat h_{S_P}$?