NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1446
Title:Conformal Prediction Under Covariate Shift

Reviewer 1

The presented paper generalizes the standard setting in regression via conformal prediction in order to adjust for covariate shift between train and test sets (i.e. while the conditional probability P_{Y|X} is constant, the covariate marginal P_X changes from train to test set). Although the structure of the paper seems somewhat uncommon, it is very easy to follow and gives excellent explanations of assumptions, interpretations and theoretical and empirical results. Proofs are supplied in adequate detail. A broader look on the context and importance of the generalized exchangeability is provided, indicating the paper's significance beyond the special case of covariate shift. I would consider the contributions of this paper as very significant. Conformal Prediction is especially relevant in safety sensitive situations (such as the medical domain, or autonomous driving), but in these settings covariate shift is extremely common. To properly deploy ML algorithms in these settings, covariate shift has to be taken care of - this paper provides both the algorithmic and the theoretical basis for this. Some (minor) critic points: - a theoretical analysis of the approximated likelihood ratio would have been very helpful; in practice, the true likelihood ratio will not be known, and although the authors give a practical way of dealing with this situation, the theoretical properties were not established - it would have been interesting to see, whether and/or how the proposed method translates to the classification setting, instead of only regression - the empirical evaluation is somewhat limited. A more rigorous evaluation on a more diverse range of data sets would have been helpful. - code could have easily been provided via anonymized links or in the supplement

Reviewer 2

Summary Conformal prediction is a distribution-free method for confidence region construction, which resembles the jackknife procedure. This paper addresses one failure of the exchangeability assumption, which is at the core of the theory underlying conformal prediction. The essence of the proposed adjustment to the theory and the modification of confidence interval construction is the addition of importance weights, which help deal with differences in distribution of the observed sample, e.g. due to covariate shift in supervised learning tasks. Authors numerically compare the classical procedure to the weighted prediction. The experiments also compare weight oracle against weights estimated via a contrast-based density estimation approach.

Reviewer 3

[Replies to author feedback] I thank the authors for the provided answers, in particular the extent to which the extensions concerning sample-wise covariate may prove useful. Originality: the problem of considering test distributions different from the input distributions one is not new, and the originality of the paper mainly lies in showing that it can be achieved for conformal prediction, provided we can find a map from the training to the test distribution (in this case, by estimating a likelihood ratio). I also missed the discussion of relations with techniques such as transfer learning, and also importance sampling ideas (admittedly less connected, but I think relevant nevertheless) Clarity: the paper is quite clear. Significance: this is maybe the weakest point of the paper. In particular: - The experiments are more a proof-of-concept than a demonstration that the method is competitive and applicable. In particular, conformal prediction having been designed to perform instance-wise covered predictions, it is unclear how much the idea of storing shifted observation is realistic? Or could we detect the drift incrementally? - The interest of the last part, substantiated by Theorem 2, is unclear to me from a practical point of view. I understand this generalizes previous results, but how practical is a setting where each data point may be "drifted" or may come from a different distribution? How could we solve the estimation problem in such a setting? In short, what is the added value of this generalisation (beyond being a generalisation)? Typos: * P2, L3 after top: for multiple instances OF the same