{"title": "Computing Full Conformal Prediction Set with Approximate Homotopy", "book": "Advances in Neural Information Processing Systems", "page_first": 1386, "page_last": 1395, "abstract": "If you are predicting the label $y$ of a new object with $\\hat y$, how confident are you that $y = \\hat y$? Conformal prediction methods provide an elegant framework for answering such question by building a $100 (1 - \\alpha)\\%$ confidence region without assumptions on the distribution of the data. It is based on a refitting procedure that parses all the possibilities for $y$ to select the most likely ones. Although providing strong coverage guarantees, conformal set is impractical to compute exactly for many regression problems. We propose efficient algorithms to compute conformal prediction set using approximated solution of (convex) regularized empirical risk minimization. Our approaches rely on a new homotopy continuation technique for tracking the solution path with respect to sequential changes of the observations. We also provide a detailed analysis quantifying its complexity.", "full_text": "Computing Full Conformal Prediction Set with\n\nApproximate Homotopy\n\nEugene Ndiaye\n\nRIKEN Center for Advanced Intelligence Project\n\neugene.ndiaye@riken.jp\n\nIchiro Takeuchi\n\nNagoya Institute of Technology\n\ntakeuchi.ichiro@nitech.ac.jp\n\nAbstract\n\nIf you are predicting the label y of a new object with \u02c6y, how con\ufb01dent are you\nthat y = \u02c6y? Conformal prediction methods provide an elegant framework for\nanswering such question by building a 100(1 \u2212 \u03b1)% con\ufb01dence region without\nassumptions on the distribution of the data. It is based on a re\ufb01tting procedure that\nparses all the possibilities for y to select the most likely ones. Although providing\nstrong coverage guarantees, conformal set is impractical to compute exactly for\nmany regression problems. We propose ef\ufb01cient algorithms to compute conformal\nprediction set using approximated solution of (convex) regularized empirical risk\nminimization. Our approaches rely on a new homotopy continuation technique for\ntracking the solution path with respect to sequential changes of the observations.\nWe also provide a detailed analysis quantifying its complexity.\n\n1\n\nIntroduction\n\nIn many practical applications of regression models it is bene\ufb01cial to provide, not only a point-\nprediction, but also a prediction set that has some desired coverage property. This is especially\ntrue when a critical decision is being made based on the prediction, e.g., in medical diagnosis or\nexperimental design. Conformal prediction is a general framework for constructing non-asymptotic\nand distribution-free prediction sets. Since the seminal work of [23, 21], the statistical properties and\ncomputational algorithms for conformal prediction have been developed for a variety of machine\nlearning problems such as density estimation, clustering, and regression - see the review of [2].\nLet Dn = {(x1, y1),\u00b7\u00b7\u00b7 , (xn, yn)} be a sequence of features and labels of random variables in\nRp \u00d7 R from a distribution P. Based on observed data Dn and a new test instance xn+1 in Rp, the\ngoal of conformal prediction is to build a 100(1 \u2212 \u03b1)% con\ufb01dence set that contains the unobserved\nvariable yn+1 for \u03b1 in (0, 1), without any speci\ufb01c assumptions on the distribution P.\nThe conformal prediction set for yn+1 is de\ufb01ned as the set of z \u2208 R whose typicalness is suf\ufb01ciently\nlarge. The typicalness of each z is de\ufb01ned based on the residuals of the regression model, trained\nwith an augmented training set Dn+1(z) = Dn \u222a (xn+1, z). On average, prediction sets constructed\nwithin a conformal prediction framework are shown to have a desirable coverage property, as long\nas the training instances {(xi, yi)}n+1\ni=1 are exchangeable, and the regression estimator is symmetric\nwith respect to the training instances (even when the model is not correctly speci\ufb01ed).\nDespite these attractive properties, the computation of conformal prediction sets has been intractable\nsince one needs to \ufb01t in\ufb01nitely many regression models with an augmented training set Dn+1(z), for\nall possible z \u2208 R. Except for simple regression estimators with quadratic loss (such as least-square\nregression, ridge regression or lasso estimators) where an explicit and exact solution of the model\nparameter can be written as a piece of a linear function in the observation vectors, the computation of\nthe full and exact conformal set for the general regression problem is challenging and still open.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fContributions. We propose a general method to compute the full conformal prediction set for a\nwider class of regression estimators. The main novelties are summarized in the following points:\n\n\u2022 We introduce a new homotopy continuation technique, inspired by [8, 16], which can\nef\ufb01ciently update an approximate solution with tolerance \u0001 > 0, when the data are streamed\nsequentially. For this, we show that the variation of the optimization error only depends on\nthe loss on the new input data. Thus, exploiting the regularity of the loss, we can provide\na range of observations for which an approximate solution is still valid. This allows us\nto approximately \ufb01t in\ufb01nitely many regression models for all possible z in a pre-selected\nrange [ymin, ymax], using only a \ufb01nite number of candidate z. For example, when the loss\nfunction is smooth, the number of model \ufb01ttings required for constructing the prediction set\nis O(1/\u221a\u0001).\n\n\u2022 Exploiting the approximation error bounds of the proposed homotopy continuation method,\nwe can construct the prediction set based on the \u0001-solution, which satis\ufb01es the same valid\ncoverage properties under the same mild assumptions as the conformal prediction framework.\nWhen the approximation tolerance \u0001 decreases to 0, the prediction set converges to the exact\nconformal prediction set which would be obtained by \ufb01tting an in\ufb01nitely large number of\nregression models. Furthermore, if the loss function of the regression estimator is smooth\nand some other regularity conditions are satis\ufb01ed, the prediction set constructed by the\nproposed method is shown to contain the exact conformal prediction set.\n\nFor reproducibility, our implementation is available in\n\nhttps://github.com/EugeneNdiaye/homotopy_conformal_prediction\n\nNotation. For a non zero integer n, we denote [n] to be the set {1,\u00b7\u00b7\u00b7 , n}. The dataset of size n\nis denoted Dn = (xi, yi)i\u2208[n], the row-wise feature matrix X = [x1,\u00b7\u00b7\u00b7 , xn+1](cid:62) , and X[n] is its\nrestriction to the n \ufb01rst rows. Given a proper, closed and convex function f : Rn \u2192 R \u222a {+\u221e}, we\ndenote domf = {x \u2208 Rn : f (x) < +\u221e}. Its Fenchel-Legendre transform is f\u2217 : Rn \u2192 R \u222a {+\u221e}\nde\ufb01ned by f\u2217(x\u2217) = supx\u2208domf(cid:104)x\u2217, x(cid:105) \u2212 f (x). The smallest integer larger than a real value r\nis denoted (cid:100)r(cid:101). We denote by Q1\u2212\u03b1, the (1 \u2212 \u03b1)-quantile of a real valued sequence (Ui)i\u2208[n+1],\nde\ufb01ned as the variable Q1\u2212\u03b1 = U((cid:100)(n+1)(1\u2212\u03b1)(cid:101)), where U(i) are the i-th order statistics. For j in\ni=1 1Ui\u2264Uj . The interval\n\n[n + 1], the rank of Uj among U1,\u00b7\u00b7\u00b7 , Un+1 is de\ufb01ned as Rank(Uj) =(cid:80)n+1\n\n[a \u2212 \u03c4, a + \u03c4 ] will be denoted [a \u00b1 \u03c4 ].\n2 Background and Problem Setup\n\nWe consider the framework of regularized empirical risk minimization (see for instance [22]) with a\nconvex loss function (cid:96) : R \u00d7 R (cid:55)\u2192 R, a convex regularizer \u2126 : R (cid:55)\u2192 R and a positive scalar \u03bb:\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03b2 \u2208 arg min\n\u03b2\u2208Rp\n\nP (\u03b2) :=\n\n(cid:96)(yi, x\n\n(cid:62)\ni \u03b2) + \u03bb\u2126(\u03b2) .\n\n(1)\n\nFor simplicity, we will assume that for any real values z and z0, we have (cid:96)(z0, z) and (cid:96)(z, z0) are\nnon negative, (cid:96)(z0, z0) and (cid:96)\u2217(z0, 0) are equal to zero. These assumptions are easy to satisfy and we\nrefer the reader to the appendix for more details.\n\nExamples. A popular example of a loss function found in the literature is power norm\nregression, where (cid:96)(a, b) = |a \u2212 b|q. When q = 2, this corresponds to classical linear re-\ngression. Cases where q \u2208 [1, 2) are common in robust statistics. In particular, q = 1 is known\nas least absolute deviation. The logcosh loss (cid:96)(a, b) = \u03b3 log(cosh(a \u2212 b)/\u03b3) is a differentiable\nalternative to the (cid:96)\u221e-norm (Chebychev approximation). One can also have the Linex loss function\n[9, 4] which provides an asymmetric loss (cid:96)(a, b) = exp(\u03b3(a \u2212 b)) \u2212 \u03b3(a \u2212 b) \u2212 1, for \u03b3 (cid:54)= 0. Any\nconvex regularization functions \u2126 e.g. Ridge [10] or sparsity inducing norms in [1] can be considered.\nFor a new test instance xn+1, the goal is to construct a prediction set \u02c6\u0393(\u03b1)(xn+1) for yn+1 such that\n\nPn+1(yn+1 \u2208 \u02c6\u0393(\u03b1)(xn+1)) \u2265 1 \u2212 \u03b1 for \u03b1 \u2208 (0, 1) .\n\n(2)\n\n2\n\n\f2.1 Conformal Prediction\n\nConformal prediction [23] is a general framework for constructing con\ufb01dence sets, with the remark-\nable properties of being distribution free, having a \ufb01nite sample coverage guarantee, and being able\nto be adapted to any estimator under mild assumptions. We recall the arguments in [21, 14].\nLet us introduce the extension of the optimization problem (1) with augmented training data\nn(cid:88)\nDn+1(z) := Dn \u222a {(xn+1, z)} for z \u2208 R:\nPz(\u03b2) :=\n\n(cid:62)\nn+1\u03b2) + \u03bb\u2126(\u03b2) .\n\n(cid:62)\ni \u03b2) + (cid:96)(z, x\n\n(cid:96)(yi, x\n\n(3)\n\n\u02c6\u03b2(z) \u2208 arg min\n\u03b2\u2208Rp\n\ni=1\n\n(cid:62)\n\u2200i \u2208 [n], \u02c6Ri(z) = \u03c8(yi, x\ni\n\nThen, for any z in R, we de\ufb01ne the conformity measure for Dn+1(z) as\n(cid:62)\n\u02c6\u03b2(z)) and \u02c6Rn+1(z) = \u03c8(z, x\n(4)\nn+1\nwhere \u03c8 is a real-valued function that is invariant with respect to any permutation of the input data.\nFor example, in a linear regression problem, one can take the absolute value of the residual to be a\nconformity measure function i.e. \u02c6Ri(z) = |yi \u2212 x(cid:62)\nThe main idea for constructing a conformal con\ufb01dence set is to consider the typicalness of a candidate\npoint z measured as\n\n\u02c6\u03b2(z)|.\n\n\u02c6\u03b2(z)) ,\n\ni\n\n\u02c6\u03c0(z) = \u02c6\u03c0(Dn+1(z)) := 1 \u2212\n\n1\n\nn + 1\n\nRank( \u02c6Rn+1(z)) .\n\n(5)\n\ni\n\n\u02c6\u03b2(z)|.\n\n\u02c6\u0393(\u03b1)(xn+1) := {z \u2208 R : \u02c6\u03c0(z) > \u03b1} .\n\nIf the sequence (xi, yi)i\u2208[n+1] is exchangeable and identically distributed, then ( \u02c6Ri(yn+1))i\u2208[n+1] is\nalso , by the invariance of \u02c6R w.r.t. permutations of the data. Since the rank of one variable among\nan exchangeable and identically distributed sequence is (sub)-uniformly distributed (see [3]) in\n{1,\u00b7\u00b7\u00b7 , n + 1}, we have Pn+1(\u02c6\u03c0(yn+1) \u2264 \u03b1) \u2264 \u03b1 for any \u03b1 in (0, 1). This implies that the function\n\u02c6\u03c0 takes a small value on atypical data. Classical statistics for hypothesis testing, such as a p-value\nfunction, satisfy such a condition under the null hypothesis (see [12, Lemma 3.3.1]). In particular, this\nimplies that the desired coverage guarantee in Equation (2) is veri\ufb01ed by the conformal set de\ufb01ned as\n(6)\nThe conformal set gathers the real value z such that \u02c6\u03c0(z) > \u03b1, if and only if \u02c6Rn+1(z) is ranked no\nhigher than (cid:100)(n + 1)(1 \u2212 \u03b1)(cid:101), among \u02c6Ri(z) for all i in [n]. For regression problems where yn+1\nlies in a subset of R, obtaining the conformal set \u02c6\u0393(\u03b1)(xn+1) in Equation (6) is computationally\nchallenging. It requires re-\ufb01tting the prediction model \u02c6\u03b2(z) for in\ufb01nitely many candidates z in R in\norder to compute a conformity measure such as \u02c6Ri(z) = |yi \u2212 x(cid:62)\nExisting Approaches for Computing a Conformal Prediction Set.\nIn Ridge regression, for any\nx in Rp, z (cid:55)\u2192 x(cid:62) \u02c6\u03b2(z) is a linear function of z, implying that \u02c6Ri(z) is piecewise linear. Exploiting\nthis fact, an exact conformal set \u02c6\u0393(\u03b1)(xn+1) for Ridge regression was ef\ufb01ciently constructed in\n[18]. Similarly, using the piecewise linearity in z of the Lasso solution, [13] proposed a piecewise\nlinear homotopy under mild assumptions, when a single input sample point is perturbed. Apart from\nthese cases of quadratic loss with Ridge and Lasso regularization, where an explicit formula of the\nestimator is available, computing such a set is often infeasible. Also, a known drawback of exact path\ncomputation is its exponential complexity in the worst case [7], and numerical instabilities due to\nmultiple inversions of potentially ill-conditioned matrices.\nAnother approach is to split the dataset into a training set - in which the regression model is \ufb01tted,\nand a calibration set - in which the conformity scores and their ranks are computed. Although this\napproach avoids the computational bottleneck of the full conformal prediction framework, statistical\nef\ufb01ciencies are lost both in the model \ufb01tting stage and in the conformity score rank computation\nstage, due to the effect of a reduced sample size. It also adds another layer of randomness, which\nmay be undesirable for the construction of prediction intervals [13].\nA common heuristic approach in the literature is to evaluate the typicalness \u02c6\u03c0(z) only for an arbitrary\n\ufb01nite number of grid points. Although the prediction set constructed by those \ufb01nite number of\n\u02c6\u03c0(z) might roughly mimic the conformal prediction set, the desirable coverage properties are no\nlonger maintained. To overcome this issue, [5] proposed a discretization strategy with a more careful\nprocedure to round the observation vectors, but failed to exactly preserve the 1\u2212\u03b1 coverage guarantee.\nIn the appendix, we discuss in detail critical limitations of such an approach.\n\n3\n\n\fAlgorithm 1 \u0001-online_homotopy\n\nn+1\u03b2 where \u03b2 is an \u00010-solution for the problem (1) using only Dn\n\nInput: Dn = {(x1, y1),\u00b7\u00b7\u00b7 , (xn, yn)}, xn+1, [ymin, ymax], \u00010 < \u0001\nInitialization: zt0 = x(cid:62)\nrepeat\n\n(cid:113) 2\n\u03bd (\u0001 \u2212 \u00010) if the loss is \u03bd-smooth\n\nztk+1 = ztk \u00b1 s\u0001 where s\u0001 =\nGet \u03b2(ztk+1 ) by minimizing Pztk+1\n\nup to accuracy \u00010 {warm started with \u03b2(ztk )}\n\nuntil [ymin, ymax] is covered\nReturn: {ztk , \u03b2(ztk )}k\u2208[T\u0001]\n\n3 Homotopy Algorithm\n\nIn constructing an exact conformal set, we need to be able to compute the entire path of the model\nparameters \u02c6\u03b2(z); which is obtained after solving the augmented optimization problem in Equation (3),\nfor any z in R. In fact, two problems arise. First, even for a single z, \u02c6\u03b2(z) may not be available\nbecause, in general, the optimization problem cannot be solved exactly [17, Chapter 1]. Second,\nexcept for simple regression problems such as Ridge or Lasso, the entire exact path of \u02c6\u03b2(z) cannot\nbe computed in\ufb01nitely many times.\nOur basic idea to circumvent this dif\ufb01culty is to rely on approximate solutions at a given precision\n\u0001 > 0. Here, we call an \u0001-solution any vector \u03b2 such that its objective value satis\ufb01es\n\nPz(\u03b2) \u2212 Pz( \u02c6\u03b2(z)) \u2264 \u0001 .\n\n(7)\nAn \u0001-solution can be found ef\ufb01ciently, under mild assumptions on the regularity of the function being\noptimized. In this section, we show that \ufb01nite paths of \u0001-solutions can be computed for a wider class\nof regression problems. Indeed, it is not necessary to re-calculate a new solution for neighboring\nobservations - i.e. \u03b2(z) and \u03b2(z0) have the same performance when z is close to z0. We develop a\nprecise analysis of this idea. Then, we show how this can be used to effectively approximate the\nconformal prediction set in Equation (6) based on exact solution, while preserving the coverage\nguarantee.\nWe recall the dual formulation [20, Chapter 31] of Equation (3):\n\n\u02c6\u03b8(z) \u2208 arg max\n\u03b8\u2208Rn+1\n\nDz(\u03b8) := \u2212\n\n\u2217\n(cid:96)\n\n\u2217\n(yi,\u2212\u03bb\u03b8i) \u2212 (cid:96)\n\n(z,\u2212\u03bb\u03b8n+1) \u2212 \u03bb\u2126\n\n\u2217\n\n(X\n\n(cid:62)\n\n\u03b8) .\n\n(8)\n\nFor a primal/dual pair of vectors (\u03b2(z), \u03b8(z)) in domPz \u00d7 domDz, the duality gap is de\ufb01ned as\n\nGapz(\u03b2(z), \u03b8(z)) := Pz(\u03b2(z)) \u2212 Dz(\u03b8(z)) .\n\nWeak duality ensures that Pz(\u03b2(z)) \u2265 Dz(\u03b8(z)), which yields an upper bound for the approximation\nerror of \u03b2(z) in Equation (7) i.e.\n\nPz(\u03b2(z)) \u2212 Pz( \u02c6\u03b2(z)) \u2264 Gapz(\u03b2(z), \u03b8(z)) .\n\nThis will allow us to keep track of the approximation error when the parameters of the objective\nfunction change. Given any \u03b2 such that Gap(\u03b2, \u03b8) \u2264 \u0001 i.e. an \u0001-solution for problem (1), we explore\nthe candidates for yn+1 with the parameterization of the real line zt de\ufb01ned as\n\n(cid:62)\nn+1\u03b2 .\n\nzt := z0 + t, for t \u2208 R and z0 = x\n\n(9)\nThis additive parameterization was used in [13] for the case of the Lasso. It provides the nice\nproperty that adding (xn+1, z0) as the (n + 1)-th observation does not change the objective value of \u03b2\ni.e. P (\u03b2) = Pz0 (\u03b2). Thus, if a vector \u03b2 is an \u0001-solution for P , it will remain so for Pz0. Interestingly,\nsuch a choice is still valid for a suf\ufb01ciently small t. We show that, depending on the regularity of the\nloss function, we can precisely derive a range of the parameter t so that \u03b2 remains a valid \u0001-solution\nfor Pzt when the dataset Dn is augmented with {(xn+1, zt)}.\nWe de\ufb01ne the variation of the duality gap between real values z and z0 to be\n\u2206G(xn+1, z, z0) := Gapz(\u03b2, \u03b8) \u2212 Gapz0 (\u03b2, \u03b8) .\n\n4\n\nn(cid:88)\n\ni=1\n\n\fLemma 1. For any (\u03b2, \u03b8) \u2208 domPw \u00d7 domDw for w \u2208 {z0, z}, we have\n\n\u2206G(xn+1, z, z0) = [(cid:96)(z, x\n\n(cid:62)\n(cid:62)\nn+1\u03b2) \u2212 (cid:96)(z0, x\nn+1\u03b2)] + [(cid:96)\n\n\u2217\n\n\u2217\n(z,\u2212\u03bb\u03b8n+1) \u2212 (cid:96)\n\n(z0,\u2212\u03bb\u03b8n+1)] .\n\nLemma 1 showed that the variation of the duality gap between z and z0 depends only on the variation\nof the loss function (cid:96), and its conjugate (cid:96)\u2217. Thus, it is enough to exploit the regularity (e.g. smoothness)\nof the loss function in order to obtain an upper bound for the variation of the duality gap (and therefore\nthe optimization error).\n\nConstruction of Dual Feasible Vector. A generic method for producing a dual-feasible vector is\nto re-scale the output of the gradient mapping. For a real value z, let \u03b2(z) be any primal vector and\nlet us denote Yz = (y1,\u00b7\u00b7\u00b7 , yn, z).\nOptimality conditions for (3) and (8) implies \u02c6\u03b8(z) = \u2212\u2207(cid:96)(Yz, X \u02c6\u03b2(z))/\u03bb, which suggests we can\nmake use of [16]\n\n\u03b8(z) :=\n\nmax{\u03bbt, \u03c3\u25e6\n\n\u2212\u2207(cid:96)(Yz, X\u03b2(z))\ndom\u2126\u2217 (X(cid:62)\u2207(cid:96)(Yz, X\u03b2(z)))} \u2208 domDz ,\n\n(10)\n\nwhere \u03c3 is the support function and \u03c3\u25e6 its polar function. When the regularization is a norm\n\u2126(\u00b7) = (cid:107)\u00b7(cid:107), then \u03c3\u25e6\ndom\u2126\u2217 is the associated dual norm (cid:107)\u00b7(cid:107)\u2217. When \u2126 is strongly convex, then the dual\nvector in Equation (10) simpli\ufb01es to \u03b8(z) = \u2212\u2207(cid:96)(Yz, X\u03b2(z))/\u03bb.\nUsing \u03b8(z0) in Equation (10) with z0 = x(cid:62)\nn+1\u03b2 greatly simpli\ufb01es the expression for the variation of\nthe duality gap between zt and z0 in Lemma 1 to\n\n\u2206G(xn+1, zt, z0) = (cid:96)(zt, x\n\n(cid:62)\nn+1\u03b2) .\n\nThis directly follows from the assumptions (cid:96)(z0, z0) = (cid:96)\u2217(z0, 0) = 0 and by construction of the\ndual vector \u03b8n+1 \u221d \u22022(cid:96)(z0, x(cid:62)\nn+1\u03b2) = \u22022(cid:96)(z0, z0) = 0. Whence, assuming that the loss function\nis \u03bd-smooth (see the appendix for more details and extensions to other regularity assumptions) and\nusing the parameterization in Equation (9), we obtain\n\n\u2206G(xn+1, zt, z0) \u2264\n\n\u03bd\n2\n\nt2 .\n\n\u03bd\n2\n\n(zt \u2212 z0)2 =\n(cid:112)\n(cid:112)\n\nProposition 1. Assuming that\nthe gap\n\u2206G(xn+1, zt, z0) are smaller than \u0001 for all t in [\u2212\n2\u0001/\u03bd]. Moreover, assuming that\nGapz0 (\u03b2(z0), \u03b8(z0)) \u2264 \u00010 < \u0001, we have (\u03b2(z0), \u03b8(z0)) being a primal/dual \u0001-solution for the\noptimization problem (3) with augmented data Dn \u222a {(xn+1, zt)} as long as\n\nthe loss function (cid:96) is \u03bd-smooth,\n\nthe variations of\n\n2\u0001/\u03bd,\n\n|zt \u2212 z0| \u2264\n\n(\u0001 \u2212 \u00010) =: s\u0001 .\n\n(cid:114)\n(cid:24) ymax \u2212 ymin\n\n2\n\u03bd\n\ns\u0001\n\n(cid:25)\n\nT\u0001 \u2264\n\nComplexity. A given interval [ymin, ymax] can be covered by Algorithm 1 with T\u0001 steps where\n\n(cid:18) 1\n\n(cid:19)\n\n\u2208 O\n\n\u221a\u0001\n\n.\n\nWe can notice that the step sizes s\u0001 (smooth case) for computing the whole path are independent of\nthe data and the intermediate solutions. Thus, for computational ef\ufb01ciency, the latter can be computed\nin parallel or by sequentially warm-starting the initialization. Also, since the grid can be constructed\nby decreasing or increasing the value of zt, one can observe that the number of solutions calculated\nalong the path can be halved by using only \u03b2(zt) as an \u0001-solution on the whole interval [zt \u00b1 s\u0001].\nLower Bound. Using the same reasoning when the loss is \u00b5-strongly convex, we have\n\n\u00b5\n\u2206G(xn+1, zt, z0) \u2265\n2\nHence \u2206G(xn+1, zt, z0) > \u0001 as soon as |zt \u2212 z0| >\napproximation errors at any candidate zt, all the step sizes are necessarily of order \u221a\u0001.\n\n(zt \u2212 z0)2 .\n\n\u00b5 (\u0001 \u2212 \u00010). Thus, in order to guarantee \u0001\n\n(cid:113) 2\n\n5\n\n\f(a) Exact conformal prediction set for ridge regression\nwith one hundred regularization parameters ranging\nfrom \u03bbmax = log(p) to \u03bbmin = \u03bbmax/104, spaced\nevenly on a log scale.\n\n(b) Evolution of the conformal set of the proposed\nhomotopy method with different optimization errors,\nspaced evenly on a geometric scale ranging from\n\u0001max = (cid:107)(y1,\u00b7\u00b7\u00b7 , yn)(cid:107)2 to \u0001min = \u0001max/1010.\n\nFigure 1: Illustration of conformal prediction sets at level \u03b1 = 0.1 with exact solutions and ap-\nproximate solutions for ridge regression. We use a synthetic data set generated using sklearn\nwith X, y = make_regression(n = 100, p = 50). We have chosen the hyperparameter with the\nsmallest con\ufb01dence set in Figure (a) to generate Figure (b).\n\nChoice of [ymin, ymax]. We follow the actual practice in the literature [13, Remark 5] and set\nymin = y(1) and ymax = y(n). In that case, we have P(yn+1 \u2208 [ymin, ymax]) \u2265 1 \u2212 2/(n + 1). This\nimplies a loss in the coverage guarantee of 2/(n + 1), which is negligible when n is suf\ufb01ciently large.\n\nRelated Works on Approximate Homotopy. Recent papers [8, 16] have developed approximation\npath methods when a function is concavely parameterized. Such techniques cannot be used here\nsince, for any \u03b2 \u2208 Rp, the function z (cid:55)\u2192 Pz(\u03b2) is not concave. Thus, it does not \ufb01t within their\nproblem description.\nUsing homotopy continuation to update an exact Lasso solution in the online setting was performed\nby [6, 13]. Allowing an approximate solution allows us to extensively generalize those approaches to\na broader class of machine learning tasks, with a variety of regularity assumptions.\n\n4 Practical Computation of a Conformal Prediction Set\n\nWe present how to compute a conformal prediction set, based on the approximate homotopy algorithm\nin Section 3. We show that the set obtained preserves the coverage guarantee, and tends to the exact\nset when the optimization error \u0001 decreases to zero. In the case of a smooth loss function, we present\na variant of conformal sets with an approximate solution, which contains the exact conformal set.\n\n4.1 Conformal Sets Directly Based on Approximate Solution\n\nFor a real value z, we cannot evaluate \u02c6\u03c0(z) in Equation (5) in many cases because it depends on the\nexact solution \u02c6\u03b2(z), which is unknown. Instead, we only have access to a given \u0001-solution \u03b2(z) and\nthe corresponding (approximate) conformity measure given as:\n\n(cid:62)\n(cid:62)\ni \u03b2(z)) and Rn+1(z) = \u03c8(z, x\n\u2200i \u2208 [n], Ri(z) = \u03c8(yi, x\nn+1\u03b2(z)) .\n\n(11)\nHowever, for establishing a coverage guarantee, one can note that any estimator that preserves\nexchangeability can be used. Whence, we de\ufb01ne\n\n\u03c0(z, \u0001) := 1 \u2212\n\nn + 1\n\n1\n\nRank(Rn+1(z)),\n\n(12)\nProposition 2. Given a signi\ufb01cance level \u03b1 \u2208 (0, 1) and an optimization tolerance \u0001 > 0, if the\nobservations (xi, yi)i\u2208[n+1] are exchangeable and identically distributed under probability P, then the\nconformal set \u0393(\u03b1,\u0001)(xn+1) satis\ufb01es the coverage guarantee Pn+1(yn+1 \u2208 \u0393(\u03b1,\u0001)(xn+1)) \u2265 1 \u2212 \u03b1.\n\n\u0393(\u03b1,\u0001)(xn+1) := {z \u2208 R : \u03c0(z, \u0001) > \u03b1} .\n\n6\n\n01234log10(\u03bbmax/\u03bb)\u22121.0\u22120.50.00.51.01.5RidgeConformalSetsTargetyn+10246810log10(\u0001max/\u0001)\u22121.0\u22120.50.00.51.01.52.0RidgeConformalSetsTargetyn+1\u0393(\u03b1,\u0001)\u02c6\u0393(\u03b1)\fOracle\nSplit\n1e-2\n1e-4\n1e-6\n1e-8\n\nCoverage Length Time (s)\n0.9\n0.9\n0.9\n0.9\n0.9\n0.9\n\n0.59\n0.26\n2.17\n8.02\n45.94\n312.56\n\n1.685\n3.111\n1.767\n1.727\n1.724\n1.722\n\nTable 1: Computing a conformal set for a Lasso regression problem on a climate data set NCEP/NCAR\nReanalysis [11] with n = 814 observations and p = 73570 features. On the left, we compare the\ntime needed to compute the full approximation path with our homotopy strategy, single coordinate\ndescent (CD) on the full data Dn+1(yn+1), and an update of the solution after initialization with an\napproximate solution using Dn. On the right, we display the coverage, length and time of different\nmethods averaged over 100 randomly held-out validation data sets.\n\nThe conformal prediction set \u0393(\u03b1,\u0001)(xn+1) (with an approximate solution) preserves the 1 \u2212 \u03b1\ncoverage guarantee and converges to \u0393(\u03b1,0)(xn+1) = \u02c6\u0393(\u03b1)(xn+1) (with an exact solution) when the\noptimization error decreases to zero. It is also easier to compute in the sense that only a \ufb01nite number\nof candidates z need to be evaluated. Indeed, as soon as an approximate solution \u03b2(z) is allowed,\nwe have shown in Section 3 that a solution update is not necessary for neighboring observation\ncandidates.\nWe consider the parameterization in Equation (9). It holds that\n\n\u0393(\u03b1,\u0001) = {z \u2208 R : \u03c0(z, \u0001) > \u03b1} = {zt : t \u2208 R, \u03c0(zt, \u0001) > \u03b1} .\n\n\u2200z \u2208 [ymin, ymax],\u2203k \u2208 [T\u0001] such that Gapz(\u03b2(ztk ), \u03b8(ztk )) \u2264 \u0001 .\n\nUsing Algorithm 1, we can build a set {zt1,\u00b7\u00b7\u00b7 , ztT\u0001} that covers [ymin, ymax] with \u0001-solutions i.e. :\nUsing the classical conformity measure \u02c6Ri(z) = |yi \u2212 x(cid:62)\n\u02c6\u03b2(z)| and computing a piecewise constant\napproximation of the solution path t (cid:55)\u2192 \u02c6\u03b2(zt) with the set {\u03b2(ztk ) : k \u2208 [T\u0001]}, we have\n(cid:62)\nn+1\u03b2(ztk ) \u00b1 Q1\u2212\u03b1(ztk )] .\n\n\u0393(\u03b1,\u0001) \u2229 [ymin, ymax] =\n\n(cid:91)\n\ni\n\n[ztk , ztk+1] \u2229 [x\n\nk\u2208[T\u0001]\n\nwhere Q1\u2212\u03b1(z) is the (1 \u2212 \u03b1)-quantile of the sequence of approximate residuals (Ri(z))i\u2208[n+1].\nDetails and extensions to the more general cases of conformity measures are discussed in the appendix.\n\n4.2 Wrapping the Exact Conformal Set\n\nPreviously, we showed that a full conformal set can be ef\ufb01ciently computed with an approximate\nsolution, and it converges to the conformal set with an exact solution when the optimization error\ndecreases to zero. When the loss function is smooth and, under a gradient-based conformity measure\n(introduced below), we provide a stronger guarantee that the exact conformal set can be included in a\nconformal set, using only approximate solutions. For this, we show how the conformity measure can\nbe bounded w.r.t. to the optimization error, when the input observation z changes.\n\nGradient based Conformity Measures. The separability of the loss function implies that the\ncoordinate-wise absolute value of the gradient of the loss function preserves the excheangeability of\nthe data, and then the coverage guarantee. Whence it can be safely used as a conformity measure i.e.\n(13)\nUsing Equation (13), we show how the function \u02c6\u03c0 can be approximated from above and below, thanks\nto a \ufb01ne bound on the dual optimal solution [15], which is related to the gradient of the loss function.\n\n\u02c6R:(z) = |\u2207(cid:96)(Yz, X \u02c6\u03b2(z))|,\n\nR:(z) = |\u2207(cid:96)(Yz, X\u03b2(z))| .\n\n7\n\n10\u2212210\u2212410\u2212610\u22128Dualitygap05101520253035Time(s)HomotopyCDonDn+1(yn+1)CDinitializedwith\u03b2(Dn)\f(a) Linear regression with (cid:96)1 regularization on Diabetes\ndataset (n = 442, p = 10).\n\n(b) Logcosh regression with (cid:96)2\nBoston dataset (n = 506, p = 13).\n\n2 regularization on\n\nFigure 2: Length of the conformal prediction sets at different coverage level \u03b1 \u2208 {0.1, 0.2,\u00b7\u00b7\u00b7 , 0.9}.\nFor all \u03b1, we display the average over 100 repetitions of randomly held-out validation data sets.\n\nLemma 2. If the loss function (cid:96)(z,\u00b7) is \u03bd-smooth, for any real value z, we have\n\n(cid:107)\u03b8(z) \u2212 \u02c6\u03b8(z)(cid:107)2 \u2264\n\n2\u03bd\n\u03bb2 Gapz(\u03b2(z), \u03b8(z)),\n\n\u2200(\u03b2(z), \u03b8(z)) \u2208 domPz \u00d7 domDz .\n\nUsing Equation (13) and further assuming that the dual vector \u03b8(z) constructed in Equation (10)\ncoincides 1 with \u2212\u2207(cid:96)(Yz, X\u03b2(z))/\u03bb in domDz, we have \u02c6R:(z) = (cid:107)\u03bb\u02c6\u03b8(z)(cid:107) and R:(z) = (cid:107)\u03bb\u03b8(z)(cid:107).\nThus, combining the triangle inequality and Lemma 2 we have\n\n\u2200i \u2208 [n + 1], (Ri(z) \u2212 \u02c6Ri(z))2 \u2264 (cid:107)R:(z) \u2212 \u02c6R:(z)(cid:107)2 = \u03bb2(cid:107)\u03b8(z) \u2212 \u02c6\u03b8(z)(cid:107)2 \u2264 2\u03bd\u0001 ,\n\nwhere the last inequality holds as soon as we can maintain Gapz(\u03b2(z), \u03b8(z)) to be smaller than \u0001, for\nany z in R. Whence, \u02c6Ri(z) belongs to [Ri(z) \u00b1 \u221a2\u03bd\u0001] for any i in [n + 1]. Noting that\n\n\u02c6\u03c0(z) = 1 \u2212\n\n1\n\nn + 1\n\nRank( \u02c6Rn+1(z)) =\n\n1\n\nn + 1\n\ni=1\n\n1 \u02c6Ri(z)\u2265 \u02c6Rn+1(z) ,\n\nn+1(cid:88)\n\nn+1(cid:88)\n\ni=1\n\nn+1(cid:88)\n\ni=1\n\nthe function \u02c6\u03c0 can be easily approximated from above and below by the functions \u03c0(z, \u0001) and \u03c0(z, \u0001),\nwhich do not depend on the exact solution and are de\ufb01ned as:\n\n\u03c0(z, \u0001) =\n\n1\n\nn + 1\n\n1Ri(z)\u2265Rn+1(z)+2\n\n\u221a\n\n2\u03bd\u0001,\n\n\u03c0(z, \u0001) =\n\n1\n\nn + 1\n\n1Ri(z)\u2265Rn+1(z)\u22122\n\n\u221a\n\n2\u03bd\u0001 .\n\n(\u03b1,\u0001)\n\n\u0393\n\n(\u03b1,\u0001) where\n= {z \u2208 R : \u03c0(z, \u0001) > \u03b1} .\n\nProposition 3. We assume that the loss function is \u03bd-smooth and that we use a gradient based\nconformity measure (13). Then, we have \u03c0(z, \u0001) \u2264 \u02c6\u03c0(z) \u2264 \u03c0(z, \u0001) and the approximated lower and\nupper bounds of the exact conformal set are \u0393(\u03b1,\u0001) \u2282 \u02c6\u0393(\u03b1) \u2282 \u0393\n\n\u0393(\u03b1,\u0001) = {z \u2208 R : \u03c0(z, \u0001) > \u03b1},\n(cid:91)\n(cid:91)\n\n\u2229 [ymin, ymax] =\n\nIn the baseline case of quadratic loss, such sets can be easily computed as\n\u2212\n(cid:62)\n1\u2212\u03b1(tk)] ,\nn+1\u03b2(ztk ) \u00b1 Q\n(cid:62)\nn+1\u03b2(ztk ) \u00b1 Q+\n\n[ztk , ztk+1] \u2229 [x\n1\u2212\u03b1(tk)) as the (1\u2212 \u03b1)-quantile of the sequence of shifted\napproximate residuals (Ri(ztk ) \u2212 2\u221a2\u03bd\u0001)i\u2208[n+1] (resp. (Ri(ztk ) + 2\u221a2\u03bd\u0001)i\u2208[n+1]) corresponding\nto the approximate solution \u03b2(ztk ) for k in [T\u0001].\n\nk\u2208[T\u0001]\n\u2212\n1\u2212\u03b1(tk) (resp. Q+\n\n\u0393(\u03b1,\u0001) \u2229 [ymin, ymax] =\n\nwhere we have denoted Q\n\n[ztk , ztk+1] \u2229 [x\n\n1\u2212\u03b1(tk)] ,\n\n(\u03b1,\u0001)\n\n\u0393\n\nk\u2208[T\u0001]\n\n1This holds whenever \u2126 is strongly convex or its domain is bounded. Also, one can guarantee this condition\nwhen \u03b2(z) is build using any converging iterative algorithm, with suf\ufb01cient iterations, for solving Equation (3).\n\n8\n\n0.20.40.60.8Coveragelevel\u03b10.51.01.52.0Lengthof\u0393(xn+1)OracleSplit1e-21e-41e-61e-80.20.40.60.8Coveragelevel\u03b10.00.51.01.52.0Lengthof\u0393(xn+1)OracleSplit1e-21e-41e-61e-8\fSmooth Chebychev Approx.\nCoverage\nLength\nTime (s)\n\nLinex regression\nCoverage\nLength\nTime (s)\n\nOracle\n\nSplit\n\n1e-2\n\n1e-4\n\n1e-6\n\n1e-8\n\n0.92\n1.940\n0.019\n\n0.91\n2.189\n0.013\n\n0.95\n2.271\n0.016\n\n0.92\n1.998\n0.073\n\n0.92\n1.990\n0.409\n\n0.92\n1.987\n3.742\n\n0.92\n1.981\n36.977\n\n0.93\n2.447\n0.012\n\n0.91\n2.231\n0.050\n\n0.91\n2.209\n0.234\n\n0.91\n2.205\n2.054\n\n0.91\n2.199\n20.712\n\nTable 2: Computing a conformal set for a logcosh (resp. linex) regression problem regularized\nwith a Ridge penalty on the Boston (resp. Diabetes) dataset with n = 506 observations and p = 13\nfeatures (resp. n = 442 and p = 10). We display the coverage, length and time of the different\nmethods, averaged over 100 randomly held-out validation data sets.\n\n5 Numerical Experiments\n\nWe illustrate the approximation of a full conformal prediction set for both linear and non-linear\nregression problems, using synthetic and real datasets that are publicly available in sklearn. All\nexperiments were conducted with a coverage level of 0.9 (\u03b1 = 0.1) and a regularization parameter\nselected by cross-validation on a randomly separated training set (for real data, we used 33% of the\ndata).\nIn the case of Ridge regression, exact and full conformal prediction sets can be computed without any\nassumptions [18]. We show in Figure 1, the conformal sets w.r.t. different regularization parameters\n\u03bb, and our proposed method based on an approximated solution for different optimization errors. The\nresults indicate that high precision is not necessary to obtain a conformal set close to the exact one.\nFor other problem formulations, we de\ufb01ne an Oracle as the set [x(cid:62)\n\u02c6\u03b2(yn+1) \u00b1 \u02c6Q1\u2212\u03b1(yn+1)]\nobtained from the estimator trained with machine precision on the oracle data Dn+1(yn+1) (the\ntarget variable yn+1 is not available in practice). For comparison, we display the average over 100\nrepetitions of randomly held-out validation data sets, the empirical coverage guarantee, the length,\nand time needed to compute the conformal set with splitting and with our approach.\nWe illustrated in Table 1 the computational cost of our proposed homotopy for Lasso regression,\nusing vanilla coordinate descent (CD) optimization solvers in sklearn [19]. For a large range of\nduality gap accuracies \u0001, the computational time of our method is roughly the same as a single run of\nCD on the full data set. However, when \u0001 becomes very small (\u2248 10\u22128), we lose computational time\nef\ufb01ciency due to large complexity T\u0001. This is visible in regression problems with non-quadratic loss\nfunctions Table 2.\nThe computational times depend only on the data \ufb01tting part and the computation of the conformity\nscore functions. Thus, the computational ef\ufb01ciency is independent of the coverage level \u03b1. We show\nin Figure 2, the variations of the length of the conformal prediction set for different coverage level.\nOverall, the results indicate that the homotopy method provides valid and near-perfect coverage,\nregardless of the optimization error \u0001. The lengths of the con\ufb01dence sets generated by homotopy\nmethods gradually increase as \u0001 increases, but all of the sets are consistently smaller than those of\nsplitting approaches. Our experiments showed that high accuracy has only limited bene\ufb01ts.\n\nn+1\n\nAcknowledgments\n\nWe would like to thank the reviewers for their valuable feedbacks and detailed comments which\ncontributed to improve the quality of this paper. This work was partially supported by MEXT\nKAKENHI (17H00758, 16H06538), JST CREST (JPMJCR1502), RIKEN Center for Advanced\nIntelligence Project, and JST support program for starting up innovation-hub on materials research\nby information integration initiative.\n\n9\n\n\fReferences\n[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.\n\nFoundations and Trends in Machine Learning, 2012.\n\n[2] V. Balasubramanian, S-S. Ho, and V. Vovk. Conformal prediction for reliable machine learning:\n\ntheory, adaptations and applications. Elsevier, 2014.\n\n[3] J. Br\u00f6cker and H. Kantz. The concept of exchangeability in ensemble forecasting. Nonlinear\n\nProcesses in Geophysics, 2011.\n\n[4] Y-C. Chang and W-L. Hung. Linex loss functions with applications to determining the optimum\n\nprocess parameters. Quality & Quantity, 2007.\n\n[5] W. Chen, K-J. Chun, and R. F. Barber. Discretized conformal prediction for ef\ufb01cient distribution-\n\nfree inference. Stat, 2018.\n\n[6] P. Garrigues and L. E. Ghaoui. An homotopy algorithm for the lasso with online observations.\n\nIn Advances in neural information processing systems, pages 489\u2013496, 2009.\n\n[7] B. G\u00e4rtner, M. Jaggi, and C. Maria. An exponential lower bound on the complexity of\n\nregularization paths. Journal of Computational Geometry, 2012.\n\n[8] J. Giesen, J. K. M\u00fcller, S. Laue, and S. Swiercy. Approximating concavely parameterized\n\noptimization problems. Advances in neural information processing systems, 2012.\n\n[9] M. Gruber. Regression estimators: A comparative study. JHU Press, 2010.\n[10] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal\n\nproblems. Technometrics, 1970.\n\n[11] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S. Saha,\nG. White, J. Woollen, et al. The NCEP/NCAR 40-year reanalysis project. Bulletin of the\nAmerican meteorological Society, 1996.\n\n[12] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business\n\nMedia, 2006.\n\n[13] J. Lei. Fast exact conformalization of lasso using piecewise linear homotopy. Biometrika, 2019.\n[14] J. Lei, M. G\u2019Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive\n\ninference for regression. Journal of the American Statistical Association, 2018.\n\n[15] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparsity\n\nenforcing penalties. J. Mach. Learn. Res, 2017.\n\n[16] E. Ndiaye, T. Le, O. Fercoq, J. Salmon, and I. Takeuchi. Safe grid search with optimal\n\ncomplexity. ICML, 2019.\n\n[17] Y. Nesterov. Introductory lectures on convex optimization. Kluwer Academic Publishers, 2004.\n[18] I. Nouretdinov, T. Melluish, and V. Vovk. Ridge regression con\ufb01dence machine. ICML, 2001.\n[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res,\n2011.\n\n[20] R. T. Rockafellar. Convex analysis. Princeton University Press, 1997.\n[21] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning\n\nResearch, 2008.\n\n[22] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[23] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer,\n\n2005.\n\n10\n\n\f", "award": [], "sourceid": 798, "authors": [{"given_name": "Eugene", "family_name": "Ndiaye", "institution": "Riken AIP"}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": "Nagoya Institute of Technology"}]}