{"title": "Generalization Bounds for Uniformly Stable Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 9747, "page_last": 9757, "abstract": "Uniform stability of a learning algorithm is a classical notion of algorithmic stability introduced to derive high-probability bounds on the generalization error (Bousquet and Elisseeff, 2002).  Specifically, for a loss function with range bounded in $[0,1]$, the generalization error of $\\gamma$-uniformly stable learning algorithm on $n$ samples is known to be at most $O((\\gamma +1/n) \\sqrt{n \\log(1/\\delta)})$ with probability at least $1-\\delta$. Unfortunately, this bound does not lead to meaningful generalization bounds in many common settings where $\\gamma \\geq 1/\\sqrt{n}$. At the same time the bound is known to be tight only when $\\gamma = O(1/n)$.\n  Here we prove substantially stronger generalization bounds for uniformly stable algorithms without any additional assumptions. First, we show that the generalization error in this setting is at most $O(\\sqrt{(\\gamma + 1/n) \\log(1/\\delta)})$ with probability at least $1-\\delta$. In addition, we prove a tight bound of $O(\\gamma^2 + 1/n)$ on the second moment of the generalization error. The best previous bound on the second moment of the generalization error is $O(\\gamma + 1/n)$. Our proofs are based on new analysis techniques and our results imply substantially stronger generalization guarantees for several well-studied algorithms.", "full_text": "Generalization Bounds for Uniformly Stable\n\nAlgorithms\n\nVitaly Feldman\nGoogle Brain\n\nJan Vondrak\n\nStanford University\n\nAbstract\n\nis known to be within O((\u03b3 + 1/n)(cid:112)n log(1/\u03b4)) of the empirical error with\n\nUniform stability of a learning algorithm is a classical notion of algorithmic stability\nintroduced to derive high-probability bounds on the generalization error (Bousquet\nand Elisseeff, 2002). Speci\ufb01cally, for a loss function with range bounded in [0, 1],\nthe generalization error of a \u03b3-uniformly stable learning algorithm on n samples\nprobability at least 1 \u2212 \u03b4. Unfortunately, this bound does not lead to meaningful\n\u221a\ngeneralization bounds in many common settings where \u03b3 \u2265 1/\nn. At the same\ntime the bound is known to be tight only when \u03b3 = O(1/n).\nWe substantially improve generalization bounds for uniformly stable algorithms\nwithout making any additional assumptions. First, we show that the bound in this\n\nsetting is O((cid:112)(\u03b3 + 1/n) log(1/\u03b4)) with probability at least 1 \u2212 \u03b4. In addition,\n\nwe prove a tight bound of O(\u03b32 + 1/n) on the second moment of the estimation\nerror. The best previous bound on the second moment is O(\u03b3 + 1/n). Our proofs\nare based on new analysis techniques and our results imply substantially stronger\ngeneralization guarantees for several well-studied algorithms.\n\n1\n\nIntroduction\n\n\u221a\n\nWe consider the basic problem of estimating the generalization error of learning algorithms. Over\nthe last couple of decades, a remarkably rich and deep theory has been developed for bounding\nthe generalization error via notions of complexity of the class of models (or predictors) output by\nthe learning algorithm. At the same time, for a variety of learning algorithms this theory does not\nprovide satisfactory bounds (even as compared with other theoretical analyses). Most notable among\nthese are continuous optimization algorithms that play the central role in modern machine learning.\nFor example, the standard generalization error bounds for stochastic gradient descent (SGD) on\nconvex Lipschitz functions cannot be obtained by proving uniform convergence for all empirical risk\nminimizers (ERM) [13, 26]. Speci\ufb01cally, there exist empirical risk minimizing algorithms whose\ngeneralization error is\nd times larger than the generalization error of SGD, where d is the dimension\nof the problem (without the Lipschitzness assumption the gap is in\ufb01nite even for d = 2) [13]. This\ndisparity stems from the fact that uniform convergence bounds largely ignore the way in which\nthe model output by the algorithm depends on the data. We note that in the restricted setting of\ngeneralized linear models one can obtain tight generalization bounds via uniform convergence [15].\nAnother classical and popular approach to proving generalization bounds is to analyze the stability of\nthe learning algorithm to changes in the dataset. This approach has been used to obtain relatively\nstrong generalization bounds for several convex optimization algorithms. For example, the seminal\nworks of Bousquet and Elisseeff [4] and Shalev-Shwartz et al. [26] demonstrate that for strongly\nconvex losses the ERM solution is stable. The use of stability is also implicit in standard analyses\nof online convex optimization [26] and online-to-batch conversion [5]. More recently, Hardt et al.\n[14] showed that for convex smooth losses the solution obtained via (stochastic) gradient descent is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fstable. They also conjectured that stability can be used to understand the generalization properties of\nalgorithms used for training deep neural networks.\nWhile a variety of notions of stability have been proposed and analyzed, most only lead to bounds on\nthe expectation or the second moment of the estimation error over the random choice of the dataset\n(where estimation error refers to the difference between the true generalization error and the empirical\nerror). In contrast, generalization bounds based on uniform convergence show that the estimation\nerror is small with high probability (more formally, the distribution of the error has exponentially\ndecaying tails). This discrepancy was \ufb01rst addressed by Bousquet and Elisseeff [4] who de\ufb01ned the\nnotion of uniform stability.\nDe\ufb01nition 1.1. Let A : Z n \u2192 F be a learning algorithm mapping a dataset S to a model in F and\n(cid:96) : F \u00d7 Z \u2192 R be a function such that (cid:96)(f, z) measures the loss of model f on point z. Then A is\nsaid to have uniform stability \u03b3n with respect to (cid:96) if for any pair of datasets S, S(cid:48) \u2208 Z n that differ in\na single element and every z \u2208 Z, |(cid:96)(A(S), z) \u2212 (cid:96)(A(S(cid:48)), z)| \u2264 \u03b3n.\n(cid:80)n\nWe denote the empirical loss of the algorithm A on S = (S1, . . . , Sn) by ES[(cid:96)(A(S))]\ni=1 (cid:96)(A(S), Si) and its expected loss relative to distribution P over Z by EP [(cid:96)(A(S))]\nEz\u223cP [(cid:96)(A(S), z)]. We denote the estimation error1 of A on S relative to P by\n\n.\n=\n.\n=\n\n1\nn\n\n\u2206P\u2212S((cid:96)(A))\n\n= EP [(cid:96)(A(S))] \u2212 ES[(cid:96)(A(S))].\n.\n\nWe summarize the generalization properties of uniform stability in the below (all proved in [4]\nalthough properties (1) and (2) are implicit in earlier work and also hold under weaker stability\nnotions). Let A : Z n \u2192 F be a learning algorithm that has uniform stability \u03b3n with respect to a loss\nfunction (cid:96) : F \u00d7 Z \u2192 [0, 1]. Then for every distribution P over Z and \u03b4 > 0:\n\nS\u223cP n\n\n[\u2206P\u2212S((cid:96)(A))]\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b3n;\n(cid:12)(cid:12)(cid:12)(cid:12) E\n(\u2206P\u2212S((cid:96)(A)))2(cid:105) \u2264 1\n(cid:104)\n(cid:19)(cid:114)\n(cid:18)\n\n2n\n\n+ 6\u03b3n;\n\nE\n\nS\u223cP n\n\n(cid:34)\n\nPr\nS\u223cP n\n\n\u2206P\u2212S((cid:96)(A)) \u2265\n\n4\u03b3n +\n\n1\nn\n\nn ln(1/\u03b4)\n\n2\n\n+ 2\u03b3n\n\n(1)\n\n(2)\n\n(3)\n\n(cid:35)\n\n\u2264 \u03b4.\n\n\u221a\n\n(cid:18)\n\n(cid:19)\n\nAs can be readily seen from eq.(3) the high probability bound is at least a factor\nn larger than\nthe expectation of the estimation error. In addition, the bound on the estimation error implied by\neq.(2) is quadratically worse than the stability parameter. We note that eq. (1) does not imply that\nEP [(cid:96)(A(S))] \u2264 ES[(cid:96)(A(S))] + O(\u03b3n/\u03b4) with probability at least 1 \u2212 \u03b4 since \u2206P\u2212S((cid:96)(A)) can be\nnegative and Markov\u2019s inequality cannot be used. Such \u201clow-probability\u201d result is known only for\nERM algorithms for which Shalev-Shwartz et al. [26] showed that\n\n[|\u2206P\u2212S((cid:96)(A))|] \u2264 O\n\nE\n\nS\u223cP n\n\n\u03b3n +\n\n1\u221a\nn\n\n(4)\n\nNaturally, for most algorithms the stability parameter needs be balanced against the guarantees on the\nempirical error. For example, ERM solution to convex learning problems can be made uniformly\nstable by adding a strongly convex term to the objective [26]. This change in the objective introduces\nan error. In the other example, the stability parameter of gradient descent on smooth objectives is\ndetermined by the sum of the rates used for all the gradient steps [14]. Limiting the sum limits\n\u221a\nthe empirical error that can be achieved. In both of those examples the optimal expected error can\nonly be achieved when \u03b3n = \u03b8(1/\nn) (which is also the expected suboptimality of the solutions).\nUnfortunately, in this setting, eq. (3) gives a vacuous bound and only \u201clow-probability\u201d generalization\nbounds are known for the \ufb01rst example (since it is ERM and eq. (4) applies).\nThis raises a natural question of whether the known bounds in eq. (2) and eq. (3) are optimal. In\nparticular, Shalev-Shwartz et al. [26] conjecture that better high probability bounds can be achieved.\n\n1Also referred to as the generalization gap is several recent works.\n\n2\n\n\fIt is easy to see that the expectation of the absolute value of the estimation error can be at least\n\u03b3n + 1\u221a\nn. Consequently, as observed already in [4], eq. (3) is optimal when \u03b3n = O(1/n). (Note\nthat this is the optimal level of stability for non-trivial learning algorithms with (cid:96) normalized to\n[0, 1].) Yet both bounds in eq. (2) and eq.(3) are signi\ufb01cantly larger than this lower bound whenever\n\u03b3n = \u03c9(1/n). At the same time, to the best of our knowledge, no other upper or lower bounds on the\nestimation error of uniformly stable algorithms were previously known.\n\n1.1 Our Results\n\n\u221a\n\nn(\u03b3n + 1/n) to(cid:112)\u03b3n + 1/n. This rate is non-vacuous for any non-trivial\n\nWe give two new upper bounds on the estimation error of uniformly stable learning algorithms.\nSpeci\ufb01cally, our bound on the second moment of the estimation error is O(\u03b32\nn + 1/n) matching (up\nto a constant) the simple lower bound of \u03b3n + 1\u221a\nn on the \ufb01rst moment. Our high probability bound\nimproves the rate from\nstability parameter \u03b3n = o(1) and matches the rate that was previously known only for the second\nmoment (eq. (2)).\nFor convenience and generality we state our bounds on the estimation error for arbitrary data\ndependent functions (and not just losses of models). Speci\ufb01cally, let M : Z n\u00d7Z \u2192 R be an algorithm\nthat is given a dataset S and a point z as an input. It can be thought of as computing a real-valued\n(cid:80)n\nfunction M (S,\u00b7) and then applying it to z. In the case of learning algorithms M (S, z) = (cid:96)(A(S), z)\nbut this notion also captures other data statistics whose choice may depend on the data. We denote\ni=1 M (S, Si), expectation relative to distribution P over Z by\nthe empirical mean ES[M (S)]\nEP [M (S)]\n\n.\n= Ez\u223cP [M (S, z)] and the estimation error by\n\n.\n= 1\nn\n\n\u2206P\u2212S(M )\n\n= EP [M (S)] \u2212 ES[M (S)].\n.\n\nUniform stability for data-dependent functions is de\ufb01ned analogously (Def. 2.1).\nTheorem 1.2. Let M : Z n \u00d7 Z \u2192 [0, 1] be a data-dependent function with uniform stability \u03b3n.\nThen for any probability distribution P over Z and any \u03b4 \u2208 (0, 1):\n\nn +\n\n2\nn\n\n;\n\n\u00b7 ln(8/\u03b4)\n\n(cid:35)\n\n\u2264 \u03b4.\n\n(5)\n\n(6)\n\n(cid:104)\n\n(\u2206P\u2212S(M ))2(cid:105) \u2264 16\u03b32\n(cid:19)\n\n(cid:115)(cid:18)\n\n\u2206P\u2212S(M ) \u2265 8\n\n2\u03b3n +\n\n1\nn\n\nE\n\nS\u223cP n\n\n(cid:34)\n\nPr\nS\u223cP n\n\nThe results in Theorem 1.2 are stated only for deterministic functions (or algorithms). They can be\nextended to randomized algorithms in several standard ways [12, 26]. If M is uniformly \u03b3-stable\nwith high probability over the choice of its random bits then one can obtain a statement which holds\nwith high probability over the choice of both S and the random bits (e.g. [19]). Alternatively, one\ncan always consider the function M(cid:48)(S, z) = EM [M (S, z)]. If M(cid:48)(S, z) is uniformly \u03b3-stable then\nThm. 1.2 can be applied to it. The resulting statement will be only about the expected value of the\nestimation error with expectation taken over the randomness of the algorithm. Further, if M is used\nwith independent randomness in each evaluation of M (S, Si) then the empirical mean ES[M (S)]\nwill be strongly concentrated around ES[M(cid:48)(S)] (whenever the variance of each evaluation is not too\nlarge). We note that randomized algorithms also allow to extend the notion of uniform stability to\nbinary classi\ufb01cation algorithms by considering the expectation of the 0/1 loss.\nA natural and, we believe, important question left open by our work is whether the high probability\nresult in eq. (6) is tight.\n\nOur techniques The high-probability generalization result in [4] (eq. (3)) is based on a simple\nobservation that as a function of S, \u2206P\u2212S(M ) has the bounded differences property. Replacing any\nelement of S can change \u2206P\u2212S(M ) by at most 2\u03b3n + 1/n (where \u03b3n comes from changing the\nfunction M (S,\u00b7) to M (S(cid:48),\u00b7) and 1/n comes the change in one of the points on which this function\nis evaluated). Applying McDiarmid\u2019s concentration inequality immediately implies concentration\nwith rate\nn(2\u03b3n + 1/n) around the expectation. The expectation, in turn, is small by eq. (1). In\ncontrast, our approach uses stability itself as a tool for proving concentration inequalities. It is based\non ideas developed in [2] to prove generalization bounds for differentially private algorithms in the\n\n\u221a\n\n3\n\n\fcontext of adaptive data analysis [11]. It was recently shown that this proof approach can be used to\nre-derive and extend several standard concentration inequalities [23, 27].\nAt a high level, the \ufb01rst step of the argument reduces the task of proving a bound on the tail of a\nnon-negative real-valued random variable to bounding the expectation of the maximum of multiple\nindependent samples of that random variable. We then show that from multiple executions of M\non independently chosen datasets it is possible to select the execution of M with approximately the\nlargest estimation error in a stable way. That is, uniform stability of M allows us to ensure that the\nselection procedure is itself uniformly stable. The selection procedure is based on the exponential\nmechanism [21] and satis\ufb01es differential privacy [9](Def. 2.3). The stability of this procedure allows\nus to bound the expectation of the estimation error of the execution of M with approximately the\nlargest estimation error (among the multiple executions). This gives us the desired bound on the\nexpectation of the maximum of multiple independent samples of the estimation error random variable.\nWe remark that the multiple executions and an algorithm for selecting among them exist purely for\nthe purposes of the proof technique and do not require any modi\ufb01cations to the algorithm itself.\nOur approach to proving the bound on the second moment of the estimation error is based on two\nideas. First we decouple the point on which each M (S) is estimated from S by observing that for\nevery dataset S the empirical mean is within 2\u03b3n of the \u201cleave-one-out\" estimate of the true mean.\n\nSpeci\ufb01cally, our leave-one-out estimator is de\ufb01ned as Ez\u223cP(cid:2) 1\n\ni=1 M (Si\u2190z, Si)(cid:3), where Si\u2190z\n(cid:80)n\n\ndenotes replacing the element in S at index i with z. We then bound the second moment of the\nestimation error of the leave-one-out estimate by bounding the effect of dependence between the\nrandom variables by O(\u03b32\n\nn\n\nn + 1/n).\n\nApplications We now apply our bounds on the estimation error to several known uniformly stable\nalgorithms in a straightforward way. Our main focus are learning problems that can be formulated as\nstochastic convex optimization. Speci\ufb01cally, these are problems in which the goal is to minimize the\n= Ez\u223cP [(cid:96)(w, z)] over w \u2208 K \u2282 Rd for some convex body K and a family of\n.\nexpected loss: FP (w)\nconvex losses F = {(cid:96)(\u00b7, z)}z\u2208Z. The stochastic convex optimization problem for a family of losses\nF over K is the problem of minimizing FP (w) for an arbitrary distribution P over Z.\nFor concreteness, we consider the well-studied setting in which F contains 1-Lipschitz convex\nfunctions with range in [0, 1] and K is included in the unit ball. In this case ERM with a strongly\n2(cid:107)w(cid:107)2 has uniform stability of 1/(\u03bbn) [4, 26]. From here, applying Markov\u2019s\nconvex regularizer \u03bb\n\u221a\ninequality to eq. (4), Shalev-Shwartz et al. [26] obtain a \u201clow-probability\" generalization bound for\nthe solution. Their bound on the true loss is within O(1/\n\u03b4n) from the optimum with probability at\nleast 1\u2212 \u03b4. Applying eq. (5) with Chebyshev\u2019s inequality improves the dependence on \u03b4 quadratically,\nsub-optimality of the solution is at most O((cid:112)log(1/\u03b4)/n1/3).\nthat is to O(1/(\u03b41/4\u221a\nn)). Further, using eq. (5) we obtain that for an appropriate choice of \u03bb, the\n\nAnother algorithm that was shown to be uniformly stable is gradient descent2 on suf\ufb01ciently smooth\nconvex functions [14]. The generalization bounds we obtain for this algorithm are similar to those\n\u221a\nwe get for the strongly convex ERM. We note that for the stability-based analysis in this case even\n\u201clow-probability\" generalization bounds were not known for the optimal error rate of 1/\nFinally, we show that our results can be used to improve the recent bounds on estimation error of\nlearning algorithms with differentially private prediction. These are algorithms introduced to model\nprivacy-preserving learning in the settings where users only have black-box access to the learned\nmodel via a prediction interface [10]. The properties of differential privacy imply that the expectation\nover the randomness of a predictor K : (X \u00d7 Y )n \u00d7 X of the loss of K at any point x \u2208 X is\nuniformly stable. Speci\ufb01cally, for an \u0001-differentially private prediction algorithm, every loss function\n(cid:96) : Y \u00d7Y \u2192 [0, 1], two datasets S, S(cid:48) \u2208 (X \u00d7Y )n that differ in a single element and (x, y) \u2208 X \u00d7Y :\n\nn.\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 e\u0001 \u2212 1.\n\n(cid:12)(cid:12)(cid:12)(cid:12)E\n\nK\n\n[(cid:96)(K(S, x), y)] \u2212 E\n\n[(cid:96)(K(S(cid:48), x), y)]\n\nM\n\nTherefore, our generalization bounds can be directly applied to the data-dependent function\n.\n= EK[(cid:96)(K(S, x), y)]. These bounds can, in turn, be used to get stronger genera-\nM (S, (x, y))\n\n2The analysis in [14] focuses on the stochastic gradient descent and derives uniform stability for the\nexpectation of the loss (over the randomness of the algorithm). However their analysis applies to gradient steps\non smooth functions more generally.\n\n4\n\n\flization bounds for one of the learning algorithms proposed in [10] (that has unbounded model\ncomplexity).\nAdditional details of these applications can be found in the supplemental material.\n\n1.2 Additional related work\n\nThe use of stability for understanding of generalization properties of learning algorithms dates back\nto the pioneering work of Rogers and Wagner [25]. They showed that expected sensitivity of a\nclassi\ufb01cation algorithm to changes of individual examples can be used to obtain a bound on the\nvariance of the leave-one-out estimator for the k-NN algorithm. Early work on stability focused\non extensions of these results to other \u201clocal\u201d algorithms and estimators and focused primarily on\nvariance (a notable exception is [8] where high probability bounds on the generalization error of\nk-NN are proved). See [7] for an overview. In a somewhat similar spirit, stability is also used for\nanalysis of the variance of the k-fold cross-validation estimator [3, 16, 17].\nA long line of work focuses on the relationship between various notions of stability and learnability\nin supervised setting (see [24, 26] for an overview). This work employs relatively weak notions of\naverage stability and derives a variety of asymptotic equivalence results. The results in [4] on uniform\nstability and their applications to generalization properties of strongly convex ERM algorithms\nhave been extended and generalized in several directions (e.g. [18, 28, 30]). Maurer [20] considers\ngeneralization bounds for a special case of linear regression with a strongly convex regularizer and a\nsuf\ufb01ciently smooth loss function. Their bounds are data-dependent and are potentially stronger for\nlarge values of the regularization parameter (and hence stability). However the bound is vacuous when\nthe stability parameter is larger than n\u22121/4 and hence is not directly comparable to ours. Finally,\nrecent work of Abou-Moustafa and Szepesv\u00e1ri [1] gives high-probability generalization bounds\nsimilar to those in [4] but using a bound on a high-order moment of stability instead of the uniform\nstability. We also remark that all these works are based on techniques different from ours.\nUniform stability plays an important role in privacy-preserving learning since a differentially private\nlearning algorithm can usually be obtained one by adding noise to the output of a uniformly stable\none (e.g. [6, 10, 29]).\n\n2 Preliminaries\nFor a domain Z, a dataset S \u2208 Z n is an n-tuple of elements in Z. We refer to element with index i\nby Si and by Si\u2190z to the dataset obtained from S by setting the element with index i to z. We refer\nto a function that takes as an input a dataset S \u2208 Z n and a point z \u2208 Z as a data-dependent function\nover Z. We think of data-dependent functions as outputs of an algorithm that takes S as an input. For\nexample in supervised learning Z is the set of all possible labeled examples Z = X \u00d7 Y and the\nalgorithm M is de\ufb01ned as estimating some loss function (cid:96)Y : Y \u00d7 Y \u2192 R+ of the model hS output\nby a learning algorithm A(S) on example z = (x, y). That is M (S, z) = (cid:96)Y (hS(x), y). Note that in\nthis setting EP [M (S)] is exactly the true loss of hS on data distribution P, whereas ES[M (S)] is the\nempirical loss of hS.\nDe\ufb01nition 2.1. A data-dependent function M : Z n \u00d7 Z \u2192 R has uniform stability \u03b3 if for all\nS \u2208 Z n, i \u2208 [n], zi, z \u2208 Z, |M (S, z) \u2212 M (Si\u2190zi, z)| \u2264 \u03b3.\nThis de\ufb01nition is equivalent to having M (S, z) having sensitivity \u03b3 or \u03b3-bounded differences for all\nz \u2208 Z.\nDe\ufb01nition 2.2. A real-valued function f : Z n \u2192 R has sensitivity at most \u03b3 if for all S \u2208 Z n,\ni \u2208 [n], zi, z \u2208 Z, |f (S) \u2212 f (Si\u2190zi)| \u2264 \u03b3.\nWe will also rely on several elementary properties of differential privacy [9]. In this context differential\nprivacy is simply a form of uniform stability for randomized algorithms.\nDe\ufb01nition 2.3 ([9]). An algorithm A : Z n \u2192 Y is \u0001-differentially private if, for all datasets\nS, S(cid:48) \u2208 Z n that differ on a single element,\n\n\u2200E \u2286 Y\n\nPr[A(S) \u2208 E] \u2264 e\u0001 Pr[A(S(cid:48)) \u2208 E].\n\n5\n\n\f3 Generalization with Exponential Tails\n\nOur approach to proving the high-probability generalization bounds is based on the technique\nintroduced by Nissim and Stemmer [22] (see [2]) to show that differentially private algorithm have\nstrong generalization properties. It has recently been pointed out by Steinke and Ullman [27] that\nthis approach can be used to re-derive the standard Bernstein, Hoeffding, and Chernoff concentration\ninequalities. Nissim and Stemmer [23] used the same approach to generalize McDiarmid\u2019s inequality\nto functions with unbounded (or high) sensitivity.\nWe prove a bound on the tail of a random variable by bounding the expectation of the maximum of\nmultiple independent samples of the random variable. Speci\ufb01cally, the following simple lemma (see\n[27] for proof):\nLemma 3.1. Let Q be a probability distribution over the reals. Then\nv1,...,vm\u223cQ [max{0, v1, v2, . . . , vm}]\n\n\u2264 ln(2)\nm\n\nv \u2265 2 \u00b7\n\nPr\nv\u223cQ\n\n(cid:20)\n\n(cid:21)\n\nE\n\n.\n\nThe second step relies on the relationship between the maximum of a set of values and the value\nchosen by the soft-argmax, which we refer to as the stable-max. Speci\ufb01cally, we de\ufb01ne\n\n(cid:88)\n\ni\u2208[m]\n\ne\u0001vi(cid:80)\n\nvi \u00b7\n\n,\n\n(cid:96)\u2208[m] e\u0001v(cid:96)\n\nstablemax\u0001{v1, . . . , vm} .\n=\n\ne\u0001vi(cid:80)\n(cid:96)\u2208[m] e\u0001v(cid:96) should be thought of as the relative weight assigned to value vi. (We remark\nwhere\nthat this vector of weights is commonly referred to as softmax and soft-argmax. We therefore use\nstable-max to avoid confusion between the weights and the weighted sum of values.) The \ufb01rst\nproperty of the stable-max is that its value is close to the maximum:\n\nstablemax\u0001{v1, . . . , vm} \u2265 max{v1, . . . , vm} \u2212 ln m\n\u0001\n\n.\n\nThe second property that we will use is that the weight (or probability) assigned to each value is stable:\nit changes by a factor of at most e2\u03b3\u0001 whenever each of the values changes by at most \u03b3. These two\nproperties are known properties of the exponential mechanism [21]. More formally, the exponential\nmechanism is the randomized algorithm that given values {v1, . . . , vm} and \u0001, outputs the index i\ne\u0001vi(cid:80)\nwith probability\n(cid:96)\u2208[m] e\u0001v(cid:96) . We state the properties of the exponential mechanism specialized to our\ncontext below.\nTheorem 3.2. [2, 21] Let f1, . . . , fm : Z n \u2192 R be m scoring functions of a dataset each of\nsensitivity at most \u2206. Let A be the algorithm that given a dataset S \u2208 Z n and a parameter \u0001 > 0\noutputs an index (cid:96) \u2208 [m] with probability proportional to e \u0001\n2\u2206\u00b7f(cid:96)(S). Then A is \u0001-differentially private\nand, further, for every S \u2208 Z n:\n\nE\n\n(cid:96)=A(S)\n\n[f(cid:96)(S)] \u2265 max\n(cid:96)\u2208[m]\n\n{f(cid:96)(S)} \u2212 2\u2206\n\u0001\n\n\u00b7 ln m.\n\nWe now de\ufb01ne the scoring functions designed to select the execution of M with the worst estimation\nerror. For these purposes our dataset will consist of m datasets each of size n. To avoid confusion,\nwe emphasize this by referring to it as multi-dataset and using S to denote it. That is S \u2208 Z m\u00d7n and\nwe refer to each of the sub-datasets as S1, . . . ,Sm and to an element i of sub-dataset (cid:96) as S(cid:96),i.\nLemma 3.3. Let M : Z n \u00d7 Z \u2192 [0, 1] be a data-dependent function with uniform stability \u03b3. For\na probability distribution P over Z, multi-dataset S \u2208 Z m\u00d7n and an index (cid:96) \u2208 [m] we de\ufb01ne the\nscoring function\n\nf(cid:96)(S)\n\n= \u2206P\u2212S(cid:96)(M ) = EP [M (S(cid:96))] \u2212 ES(cid:96)[M (S(cid:96))].\n.\n\nThen f(cid:96) has sensitivity 2\u03b3 + 1/n.\nProof. Let S and S(cid:48) be two multi-datasets that differ in a single element at index i in sub-dataset k.\nClearly, if k (cid:54)= (cid:96) then S(cid:96) = S(cid:48)\n(cid:96) differ in a single element.\nThus\n\n(cid:96) and f(cid:96)(S) = f(cid:96)(S(cid:48)). Otherwise, S(cid:96) and S(cid:48)\nz\u223cP[M (S(cid:96), z) \u2212 M (S(cid:48)\n\n(cid:96))]| =\n\n|EP [M (S(cid:96))] \u2212 EP [M (S(cid:48)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b3.\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\n(cid:96), z)]\n\n6\n\n\fand\n\n(cid:12)(cid:12)(cid:12) =\n\n[M (S(cid:48)\n(cid:96))]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nn\n\n(cid:88)\n\nj\u2208[n]\n\nM (S(cid:96),S(cid:96),j) \u2212 1\nn\n\n(cid:88)\n\u00b7(cid:12)(cid:12)M (S(cid:48)\n\nj\u2208[n]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) +\n\n1\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nM (S(cid:48)\n\n(cid:96),S(cid:48)\n\n(cid:96),j)\n\n(cid:96),i)(cid:12)(cid:12)\n\n(M (S(cid:96),S(cid:96),j) \u2212 M (S(cid:48)\n\n(cid:96),S(cid:96),j))\n\n(cid:96),S(cid:96),i) \u2212 M (S(cid:48)\n\n(cid:96),S(cid:48)\n\n(cid:12)(cid:12)(cid:12)ES(cid:96)[M (S(cid:96))] \u2212 ES(cid:48)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n(cid:88)\n\nj\u2208[n],j(cid:54)=i\n\n\u2264\n\n(cid:96)\n\nn\n\u2264 \u03b3 +\n\n1\nn\n\n.\n\nThe \ufb01nal (and new) ingredient of our proof is a bound on the expected estimation error of any\nuniformly stable algorithm on a sub-dataset chosen in a differentially private way.\nLemma 3.4. For (cid:96) \u2208 [m], let M(cid:96) : Z n \u00d7 Z \u2192 [0, 1] be a data-dependent function with uniform\nstability \u03b3. Let A : Z n\u00d7m \u2192 [m] be an \u0001-differentially private algorithm. Then for any distribution\nP over Z, we have that:\n\ne\u2212\u0001VS \u2212 \u03b3 \u2264\n\nE\n= ES\u223cPmn,(cid:96)=A(S) [ES(cid:96)[M(cid:96)(S(cid:96))]] .\n\nS\u223cPmn,(cid:96)=A(S)\n\nwhere VS .\n\n[EP [M(cid:96)(S(cid:96))]] \u2264 e\u0001VS + \u03b3,\n\nProof.\n\nVS =\n\n=\n\n=\n\n\uf8f9\uf8fb\n\nM(cid:96)(S(cid:96),S(cid:96),i)\n\n1(A(S) = (cid:96)) \u00b7 M(cid:96)(S(cid:96),S(cid:96),i)\n\n\uf8ee\uf8f0 1\n(cid:88)\n\nn\n\ni\u2208[n]\n\n(cid:88)\n(cid:88)\n(cid:20)\n\n\uf8f9\uf8fb\n(cid:21)\n\nES\u223cPmn\n\n[1(A(S) = (cid:96))] \u00b7 M(cid:96)(S(cid:96),S(cid:96),i)\nE\nA\n\ni\u2208[n]\n\n(cid:96)\u2208[m]\n\nA,S\u223cPmn\n\ni\u2208[n]\n\n(cid:96)\u2208[m]\n\nE\n\nS\u223cPmn,(cid:96)=A(S)\n\nn\n\n\uf8ee\uf8f0 1\n(cid:88)\n(cid:88)\n(cid:88)\n\nE\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u2264 1\nn\n\n1\nn\n\n1\nn\n\n=\n\n=\n\nE\n\nS\u223cPmn,z\u223cP\n\ni\u2208[n]\n\n(cid:96)\u2208[m]\n\nE\n\nS\u223cPmn,z\u223cP\n\ni\u2208[n]\n\n(cid:96)\u2208[m]\n\n(cid:20)\n(cid:20)\n\ne\u0001 \u00b7 E\n\nA\n\ne\u0001 \u00b7 E\n\nA\n\n(cid:21)\n\n(cid:21)\n\n[1(A(S (cid:96),i\u2190z) = (cid:96))] \u00b7 (M(cid:96)(S i\u2190z\n\n(cid:96)\n\n,S(cid:96),i) + \u03b3)\n\n[1(A(S) = (cid:96))] \u00b7 (M(cid:96)(S(cid:96), z) + \u03b3)\n\n(cid:18)\n\nS\u223cPmn,z\u223cP,(cid:96)=A(S)\n\nE\n\n[e\u0001 \u00b7 (M(cid:96)(S(cid:96), z) + \u03b3)] = e\u0001 \u00b7\n\nS\u223cPmn,z\u223cP,(cid:96)=A(S)\n\nE\n\n[M(cid:96)(S(cid:96), z)] + \u03b3\n\nThis gives the left hand side of the stated inequality. The right hand side is obtained analogously.\n\nWe are now ready to put the ingredients together to prove the claimed result:\n\n(cid:19)\n\n.\n\nProof of eq. (6) in Theorem 1.2. We choose m = ln(2)/\u03b4. Let f1, . . . , fm be the scoring functions\nde\ufb01ned in Lemma 3.3. Let fm+1(S) \u2261 0. Let A be the execution of the exponential mechanism with\n\u2206 = 2\u03b3 + 1/n on scoring functions f1, . . . , fm+1 and \u0001 to be de\ufb01ned later. Note that this corresponds\nto the setting of Lemma 3.4 with M(cid:96) \u2261 M for all (cid:96) \u2208 [m] and Mm+1 \u2261 0. By Lemma 3.4 we have\nthat\n\n(cid:20)\n\n(cid:21)\n\nES\u223cP (m+1)n\n\nE\n\n(cid:96)=A(S)\n\n[f(cid:96)(S)]\n\n=\n\nE\n\nS\u223cP (m+1)n,(cid:96)=A(S)\n\n[EP [M(cid:96)(S(cid:96))] \u2212 ES(cid:96)[M(cid:96)(S(cid:96))]] \u2264 e\u0001 \u2212 1 + \u03b3.\n\n7\n\n\f\u0001\n\n(cid:113)(cid:0)2\u03b3 + 1\n(cid:1) \u00b7 ln(m + 1) =\n(cid:115)(cid:18)\n(cid:19)\n(cid:19)\n\n\u00b7 ln(8/\u03b4)\n\n(cid:35)\n\n2\u03b3 +\n\n1\nn\n\n(cid:115)(cid:18)\n\n(cid:12)(cid:12)(cid:12)(cid:12) E\n\nBy Theorem 3.2\n\n(cid:20)\n\n(cid:26)\n\nmax\n\n(cid:20)\n\nES\u223cPmn\n\u2264 ES\u223cPmn\n\n0, max\n(cid:96)\u2208[m]\n[f(cid:96)(S)]\n\nE\n\n(cid:96)=A(S)\n\n(cid:27)(cid:21)\n(cid:21)\nEP [M (S(cid:96))] \u2212 ES(cid:96) [M (S(cid:96))]\n\n= ES\u223cPmn\nln(m + 1) \u2264 e\u0001 \u2212 1 + \u03b3 +\n\n+\n\n(cid:20)\n\nmax\n(cid:96)\u2208[0.m]\n4\u03b3 + 2/n\n\n(cid:21)\n\nf(cid:96)(S)\n\n2\u2206\n\u0001\n\n(cid:113)(cid:0)2\u03b3 + 1\n\nTo bound this expression we choose \u0001 =\nOur bound is at least 2\u0001 and hence holds trivially if \u0001 \u2265 1/2. Otherwise (e\u0001 \u2212 1) \u2264 2\u0001 and we obtain\nthe following bound on the expectation of the maximum.\n\nn\n\nn\n\nln(m + 1).\n\n(cid:1) \u00b7 ln(e ln(2)/\u03b4).\n\n(cid:115)(cid:18)\nwhere we used that \u03b3 \u2264 \u221a\n\n4\n\n2\u03b3 +\n\n1\nn\n\n(cid:19)\n\n(cid:34)\n\n\u00b7 ln(e ln(2)/\u03b4) + \u03b3 \u2264 4\n\n\u03b3. Finally, plugging this bound into Lemma 3.1 we obtain that\n\nEP [M (S)] \u2212 ES [M (S)] \u2265 8\n\nPrS\u223cP n\n\n2\u03b3 +\n\n1\nn\n\n\u00b7 ln(8/\u03b4)\n\n\u2264 ln(2)\nm\n\n\u2264 \u03b4.\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b3.\n\n4 Second Moment of the Estimation Error\n\nIn this section we prove eq. (5) of Theorem 1.2. It will be more convenient to directly work with\n= M (S, z) \u2212 EP [M (S)]. Clearly, L is\n.\nthe unbiased version of M. Speci\ufb01cally, we de\ufb01ne L(S, z)\nunbiased with respect to P in the sense that for every S \u2208 Z n, EP [L(S)] = 0. Note that if the range\nof M is [0, 1] then the range of L is [\u22121, 1]. Further, L has uniform stability of at most 2\u03b3 since for\ntwo datasets S and S(cid:48) that differ in a single element,\n\n|EP [M (S)] \u2212 EP [M (S(cid:48))]| \u2264\n\nz\u223cP[M (S, z) \u2212 M (S(cid:48), z)]\n\nObserve that\n\n\u2206P\u2212S(M (S)) =\n\n1\nn\n\nBy eq. (7) we obtain that\n\n(EP [M (S)] \u2212 M (S, Si)) =\n\nn(cid:88)\n(\u2206P\u2212S(M (S)))2(cid:105)\n(cid:104)\n\ni=1\n\nL(S, Si) = \u2212ES[L(S)].\n\n(7)\n\n\u22121\nn\n\nn(cid:88)\n(cid:104)\n(ES[L(S)])2(cid:105)\n\ni=1\n\n.\n\n= E\n\nS\u223cP n\n\nE\n\nS\u223cP n\n\nTherefore eq. (5) of Theorem 1.2 will follow immediately from the following lemma (by using it with\nstability 2\u03b3).\nLemma 4.1. Let L : Z n \u00d7 Z \u2192 [\u22121, 1] be a data-dependent function with uniform stability \u03b3 and\nP be an arbitrary distribution over Z. If L is unbiased with respect to P then:\n\n(cid:104)\n\n(ES[L(S)])2(cid:105) \u2264 4\u03b32 +\n\n2\nn\n\n.\n\nE\n\nS\u223cP n\n\nOur proof starts by \ufb01rst establishing this result for the leave-one-out estimate.\nLemma 4.2. For a data-dependent function L : Z n \u00d7 Z \u2192 [\u22121, 1], a dataset S \u2208 Z n and a\ndistribution P, de\ufb01ne\n\n= E\nz\u223cP\n\nL(Si\u2190z, Si)\n\nIf L has uniform stability \u03b3 and is unbiased with respect to P then:\n1\nn\n\n(cid:2)L(cid:0)S\u2190P(cid:1)(cid:3)(cid:1)2(cid:105) \u2264 \u03b32 +\n\nS\u223cP n\n\nE\n\n.\n\nES\n\n(cid:2)L(cid:0)S\u2190P(cid:1)(cid:3) .\n(cid:104)(cid:0)ES\n\n\uf8f9\uf8fb .\n\n\uf8ee\uf8f0 1\n\nn\n\n(cid:88)\n\ni\u2208[n]\n\n8\n\n\fProof.\n\nE\n\nS\u223cP n\n\n=\n\n1\nn2\n\u2264 1\nn\n\n(cid:104)(cid:0)ES\n(cid:88)\n\ni\u2208[n]\n1\nn2\n\n+\n\n(cid:2)L(cid:0)S\u2190P(cid:1)(cid:3)(cid:1)2(cid:105) \u2264\n\nE\n\nS\u223cP n,z\u223cP\n\n(cid:104)(cid:0)L(Si\u2190z, Si)(cid:1)2(cid:105)\n\n\uf8ee\uf8ef\uf8f0\n\uf8eb\uf8ed 1\n\nn\n\n1\nn2\n\n(cid:88)\n(cid:88)\n\ni\u2208[n]\n\nS\u223cP n,z\u223cP\n\nE\n\n(cid:88)\n\ni,j\u2208[n],i(cid:54)=j\n\nE\n\nS\u223cP n,z\u223cP\n\n+\n\n(cid:2)L(Si\u2190z, Si) \u00b7 L(Sj\u2190z, Sj)(cid:3) ,\n\ni,j\u2208[n],i(cid:54)=j\n\nE\n\nS\u223cP n,z\u223cP\n\nL(Si\u2190z, Si)\n\n\uf8f6\uf8f82\uf8f9\uf8fa\uf8fb\n(cid:2)L(Si\u2190z, Si) \u00b7 L(Sj\u2190z, Sj)(cid:3)\n\n(8)\n\nwhere we used convexity to obtain the \ufb01rst line and the bound on the range of L to obtain the last\ninequality. For a \ufb01xed i (cid:54)= j and a \ufb01xed setting of all the elements in S with other indices (which we\ndenote by S\u2212i,j) we now analyze the cross term\n\n(cid:2)L(Si\u2190z, Si) \u00b7 L(Sj\u2190z, Sj)(cid:3) .\n\n.\n=\n\nvi,j\n\nE\n\nSi,Sj ,z\u223cP\n\nFor z \u2208 Z, de\ufb01ne\n\ng(z) = min\nzi,zj\u2208Z\n\nL(Si,j\u2190zi,zj , z) + \u03b3.\n\n(We remark that g implicitly depends on i, j and S\u2212i,j). Uniform stability of L implies that\n\nmax\nzi,zj\u2208Z\n\nL(Si,j\u2190zi,zj , z) \u2264 min\nzi,zj\u2208Z\n\nL(Si,j\u2190zi,zj , z) + 2\u03b3.\n\n=\n\n(9)\n\nE\n\nE\n\nvi,j =\n\nSi,Sj ,z\u223cP\n\nSi,Sj ,z\u223cP\n\nUsing this inequality we obtain\n\nThis means that for all zi, zj, z \u2208 Z,(cid:12)(cid:12)L(Si,j\u2190zi,zj , z) \u2212 g(z)(cid:12)(cid:12) \u2264 \u03b3.\n(cid:2)L(Si\u2190z, Si) \u00b7 L(Sj\u2190z, Sj)(cid:3)\n(cid:2)(L(Si\u2190z, Si) \u2212 g(Si)) \u00b7 (L(Sj\u2190z, Sj) \u2212 g(Sj))(cid:3) +\nSi,Sj\u223cP [g(Si) \u00b7 g(Sj)]\n\n(cid:2)g(Si) \u00b7 L(Sj\u2190z, Sj)(cid:3)\n(cid:18)\n(cid:19)2\nz(cid:48)\u223cP[g(z(cid:48))]\nNote that L is unbiased and g does not depend on Si or Sj. Therefore, for every \ufb01xed setting of Si\nand z,\n\n(cid:2)g(Sj) \u00b7 L(Si\u2190z, Si)(cid:3) \u2212 E\n(cid:2)g(Sj) \u00b7 L(Si\u2190z, Si)(cid:3) \u2212\n(cid:2)g(Si) \u00b7 L(Sj\u2190z, Sj)(cid:3) +\n(cid:2)g(Si) \u00b7 L(Sj\u2190z, Sj)(cid:3) = g(Si) \u00b7 EP [L(Sj\u2190z)] = 0.\n(cid:2)g(Si) \u00b7 L(Sj\u2190z, Sj)(cid:3) +\n\nTherefore,\n\nE\n\nSi,Sj ,z\u223cP\n\n\u2264 \u03b32 +\n\nSi,Sj ,z\u223cP\n\nSi,Sj ,z\u223cP\n\nSi,Sj ,z\u223cP\n\nSi,Sj ,z\u223cP\n\nE\nSj\u223cP\n\nE\n\nSi,Sj ,z\u223cP\n\nE\n\nE\n\nE\n\n.\n\n+\n\nE\n\nE\n\nimplying that vi,j \u2264 \u03b32. Substituting this into eq.(8) we obtain the claim.\n\n(cid:2)g(Sj) \u00b7 L(Si\u2190z, Si)](cid:3) = 0.\n(cid:2)L(cid:0)S\u2190P(cid:1)(cid:3)(see supplemental material for\n\nWe can now obtain the proof of Lemma 4.1 by observing that for every S, the empirical mean\nES[L(S)] is within \u03b3 of our leave-one-out estimator ES\nthe proof).\n\nReferences\n[1] Karim T. Abou-Moustafa and Csaba Szepesv\u00e1ri. An exponential tail bound for lq stable learning\nrules. application to k-folds cross-validation. In ISAIM, 2018. URL http://isaim2018.cs.\nvirginia.edu/papers/ISAIM2018_Abou-Moustafa_Szepesvari.pdf.\n\n9\n\n\f[2] Raef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan\n\nUllman. Algorithmic stability for adaptive data analysis. In STOC, pages 1046\u20131059, 2016.\n\n[3] Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and\n\nprogressive cross-validation. In COLT, pages 203\u2013208, 1999.\n\n[4] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. JMLR, 2:499\u2013526, 2002.\n\n[5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[6] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical\n\nrisk minimization. Journal of Machine Learning Research, 12:1069\u20131109, 2011.\n\n[7] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\n[8] Luc Devroye and Terry J. Wagner. Distribution-free inequalities for the deleted and holdout\n\nerror estimates. IEEE Trans. Information Theory, 25(2):202\u2013207, 1979.\n\n[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In TCC, pages 265\u2013284, 2006.\n\n[10] Cynthia Dwork and Vitaly Feldman. Privacy-preserving prediction. CoRR, abs/1803.10266,\n\n2018. URL http://arxiv.org/abs/1803.10266. Extended abstract in COLT 2018.\n\n[11] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron\nRoth. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014.\nExtended abstract in STOC 2015.\n\n[12] Andr\u00e9 Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized\nlearning algorithms. Journal of Machine Learning Research, 6:55\u201379, 2005. URL http:\n//www.jmlr.org/papers/v6/elisseeff05a.html.\n\n[13] Vitaly Feldman. Generalization of ERM in stochastic convex optimization: The dimension\nstrikes back. CoRR, abs/1608.04414, 2016. URL http://arxiv.org/abs/1608.04414.\nExtended abstract in NIPS 2016.\n\n[14] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of\nstochastic gradient descent. In ICML, pages 1225\u20131234, 2016. URL http://jmlr.org/\nproceedings/papers/v48/hardt16.html.\n\n[15] S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds,\n\nmargin bounds, and regularization. In NIPS, pages 793\u2013800, 2008.\n\n[16] Satyen Kale, Ravi Kumar, and Sergei Vassilvitskii. Cross-validation and mean-square stability.\nIn Innovations in Computer Science - ICS, pages 487\u2013495, 2011. URL http://conference.\nitcs.tsinghua.edu.cn/ICS2011/content/papers/31.html.\n\n[17] Ravi Kumar, Daniel Lokshtanov, Sergei Vassilvitskii, and Andrea Vattani. Near-optimal\nIn ICML, pages 27\u201335, 2013. URL http:\n\nbounds for cross-validation via loss stability.\n//jmlr.org/proceedings/papers/v28/kumar13a.html.\n\n[18] Tongliang Liu, G\u00e1bor Lugosi, Gergely Neu, and Dacheng Tao. Algorithmic stability and\nhypothesis complexity. In ICML, pages 2159\u20132167, 2017. URL http://proceedings.mlr.\npress/v70/liu17c.html.\n\n[19] Ben London. A pac-bayesian analysis of randomized learning with application to stochastic\n\ngradient descent. In NIPS, pages 2935\u20132944, 2017.\n\n[20] Andreas Maurer. A second-order look at stability and generalization. In COLT, pages 1461\u2013\n\n1475, 2017. URL http://proceedings.mlr.press/v65/maurer17a.html.\n\n[21] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pages\n\n94\u2013103, 2007.\n\n10\n\n\f[22] Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR,\n\nabs/1504.05800, 2015.\n\n[23] Kobbi Nissim and Uri Stemmer. Concentration bounds for high sensitivity functions through\ndifferential privacy. CoRR, abs/1703.01970, 2017. URL http://arxiv.org/abs/1703.\n01970.\n\n[24] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for\n\npredictivity in learning theory. Nature, 428(6981):419\u2013422, 2004.\n\n[25] W. H. Rogers and T. J. Wagner. A \ufb01nite sample distribution-free performance bound for local\n\ndiscrimination rules. The Annals of Statistics, 6(3):506\u2013514, 1978.\n\n[26] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability,\nstability and uniform convergence. The Journal of Machine Learning Research, 11:2635\u20132670,\n2010.\n\n[27] Thomas Steinke and Jonathan Ullman. Subgaussian tail bounds via stability arguments. arXiv\n\npreprint arXiv:1701.03493, 2017. URL https://arxiv.org/abs/1701.03493.\n\n[28] Rosasco Lorenzo Wibisono, Andre and Tomaso Poggio. Suf\ufb01cient conditions for uniform\nstability of regularization algorithms. Technical Report MIT-CSAIL-TR-2009-060, MIT, 2009.\n\n[29] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-\non differential privacy for scalable stochastic gradient descent-based analytics. In (SIGMOD),\npages 1307\u20131322, 2017.\n\n[30] Tong Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15(6):1397\u20131437,\n\n2003.\n\n11\n\n\f", "award": [], "sourceid": 6396, "authors": [{"given_name": "Vitaly", "family_name": "Feldman", "institution": "Google Brain"}, {"given_name": "Jan", "family_name": "Vondrak", "institution": "Stanford University"}]}