{"title": "A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2931, "page_last": 2940, "abstract": "We study the generalization error of randomized learning algorithms -- focusing on stochastic gradient descent (SGD) -- using a novel combination of PAC-Bayes and algorithmic stability. Importantly, our generalization bounds hold for all posterior distributions on an algorithm's random hyperparameters, including distributions that depend on the training data. This inspires an adaptive sampling algorithm for SGD that optimizes the posterior at runtime. We analyze this algorithm in the context of our generalization bounds and evaluate it on a benchmark dataset. Our experiments demonstrate that adaptive sampling can reduce empirical risk faster than uniform sampling while also improving out-of-sample accuracy.", "full_text": "A PAC-Bayesian Analysis of Randomized Learning\n\nwith Application to Stochastic Gradient Descent\n\nBen London\n\nblondon@amazon.com\n\nAmazon AI\n\nAbstract\n\nWe study the generalization error of randomized learning algorithms\u2014focusing\non stochastic gradient descent (SGD)\u2014using a novel combination of PAC-Bayes\nand algorithmic stability. Importantly, our generalization bounds hold for all pos-\nterior distributions on an algorithm\u2019s random hyperparameters, including distribu-\ntions that depend on the training data. This inspires an adaptive sampling algo-\nrithm for SGD that optimizes the posterior at runtime. We analyze this algorithm\nin the context of our generalization bounds and evaluate it on a benchmark dataset.\nOur experiments demonstrate that adaptive sampling can reduce empirical risk\nfaster than uniform sampling while also improving out-of-sample accuracy.\n\n1\n\nIntroduction\n\nRandomized algorithms are the workhorses of modern machine learning. One such algorithm is\nstochastic gradient descent (SGD), a \ufb01rst-order optimization method that approximates the gradient\nof the learning objective by a random point estimate, thereby making it ef\ufb01cient for large datasets.\nRecent interest in studying the generalization properties of SGD has led to several breakthroughs.\nNotably, Hardt et al. [10] showed that SGD is stable with respect to small perturbations of the\ntraining data, which let them bound the risk of a learned model. Related studies followed thereafter\n[13, 16]. Simultaneously, Lin and Rosasco [15] derived risk bounds that show that early stopping\nacts as a regularizer in multi-pass SGD (echoing studies of incremental gradient descent [19]).\nIn this paper, we study generalization in randomized learning, with SGD as a motivating example.\nUsing a novel analysis that combines PAC-Bayes with algorithmic stability (reminiscent of [17]),\nwe prove new generalization bounds for randomized learning algorithms, which apply to SGD un-\nder various assumptions on the loss function and optimization objective. Our bounds improve on\nrelated work in two important ways. While some previous bounds for SGD [1, 10, 13, 16] hold in\nexpectation over draws of the training data, our bounds hold with high probability. Further, existing\ngeneralization bounds for randomized learning [6, 7] only apply to algorithms with \ufb01xed distribu-\ntions (such as SGD with uniform sampling); thanks to our PAC-Bayesian treatment, our bounds hold\nfor all posterior distributions, meaning they support data-dependent randomization. The penalty for\nover\ufb01tting the posterior to the data is captured by the posterior\u2019s divergence from a \ufb01xed prior.\nOur generalization bounds suggest a sampling strategy for SGD that adapts to the training data and\nmodel, focusing on useful examples while staying close to a uniform prior. We therefore propose\nan adaptive sampling algorithm that dynamically updates its distribution using multiplicative weight\nupdates (similar to boosting [8, 21], focused online learning [22] and exponentiated gradient dual\ncoordinate ascent [4]). The algorithm requires minimal tuning and works with any stochastic gra-\ndient update rule. We analyze the divergence of the adaptive posterior and conduct experiments\non a benchmark dataset, using several combinations of update rule and sampling utility function.\nOur experiments demonstrate that adaptive sampling can reduce empirical risk faster than uniform\nsampling while also improving out-of-sample accuracy.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Preliminaries\nLet X denote a compact domain; let Y denote a set of labels; and let Z (cid:44) X \u00d7 Y denote their\nCartesian product. We assume there exists an unknown, \ufb01xed distribution, D, supported on Z.\nGiven a dataset of examples, S (cid:44) (z1, . . . , zn) = ((x1, y1), . . . , (xn, yn)), drawn independently\nand identically from D, we wish to learn the parameters of a predictive model, X (cid:55)\u2192 Y, from a class\nof hypotheses, H, which we assume is a subset of Euclidean space. We have access to a deterministic\nlearning algorithm, A : Z n\u00d7\u0398 \u2192 H, which, given S, and some hyperparameters, \u03b8 \u2208 \u0398, produces\na hypothesis, h \u2208 H.\nWe measure the quality of a hypothesis using a loss function, L : H\u00d7Z \u2192 [0, M ], which we assume\nis M-bounded1 and \u03bb-Lipschitz (see Appendix A for the de\ufb01nition). Let L(A(S, \u03b8), z) denote the\nloss of a hypothesis that was output by A(S, \u03b8) when applied to example z. Ultimately, we want the\nlearning algorithm to have low expected loss on a random example; i.e., low risk, denoted R(S, \u03b8) (cid:44)\nEz\u223cD[L(A(S, \u03b8), z)].\n(The learning algorithm should always be clear from context.) Since this\nexpectation cannot be computed, we approximate it by the average loss on the training data; i.e., the\nempirical risk, \u02c6R(S, \u03b8) (cid:44) 1\ni=1 L(A(S, \u03b8), zi), which is what most learning algorithms attempt\nto minimize. By bounding the difference of the two, G(S, \u03b8) (cid:44) R(S, \u03b8) \u2212 \u02c6R(S, \u03b8), which we refer\nto as the generalization error, we obtain an upper bound on R(S, \u03b8).\nThroughout this document, we will view a randomized learning algorithm as a deterministic learning\nalgorithm whose hyperparameters are randomized. For instance, stochastic gradient descent (SGD)\nperforms a sequence of hypothesis updates, for t = 1, . . . , T , of the form\nht \u2190 Ut(ht\u22121, zit) (cid:44) ht\u22121 \u2212 \u03b7t\u2207F (ht\u22121, zit),\n\n(cid:80)n\n\nn\n\nusing a sequence of random example indices, \u03b8 = (i1, . . . , iT ), sampled according to a distribution,\nP, on \u0398 = {1, . . . , n}T . The objective function, F : H \u00d7 Z \u2192 R+, may be different from L; it is\nusually chosen as an optimizable upper bound on L, and need not be bounded. The parameter \u03b7t is\na step size for the update at iteration t. SGD can be viewed as taking a dataset, S, drawing \u03b8 \u223c P,\nthen running a deterministic algorithm, A(S, \u03b8), which executes the sequence of hypothesis updates.\nSince learning is randomized, we will deal with the expected loss over draws of random hyperparam-\neters. We therefore overload the above notation for a distribution, P, on the hyperparameter space,\n\u0398; let R(S, P) (cid:44) E\u03b8\u223cP[R(S, \u03b8)], \u02c6R(S, P) (cid:44) E\u03b8\u223cP[ \u02c6R(S, \u03b8)], and G(S, P) (cid:44) R(S, P) \u2212 \u02c6R(S, P).\n\n2.1 Relationship to PAC-Bayes\nConditioned on the training data, a posterior distribution, Q, on the hyperparameter space, \u0398, in-\nduces a distribution on the hypothesis space, H. If we ignore the learning algorithm altogether and\nthink of Q as a distribution on H directly, then Eh\u223cQ[L(h, z)] is the Gibbs loss; that is, the expected\nloss of a random hypothesis. The Gibbs loss has been studied extensively using PAC-Bayesian anal-\nysis (also known simply as PAC-Bayes) [3, 9, 14, 18, 20]. In the PAC-Bayesian learning framework,\nwe \ufb01x a prior distribution, P, then receive some training data, S \u223c Dn, and learn a posterior dis-\ntribution, Q. PAC-Bayesian bounds frame the generalization error, G(S, Q), as a function of the\nposterior\u2019s divergence from the prior, which penalizes over\ufb01tting the posterior to the training data.\nIn Section 4, we derive new upper bounds on G(S, Q) using a novel PAC-Bayesian treatment. While\ntraditional PAC-Bayes analyzes distributions directly on H, we instead analyze distributions on\n\u0398. Thus, instead of applying the loss directly to a random hypothesis, we apply it to the output\nof a learning algorithm, whose inputs are a dataset and a random hyperparameter instantiation.\nThis distinction is subtle, but important.\nIn our framework, a random hypothesis is explicitly a\nfunction of the learning algorithm, whereas in traditional PAC-Bayes this dependence may only be\nimplicit\u2014for instance, if the posterior is given by random permutations of a learned hypothesis. The\nadvantage of making the learning aspect explicit is that it isolates the source of randomness, which\nmay help in analyzing the distribution of learned hypotheses. Indeed, it may be dif\ufb01cult to map the\noutput of a randomized learning algorithm to a distribution on the hypothesis space. That said, the\ndisadvantage of making learning explicit is that, due to the learning algorithm\u2019s dependence on the\ntraining data and hyperparameters, the generalization error could be sensitive to certain examples or\nhyperparameters. This condition is quanti\ufb01ed with algorithmic stability, which we discuss next.\n\n1Accommodating unbounded loss functions is possible [11], but requires additional assumptions.\n\n2\n\n\f3 Algorithmic Stability\n\nInformally, algorithmic stability measures the change in loss when the inputs to a learning algorithm\nare perturbed; a learning algorithm is stable if small perturbations lead to proportional changes in the\nloss. In other words, a learning algorithm should not be overly sensitive to any single input. Stability\nis crucial for learnability [23], and has also been linked to differentially private learning [24]. In this\nsection, we discuss several notions of stability tailored for randomized learning algorithms. From\n\nthis point on, let DH(v, v(cid:48)) (cid:44)(cid:80)|v|\n\n1{vi (cid:54)= v(cid:48)\n\ni} denote the Hamming distance.\n\ni=1\n\n3.1 De\ufb01nitions of Stability\n\nThe literature traditionally measures stability with respect to perturbations of the training data. We\nrefer to this general property as data stability. Data stability has been de\ufb01ned in many ways. The\nfollowing de\ufb01nitions, originally proposed by Elisseeff et al. [6], are designed to accommodate ran-\ndomized algorithms via an expectation over the hyperparameters, \u03b8 \u223c P.\nDe\ufb01nition 1 (Uniform Stability). A randomized learning algorithm, A, is \u03b2Z-uniformly stable with\nrespect to a loss function, L, and a distribution, P on \u0398, if\n\nsup\n\nS,S(cid:48)\u2208Z n:DH(S,S(cid:48))=1\n\nsup\nz\u2208Z\n\nsup\n\ni\u2208{1,...,n}\n\nE\nz\u223cD\n\nE\n\u03b8\u223cP\n\nE\n\nS\u223cDn\n\nDe\ufb01nition 2 (Pointwise Hypothesis Stability). For a given dataset, S, let Si,z denote the result of\nreplacing the ith example with example z. A randomized learning algorithm, A, is \u03b2Z-pointwise\nhypothesis stable with respect to a loss function, L, and a distribution, P on \u0398, if\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03b2Z .\n(cid:12)(cid:12)(cid:12) E\n\u03b8\u223cP [L(A(S, \u03b8), z) \u2212 L(A(S(cid:48), \u03b8), z)]\n(cid:2)(cid:12)(cid:12)L(A(S, \u03b8), zi) \u2212 L(A(Si,z, \u03b8), zi)(cid:12)(cid:12)(cid:3) \u2264 \u03b2Z .\n\nUniform stability measures the maximum change in loss from replacing any single training example,\nwhereas pointwise hypothesis stability measures the expected change in loss on a random example\nwhen said example is removed from the training data. Under certain conditions, \u03b2Z-uniform stability\nimplies \u03b2Z-pointwise hypothesis stability, but not vice versa. Thus, while uniform stability enables\nsharper bounds, pointwise hypothesis stability supports a wider range of learning algorithms.\nIn addition to data stability, we might also require stability with respect to changes in the hyperpa-\nrameters. From this point forward, we will assume that the hyperparameter space, \u0398, decomposes\nt=1 \u0398t. For instance, \u0398 could be the set of all sequences of\nexample indices, {1, . . . , n}T , such as one would sample from in SGD.\nDe\ufb01nition 3 (Hyperparameter Stability). A randomized learning algorithm, A, is \u03b2\u0398-uniformly\nstable with respect to a loss function, L, if\n\ninto the product of T subspaces, (cid:81)T\n\nsup\nS\u2208Z n\n\nsup\nz\u2208Z\n\nsup\n\n\u03b8,\u03b8(cid:48)\u2208\u0398:DH(\u03b8,\u03b8(cid:48))=1\n\n|L(A(S, \u03b8), z) \u2212 L(A(S, \u03b8(cid:48)), z)| \u2264 \u03b2\u0398.\n\nWhen A is both \u03b2Z-uniformly and \u03b2\u0398-uniformly stable, we say that A is (\u03b2Z , \u03b2\u0398)-uniformly stable.\nRemark 1. For SGD, De\ufb01nition 3 can be mapped to Bousquet and Elisseeff\u2019s [2] original de\ufb01nition\nof uniform stability using the resampled example sequence. Yet their generalization bounds would\nstill not apply because the resampled data is not i.i.d. and SGD is not a symmetric learning algorithm.\n\n3.2 Stability of Stochastic Gradient Descent\n\nFor non-vacuous generalization bounds, we will need the data stability coef\ufb01cient, \u03b2Z, to be of order\n\u02dcO(n\u22121). Additionally, certain results will require the hyperparameter stability coef\ufb01cient, \u03b2\u0398, to be\n\u221a\nnT ). (If T = \u0398(n), as it often is, then \u03b2\u0398 = \u02dcO(T \u22121) suf\ufb01ces.) In this section, we\nof order \u02dcO(1/\nreview some conditions under which these requirements are satis\ufb01ed by SGD. We rely on standard\ncharacterizations of the objective function\u2014namely, convexity, Lipschitzness and smoothness\u2014the\nde\ufb01nitions of which are deferred to Appendix A, along with all proofs from this section.\nA recent study by Hardt et al. [10] proved that some special cases of SGD\u2014when examples are sam-\npled uniformly, with replacement\u2014satisfy \u03b2Z-uniform stability (De\ufb01nition 1) with \u03b2Z = O(n\u22121).\nWe extend their work (speci\ufb01cally, [10, Theorem 3.7]) in the following result for SGD with a convex\nobjective function, when the step size is at most inversely proportional to the current iteration.\n\n3\n\n\fProposition 1. Assume that the loss function, L, is \u03bb-Lipschitz, and that the objective function, F ,\nis convex, \u03bb-Lipschitz and \u03c3-smooth. Suppose SGD is run for T iterations with a uniform sampling\ndistribution, P, and step sizes \u03b7t \u2208 [0, \u03b7/t], for \u03b7 \u2208 [0, 2/\u03c3]. Then, SGD is both \u03b2Z-uniformly stable\nand \u03b2Z-pointwise hypothesis stable with respect to L and P, with\n\n\u03b2Z \u2264 2\u03bb2\u03b7 (ln T + 1)\n\nn\n\n.\n\n(1)\n\nWhen T = \u0398(n), Equation 1 is \u02dcO(n\u22121), which is acceptable for proving generalization.\nIf we do not assume that the objective function is convex, we can borrow a result (with small modi-\n\ufb01cation2) from Hardt et al. [10, Theorem 3.8].\nProposition 2. Assume that the loss function, L, is M-bounded and \u03bb-Lipschitz, and that the objec-\ntive function, F , is \u03bb-Lipschitz and \u03c3-smooth. Suppose SGD is run for T iterations with a uniform\nsampling distribution, P, and step sizes \u03b7t \u2208 [0, \u03b7/t], for \u03b7 \u2265 0. Then, SGD is both \u03b2Z-uniformly\nstable and \u03b2Z-pointwise hypothesis stable with respect to L and P, with\n\n(cid:19)(cid:0)2\u03bb2\u03b7(cid:1) 1\n\u03c3\u03b7+1(cid:1). As \u03c3\u03b7 approaches 1, the rate becomes O(n\u22121/2), which, as will become evident\n11(cid:1) \u2248 O(n\u22121), which suf\ufb01ces for generalization.\n\nto O(cid:0)n\nsmall\u2014say, \u03b7 = (10\u03c3)\u22121\u2014then we get O(cid:0)n\u2212 10\n\nAssuming T = \u0398(n), and ignoring constants that depend on M, \u03bb, \u03c3 and \u03b7, Equation 2 reduces\n\nin Section 4, yields generalization bounds that are suboptimal, or even vacuous. However, if \u03c3\u03b7 is\n\n(cid:18) M + (\u03c3\u03b7)\u22121\n\n\u03c3\u03b7\n\n\u03c3\u03b7+1 T\n\n\u03c3\u03b7+1 .\n\n\u2212 1\n\n\u03b2Z \u2264\n\nn \u2212 1\n\n(2)\n\nWe can obtain even tighter bounds for \u03b2Z-pointwise hypothesis stability (De\ufb01nition 2) by adopting\na data-dependent view. The following result for SGD with a convex objective function is adapted\nfrom work by Kuzborskij and Lampert [13, Theorem 3].\nProposition 3. Assume that the loss function, L, is \u03bb-Lipschitz, and that the objective function, F , is\nconvex, \u03bb-Lipschitz and \u03c3-smooth. Suppose SGD starts from an initial hypothesis, h0, and is run for\nT iterations with a uniform sampling distribution, P, and step sizes \u03b7t \u2208 [0, \u03b7/t], for \u03b7 \u2208 [0, 2/\u03c3].\nThen, SGD is \u03b2Z-pointwise hypothesis stable with respect to L and P, with\n\n\u03b2Z \u2264 2\u03bb\u03b7 (ln T + 1)(cid:112)2\u03c3 Ez\u223cD[L(h0, z)]\n\n.\n\n(3)\n\nn\n\nImportantly, Equation 3 depends on the risk of the initial hypothesis, h0. If h0 happens to be close\nto a global optimum\u2014that is, a good \ufb01rst guess\u2014then Equation 3 could be tighter than Equation 1.\nKuzborskij and Lampert also proved a data-dependent bound for non-convex objective functions\n[13, Theorem 5], which, under certain conditions, might be tighter than Equation 2. Though not\npresented herein, Kuzborskij and Lampert\u2019s bound is worth noting.\n\u221a\nAs we will later show, we can obtain stronger generalization guarantees by combining \u03b2Z-uniform\nstability with \u03b2\u0398-uniform stability (De\ufb01nition 3), provided \u03b2\u0398 = \u02dcO(1/\nnT ). Prior stability analy-\nses of SGD [10, 13] have not addressed this form of stability. Elisseeff et al. [6] proved (\u03b2Z , \u03b2\u0398)-\nuniform stability for certain bagging algorithms, but did not consider SGD. In light of Remark 1, it is\ntempting to map \u03b2\u0398-uniform stability to Bousquet and Elisseeff\u2019s [2] uniform stability and thereby\nleverage their study of various regularized objective functions. However, their analysis crucially re-\nlies on exact minimization of the learning objective, whereas SGD with a \ufb01nite number of steps only\n\ufb01nds an approximate minimizer. Thus, to our knowledge, no prior work applies to this problem. As\na \ufb01rst step, we prove uniform stability, with respect to both data and hyperparameters, for SGD with\na strongly convex objective function and decaying step sizes.\nProposition 4. Assume that the loss function, L, is \u03bb-Lipschitz, and that the objective function, F ,\nis \u03b3-strongly convex, \u03bb-Lipschitz and \u03c3-smooth. Suppose SGD is run for T iterations with a uniform\nsampling distribution, P, and step sizes \u03b7t (cid:44) (\u03b3t + \u03c3)\u22121. Then, SGD is (\u03b2Z , \u03b2\u0398)-uniformly stable\nwith respect to L and P, with\n\n\u221a\nWhen T = \u0398(n), the \u03b2\u0398 bound in Equation 4 is O(1/\n\nnT ), which supports good generalization.\n\n\u03b2Z \u2264 2\u03bb2\n\u03b3n\n\nand \u03b2\u0398 \u2264 2\u03bb2\n\u03b3T\n\n.\n\n(4)\n\n2Hardt et al.\u2019s de\ufb01nition of stability and theorem statement differ slightly from ours. See Appendix A.1.\n\n4\n\n\f4 Generalization Bounds\n\nIn this section, we present new generalization bounds for randomized learning algorithms. While\nprior work [6, 7] has addressed this topic, ours is the \ufb01rst PAC-Bayesian treatment (the bene\ufb01ts of\nwhich will be discussed momentarily). Recall that in the PAC-Bayesian framework, we \ufb01x a prior\ndistribution, P, on the hypothesis space, H; then, given a sample of training data, S \u223c Dn, we learn\na posterior distribution, Q, also on H. In our extension for randomized learning algorithms, P and Q\nare instead supported on the hyperparameter space, \u0398. Moreover, while traditional PAC-Bayes stud-\nies Eh\u223cQ[L(h, z)], we study the expected loss over draws of hyperparameters, E\u03b8\u223cQ[L(A(S, \u03b8), z)].\nOur goal will be to upper-bound the generalization error of the posterior, G(S, Q), which thereby\nupper-bounds the risk, R(S, Q), by a function of the empirical risk, \u02c6R(S, Q).\nImportantly, our bounds are polynomial in \u03b4\u22121, for a free parameter \u03b4 \u2208 (0, 1), and hold with prob-\nability at least 1 \u2212 \u03b4 over draws of a \ufb01nite training dataset. This stands in contrast to related bounds\n[1, 10, 13, 16] that hold in expectation. While expectation bounds are useful for gaining insight into\ngeneralization behavior, high-probability bounds are sometimes preferred. Provided the loss is M-\nbounded, it is always possible to convert a high-probability bound of the form PrS\u223cDn{G(S, Q) \u2264\nB(\u03b4)} \u2265 1 \u2212 \u03b4 to an expectation bound of the form ES\u223cDn [G(S, Q)] \u2264 B(\u03b4) + \u03b4M.\nAnother useful property of PAC-Bayesian bounds is that they hold simultaneously for all posteriors,\nincluding those that depend on the training data. In Section 3, we assumed that hyperparameters\nwere sampled according to a \ufb01xed distribution; for instance, sampling training example indices for\nSGD uniformly at random. However, in certain situations, it may be advantageous to sample accord-\ning to a data-dependent distribution. Following the SGD example, suppose most training examples\nare easy to classify (e.g., far from the decision boundary), but some are dif\ufb01cult (e.g., near the deci-\nsion boundary, or noisy). If we sample points uniformly at random, we might encounter mostly easy\nexamples, which could slow progress on dif\ufb01cult examples. If we instead focus training on the dif\ufb01-\ncult set, we might converge more quickly to an optimal hypothesis. Since our PAC-Bayesian bounds\nhold for all hyperparameter posteriors, we can characterize the generalization error of algorithms\nthat optimize the posterior using the training data. Existing generalization bounds for randomized\nlearning [6, 7], or SGD in particular [1, 10, 13, 15, 16], cannot address such algorithms. Of course,\nthere is a penalty for over\ufb01tting the posterior to the data, which is captured by the posterior\u2019s diver-\ngence from the prior.\nOur \ufb01rst PAC-Bayesian theorem requires the weakest stability condition, \u03b2Z-pointwise hypothesis\nstability, but the bound is sublinear in \u03b4\u22121. Our second bound is polylogarithmic in \u03b4\u22121, but requires\nthe stronger stability conditions, (\u03b2Z , \u03b2\u0398)-uniform stability. All proofs are deferred to Appendix B.\nTheorem 1. Suppose a randomized learning algorithm, A, is \u03b2Z-pointwise hypothesis stable with\nrespect to an M-bounded loss function, L, and a \ufb01xed prior, P on \u0398. Then, for any n \u2265 1 and\n\u03b4 \u2208 (0, 1), with probability at least 1\u2212 \u03b4 over draws of a dataset, S \u223c Dn, every posterior, Q on \u0398,\nsatis\ufb01es\n\n(cid:19)\n\n+ 12M \u03b2Z\n\n,\n\n(5)\n\n(cid:115)(cid:18) \u03c72(Q(cid:107)P) + 1\n(cid:17)2 \u2212 1\n(cid:105)\n\n\u03b4\n\n(cid:19)(cid:18) 2M 2\n\nn\n\nG(S, Q) \u2264\n\n(cid:104)(cid:16) Q(\u03b8)P(\u03b8)\n\nis the \u03c72 divergence from P to Q.\n\nwhere \u03c72(Q(cid:107)P) (cid:44) E\u03b8\u223cP\nTheorem 2. Suppose a randomized learning algorithm, A, is (\u03b2Z , \u03b2\u0398)-uniformly stable with re-\nt=1 \u0398t. Then,\nfor any n \u2265 1, T \u2265 1 and \u03b4 \u2208 (0, 1), with probability at least 1\u2212\u03b4 over draws of a dataset, S \u223c Dn,\nevery posterior, Q on \u0398, satis\ufb01es\n\nspect to an M-bounded loss function, L, and a \ufb01xed product measure, P on \u0398 =(cid:81)T\n(cid:19)\n\n(cid:19)(cid:18) (M + 2n\u03b2Z )2\n\nG(S, Q) \u2264 \u03b2Z +\n\nDKL(Q(cid:107)P) + ln\n\n2\n\u03b4\n\n+ 4T \u03b22\n\u0398\n\n,\n\n(6)\n\nn\n\nwhere DKL(Q(cid:107)P) (cid:44) E\u03b8\u223cQ\n\nln\n\nis the KL divergence from P to Q.\n\n(cid:115)\n(cid:18)\n(cid:17)(cid:105)\n(cid:16) Q(\u03b8)P(\u03b8)\n\n2\n\n(cid:104)\n\nSince Theorems 1 and 2 hold simultaneously for all hyperparameter posteriors, they provide gen-\neralization guarantees for SGD with any sampling distribution. Note that the stability requirements\nonly need to be satis\ufb01ed by a \ufb01xed product measure, such as a uniform distribution. This simple\n\n5\n\n\fsampling distribution can have(cid:0)O(n\u22121), O(T \u22121)(cid:1)-uniform stability under certain conditions, as\n\ndemonstrated in Section 3.2. In the following, we apply Theorem 2 to SGD with a strongly convex\nobjective function, leveraging Proposition 4 to upper-bound the stability coef\ufb01cients.\nCorollary 1. Assume that the loss function, L, is M-bounded and \u03bb-Lipschitz, and that the objective\nfunction, F , is \u03b3-strongly convex, \u03bb-Lipschitz and \u03c3-smooth. Let P denote a uniform prior on\n{1, . . . , n}T . Then, for any n \u2265 1, T \u2265 1 and \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4 over draws\nof a dataset, S \u223c Dn, SGD with step sizes \u03b7t (cid:44) (\u03b3t + \u03c3)\u22121 and any posterior sampling distribution,\nQ on {1, . . . , n}T , satis\ufb01es\n\nG(S, Q) \u2264 2\u03bb2\n\u03b3n\n\n+\n\n2\n\nDKL(Q(cid:107)P) + ln\n\n2\n\u03b4\n\n+\n\n16\u03bb4\n\u03b32T\n\nn\n\n(cid:19)(cid:18) (M + 4\u03bb2/\u03b3)2\n\n(cid:19)\n\n.\n\n(cid:115)\n\n(cid:18)\n\nWhen the divergence is polylogarithmic in n, and T = \u0398(n), the generalization bound is \u02dcO(n\u22121/2).\nIn the special case of uniform sampling, the KL divergence is zero, yielding a O(n\u22121/2) bound.\nImportantly, Theorem 1 does not require hyperparameter stability, and is therefore of interest for\nanalyzing non-convex objective functions, since it is not known whether uniform hyperparameter\nstability can be satis\ufb01ed without (strong) convexity. One can use Equation 2 (or [13, Theorem 5]) to\nupper-bound \u03b2Z in Equation 5 and thereby obtain a generalization bound for SGD with a non-convex\nobjective function, such as neural network training. We leave this substitution to the reader.\nEquation 6 holds with high probability over draws of a dataset, but the generalization error is an\nexpected value over draws of hyperparameters. To obtain a bound that holds with high probability\nover draws of both data and hyperparameters, we consider posteriors that are product measures.\nTheorem 3. Suppose a randomized learning algorithm, A, is (\u03b2Z , \u03b2\u0398)-uniformly stable with re-\nt=1 \u0398t. Then,\nfor any n \u2265 1, T \u2265 1 and \u03b4 \u2208 (0, 1), with probability at least 1\u2212\u03b4 over draws of a dataset, S \u223c Dn,\nand hyperparameters, \u03b8 \u223c Q, from any posterior product measure, Q on \u0398,\n\nspect to an M-bounded loss function, L, and a \ufb01xed product measure, P on \u0398 =(cid:81)T\n\n(cid:114)\n\n(cid:115)\n\n(cid:18)\n\n2\n\u03b4\n\n(cid:113)\n\n(cid:19)(cid:18) (M + 2n\u03b2Z )2\n\nn\n\n(cid:19)\n\n+ 4T \u03b22\n\u0398\n\n.\n\n(7)\n\nG(S, \u03b8) \u2264 \u03b2Z + \u03b2\u0398\n\n2 T ln\n\n+\n\n2\n\nDKL(Q(cid:107)P) + ln\n\n4\n\u03b4\n\nnT ), then \u03b2\u0398\n\n\u221a\n\u03b4 vanishes at a rate of \u02dcO(n\u22121/2). We can apply Theorem 3 to\nIf \u03b2\u0398 = \u02dcO(1/\nSGD in the same way we applied Theorem 2 in Corollary 1. Further, note that a uniform distribution\nis a product distribution. Thus, if we eschew optimizing the posterior, then the KL divergence dis-\nappears, leaving a O(n\u22121/2) derandomized generalization bound for SGD with uniform sampling.3\n\n2 T ln 2\n\n5 Adaptive Sampling for Stochastic Gradient Descent\n\nThe PAC-Bayesian theorems in Section 4 motivate data-dependent posterior distributions on the\nhyperparameter space. Intuitively, certain posteriors may improve, or speed up, learning from a\ngiven dataset. For instance, suppose certain training examples are considered valuable for reducing\nempirical risk; then, a sampling posterior for SGD should weight those examples more heavily\nthan others, so that the learning algorithm can, probabilistically, focus its attention on the valuable\nexamples. However, a posterior should also try to stay close to the prior, to control the divergence\npenalty in the generalization bounds.\nBased on this idea, we propose a sampling procedure for SGD (or any variant thereof) that constructs\na posterior based on the training data, balancing the utility of the sampling distribution with its di-\nvergence from a uniform prior. The algorithm operates alongside the learning algorithm, iteratively\ngenerating the posterior as a sequence of conditional distributions on the training data. Each itera-\ntion of training generates a new distribution conditioned on the previous iterations, so the posterior\ndynamically adapts to training. We therefore call our algorithm adaptive sampling SGD.\n\n3We can achieve the same result by pairing Proposition 4 with Elisseeff et al.\u2019s generalization bound for\nalgorithms with (\u03b2Z , \u03b2\u0398)-uniform stability [6, Theorem 15]. However, Elisseeff et al.\u2019s bound only applies\nto \ufb01xed product measures on \u0398, whereas Theorem 3 applies more generally to any posterior product measure,\nand when P = Q, Equation 7 is within a constant factor of Elisseeff et al.\u2019s bound.\n\n6\n\n\futility function, f : Z \u00d7 H \u2192 R; amplitude, \u03b1 \u2265 0; decay, \u03c4 \u2208 (0, 1).\n\nAlgorithm 1 Adaptive Sampling SGD\nRequire: Examples, (z1, . . . , zn) \u2208 Z n; initial hypothesis, h0 \u2208 H; update rule, Ut : H\u00d7Z \u2192 H;\n1: (q1, . . . , qn) \u2190 1\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6: return hT\n\n(cid:46) Draw index it proportional to sampling weights\n(cid:46) Update hypothesis\n(cid:46) Update sampling weight for it\n\nit \u223c Qt \u221d (q1, . . . , qn)\nht \u2190 Ut(ht\u22121, zit)\nqit \u2190 q\u03c4\n\n(cid:46) Initialize sampling weights uniformly\n\nexp (\u03b1 f (zit, ht))\n\nit\n\nAlgorithm 1 maintains a set of nonnegative sampling weights, (q1, . . . , qn), which de\ufb01ne a distribu-\ntion on the dataset. The posterior probability of the ith example in the tth iteration, given the previous\niterations, is proportional to the ith weight: Qt(i) (cid:44) Q(it = i| i1, . . . , it\u22121) \u221d qi. The sampling\nweights are initialized to 1, thereby inducing a uniform distribution. At each iteration, we draw an\nindex, it \u223c Qt, and use example zit to update the hypothesis. We then update the weight for it\nmultiplicatively as qit \u2190 q\u03c4\nexp (\u03b1 f (zit, ht)), where: f (zit , ht) is a utility function of the chosen\nexample and current hypothesis; \u03b1 \u2265 0 is an amplitude parameter, which controls the aggressiveness\nof the update; and \u03c4 \u2208 (0, 1) is a decay parameter, which lets qi gradually forget past updates.\nThe multiplicative weight update (line 5) can be derived by choosing a sampling distribution for\nthe next iteration, t + 1, that maximizes the expected utility while staying close to a reference\ndistribution. Consider the following constrained optimization problem:\n\nit\n\nn(cid:88)\n\ni=1\n\nThe term (cid:80)n\n\nmax\n\nQt+1\u2208\u2206n\n\nQt+1(i)f (zi, ht) \u2212 1\n\u03b1\n\nDKL(Qt+1(cid:107)Q\u03c4\nt ).\n\n(8)\n\ni=1\n\nQt+1(i)f (zi, ht) is the expected utility under the new distribution, Qt+1. This\nis offset by the KL divergence, which acts as a regularizer, penalizing Qt+1 for diverging from a\nreference distribution, Q\u03c4\ni . The decay parameter, \u03c4, controls the temperature of\nthe reference distribution, allowing it to interpolate between the current distribution (\u03c4 = 1) and a\nuniform distribution (\u03c4 = 0). The amplitude parameter, \u03b1, scales the in\ufb02uence of the regularizer\nrelative to the expected utility. We can solve Equation 8 analytically using the method of Lagrange\nmultipliers, which yields\n\nt , where Q\u03c4\n\nt (i) \u221d q\u03c4\n\nt+1(i) \u221d Q\u03c4\nQ(cid:63)\n\nt (i) exp (\u03b1 f (zit, ht) \u2212 1) \u221d q\u03c4\n\ni exp (\u03b1 f (zit, ht)) .\n\nUpdating qi for all i = 1, . . . , n is impractical for large n, so we approximate the above solution by\nonly updating the weight for the last sampled index, it, effectively performing coordinate ascent.\nThe idea of tuning the empirical data distribution through multiplicative weight updates is reminis-\ncent of AdaBoost [8] and focused online learning [22], but note that Algorithm 1 learns a single\nhypothesis, not an ensemble. In this respect, it is similar to Sel\ufb01eBoost [21]. One could also draw\nparallels to exponentiated gradient dual coordinate ascent [4]. Finally, note that when the gradi-\nent estimate is unbiased (i.e., weighted by the inverse sampling probability), we obtain a variant of\nimportance sampling SGD [25], though we do not necessarily need unbiased gradient estimates.\nIt is important to note that we do not actually need to compute the full posterior distribution\u2014which\nwould take O(n) time per iteration\u2014in order to sample from it. Indeed, using an algorithm and data\nstructure described in Appendix C, we can sample from and update the distribution in O(log n) time,\nusing O(n) space. Thus, the additional iteration complexity of adaptive sampling is logarithmic in\nthe size of the dataset, which suitably ef\ufb01cient for learning from large datasets.\nIn practice, SGD is typically applied with mini-batching, whereby multiple examples are drawn at\neach iteration, instead of just one. Given the massive parallelism of today\u2019s computing hardware,\nmini-batching is simply a more ef\ufb01cient way to process a dataset, and can result in more accurate\ngradient estimates than single-example updates. Though Algorithm 1 is stated for single-example\nupdates, it can be modi\ufb01ed for mini-batching by replacing line 3 with multiple independent draws\nfrom Qt, and line 5 with sampling weight updates for each unique4 example in the mini-batch.\n\n4If an example is drawn multiple times in a mini-batch, its sampling weight is only updated once.\n\n7\n\n\f5.1 Divergence Analysis\n\nRecall that our generalization bounds use the posterior\u2019s divergence from a \ufb01xed prior to penalize the\nposterior for over\ufb01tting the training data. Thus, to connect Algorithm 1 to our bounds, we analyze\nthe adaptive posterior\u2019s divergence from a uniform prior on {1, . . . , n}T . This quantity re\ufb02ects the\npotential cost, in generalization performance, of adaptive sampling. The goal of this section is to\nupper-bound the KL divergence resulting from Algorithm 1 in terms of interpretable, data-dependent\nquantities. All proofs are deferred to Appendix D.\nOur analysis requires introducing some notation. Given a sequence of sampled indices, (i1, . . . , it),\nlet Ni,t (cid:44) |{t(cid:48) : t(cid:48) < t, it(cid:48) = i}| denote the number of times that index i was chosen before iteration\nt. Let Oi,j denote the jth iteration in which i was chosen; for instance, if i was chosen at iterations\n13 and 47, then Oi,1 = 13 and Oi,2 = 47. With these de\ufb01nitions, we can state the following bound,\nwhich exposes the in\ufb02uences of the utility function, amplitude and decay on the KL divergence.\nTheorem 4. Fix a uniform prior, P, a utility function, f : Z \u00d7 H \u2192 R, an amplitude, \u03b1 \u2265 0, and a\ndecay, \u03c4 \u2208 (0, 1). If Algorithm 1 is run for T iterations, then its posterior, Q, satis\ufb01es\n\nDKL(Q(cid:107)P) \u2264 T(cid:88)\n\nE\n\n(i1,...,it)\u223cQ\n\n\u03b1\nn\n\n(cid:34) Nit,t(cid:88)\n\nn(cid:88)\n\nNi,t(cid:88)\n\n(cid:35)\n\nt=2\n\ni=1\n\nj=1\n\nk=1\n\nf (zit, hOit,j ) \u03c4 Nit,t\u2212j \u2212\n\nf (zi, hOi,k ) \u03c4 Ni,t\u2212k\n\n. (9)\n\nEquation 9 can be interpreted as measuring, on average, how the cumulative past utilities of each\nsampled index, it, differ from the cumulative utilities of any other index, i.5 When the posterior\nbecomes too focused on certain examples, this difference is large. The accumulated utilities decay\nexponentially, with the rate of decay controlled by \u03c4. The amplitude, \u03b1, scales the entire bound,\nwhich means that aggressive posterior updates may adversely affect generalization.\nAn interesting special case of Theorem 4 is when the utility function is nonnegative, which results\nin a simpler, more interpretable bound.\nTheorem 5. Fix a uniform prior, P, a nonnegative utility function, f : Z \u00d7H \u2192 R+, an amplitude,\n\u03b1 \u2265 0, and a decay, \u03c4 \u2208 (0, 1). If Algorithm 1 is run for T iterations, then its posterior, Q, satis\ufb01es\n\nDKL(Q(cid:107)P) \u2264 \u03b1\n1 \u2212 \u03c4\n\nE\n\n(i1,...,it)\u223cQ\n\nf (zit, ht)\n\n.\n\n(10)\n\nT\u22121(cid:88)\n\nt=1\n\n(cid:104)\n\n(cid:105)\n\nEquation 10 is simply the sum of expected utilities computed over T \u22121 iterations of training, scaled\nby \u03b1/(1 \u2212 \u03c4 ). The implications of this bound are interesting when the utility function is de\ufb01ned as\nthe loss, f (z, h) (cid:44) L(h, z); then, if SGD quickly converges to a hypothesis with low maximal loss\non the training data, it can reduce the generalization error.6 The caveat is that tuning the amplitude\nor decay to speed up convergence may actually counteract this effect.\nIt is worth noting that similar guarantees hold for a mini-batch variant of Algorithm 1. The bounds\nare essentially unchanged, modulo notational intricacies.\n\n6 Experiments\n\nTo demonstrate the effectiveness of Algorithm 1, we conducted several experiments with the CIFAR-\n10 dataset [12]. This benchmark dataset contains 60,000 (32\u00d732)-pixel RGB images from 10 object\nclasses, with a standard, static partitioning into 50,000 training examples and 10,000 test examples.\nWe speci\ufb01ed the hypothesis class as the following convolutional neural network architecture: 32\n(3 \u00d7 3) \ufb01lters with recti\ufb01ed linear unit (ReLU) activations in the \ufb01rst and second layers, followed\nby (2 \u00d7 2) max-pooling and 0.25 dropout7; 64 (3 \u00d7 3) \ufb01lters with ReLU activations in the third and\nfourth layers, again followed by (2 \u00d7 2) max-pooling and 0.25 dropout; \ufb01nally, a fully-connected,\n512-unit layer with ReLU activations and 0.5 dropout, followed by a fully-connected, 10-output\nsoftmax layer. We trained the network using the cross-entropy loss. We emphasize that our goal was\n\n5When Ni,t = 0 (i.e., i has not yet been sampled), a summation over j = 1, . . . , Ni,t evaluates to zero.\n6This interpretation concurs with ideas in [10, 22].\n7It can be shown that dropout improves data stability [10, Lemma 4.4].\n\n8\n\n\fnot to achieve state-of-the-art results on the dataset; rather, to evaluate Algorithm 1 in a simple, yet\nrealistic, application.\nFollowing the intuition that sampling should focus on dif\ufb01cult examples, we experimented with two\nutility functions for Algorithm 1 based on common loss functions. For an example z = (x, y), with\nh(x, y) denoting the predicted probability of label y given input x under hypothesis h, let\n\nf0(z, h) (cid:44) 1{arg maxy(cid:48)\u2208Y h(x, y(cid:48)) (cid:54)= y} and f1(z, h) (cid:44) 1 \u2212 h(x, y).\n\nThe \ufb01rst utility function, f0, is the 0-1 loss; the second, f1, is the L1 loss, which accounts for\nuncertainty in the most likely label. We combined these utility functions with two parameter update\nrules: standard SGD with decreasing step sizes, \u03b7t (cid:44) \u03b7/(1+\u03bdt) \u2264 \u03b7/(\u03bdt), for \u03b7 > 0 and \u03bd > 0; and\nAdaGrad [5], a variant of SGD that automatically tunes a separate step size for each parameter. We\nused mini-batches of 100 examples per update. The combination of utility functions and update rules\nyields four adaptive sampling algorithms: AdaSamp-01-SGD, AdaSamp-01-AdaGrad, AdaSamp-\nL1-SGD and AdaSamp-L1-AdaGrad. We compared these to their uniform sampling counterparts,\nUnif-SGD and Unif-AdaGrad.\nWe tuned all hyperparameters using random subsets of the training data for cross-validation. We then\nran 10 trials of training and testing, using different seeds for the pseudorandom number generator\nat each trial to generate different random initializations8 and training sequences. Figures 1a and 1b\nplot learning curves of the average cross-entropy and accuracy, respectively, on the training data;\nFigure 1c plots the average accuracy on the test data. We found that all adaptive sampling variants\nreduced empirical risk (increased training accuracy) faster than their uniform sampling counterparts.\nFurther, AdaGrad with adaptive sampling exhibited modest, yet consistent, improvements in test\naccuracy in early iterations of training. Figure 1d illustrates the effect of varying the amplitude\nparameter, \u03b1. Higher values of \u03b1 led to faster empirical risk reduction, but lower test accuracy\u2014a\nsign of over\ufb01tting the posterior to the data, which concurs with Theorems 4 and 5 regarding the\nin\ufb02uence of \u03b1 on the KL divergence. Figure 1e plots the KL divergence from the conditional prior,\nPt, to the conditional posterior, Qt, given sampled indices (i1, . . . , it\u22121); i.e., DKL(Qt(cid:107)Pt). The\nsampling distribution quickly diverged in early iterations, to focus on examples where the model\nerred, then gradually converged to a uniform distribution as the empirical risk converged.\n\n(a) Train loss\n\n(b) Train accuracy\n\n(c) Test accuracy\n\n(d) Impact of \u03b1\n\n(e) DKL(Qt(cid:107)Pt)\n\nFigure 1: Experimental results on CIFAR-10, averaged over 10 random initializations and training\nruns. (Best viewed in color.) Figure 1a plots learning curves of training cross-entropy (lower is\nbetter). Figures 1b and 1c, respectively, plot train and test accuracies (higher is better). Figure 1d\nhighlights the impact of the amplitude parameter, \u03b1, on accuracy. Figure 1e plots the KL divergence\nfrom the conditional prior, Pt, to the conditional posterior, Qt, given sampled indices (i1, . . . , it\u22121).\n\n7 Conclusions and Future Work\n\nWe presented new generalization bounds for randomized learning algorithms, using a novel combi-\nnation of PAC-Bayes and algorithmic stability. The bounds inspired an adaptive sampling algorithm\nfor SGD that dynamically updates the sampling distribution based on the training data and model.\nExperimental results with this algorithm indicate that it can reduce empirical risk faster than uniform\nsampling while also improving out-of-sample accuracy. Future research could investigate different\nutility functions and distribution updates, or explore the connections to related algorithms. We are\nalso interested in providing stronger generalization guarantees, with polylogarithmic dependence on\n\u03b4\u22121, for non-convex objective functions, but proving \u02dcO(1/\nnT )-uniform hyperparameter stability\nwithout (strong) convexity is dif\ufb01cult. We hope to address this problem in future work.\n\n\u221a\n\n8Each training algorithm started from the same initial hypothesis.\n\n9\n\n\fReferences\n[1] L. Bottou and O. Bousquet. The tradeoffs of large scale learning.\n\nProcessing Systems, 2008.\n\nIn Neural Information\n\n[2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[3] O. Catoni. Pac-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical Learn-\ning, volume 56 of Institute of Mathematical Statistics Lecture Notes \u2013 Monograph Series. In-\nstitute of Mathematical Statistics, 2007.\n\n[4] M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algo-\nrithms for conditional random \ufb01elds and max-margin Markov networks. Journal of Machine\nLearning Research, 9:1775\u20131822, 2008.\n\n[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[6] A. Elisseeff, T. Evgeniou, and M. Pontil. Stability of randomized learning algorithms. Journal\n\nof Machine Learning Research, 6:55\u201379, 2005.\n\n[7] J. Feng, T. Zahavy, B. Kang, H. Xu, and S. Mannor. Ensemble robustness of deep learning\n\nalgorithms. CoRR, abs/1602.02389, 2016.\n\n[8] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. In Computational Learning Theory, 1995.\n\n[9] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear\n\nclassi\ufb01ers. In International Conference on Machine Learning, 2009.\n\n[10] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic\n\ngradient descent. In International Conference on Machine Learning, 2016.\n\n[11] A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In Inter-\n\nnational Conference on Machine Learning, 2014.\n\n[12] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, 2009.\n\n[13] I. Kuzborskij and C. Lampert. Data-dependent stability of stochastic gradient descent. CoRR,\n\nabs/1703.01678, 2017.\n\n[14] J. Langford and J. Shawe-Taylor. PAC-Bayes and margins. In Neural Information Processing\n\nSystems, 2002.\n\n[15] J. Lin and L. Rosasco. Optimal learning for multi-pass stochastic gradient methods. In Neural\n\nInformation Processing Systems, 2016.\n\n[16] J. Lin, R. Camoriano, and L. Rosasco. Generalization properties and implicit regularization\n\nfor multiple passes SGM. In International Conference on Machine Learning, 2016.\n\n[17] B. London, B. Huang, and L. Getoor. Stability and generalization in structured prediction.\n\nJournal of Machine Learning Research, 17(222):1\u201352, 2016.\n\n[18] D. McAllester. PAC-Bayesian model averaging. In Computational Learning Theory, 1999.\n[19] L. Rosasco and S. Villa. Learning with incremental iterative regularization. In Neural Infor-\n\nmation Processing Systems, 2015.\n\n[20] M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 3:233\u2013269, 2002.\n\n[21] S. Shalev-Shwartz. Sel\ufb01eboost: A boosting algorithm for deep learning. CoRR, abs/1411.3436,\n\n2014.\n\n[22] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In Interna-\n\ntional Conference on Machine Learning, 2016.\n\n[23] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform\n\nconvergence. Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\n[24] Y. Wang, J. Lei, and S. Fienberg. Learning with differential privacy: Stability, learnability and\nthe suf\ufb01ciency and necessity of ERM principle. Journal of Machine Learning Research, 17\n(183):1\u201340, 2016.\n\n[25] P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss\n\nminimization. In International Conference on Machine Learning, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "Ben", "family_name": "London", "institution": "Amazon"}]}