{"title": "A Stability-based Validation Procedure for Differentially Private Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2652, "page_last": 2660, "abstract": "Differential privacy is a cryptographically motivated definition of privacy which has gained considerable attention in the algorithms, machine-learning and data-mining communities. While there has been an explosion of work on differentially private machine learning algorithms, a major barrier to achieving end-to-end differential privacy in practical machine learning applications is the lack of an effective procedure for differentially private parameter tuning, or, determining the parameter value, such as a bin size in a histogram, or a regularization parameter, that is suitable for a particular application.   In this paper, we introduce a generic validation procedure for differentially private machine learning algorithms that apply when a certain stability condition holds on the training algorithm and the validation performance metric. The training data size and the privacy budget used for training in our procedure is independent of the number of parameter values searched over. We apply our generic procedure to two fundamental tasks in statistics and machine-learning -- training a regularized linear classifier and building a histogram density estimator that result in end-to-end differentially private solutions for these problems.", "full_text": "A Stability-based Validation Procedure for\nDifferentially Private Machine Learning\n\nKamalika Chaudhuri\n\nDepartment of Computer Science and Engineering\n\nUC San Diego, La Jolla CA 92093\n\nkamalika@cs.ucsd.edu\n\nStaal Vinterbo\n\nDivision of Biomedical Informatics\nUC San Diego, La Jolla CA 92093\n\nsav@ucsd.edu\n\nAbstract\n\nDifferential privacy is a cryptographically motivated de\ufb01nition of privacy which\nhas gained considerable attention in the algorithms, machine-learning and data-\nmining communities. While there has been an explosion of work on differentially\nprivate machine learning algorithms, a major barrier to achieving end-to-end dif-\nferential privacy in practical machine learning applications is the lack of an ef-\nfective procedure for differentially private parameter tuning, or, determining the\nparameter value, such as a bin size in a histogram, or a regularization parameter,\nthat is suitable for a particular application.\nIn this paper, we introduce a generic validation procedure for differentially private\nmachine learning algorithms that apply when a certain stability condition holds on\nthe training algorithm and the validation performance metric. The training data\nsize and the privacy budget used for training in our procedure is independent of\nthe number of parameter values searched over. We apply our generic procedure to\ntwo fundamental tasks in statistics and machine-learning \u2013 training a regularized\nlinear classi\ufb01er and building a histogram density estimator that result in end-to-\nend differentially private solutions for these problems.\n\n1\n\nIntroduction\n\nPrivacy-preserving machine learning algorithms are increasingly essential for settings where sensi-\ntive and personal data are mined. The emerging standard for privacy-preserving computation for\nthe past few years is differential privacy [7]. Differential privacy is a cryptographically motivated\nde\ufb01nition, which guarantees privacy by ensuring that the log-likelihood of any outcome does not\nchange by more than \u03b1 due to the participation of a single individual; an adversary will thus have\ndif\ufb01culty inferring the private value of a single individual when \u03b1 is small. This is achieved by\nadding random noise to the data or to the result of a function computed on the data. The value \u03b1 is\ncalled the privacy budget, and measures the level of privacy risk allowed. As more noise is needed\nto achieve lower \u03b1,the price of higher privacy is reduced utility or accuracy. The past few years\nhave seen an explosion in the literature on differentially private algorithms, and there currently exist\ndifferentially private algorithms for many statistical and machine-learning tasks such as classi\ufb01ca-\ntion [4, 15, 23, 10], regression [18], PCA [2, 5, 17, 12], clustering [2], density estimation [28, 19],\namong others.\nMany statistics and machine learning algorithms involve one or more parameters, for example, the\nregularization parameter \u03bb in Support Vector Machines and the number of clusters in k-means.\nAccurately setting these parameters is critical to performance. However there is no good apriori way\nto set these parameters, and common practice is to run the algorithm for a few different plausible\nparameter values on a dataset, and then select the output that yields the best performance on held-out\nvalidation data. This process is often called parameter-tuning, and is an essential component of any\npractical machine-learning system.\n\n1\n\n\fA major barrier to achieving end-to-end differential privacy in practical machine-learning appli-\ncations is the absence of an effective procedure for differentially private parameter-tuning. Most\nprevious experimental works either assume that a good parameter value is known apriori [15, 5] or\nuse a heuristic to determine a suitable parameter value [19, 28]. Currently, parameter-tuning with\ndifferential privacy is done in two ways. The \ufb01rst is to run the training algorithm on the same data\nmultiple times. However re-using the data leads to a degradation in the privacy guarantees, and thus\nto maintain the privacy budget \u03b1, for each training, we need to use a privacy budget that shrinks\npolynomially with the number of parameter values. The second procedure, used by [4], is to divide\nthe training data into disjoint sets and train for each parameter value using a different set. Both so-\nlutions are highly sub-optimal, particularly, if a large number of parameter values are involved \u2013 the\n\ufb01rst due to the lower privacy budget, and the second due to less data. Thus the challenge is to design\na differentially private validation procedure that uses the data and the privacy budget effectively, but\ncan still do parameter-tuning. This is an important problem, and has been mentioned as an open\nquestion by [28] and [4].\nIn this paper, we show that it is indeed possible to do effective parameter-tuning with differential\nprivacy in a fairly general setting, provided the training algorithm and the performance measure\nused to evaluate its output on the validation data together obey a certain stability condition. We\ncharacterize this stability condition by introducing a notion of (\u03b21, \u03b22, \u03b4)-stability; loosely speaking,\nstability holds if the validation performance measure does not change very much when one person\u2019s\nprivate value in the training set changes, when exactly the same random bits are used in the training\nalgorithm in both cases or, when one person\u2019s private value in the validation set changes. The second\ncondition is fairly standard, and our key insight is in characterizing the \ufb01rst condition and showing\nthat it can help in differentially private parameter tuning.\nWe next design a generic differentially private training and validation procedure that provides end-\nto-end privacy provided this stability condition holds. The training set size and the privacy budget\nused by our training algorithms are independent of k, the number of parameter values, and the\naccuracy of our validation procedure degrades only logarithmically with k.\nWe apply our generic procedure to two fundamental tasks in machine-learning and statistics \u2013 train-\ning a linear classi\ufb01er using regularized convex optimization, and building a histogram density esti-\nmator. We prove that existing differentially private algorithms for these problems obey our notion\nof stability with respect to standard validation performance measures, and we show how to combine\nthem to provide end-to-end differentially private solutions for these tasks. In particular, our appli-\ncation to linear classi\ufb01cation is based on existing differentially private procedures for regularized\nconvex optimization due to [4], and our application to histogram density estimation is based on the\nalgorithm variant due to [19].\nFinally we provide an experimental evaluation of our procedure for training a logistic regression\nclassi\ufb01er on real data.\nIn our experiments, even for a moderate value of k, our procedure out-\nperformed existing differentially private solutions for parameter tuning, and achieved performance\nonly slightly worse than knowing the best parameter to use ahead of time. We also observed that\nour procedure, in contrast to the other procedures we tested, improved the correspondence between\npredicted probabilities and observed outcomes, often referred to as model calibration.\nRelated Work. Differential privacy, proposed by [7], has gained considerable attention in the algo-\nrithms, data-mining and machine-learning communities over the past few years as there has been a\nlarge explosion of theoretical and experimental work on differentially private algorithms for statis-\ntical and machine-learning tasks [10, 2, 15, 19, 27, 28, 3] \u2013 see [24] for a recent survey of machine\nlearning methods with a focus on continuous data. In particular, our case study on linear classi-\n\ufb01cation is based on existing differentially private procedures for regularized convex optimization,\nwhich were proposed by [4], and extended by [23, 18, 15]. There has also been a large body of\nwork on differentially private histogram construction in the statistics, algorithms and database liter-\nature [7, 19, 27, 28, 20, 29, 14]. We use the algorithm variant due to [19].\nWhile the problem of differentially private parameter tuning has been mentioned in several works,\nto the best of our knowledge, an ef\ufb01cient systematic solution has been elusive. Most previous\nexperimental works either assume that a good parameter value is known apriori [15, 5] or use a\nheuristic to determine a suitable parameter value [19, 28]. [4] use a parameter-tuning procedure\nwhere they divide the training data into disjoint sets, and train for a parameter value on each set. [28]\n\n2\n\n\fmentions \ufb01nding a good bin size for a histogram using differentially private validation procedure as\nan open problem.\nFinally, our analysis uses ideas similar to the analysis of the Multiplicative Weights Method for\nanswering a set of linear queries [13].\n\n2 Preliminaries\n\nPrivacy De\ufb01nition and Composition Properties. We adopt differential privacy as our notion of\nprivacy.\nDe\ufb01nition 1 A (randomized) algorithm A whose output lies in a domain S is said to be (\u03b1, \u03b4)-\ndifferentially private if for all measurable S \u2286 S, for all datasets D and D(cid:48) that differ in the value\nof a single individual, it is the case that: Pr(A(D) \u2208 S) \u2264 e\u03b1 Pr(A(D(cid:48)) \u2208 S) + \u03b4. An algorithm is\nsaid to be \u03b1-differentially private if \u03b4 = 0.\n\nHere \u03b1 and \u03b4 are privacy parameters where lower \u03b1 and \u03b4 imply higher privacy. Differential privacy\nhas been shown to have many desirable properties, such as robustness to side information [7] and\nresistance to composition attacks [11].\nAn important property of differential privacy is that the privacy guarantees degrade gracefully if\nthe same sensitive data is used in multiple private computations. In particular, if we apply an \u03b1-\ndifferentially private procedure k times on the same data, the result is k\u03b1-differential private as\n\nwell as (\u03b1(cid:48), \u03b4)-differentially private for \u03b1(cid:48) = k\u03b1(e\u03b1 \u2212 1) +(cid:112)2k log(1/\u03b4)\u03b1 [7, 8]. These privacy\n\ncomposition results are the basis of existing differentially private parameter tuning procedures.\nTraining Procedure and Validation Score. Typical (non-private) machine learning algorithms\nhave one or more undetermined parameters, and standard practice is to run the machine learning\nalgorithm for a number of different parameter values on a training set, and evaluate the outputs on a\nseparate held-out validation dataset. The \ufb01nal output is the one which performs best on the validation\ndata. For example, in linear classi\ufb01cation, we train logistic regression or SVM classi\ufb01ers with\nseveral different values of the regularization parameter \u03bb, and then select the classi\ufb01er which has\nthe best performance on held-out validation data. Our goal in this paper is to design a differentially\nprivate version of this procedure which uses the privacy budget ef\ufb01ciently.\nThe full validation process thus has two components \u2013 a training procedure, and a validation score\nwhich evaluates how good the training procedure is.\nWe assume that training and validation data are drawn from a domain X , and the result of the\ndifferentially private training algorithm lies in a domain C. For example, for linear classi\ufb01cation, X\nis the set of all labelled examples (x, y) where x \u2208 Rd and y \u2208 {\u22121, 1}, and C is the set of linear\nclassi\ufb01ers in d dimensions. We use n to denote the size of a training set, m to denote the size of a\nheld-out validation set, and \u0398 to denote a set of parameters.\nA differentially private training procedure is a randomized algorithm, which takes as input a (sensi-\ntive) training dataset, a parameter (of the training procedure), and a privacy parameter \u03b1 and outputs\nan element of C; the procedure is expected to be \u03b1-differentially private. For ease of exposition and\nproof, we represent a differentially private training procedure T as a tuple T = (G, F ), where G is\na density over sequences of real numbers, and F is a function, which takes as input a training set, a\nparameter in the parameter set \u0398, a privacy parameter \u03b1, and a random sequence drawn from G, and\noutputs an element of C. F is thus a deterministic function, and the randomization in the training\nprocedure is isolated in the draw from G.\nObserve that any differentially private algorithm can be represented as such a tuple. For example,\ngiven x1, . . . , xn \u2208 [0, 1], an \u03b1-differentially private approximation to the sample mean \u00afx is \u00afx +\n\u03b1n Z where Z is drawn from the standard Laplace distribution. We can represent this procedure\nas a tuple T = (G, F ) as follows: G is the standard Laplace density over reals, and for any \u03b8,\nF ({x1, . . . , xn}, \u03b8, \u03b1, r) = \u00afx + r\n\u03b1n. In general, more complicated procedures will require more\ninvolved functions F .\nA validation score is a function q : C \u00d7 X m \u2192 R which takes an object h in C and a validation\ndataset V , and outputs a score which re\ufb02ects the quality of h with respect to V . For example, a\n\n1\n\n3\n\n\fcommon validation score used in linear classi\ufb01cation is classi\ufb01cation accuracy.\nIn (non-private)\nvalidation, if hi is obtained by running the machine learning algorithm with parameter \u03b8i, then the\ngoal is to output the i (or equivalently the hi) which maximizes q(hi, V ); our goal is to output\nan i that approximately maximizes q(hi, V ) while still preserving the privacy of V as well as the\nsensitive training data used in constructing the his.\n\n3 Stability and Generic Validation Procedure\n\nWe now introduce and discuss our notion of stability, and provide a generic validation procedure\nthat uses the privacy budget ef\ufb01ciently when this notion of stability holds.\n\nDe\ufb01nition 2 ((\u03b21, \u03b22, \u03b4)-Stability) A validation score q is said to be (\u03b21, \u03b22, \u03b4)-stable with respect\nto a training procedure T = (G, F ), a privacy parameter \u03b1, and a parameter set \u0398 if the following\nholds. There exists a set \u03a3 such that PrR\u223cG(R \u2208 \u03a3) \u2265 1 \u2212 \u03b4, and whenever R \u2208 \u03a3, the following\ntwo conditions hold:\n\n1. Training Stability: For all \u03b8 \u2208 \u0398, V , and all training sets T and T (cid:48) that differ in a single\n\nentry, |q(F (T, \u03b8, \u03b1, R), V ) \u2212 q(F (T (cid:48), \u03b8, \u03b1, R), V )| \u2264 \u03b21\nn .\n\n2. Validation Stability: For all T , \u03b8 \u2208 \u0398, and for all V and V (cid:48) that differ in a single entry,\n\n|q(F (T, \u03b8, \u03b1, R), V ) \u2212 q(F (T, \u03b8, \u03b1, R), V (cid:48))| \u2264 \u03b22\nm .\n\nCondition (1), the training stability condition, bounds the change in the validation score q, when one\nperson\u2019s private data in the training set T changes, and the validation set V as well as the value of the\nrandom variable R remains the same. Our validation procedure critically relies on this condition,\nand our main contribution in this paper is to identify and exploit it to provide a validation procedure\nthat uses the privacy budget ef\ufb01ciently.\nAs F (T, \u03b8, \u03b1, R) is a deterministic function, Condition (2), the validation stability condition, bounds\nthe change in q when one person\u2019s private data in the validation set V changes, and the output of the\ntraining procedure remains the same. We observe that (some version of) Condition (2) is a standard\nrequirement in existing differentially private algorithms that preserve the privacy of the validation\ndataset while selecting a h \u2208 C that approximately maximizes q(h, V ), even if it is not required to\nmaintain privacy with respect to the training data.\nSeveral remarks are in order. First, observe that Condition (1) is a property of the differentially\nprivate training algorithm (in addition to q and the non-private quantity being approximated). Even\nif all else remains the same, different differentially private approximations to the same non-private\nquantity will have different values of \u03b21.\nSecond, Condition (1) does not always hold for small \u03b21 as an immediate consequence of differential\nprivacy of the training procedure. Differential privacy ensures that the probability of any outcome is\nalmost the same when the inputs differ in the value of a single individual; Condition (1) requires that\neven when the same randomness is used, the validation score evaluated on the actual output of the\nalgorithm does not change very much when the inputs differ by a single individual\u2019s private value.\nIn Section 6.1, we present an example of a problem and two \u03b1-differentially private training algo-\nrithms which approximately optimize the same function; the \ufb01rst algorithm is based on exponential\nmechanism, and the second on a maximum of Laplace random variables mechanism. We show\nthat while both provide \u03b1-differential privacy guarantees, the \ufb01rst algorithm does not satisfy train-\ning stability for \u03b21 = o(n) and small enough \u03b4 while the second one ensures training stability for\n\u03b21 = 1 and \u03b4 = 0. In Section 4, we present two case studies of commonly used differentially private\nalgorithms where Conditions (1) and (2) hold for constant \u03b21 and \u03b22.\nWhen the (\u03b21, \u03b22, \u03b4)-stability condition holds, we can design an end-to-end differentially private\nparameter tuning algorithm, which is shown in Algorithm 2. The algorithm \ufb01rst uses a validation\nprocedure to determine which parameter out of the given set \u0398 is (approximately) optimal based\non the held-out data (see Algorithm 1). In the next step, the training data is re-used along with the\nparameter output by Algorithm 1 and fresh randomness to generate the \ufb01nal output. Note that we\nuse Exp(\u03b3) to denote the exponential distribution with expectation \u03b3.\n\n4\n\n\fAlgorithm 1 Validate(\u0398, T , T , V , \u03b21, \u03b22, \u03b11, \u03b12)\n1: Inputs: Parameter list \u0398 = {\u03b81, . . . , \u03b8k}, training procedure T = (G, F ), validation score q,\ntraining set T , validation set V , stability parameters \u03b21 and \u03b22, training privacy parameter \u03b11,\nvalidation privacy parameter \u03b12.\n\n).\n\nDraw Ri \u223c G. Compute hi = F (T, \u03b8i, \u03b11, Ri).\nLet \u03b2 = max( \u03b21\nLet ti = q(hi, V ) + 2\u03b2Zi, where Zi \u223c Exp( 1\n\n2: for i = 1, . . . , k do\n3:\n4:\n5:\n6: end for\n7: Output i\u2217 = argmaxiti.\n\nn , \u03b22\n\nm ).\n\n\u03b12\n\nAlgorithm 1 takes as input a training procedure T , a parameter list \u0398, a validation score q, training\nand validation datasets T and V , and privacy parameters \u03b11 and \u03b12. It runs the training procedure\nT on the same training set T with privacy budget \u03b11 for each parameter in \u0398 to generate outputs\nh1, h2, . . ., and then uses an \u03b12-differentially private procedure to select the index i\u2217 such that\nthe validation score q(hi\u2217 , V ) is (approximately) maximum. For simplicity, we use a maximum of\nExponential random variables procedure, inspired by [1], to \ufb01nd the approximate maximum; an\nexponential mechanism [21] may also be used instead. Algorithm 2 then re-uses the training data\nset T to train with parameter \u03b8i\u2217 to get the \ufb01nal output.\n\nAlgorithm 2 End-to-end Differentially Private Training and Validation Procedure\n1: Inputs: Parameter list \u0398 = {\u03b81, . . . , \u03b8k}, training procedure T = (G, F ), validation score q,\ntraining set T , validation set V , stability parameters \u03b21 and \u03b22, training privacy parameter \u03b11,\nvalidation privacy parameter \u03b12.\n\n2: i\u2217 = Validate(\u0398,T , T, V, \u03b21, \u03b22, \u03b11, \u03b12).\n3: Draw R \u223c G. Output h = F (T, \u03b8i\u2217 , \u03b11, R).\n\n3.1 Performance Guarantees\n\nTheorem 1 shows that Algorithm 1 is (\u03b12, \u03b4)-differentially private, and Theorem 2 shows privacy\nguarantees on Algorithm 2. Detailed proofs of both theorems are provided in the Supplementary\nMaterial. We observe that Conditions (1) and (2) are critical to the proof of Theorem 1.\n\nis\nk )-stable with respect to the training procedure T , the privacy parameter \u03b11 and the\n\nTheorem 1 (Privacy Guarantees for Validation Procedure) If\n(\u03b21, \u03b22, \u03b4\nparameter set \u0398, then, Algorithm 1 guarantees (\u03b12, \u03b4)-differential privacy.\nTheorem 2 (End-to-end Privacy Guarantees) If the conditions in Theorem 1 hold, and if T is\n\u03b11-differentially private, then Algorithm 2 is (\u03b11 + \u03b12, \u03b4)-differentially private.\nTheorem 3 shows guarantees on the utility of the validation procedure \u2013 that it selects an index i\u2217\nwhich is not too suboptimal.\n\nvalidation\n\nscore\n\nthe\n\nq\n\nTheorem 3 (Utility Guarantees) Let h1, . . . , hk be the output of the differentially private train-\ning procedure in Step (3) of Algorithm 1. Then, with probability \u2265 1 \u2212 \u03b40, q(hi\u2217 , V ) \u2265\nmax1\u2264i\u2264k q(hi, V ) \u2212 2\u03b2 log(k/\u03b40)\n\n.\n\n\u03b12\n\n4 Case Studies\n\nWe next show that Algorithm 2 may be applied to design end-to-end differentially private training\nand validation procedures for two fundamental statistical and machine-learning tasks \u2013 training a lin-\near classi\ufb01er, and building a histogram density estimator. In each case, we use existing differentially\nprivate algorithms and validation scores for these tasks. We show that the validation score satis\ufb01es\nthe (\u03b21, \u03b22, \u03b4)-stability property with respect to the training procedure for small values of \u03b21 and\n\n5\n\n\f\u03b22, and thus we can apply in Algorithm 2 with a small value of \u03b2 to obtain end-to-end differential\nprivacy.\nDetails of the case study for regularized linear classi\ufb01cation is shown in Section 4.1, and those for\nhistogram density estimation is presented in the Supplementary Material.\n\n4.1 Linear Classi\ufb01cation based on Logistic Regression and SVM\nGiven a set of labelled examples (x1, y1), . . . , (xn, yn) where xi \u2208 Rd, (cid:107)xi(cid:107) \u2264 1 for all i, and\nyi \u2208 {\u22121, 1}, the goal in linear classi\ufb01cation is to train a linear classi\ufb01er that largely separates\nexamples from the two classes. A popular solution in machine learning is to \ufb01nd a classi\ufb01er w\u2217 by\nsolving a regulared convex optimization problem:\n\nw\u2217 = argminw\u2208Rd\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\n(cid:96)(w, xi, yi)\n\n(1)\n\nn(cid:88)\n\ni=1\n\nHere \u03bb is a regularization parameter, and (cid:96) is a convex loss function. When (cid:96) is the logistic loss\nfunction (cid:96)(w, x, y) = log(1 + e\u2212yiw(cid:62)xi ), then we have logistic regression. When (cid:96) is the hinge loss\n(cid:96)(w, x, y) = max(0, 1 \u2212 yiw(cid:62)xi), then we have Support Vector Machines. The optimal value of \u03bb\nis data-dependent, and there is no good pre-de\ufb01ned way to select \u03bb apriori. In practice, the optimal\n\u03bb is determined by training a small number of classi\ufb01ers with different \u03bb values, and picking the one\nthat has the best performance on a held-out validation dataset.\n[4] present two algorithms for computing differentially private approximations to these regularized\nconvex optimization problems for \ufb01xed \u03bb: output perturbation and objective perturbation. We restate\noutput perturbation as Algorithm 4 (in the Supplementary Material) and objective perturbation as\nAlgorithm 3. It was shown by [4] that provided certain conditions hold on (cid:96) and the data, Algorithm 4\nis \u03b1-differentially private; moreover, with some additional conditions on (cid:96), Algorithm 3 is \u03b1 +\n\n(cid:1)-differentially private, where c is a constant that depends on the loss function (cid:96), and\n\n2 log(cid:0)1 + c\n\n\u03bb is the regularization parameter.\n\n\u03bbn\n\nAlgorithm 3 Objective Perturbation for Differentially Private Linear Classi\ufb01cation\n1: Inputs: Regularization parameter \u03bb, training set T = {(xi, yi), i = 1, . . . , n}, privacy parame-\n2: Let G be the following density over Rd: \u03c1G(r) \u221d e\u2212(cid:107)r(cid:107). Draw R \u223c G.\n3: Solve the convex optimization problem:\n\nter \u03b1.\n\nw\u2217 = argminw\u2208Rd\n\n(cid:107)w(cid:107)2 +\n\n\u03bb\n2\n\n1\nn\n\n4: Output w\u2217.\n\n(cid:96)(w, xi, yi) +\n\nR(cid:62)w\n\n2\n\u03b1n\n\n(2)\n\nn(cid:88)\n\ni=1\n\nIn the sequel, we use the notation X to denote the set {x \u2208 Rd : (cid:107)x(cid:107) \u2264 1}.\nDe\ufb01nition 3 A function g : Rd \u00d7X \u00d7{\u22121, 1} \u2192 R is said to be L-Lipschitz if for all w, w(cid:48) \u2208 Rd,\nfor all x \u2208 X , and for all y, |g(w, x, y) \u2212 g(w(cid:48), x, y)| \u2264 L \u00b7 (cid:107)w \u2212 w(cid:48)(cid:107).\nLet V = {(\u00afxi, \u00afyi), i = 1, . . . , m} be the validation dataset. For our validation score, we choose a\nfunction of the form:\n\ng(w, \u00afxi, \u00afyi)\n\n(3)\n\nq(w, V ) = \u2212 1\nm\n\nm(cid:88)\n\ni=1\n\nwhere g is an L-Lipschitz loss function. In particular, the logistic loss and the hinge loss are 1-\nLipschitz, whereas the 0/1 loss is not L-Lipschitz for any L. Other examples of 1-Lipschitz but\nnon-convex losses include the ramp loss: g(w, x, y) = min(1, max(0, 1 \u2212 yw(cid:62)x)).\nThe following theorem shows that any non-negative and L-Lipschitz validation score is stable with\nrespect to Algorithms 3 and 4 and a set of regularization parameters \u039b; a detailed proof is provided\nin the Supplementary Material. Thus we can use Algorithm 2 along with this training procedure\n\n6\n\n\fand any L-Lipschitz validation score to get an end-to-end differentially private algorithm for linear\nclassi\ufb01cation.\nTheorem 4 (Stability of differentially private linear classi\ufb01ers) Let \u039b = {\u03bb1, . . . , \u03bbk} be a set\ni=1 \u03bbi, and let g\u2217 = max(x,y)\u2208X ,w\u2208Rd g(w, x, y). If\nof regularization parameters, let \u03bbmin = mink\n(cid:96) is convex and 1-Lipschitz, and if g is L-Lipschitz and non-negative, then, the validation score q in\nEquation 3 is (\u03b21, \u03b22, \u03b4\n\nk )-stable with respect to Algorithms 3 and 4, \u03b1 and \u039b for:\n\n\u03b21 =\n\n2L\n\u03bbmin\n\n,\n\n\u03b22 = min\n\ng\u2217,\n\nL\n\n\u03bbmin\n\n1 +\n\nd log(dk/\u03b4)\n\n\u03b1n\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n(cid:16)\n\n(cid:17)\n\nExample. For example,\n\n1\n\n1 + d log(dk/\u03b4)\n\nif g is chosen to be the hinge loss,\n\nand \u03b22 =\n. This follows from the fact that the hinge loss is 1-Lipschitz, but may be\n\nthen \u03b21 = 2\n\u03bbmin\n\n\u03b1n\n\n\u03bbmin\nunbounded for w of unbounded norm.\n, and \u03b22 = 1 (assuming that \u03bbmin \u2264 1). This\nIf g is chosen to be the ramp loss, then \u03b21 = 2\nfollows from the fact that the ramp loss is 1-Lipschitz, but bounded at 1 for any w and (x, y) \u2208 X .\n\u03bbmin\n\n5 Experiments\n\nIn order to evaluate Algorithm 2 empirically, we compare the regularizer parameter values and per-\nformance of regularized logistic regression classi\ufb01ers the algorithm produces with those produced\nby four alternative methods. We used datasets from two domains, and used 10 times 10-fold cross-\nvalidation (CV) to reduce variability in the computed performance averages.\n\nThe Methods Each method takes input (\u03b1, \u0398, T, V ), where \u03b1 denotes the allowed differential\nprivacy, T is a training set, V is a validation set, and \u0398 = {\u03b81, . . . , \u03b8k} a list of k regularizer values.\nAlso, let oplr (\u03b1, \u03bb, T ) denote the application of the objective perturbation training procedure given\nin Algorithm 3 such that it yields \u03b1-differential privacy.\nThe \ufb01rst of the \ufb01ve methods we compare is Stability, the application of Algorithm 2 with oplr used\nfor learning classi\ufb01ers, \u03b4 chosen in an ad-hoc manner to be 0.01, average negative ramp loss used as\nvalidation score q, and with \u03b11 = \u03b12 = \u03b1/2.\nThe four other methods work by performing the following 4 steps: (1) for each \u03b8i \u2208 \u0398, train a\ndifferentially private classi\ufb01er fi = oplr (\u03b1i, \u03b8i, Ti), (2) determine the number of errors ei each fi\nmakes on validation set V , (3) randomly choose i\u2217 from {1, 2, . . . , k} with probability P (i\u2217 = i|pi),\nand (4) output (\u03b8i\u2217 , fi\u2217 ).\nWhat differentiates the four alternative methods is how \u03b1i, Ti, and pi are determined. For\nalphaSplit: \u03b1i = \u03b1/k, Ti = T , pi \u221d e\u2212\u03b1ei/2, dataSplit: \u03b1i = \u03b1, partition T into k equally\nsized sets Ti, pi \u221d e\u2212\u03b1ei/2 (used in [4]), Random: \u03b1i = \u03b1, Ti = T , pi \u221d 1, and Control: \u03b1i = \u03b1,\nTi = T , pi \u221d 1(i = arg maxj q(fj, V )). Note that for alphaSplit, \u03b1/k > \u03b1(cid:48) where \u03b1(cid:48) is the\n\u03b1 = 0.3, then \u03b1/k > \u03b1(cid:48) \u2212 0.0003. The method Control is not private, and serves to provide an\napproximate upper bound on the performance of Stability. The three other alternative methods are\ndifferentially private which we state in the following theorem.\n\nsolution of \u03b1 = k(e\u03b1(cid:48) \u2212 1)\u03b1(cid:48) +(cid:112)2k log(1/\u03b4)\u03b1(cid:48) for all of our experimental settings, except when\n\nTheorem 5 (Privacy of alternative methods) If T and V are disjoint, both alphaSplit and\ndataSplit are \u03b1-differentially private. Random is \u03b1 differentially private even if T and V are\nnot disjoint, in which case alphaSplit and dataSplit are 2\u03b1-differentially private.\n\nProcedures and Data We performed 10 10-fold CV as follows. For round i in each of the CV\nexperiments, fold i was used as a test set W on which the produced classi\ufb01ers were evaluated, fold\n(i mod 10)+1 was used as V , and the remaining 8 folds were used as T . Furthermore k = 10 with\n\u0398 = {0.001, 0.112, 0.223, 0.334, 0.445, 0.556, 0.667, 0.778, 0.889, 1}. Note that the order of \u0398 is\nchosen such that i < j implies \u03b8i < \u03b8j. By Theorems 2 and 5, all methods except Control produce\n\n7\n\n\fa (\u03b1, \u03b4)-differentially private classi\ufb01er. Classi\ufb01er performance was evaluated using the area under\nthe receiver operator curve [25] (AUC) as well as mean squared error (MSE). All computations\nwere done using the R environment [22], and data sets were scaled such that covariate vectors were\nconstrained to the unit ball. We used the following data available from the UCI Machine Learning\nRepository [9]:\nAdult \u2013 98 predictors (14 original including categorical variables that needed to be recoded). The\ndata set describes measurements on cases taken from the 1994 Census data base. The classi\ufb01cation is\nwhether or not a person has an annual income exceeding 50000 USD, which has a prevalence of 0.22.\nEach experiment involves computing more than 24000 classi\ufb01ers. In order to reduce computation\ntime, we selected 52 predictors using the step procedure for a model computed by glm with family\nbinomial and logit link function.\nMagic \u2013 10 predictors on 19020 cases. The data set describes simulated high energy gamma par-\nticles registered by a ground-based atmospheric Cherenkov gamma telescope. The classi\ufb01cation is\nwhether particles are primary gammas (signal) or from hadronic showers initiated by cosmic rays in\nthe upper atmosphere (background). The prevalence of primary gammas is 0.35.\n\n(a) Averages of AUC for the two data sets.\n\n(b) Averages of MSE for the two data sets.\n\nFigure 1: A summary of 10 times 10-fold cross-validation experiments for different privacy levels\n\u03b1. Each point in the \ufb01gure represents a summary of 100 data points. The error bars indiciate a\nboot-strap sample estimate of the 95% con\ufb01dence interval of the mean. A small amount of jitter was\nadded to positions on the x-axes to avoid over-plotting.\n\nResults Figure 1 summarizes classi\ufb01er performances and regularizer choices for the different val-\nues of the privacy parameter \u03b1, aggregated over all cross-validation runs. Figure 1a shows average\nperformance in terms of AUC, and Figure 1b shows average performance in terms of MSE.\nLooking at AUC in our experiments, Stability signi\ufb01cantly outperformed alphaSplit and dataSplit.\nHowever, Stability only outperformed Random for \u03b1 > 1 in the Magic data set, and was in fact out-\nperformed by Random in the Adult data set. In the Adult data set, regularizer choice did not seem\nto matter as Random performed equally well to Control. For MSE on the other hand, Stability\noutperformed the differentially private alternatives in all experiments. We suggest the following\nintuition regarding these results. The calibration of a logistic regression model instance, i.e., the\ndifference between predicted probabilities and a 0/1 encoding of the corresponding labels, is not\ncaptured well by AUC (or 0/1 error rate) as AUC is insensitive to all strictly monotonically increas-\ning transformations of the probabilities. MSE is often used as a measure of probabilistic model\ncalibration and can be decomposed into two terms: reliability (a calibration term), and re\ufb01nement\n(a discrimination measure) which is related to the AUC. In the Adult data set, the minor change\nin AUC of Control and Random for \u03b1 > 0.5, together with the apparent insensitivity of AUC\nto regularizer value, suggests that any improvement in Stability performance can only come from\n(the observed) improved calibration. Unlike in the Adult data set, there is a AUC performance gap\nbetween Control and Random in the Magic data set. This means that regularizer choice matters for\ndiscrimination, and we observe improvement for Stability in both discrimination and calibration.\nAcknowledgements This work was supported by NIH grants R01 LM07273 and U54\nHL108460, the Hellman Foundation, and NSF IIS 1253942.\n\n8\n\nllllllllllllAdultMagic0.60.70.80.30.51.02.03.05.00.30.51.02.03.05.0alphallllllllllllAdultMagic0.180.200.220.240.30.51.02.03.05.00.30.51.02.03.05.0alphalStabilityalphaSplitdataSplitRandomControl\fReferences\n[1] R Bhaskar, S Laxman, A Smith, and A Thakurta. Discovering frequent patterns in sensitive\n\ndata. In KDD, 2010.\n\n[2] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In\n\nPODS, 2005.\n\n[3] K. Chaudhuri and D. Hsu. Convergence rates for differentially private statistical estimation. In\n\nICML, 2012.\n\n[4] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-\n\nmization. Journal of Machine Learning Research, 12:1069\u20131109, March 2011.\n\n[5] K. Chaudhuri, A.D. Sarwate, and K. Sinha. Near-optimal algorithms for differentially-private\n\nprincipal components. Journal of Machine Learning Research, 2013 (to appear).\n\n[6] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer, 2001.\n[7] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In Theory of Cryptography, Berlin, Heidelberg, 2006.\n\n[8] C. Dwork, G. Rothblum, and S. Vadhan. Boosting and differential privacy. In FOCS, 2010.\n[9] A. Frank and A. Asuncion. UCI machine learning repository, 2013.\n[10] A. Friedman and A. Schuster. Data mining with differential privacy. In KDD, 2010.\n[11] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary informa-\n\ntion in data privacy. In KDD, 2008.\n\n[12] M. Hardt and A. Roth. Beyond worst-case analysis in private singular vector computation. In\n\nSTOC, 2013.\n\n[13] M. Hardt and G. Rothblum. A multiplicative weights mechanism for privacy-preserving data\n\nanalysis. In FOCS, pages 61\u201370, 2010.\n\n[14] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differentially private\n\nhistograms through consistency. PVLDB, 3(1):1021\u20131032, 2010.\n\n[15] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, 2012.\n[16] M C Jones, J S Marron, and S J Sheather. A brief survey of bandwidth selection for density\n\nestimation. JASA, 91(433):401\u2013407, 1996.\n\n[17] M. Kapralov and K. Talwar. On differentially private low rank approximation. In SODA, 2013.\n[18] D. Kifer, A. Smith, and A. Thakurta. Private convex optimization for empirical risk minimiza-\n\ntion with applications to high-dimensional regression. In COLT, 2012.\n\n[19] J. Lei. Differentially private M-estimators. In NIPS 24, 2011.\n[20] A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory\n\nmeets practice on the map. In ICDE, 2008.\n\n[21] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.\n[22] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation.\n[23] B. Rubinstein, P. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-\n\npreserving mechanisms for svm learning. Journal of Privacy and Con\ufb01dentiality, 2012.\n\n[24] A.D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential\n\nprivacy: Algorithms and challenges for continuous data. IEEE Signal Process. Mag., 2013.\n\n[25] J. A. Swets and R. M. Pickett. Evaluation of Diagnostic Systems. Methods from Signal Detec-\n\ntion Theory. Academic Press, New York, 1982.\n\n[26] Berwin A Turlach. Bandwidth selection in kernel density estimation: A review. In CORE and\n\nInstitut de Statistique. Citeseer, 1993.\n\n[27] S. Vinterbo. Differentially private projected histograms: Construction and use for prediction.\n\nIn ECML, 2012.\n\n[28] L. Wasserman and S. Zhou. A statistical framework for differential privacy.\n\n105(489):375\u2013389, 2010.\n\nJASA,\n\n[29] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu. Differentially private histogram publication. In\n\nICDE, 2012.\n\n9\n\n\f6 Appendix\n\n6.1 An Example to Show Training Stability is not a Direct Consequence of Differential\n\nPrivacy\n\nWe now present an example to illustrate that training stability is a property of the training algorithm\nand not a direct consequence of differential privacy. We present a problem and two \u03b1-differentially\nprivate training algorithms which approximately optimize the same function; the \ufb01rst algorithm\nis based on exponential mechanism, and the second on a maximum of Laplace random variables\nmechanism. We show that while both provide \u03b1-differential privacy guarantees, the \ufb01rst algorithm\ndoes not satisfy training stability while the second one does.\nLet i \u2208 {1, . . . , l}, and let f : X n \u00d7 R \u2192 [0, 1] be a function such that for all i and all datasets D\nand D(cid:48) of size n that differ in the value of a single individual, |f (D, i) \u2212 f (D(cid:48), i)| \u2264 1\nn.\nConsider the following training and validation problem. Given a sensitive dataset D, the private\ntraining procedure A outputs a tuple (i\u2217, t1, . . . , tl), where i\u2217 is the output of the \u03b1/2-differentially\nprivate exponential mechanism [21] run to approximately maximize f (D, i), and each ti is equal to\nf (D, i) plus an independent Laplace random variable with standard deviation 2l\n\u03b1n. For any validation\ndataset V , the validation score q((i\u2217, t1, . . . , tl), V ) = ti\u2217.\nIt follows from standard results that A is \u03b1-differentially private. Moreover, A can be represented\nby a tuple TA = (GA, FA), where GA is the following density over sequences of real numbers of\nlength l + 1:\n\nGA(r0, r1, . . . , rl) = 10\u2264r0\u22641 \u00b7 1\n\n2l e\u2212(|r1|+|r2|+...+|rl|)\n\nThus GA is the product of the uniform density on [0, 1] and l standard Laplace densities. Consider\nthe following map E0. For r \u2208 [0, 1], let\n\n(cid:80)\n(cid:80)\n\n(cid:80)\n(cid:80)\n\nE0(r) = i,\n\nif\n\nj<i en\u03b1f (D,j)/4\nj en\u03b1f (D,j)/4\n\n\u2264 r \u2264\n\nj\u2264i en\u03b1f (D,j)/4\nj en\u03b1f (D,j)/4\n\nIn other words, E0(r) is the map that converts a random number r drawn from the uniform distribu-\ntion on [0, 1] to the \u03b1/2-differentially private exponential mechanism distribution that approximately\nmaximizes f (D, i). Given a l + 1-tuple R = (R0, R1, . . . , Rl), FA is now the following map:\n\nFA(D, \u03b1, R) =\n\nE(R0), f (D, 1) +\n\n2lR1\n\u03b1n\n\n, f (D, 2) +\n\n2lR2\n\u03b1n\n\n, . . . , f (D, l) +\n\n2lRl\n\u03b1n\n\n(cid:18)\n\n(cid:19)\n\nen\u03b1/8\n\ne(n+2)\u03b1/8\n\n2 + 1\n\nn, f (D(cid:48), 2) = 1\n\n2 and f (D(cid:48), 1) = 1 \u2212 1\n\nLet l = 2 and D and D(cid:48) be two datasets that differ in the value of a single individual. Suppose it\nis the case that f (D, 1) = 1, f (D, 2) = 1\nn. Observe\nen\u03b1/4\nthat for D, the exponential mechanism picks 1 with probability\nen\u03b1/4+en\u03b1/8 , and 2 with probability\ne(n\u22121)\u03b1/4\nen\u03b1/4+en\u03b1/8 , where as for D(cid:48), it picks 1 with probability\ne(n\u22121)\u03b1/4+e(n+2)\u03b1/8 and 2 with proba-\ne(n\u22121)\u03b1/4\nbility\ne(n\u22121)\u03b1/4+e(n+2)\u03b1/8 . Thus, if R0 lies in the interval [\nen\u03b1/4+en\u03b1/8 ], then,\ne(n\u22121)\u03b1/4+e(n+2)\u03b1/8 ,\nFA(D, \u03b1, R) = t1 whereas FA(D(cid:48), \u03b1, R) = t2. When n is large enough, with high probabil-\nity, |t1 \u2212 t2| \u2265 1\n3; thus, the training stability condition does not hold for A for \u03b21 = o(n) and\n\u03b4 <\nConsider a different algorithm A(cid:48) which computes t1, . . . , tl \ufb01rst, and then outputs the index i\u2217 that\nmaximizes ti\u2217. Then A(cid:48) can be represented by a tuple TA(cid:48) = (GA(cid:48), FA(cid:48)), where GA(cid:48) is a density\nover sequences of real numbers of length l as follows:\n\nen\u03b1/8(e\u03b1/2\u22121)\n\n(en\u03b1/8+1)(en\u03b1/8+e\u03b1/2).\n\nen\u03b1/4\n\nGA(r1, . . . , rl) =\n\n1\n\n2l e\u2212(|r1|+...+|rl|)\n\nand FA(cid:48) is the map:\n\nFA(cid:48)(D, \u03b1, R) =\n\n(cid:18)\n\nargmaxi(f (D, i) +\n\nlRi\n\u03b1n\n\n), f (D, 1) +\n\nlR1\n\u03b1n\n\n, f (D, 2) +\n\nlR2\n\u03b1n\n\n, . . . , f (D, l) +\n\n10\n\n(cid:19)\n\nlRl\n\u03b1n\n\n\fFor the same value of R1, . . . , Rl, if i\u2217 = i on input dataset D and if i\u2217 = i(cid:48) on input dataset D(cid:48),\nthen, |f (D, i) \u2212 f (D, i(cid:48))| \u2264 1\n\nn; this implies that\n\n|q(FA(cid:48)(D, \u03b1, R), V ) \u2212 q(FA(cid:48)(D(cid:48), \u03b1, R), V )| = |ti \u2212 ti(cid:48)| = |f (D, i) \u2212 f (D(cid:48), i(cid:48))| \u2264 1\nn\nwith probability 1 over GA(cid:48). Thus the training stability condition holds for \u03b21 = 1 and \u03b4 = 0.\n\n6.2 Output Perturbation Algorithm\n\nWe present the output perturbation algorithm for regularized linear classi\ufb01cation.\n\nAlgorithm 4 Output Perturbation for Differentially Private Linear Classi\ufb01cation\n1: Inputs: Regularization parameter \u03bb, training set T = {(xi, yi), i = 1, . . . , n}, privacy parame-\n2: Let G be the following density over Rd: \u03c1G(r) \u221d e\u2212(cid:107)r(cid:107). Draw R \u223c G.\n3: Solve the convex optimization problem:\n\nter \u03b1.\n\n(cid:96)(w, xi, yi)\n\n(4)\n\nw\u2217 = argminw\u2208Rd\n\n\u03bb(cid:107)w(cid:107)2 +\n\n1\n2\n\n1\nn\n\n4: Output w\u2217 + 2\n\n\u03bb\u03b1n R.\n\n6.3 Case Study: Histogram Density Estimation\n\nn(cid:88)\n\ni=1\n\n(cid:90)\n\nm(cid:88)\n\ni=1\n\nOur second case study is developing an end-to-end differentially private solution for histogram-\nbased density estimation. In density estimation, we are given n samples x1, . . . , xn drawn from\nan unknown density f, and our goal is to build an approximation \u02c6f to f. In a histogram density\nestimator, we divide the range of the data into equal-sized bins of width h; if ni out of n of the input\n\nsamples lie in bin i, then \u02c6f is the density function: \u02c6f (x) =(cid:80)1/h\n\nhn \u00b7 1(x \u2208 Bin i).\n\ni=1\n\nni\n\nA critical parameter while constructing the histogram density estimator is the bin size h. There is\nmuch theoretical literature on how to choose h \u2013 see [16, 26] for surveys. However, the choice\nof h is usually data-dependent, and in practice, the optimal h is often determined by building a\nhistogram density estimator for a few different values of h, and selecting the one which has the best\nperformance on held-out validation data.\nThe most popular measure to evaluate the quality of a density estimator is the L2-distance or the\nIntegrated Square Error (ISE) between the density estimate and the true density:\n\n(cid:107) \u02c6f \u2212 f(cid:107)2 =\n\n( \u02c6f (x) \u2212 f (x))2dx =\n\nf 2(x)dx +\n\n\u02c6f 2(x)dx \u2212 2\n\nf (x) \u02c6f (x)dx\n\n(5)\n\nx\n\nx\n\nx\n\nx\n\nf is typically unknown, so the ISE cannot be computed exactly. Fortunately it is still possible to\ncompare multiple density estimates based on this distance. The \ufb01rst term in the right hand side of\nEquation 5 depends only on f, and is equal for all \u02c6f. The second term is a function of \u02c6f only and can\nthus be computed. The third term is 2Ex\u223cf [ \u02c6f (x)], and even though it cannot be computed exactly\nwithout knowledge of f, we can estimate it based on a held out validation dataset. Thus, given a\ndensity estimator \u02c6f and a validation dataset V = {z1, . . . , zm}, we will use the following function\nto evaluate the quality of \u02c6f on V :\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nq( \u02c6f , V ) = \u2212\n\n\u02c6f 2(x)dx +\n\n2\nm\n\nx\n\n\u02c6f (zi)\n\n(6)\n\nA higher value of q indicates a smaller distance (cid:107) \u02c6f \u2212 f(cid:107)2, and thus a higher quality density estimate.\nFor other measures, see [6].\nIn the sequel, we assume that the data lies in the interval [0, 1] and that this interval is known in\nadvance. For ease of notation, we also assume without loss of generality that 1\nh is an integer. For\n\n11\n\n\fease of exposition, we con\ufb01ne ourselves to one-dimensional data, although the general techniques\ncan be easily extended to higher dimensions. Given n samples and a bin size h, several works,\nincluding [7, 19, 27, 28, 20, 29, 14] have shown different ways of constructing and sampling from\ndifferentially private histograms. The most basic approach is to construct a non-private histogram\nand then add Laplace noise to each cell, followed by some post-processing. Algorithm 5 presents a\nvariant of a differentially private histogram density estimator due to [19] in our framework.\n\nh do\n\nAlgorithm 5 Differentially Private Histogram Density Estimator\n1: Inputs: Bin size h (such that 1/h is an integer), data T = {x1, . . . , xn}, privacy parameter \u03b1.\n2: for i = 1, . . . , 1\n3:\n4:\n5: end for\n\n(cid:104) i\u22121\ni \u02dcni. Return the density estimator: \u02c6f (x) =(cid:80)1/h\n\nj=1 1(xj \u2208 Ii), and let \u02dcni = max(cid:0)0, ni + 2Ri\n\nDraw Ri independently from the standard Laplace density: \u03c1G(r) = 1\nLet Ii =\n\n. De\ufb01ne: ni =(cid:80)n\n\n6: Let \u02dcn =(cid:80)\n\nh\u02dcn \u00b7 1(x \u2208 Ii)\n\n2 e\u2212|r|.\n\n(cid:1).\n\nh , i\n\n(cid:17)\n\n\u02dcni\n\ni=1\n\nh\n\n\u03b1\n\nThe following theorem shows stability guarantees on the differentially private histogram density\nestimator described in Algorithm 5.\nTheorem 6 (Stability of Private Histogram Density Estimator) Let H = {h1, . . . , hk} be a set\nof bin sizes, and let hmin = mini hi. For any \ufb01xed \u03b4, if the sample size n \u2265 1 + 2 ln(4k/\u03b4)\n, then,\nthe validation score q in Equation 6 is (\u03b21, \u03b22, \u03b4\nk )-Stable with respect to Algorithm 5 and H for:\n\u221a\n\u03b21 =\n\n\u221a\n\u03b1\n\n, where: \u03bd = 2 ln(4k/\u03b4)\nhmin\n\n\u03b22 = 2\nhmin\n\n(1\u2212\u03bd)hmin\n\nhmin\n\nn\u03b1\n\n.\n\n6\n\n,\n\n6.4 Proofs of Theorems 1, 2 and 3\n\nWe now present the proofs of Theorems 1, 2 and 3. Our proofs involve ideas similar to those in\nthe analysis of the multiplicative weights update method for answering a set of linear queries in a\ndifferentially private manner [13].\nLet A(D) denote the output of Algorithm 1 when the input is a sensitive dataset D = (T, V ), where\nT is the training part and V is the validation part. Let D(cid:48) = (T (cid:48), V ) where T and T (cid:48) differ in the\nvalue of a single individual, and let D(cid:48)(cid:48) = (T, V (cid:48)) where V and V (cid:48) differ in the value of a single\nindividual. The proof of Theorem 1 is a consequence of the following two lemmas.\nLemma 1 Suppose that the conditions in Theorem 1 hold. Then, for all D = (T, V ), all D(cid:48) =\n(T (cid:48), V ), such that T and T (cid:48) differ in the value of a single individual, and for any set of outcomes S:\n(7)\n\nPr(A(D) \u2208 S) \u2264 e\u03b12 Pr(A(D(cid:48)) \u2208 S) + \u03b4\n\nLemma 2 Suppose that the conditions in Theorem 1 hold. Then, for all D = (T, V ), all D(cid:48)(cid:48) =\n(T, V (cid:48)) such that V and V (cid:48) differ in the value of a single individual, and for any set of outcomes S,\n(8)\n\nPr(A(D) \u2208 S) \u2264 e\u03b12 Pr(A(D(cid:48)(cid:48)) \u2208 S) + \u03b4\n\nPROOF: (Of Lemma 1) Let S = (I, C), where I \u2286 [k] is a set of indices and C \u2286 C. Let E be the\nevent that all of R1, . . . , Rk lie in the set \u03a3. We will \ufb01rst show that conditioned on E, for all i, it\nholds that:\n(9)\nSince Pr(E) \u2265 1 \u2212 \u03b4, from the conditions in Theorem 1, for any subset I of indices, we can write:\n\nPr(i\u2217 = i|D, E) \u2264 e\u03b12 Pr(i\u2217 = i|D(cid:48), E)\n\nPr(i\u2217 \u2208 I|D) \u2264 Pr(i\u2217 \u2208 I|D, E) Pr(E) + (1 \u2212 Pr(E))\n\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I|D(cid:48), E) Pr(E) + \u03b4\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I, E|D(cid:48)) + \u03b4\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I|D(cid:48)) + \u03b4\n\n12\n\n(10)\n\n\fWe will now prove Equation 9. For this purpose, we adopt the following notation. We use the\nnotation Z\\i to denote the random variables Z1, . . . , Zi\u22121, Zi+1, . . . , Zk and z\\i to denote the set of\nvalues z1, . . . , zi\u22121, zi+1, . . . , zk. We also use the notation h(\u00b7) to represent the density induced on\nthe random variables Z1, . . . , Zk by Algorithm 1. In addition, we use the notation R to denote the\nvector (R1, . . . , Rk). We \ufb01rst \ufb01x a value z\\i for Z\\i, and a value of R such that R1, . . . , Rk all lie\nin \u03a3, and consider the ratio of probabilities:\n\nPr(i\u2217 = i|Z\\i = z\\i, D, R)\nPr(i\u2217 = i|Z\\i = z\\i, D(cid:48), R)\n\nObserve that this ratio of probabilities is equal to:\n\nPr(Zi + q(F (T, \u03b8i, \u03b11, Ri), V ) \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V ))\nPr(Zi + q(F (T (cid:48), \u03b8i, \u03b11, Ri), V ) \u2265 supj(cid:54)=i zj + q(F (T (cid:48), \u03b8j, \u03b11, Rj), V ))\n\nwhich is in turn equal to:\n\nPr(Zi \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T, \u03b8i, \u03b11, Ri), V ))\nPr(Zi \u2265 supj(cid:54)=i zj + q(F (T (cid:48), \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T (cid:48), \u03b8i, \u03b11, Ri), V ))\n\nObserve that from the stability condition,\n\n|(q(F (T, \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T, \u03b8i, \u03b11, Ri), V )) \u2212 (q(F (T (cid:48), \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T (cid:48), \u03b8i, \u03b11, Ri), V ))|\n\u2264 |q(F (T, \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T (cid:48), \u03b8j, \u03b11, Rj), V (cid:48))| + |q(F (T, \u03b8i, \u03b11, Ri), V ) \u2212 q(F (T (cid:48), \u03b8i, \u03b11, Ri), V )|\n\u2264 2\u03b21\nn\n\n\u2264 2\u03b2\n\nThus, the ratio of the probabilities is at most the ratio Pr(Zi \u2265 \u03b3)/ Pr(Zi \u2265 \u03b3 + 2\u03b2) where\n\u03b3 = supj(cid:54)=i zj +q(F (T, \u03b8j, \u03b11, Rj), V )\u2212q(F (T, \u03b8i, \u03b11, Ri), V ), which is at most e\u03b12 by properties\nof the exponential distribution. Thus, we have established that for all z\\i, for all R in \u03a3k,\n\nPr(i\u2217 = i|Z\\i = z\\i, D, R) \u2264 e\u03b12 \u00b7 Pr(i\u2217 = i|Z\\i = z\\i, D(cid:48), R)\n\nEquation 9 follows by integrating over z\\i and R. The lemma follows. (cid:3)\nPROOF:(Of Lemma 2) Let S = (I, C), where I \u2286 [k] is a set of indices and C \u2286 C. Let E be the\nevent that all of R1, . . . , Rk lie in \u03a3. We will \ufb01rst show that conditioned on E, for all i, it holds that:\n(11)\nSince Pr(E) \u2265 1 \u2212 \u03b4, from the conditions in Theorem 1, for any subset I of indices, we can write:\n\nPr(i\u2217 = i|D, E) \u2264 e\u03b12 Pr(i\u2217 = i|D(cid:48)(cid:48), E)\n\nPr(i\u2217 \u2208 I|D) \u2264 Pr(i\u2217 \u2208 I|D, E) Pr(E) + (1 \u2212 Pr(E))\n\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I|D(cid:48)(cid:48), E) Pr(E) + \u03b4\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I, E|D(cid:48)(cid:48)) + \u03b4\n\u2264 e\u03b12 Pr(i\u2217 \u2208 I|D(cid:48)(cid:48)) + \u03b4\n\n(12)\n\nWe will now focus on showing Equation 11. We \ufb01rst consider the case when event E holds, that is,\nRj \u2208 R, for j = 1, . . . , k. In this case, the stability de\ufb01nition and the conditions of the theorem\nimply that for all \u03b8j \u2208 \u0398,\n\n|q(F (T, \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T, \u03b8j, \u03b11, Rj), V (cid:48))| \u2264 \u03b22\nm\n\n\u2264 \u03b2\n\n(13)\n\nIn what follows, we use the notation Z\\i to denote the random variables Z1, . . . , Zi\u22121, Zi+1, . . . , Zk\nand z\\i to denote the set of values z1, . . . , zi\u22121, zi+1, . . . , zk. We also use the notation h(\u00b7) to\nrepresent the density induced on the random variables Z1, . . . , Zk by Algorithm 1. In addition, we\nuse the notation R to denote the vector (R1, . . . , Rk). We \ufb01rst \ufb01x a value z\\i for Z\\i, and a value of\nR such that E holds, and consider the ratio of probabilities:\n\nPr(i\u2217 = i|Z\\i = z\\i, D, R)\nPr(i\u2217 = i|Z\\i = z\\i, D(cid:48)(cid:48), R)\n\n13\n\n\fObserve that this ratio of probabilities is equal to:\n\nPr(Zi + q(F (T, \u03b8i, \u03b11, Ri), V ) \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V ))\nPr(Zi + q(F (T, \u03b8i, \u03b11, Ri), V (cid:48)) \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V (cid:48)))\n\nwhich is in turn equal to:\n\nPr(Zi \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V ) \u2212 q(F (T, \u03b8i, \u03b11, Ri), V ))\nPr(Zi \u2265 supj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, Rj), V (cid:48)) \u2212 q(F (T, \u03b8i, \u03b11, Ri), V (cid:48)))\n\nObserve that from Equation 13,\n|(q(F (T, \u03b8j, \u03b11, Rj), V )\u2212q(F (T, \u03b8i, \u03b11, Ri), V ))\u2212(q(F (T, \u03b8j, \u03b11, Rj), V (cid:48))\u2212q(F (T, \u03b8i, \u03b11, Ri), V (cid:48)))| \u2264 2\u03b22\nm\nThus, the ratio of the probabilities is at most the ratio Pr(Zi \u2265 \u03b3)/ Pr(Zi \u2265 \u03b3 + 2\u03b2) for \u03b3 =\nsupj(cid:54)=i zj + q(F (T, \u03b8j, \u03b11, rj), V ) \u2212 q(F (T, \u03b8i, \u03b11, ri), V ), which is at most e\u03b12 by properties of\nthe exponential distribution. Thus, we have established that when R \u2208 \u03a3k, for all j,\n\n\u2264 2\u03b2\n\nPr(i\u2217 = i|Z\\i = z\\i, D, R)\nPr(i\u2217 = i|Z\\i = z\\i, D(cid:48)(cid:48), R)\n\n\u2264 e\u03b12\n\nThus for any such R, we can write:\n\nPr(i\u2217 = i|D, R)\nPr(i\u2217 = i|D(cid:48)(cid:48), R)\n\n=\n\n(cid:82)\n(cid:82)\n\nz\\i\n\nz\\i\n\nEquation 11 now follows by integrating R over E. (cid:3)\n\nPr(i\u2217 = i|Z\\i = z\\i, D, R)h(z\\i)dz\\i\nPr(i\u2217 = i|Z\\i = z\\i, D(cid:48)(cid:48), R)h(z\\i)dz\\i\n\n\u2264 e\u03b12\n\nPROOF:(Of Theorem 1) The proof of Theorem 1 follows from a combination of Lemmas 1 and 2.\n(cid:3)\n\nPROOF:(Of Theorem 2) The proof of Theorem 2 follows from privacy composition; Theorem 1\nensures that Step (2) of Algorithm 2 is (\u03b12, \u03b4)-differentially private; moreover the training procedure\nT is \u03b11-differentially private. The theorem follows by composing these two results. (cid:3)\n\nPROOF:(Of Theorem 3) Observe that:\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nPr\n\nq(hi\u2217 , V ) < max\n1\u2264i\u2264k\n\nq(hi, V ) \u2212 2\u03b2 log(k/\u03b40)\n\n\u03b12\n\n\u2264 Pr\n\nBy properties of the exponential distribution, for any \ufb01xed j, Pr(Zj \u2265 log(k/\u03b40)\ntheorem follows by an Union Bound. (cid:3)\n\n\u03b12\n\n\u2203j s.t. Zj \u2265 log(k/\u03b40)\n\u03b12\n) \u2264 \u03b40\n\nk . Thus the\n\n6.5 Proof of Theorem 4\nPROOF: (Of Theorem 4 for Output Perturbation) Let T and T (cid:48) be two training sets which differ in\na single labelled example ((xn, yn) vs. (x(cid:48)\nn)), and let w\u2217(T ) and w\u2217(T (cid:48)) be the solutions to the\nregularized convex optimization problem in Equation 1 when the inputs are T and T (cid:48) respectively.\nWe observe that for \ufb01xed \u03bb, \u03b1 and R,\n\nn, y(cid:48)\n\nF (T, \u03bb, \u03b1, R) \u2212 F (T (cid:48), \u03bb, \u03b1, R) = w\u2217(T ) \u2212 w\u2217(T (cid:48))\n\nWhen the training sets are T and T (cid:48), the objective functions in the regularized convex optimization\nproblems are both \u03bb-strongly convex, and they differ by 1\nn)). Combining\nthis fact with Lemma 1 of [4], and using the fact that (cid:96) is 1-Lipschitz, we have that for all \u03bb and R,\n\nn ((cid:96)(w, xn, yn)\u2212(cid:96)(w, x(cid:48)\n\nn, y(cid:48)\n\n(cid:107)F (T, \u03bb, \u03b1, R) \u2212 F (T (cid:48), \u03bb, \u03b1, R)(cid:107) \u2264 2\n\u03bbn\n\nSince g is L-Lipschitz, this implies that for any \ufb01xed validation set V , and for all \u03bb, \u03b1 and R,\n\n|q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T (cid:48), \u03bb, \u03b1, R), V )| \u2264 2L\n\u03bbn\n\n(14)\n\n14\n\n\fNow let V and V (cid:48) be two validation sets that differ in the value of a single labelled example\n(\u00afxm, \u00afym). Since g \u2265 0 for all inputs, for any such V and V (cid:48), and for a \ufb01xed \u039b, \u03b1 and R,\n|q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T, \u03bb, \u03b1, R), V (cid:48))| \u2264 gmax\n\nm , where\n\ngmax = sup\n\n(x,y)\u2208X\n\ng(F (T, \u03bb, \u03b1, R), x, y)\n\nBy de\ufb01nition, gmax \u2264 g\u2217. Moreover, as g is L-Lipschitz,\n\ngmax \u2264 L \u00b7 (cid:107)F (T, \u03bb, \u03b1, R)(cid:107)\n\nNow, let E be the event that (cid:107)R(cid:107) \u2264 d log(dk/\u03b4). From Lemma 4 of [4], Pr(E) \u2265 1 \u2212 \u03b4/k. Thus,\nprovided E holds, we have that:\n(cid:107)F (T, \u03bb, \u03b1, R)(cid:107) \u2264 (cid:107)w\u2217(cid:107) +\n\nd log(dk/\u03b4)\n\nd log(dk/\u03b4)\n\nd log(dk/\u03b4)\n\n(cid:19)\n\n(cid:18)\n\n1 +\n\n\u2264 1\n\u03bb\n\n+\n\n\u03bb\u03b1n\n\n\u03bb\u03b1n\n\n=\n\n1\n\u03bb\n\nwhere the bound on (cid:107)w\u2217(cid:107) follows from an application of Lemma 1 of [4] on the functions 1\nand 1\nfor all \u03bb,\n\n2 \u03bb(cid:107)w(cid:107)2\ni=1 (cid:96)(w, xi, yi). This implies that provided E holds, for all training sets T , and\n\n2 \u03bb(cid:107)w(cid:107)2 + 1\n\nn\n\n(cid:80)n\n\n|q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T, \u03bb, \u03b1, R), V (cid:48))| \u2264 L\n\u03bbm\n\n1 +\n\nd log(dk/\u03b4)\n\nn\u03b1\n\n(15)\n\nThe theorem now follows from a combination of Equations 14 and 15, and the de\ufb01nition of g\u2217. (cid:3)\nPROOF: (Of Theorem 4 for Objective Perturbation) Let T and T (cid:48) be two training sets which differ in\na single labelled example (xn, yn). We observe that for a \ufb01xed R and \u03bb, the objective of the regular-\nn)).\nized convex optimization problem in Equation 2 differs in the term 1\nCombining this with Lemma 1 of [4], and using the fact that (cid:96) is 1-Lipschitz, we have that for all \u03bb,\n\u03b1, R,\n\nn ((cid:96)(w, xn, yn) \u2212 (cid:96)(w, x(cid:48)\n\nn, y(cid:48)\n\n(cid:107)F (T, \u03bb, \u03b1, R) \u2212 F (T (cid:48), \u03bb, \u03b1, R)(cid:107) \u2264 2\n\u03bbn\n\nSince g is L-Lipschitz, this implies that for any \ufb01xed validation set V , and for all \u03bb and r,\n\n|q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T (cid:48), \u03bb, \u03b1, R), V )| \u2264 2L\n\u03bbn\n\n(16)\n\nNow let V and V (cid:48) be two validation sets that differ in the value of a single labelled example\n(\u00afxm, \u00afym). Since g \u2265 0, for any such V and V (cid:48), |q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T, \u03bb, \u03b1, R), V (cid:48))| \u2264\ngmax\nm , where\n\nn\u03b1\n\n(cid:19)\n\n(cid:18)\n\nBy de\ufb01nition gmax \u2264 g\u2217. Moreover, as g is L-Lipschitz,\n\ngmax = sup\n\n(x,y)\u2208X\n\ng(F (T, \u03bb, \u03b1, R), x, y)\n\ngmax \u2264 L \u00b7 (cid:107)F (T, \u03bb, \u03b1, R)(cid:107)\n\nLet E be the event that (cid:107)R(cid:107) \u2264 d log(dk/\u03b4). From Lemma 4 of [4], Pr(E) \u2265 1 \u2212 \u03b4/k. Thus,\nprovided E holds, we have that:\n\n(cid:107)F (T, \u03bb, \u03b1, R)(cid:107) \u2264 1 + (cid:107)R(cid:107)/(\u03b1n)\n\n\u03bb\n\n\u2264 1\n\u03bb\n\n1 +\n\nd log(dk/\u03b4)\n\nn\u03b1\n\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\nThis implies that provided E holds, for all training sets T , and for all \u03bb,\n\n|q(F (T, \u03bb, \u03b1, R), V ) \u2212 q(F (T, \u03bb, \u03b1, R), V (cid:48))| \u2264 L\n\u03bbm\n\n1 +\n\nd log(dk/\u03b4)\n\nn\u03b1\n\nThe theorem now follows from a combination of Equations 16 and 17, and the de\ufb01nition of g\u2217. (cid:3)\n\n15\n\n(cid:19)\n\n(17)\n\n\f6.6 Proof of Theorem 6\nLemma 3 (Concentration of Sum of Laplace Random Variables) Let Z1, . . . , Zs be s \u2265 2 iid\nstandard Laplace random variables, and let Z = Z1 + . . . + Zs. Then, for any \u03b8,\n\n(cid:18)\n\n(cid:19)\u2212s\n\nPr(Z \u2265 \u03b8) \u2264\n\n1 \u2212 1\ns\n\n\u221a\ne\u2212\u03b8/\n\n\u221a\ns \u2264 4e\u2212\u03b8/\n\ns\n\nPROOF: The proof follows from using the method of generating functions. The generating function\n1\u2212t2 , for |t| \u2264 1. As Z1, . . . , Zs are\nfor the standard Laplace distribution is: \u03c8(X) = E[etX ] = 1\nindependently distributed, the generating function for Z is E[etZ] = (1\u2212 t2)\u2212s. Now, we can write:\n\nPr(Z \u2265 \u03b8) = Pr(etZ \u2265 et\u03b8)\n\n\u2264 E[etZ]\net\u03b8 = e\u2212t\u03b8 \u00b7 (1 \u2212 t2)\u2212s\n(cid:18)\n\n(cid:19)\u2212s\n\u221a\ne\u2212\u03b8/\ns )s \u2265 1\n4. (cid:3)\n\nPlugging in t = 1\u221a\n\ns, we get that:\n\ns\n\nPr(Z \u2265 \u03b8) \u2264\n\n1 \u2212 1\ns\nThe lemma follows by observing that for s \u2265 2, (1 \u2212 1\nPROOF: (Of Theorem 6) Let V = {z1, . . . , zm} be a validation dataset, and let V (cid:48) be a valida-\ntion dataset that differs from V in a single sample (zm vs z(cid:48)\nm). We use the notation R to denote\nthe sequence of values R = (R1, R2, . . . , R1/h). Given an input sample T , a bin size h, a pri-\nvacy parameter \u03b1, and a sequence R, we use the notation \u02c6fT,h,\u03b1,R to denote the density estimator\nF (T, h, \u03b1, R). For all such T , all h, all \u03b1 and all R, we can write:\n\n|q(F (T, h, \u03b1, R), V ) \u2212 q(F (T, h, \u03b1, R), V (cid:48))| =\n\n2\nm\n\u2264 2\nm\n\n( \u02c6fT,h,\u03b1,R(zm) \u2212 \u02c6fT,h,\u03b1,R(z(cid:48)\n\u00b7 maxi \u02dcni\n\n\u2264 2\nmh\n\nh\u02dcn\n\nm))\n\n(18)\n\nFor a \ufb01xed value of h, we de\ufb01ne the following event E:\n\nRi \u2265 \u2212 ln(4k/\u03b4)\u221a\n\nh\n\n1/h(cid:88)\n\ni=1\n\n1/h(cid:88)\n\ni=1\n\nUsing the symmetry of Laplace random variables and Lemma 3, we get that Pr(E) \u2265 1 \u2212 \u03b4/k. We\nobserve that provided the event E holds,\n\n\u02dcn \u2265 n \u2212\n\nRi \u2265 n \u2212 2 ln(4k/\u03b4)\n\n\u221a\n\n\u03b1\n\nh\n\n\u2265 n(1 \u2212 \u03bd)\n\n(19)\n\nLet T and T (cid:48) be two input datasets that differ in a single sample (xn vs x(cid:48)\nvalue of \u03b1, and a sequence R, and for these \ufb01xed values, we use the notation \u02dcni and \u02dcn(cid:48)\n\nvalue of \u02dcni in Algorithm 5 when the inputs are T and T (cid:48) respectively. Similarly, we use \u02dcn =(cid:80)\nand \u02dcn(cid:48) =(cid:80)\n\nn). We \ufb01x a bin size h, a\ni to denote the\ni \u02dcni\n\ni \u02dcn(cid:48)\ni.\n\nFor any V , we can write:\n\nq(F (T, h, \u03b1, R), V ) \u2212 q(F (T (cid:48), h, \u03b1, R), V ) =\n\n\u2212\n\nm(cid:88)\n1/h(cid:88)\n\n2\nm\n\nj=1\n\nh \u00b7\n\ni=1\n\n( \u02c6fT,h,\u03b1,R(zj) \u2212 \u02c6fT (cid:48),h,\u03b1,R(zj))\n(cid:18) \u02dcn2\nh2 \u02dcn2 \u2212 \u02dcn(cid:48)2\nh2 \u02dcn(cid:48)2\n\n(cid:19)\n\ni\n\ni\n\n(20)\n\nWe now look at bounding the right hand side of Equation 20 term by term. Suppose T (cid:48) is obtained\nrom T by moving a single sample xn from bin a to bin b in the histogram. Then, depending on the\nrelative values of \u02dcna and \u02dcnb, there are four cases:\n\n16\n\n\f1. \u02dcn(cid:48)\n2. \u02dcn(cid:48)\n3. \u02dcn(cid:48)\n4. \u02dcn(cid:48)\n\na = \u02dcna \u2212 1, \u02dcn(cid:48)\na = \u02dcna = 0, \u02dcn(cid:48)\na = \u02dcna \u2212 1, \u02dcn(cid:48)\na = \u02dcna = 0, \u02dcn(cid:48)\n\nb = \u02dcnb + 1. Thus \u02dcn(cid:48) = \u02dcn.\nb = \u02dcnb + 1. Thus \u02dcn(cid:48) = \u02dcn + 1.\nb = \u02dcnb = 0. Thus \u02dcn(cid:48) = \u02dcn \u2212 1.\nb = \u02dcnb = 0. Thus \u02dcn(cid:48) = \u02dcn.\n\nIn the fourth case, \u02c6fT,h,\u03b1,R = \u02c6fT (cid:48),h,\u03b1,R, and thus the right hand side of Equation 20 is 0. Moreover,\nthe second and the third cases are symmetric. We thus focus on the \ufb01rst two cases.\nIn the \ufb01rst case, the \ufb01rst term in the right hand side of Equation 20 can be written as:\n\n(cid:18) \u02dcni\n\nh\u02dcn\n\n(cid:19)(cid:12)(cid:12)(cid:12) =\n\n\u2212 \u02dcn(cid:48)\nh\u02dcn(cid:48)\n\ni\n\n(cid:12)(cid:12)(cid:12) 2\n\n\u00b7 m(cid:88)\n\n1/h(cid:88)\n\nj=1\n\ni=1\n\n1(zj \u2208 Ii) \u00b7 \u02dcni \u2212 \u02dcn(cid:48)\n\ni\n\nh\u02dcn\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12) 2\n\nm\n\n\u00b7 m(cid:88)\n\n1/h(cid:88)\n\nj=1\n\ni=1\n\n1(zj \u2208 Ii) \u00b7\n\n\u2264 2\nh\u02dcn\nThe second term on the right hand side of Equation 20 can be written as:\n\n\u00b7 m \u00b7 1\nh\u02dcn\n\nm\n\u2264 2\nm\n\n(cid:12)(cid:12)(cid:12) 1/h(cid:88)\n\ni=1\n\n(cid:18) \u02dcn2\nh\u02dcn2 \u2212 \u02dcn(cid:48)2\nh\u02dcn(cid:48)2\n\ni\n\ni\n\n(cid:19)(cid:12)(cid:12)(cid:12) =\n\nb \u2212 (\u02dcna \u2212 1)2 \u2212 (\u02dcnb + 1)2\n\n\u02dcn2\na + \u02dcn2\n\n(cid:12)(cid:12)(cid:12) 2\u02dcna \u2212 2\u02dcnb \u2212 2\n\nh\u02dcn2\n\nh\u02dcn2\n\n(cid:12)(cid:12)(cid:12) \u2264 2\n\nh\u02dcn\n\n=\n\nwhere the last step follows from the fact that \u02dcn(cid:48)\nhand side of Equation 20 is at most 4\nh\u02dcn.\nWe now consider the second case. The \ufb01rst term on the right hand side of Equation 20 can be written\nas:\n\nb = \u02dcnb + 1 \u2264 \u02dcn. Thus, for the \ufb01rst case, the right\n\n(cid:18) \u02dcni\n(cid:18) \u02dcni\n\nh\u02dcn\n\n(cid:19)(cid:12)(cid:12)(cid:12)\n(cid:19)(cid:12)(cid:12)(cid:12)\n\ni\n\ni\n\n\u2212 \u02dcn(cid:48)\nh\u02dcn(cid:48)\n\u2212 \u02dcn(cid:48)\n\n\u02dcn\n\n\u02dcn + 1\n\n1(zj \u2208 Ii) \u00b7\n\n1(zj \u2208 Ii) \u00b7\n\n(cid:12)(cid:12)(cid:12) 2\n(cid:12)(cid:12)(cid:12) 2\n\nm\n\nmh\n\n1/h(cid:88)\n\u00b7 m(cid:88)\n1/h(cid:88)\n\u00b7 m(cid:88)\n\nj=1\n\ni=1\n\nj=1\n\ni=1\n\n=\n\n\u2264\n\n\u00b7 m \u00b7\n\n1\n\n2\nhm\n\u00b7\n\n\u2264 2\nh\n\n\u02dcn(\u02dcn + 1)\n\n(cid:12)(cid:12)(cid:12) 1/h(cid:88)\n\ni=1\n\n(cid:18) \u02dcn2\nh\u02dcn2 \u2212 \u02dcn(cid:48)2\nh\u02dcn(cid:48)2\n\ni\n\ni\n\n(cid:19)(cid:12)(cid:12)(cid:12) =\n\n1\n\n\u02dcn(\u02dcn + 1)\n\n\u00b7 max(|\u02dcni(\u02dcn + 1) \u2212 \u02dcni \u02dcn|,|\u02dcni(\u02dcn + 1) \u2212 \u02dcn(\u02dcni + 1)|)\n\n\u00b7 max(|\u02dcni|,|\u02dcn \u2212 \u02dcni|) \u2264\n\n2\n\nh(\u02dcn + 1)\n\n(cid:18) \u02dcn2\n(cid:88)\nh\u02dcn2 \u2212\nh\u02dcn2(\u02dcn + 1)2 \u00b7(cid:88)\n\n2\u02dcn + 1\n\ni(cid:54)=b\n\ni\n\n\u02dcn2\ni\n\nh(\u02dcn + 1)2\n\n(cid:19)\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12) \u02dcn2\nh\u02dcn2 \u2212 (\u02dcnb + 1)2\n(cid:12)(cid:12)(cid:12) (\u02dcnb \u2212 \u02dcn)(2\u02dcnb \u02dcn + \u02dcn + \u02dcnb)\n\nh(\u02dcn + 1)2\n\nh\u02dcn2(\u02dcn + 1)2\n\n+\n\nb\n\n(cid:12)(cid:12)(cid:12)\n\n\u02dcn2\n\ni +\n\ni(cid:54)=b\n\u02dcn \u00b7 2\u02dcn(\u02dcn + 1)\nh\u02dcn2(\u02dcn + 1)2 \u2264\n\n4\n\n2\u02dcn + 1\nh(\u02dcn + 1)2 +\n\n=\n\n\u2264\n\nwhere the last step follows from the fact that max(|\u02dcni|,|\u02dcn \u2212 \u02dcni|) \u2264 \u02dcn. The second term on the right\nhand side of Equation 20 can be written as:\n\nh(\u02dcn + 1)\nThus, in the second case, the right hand side of Equation 20 is at most\nh(\u02dcn+1). We observe that the\nthird case is symmetric to the second case, and thus we can carry out very similar calculations in\nh\u02dcn. Thus, we have that for any T and T (cid:48),\nthe third case to show that the right hand side is at most 6\nprovided the event E holds,\n\n6\n\n|q(F (T, h, \u03b1, R), V ) \u2212 q(F (T (cid:48), h, \u03b1, R), V )| \u2264 6\nh\u02dcn\nThe theorem now follows by combining Equation 21 with Equation 19. (cid:3)\n\n(21)\n\n17\n\n\f6.7 Proof of Theorem 5\nLemma 4 (Parallel construction) Let A = {A1,A2, . . . ,Ak} be a list of k independently random-\nized functions, and let Ai be \u03b1i-differentially private. Let {D1, D2, . . . , Dk} be k subsets of a set\nD such that i (cid:54)= j =\u21d2 Di \u2229 Dj = \u2205. Algorithm B(D, A) = (A1(D1),A2(D2), . . . ,Ak(Dk)) is\nmax1\u2264i\u2264k \u03b1i-differentially private.\nPROOF: Let D, D(cid:48) be two datasets such that their symmetric difference contains one element. We\nhave that\nP (B(D, A) \u2208 S)\nP (A1(D1) \u2208 S1)\u00b7\u00b7\u00b7 P (Ak(Dk) \u2208 Sk)\nP (A1(D(cid:48)\nP (B(D(cid:48), A) \u2208 S)\nk) \u2208 Sk)\n(22)\nby independence of randomness in the Ai. Since i (cid:54)= j =\u21d2 Di \u2229 Dj = \u2205, there exists at most one\nj. If j does not exist, (22) reduces to e0 \u2264 emax1\u2264i\u2264k \u03b1i. Let j exist, then\nindex j such that Dj (cid:54)= D(cid:48)\n\nP (B(D, A) \u2208 S1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Sk)\nP (B(D(cid:48), A) \u2208 S1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Sk)\n\n=\n\n=\n\n1) \u2208 S1)\u00b7\u00b7\u00b7 P (Ak(D(cid:48)\n\nP (B(D, A) \u2208 S)\nP (B(D(cid:48), A) \u2208 S)\n\n=\n\nP (Aj(Dj) \u2208 Sj)\nj) \u2208 Sj)\nP (Aj(D(cid:48)\n\n\u2264 e\u03b1j \u2264 emax1\u2264i\u2264k \u03b1i,\n\nwhich concludes the proof. (cid:3)\n\nPROOF: (Theorem 5) We begin by separating task (a) of producing the fi in step 1. from the task\n(b) of computing ei in step 2. and selecting i\u2217 in step 3.\nFrom the parallel construction Lemma 4 it follows that (a) in dataSplit is \u03b1-differentially private.\nFrom standard composition of privacy it follows that (a) in alphaSplit is \u03b1-differentially private.\nTask (b) is for both alphaSplit and dataSplit an application of the exponential mechanism [21],\nwhich for choosing with a probability proportional to \u0001(\u2212ei) yields 2\u0001\u2206-differential privacy, where\n\u2206 is the sensitivity of ei. Since a single change in V can change the number of errors any \ufb01xed\nclassi\ufb01er can make by at most 1 = \u2206, we get that task (b) is \u03b1-differentially private for \u0001 = \u03b1/2.\nIf T and V are disjoint, we get by parallel construction that both alphaSplit and dataSplit yield\n\u03b1-differential privacy. If T and V are not disjoint, by standard composition of privacy we get that\nboth alphaSplit and dataSplit yield 2\u03b1-differential privacy.\nIn Random, the results of step 2. in task (b) are never used in step 3. Step 3 is done without looking\nat the input data and does not incur loss of differential privacy. We can therefore simulate Random\nby \ufb01rst choosing i\u2217 uniformly at random, and then computing fi at \u03b1-differential privacy, which by\nstandard privacy composition is \u03b1-differentially private. (cid:3)\n\n6.8 Experimental selection of regularizer index\n\n18\n\n\fFigure 2: A summary of 10 times 10-fold cross-validation selection of regularizer index i into \u0398\nfor different privacy levels \u03b1. Each point in the \ufb01gure represents a summary of 100 data points.\nThe error bars indiciate a boot-strap sample estimate of the 95% con\ufb01dence interval of the mean. A\nsmall amount of jitter was added to positions on the x-axes to avoid over-plotting.\n\n19\n\nllllllllllllAdultMagic2460.30.51.02.03.05.00.30.51.02.03.05.0alphalStabilityalphaSplitdataSplitRandomControl\f", "award": [], "sourceid": 1247, "authors": [{"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UC San Diego"}, {"given_name": "Staal", "family_name": "Vinterbo", "institution": "UC San Diego"}]}