{"title": "Nearly Optimal Private LASSO", "book": "Advances in Neural Information Processing Systems", "page_first": 3025, "page_last": 3033, "abstract": "We present a nearly optimal differentially private version of the well known LASSO estimator. Our algorithm provides privacy protection with respect to each training data item. The excess risk of our algorithm, compared to the non-private version, is $\\widetilde{O}(1/n^{2/3})$, assuming all the input data has bounded $\\ell_\\infty$ norm. This is the first differentially private algorithm that achieves such a bound without the polynomial dependence on $p$ under no addition assumption on the design matrix. In addition, we show that this error bound is nearly optimal amongst all differentially private algorithms.", "full_text": "Nearly-Optimal Private LASSO\u2217\n\nKunal Talwar\nGoogle Research\n\nkunal@google.com\n\nAbhradeep Thakurta\n(Previously) Yahoo! Labs\n\nguhathakurta.abhradeep@gmail.com\n\nLi Zhang\n\nGoogle Research\n\nliqzhang@google.com\n\nAbstract\n\nWe present a nearly optimal differentially private version of the well known\nLASSO estimator. Our algorithm provides privacy protection with respect to each\ntraining example. The excess risk of our algorithm, compared to the non-private\n\nversion, is (cid:101)O(1/n2/3), assuming all the input data has bounded (cid:96)\u221e norm. This\n\nis the \ufb01rst differentially private algorithm that achieves such a bound without the\npolynomial dependence on p under no additional assumptions on the design ma-\ntrix.\nIn addition, we show that this error bound is nearly optimal amongst all\ndifferentially private algorithms.\n\n1\n\nIntroduction\n\nA common task in supervised learning is to select the model that best \ufb01ts the data. This is frequently\nachieved by selecting a loss function that associates a real-valued loss with each datapoint d and\nmodel \u03b8 and then selecting from a class of admissible models, the model \u03b8 that minimizes the\naverage loss over all data points in the training set. This procedure is commonly referred to as\nEmpirical Risk Minimization(ERM).\nThe availability of large datasets containing sensitive information from individuals has motivated the\nstudy of learning algorithms that guarantee the privacy of individuals contributing to the database. A\nrigorous and by-now standard privacy guarantee is via the notion of differential privacy. In this work,\nwe study the design of differentially private algorithms for Empirical Risk Minimization, continuing\na long line of work. (See [2] for a survey.)\nIn particular, we study adding privacy protection to the classical LASSO estimator, which has been\nwidely used and analyzed. We \ufb01rst present a differentially private optimization algorithm for the\nLASSO estimator. The algorithm is the combination of the classical Frank-Wolfe algorithm [15]\nand the exponential mechanism for guaranteeing the privacy [21]. We then show that our algorithm\nachieves nearly optimal risk among all the differentially private algorithms. This lower bound proof\nrelies on recently developed techniques with roots in Cryptography [4, 14],\n(cid:80)\nConsider the training dataset D consisting of n pairs of data di = (xi, yi) where xi \u2208 Rp, usually\ncalled the feature vector, and yi \u2208 R, the prediction. The LASSO estimator, or the sparse linear\nregression, solves for \u03b8\u2217 = argmin\u03b8 L(\u03b8; di) = 1\ni |xi \u00b7 \u03b8 \u2212 yi|2 subject to (cid:107)\u03b8(cid:107)1 \u2264 c. To simplify\npresentation, we assume c = 1, but our results directly extend to general c. The (cid:96)1 constraint tends to\ninduce sparse \u03b8\u2217 so is widely used in the high dimensional setting when p (cid:29) n. Here, we will study\napproximating the LASSO estimation with minimum possible error while protecting the privacy of\neach individual di. Below we de\ufb01ne the setting more formally.\n\nn\n\n\u2217Part of this work was done at Microsoft Research Silicon Valley Campus.\n\n1\n\n\fProblem de\ufb01nition: Given a data set D = {d1,\u00b7\u00b7\u00b7 , dn} of n samples from a domain D, a constraint\nset C \u2286 Rp, and a loss function L : C \u00d7D \u2192 R, for any model \u03b8, de\ufb01ne its excess empirical risk as\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nR(\u03b8; D)\n\ndef\n=\n\n1\nn\n\nL(\u03b8; di) \u2212 min\n\u03b8\u2208C\n\n1\nn\n\nL(\u03b8; di).\n\n(1)\n\nFor LASSO, the constraint set is the (cid:96)1 ball, and the loss is the quadratic loss function. We de\ufb01ne\nthe risk of a mechanism A on a data set D as R(A; D) = E[R(A(D); D)], where the expectation\nis over the internal randomness of A, and the risk R(A) = maxD\u2208Dn R(A; D) is the maximum\nrisk over all the possible data sets. Our objective is then to design a mechanism A which preserves\n(\u0001, \u03b4)-differential privacy (De\ufb01nition 1.3) and achieves as low risk as possible. We call the minimum\nachievable risk as privacy risk, de\ufb01ned as minA R(A), where the min is over all (\u0001, \u03b4)-differentially\nprivate mechanisms A.\nThere has been much work on studying the privacy risk for the LASSO estimator. However, all the\nprevious results either need to make strong assumption about the input data or have polynomial de-\npendence on the dimension p. First [20] and then [24] studied the LASSO estimator with differential\nprivacy guarantee. They showed that one can avoid the polynomial dependence on p in the excess\nempirical risk if the data matrix X satisfy the restricted strong convexity and mutual incoherence\nproperities. While such assumptions seem necessary to prove that LASSO recovers the exact sup-\nport in the worst case, they are often violated in practice, where LASSO still leads to useful models.\nIt is therefore desirable to design and analyze private versions of LASSO in the absence of such as-\nsumptions. In this work, we do so by analyzing the loss achieved by the private optimizer, compared\nto the true optimizer.\nWe make primarily two contributions in this paper. First we present an algorithm that achieves\n\nthe privacy risk of (cid:101)O(1/n2/3) for the LASSO problem1. Compared to the previous work, we only\n\nassume that the input data has bounded (cid:96)\u221e norm. In addition, the above risk bound only has log-\narithmic dependence on p, which \ufb01ts particularly well for LASSO as we usually assume n (cid:28) p\nwhen applying LASSO. This bound is achieved by a private version of the Frank-Wolfe algorithm.\nAssuming that each data point di satis\ufb01es that (cid:107)di(cid:107)\u221e \u2264 1, we have\nTheorem 1.1. There exists an (\u0001, \u03b4)-differentially private algorithm A for LASSO such that\n\n(cid:32)\n\nlog(np)(cid:112)log(1/\u03b4)\n\n(cid:33)\n\n(n\u0001)2/3\n\n.\n\nR(A) = O\n\nOur second contribution is to show that, surprisingly, this simple algorithm gives a nearly tight\nbound. We show that this rather unusual n\u22122/3 dependence is not an artifact of the algorithm or the\nanalysis, but is in fact the right dependence for the LASSO problem: no differentially private algo-\nrithm can do better! We prove a lower bound by employing \ufb01ngerprinting codes based techniques\ndeveloped in [4, 14].\nTheorem 1.2. For the sparse linear regression problem where (cid:107)xi(cid:107)\u221e \u2264 1, for \u0001 = 0.1 and \u03b4 =\no(1/n2), any (\u0001, \u03b4)-differentially private algorithm A must have\nR(A) = \u2126(1/(n log n)2/3) .\n\nOur improved privacy risk crucially depends on the fact that the constraint set is a polytope with\nfew (polynomial in dimensions) vertices. This allows us to use a private version of the Frank-Wolfe\nalgorithm, where at each step, we use the exponential mechanism to select one of the vertices of\nthe polytope. We also present a variant of Frank-Wolfe that uses objective perturbation instead of\nthe exponential mechanism. We show that (Theorem 2.6) we can obtain a risk bound dependent on\nthe Gaussian width of the constraint set, which often results in tighter bounds compared to bounds\nbased, e.g., on diameter. While more general, this variant adds much more noise than the Frank-\nWolfe based algorithm, as it is effectively publishing the whole gradient at each step. When C is not\na polytope with a small number of vertices, one can still use the exponential mechanism as long as\none has a small list of candidate points which contains an approximate optimizer for every direction.\nFor many simple cases, for example the (cid:96)q ball with 1 < q < 2, the bounds attained in this way have\n\n1Throughout the paper, we use (cid:101)O to hide logarithmic factors.\n\n2\n\n\fan additional polynomial dependence on the dimension p, instead of the logarithmic dependence in\nthe above result. For example, when q = 1, the upper bound from this variant has an extra factor\nof p1/3. Whereas such a dependence is provably needed for q = 2, the upper bound jump rather\nabruptly from the logarithmic dependence for q = 1 to a polynomial dependence on p for q > 1.\nWe leave open the question of resolving this discontinuity and interpolating more smoothly between\nthe (cid:96)1 case and the (cid:96)2 case.\nOur results enlarge the set of problems for which privacy comes \u201cfor free\u201d. Given n samples from\na distribution, suppose that \u03b8\u2217 is the empirical risk minimizer and \u03b8priv is the differentially private\napproximate minimizer. Then the non-private ERM algorithm outputs \u03b8\u2217 and incurs expected (on the\ndistribution) loss equal to the loss(\u03b8\u2217, training-set) + generalization-error, where the generalization\nerror term depends on the loss function, C and on the number of samples n. The differentially private\nalgorithm incurs an additional loss of the privacy risk. If the privacy risk is asymptotically no larger\nthan the generalization error, we can think of privacy as coming for free, since under the assumption\nof n being large enough to make the generalization error small, we are also making n large enough\nto make the privacy risk small. In the case when C is the (cid:96)1-ball, and the loss function is the squared\nloss with (cid:107)x(cid:107)\u221e \u2264 1 and |y| \u2264 1, the best known generalization error bounds dominate the privacy\nrisk when n = \u03c9(log3 p) [1, Theorem 18].\n\n1.1 Related work\n\nThere have been much work on private LASSO or more generally private ERM algorithms. The\nerror bounds mainly depend on the shape of the constraint set and the Lipschitz condition of the loss\nfunction. Here we will summarize these related results. Related to our results, we distinguish two\nsettings: i) the constraint set is bounded in the (cid:96)1-norm and the the loss function is 1-Lipschitz in\nthe (cid:96)1-norm. (call it the ((cid:96)1/(cid:96)\u221e)-setting). This is directly related to our bounds on LASSO; and\nii) the constraint set has bounded (cid:96)2 norm and the loss function is 1-Lipschitz in the (cid:96)2 norm (the\n((cid:96)2/(cid:96)2)-setting), which is related to our bounds using Gaussian width.\n\nThe ((cid:96)1/(cid:96)\u221e)-setting: The results in this setting include [20, 24, 19, 25]. The \ufb01rst two works make\ncertain assumptions about the instance (restricted strong convexity (RSC) and mutual incoherence).\nUnder these assumptions, they obtain privacy risk guarantees that depend logarithmically in the di-\nmensions p, and thus allowing the guarantees to be meaningful even when p (cid:29) n. In fact their\nbound of O(polylog p/n) can be better than our tight bound of O(polylog p/n2/3). However, these\nassumptions on the data are strong and may not hold in practice. Our guarantees do not require\nany such data dependent assumptions. The result of [19] captures the scenario when the constraint\nset C is the probability simplex and the loss function is a generalized linear model, but provides\na worse bound of O(polylog p/n1/3). For the special case of linear loss functions, which are in-\nteresting primarily in the online prediction setting, the techniques of [19, 25] provide a bound of\nO(polylog p/n).\n\nThe ((cid:96)2/(cid:96)2)-setting: In all the works on private convex optimization that we are aware of, either the\nexcess risk guarantees depend polynomially on the dimensionality of the problem (p), or assumes\nspecial structure to the loss (e.g., generalized linear model [19] or linear losses [25]). Similar de-\npendence is also present in the online version of the problem [18, 26]. [2] recently show that in\nthe private ERM setting, in general this polynomial dependence on p is unavoidable. In our work\nwe show that one can replace this dependence on p with the Gaussian width of the constraint set C,\nwhich can be much smaller.\nEffect of Gaussian width in risk minimization: Our result on general C has an dependence on\nthe Gaussian width of C. This geometric concept has previously appeared in other contexts. For\nexample, [1] bounds the the excess generalization error by the Gaussian width of the constraint set C.\nRecently [5] show that the Gaussian width of a constraint set C is very closely related to the number\nof generic linear measurements one needs to perform to recover an underlying model \u03b8\u2217 \u2208 C. The\nnotion of Gaussian width has also been used by [22, 11] in the context of differentially private query\nrelease mechanisms but in the very different context of answering multiple linear queries over a\ndatabase.\n\n3\n\n\f1.2 Background\n\nDifferential Privacy: The notion of differential privacy (De\ufb01nition 1.3) is by now a defacto standard\nfor statistical data privacy [10, 12]. One of the reasons why differential privacy has become so\npopular is because it provides meaningful guarantees even in the presence of arbitrary auxiliary\ninformation. At a semantic level, the privacy guarantee ensures that an adversary learns almost\nthe same thing about an individual independent of his presence or absence in the data set. The\nparameters (\u0001, \u03b4) quantify the amount of information leakage. For reasons beyond the scope of this\nwork, \u0001 \u2248 0.1 and \u03b4 = 1/n\u03c9(1) are a good choice of parameters. Here n refers to the number of\nsamples in the data set.\nDe\ufb01nition 1.3. A randomized algorithm A is (\u0001, \u03b4)-differentially private if, for all neighboring data\nsets D and D(cid:48) (i.e., they differ in one record, or equivalently, dH (D, D(cid:48)) = 1) and for all events S\nin the output space of A, we have\n\nPr(A(D) \u2208 S) \u2264 e\u0001 Pr(A(D(cid:48)) \u2208 S) + \u03b4 .\n\nHere dH (D, D(cid:48)) refers to the Hamming distance.\n\n(cid:96)q-norm, q \u2265 1: For q \u2265 1, the (cid:96)q-norm for any vector v \u2208 Rp is de\ufb01ned as\nv(i) is the i-th coordinate of the vector v.\nL-Lipschitz continuity w.r.t. norm (cid:107) \u00b7 (cid:107): A function \u03a8 : C \u2192 R is L-Lispchitz within a set C w.r.t.\na norm (cid:107) \u00b7 (cid:107) if the following holds.\n\n, where\n\nv(i)q\n\n(cid:19)1/q\n\n(cid:18) p(cid:80)\n\ni=1\n\n\u2200\u03b81, \u03b82 \u2208 C,|\u03a8(\u03b81) \u2212 \u03a8(\u03b82)| \u2264 L \u00b7 (cid:107)\u03b81 \u2212 \u03b82(cid:107).\n\nGaussian width of a set C: Let b \u223c N (0, Ip) be a Gaussian random vector in Rp. The Gaussian\nwidth of a set C is de\ufb01ned as GC def\n\n|(cid:104)b, w(cid:105)|\n\n= Eb\n\n.\n\n(cid:21)\n\n(cid:20)\n\nsup\nw\u2208C\n\n2 Private Convex Optimization by Frank-Wolfe algorithm\n\nIn this section we analyze a differentially private variant of the classical Frank-Wolfe algorithm [15].\nWe show that for the setting where the constraint set C is a polytope with k vertices, and the loss\nfunction L(\u03b8; d) is Lipschitz w.r.t. the (cid:96)1-norm, one can obtain an excess privacy risk of roughly\nO(log k/n2/3). This in particular captures the high-dimensional linear regression setting. One such\nexample is the classical LASSO algorithm[27], which computes argmin\u03b8:(cid:107)\u03b8(cid:107)1\u22641\n2. In\nthe usual case of |xij|,|yj| = O(1), L(\u03b8) = 1\n2 is O(1)-Lipschitz with respect to (cid:96)1-norm,\n\nwe show that one can achieve the nearly optimal privacy risk of (cid:101)O(1/n2/3).\n\nn(cid:107)X\u03b8 \u2212 y(cid:107)2\n\nn(cid:107)X\u03b8\u2212y(cid:107)2\n\n1\n\nThe Frank-Wolfe algorithm [15] can be regarded as a \u201cgreedy\u201d algorithm which moves towards\nthe optimum solution in the \ufb01rst order approximation (see Algorithm 1 for the description). How\nfast Frank-Wolfe algorithm converges depends on L\u2019s \u201ccurvature\u201d, de\ufb01ned as follows according\nto [8, 17]. We remark that a \u03b2-smooth function on C has curvature constant bounded by \u03b2(cid:107)C(cid:107)2.\nDe\ufb01nition 2.1 (Curvature constant). For L : C \u2192 R, de\ufb01ne \u0393L as below.\n\n\u0393L :=\n\n\u03b81,\u03b82,\u2208C,\u03b3\u2208(0,1],\u03b83=\u03b81+\u03b3(\u03b82\u2212\u03b81)\n\nsup\n\n2\n\n\u03b32 (L(\u03b83) \u2212 L(\u03b81) \u2212 (cid:104)\u03b83 \u2212 \u03b81,(cid:53)L(\u03b81)(cid:105)) .\n\nRemark 1. A useful bound can be derived for a quadratic loss L(\u03b8) = \u03b8AT A\u03b8 + (cid:104)b, \u03b8(cid:105). In this\ncase, by [8], \u0393L \u2264 maxa,b\u2208A\u00b7C (cid:107)a \u2212 b(cid:107)2\n2. When C is centrally symmetric, we have the bound\n\u0393L \u2264 4 max\u03b8\u2208C (cid:107)A\u03b8(cid:107)2\n2. For LASSO, A = 1\u221a\nL(\u03b8). The following theorem bounds the convergence rate of Frank-Wolfe\nDe\ufb01ne \u03b8\u2217 = argmin\n\u03b8\u2208C\nalgorithm.\n\nn X.\n\n4\n\n\fAlgorithm 1 Frank-Wolfe algorithm\nInput: C \u2286 Rp, L : C \u2192 R, \u00b5t\n1: Choose an arbitrary \u03b81 from C\n2: for t = 1 to T \u2212 1 do\n3:\n4:\n5: return \u03b8T .\n\nCompute(cid:101)\u03b8t = argmin\u03b8\u2208C(cid:104)(cid:53)L(\u03b8t), (\u03b8 \u2212 \u03b8t)(cid:105)\nSet \u03b8t+1 = \u03b8t + \u00b5t((cid:101)\u03b8t \u2212 \u03b8t)\n\nTheorem 2.2 ([8, 17]). If we set \u00b5t = 2/(t + 2), then L(\u03b8T ) \u2212 L(\u03b8\u2217) = O(\u0393L/T ) .\nWhile the Frank-Wolfe algorithm does not necessarily provide faster convergence compared to the\ngradient-descent based method, it has two major advantages. First, on Line 3, it reduces the problem\nto solving a minimization of linear function. When C is de\ufb01ned by small number of vertices, e.g.\nwhen C is an (cid:96)1 ball, the minimization can be done by checking (cid:104)(cid:53)L(\u03b8t), x(cid:105) for each vertex x of\nC. This can be done ef\ufb01ciently. Secondly, each step in Frank-Wolfe takes a convex combination\n\nof \u03b8t and(cid:101)\u03b8t, which is on the boundary of C. Hence each intermediate solution is always inside C\n\n(sometimes called projection free), and the \ufb01nal outcome \u03b8T is the convex combination of up to T\npoints on the boundary of C (or vertices of C when C is a polytope). Such outcome might be desired,\nfor example when C is a polytope, as it corresponds to a sparse solution. Due to these reasons\nFrank-Wolfe algorithm has found many applications in machine learning [23, 16, 8]. As we shall\nsee below, these properties are also useful for obtaining low risk bounds for their private version.\n\n2.1 Private Frank-Wolfe Algorithm\n\n\u03b8\u2208C(cid:104)\u03b8, v(cid:105) \u2229 S (cid:54)= \u2205.\n\nWe now present a private version of the Frank-Wolfe algorithm. The algorithm accesses the private\ndata only through the loss function in step 3 of the algorithm. Thus to achieve privacy, it suf\ufb01ces to\nreplace this step by a private version.\nTo do so, we apply the exponential mechanism [21] to select an approximate optimizer. In the case\nwhen the set C is a polytope, it suf\ufb01ces to optimize over the vertices of C due to the following basic\nfact:\nFact 2.3. Let C \u2286 Rp be the convex hull of a compact set S \u2286 Rp. For any vector v \u2208 Rp,\narg min\nThus it suf\ufb01ces to run the exponential mechanism to select \u03b8t+1 from amongst the vertices of C.\nThis leads to a differentially private algorithm with risk logarithmically dependent on |S|. When\n|S| is polynomial in p, it leads to an error bound with log p dependence. We can bound the error\nin terms of the (cid:96)1-Lipschitz constant, which can be much smaller than the (cid:96)2-Lipschitz constant. In\nparticular, as we show in the next section, the private Frank-Wolfe algorithm is nearly optimal for\nthe important high-dimensional sparse linear regression problem.\nAlgorithm 2 ANoise\u2212FW(polytope): Differentially Private Frank-Wolfe Algorithm (Polytope Case)\nInput: Data set: D = {d1,\u00b7\u00b7\u00b7 , dn}, loss function: L(\u03b8; D) = 1\n\nL(\u03b8; di) (with (cid:96)1-Lipschitz\nconstant L1 for L), privacy parameters: (\u0001, \u03b4), convex set: C = conv(S) with (cid:107)C(cid:107)1 denoting\nmaxs\u2208S (cid:107)s(cid:107)1.\n\nn(cid:80)\n\ni=1\n\nn\n\n, where Lap(\u03bb) \u223c 1\n\n2\u03bb e\u2212|x|/\u03bb.\n\n\u2200s \u2208 S, \u03b1s \u2190 (cid:104)s,(cid:53)L(\u03b8t; D)(cid:105) + Lap\n\n1: Choose an arbitrary \u03b81 from C\n2: for t = 1 to T \u2212 1 do\n3:\n\n(cid:18)\n(cid:101)\u03b8t \u2190 arg min\n\u03b8t+1 \u2190 (1 \u2212 \u00b5t)\u03b8t + \u00b5t(cid:101)\u03b8t, where \u00b5t = 2\n\ns\u2208S\n\n\u03b1s.\n\n4:\n\n5:\n6: Output \u03b8priv = \u03b8T .\n\nt+2.\n\n(cid:19)\n\nL1(cid:107)C(cid:107)1\n\n\u221a\n\n8T log(1/\u03b4)\nn\u0001\n\nTheorem 2.4 (Privacy guarantee). Algorithm 2 is (\u0001, \u03b4)-differentially private.\n\n5\n\n\fSince each data item is assumed to have bounded (cid:96)\u221e norm, for two neighboring databases D and\nD(cid:48) and any \u03b8 \u2208 C, s \u2208 S, we have that\n\n|(cid:104)s,(cid:53)L(\u03b8; D)(cid:105) \u2212 (cid:104)s,(cid:53)L(\u03b8; D)(cid:105)| = O(L1(cid:107)C(cid:107)1/n) .\n\nThe proof of privacy then follows from a straight-forward application of the exponential mechanism\n[21] or its noisy maximum version [3, Theorem 5]) and the strong composition theorem [13]. In\nTheorem 2.5 we prove the utility guarantee for the private Frank-Wolfe algorithm for the convex\npolytope case. De\ufb01ne \u0393L = max\nTheorem 2.5 (Utility guarantee). Let L1, S and (cid:107)C(cid:107)1 be de\ufb01ned as in Algorithms 2 (Algorithm\nANoise\u2212FW(polytope)). Let \u0393L be an upper bound on the curvature constant (de\ufb01ned in De\ufb01nition 2.1)\nfor the loss function L(\u00b7; d) that holds for all d \u2208 D. In Algorithm ANoise\u2212FW(polytope), if we set\nT = \u0393L2/3(n\u0001)2/3\n\nD\u2208D CL over all the possible data sets in D.\n\n(cid:32)\n\n\u0393L1/3 (L1(cid:107)C(cid:107)1)2/3 log(n|S|)(cid:112)log(1/\u03b4)\n\n(cid:33)\n\n.\n\n(L1(cid:107)C(cid:107)1)2/3 , then\n\nE(cid:2)L(\u03b8priv; D)(cid:3) \u2212 min\n\n\u03b8\u2208C L(\u03b8; D) = O\n\n(n\u0001)2/3\n\nHere the expectation is over the randomness of the algorithm.\n\nThe proof of utility uses known bounds on noisy Frank-Wolfe [17], along with error bounds for the\nexponential mechanism. The details can be found in the full version.\nGeneral C While a variant of this mechanism can be applied to the case when C is not a polytope,\nits error would depend on the size of a cover of the boundary of C, which can be exponential in p,\nleading to an error bound with polynomial dependence on p. In the full version, we analyze another\nvariant of private Frank-Wolfe that uses objective perturbation to ensure privacy. This variant is\nwell-suited for a general convex set C and the following result, proven in the Appendix, bounds its\nexcess risk in terms of the Gaussian Width of C. For this mechanism, we only need C to be bounded\nin (cid:96)2 diameter, but our error now depends on the (cid:96)2-Lipschitz constant of the loss functions.\nTheorem 2.6. Suppose that each loss function is L2-Lipschitz with respect to the (cid:96)2 norm, and that\nC has (cid:96)2 diameter at most (cid:107)C(cid:107)2. Let GC the Gaussian width of the convex set C \u2286 Rp, and let \u0393L\nbe the curvature constant (de\ufb01ned in De\ufb01nition 2.1) for the loss function (cid:96)(\u03b8; d) for all \u03b8 \u2208 C and\nd \u2208 D. Then there is an (\u0001, \u03b4)-differentially private algorithm ANoise\u2212FW with excess empirical risk:\n\nE(cid:2)L(\u03b8priv; D)(cid:3) \u2212 min\n\n(cid:32)\n\n\u03b8\u2208C L(\u03b8; D) = O\n\n\u0393L1/3 (L2GC)2/3 log2(n/\u03b4)\n\n(n\u0001)2/3\n\n(cid:33)\n\n.\n\nHere the expectation is over the randomness of the algorithm.\n\n2.2 Private LASSO algorithm\nWe now apply the private Frank-Wolfe algorithm ANoise\u2212FW(polytope) to the important case of the\nsparse linear regression (or LASSO) problem.\nProblem de\ufb01nition: Given a data set D = {(x1, y1),\u00b7\u00b7\u00b7 , (xn, yn)} of n-samples from the domain\nD = {(x, y) : x \u2208 Rp, y \u2208 [\u22121, 1],(cid:107)x(cid:107)\u221e \u2264 1}, and the convex set C = (cid:96)p\n1. De\ufb01ne the mean\nsquared loss,\n\nL(\u03b8; D) =\n\n((cid:104)xi, \u03b8(cid:105) \u2212 yi)2 .\n\n(2)\n\n(cid:88)\n\ni\u2208[n]\n\n1\nn\n\nThe objective is to compute \u03b8priv \u2208 C to minimize L(\u03b8; D) while preserving privacy with respect to\nany change of individual (xi, yi) pair. The non-private setting of the above problem is a variant of\nthe least squares problem with (cid:96)1 regularization, which was started by the work of LASSO [27, 28]\nand intensively studied in the past years.\nSince the (cid:96)1 ball is the convex hull of 2p vertices, we can apply the private Frank-Wolfe algo-\nrithm ANoise\u2212FW(polytope). For the above setting, it is easy to check that the (cid:96)1-Lipschitz constant is\nbounded by O(1). Further, by applying the bound on quadratic programming Remark 1, we have\n2 = O(1) since C is the unit (cid:96)1 ball, and |xij| \u2264 1. Hence \u0393 = O(1).\nthat CL \u2264 4 max\u03b8\u2208C 1\nNow applying Theorem 2.5, we have\n\nn(cid:107)X\u03b8(cid:107)2\n\n6\n\n\fCorollary 2.7. Let D = {(x1, y1),\u00b7\u00b7\u00b7 , (xn, yn)} of n samples from the domain D = {(x, y) :\n(cid:107)x(cid:107)\u221e \u2264 1,|y| \u2264 1}, and the convex set C equal to the (cid:96)1-ball. The output \u03b8priv of Algorithm\nANoise\u2212FW(polytope) ensures the following.\n\nE[L(\u03b8priv; D) \u2212 min\n\n\u03b8\u2208C L(\u03b8; D)] = O\n\n(cid:18) log(np/\u03b4)\n\n(cid:19)\n\n(n\u0001)2/3\n\n.\n\nRemark 2. Compared to the previous work [20, 24], the above upper bound makes no assumption of\nrestricted strong convexity or mutual incoherence, which might be too strong for realistic settings.\nAlso our results signi\ufb01cantly improve bounds of [19], from \u02dcO(1/n1/3) to \u02dcO(1/n2/3), which con-\nsidered the case of the set C being the probability simplex and the loss being a generalized linear\nmodel.\n\n3 Optimality of Private LASSO\n\nIn the following, we shall show that to ensure privacy, the error bound in Corollary 2.7 is nearly\noptimal in terms of the dominant factor of 1/n2/3.\nTheorem 3.1 (Optimality of private Frank-Wolfe). Let C be the (cid:96)1-ball and L be the mean squared\nloss in equation (2). For every suf\ufb01ciently large n, for every (\u0001, \u03b4)-differentially private algorithm\nA, with \u0001 \u2264 0.1 and \u03b4 = o(1/n2), there exists a data set D = {(x1, y1),\u00b7\u00b7\u00b7 , (xn, yn)} of n samples\nfrom the domain D = {(x, y) : (cid:107)x(cid:107)\u221e \u2264 1,|y| \u2264 1} such that\n\nE[L(A(D); D) \u2212 min\n\n\u03b8\u2208C L(\u03b8; D)] =(cid:101)\u2126\n\n(cid:18) 1\n\nn2/3\n\n(cid:19)\n\n.\n\nWe prove the lower bound by following the \ufb01ngerprinting codes argument of [4] for lowerbound-\ning the error of (\u0001, \u03b4)-differentially private algorithms. Similar to [4] and [14], we start with the\nfollowing lemma which is implicit in [4].The matrix X in Theorem 3.2 is the padded Tardos code\nused in [14, Section 5]. For any matrix X, denote by X(i) the matrix obtained by removing the i-th\nrow of X. Call a column of a matrix a consensus column if the entries in the column are either all\n1 or all \u22121. The sign of a consensus column is simply the consensus value of the column. Write\nw = m/ log m and p = 1000m2. The following theorem follows immediately from the proof of\nCorollary 16 in [14].\nTheorem 3.2. [Corollary 16 from [14], restated] Let m be a suf\ufb01ciently large positive integer. There\nexists a matrix X \u2208 {\u22121, 1}(w+1)\u00d7p with the following property. For each i \u2208 [1, w + 1], there are\nat least 0.999p consensus columns Wi in each X(i). In addition, for algorithm A on input matrix\nX(i) where i \u2208 [1, w + 1], if with probability at least 2/3, A(X(i)) produces a p-dimensional sign\n4 p columns in Wi, then A is not (\u03b5, \u03b4) differentially private with\nvector which agrees with at least 3\nrespect to single row change (to some other row in X).\nWrite \u03c4 = 0.001. Let k = \u03c4 wp. We \ufb01rst form an k \u00d7 p matrix Y where the column vectors of\nY are mutually orthogonal {1,\u22121} vectors. This is possible as k (cid:29) p. Now we construct w + 1\ndatabases Di for 1 \u2264 i \u2264 w + 1 as follows. For all the databases, they contain the common set of\nexamples (zj, 0) (i.e. vector zj with label 0) for 1 \u2264 j \u2264 k where zj = (Yj1, . . . , Yjp) is the j-th\nrow vector of Y . In addition, each Di contains w examples (xj, 1) for xj = (Xj1, . . . , Xjk) for\nj (cid:54)= i. Then L(\u03b8; Di) is de\ufb01ned as follows (for the ease of notation in this proof, we work with the\nun-normalized loss. This does not affect the generality of the arguments in any way.)\n(xj \u00b7 \u03b8 \u2212 1)2 + k(cid:107)\u03b8(cid:107)2\n2 .\n\n(xj \u00b7 \u03b8 \u2212 1)2 +\n\nL(\u03b8; Di) =\n\n(yj \u00b7 \u03b8)2 =\n\nk(cid:88)\n\n(cid:88)\n\nj=1\n\nj(cid:54)=i\n\nThe last equality is due to that the columns of Y are mutually orthogonal {\u22121, 1} vectors. For each\nsuch that the sign of the coordinates of \u03b8\u2217 matches the sign for the\n\nconsensus columns of X(i). Plugging \u03b8\u2217 in L(\u03b8\u2217; \u02c6D) we have the following,\n\n(cid:88)\n(cid:111)p\n\nj(cid:54)=i\n\np , 1\n\np\n\nDi, consider \u03b8\u2217 \u2208 (cid:110)\u2212 1\nL(\u03b8\u2217; \u02c6D) \u2264 w(cid:88)\n\n(2\u03c4 )2 +\n\ni=1\n\nk\np\n\n[since the number of consensus columns is at least (1 \u2212 \u03c4 )p]\n\n= (\u03c4 + 4\u03c4 2)w .\n\n(3)\n\n7\n\n\fWe now prove the crucial lemma, which states that if \u03b8 is such that (cid:107)\u03b8(cid:107)1 \u2264 1 and L(\u03b8; Di) is small,\nthen \u03b8 has to agree with the sign of most of the consensus columns of X(i).\nLemma 3.3. Suppose that (cid:107)\u03b8(cid:107)1 \u2264 1, and L(\u03b8; Di) < 1.1\u03c4 w. For j \u2208 Wi, denote by sj the sign of\nthe consensus column j. Then we have\n|{j \u2208 Wi\n\np .\n\n: sign(\u03b8j) = sj}| \u2265 3\n4\n\nProof. For any S \u2286 {1, . . . , p}, denote by \u03b8|S the projection of \u03b8 to the coordinate subset S. Con-\nsider three subsets S1, S2, S3, where\n\nS1 = {j \u2208 Wi\nS2 = {j \u2208 Wi\nS3 = {1, . . . , p} \\ Wi .\n\n: sign(\u03b8j) = sj} ,\n: sign(\u03b8j) (cid:54)= sj} ,\n\nThe proof is by contradiction. Assume that |S1| < 3\n4 p.\nFurther denote \u03b8i = \u03b8|Si for i = 1, 2, 3. Now we will bound (cid:107)\u03b81(cid:107)1 and (cid:107)\u03b83(cid:107)1 using the inequality\n(cid:107)x(cid:107)2 \u2265 (cid:107)x(cid:107)1/\n\n\u221a\n\n2 \u2265 w(cid:107)\u03b83(cid:107)2\n\nd for any d-dimensional vector.\n(cid:107)\u03b83(cid:107)2\n2 \u2265 (cid:107)\u03b83(cid:107)2\n1. But k(cid:107)\u03b83(cid:107)2\nHence k(cid:107)\u03b83(cid:107)2\n2 \u2264 k(cid:107)\u03b8(cid:107)2\nSimilarly by the assumption of |S1| < 3\n4 p,\n\n1/|S3| \u2265 (cid:107)\u03b83(cid:107)2\n2 \u2264 1.1\u03c4 w, so that (cid:107)\u03b83(cid:107)1 \u2264 \u221a\n2 < 1.1\u03c4 w, we have that (cid:107)\u03b81(cid:107)1 \u2264(cid:112)1.1 \u2217 3/4 \u2264 0.91.\n\n1/|S1| \u2265 4(cid:107)\u03b81(cid:107)2\n\n2 \u2265 (cid:107)\u03b81(cid:107)2\n\n(cid:107)\u03b81(cid:107)2\n\n1/(\u03c4 p) .\n\n1/(3p) .\n\nAgain using k(cid:107)\u03b8(cid:107)2\nNow we have (cid:104)xi, \u03b8(cid:105) \u2212 1 = (cid:107)\u03b81(cid:107)1 \u2212 (cid:107)\u03b82(cid:107)1 + \u03b2i \u2212 1 where |\u03b2i| \u2264 (cid:107)\u03b83(cid:107)1 \u2264 0.04. By (cid:107)\u03b81(cid:107)1 +\n(cid:107)\u03b82(cid:107)1 + (cid:107)\u03b83(cid:107)1 \u2264 1, we have\n\n1.1\u03c4 \u2264 0.04.\n\n|(cid:104)xi, \u03b8(cid:105) \u2212 1| \u2265 1 \u2212 (cid:107)\u03b81(cid:107) \u2212 |\u03b2i| \u2265 1 \u2212 0.91 \u2212 0.04 = 0.05 .\n\nHence we have that L(\u03b8; Di) \u2265 (0.05)2w \u2265 1.1\u03c4 w. This leads to a contradiction. Hence we must\nhave |S1| \u2265 3\n4 p.\n\nWith Theorem 3.2 and Lemma 3.3, we can now prove Theorem 3.1.\nProof. Suppose that A is private. And for the datasets we constructed above,\n\nE[L(A(Di); Di) \u2212 min\n\nL(\u03b8; Di)] \u2264 cw ,\n\nfor suf\ufb01ciently small constant c. By Markov inequality, we have with probability at least 2/3,\nL(A(Di); Di) \u2212 min\u03b8 L(\u03b8; Di) \u2264 3cw. By (3), we have min\nL(\u03b8; Di) \u2264 (\u03c4 + 4\u03c4 2)w. Hence if we\nchoose c a constant small enough, we have with probability 2/3,\n\n\u03b8\n\nBy Lemma 3.3, (4) implies that A(Di) agrees with at least 3\nby Theorem 3.2, this violates the privacy of A. Hence we have that there exists i, such that\n\n(4)\n4 p consensus columns in X(i). However\n\nL(A(Di); Di) < (\u03c4 + 4\u03c4 2 + 3c)w \u2264 1.1\u03c4 w .\n\nE[L(A(Di); Di) \u2212 min\n\nL(\u03b8; Di)] > cw .\n\n\u03b8\n\n\u03b8\n\nRecall that w = m/ log m and n = w + wp = O(m3/ log m). Hence we have that\n\nE[L(A(Di); Di) \u2212 min\n\n\u03b8\n\nL(\u03b8; Di)] = \u2126(n1/3/ log2/3 n) .\n\nThe proof\n\u2126(1/(n log n)2/3).\n\nis completed by converting the above bound to the normalized version of\n\n8\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. The Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization, revisited. In FOCS, 2014.\n[3] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive data. In\n\nKDD, New York, NY, USA, 2010.\n\n[4] M. Bun, J. Ullman, and S. Vadhan. Fingerprinting codes and the price of approximate differential privacy.\n\nIn STOC, 2014.\n\n[5] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[6] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2008.\n[7] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization.\n\nJMLR, 12:1069\u20131109, 2011.\n\n[8] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transations\n\non Algorithms, 2010.\n\n[9] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In FOCS,\n\n2013.\n\n[10] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis.\n\nIn Theory of Cryptography Conference, pages 265\u2013284. Springer, 2006.\n\n[11] C. Dwork, A. Nikolov, and K. Talwar. Ef\ufb01cient algorithms for privately releasing marginals via convex\n\nrelaxations. arXiv preprint arXiv:1308.1385, 2013.\n\n[12] C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in\n\nTheoretical Computer Science. NOW Publishers, 2014.\n\n[13] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In FOCS, 2010.\n[14] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: optimal bounds for privacy-preserving\n\nprincipal component analysis. In STOC, 2014.\n\n[15] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly,\n\n3(1-2):95\u2013110, 1956.\n\n[16] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.\n[17] M. Jaggi. Revisiting {Frank-Wolfe}: Projection-free sparse convex optimization. In ICML, 2013.\n[18] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, pages 24.1\u201324.34,\n\n2012.\n\n[19] P. Jain and A. Thakurta. (near) dimension independent risk bounds for differentially private learning. In\n\nInternational Conference on Machine Learning (ICML), 2014.\n\n[20] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional\n\nregression. In COLT, pages 25.1\u201325.40, 2012.\n\n[21] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94\u2013103. IEEE,\n\n2007.\n\n[22] A. Nikolov, K. Talwar, and L. Zhang. The geometry of differential privacy: The sparse and approximate\n\ncases. In STOC, 2013.\n\n[23] S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity in optimization problems with\n\nsparsity constraints. SIAM Journal on Optimization, 2010.\n\n[24] A. Smith and A. Thakurta. Differentially private feature selection via stability arguments, and the robust-\n\nness of the Lasso. In COLT, 2013.\n\n[25] A. Smith and A. Thakurta. Follow the perturbed leader is differentially private with optimal regret guar-\n\nantees. Manuscript in preparation, 2013.\n\n[26] A. Smith and A. Thakurta. Nearly optimal algorithms for private online learning in full-information and\n\nbandit settings. In NIPS, 2013.\n\n[27] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), 1996.\n\n[28] R. Tibshirani et al. The Lasso method for variable selection in the cox model. Statistics in medicine,\n\n16(4):385\u2013395, 1997.\n\n[29] J. Ullman. Private multiplicative weights beyond linear queries. CoRR, abs/1407.1571, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1709, "authors": [{"given_name": "Kunal", "family_name": "Talwar", "institution": "Google"}, {"given_name": "Abhradeep", "family_name": "Guha Thakurta", "institution": null}, {"given_name": "Li", "family_name": "Zhang", "institution": "Google"}]}