{"title": "Differentially private Bayesian learning on distributed data", "book": "Advances in Neural Information Processing Systems", "page_first": 3226, "page_last": 3235, "abstract": "Many applications of machine learning, for example in health care, would benefit from methods that can guarantee privacy of data subjects. Differential privacy (DP) has become established as a standard for protecting learning results. The standard DP algorithms require a single trusted party to have access to the entire data, which is a clear weakness, or add prohibitive amounts of noise. We consider DP Bayesian learning in a distributed setting, where each party only holds a single sample or a few samples of the data. We propose a learning strategy based on a secure multi-party sum function for aggregating summaries from data holders and the Gaussian mechanism for DP. Our method builds on an asymptotically optimal and practically efficient DP Bayesian inference with rapidly diminishing extra cost.", "full_text": "Differentially private Bayesian learning on\n\ndistributed data\n\nMikko Heikkil\u00e41\n\nmikko.a.heikkila@helsinki.fi\n\nEemil Lagerspetz2\n\neemil.lagerspetz@helsinki.fi\n\nSamuel Kaski3\n\nsamuel.kaski@aalto.fi\n\nKana Shimizu4\n\nshimizu.kana.g@gmail.com\n\nSasu Tarkoma2\n\nsasu.tarkoma@helsinki.fi\n\nAntti Honkela1,5\n\nantti.honkela@helsinki.fi\n\n1 Helsinki Institute for Information Technology HIIT,\n\nDepartment of Mathematics and Statistics, University of Helsinki\n\n2 Helsinki Institute for Information Technology HIIT,\n\nDepartment of Computer Science, University of Helsinki\n\n3 Helsinki Institute for Information Technology HIIT,\nDepartment of Computer Science, Aalto University\n\n4 Department of Computer Science and Engineering, Waseda University\n\n5 Department of Public Health, University of Helsinki\n\nAbstract\n\nMany applications of machine learning, for example in health care, would bene\ufb01t\nfrom methods that can guarantee privacy of data subjects. Differential privacy\n(DP) has become established as a standard for protecting learning results. The\nstandard DP algorithms require a single trusted party to have access to the entire\ndata, which is a clear weakness, or add prohibitive amounts of noise. We consider\nDP Bayesian learning in a distributed setting, where each party only holds a single\nsample or a few samples of the data. We propose a learning strategy based on a\nsecure multi-party sum function for aggregating summaries from data holders and\nthe Gaussian mechanism for DP. Our method builds on an asymptotically optimal\nand practically ef\ufb01cient DP Bayesian inference with rapidly diminishing extra\ncost.\n\n1\n\nIntroduction\n\nDifferential privacy (DP) [9, 11] has recently gained popularity as the theoretically best-founded\nmeans of protecting the privacy of data subjects in machine learning. It provides rigorous guarantees\nagainst breaches of individual privacy that are robust even against attackers with access to additional\nside information. DP learning methods have been proposed e.g. for maximum likelihood estimation\n[24], empirical risk minimisation [5] and Bayesian inference [e.g. 8, 13, 16, 17, 19, 25, 29]. There\nare DP versions of most popular machine learning methods, including linear regression [16, 28],\nlogistic regression [4], support vector machines [5], and deep learning [1].\nAlmost all existing DP machine learning methods assume that some trusted party has unrestricted\naccess to all the data in order to add the necessary amount of noise needed for the privacy guarantees.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThis is a highly restrictive assumption for many applications, e.g. for learning with data on mobile\ndevices, and creates huge privacy risks through a potential single point of failure.\nIn this paper we introduce a general strategy for DP Bayesian learning in the distributed setting with\nminimal overhead. Our method builds on the asymptotically optimal suf\ufb01cient statistic perturbation\nmechanism [13, 16] and shares its asymptotic optimality. The method is based on a DP secure\nmulti-party communication (SMC) algorithm, called Distributed Compute algorithm (DCA), for\nachieving DP in the distributed setting. We demonstrate good performance of the method on DP\nBayesian inference using linear regression as an example.\n\n1.1 Our contribution\n\nWe propose a general approach for privacy-sensitive learning in the distributed setting. Our approach\ncombines SMC with DP Bayesian learning methods, originally introduced for the non-distributed\nsetting including a trusted party, to achieve DP Bayesian learning in the distributed setting.\nTo demonstrate our framework in practice, we combine the Gaussian mechanism for (\u0001, \u03b4)-DP with\nef\ufb01cient DP Bayesian inference using suf\ufb01cient statistics perturbation (SSP) and an ef\ufb01cient SMC\napproach for secure distributed computation of the required sums of suf\ufb01cient statistics. We prove that\nthe Gaussian SSP is an ef\ufb01cient (\u0001, \u03b4)-DP Bayesian inference method and that the distributed version\napproaches this quickly as the number of parties increases. We also address the subtle challenge of\nnormalising the data privately in a distributed manner, required for the proof of DP in distributed DP\nlearning.\n\n2 Background\n\n2.1 Differential privacy\n\nDifferential privacy (DP) [11] gives strict, mathematically rigorous guarantees against intrusions on\nindividual privacy. A randomised algorithm is differentially private if its results on adjacent data\nsets are likely to be similar. Here adjacency means that the data sets differ by a single element, i.e.,\nthe two data sets have the same number of samples, but they differ on a single one. In this work we\nutilise a relaxed version of DP called (\u0001, \u03b4)-DP [9, De\ufb01nition 2.4].\nDe\ufb01nition 2.1. A randomised algorithm A is (\u0001, \u03b4)-DP, if for all S \u2286 Range (A) and all adjacent\ndata sets D, D(cid:48),\n\nP (A(D) \u2208 S) \u2264 exp(\u0001)P (A(D(cid:48)) \u2208 S) + \u03b4.\n\nThe parameters \u0001 and \u03b4 in De\ufb01nition 2.1 control the privacy guarantee: \u0001 tunes the amount of privacy\n(smaller \u0001 means stricter privacy), while \u03b4 can be interpreted as the proportion of probability space\nwhere the privacy guarantee may break down.\nThere are several established mechanisms for ensuring DP. We use the Gaussian mechanism [9,\nTheorem 3.22]. The theorem says that given a numeric query f with (cid:96)2-sensitivity \u22062(f ), adding\nnoise distributed as N (0, \u03c32) to each output component guarantees (\u0001, \u03b4)-DP, when\n\nHere, the (cid:96)2-sensitivity of a function f is de\ufb01ned as\n\n\u03c32 > 2 ln(1.25/\u03b4)(\u22062(f )/\u0001)2.\n\nwhere the supremum is over all adjacent data sets D, D(cid:48).\n\n\u22062(f ) = sup\nD\u223cD(cid:48)\n\n(cid:107)f (D) \u2212 f (D(cid:48))(cid:107)2,\n\n(1)\n\n(2)\n\n2.2 Differentially private Bayesian learning\n\nBayesian learning provides a natural complement to DP because it inherently can handle uncertainty,\nincluding uncertainty introduced to ensure DP [26], and it provides a \ufb02exible framework for data\nmodelling.\nThree distinct types of mechanisms for DP Bayesian inference have been proposed:\n\n1. Drawing a small number of samples from the posterior or an annealed posterior [7, 25];\n\n2\n\n\f2. Suf\ufb01cient statistics perturbation (SSP) of an exponential family model [13, 16, 19]; and\n3. Perturbing the gradients in gradient-based MCMC [25] or optimisation in variational infer-\n\nence [17].\n\nFor models where it applies, the SSP approach is asymptotically ef\ufb01cient [13, 16], unlike the posterior\nsampling mechanisms. The ef\ufb01ciency proof of [16] can be generalised to (\u0001, \u03b4)-DP and Gaussian\nSSP as shown in the Supplementary Material.\nThe SSP (#2) and gradient perturbation (#3) mechanisms are of similar form in that the DP mechanism\nultimately computes a perturbed sum\n\nN(cid:88)\n\nz =\n\nzi + \u03b7\n\n(3)\n\ni=1\n\nover quantities zi computed for different samples i = 1, . . . , N, where \u03b7 denotes the noise injected to\nensure the DP guarantee. For SSP [13, 16, 19], the zi are the suf\ufb01cient statistics of a particular sample,\nwhereas for gradient perturbation [17, 25], the zi are the clipped per-sample gradient contributions.\nWhen a single party holds the entire data set, the sum z in Eq. (3) can be computed easily, but the\ncase of distributed data makes things more dif\ufb01cult.\n\n3 Secure and private learning with distributed data\n\nLet us assume there are N data holders (called clients in the following), who each hold a single data\nsample. We would like to use the aggregate data for learning, but the clients do not want to reveal\ntheir data as such to anybody else. The main problem with the distributed setting is that if each client\nuses a trusted aggregator (TA) DP technique separately, the noise \u03b7 in Eq. (3) is added by each client,\nincreasing the total noise variance by a factor of N compared to the non-distributed single TA setting,\neffectively reducing to naive input perturbation. To reduce the noise level without compromising on\nprivacy, the individual data samples need to be combined without directly revealing them to anyone.\nOur solution to this problem uses an SMC approach based on a form of secret sharing: each client\nsends their term of the sum, split to separate messages, to M servers such that together the messages\nsum up to the desired value, but individually they are just random noise. This can be implemented\nef\ufb01ciently using a \ufb01xed-point representation of real numbers which allows exact cancelling of the\nnoise in the addition. Like any secret sharing approach, this algorithm is secure as long as not all M\nservers collude. Cryptography is only required to secure the communication between the client and\nthe server. Since this does not need to be homomorphic as in many other protocols, faster symmetric\ncryptography can be used for the bulk of the data. We call this the Distributed Compute Algorithm\n(DCA), which we introduce next in detail.\n\n3.1 Distributed compute algorithm (DCA)\n\nIn order to add the correct amount of noise while avoiding revealing the unperturbed data to any\nsingle party, we combine an encryption scheme with the Gaussian mechanism for DP as illustrated in\nFig. 1(a). Each individual client adds a small amount of Gaussian noise to his data, resulting in the\naggregated noise to be another Gaussian with large enough variance. The details of the noise scaling\nare presented in the Section 3.1.2.\nThe scheme relies on several independent aggregators, called Compute nodes (Algorithm 1). At\na general level, the clients divide their data and some blinding noise into shares that are each sent\nto one Compute. After receiving shares from all clients, each Compute decrypts the values, sums\nthem and broadcasts the results. The \ufb01nal results can be obtained by summing up the values from all\nComputes, which cancels the blinding noise.\n\n3.1.1 Threat model\n\nWe assume there are at most T clients who may collude to break the privacy, either by revealing the\nnoise they add to their data samples or by abstaining from adding the noise in the \ufb01rst place. The rest\nare honest-but-curious (HbC), i.e., they will take a peek at other people\u2019s data if given the chance, but\nthey will follow the protocol.\n\n3\n\n\f(a) DCA setting\n\n(b) Extra scaling factor\n\nFigure 1: 1(a): Schematic diagram of the Distributed Compute algorithm (DCA). Red refers to\nencrypted values, blue to unencrypted (but blinded or DP) values. 1(b) Extra scaling factor needed\nfor the noise in the distributed setting with T colluding clients as compared to the trusted aggregator\nsetting.\n\nAlgorithm 1 Distributed Compute Algorithm for distributed summation with independent Compute\nnodes\nInput: d-dimensional vectors zi held by clients i \u2208 {1, . . . , N};\n\nj , j = 1, . . . , d (public);\n\n1: Each client i simulates \u03b7i \u223c N (0, diag(\u03c32\n\nj )) and M \u2212 1 vectors ri,k of uniformly random\nk=1 ri,k = 0d (a vector of zeros).\n2: Each client i computes the messages mi,1 = zi + \u03b7i + ri,1, mi,k = ri,k, k = 2, . . . M, and\n\nk=1 ri,k to ensure that(cid:80)M\n\ni=1 (zi + \u03b7i), where \u03b7i \u223c N (0, diag(\u03c32\nj ))\n\nsends them securely to the corresponding Compute k.\n\nDistributed Gaussian mechanism noise variances \u03c32\nNumber of parties N (public);\nNumber of Compute nodes M (public);\n\nOutput: Differentially private sum(cid:80)N\n\ufb01xed-point data with ri,M = \u2212(cid:80)M\u22121\nthe noisy aggregate sums qk =(cid:80)N\n(cid:80)M\nk=1 qk =(cid:80)N\n\ni=1(zi + \u03b7i).\n\n3: After receiving messages from all of the clients, Compute k decrypts the values and broadcasts\ni=1 mi,k. A \ufb01nal aggregator will then add these to obtain\n\nTo break the privacy of individual clients, all Compute nodes need to collude. We therefore assume\nthat at least one Compute node follows the protocol. We further assume that all parties have an\ninterest in the results and hence will not attempt to pollute the results with invalid values.\n\n3.1.2 Privacy of the mechanism\n\nIn order to guarantee that the sum-query results returned by Algorithm 1 are DP, we need to show\nthat the variance of the aggregated Gaussian noise is large enough.\nTheorem 1 (Distributed Gaussian mechanism). If at most T clients collude or drop out of the\nprotocol, the sum-query result returned by Algorithm 1 is (\u0001, \u03b4)-DP, when the variance of the added\nnoise \u03c32\n\nj ful\ufb01ls\n\nj \u2265\n\u03c32\nwhere N is the number of clients and \u03c32\nj,std is the variance of the noise in the standard (\u0001, \u03b4)-DP\nGaussian mechanism given in Eq. (1).\n\nN \u2212 T \u2212 1\n\n\u03c32\nj,std,\n\n1\n\nProof. See Supplement.\n\nIn the case of all HbC clients, T = 0. The extra scaling factor increases the variance of the DP, but\nthis factor quickly approaches 1 as the number of clients increases, as can be seen from Figure 1(b).\n\n4\n\nzi + \u03b7iN ClientsIndividual encryptionM Compute NodesMessageRouterSum of NdecryptedmessagesDP ResultComputeNodeComputeNodeN messagesN messagesSum of NdecryptedmessagesEnc(mi,l)\u03a3i zi+ \u03b7 020406080100Number of clients1.01.52.02.53.03.54.04.55.0Scaling factorScaling factor needed to guarantee privacyNumber of colluding clientsT=0T=5T=10\f3.1.3 Fault tolerance\n\nThe Compute nodes need to know which clients\u2019 contributions they can safely aggregate. This feature\nis simple to implement e.g. with pairwise-communications between all Compute nodes. In order to\navoid having to start from scratch due to insuf\ufb01cient noise for DP, the same strategy used to protect\nagainst colluding clients can be utilized: when T > 0, at most T clients in total can drop or collude\nand the scheme will still remain private.\n\n3.1.4 Computational scalability\n\nMost of the operations needed in Algorithm 1 are extremely fast: encryption and decryption can use\nfast symmetric algorithms such as AES (using slower public key cryptography just for the key of the\nsymmetric system) and the rest is just integer additions for the \ufb01xed point arithmetic. The likely \ufb01rst\nbottlenecks in the implementation would be caused by synchronisation when gathering the messages\nas well as the generation of cryptographically secure random vectors ri,k.\n\n3.2 Differentially private Bayesian learning on distributed data\n\nIn order to perform DP Bayesian learning securely in the distributed setting, we use DCA (Algorithm\n1) to compute the required data summaries that correspond to Eq. (3). In this Section we consider\nhow to combine this scheme with concrete DP learning methods introduced for the trusted aggregator\nsetting, so as to provide a wide range of possibilities for performing DP Bayesian learning securely\nwith distributed data.\nThe aggregation algorithm is most straightforward to apply to the SSP method [13, 16] for exact and\napproximate posterior inference on exponential family models. [13] and [16] use Laplacian noise to\nguarantee \u0001-DP, which is a stricter form of privacy than the (\u0001, \u03b4)-DP used in DCA [9]. We consider\nhere only (\u0001, \u03b4)-DP version of the methods, and discuss the possible Laplace noise mechanism further\nin Section 7. The model training in this case is done in a single iteration, so a single application of\nAlgorithm 1 is enough for learning. We consider a more detailed example in Section 3.2.1.\nWe can also apply DCA to DP variational inference [17, 19]. These methods rely on possibly clipped\ngradients or expected suf\ufb01cient statistics calculated from the data. Typically, each training iteration\nwould use only a mini-batch instead of the full data. To use variational inference in the distributed\nsetting, an arbitrary party keeps track of the current (public) model parameters and the privacy budget,\nand asks for updates from the clients.\nAt each iteration, the model trainer selects a random mini-batch of \ufb01xed public size from the available\nclients and sends them the current model parameters. The selected clients then calculate the clipped\ngradients or expected suf\ufb01cient statistics using their data, add noise to the values scaled re\ufb02ecting the\nbatch size, and pass them on using DCA. The model trainer receives the decrypted DP sums from the\noutput and updates the model parameters.\n\n3.2.1 Distributed Bayesian linear regression with data projection\n\nAs an empirical example, we consider Bayesian linear regression (BLR) with data projection in the\ndistributed setting. The standard BLR model depends on the data only through suf\ufb01cient statistics\nand the approach discussed in Section 3.2 can be used in a straightforward manner to \ufb01t the model by\nrunning a single round of DCA.\nThe more ef\ufb01cient BLR with projection (Algorithm 2) [16] reduces the data range, and hence\nsensitivity, by non-linearly projecting all data points inside stricter bounds, which translates into less\nadded noise. We can select the bounds to optimize bias vs. DP noise variance. In the distributed\nsetting, we need to run an additional round of DCA and use some privacy budget to estimate data\nstandard deviations (stds). However, as shown by the test results (Figures 2 and 3), this can still\nachieve signi\ufb01cantly better utility with a given privacy level.\nThe assumed bounds in Step 1 of Algorithm 2 would typically be available from general knowledge\nof the data. The initial projection in Step 1 ensures the privacy of the scheme even if the bounds are\ninvalid for some samples. We determine the optimal \ufb01nal projection thresholds pj in Step 3 using the\nsame general approach as [16]: we create an auxiliary data set of equal size as the original with data\n\n5\n\n\fAlgorithm 2 Distributed linear regression with projection\nInput: Data and target values (xij, yi), j = 1, . . . , d held by clients i \u2208 {1, . . . , N};\n\nNumber of clients N (public);\nAssumed data and target bounds (\u2212cj, cj), j = 1, . . . , d + 1 (public);\nPrivacy budget (\u0001, \u03b4) (public);\n\nOutput: DP BLR model suf\ufb01cient statistics of projected data(cid:80)N\n\ni=1 \u02c6xi \u02c6xT\n\ni + \u03b7(1),(cid:80)N\n\ni=1 \u02c6xT\n\ni \u02c6yi + \u03b7(2),\n\ncalculated using projection to estimated optimal bounds\n\n1: Each client projects his data to the assumed bounds (\u2212cj, cj) \u2200j.\n2: Calculate marginal std estimates \u03c3(1), . . . , \u03c3(d+1) by running Algorithm 1 using the assumed\n\nbounds for sensitivity and a chosen share of the privacy budget.\n3: Estimate optimal projection thresholds pj, j = 1, . . . , d + 1 as fractions of std on auxiliary\ndata. Each client then projects his data to the estimated optimal bounds (\u2212pj\u03c3(j), pj\u03c3(j)), j =\n1, . . . , d + 1.\n\n4: Aggregate the unique terms in the DP suf\ufb01cient statistics by running Algorithm 1 using the\nestimated optimal bounds for sensitivity and the remaining privacy budget, and combine the\nDP result vectors into the symmetric d \u00d7 d matrix and d-dimensional vector of DP suf\ufb01cient\nstatistics.\n\ngenerated as\n\n(4)\n(5)\n(6)\nWe then perform grid search on the auxiliary data with varying thresholds to \ufb01nd the one providing\noptimal prediction performance. The source code for our implementation is available through GitHub1\nand a more detailed description can be found in the Supplement.\n\nxi \u223c N (0, Id)\n\u03b2 \u223c N (0, \u03bb0I)\ni \u03b2, \u03bb).\n\nyi|xi \u223c N (xT\n\n4 Experimental setup\n\nWe demonstrate the secure DP Bayesian learning scheme in practice by testing the performance of\nthe BLR with data projection, the implementation of which was discussed in Section 3.2.1, along\nwith the DCA (Algorithm 1) in the all HbC clients distributed setting (T = 0).\nWith the DCA our primary interest is scalability.\nIn the case of BLR implementation, we are\nmostly interested in comparing the distributed algorithm to the trusted aggregator version as well as\ncomparing the performance of the straightforward BLR to the variant using data projection, since it is\nnot clear a priori if the extra cost in privacy necessitated by the projection in the distributed setting is\noffset by the reduced noise level.\nWe use simulated data for the DCA scalability testing, and real data for the BLR tests. As real\ndata, we use the Wine Quality [6] (split into white and red wines) and Abalone data sets from the\nUCI repository[18], as well as the Genomics of Drug Sensitivity in Cancer (GDSC) project data 2.\nThe measured task in the GDSC data is to predict drug sensitivity of cancer cell lines from gene\nexpression data (see Supplement for a more detailed description). The datasets are assumed to be\nzero-centred. This assumption is not crucial but is done here for simplicity; non-zero data means can\nbe estimated like the marginal stds at the cost of some added noise (see Section 3.2.1).\nFor estimating the marginal std, we also need to assume bounds for the data. For unbounded data, we\ncan enforce arbitrary bounds simply by projecting all data inside the chosen bounds, although very\npoor choice of bounds will lead to poor performance. With real distributed data, the assumed bounds\ncould differ from the actual data range. In the UCI tests we simulate this effect by scaling each data\ndimension to have a range of length 10, and then assuming bounds of [\u22127.5, 7.5], i.e., the assumed\nbounds clearly overestimate the length of the true range, thus adding more noise to the results. The\nactual scaling chosen here is arbitrary. With the GDSC data, the true ranges are mostly known due to\nthe nature of the data (see Supplement).\n\n1https://github.com/DPBayes/dca-nips2017\n2http://www.cancerrxgene.org/, release 6.1, March 2017\n\n6\n\n\fd=10\nd=102\nd=103\nd=104\n\n1.89\n2.86\n10.56\n84.95\n\nTable 1: DCA experiment average runtimes in seconds with 5 repeats, using M=10 Compute nodes,\nN clients and vector length d.\n\nN=102 N=103 N=104 N=105\n1.72\n2.03\n3.43\n15.30\n\n2.99\n12.36\n101.2\n994.96\n\n8.58\n65.64\n610.55\n1592.29\n\n(a) Red wine data set\n\n(b) Abalone data set\n\n(c) White wine data set\n\nFigure 2: Median of the predictive accuracy measured on mean absolute error (MAE) on several\nUCI data sets with error bars denoting the interquartile range (lower is better). The performance of\nthe distributed methods (DDP, DDP proj) is indistinguishable from the corresponding undistributed\nalgorithms (TA, TA proj) and the projection (proj TA, proj DDP) can clearly be bene\ufb01cial for\nprediction performance. NP refers to non-private version, TA to the trusted aggregator setting, DDP\nto the distributed scheme.\n\nThe optimal projection thresholds are searched for using 10 (GDSC) or 20 (UCI) repeats on a grid\nwith 20 points between 0.1 and 2.1 times the std of the auxiliary data set. In the search we use one\ncommon threshold for all data dimensions and a separate one for the target.\nFor accuracy measure, we use prediction accuracy on a separate test data set. The size of the test set\nfor UCI in Figure 2 is 500 for red wine, 1000 for white wine, and 1000 for abalone data. The test set\nsize for GDSC in Figure 3 is 100. For UCI, we compare the median performance measured on mean\nabsolute error over 25 cross-validation (CV) runs, while for GDSC we measure mean prediction\naccuracy to sensitive vs insensitive with Spearman\u2019s rank correlation on 25 CV runs. In both cases,\nwe use input perturbation [11] and the trusted aggregator setting as baselines.\n\n5 Results\n\nTable 1 shows runtimes of a distributed Spark implementation of the DCA algorithm. The timing\nexcludes encryption, but running AES for the data of the largest example would take less than 20\ns on a single thread on a modern CPU. The runtime modestly increases as N or d is increased.\nThis suggests that the prototype is reasonably scalable. Spark overhead sets a lower bound runtime\nof approximately 1 s for small problems. For large N and d, sequential communication at the 10\nCompute threads is the main bottleneck. Larger N could be handled by introducing more Compute\nnodes and clients only communicating with a subset of them.\nComparing the results on predictive error with and without projection (Fig. 2 and Fig. 3), it is clear\nthat despite incurring extra privacy cost for having to estimate the marginal standard deviations, using\nthe projection can improve the results markedly with a given privacy budget.\nThe results also demonstrate that compared to the trusted aggregator setting, the extra noise added due\nto the distributed setting with HbC clients is insigni\ufb01cant in practice as the results of the distributed\nand trusted aggregator algorithms are effectively indistinguishable.\n\n7\n\n1.01.783.165.6210.031.62epsilon1.01.52.02.53.03.54.0MAENPproj NPTADDPproj DDPinputperturbedproj TAd=11, sample size=1000, repeats=25, \u03b4=0.00011.01.783.165.6210.031.62epsilon1.01.52.02.5MAENPproj NPTADDPproj DDPinputperturbedproj TAd=8, sample size=3000, repeats=25, \u03b4=0.00011.01.783.165.6210.031.62epsilon1.01.52.02.5MAENPproj NPTADDPproj DDPinputperturbedproj TAd=11, sample size=3000, repeats=25, \u03b4=0.0001\f(a) Drug sensitivity prediction\n\n(b) Drug sensitivity prediction, selected methods\n\nFigure 3: Mean drug sensitivity prediction accuracy on GDSC dataset with error bars denoting\nstandard deviation over CV runs (higher is better). Distributed results (DDP, proj DDP) do not differ\nmarkedly from the corresponding trusted aggregator (TA, proj TA) results. The projection (proj TA,\nproj DDP) is clearly bene\ufb01cial for performance. The actual sample size varies between drugs. NP\nrefers to non-private version, TA to the trusted aggregator setting, DDP to the distributed scheme.\n\n6 Related work\n\nThe idea of distributed private computation through addition of noise generated in a distributed\nmanner was \ufb01rst proposed by Dwork et al. [10]. However, to the best of our knowledge, there is no\nprior work on secure DP Bayesian statistical inference in the distributed setting.\nIn machine learning, [20] presented the \ufb01rst method for aggregating classi\ufb01ers in a DP manner, but\ntheir approach is sensitive to the number of parties and sizes of the data sets held by each party and\ncannot be applied in a completely distributed setting. [21] improved upon this by an algorithm for\ndistributed DP stochastic gradient descent that works for any number of parties. The privacy of the\nalgorithm is based on perturbation of gradients which cannot be directly applied to the ef\ufb01cient SSP\nmechanism. The idea of aggregating classi\ufb01ers was further re\ufb01ned in [15] through a method that uses\nan auxiliary public data set to improve the performance.\nThe \ufb01rst practical method for implementing DP queries in a distributed manner was the distributed\nLaplace mechanism presented in [22]. The distributed Laplace mechanism could be used instead\nof the Gaussian mechanism if pure \u0001-DP is required, but the method, like those in [20, 21], needs\nhomomorphic encryption which is computationally more demanding, especially for high-dimensional\ndata.\nThere is a wealth of literature on secure distributed computation of DP sum queries as reviewed in\n[14]. The methods of [23, 2, 3, 14] also include different forms of noise scaling to provide collusion\nresistance and/or fault tolerance, where the latter requires a separate recovery round after data holder\nfailures which is not needed by DCA. [12] discusses low level details of an ef\ufb01cient implementation\nof the distributed Laplace mechanism.\nFinally, [27] presents several proofs related to the SMC setting and introduce a protocol for generating\napproximately Gaussian noise in a distributed manner. Compared to their protocol, our method of\nnoise addition is considerably simpler and faster, and produces exactly instead of approximately\nGaussian noise with negligible increase in noise level.\n\n7 Discussion\n\nWe have presented a general framework for performing DP Bayesian learning securely in a distributed\nsetting. Our method combines a practical SMC method for calculating secure sum queries with\nef\ufb01cient Bayesian DP learning techniques adapted to the distributed setting.\n\n8\n\n1.03.05.07.510.0epsilon0.000.050.100.150.200.25Predictive accuracyNPproj NPTADDPproj DDPinputperturbedproj TAd=10, sample size=840, CV=25, \u03b4=0.00011.03.05.07.510.0epsilon0.000.050.100.150.200.25Predictive accuracyNPDDPproj DDPinputperturbedd=10, sample size=840, CV=25, \u03b4=0.0001\fDP methods are based on adding suf\ufb01cient noise to effectively mask the contribution of any single\nsample. The extra loss in accuracy due to DP tends to diminish as the number of samples increases\nand ef\ufb01cient DP estimation methods converge to their non-private counterparts as the number of\nsamples increases [13, 16]. A distributed DP learning method can signi\ufb01cantly help in increasing\nthe number of samples because data held by several parties can be combined thus helping make DP\nlearning signi\ufb01cantly more effective.\nConsidering the DP and the SMC components separately, although both are necessary for ef\ufb01cient\nprivacy-aware learning, it is clear that the choice of method to use for each sub-problem can be\nmade largely independently. Assessing these separately, we can therefore easily change the privacy\nmechanism from the Gaussian used in Algorithm 1 to the Laplace mechanism, e.g. by utilising one of\nthe distributed Laplace noise addition methods presented in [14] to obtain a pure \u0001-DP method. If\nneed be, the secure sum algorithm in our method can also be easily replaced with one that better suits\nthe security requirements at hand.\nWhile the noise introduced for DP will not improve the performance of an otherwise good learning\nalgorithm, a DP solution to a learning problem can yield better results if the DP guarantees allow\naccess to more data than is available without privacy. Our distributed method can further help make\nthis more ef\ufb01cient by securely and privately combining data from multiple parties.\n\nAcknowledgements\n\nThis work was funded by the Academy of Finland [Centre of Excellence COIN and projects 259440,\n278300, 292334, 294238, 297741, 303815, 303816], the Japan Agency for Medical Research and\nDevelopment (AMED), and JST CREST [JPMJCR1688].\n\nReferences\n[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep\n\nlearning with differential privacy. In Proc. CCS 2016, 2016.\n\n[2] G. \u00c1cs and C. Castelluccia. I have a DREAM! (DiffeRentially privatE smArt Metering). In\nProc. 13th International Conference in Information Hiding (IH 2011), pages 118\u2013132, 2011.\n\n[3] T. H. H. Chan, E. Shi, and D. Song. Privacy-preserving stream aggregation with fault tolerance.\nIn Proc. 16th Int. Conf. on Financial Cryptography and Data Security (FC 2012), pages\n200\u2013214, 2012.\n\n[4] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In Advances in Neural\n\nInformation Processing Systems 21, pages 289\u2013296. 2009.\n\n[5] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-\n\nmization. J. Mach. Learn. Res., 12:1069\u20131109, 2011.\n\n[6] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Modeling wine preferences by data\n\nmining from physicochemical properties. Decision Support Systems, 47(4):547\u2013553, 2009.\n\n[7] C. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. P. Rubinstein. Robust and private Bayesian\n\ninference. In Proc. ALT 2014, pages 291\u2013305, 2014.\n\n[8] C. Dimitrakakis, B. Nelson, Z. Zhang, A. Mitrokotsa, and B. I. P. Rubinstein. Differential\nprivacy for Bayesian inference through posterior sampling. Journal of Machine Learning\nResearch, 18(11):1\u201339, 2017.\n\n[9] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and\n\nTrends in Theoretical Computer Science, 9(3-4):211\u2013407, 2014.\n\n[10] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy\nvia distributed noise generation. In Advances in Cryptology (EUROCRYPT 2006), page 486\u2013503,\n2006.\n\n[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data\nanalysis. In Proc. 3rd Theory of Cryptography Conference (TCC 2006), pages 265\u2013284. 2006.\n\n9\n\n\f[12] F. Eigner, A. Kate, M. Maffei, F. Pampaloni, and I. Pryvalov. Differentially private data aggre-\ngation with optimal utility. In Proceedings of the 30th Annual Computer Security Applications\nConference, pages 316\u2013325. ACM, 2014.\n\n[13] J. Foulds, J. Geumlek, M. Welling, and K. Chaudhuri. On the theory and practice of privacy-\npreserving Bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncer-\ntainty in Arti\ufb01cial Intelligence, UAI\u201916, pages 192\u2013201, 2016.\n\n[14] S. Goryczka and L. Xiong. A comprehensive comparison of multiparty secure additions with\n\ndifferential privacy. IEEE Transactions on Dependable and Secure Computing, 2015.\n\n[15] J. Hamm, P. Cao, and M. Belkin. Learning privately from multiparty data. In ICML, 2016.\n\n[16] A. Honkela, M. Das, A. Nieminen, O. Dikmen, and S. Kaski. Ef\ufb01cient differentially private\n\nlearning improves drug sensitivity prediction. 2016. arXiv:1606.02109 [stat.ML].\n\n[17] J. J\u00e4lk\u00f6, O. Dikmen, and A. Honkela. Differentially private variational inference for non-\nconjugate models. In Proc. 33rd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI\n2017), 2017.\n\n[18] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\n[19] M. Park, J. Foulds, K. Chaudhuri, and M. Welling. Variational Bayes in private settings (VIPS).\n\n2016. arXiv:1611.00340.\n\n[20] M. Pathak, S. Rane, and B. Raj. Multiparty differential privacy via aggregation of locally trained\nclassi\ufb01ers. In Advances in Neural Information Processing Systems 23, pages 1876\u20131884, 2010.\n\n[21] A. Rajkumar and S. Agarwal. A differentially private stochastic gradient descent algorithm for\n\nmultiparty classi\ufb01cation. In Proc. AISTATS 2012, pages 933\u2013941, 2012.\n\n[22] V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with\ntransformation and encryption. In Proc. 2010 ACM SIGMOD International Conference on\nManagement of Data (SIGMOD 2010), pages 735\u2013746. ACM, 2010.\n\n[23] E. Shi, T. Chan, E. Rieffel, R. Chow, and D. Song. Privacy-preserving aggregation of time-series\n\ndata. In Proc. NDSS, 2011.\n\n[24] A. Smith. Ef\ufb01cient, differentially private point estimators. 2008. arXiv:0809.4794 [cs.CR].\n\n[25] Y. Wang, S. E. Fienberg, and A. J. Smola. Privacy for free: Posterior sampling and stochastic\n\ngradient Monte Carlo. In Proc. ICML 2015, pages 2493\u20132502, 2015.\n\n[26] O. Williams and F. McSherry. Probabilistic inference and differential privacy. In Adv. Neural\n\nInf. Process. Syst. 23, 2010.\n\n[27] G. Wu, Y. He, J. Wu, and X. Xia. Inherit differential privacy in distributed setting: Multiparty\nrandomized function computation. In 2016 IEEE Trustcom/BigDataSE/ISPA, pages 921\u2013928,\n2016.\n\n[28] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: Regression\n\nanalysis under differential privacy. PVLDB, 5(11):1364\u20131375, 2012.\n\n[29] Z. Zhang, B. Rubinstein, and C. Dimitrakakis. On the differential privacy of Bayesian inference.\n\nIn Proc. AAAI 2016, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1835, "authors": [{"given_name": "Mikko", "family_name": "Heikkil\u00e4", "institution": "University of Helsinki"}, {"given_name": "Eemil", "family_name": "Lagerspetz", "institution": "University of Helsinki"}, {"given_name": "Samuel", "family_name": "Kaski", "institution": "Aalto University"}, {"given_name": "Kana", "family_name": "Shimizu", "institution": "Waseda University"}, {"given_name": "Sasu", "family_name": "Tarkoma", "institution": "University of Helsinki"}, {"given_name": "Antti", "family_name": "Honkela", "institution": "University of Helsinki"}]}