{"title": "Bootstrap Model Aggregation for Distributed Statistical Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1795, "page_last": 1803, "abstract": "In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A simple method is to linearly average the parameters of the local models, which, however, tends to be degenerate or not applicable on non-convex models, or models with different parameter dimensions. One more practical strategy is to generate bootstrap samples from the local models, and then learn a joint model based on the combined bootstrap set. Unfortunately, the bootstrap procedure introduces additional noise and can significantly deteriorate the performance. In this work, we propose two variance reduction methods to correct the bootstrap noise, including a weighted M-estimator that is both statistically efficient and practically powerful. Both theoretical and empirical analysis is provided to demonstrate our methods.", "full_text": "Bootstrap Model Aggregation for Distributed\n\nStatistical Learning\n\nJun Han\n\nDepartment of Computer Science\n\nDartmouth College\n\njun.han.gr@dartmouth.edu\n\nQiang Liu\n\nDepartment of Computer Science\n\nDartmouth College\n\nqiang.liu@dartmouth.edu\n\nAbstract\n\nIn distributed, or privacy-preserving learning, we are often given a set of probabilis-\ntic models estimated from different local repositories, and asked to combine them\ninto a single model that gives ef\ufb01cient statistical estimation. A simple method is to\nlinearly average the parameters of the local models, which, however, tends to be\ndegenerate or not applicable on non-convex models, or models with different param-\neter dimensions. One more practical strategy is to generate bootstrap samples from\nthe local models, and then learn a joint model based on the combined bootstrap\nset. Unfortunately, the bootstrap procedure introduces additional noise and can\nsigni\ufb01cantly deteriorate the performance. In this work, we propose two variance\nreduction methods to correct the bootstrap noise, including a weighted M-estimator\nthat is both statistically ef\ufb01cient and practically powerful. Both theoretical and\nempirical analysis is provided to demonstrate our methods.\n\n1\n\nIntroduction\n\nModern data science applications increasingly involve learning complex probabilistic models over\nmassive datasets. In many cases, the datasets are distributed into multiple machines at different\nlocations, between which communication is expensive or restricted; this can be either because the\ndata volume is too large to store or process in a single machine, or due to privacy constraints as\nthese in healthcare or \ufb01nancial systems. There has been a recent growing interest in developing\ncommunication-ef\ufb01cient algorithms for probabilistic learning with distributed datasets; see e.g., Boyd\net al. (2011); Zhang et al. (2012); Dekel et al. (2012); Liu and Ihler (2014); Rosenblatt and Nadler\n(2014) and reference therein.\nThis work focuses on a one-shot approach for distributed learning, in which we \ufb01rst learn a set\nof local models from local machines, and then combine them in a fusion center to form a single\nmodel that integrates all the information in the local models. This approach is highly ef\ufb01cient in\nboth computation and communication costs, but casts a challenge in designing statistically ef\ufb01cient\ncombination strategies. Many studies have been focused on a simple linear averaging method that\nlinearly averages the parameters of the local models (e.g., Zhang et al., 2012, 2013; Rosenblatt\nand Nadler, 2014); although nearly optimal asymptotic error rates can be achieved, this simple\nmethod tends to degenerate in practical scenarios for models with non-convex log-likelihood or\nnon-identi\ufb01able parameters (such as latent variable models, and neural models), and is not applicable\nat all for models with non-additive parameters (e.g., when the parameters have discrete or categorical\nvalues, or the parameter dimensions of the local models are different).\nA better strategy that overcomes all these practical limitations of linear averaging is the KL-averaging\nmethod (Liu and Ihler, 2014; Merugu and Ghosh, 2003), which \ufb01nds a model that minimizes the\nsum of Kullback-Leibler (KL) divergence to all the local models. In this way, we directly combine\nthe models, instead of the parameters. The exact KL-averaging is not computationally tractable\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbecause of the intractability of calculating KL divergence; a practical approach is to draw (bootstrap)\nsamples from the given local models, and then learn a combined model based on all the bootstrap\ndata. Unfortunately, the bootstrap noise can easily dominate in this approach and we need a very large\nbootstrap sample size to obtain accurate results. In Section 3, we show that the MSE of the estimator\nobtained from the naive way is O(N\u22121 + (dn)\u22121), where N is the total size of the observed data,\nand n is bootstrap sample size of each local model and d is the number of machines. This means that\nto ensure a MSE of O(N\u22121), which is guaranteed by the centralized method and the simple linear\naveraging, we need dn (cid:38) N; this is unsatisfying since N is usually very large by assumption.\nIn this work, we use variance reduction techniques to cancel out the bootstrap noise and get better\nKL-averaging estimates. The dif\ufb01culty of this task is \ufb01rst illustrated using a relatively straightforward\ncontrol variates method, which unfortunately suffers some of the practical drawback of the linear\naveraging method due to the use of a linear correction term. We then propose a better method based\non a weighted M-estimator, which inherits all the practical advantages of KL-averaging. On the\ntheoretical part, we show that our methods give a MSE of O(N\u22121 + (dn2)\u22121), which signi\ufb01cantly\nimproves over the original bootstrap estimator. Empirical studies are provided to verify our theoretical\nresults and demonstrate the practical advantages of our methods.\nThis paper is organized as follows. Section 2 introduces the background, and Section 3 introduces\nour methods and analyze their theoretical properties. We present numerical results in Section 4 and\nconclude the paper in Section 5. Detailed proofs can be found in the appendix.\n\nd(cid:88)\n\nN/d(cid:88)\n\nk=1\n\nj=1\n\nN/d(cid:88)\n\nj=1\n\n\u2217\n\n2 Background and Problem Setting\nSuppose we have a dataset X = {xj, j = 1, 2, ..., N} of size N, i.i.d. drawn from a probabilistic\nmodel p(x|\u03b8\n\u2217 is the unknown true\nparameter that we want to estimate based on X. In the distributed setting, the dataset X is partitioned\nk=1 X k, where X k denotes the k-th subset which we assume is stored\n\ninto d disjoint subsets, X =(cid:83)d\n\n) within a parametric family P = {p(x|\u03b8) : \u03b8 \u2208 \u0398}; here \u03b8\n\nin a local machine. For simplicity, we assume all the subsets have the same data size (N/d).\nThe traditional maximum likelihood estimator (MLE) provides a natural way for estimating the true\nparameter \u03b8\n\n\u2217 based on the whole dataset X,\n\nGlobal MLE: \u02c6\u03b8mle = arg max\n\n\u03b8\u2208\u0398\n\nlog p(xk\n\nj | \u03b8), where X k = {xk\nj}.\n\n(1)\n\nHowever, directly calculating the global MLE is challenging due to the distributed partition of the\ndataset. Although distributed optimization algorithms exist (e.g., Boyd et al., 2011; Shamir et al.,\n2014), they require iterative communication between the local machines and a fusion center, which\ncan be very time consuming in distributed settings, for which the number of communication rounds\nforms the main bottleneck (regardless of the amount of information communicated at each round).\nWe instead consider a simpler one-shot approach that \ufb01rst learns a set of local models based on each\nsubset, and then send them to a fusion center in which they are combined into a global model that\ncaptures all the information. We assume each of the local models is estimated using a MLE based on\nsubset X k from the k-th machine:\n\nLocal MLE: \u02c6\u03b8k = arg max\n\n\u03b8\u2208\u0398\n\nlog p(xk\n\nj | \u03b8), where k \u2208 [d] = {1, 2,\u00b7\u00b7\u00b7 , d}.\n\n(2)\n\nThe major problem is how to combine these local models into a global model. The simplest way is to\nlinearly average all local MLE parameters:\n\nLinear Average: \u02c6\u03b8linear =\n\n1\nd\n\n\u02c6\u03b8k.\n\nd(cid:88)\n\nk=1\n\nComprehensive theoretical analysis has been done for \u02c6\u03b8linear (e.g., Zhang et al., 2012; Rosenblatt and\nNadler, 2014), which show that it has an asymptotic MSE of E|| \u02c6\u03b8linear \u2212 \u03b8\n\u2217||2 = O(N\u22121). In fact,\nit is equivalent to the global MLE \u02c6\u03b8mle up to the \ufb01rst order O(N\u22121), and several improvements have\nbeen developed to improve the second order term (e.g., Zhang et al., 2012; Huang and Huo, 2015).\n\n2\n\n\fUnfortunately, the linear averaging method can easily break down in practice, or is even not applicable\nwhen the underlying model is complex. For example, it may work poorly when the likelihood has\nmultiple modes, or when there exist non-identi\ufb01able parameters for which different parameter values\ncorrespond to a same model (also known as the label-switching problem); models of this kind include\nlatent variable models and neural networks, and appear widely in machine learning. In addition, the\nlinear averaging method is obviously not applicable when the local models have different numbers of\nparameters (e.g., Gaussian mixtures with unknown numbers of components), or when the parameters\nare simply not additive (such as parameters with discrete or categorical values). Further discussions\non the practical limitations of the linear averaging method can be found in Liu and Ihler (2014).\nAll these problems of linear averaging can be well addressed by a KL-averaging method which\n(cid:80)d\naverages the model (instead of the parameters) by \ufb01nding a geometric center of the local models\nin terms of KL divergence (Merugu and Ghosh, 2003; Liu and Ihler, 2014). Speci\ufb01cally, it \ufb01nds a\nk=1 KL(p(x| \u02c6\u03b8k) || p(x|\u03b8)), which\nmodel p(x | \u03b8\n\u2217\nKL) where \u03b8\n(cid:90)\nis equivalent to,\n\n\u2217\nKL is obtained by \u03b8\n\n\u2217\nKL = arg min\u03b8\n\n(cid:27)\n\n(cid:26)\n\np(x | \u02c6\u03b8k) log p(x | \u03b8)dx\n\n.\n\n(3)\n\nExact KL Estimator:\n\n\u2217\nKL = arg max\n\n\u03b8\n\n\u03b8\u2208\u0398\n\n\u03b7(\u03b8) \u2261 d(cid:88)\n\nk=1\n\nLiu and Ihler (2014) studied the theoretical properties of the KL-averaging method, and showed\n\u2217\nKL = \u02c6\u03b8mle, when the distribution family is a full\nthat it exactly recovers the global MLE, that is, \u03b8\nexponential family, and achieves an optimal asymptotic error rate (up to the second order) among all\nthe possible combination methods of { \u02c6\u03b8k}.\nDespite the attractive properties, the exact KL-averaging is not computationally tractable except for\nvery simple models. Liu and Ihler (2014) suggested a naive bootstrap method for approximation: it\nj=1 from each local model p(x| \u02c6\u03b8k), k \u2208 [d] and use it to\n\ndraws parametric bootstrap sample {(cid:101)xk\nj}n\n\napproximate each integral in (3). The optimization in (3) then reduces to a tractable one,\n\n(cid:26)\n\n\u02c6\u03b7(\u03b8) \u2261 1\nn\n\n(cid:27)\nj | \u03b8)\n\nlog p((cid:101)xk\n\nd(cid:88)\n\nn(cid:88)\n\nk=1\n\nj=1\n\n.\n\n(4)\n\nKL-Naive Estimator: \u02c6\u03b8KL = arg max\n\nIntuitively, we can treat each (cid:101)Xk = {(cid:101)xk\n\nj}n\n\n\u03b8\u2208\u0398\n\nj=1 as an approximation of the original subset X k =\n\nj}N/d\n\nj=1 , and hence can be used to approximate the global MLE in (1).\n\n{xk\nUnfortunately, as we show in the sequel, the accuracy of \u02c6\u03b8KL critically depends on the bootstrap\nsample size n, and one would need n to be nearly as large as the original data size N/d to make \u02c6\u03b8KL\nachieve the baseline asymptotic rate O(N\u22121) that the simple linear averaging achieves; this is highly\nundesirably since N is often assumed to be large in distributed learning settings.\n\n3 Main Results\n\nWe propose two variance reduction techniques for improving the KL-averaging estimates and discuss\ntheir theoretical and practical properties. We start with a concrete analysis on the KL-naive estimator\n\u02c6\u03b8KL, which was missing in Liu and Ihler (2014).\n\nlog p(x | \u03b8), \u2202 log p(x|\u03b8)\n\nAssumption 1. 1.\n\u2200\u03b8 \u2208 \u0398; 2. \u22022 log p(x|\u03b8)\n\u2200x \u2208 X , and C1, C2 are some positive constans.\nTheorem 2. Under Assumption 1, \u02c6\u03b8KL is a consistent estimator of \u03b8\n\nis positive de\ufb01nite and C1 \u2264 (cid:107) \u22022 log p(x|\u03b8)\n\n, and \u22022 log p(x|\u03b8)\n\u2202\u03b8\u2202\u03b8(cid:62)\n\n\u2202\u03b8\u2202\u03b8(cid:62)\n\n\u2202\u03b8\u2202\u03b8(cid:62)\n\n\u2202\u03b8\n\nare continuous for \u2200x \u2208 X and\n\u2217 for\n\n(cid:107) \u2264 C2 in a neighbor of \u03b8\n\nKL as n \u2192 \u221e, and\n\u2217\n\nE( \u02c6\u03b8KL \u2212 \u03b8\n\n\u2217\nKL) = o(\n\n), E(cid:107) \u02c6\u03b8KL \u2212 \u03b8\n\nKL(cid:107)2 = O(\n\u2217\n\n1\ndn\n\n1\ndn\n\n),\n\nwhere d is the number of machines and n is the bootstrap sample size for each local model p(x | \u02c6\u03b8k).\n\u2217\nThe proof is in Appendix A. Because the MSE between the exact KL estimator \u03b8\nKL and the true\n\u2217 is O(N\u22121) as shown in Liu and Ihler (2014), the MSE between \u02c6\u03b8KL and the true\nparameter \u03b8\n\n3\n\n\fparameter \u03b8\n\n\u2217 is\nE(cid:107) \u02c6\u03b8KL \u2212 \u03b8\n\nKL \u2212 \u03b8\n\u2217\n\n\u2217(cid:107)2 = O(N\u22121 + (dn)\u22121).\n\nKL(cid:107)2 + E(cid:107)\u03b8\n\u2217(cid:107)2 \u2248 E(cid:107) \u02c6\u03b8KL \u2212 \u03b8\n\u2217\n(5)\n\u2217 equal O(N\u22121), as what is achieved by the simple linear\nTo make the MSE between \u02c6\u03b8KL and \u03b8\naveraging, we need draw dn (cid:38) N bootstrap data points in total, which is undesirable since N is often\nassumed to be very large by the assumption of distributed learning setting (one exception is when the\ndata is distributed due to privacy constraint, in which case N may be relatively small).\nTherefore, it is a critical task to develop more accurate methods that can reduce the noise introduced\nby the bootstrap process. In the sequel, we introduce two variance reduction techniques to achieve\nthis goal. One is based a (linear) control variates method that improves \u02c6\u03b8KL using a linear correction\nterm, and another is a multiplicative control variates method that modi\ufb01es the M-estimator in (4) by\nassigning each bootstrap data point with a positive weight to cancel the noise. We show that both\nmethod achieves a higher O(N\u22121 + (dn2)\u22121) rate under mild assumptions, while the second method\nhas more attractive practical advantages.\n\n3.1 Control Variates Estimator\n\nThe control variates method is a technique for variance reduction on Monte Carlo estimation (e.g.,\nWilson, 1984). It introduces a set of correlated auxiliary random variables with known expectations\nor asymptotics (referred as the control variates), to balance the variation of the original estimator. In\nj=1 is know to be drawn from the local\n\nour case, since each bootstrapped subsample (cid:101)X k = {(cid:101)xk\nmodel p(x | \u02c6\u03b8k), we can construct a control variate by re-estimating the local model based on (cid:101)X k:\nj}n\nn(cid:88)\n(cid:101)\u03b8k = arg max\nlog p((cid:101)xk\n\nwhere(cid:101)\u03b8k is known to converge to \u02c6\u03b8k asymptotically. This allows us to de\ufb01ne the following control\n\nBootstrapped Local MLE:\n\nfor k \u2208 [d],\n\nj | \u03b8),\n\n\u03b8\u2208\u0398\n\n(6)\n\nj=1\n\nvariates estimator:\n\nd(cid:88)\n\nk=1\n\nBk((cid:101)\u03b8k \u2212 \u02c6\u03b8k),\n\n(7)\n\nKL-Control Estimator: \u02c6\u03b8KL\u2212C = \u02c6\u03b8KL +\n\nd(cid:88)\n\nwhere Bk is a matrix chosen to minimize the asymptotic variance of \u02c6\u03b8KL\u2212C; our derivation shows\nthat the asymptotically optimal Bk has a form of\n\nBk = \u2212(\n\nI( \u02c6\u03b8k))\u22121I( \u02c6\u03b8k),\n\nk \u2208 [d],\n\n(8)\n\nk=1\n\nwhere I( \u02c6\u03b8k) is the empirical Fisher information matrix of the local model p(x | \u02c6\u03b8k). Note that this\ndifferentiates our method from the typical control variates methods where Bk is instead estimated\nusing empirical covariance between the control variates and the original estimator (in our case, we\n\ncan not directly estimate the covariance because \u02c6\u03b8KL and(cid:101)\u03b8k are not averages of i.i.d. samples).The\n\nprocedure of our method is summarized in Algorithm 1. Note that the form of (7) shares some\nsimilarity with the one-step estimator in Huang and Huo (2015), but Huang and Huo (2015) focuses\non improving the linear averaging estimator, and is different from our setting.\nWe analyze the asymptotic property of the estimator \u02c6\u03b8KL\u2212C, and summarize it as follows.\n\nTheorem 3. Under Assumption (1), \u02c6\u03b8KL\u2212C is a consistent estimator of \u03b8\nasymptotic MSE is guaranteed to be smaller than the KL-naive estimator \u02c6\u03b8KL, that is,\n\nKL as n \u2192 \u221e, and its\n\u2217\n\nnE(cid:107) \u02c6\u03b8KL\u2212C \u2212 \u03b8\n\nas n \u2192 \u221e.\nIn addition, when N > n\u00d7d, the \u02c6\u03b8KL\u2212C has \u201czero-variance\u201d in that E(cid:107) \u02c6\u03b8KL\u2212\u03b8\nKL(cid:107)2 = O((dn2)\u22121).\n\u2217\nFurther, in terms of estimating the true parameter, we have\n\nKL(cid:107)2 < nE(cid:107) \u02c6\u03b8KL \u2212 \u03b8\n\u2217\n\nKL(cid:107)2,\n\u2217\n\nE(cid:107) \u02c6\u03b8KL\u2212C \u2212 \u03b8\n\n\u2217(cid:107)2 = O(N\u22121 + (dn2)\u22121).\n\n(9)\n\n4\n\n\fk=1.\n\nAlgorithm 1 KL-Control Variates Method for Combining Local Models\n1: Input: Local model parameters { \u02c6\u03b8k}d\n\n2: Generate bootstrap data {(cid:101)xk\n(cid:80)n\nj=1 log p((cid:101)xk\nj}n\nj=1 from each p(x| \u02c6\u03b8k), for k \u2208 [d].\n4: Re-estimate the local parameters(cid:101)\u03b8k via (6) based on the bootstrapped data subset {(cid:101)xk\nj|\u03b8).\nj}n\nj=1,\n(cid:80)n\n\u2202log p((cid:101)xk\n\u2202log p((cid:101)xk\n\n3: Calculate the KL-Naive estimator, \u02c6\u03b8KL = arg max\u03b8\u2208\u0398\n\n5: Estimate the empirical Fisher information matrix I( \u02c6\u03b8k) = 1\nn\n\n(cid:80)d\n\nk \u2208 [d].\n\nj | \u02c6\u03b8k)\n\nk=1\n\nj=1\n\n1\nn\n\n\u2202\u03b8\n\n\u2202\u03b8\n\nfor\n\n(cid:62)\n\n,\n\nj | \u02c6\u03b8k)\n\nfor k \u2208 [d].\n\n6: Ouput: The parameter \u02c6\u03b8KL\u2212C of the combined model is given by (7) and (8).\n\n\u2217 reduces to\nThe proof is in Appendix B. From (9), we can see that the MSE between \u02c6\u03b8KL\u2212C and \u03b8\nO(N\u22121) as long as n (cid:38) (N/d)1/2, which is a signi\ufb01cant improvement over the KL-naive method\nwhich requires n (cid:38) N/d. When the goal is to achieve an O(\u0001) MSE, we would just need to take\nn (cid:38) 1/(d\u0001)1/2 when N > 1/\u0001, that is, n does not need to increase with N when N is very large.\n\nMeanwhile, because \u02c6\u03b8KL\u2212C requires a linear combination of \u02c6\u03b8k,(cid:101)\u03b8k and \u02c6\u03b8KL, it carries the practical\n\ndrawbacks of the linear averaging estimator as we discuss in Section 2. This motivates us to develop\nanother KL-weighted method shown in the next section, which achieves the same asymptotical\nef\ufb01ciency as \u02c6\u03b8KL\u2212C, while still inherits all the practical advantages of KL-averaging.\n\n3.2 KL-Weighted Estimator\n\nassigning each bootstrap data point(cid:101)xk\n\u02c6\u03b8k)/p((cid:101)xk\n\nthe probability ratio acts like a multiplicative control variate (Nelson, 1987), which has the advantage\nof being positive and applicable to non-identi\ufb01able, non-additive parameters. Our estimator is de\ufb01ned\nas\n\nj a positive weight according to the probability ratio p((cid:101)xk\nOur KL-weighted estimator is based on directly modifying the M-estimator for \u02c6\u03b8KL in (4), by\nj | (cid:101)\u03b8k) of the actual local model p(x| \u02c6\u03b8k) and the re-estimated model p(x|(cid:101)\u03b8k) in (6). Here\nj |\n(cid:27)\nWe \ufb01rst show that this weighted estimator(cid:101)\u03b7(\u03b8) gives a more accurate estimation of \u03b7(\u03b8) in (3) than\nLemma 4. As n \u2192 \u221e,(cid:101)\u03b7(\u03b8) is a more accurate estimator of \u03b7(\u03b8) than \u02c6\u03b7(\u03b8), in that\n\nthe straightforward estimator \u02c6\u03b7(\u03b8) de\ufb01ned in (4) for any \u03b8 \u2208 \u0398.\n\n(cid:26)(cid:101)\u03b7(\u03b8) \u2261 d(cid:88)\n\nKL-Weighted Estimator: \u02c6\u03b8KL\u2212W = arg max\n\np((cid:101)xk\nj|(cid:101)\u03b8k)\nj| \u02c6\u03b8k)\np((cid:101)xk\n\nlog p((cid:101)xk\nj|\u03b8)\n\nn(cid:88)\n\n(10)\n\n\u03b8\u2208\u0398\n\n1\nn\n\nk=1\n\nj=1\n\n.\n\nas n \u2192 \u221e,\n\nfor any \u03b8 \u2208 \u0398.\n\n(11)\n\nnVar((cid:101)\u03b7(\u03b8)) \u2264 nVar(\u02c6\u03b7(\u03b8)),\n\nThis estimator is motivated by Henmi et al. (2007) in which the same idea is applied to reduce the\nasymptotic variance in importance sampling. Similar result is also found in Hirano et al. (2003), in\nwhich it is shown that a similar weighted estimator with estimated propensity score is more ef\ufb01cient\nthan the estimator using true propensity score in estimating the average treatment effects. Although\nbeing a very powerful tool, results of this type seem to be not widely known in machine learning,\nexcept several applications in semi-supervised learning (Sokolovska et al., 2008; Kawakita and\nKanamori, 2013), and off-policy learning (Li et al., 2015).\nWe go a step further to analyze the asymptotic property of our weighted M-estimator \u02c6\u03b8KL\u2212W that\n\nmaximizes(cid:101)\u03b7(\u03b8). It is natural to expect that the asymptotic variance of \u02c6\u03b8KL\u2212W is smaller than that of\nKL as n \u2192 \u221e, and has a\n\u2217\n\n\u02c6\u03b8KL based on maximizing \u02c6\u03b7(\u03b8); this is shown in the following theorem.\n\nTheorem 5. Under Assumption 1, \u02c6\u03b8KL\u2212W is a consistent estimator of \u03b8\nbetter asymptotic variance than \u02c6\u03b8KL, that is,\n\nnE(cid:107) \u02c6\u03b8KL\u2212W \u2212 \u03b8\n\nKL(cid:107)2 \u2264 nE(cid:107) \u02c6\u03b8KL \u2212 \u03b8\n\u2217\n\nKL(cid:107)2,\n\u2217\n\nwhen n \u2192 \u221e.\n\n5\n\n\fAlgorithm 2 KL-Weighted Method for Combining Local Models\n1: Input: Local MLEs { \u02c6\u03b8k}d\n\n2: Generate bootstrap sample {(cid:101)xk\n3: Re-estimate the local model parameter (cid:101)\u03b8k in (6) based on bootstrap subsample {(cid:101)xk\nj}n\n\nj=1 from each p(x| \u02c6\u03b8k), for k \u2208 [d].\n\nk=1.\n\nj}n\n\nj=1, for\n\neach k \u2208 [d].\n\n4: Output: The parameter \u02c6\u03b8KL\u2212W of the combined model is given by (10).\n\nWhen N > n \u00d7 d, we have E(cid:107) \u02c6\u03b8KL\u2212W \u2212 \u03b8\nestimating the true parameter \u03b8\n\n\u2217 is\n\nKL(cid:107)2 = O((dn2)\u22121) as n \u2192 \u221e. Further, its MSE for\n\u2217\n\nE(cid:107) \u02c6\u03b8KL\u2212W \u2212 \u03b8\n\n\u2217(cid:107)2 = O(N\u22121 + (dn2)\u22121).\n\n(12)\n\nThe proof is in Appendix C. This result is parallel to Theorem 3 for the linear control variates\nestimator \u02c6\u03b8KL\u2212C. Similarly, it reduces to an O(N\u22121) rate once we take n (cid:38) (N/d)1/2.\nMeanwhile, unlike the linear control variates estimator, \u02c6\u03b8KL\u2212W inherits all the practical advantages of\nKL-averaging: it can be applied whenever the KL-naive estimator can be applied, including for models\nwith non-identi\ufb01able parameters, or with different numbers of parameters. The implementation of\n\u02c6\u03b8KL\u2212W is also much more convenient (see Algorithm 2), since it does not need to calculate the\nFisher information matrix as required by Algorithm 1.\n\n4 Empirical Experiments\n\nWe study the empirical performance of our methods on both simulated and real world datasets. We\n\ufb01rst numerically verify the convergence rates predicted by our theoretical results using simulated\ndata, and then demonstrate the effectiveness of our methods in a challenging setting when the number\nof parameters of the local models are different as decided by Bayesian information criterion (BIC).\nFinally, we conclude our experiments by testing our methods on a set of real world datasets.\nThe models we tested include probabilistic principal components analysis (PPCA), mixture of\ns=1 \u03b1sN (\u00b5s, \u03a3s)\nwhere \u03b8 = (\u03b1s, \u00b5s, \u03a3s). PPCA model is de\ufb01ned with the help of a hidden variable t, p(x | \u03b8) =\ns=1 \u03b1sps(x | \u03b8s), where \u03b8 = {\u03b1s, \u03b8s}m\n\nPPCA and Gaussian Mixtures Models (GMM). GMM is given by p(x | \u03b8) =(cid:80)m\n(cid:82) p(x | t; \u03b8)p(t | \u03b8)dt, where p(x | t; \u03b8) = N (x; \u00b5 + W t, \u03c32), and p(t | \u03b8) = N (t; 0, I) and\n\u03b8 = {\u00b5, W, \u03c32}. The mixture of PPCA is p(x | \u03b8) =(cid:80)m\n\nand each ps(x | \u03b8s) is a PPCA model.\nBecause all these models are latent variable models with unidenti\ufb01able parameters, the direct linear\naveraging method are not applicable. For GMM, it is still possible to use a matched linear averaging\nwhich matches the mixture components of the different local models by minimizing a symmetric KL\ndivergence; the same idea can be used on our linear control variates method to make it applicable to\nGMM. On the other hand, because the parameters of PPCA-based models are unidenti\ufb01able up to\narbitrary orthonormal transforms, linear averaging and linear control variates can no longer be applied\neasily. We use expectation maximization (EM) to learn the parameters in all these three models.\n\ns=1\n\n4.1 Numerical Veri\ufb01cation of the Convergence Rates\nWe start with verifying the convergence rates in (5), (9) and (12) of MSE E|| \u02c6\u03b8 \u2212 \u03b8\n\u2217||2 of the different\nestimators for estimating the true parameters. Because there is also an non-identi\ufb01ability problem in\ncalculating the MSE, we again use the symmetric KL divergence to match the mixture components,\nand evaluate the MSE on W W (cid:62) to avoid the non-identi\ufb01ability w.r.t. orthonormal transforms. To\nverify the convergence rates w.r.t. n, we \ufb01x d and let the total dataset N be very large so that N\u22121 is\nnegligible. Figure 1 shows the results when we vary n, where we can see that the MSE of KL-naive\n\u02c6\u03b8KL is O(n\u22121) while that of KL-control \u02c6\u03b8KL\u2212C and KL-weighted \u02c6\u03b8KL\u2212W are O(n\u22122); both are\nconsistent with our results in (5), (9) and (12).\nIn Figure 2(a), we increase the number d of local machines, while using a \ufb01x n and a very large\nN, and \ufb01nd that both \u02c6\u03b8KL and \u02c6\u03b8KL\u2212W scales as O(d\u22121) as expected. Note that since the total\n\n6\n\n\fobservation data size N is \ufb01xed, the number of data in each local machine is (N/d) and it decreases\nas we increase d. It is interesting to see that the performance of the KL-based methods actually\nincreases with more partitions; this is, of course, with a cost of increasing the total bootstrap sample\nsize dn as d increases. Figure 2(b) considers a different setting, in which we increase d when \ufb01xing\nthe total observation data size N, and the total bootstrap sample size ntot = n \u00d7 d. According to (5)\nand (12), the MSEs of \u02c6\u03b8KL and \u02c6\u03b8KL\u2212W should be about O(n\u22121\ntot) respectively when\nN is very large, and this is consistent with the results in Figure 2(b). It is interesting to note that\nthe MSE of \u02c6\u03b8KL is independent with d while that of \u02c6\u03b8KL\u2212W increases linearly with d. This is not\ncon\ufb02ict with the fact that \u02c6\u03b8KL\u2212W is better than \u02c6\u03b8KL, since we always have d \u2264 ntot.\nFigure 2(c) shows the result when we set n = (N/d)\u03b1 and vary \u03b1, where we \ufb01nd that \u02c6\u03b8KL\u2212W\nquickly converges to the global MLE as \u03b1 increases, while the KL-naive estimator \u02c6\u03b8KL converges\nsigni\ufb01cantly slower. Figure 2(d) demonstrates the case when we increase N while \ufb01x d and n, where\nwe see our KL-weighted estimator \u02c6\u03b8KL\u2212W matches closely with N, except when N is very large in\nwhich case the O((dn2)\u22121) term starts to dominate, while KL-naive is much worse. We also \ufb01nd the\nlinear averaging estimator performs poorly, and does not scale with O(N\u22121) as the theoretical rate\nclaims; this is due to unidenti\ufb01able orthonormal transform in the PPCA model that we test on.\n\ntot) and O(dn\u22122\n\n(a) PPCA\n\n(b) Mixture of PPCA\n\n(c) GMM\n\nFigure 1: Results on different models with simulated data when we change the bootstrap sample size\nn, with \ufb01xed d = 10 and N = 6 \u00d7 107. The dimensions of the PPCA models in (a)-(b) are 5, and\nthat of GMM in (c) is 3. The numbers of mixture components in (b)-(c) are 3. Linear averaging and\nKL-Control are not applicable for the PPCA-based models, and are not shown in (a) and (b).\n\n(a) Fix N and n\n\n(b) Fix N and ntot\n\n(d) Fix n and d\n\nFigure 2: Further experiments on PPCA with simulated data. (a) varying n with \ufb01xed N = 5 \u00d7 107.\n(b) varying d with N = 5 \u00d7 107, ntot = n \u00d7 d = 3 \u00d7 105. (c) varying \u03b1 with n = ( N\nd )\u03b1, N = 107\nand d. (d) varying N with n = 103 and d = 20. The dimension of data x is 5 and the dimension of\nlatent variables t is 4.\n\n(c) Fix N, n = ( N\n\nd )\u03b1 and d\n\n4.2 Gaussian Mixture with Unknown Number of Components\n\nWe further apply our methods to a more challenging setting for distributed learning of GMM when\nthe number of mixture components is unknown. In this case, we \ufb01rst learn each local model with EM\nand decide its number of components using BIC selection. Both linear averaging and KL-control\n\u02c6\u03b8KL\u2212C are not applicable in this setting, and and we only test KL-naive \u02c6\u03b8KL and KL-weighted\n\u02c6\u03b8KL\u2212W . Since the MSE is also not computable due to the different dimensions, we evaluate \u02c6\u03b8KL\nand \u02c6\u03b8KL\u2212W using the log-likelihood on a hold-out testing dataset as shown in Figure 3. We can\nsee that \u02c6\u03b8KL\u2212W generally outperforms \u02c6\u03b8KL as we expect, and the relative improvement increases\n\n7\n\n100 1000Bootstrap Size (n)-4-3-2-10Log MSE100 1000Bootstrap Size (n)-3-2-101Log MSE100 1000Bootstrap Size (n)-5-4-3-2-1Log MSEKL-NaiveKL-ControlKL-Weighted10 100 1000d-4-3-2-10Log MSE2004006008001000d-4-3-2-1Log MSE0.50.60.70.80.9,-3.5-3-2.5-2-1.5-1-0.5Log MSEGlobal MLELinearKL-NaiveKL-Weighted1000001e+06 1e+07 N-3-2-10Log MSE\fsigni\ufb01cantly as the dimension of the observation data x increases. This suggests that our variance\nreduction technique works very ef\ufb01ciently in high dimension problems.\n\n(a) Dimension of x = 3\n\n(b) Dimension of x = 80\n\n(c) varying the dimension of x\n\nFigure 3: GMM with the number of mixture components estimated by BIC. We set n = 600 and\nthe true number of mixtures to be 10 in all the cases. (a)-(b) vary the total data size N when the\ndimension of x is 3 and 80, respectively. (c) varies the dimension of the data with \ufb01xed N = 105.\nThe y-axis is the testing log likelihood compared with that of global MLE.\n\n4.3 Results on Real World Datasets\n\nFinally, we apply our methods to several real world datasets, including the SensIT Vehicle dataset on\nwhich mixture of PPCA is tested, and the Covertype and Epsilon datasets on which GMM is tested.\nFrom Figure 4, we can see that our KL-Weight and KL-Control (when it is applicable) again perform\nthe best. The (matched) linear averaging performs poorly on GMM (Figure 4(b)-(c)), while is not\napplicable on mixture of PPCA.\n\n(a) Mixture of PPCA, SensIT Vehicle\n\n(b) GMM, Covertype\n\n(c) GMM, Epsilon\n\nFigure 4: Testing log likelihood (compared with that of global MLE) on real world datasets. (a)\nLearning Mixture of PPCA on SensIT Vehicle. (b)-(c) Learning GMM on Covertype and Epsilon.\nThe number of local machines is 10 in all the cases, and the number of mixture components are\ntaken to be the number of labels in the datasets. The dimension of latent variables in (a) is 90. For\nEpsilon, a PCA is \ufb01rst applied and the top 100 principal components are chosen. Linear-matched and\nKL-Control are not applicable on Mixture of PPCA and are not shown on (a).\n\n5 Conclusion and Discussion\n\nWe propose two variance reduction techniques for distributed learning of complex probabilistic\nmodels, including a KL-weighted estimator that is both statistically ef\ufb01cient and widely applicable\nfor even challenging practical scenarios. Both theoretical and empirical analysis is provided to\ndemonstrate our methods. Future directions include extending our methods to discriminant learning\ntasks, as well as the more challenging deep generative networks on which the exact MLE is not\ncomputable tractable, and surrogate likelihood methods with stochastic gradient descent are need.\nWe note that the same KL-averaging problem also appears in the \u201cknowledge distillation\" problem\nin Bayesian deep neural networks (Korattikara et al., 2015), and it seems that our technique can be\napplied straightforwardly.\nAcknowledgement This work is supported in part by NSF CRII 1565796.\n\n8\n\n2468N#104-0.14-0.12-0.1-0.08-0.06-0.04-0.02Average LLKL-NaiveKL-Weighted246810N#104-7-6-5-4-3-2Average LL20406080100Dimension of Data-2-1.5-1-0.5Average LL12345N#104-7-6-5-4-3-2-1Average LL0510N#104-4-3-2-10Average LL0510N#104-1-0.8-0.6-0.4-0.20Average LLLinear-MatchedKL-NaiveKL-ControlKL-Weighted\fReferences\nS. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in Machine\nLearning, 3(1), 2011.\n\nY. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. In NIPS, 2012.\n\nO. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using\n\nmini-batches. In JMLR, 2012.\n\nQ. Liu and A. T. Ihler. Distributed estimation, information loss and exponential families. In NIPS,\n\n2014.\n\nJ. Rosenblatt and B. Nadler. On the optimality of averaging in distributed statistical learning. arXiv\n\npreprint arXiv:1407.2724, 2014.\n\nY. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for\n\ndistributed statistical estimation with communication constraints. In NIPS, 2013.\n\nS. Merugu and J. Ghosh. Privacy-preserving distributed clustering using generative models. In Data\nMining, 2003. ICDM 2003. Third IEEE International Conference on, pages 211\u2013218. IEEE, 2003.\n\nO. Shamir, N. Srebro, and T. Zhang. Communication ef\ufb01cient distributed optimization using an\n\napproximate Newton-type method. In ICML, 2014.\n\nC. Huang and X. Huo. A distributed one-step estimator. arXiv preprint arXiv:1511.01443, 2015.\n\nJ. R. Wilson. Variance reduction techniques for digital simulation. American Journal of Mathematical\n\nand Management Sciences, 4, 1984.\n\nB. L. Nelson. On control variate estimators. Computers & Operations Research, 14, 1987.\n\nM. Henmi, R. Yoshida, and S. Eguchi. Importance sampling via the estimated sampler. Biometrika,\n\n94(4), 2007.\n\nK. Hirano, G. W. Imbens, and G. Ridder. Ef\ufb01cient estimation of average treatment effects using the\n\nestimated propensity score. Econometrica, 71, 2003.\n\nN. Sokolovska, O. Capp\u00e9, and F. Yvon. The asymptotics of semi-supervised learning in discriminative\n\nprobabilistic models. In ICML. ACM, 2008.\n\nM. Kawakita and T. Kanamori. Semi-supervised learning with density-ratio estimation. Machine\n\nlearning, 91, 2013.\n\nL. Li, R. Munos, and C. Szepesv\u00e1ri. Toward minimax off-policy value estimation. In AISTATS, 2015.\n\nA. Korattikara, V. Rathod, K. Murphy, and M. Welling. Bayesian dark knowledge. arXiv preprint\n\narXiv:1506.04416, 2015.\n\n9\n\n\f", "award": [], "sourceid": 963, "authors": [{"given_name": "JUN", "family_name": "HAN", "institution": "Dartmouth College"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "Dartmouth College"}]}