{"title": "Multi-Task Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 1169, "page_last": 1177, "abstract": "We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA  both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient.", "full_text": "Multi-Task Averaging\n\nSergey Feldman, Maya R. Gupta, and Bela A. Frigyik\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98103\n\nAbstract\n\nWe present a multi-task learning approach to jointly estimate the means of mul-\ntiple independent data sets. The proposed multi-task averaging (MTA) algorithm\nresults in a convex combination of the single-task averages. We derive the optimal\namount of regularization, and show that it can be effectively estimated. Simu-\nlations and real data experiments demonstrate that MTA outperforms both maxi-\nmum likelihood and James-Stein estimators, and that our approach to estimating\nthe amount of regularization rivals cross-validation in performance but is more\ncomputationally ef\ufb01cient.\n\n1\n\nIntroduction\n\nThe motivating hypothesis behind multi-task learning (MTL) algorithms is that leveraging data from\nrelated tasks can yield superior performance over learning from each task independently. Early\nevidence for this hypothesis is Stein\u2019s work on the estimation of the means of T distributions (tasks)\n[1]. Stein showed that it is better (in a summed squared error sense) to estimate each of the means\nof T Gaussian random variables using data sampled from all of them, even if they are independent\nand have different means. That is, it is bene\ufb01cial to consider samples from seemingly unrelated\ndistributions in the estimation of the tth mean. This surprising result is often referred to as Stein\u2019s\nparadox [2].\nEstimating means is perhaps the most common of all estimation tasks, and often multiple means\nneed to be estimated. In this paper we consider a multi-task regularization approach to the problem\nof estimating multiple means that we call multi-task averaging (MTA). We show that MTA has\nprovably nice theoretical properties, is effective in practice, and is computationally ef\ufb01cient. We\nde\ufb01ne the MTA objective in Section 2, and review related work in Section 3. We present some\nkey properties of MTA in Section 4 (proofs are omitted due to space constraints). In particular, we\nstate the optimal amount of regularization to be used, and show that this optimal amount can be\neffectively estimated. Simulations in Section 5 verify the advantage of MTA over standard sample\nmeans and James-Stein estimation if the true means are close compared to the sample variance. In\nSection 6.1, two experiments estimating expected sales show that MTA can reduce real errors by\nover 30% compared to the sample mean. MTA can be used anywhere multiple averages are needed;\nwe demonstrate this by applying it fruitfully to the averaging step of kernel density estimation in\nSection 6.1.\n\n2 Multi-Task Averaging\n\nConsider the T -task problem of estimating the means of T random variables that have \ufb01nite mean\nand variance. Let {Yti}Nt\ni=1 be Nt independent and identically-distributed random samples for t =\n1, . . . , T . The MTA objective and many of the results in this paper generalize trivially to samples that\nare vectors rather than scalars, but for notational simplicity we restrict our focus to scalar samples\nYti \u2208 R. Key notation is given in Table 1.\n\n1\n\n\fTable 1: Key Notation\n\nT\nNt\nYti \u2208 R\n\u00afYt \u2208 R\nt \u2208 R\nY \u2217\n\u03c32\nt\n\u03a3\nA \u2208 RT\u00d7T\n\n(cid:80)\n\nnumber of tasks\nnumber of samples for tth task\nith random sample from tth task\ntth sample average 1\nNt\nMTA estimate of tth mean\nvariance of the tth task\ndiagonal covariance matrix of \u00afY with \u03a3tt = \u03c32\nt\nNt\npairwise task similarity matrix\n\ni Yti\n\nL = D \u2212 A graph Laplacian of A, with diagonal D s.t. Dtt =(cid:80)T\n\nr=1 Atr\n\nIn addition, assume that the T \u00d7 T matrix A describes the relatedness or similarity of any pair of the\nT tasks, with Att = 0 for all t without loss of generality (because the diagonal self-similarity terms\nare canceled in the objective below). The proposed MTA objective is\n\n{Y \u2217\n\nt }T\n\nt=1 = arg min\n\n{ \u02c6Yt}T\n\nt=1\n\n1\nT\n\nT(cid:88)\n\nNt(cid:88)\n\nt=1\n\ni=1\n\n(Yti \u2212 \u02c6Yt)2\n\n\u03c32\nt\n\n+\n\n\u03b3\nT 2\n\nT(cid:88)\n\nT(cid:88)\n\nr=1\n\ns=1\n\nArs( \u02c6Yr \u2212 \u02c6Ys)2.\n\n(1)\n\nThe \ufb01rst term minimizes the sum of the empirical losses, and the second term jointly regularizes\nthe estimates by regularizing their pairwise differences. The regularization parameter \u03b3 balances\nthe empirical risk and the multi-task regularizer. Note that if \u03b3 = 0, then (1) decomposes to T\nseparate minimization problems, producing the sample averages \u00afYt. The normalization of each\nerror term in (1) by its task-speci\ufb01c variance \u03c32\nt (which may be estimated) scales the T empirical\nloss terms relative to the variance of their distribution; this ensures that high-variance tasks do not\ndisproportionately dominate the loss term.\nA more general formulation of MTA is\n\n{Y \u2217\n\nt }T\n\nt=1 = arg min\n\n{ \u02c6Yt}T\n\nt=1\n\n1\nT\n\n(cid:16){ \u02c6Yt}T\n\n(cid:17)\n\nL(Yti, \u02c6Yt) + \u03b3J\n\nt=1\n\n,\n\nT(cid:88)\n\nNt(cid:88)\n\nt=1\n\ni=1\n\nwhere L is some loss function and J is a regularization function. If L is chosen to be any Bregman\nloss, then setting \u03b3 = 0 will produce the T sample averages [3]. For the analysis and experiments\nin this paper, we restrict our focus to the tractable squared-error formulation given in (1).\nThe task similarity matrix A can be speci\ufb01ed as side information (e.g. from a domain expert), or\nset in an optimal fashion. In Section 4 we derive two optimal choices of A for the T = 2 case: the\nA that minimizes expected squared error, and a minimax A. We use the T = 2 analysis to propose\npractical estimators of A for any number of tasks.\n\n3 Related Work\n\nMTA is an approach to the problem of estimating T means. We are not aware of other work in the\nmulti-task literature that addresses this problem; most MTL methods are designed for regression,\nclassi\ufb01cation, or feature selection, e.g. [4, 5, 6]. The most closely related work is Stein estimation,\nan empirical Bayes strategy for estimating multiple means simultaneously [7, 8, 2, 9]. James and\nStein [7] showed that the maximum likelihood estimate of the tth mean \u00b5t can be dominated by\na shrinkage estimate given Gaussian assumptions. There have been a number of extensions to the\noriginal James-Stein estimator. We compare to the positive-part residual James-Stein estimator for\nmultiple data points per task and independent unequal variances [8, 10], such that the estimated\nmean for the tth task is\n\n\u03be +\n\n1 \u2212\n\nT \u2212 3\n\n( \u00afY \u2212 \u03be)T \u03a3\u22121( \u00afY \u2212 \u03be)\n\n( \u00afYt \u2212 \u03be),\n\n+\n\n(2)\n\n(cid:18)\n\n(cid:19)\n\n2\n\n\fT\n\nt\n\n(cid:80)\n\nwhere (x)+ = max(0, x); \u03a3 is a diagonal matrix of the estimated variances of each sample mean\nwhere \u03a3tt = \u02c6\u03c32\nand the estimate is shrunk towards \u03be, which is usually set to be the mean of the\nt\nNt\nsample means (other choices are sometimes used) \u03be = \u00af\u00afY = 1\n\u00afYt. Bock\u2019s formulation of (2)\nuses the effective dimension (de\ufb01ned as the ratio of the trace of \u03a3 to the maximum eigenvalue of \u03a3)\nrather than the T in the numerator of (2) [8, 7, 10]. In preliminary practical experiments where \u03a3\nmust be estimated from the data, we found that using the effective dimension signi\ufb01cantly crippled\nthe performance of the James-Stein estimator. We hypothesize that this is due to the high variance\nof the estimate of the maximum eigenvalue of \u03a3.\nMTA can be interpreted as estimating means of T Gaussians with an intrinsic Gaussian Markov\nrandom \ufb01eld prior [11]. Unlike most work in graphical models, we do not assume any variables are\nconditionally independent, and generally have non-sparse inverse covariance.\nA key issue for MTA and many other multi-task learning methods is how to estimate the similarity\n(or task relatedness) between tasks and/or samples if it is not provided. A common approach is to\nestimate the similarity matrix jointly with the task parameters [12, 13, 5, 14, 15]. For example, Zhang\nand Yeung [15] assumed that there exists a covariance matrix for the task relatedness, and proposed\na convex optimization approach to estimate the task covariance matrix and the task parameters in\na joint, alternating way. Applying such joint and alternating approaches to the MTA objective (1)\nleads to a degenerate solution with zero similarity. However, the simplicity of MTA enables us to\nspecify the optimal task similarity matrix for T = 2 (see Sec. 4), which we generalize to obtain an\nestimator for the general multi-task case.\n\n4 MTA Theory\n\nFor symmetric A with non-negative components1, the MTA objective given in (1) is continuous,\ndifferentiable, and convex. It is straightforward to show that (1) has closed-form solution:\n\nwhere \u00afY is the vector of sample averages with tth entry \u00afYt = 1\ni=1 Yti, L is the graph Laplacian\nNt\nof A, and \u03a3 is de\ufb01ned as before. With non-negative A and \u03b3, the matrix inverse in (3) can be shown\nto always exist using the Gershgorin Circle Theorem [16].\nNote that the (r, s)th entry of \u03b3\n\nT \u03a3L(cid:1)\u22121 \u2192 I in the norm. By the law of large numbers one can\n\nis a continuous operation,(cid:0)I + \u03b3\n\nT \u03a3L goes to 0 as Nt approaches in\ufb01nity, and since matrix inversion\n\nconclude that Y \u2217 asymptotically approaches the true means.\n\n4.1 Convexity of MTA Solution\nFrom inspection of (3), it is clear that each of the elements of Y \u2217 is a linear combination of the\nsample averages \u00afY . However, a stronger statement can be made:\nTheorem: If \u03b3 \u2265 0, 0 \u2264 Ars < \u221e for all r, s and 0 < \u03c32\n{Y \u2217\n\nt } given in (3) are a convex combination of the task sample averages { \u00afYt}.\n\nProof Sketch: The theorem requires showing that the matrix W = (cid:0)I + \u03b3\nby de\ufb01nition the rows of the graph Laplacian L sum to zero. Thus(cid:0)I + \u03b3\nwe established invertibility, this implies the desired right-stochasticity: 1 =(cid:0)I + \u03b3\n\nright-stochastic. Using the Gershgorin Circle Theorem [16], we can show that the real part of every\neigenvalue of W \u22121 is positive. The matrix W \u22121 is a Z-matrix [17], and if the real part of each of\nthe eigenvalues of a Z-matrix is positive, then its inverse has all non-negative entries (See Chapter\n6, Theorem 2.3, G20, and N38, [17]). Finally, to prove that W has rows that sum to 1, \ufb01rst note that\n\nT \u03a3L(cid:1)\u22121 exists and is\nT \u03a3L(cid:1) 1 = 1, and because\n\n< \u221e for all t, then the MTA estimates\n\nT \u03a3L(cid:1)\u22121 1.\n\nt\nNt\n\n1If an asymmetric A is provided, using it with MTA is equivalent to using the symmetric (AT + A)/2.\n\n3\n\n(cid:16)\n\nY \u2217 =\n\nI +\n\n\u03b3\nT\n\n\u03a3L\n\n(cid:17)\u22121\n\n\u00afY ,\n\n(cid:80)Nt\n\n(3)\n\n\f4.2 Optimal A for the Two Task Case\n\nIn this section we analyze the T = 2 task case, with N1 and N2 samples for tasks 1 and 2 respec-\ntively. Suppose {Y1i} are iid (independently and identically distributed) with \ufb01nite mean \u00b51 and\n\ufb01nite variance \u03c32\n2. Let the\ntask-relatedness matrix be A = [0 a; a 0], and without loss of generality, we \ufb01x \u03b3 = 1. Then the\nclosed-form solution (3) can be simpli\ufb01ed:\n\n1, and {Y2i} are iid with \ufb01nite mean \u00b52 = \u00b51 + \u2206 and \ufb01nite variance \u03c32\n\n\uf8f6\uf8f8 \u00afY1 +\n\n\uf8eb\uf8ed\n\nY \u2217\n1 =\n\n\uf8eb\uf8ed T + \u03c32\n\uf8eb\uf8ed T 2 + 2T \u03c32\n\nT + \u03c32\n1\nN1\n\n2\nN2\n\na\n\n2\nN2\na + \u03c32\n2\nN2\n\na\n\nIt is straightforward to derive the mean squared error of Y \u2217\n1 :\na2 + \u03c34\n2\nN 2\n2\n\na + \u03c32\n\na2\n\nMSE[Y \u2217\n\n1 ] =\n\n\u03c32\n1\nN1\n\n1 \u03c32\n2\nN1N2\na + \u03c32\n2\nN2\n\na)2\n\n(T + \u03c32\n1\nN1\n\n\uf8f6\uf8f8 +\n\n\u22062 \u03c34\na2\n1\nN 2\n1\na + \u03c32\n2\nN2\n\n(T + \u03c32\n1\nN1\n\n.\n\na)2\n\n\uf8f6\uf8f8 \u00afY2.\n\n\u03c32\na\n1\nN1\na + \u03c32\nT + \u03c32\n2\n1\nN2\nN1\n\na\n\nComparing to the MSE of the sample average, one obtains the following relationship:\n\nMSE[Y \u2217\n\n1 ] < MSE[ \u00afY1] if \u22062 \u2212 \u03c32\n1\nN1\n\n\u2212 \u03c32\n2\nN2\n\n<\n\n4\na\n\n,\n\nThus the MTA estimate of the \ufb01rst mean has lower MSE if the squared mean-separation \u22062 is small\ncompared to the variances of the sample averages. Note that as a approaches 0 from above, the RHS\nof (6) approaches in\ufb01nity, which means that a small amount of regularization can be helpful even\nwhen the difference between the task means \u2206 is large. Summarizing, if the two task means are\nclose relative to each task\u2019s sample variance, MTA will help.\nThe risk is the sum of the mean squared errors: MSE[Y \u2217\n2 ], which is a convex, continuous,\nand differentiable function of a, and therefore the \ufb01rst derivative can be used to specify the optimal\nvalue a\u2217, when all the other variables are \ufb01xed. Minimizing MSE[Y \u2217\n2 ] w.r.t. a one\nobtains the following solution:\n\n1 ] + MSE[Y \u2217\n\n1 ]+MSE[Y \u2217\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\na\u2217 =\n\n2\n\u22062 ,\n\nwhich is always non-negative.\nAnalysis of the second derivative shows that this minimizer always holds for the cases of interest\n(that is, for N1, N2 \u2265 1). In the limit case, when the difference in the task means \u2206 goes to zero\nt stay constant), the optimal task-relatedness a\u2217 goes to in\ufb01nity, and the weights in (4) on\n(while \u03c32\n\u00afY1 and \u00afY2 become 1/2 each.\n\n4.3 Estimating A from Data\n\nBased on our analysis of the optimal A for the two-task case, we propose two methods to estimate\nA from data for arbitrary T . The \ufb01rst method is designed to minimize the approximate risk using a\nconstant similarity matrix. The second method provides a minimax estimator. With both methods we\ncan use the Sherman-Morrison formula to avoid taking the matrix inverse in (3), and the computation\nof Y \u2217 is O(T ).\n\n4.3.1 Constant MTA\n\nRecalling that E[ \u00afY \u00afY T ] = \u00b5\u00b5T + \u03a3, the risk of estimator \u02c6Y = W \u00afY of unknown parameter vector\n\u00b5 for the squared loss is the sum of the mean squared errors:\n\n(cid:0)I + \u03b3\n\nR(\u00b5, W \u00afY ) = E[(W \u00afY \u2212 \u00b5)T (W \u00afY \u2212 \u00b5)] = tr(W \u03a3W T ) + \u00b5T (I \u2212 W )T (I \u2212 W )\u00b5.\n\n(8)\nOne approach to generalizing the results of Section 4.2 to arbitrary T is to try to \ufb01nd a symmetric,\nnon-negative matrix A such that the (convex, differentiable) risk R(\u00b5, W \u00afY ) is minimized for W =\n(i) the solution is not analytically tractable for T > 2 and (ii) an arbitrary A has T (T \u2212 1) degrees\nof freedom, which is considerably more than the number of means we are trying to estimate in\n\nT \u03a3L(cid:1)\u22121 (recall L is the graph Laplacian of A). The problem with this approach is two-fold:\n\n4\n\n\fthe \ufb01rst place. To avoid these problems, we generalize the two-task results by constraining A to\nbe a scaled constant matrix A = a11T , and \ufb01nd the optimal a\u2217 that minimizes the risk in (8). In\naddition, w.l.o.g. we set \u03b3 to 1, and for analytic tractability we assume that all the tasks have the\nsame variance, estimating \u03a3 as tr(\u03a3)\n\nT I. Then it remains to solve:\n\n(cid:19)\u22121\n\n(cid:33)\n\na\u2217 = arg min\n\nR\n\na\n\n\u00b5,\n\nI +\n\nL(a11T )\n\n\u00afY\n\n,\n\n1\nT\n\ntr(\u03a3)\n\nT\n\n(cid:18)\n\n(cid:32)\n\n1\n\n(cid:18)\n\na\u2217 =\n\nwhich has the solution\n\n(cid:80)T\nwhich reduces to the optimal two task MTA solution (7) when T = 2. In practice, one of course\ndoes not have {\u00b5r} as these are precisely the values one is trying to estimate. So, to estimate a\u2217 we\n(cid:80)T\n(cid:80)T\nuse the sample means {\u00afyr}: \u02c6a\u2217 =\ns=1(\u00afyr\u2212\u00afys)2 . Using this estimated optimal constant\nsimilarity and an estimated covariance matrix \u02c6\u03a3 produces what we refer to as the constant MTA\nestimate\n\n(cid:80)T\ns=1(\u00b5r \u2212 \u00b5s)2\n\nT (T\u22121)\n\n1\n\nT (T \u22121)\n\nr=1\n\n2\n\nr=1\n\n2\n\n,\n\n(cid:19)\u22121\n\nY \u2217 =\n\nI +\n\n1\nT\n\n\u02c6\u03a3L(\u02c6a\u221711T )\n\n\u00afY .\n\n(9)\n\nNote that we made the assumption that the entries of \u03a3 were the same in order to be able to derive\nthe constant similarity a\u2217, but we do not need nor suggest that assumption on the \u02c6\u03a3 used with \u02c6a\u2217 in\n(9).\n\n4.4 Minimax MTA\n\nBock\u2019s James-Stein estimator is minimax in that it minimizes the worst-case loss, not necessarily the\nexpected loss [10]. This leads to a more conservative use of regularization. In this section, we derive\na minimax version of MTA, that prescribes less regularization than the constant MTA. Formally, an\nestimator Y M of \u00b5 is called minimax if it minimizes the maximum risk:\n\ninf\n\u02c6Y\n\nsup\n\n\u00b5\n\nR(\u00b5, \u02c6Y ) = sup\n\u00b5\n\nR(\u00b5, Y M ).\n\nFirst, we will specify minimax MTA for the T = 2 case. To \ufb01nd a minimax estimator Y M it is\nsuf\ufb01cient to show that (i) Y M is a Bayes estimator w.r.t.\nthe least favorable prior (LFP) and (ii)\nit has constant risk [10]. To \ufb01nd a LFP, we \ufb01rst need to specify a constraint set for \u00b5t; we use an\ninterval: \u00b5t \u2208 [bl, bu], for all t, where bl \u2208 R and bu \u2208 R. With this constraint set the minimax\nestimator is:\n\n(cid:18)\n\n(cid:19)\u22121\nT (bu \u2212 bl)2 \u03a3L(11T )\n\n2\n\nY M =\n\nI +\n\n\u00afY ,\n\n(10)\n\nwhich reduces to (7) when T = 2. This minimax analysis is only valid for the case when T = 2,\nbut we found that good practical results for larger T using (10) with the data-dependent interval\n\u02c6bl = mint \u00afyt and \u02c6bu = maxt \u00afyt.\n\n5 Simulations\n\nWe \ufb01rst illustrate the performance of the proposed MTA using Gaussian and uniform simulations\nso that comparisons to ground truth can be made. Simulation parameters are given in the table in\nFigure 1, and were set so that the variances of the distribution of the true means were the same in\nboth types of simulations. Simulation results are reported in Figure 1 for different values of \u03c32\n\u00b5,\nwhich determines the variance of the distribution over the means.\nWe compared constant MTA and minimax MTA to single-task sample averages and to the James-\nStein estimator given in (2). We also compared to a randomized 5-fold 50/50 cross-validated (CV)\nversion of constant MTA, and minimax MTA, and the James-Stein estimator (which is simply a con-\nvex regularization towards the average of the sample means: \u03bb\u00afyt+(1\u2212\u03bb)\u00af\u00afy.). For the cross-validated\nversions, we randomly subsampled Nt/2 samples and chose the value of \u03b3 for constant/minimax\n\n5\n\n\fGaussian Simulations\n\u00b5t \u223c N (0, \u03c32\n\u00b5)\nt \u223c Gamma(0.9, 1.0) + 0.1\n\u03c32\nNt \u223c U{2, . . . , 100}\nyti \u223c N (\u00b5t, \u03c32\nt )\n\nUniform Simulations\n\n(cid:113)\n\u00b5t \u223c U (\u2212(cid:113)\nt , \u00b5t +(cid:112)3\u03c32\nyti \u223c U [\u00b5t \u2212(cid:112)3\u03c32\n\n3\u03c32\n\u00b5,\nt \u223c U (0.1, 2.0)\n\u03c32\nNt \u223c U{2, . . . , 100}\n\n3\u03c32\n\u00b5)\n\nt ]\n\nT = 2\n\nT = 2\n\nT = 5\n\nT = 5\n\nT = 25\n\nT = 25\n\nFigure 1: Average (over 10000 random draws) percent change in risk vs. single-task. Lower is\nbetter.\nMTA or \u03bb for James-Stein that resulted in the lowest average left-out risk compared to the sample\nmean estimated with all Nt samples. In the optimal versions of constant/minimax MTA, \u03b3 was set\nto 1, as this was the case during derivation.\nWe used the following parameters for CV: \u03b3 \u2208 {2\u22125, 2\u22124, . . . , 25} for the MTA estimators and a\ncomparable set of \u03bb spanning (0, 1) by the transformation \u03bb = \u03b3\n\u03b3+1 . Even when cross-validating,\nan advantage of using the proposed constant MTA or minimax MTA is that these estimators provide\na data-adaptive scale for \u03b3, where \u03b3 = 1 sets the regularization parameter to be a\u2217\nT (bu\u2212bl)2 ,\nrespectively.\nSome observations from Figure 1: further to the right on the x-axis, the means are more likely to be\nfurther apart, and multi-task approaches help less on average. For T = 2, the James-Stein estimator\nreduces to the single-task estimator, and is of no help. The MTA estimators provide a gain while\n\nT or\n\n1\n\n6\n\n00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task  Single\u2212TaskJames\u2212SteinMTA, constantMTA, minimaxJames\u2212Stein (CV)MTA, constant (CV)MTA, minimax (CV)00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task  Single\u2212TaskJames\u2212SteinMTA, constantMTA, minimaxJames\u2212Stein (CV)MTA, constant (CV)MTA, minimax (CV)00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task00.511.522.53\u221250\u221240\u221230\u221220\u221210010\u03c3\u00b52 (variance of the means)% change vs. single\u2212task\f\u00b5 < 1 but deteriorates quickly thereafter. For T = 5, constant MTA dominates in the Gaussian\n\u03c32\ncase, but in the uniform case does worse than single-task when the means are far apart. Note that for\nall T > 2 minimax MTA almost always outperforms James-Stein and always outperforms single-\ntask, which is to be expected as it was designed conservatively. For T = 25, we see the trend that\nall estimators bene\ufb01t from an increase in the number of tasks.\nFor constant MTA, cross-validation is always worse than the estimated optimal regularization. Since\nboth constant MTA and minimax MTA use a similarity matrix of all ones scaled by a constant, cross-\nvalidating over a set of possible \u03b3 may result in nearly identical performance, and this can be seen in\nthe Figure (i.e. the green and blue dotted lines are superimposed). To conclude, when the tasks are\nclose to each other compared to their variances, constant MTA is the best estimator to use by a wide\nmargin. When the tasks are farther apart, minimax MTA will provide a win over both James-Stein\nand maximum likelihood.\n\n6 Applications\n\nWe present two applications with real data. The \ufb01rst application parallels the simulations, estimating\nexpected values of sales of related products. The second application uses MTA for multi-task kernel\ndensity estimation, highlighting the applicability of MTA to any algorithm that uses sample averages.\n\n6.1 Application: Estimating Product Sales\n\nWe consider two multi-task problems using sales data over a certain time period supplied by Artifact\nPuzzles, a company that sells jigsaw puzzles online. For both problems, we model the given samples\nas being drawn iid from each task.\nThe \ufb01rst problem estimates the impact of a particular puzzle on repeat business: \u201cEstimate how\nmuch a random customer will spend on an order on average, if on their last order they purchased\nthe tth puzzle, for each of T = 77 puzzles.\u201d The samples were the amounts different customers had\nspent on orders after buying each of the t puzzles, and ranged from 480 down to 0 for customers that\nhad not re-ordered. The number of samples for each puzzle ranged from Nt = 8 to Nt = 348.\nThe second problem estimates the expected order size of a particular customer: \u201cEstimate how much\nthe tth customer will spend on a order on average, for each of the T = 477 customers that ordered\nat least twice during the data timeframe.\u201d The samples were the order amounts for each of the T\ncustomers. Order amounts varied from 15 to 480. The number of samples for each customer ranged\nfrom Nt = 2 to Nt = 17.\nThere is no ground truth. As a metric to compare the estimates, we treat each task\u2019s sample average\ncomputed from all of the samples as the ground truth, and compare to estimates computed from a\nuniformly randomly chosen 50% of the samples. Results in Table 2 are averaged over 1000 random\ndraws of the 50% used for estimation. We used 5-fold cross-validation with the same parameter\nchoices as in the simulations section.\n\nTable 2: Percent change in average risk (for puzzle and buyer data, lower is better), and mean\nreciprocal rank (for terrorist data, higher is better).\n\nEstimator\n\nPooled Across Tasks\nJames-Stein\nJames-Stein (CV)\nConstant MTA\nConstant MTA (CV)\nMinimax MTA\nMinimax MTA (CV)\nExpert MTA\nExpert MTA (CV)\n\nCustomers\nPuzzles\nT = 477\nT = 77\n181.67% 109.21%\n-14.04%\n-6.87%\n-31.01%\n-21.18%\n-32.29%\n-17.48%\n-21.65% -30.89%\n-8.41%\n-2.96%\n-19.83 % -25.04%\n\n-\n-\n\n-\n-\n\nSuicide Bombings\n\nT = 7\n0.13\n0.15\n0.15\n0.19\n0.19\n0.19\n0.19\n0.19\n0.19\n\n7\n\n\f6.2 Density Estimation for Terrorism Risk Assessment\n\n(cid:80)N\n\ni=1 K(xi, z), which is just a sample average.\n\nMTA can be used whenever multiple averages are taken.\nIn this section we present multi-task\nkernel density estimation, as an application of MTA. Recall that for standard single-task kernel\ndensity estimation (KDE) [18], a set of random samples xi \u2208 Rd, i \u2208 {1, . . . , N} are assumed\nto be iid from an unknown distribution pX, and the problem is to estimate the density for a query\nsample, z \u2208 Rd. Given a kernel function K(xi, xj), the un-normalized single-task KDE estimate is\n\u02c6p(z) = 1\nN\nWhen multiple kernel densities {pt(z)}T\nt=1 are estimated for the same domain, we replace the mul-\ntiple sample averages with MTA estimates, which we refer to as multi-task kernel density estimation\n(MT-KDE).\nWe compared KDE and MT-KDE on a problem of estimating the probability of terrorist events in\nJerusalem using the Naval Research Laboratory\u2019s Adversarial Modeling and Exploitation Database\n(NRL AMX-DB). The NRL AMX-DB combined multiple open primary sources2 to create a rich\nrepresentation of the geospatial features of urban Jerusalem and the surrounding region, and accu-\nrately geocoded locations of terrorist attacks. Density estimation models are used to analyze the\nbehavior of such violent agents, and to allocate security and medical resources. In related work,\n[19] also used a Gaussian kernel density estimate to assess risk from past terrorism events.\nThe goal in this application is to estimate a risk density for 40,000 geographical locations (samples)\nin a 20km \u00d7 20km area of interest in Jerusalem. Each geographical location is represented by a\nd = 76-dimensional feature vector. Each of the 76 features is the distance in kilometers to the\nnearest instance of some geographic location of interest, such as the nearest market or bus stop.\nLocations of past events are known for 17 suicide bombings. All the events are attributed to one of\nseven terrorist groups. The density estimates for these seven groups are expected to be related, and\nare treated as T = 7 tasks.\nThe kernel K was taken to be a Gaussian kernel with identity covariance. In addition to constant A\nand minimax A, we also obtained a side-information A from terrorism expert Mohammed M. Hafez\nof the Naval Postgraduate School; he assessed the similarity between the seven groups during the\nSecond Intifada (the time period of the data), providing similarities between 0 and 1.\nWe used leave-one-out cross validation to assess KDE and MT-KDE for this problem, as follows.\nAfter computing the KDE and MT-KDE density estimates using all but one of the training examples\n{xti} for each task, we sort the resulting 40,000 estimated probabilities for each of the seven tasks,\nand extract the rank of the left-out known event. The mean reciprocal rank (MRR) metric is reported\nin Table 2. Ideally, the MRR of the left-out events would be as close to 1 as possible, and indicating\nthat the location of the left-out event is at high-risk. The results show that the MRR for MT-KDE\nare lower or not worse than those for KDE for both problems; there are, however, too few samples\nto verify statistical signi\ufb01cance of these results.\n\n7 Summary\n\nThough perhaps unintuitive, we showed that both in theory and in practice, estimating multiple un-\nrelated means using an MTL approach can improve the overall risk, even more so than James-Stein\nestimation. Averaging is common, and MTA has potentially broad applicability as a subcompo-\nnent in many algorithms, such as k-means clustering, kernel density estimation, or non-local means\ndenoising.\n\nAcknowledgments\n\nWe thank Peter Sadowski, Mohammed Hafez, Carol Chang, Brian Sandberg and Ruth Wilis for help\nwith preliminary experiments and access to the terorrist dataset.\n\n2Primary sources included the NRL Israel Suicide Terrorism Database (ISD) cross referenced with open\nsources (including the Israel Ministry of Foreign Affairs, BBC, CPOST, Daily Telegraph, Associated Press,\nHa\u2019aretz Daily, Jerusalem Post, Israel National News), as well as the University of New Haven Institute for\nthe Study of Violent Groups, the University of Maryland Global Terrorism Database, and the National Counter\nTerrorism Center Worldwide Incident Tracking System.\n\n8\n\n\fReferences\n[1] C. Stein, \u201cInadmissibility of the usual estimator for the mean of a multivariate distribution,\u201d\nProc. Third Berkeley Symposium on Mathematical Statistics and Probability, pp. 197\u2013206,\n1956.\n\n[2] B. Efron and C. N. Morris, \u201cStein\u2019s paradox in statistics,\u201d Scienti\ufb01c American, vol. 236, no. 5,\n\npp. 119\u2013127, 1977.\n\n[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, \u201cClustering with Bregman divergences,\u201d\n\nJournal Machine Learning Research, vol. 6, pp. 1705\u20131749, December 2005.\n\n[4] C. A. Micchelli and M. Pontil, \u201cKernels for multi\u2013task learning,\u201d in Advances in Neural Infor-\n\nmation Processing Systems (NIPS), 2004.\n\n[5] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, \u201cMulti-task Gaussian process prediction,\u201d\n\nin Advances in Neural Information Processing Systems (NIPS). MIT Press, 2008.\n\n[6] A. Argyriou, T. Evgeniou, and M. Pontil, \u201cConvex multi-task feature learning,\u201d Machine\n\nLearning, vol. 73, no. 3, pp. 243\u2013272, 2008.\n\n[7] W. James and C. Stein, \u201cEstimation with quadratic loss,\u201d Proc. Fourth Berkeley Symposium on\n\nMathematical Statistics and Probability, pp. 361\u2013379, 1961.\n\n[8] M. E. Bock, \u201cMinimax estimators of the mean of a multivariate normal distribution,\u201d The\n\nAnnals of Statistics, vol. 3, no. 1, 1975.\n\n[9] G. Casella, \u201cAn introduction to empirical Bayes data analysis,\u201d The American Statistician, pp.\n\n83\u201387, 1985.\n\n[10] E. L. Lehmann and G. Casella, Theory of Point Estimation. New York: Springer, 1998.\n[11] H. Rue and L. Held, Gaussian Markov Random Fields: Theory and Applications, ser. Mono-\n\ngraphs on Statistics and Applied Probability. London: Chapman & Hall, 2005, vol. 104.\n\n[12] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, \u201cA spectral regularization framework for\nmulti-task structure learning,\u201d in Advances in Neural Information Processing Systems (NIPS),\n2007.\n\n[13] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, \u201cMulti-task learning for classi\ufb01cation with\n\nDirichlet process priors,\u201d Journal Machine Learning Research, vol. 8, pp. 35\u201363, 2007.\n\n[14] L. Jacob, F. Bach, and J.-P. Vert, \u201cClustered multi-task learning: A convex formulation,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2008, pp. 745\u2013752.\n\n[15] Y. Zhang and D.-Y. Yeung, \u201cA convex formulation for learning task relationships,\u201d in Proc. of\n\nthe 26th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\n[16] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1990, corrected\n\nreprint of the 1985 original.\n\n[17] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences. Aca-\n\ndemic Press, 1979.\n\n[18] B. W. Silverman, Density Estimation for Statistics and Data Analysis. New York: Chapman\n\nand Hall, 1986.\n\n[19] D. Brown, J. Dalton, and H. Hoyle, \u201cSpatial forecast methods for terrorist events in urban\n\nenvironments,\u201d Lecture Notes in Computer Science, vol. 3073, pp. 426\u2013435, 2004.\n\n9\n\n\f", "award": [], "sourceid": 564, "authors": [{"given_name": "Sergey", "family_name": "Feldman", "institution": null}, {"given_name": "Maya", "family_name": "Gupta", "institution": null}, {"given_name": "Bela", "family_name": "Frigyik", "institution": null}]}