{"title": "Sparse Bayesian Multi-Task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1755, "page_last": 1763, "abstract": "We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods.", "full_text": "Sparse Bayesian Multi-Task Learning\n\nC\u00b4edric Archambeau, Shengbo Guo, Onno Zoeter\n\n{Cedric.Archambeau, Shengbo.Guo, Onno.Zoeter}@xrce.xerox.com\n\nXerox Research Centre Europe\n\nAbstract\n\nWe propose a new sparse Bayesian model for multi-task regression and classi\ufb01ca-\ntion. The model is able to capture correlations between tasks, or more speci\ufb01cally\na low-rank approximation of the covariance matrix, while being sparse in the fea-\ntures. We introduce a general family of group sparsity inducing priors based on\nmatrix-variate Gaussian scale mixtures. We show the amount of sparsity can be\nlearnt from the data by combining an approximate inference approach with type\nII maximum likelihood estimation of the hyperparameters. Empirical evaluations\non data sets from biology and vision demonstrate the applicability of the model,\nwhere on both regression and classi\ufb01cation tasks it achieves competitive predictive\nperformance compared to previously proposed methods.\n\n1\n\nIntroduction\n\nLearning multiple related tasks is increasingly important in modern applications, ranging from the\nprediction of tests scores in social sciences and the classi\ufb01cation of protein functions in systems\nbiology to the categorisation of scenes in computer vision and more recently to web search and\nranking. In many real life problems multiple related target variables need to be predicted from a\nsingle set of input features. A problem that attracted considerable interest in recent years is to label\nan image with (text) keywords based on the features extracted from that image [26]. In general, this\nmulti-label classi\ufb01cation problem is challenging as the number of classes is equal to the vocabulary\nsize and thus typically very large. While capturing correlations between the labels seems appealing\nit is in practice dif\ufb01cult as it rapidly leads to numerical problems when estimating the correlations.\nA naive solution is to learn a model for each task separately and to make predictions using the\nindependent models. Of course, this approach is unsatisfactory as it does not take advantage of\nall the information contained in the data. If the model is able to capture the task relatedness, it\nis expected to have generalisation capabilities that are drastically increased. This motivated the\nintroduction of the multi-task learning paradigm that exploits the correlations amongst multiple\ntasks by learning them simultaneously rather than individually [12]. More recently, the abundant\nliterature on multi-task learning demonstrated that performance indeed improves when the tasks are\nrelated [6, 31, 2, 14, 13].\nThe multi-task learning problem encompasses two main settings. In the \ufb01rst one, for every input,\nevery task produces an output. If we restrict ourselves to multiple regression for the time being, the\nmost basic multi-task model would consider P correlated tasks1, the vector of covariates and targets\nbeing respectively denoted by xn \u2208 RD and yn \u2208 RP :\n\nyn = Wxn + \u00b5 + \u0001n,\n\n(1)\nwhere W \u2208 RP\u00d7D is the matrix of weights and \u00b5 \u2208 RP the task offsets and \u0001n \u2208 RP the vector\nresidual errors with covariance \u03a3 \u2208 RP\u00d7P . In this setting, the output of all tasks is observed for\n1While it is straightfoward to show that the maximum likelihood estimate of W would be the same as when\n\nconsidering uncorrelated noise, imposing any prior on W would lead to a different solution.\n\n\u0001n \u223c N (0, \u03a3),\n\n1\n\n\fevery input. In the second setting, the goal is to learn from a set of observed tasks and to generalise\nto a new task. This approach views the multi-task learning problem as a transfer learning problem,\nwhere it is assumed that the various tasks belong in some sense to the same environment and share\ncommon properties [23, 5]. In general only a single task output is observed for every input.\nA recent trend in multi-task learning is to consider sparse solutions to facilitate the interpretation.\nMany formulate the sparse multi-task learning problem in a (relaxed) convex optimization frame-\nwork [5, 22, 35, 23]. If the regularization constant is chosen using cross-validation, regularization-\nbased approaches often overestimate the support [32] as they select more features than the set that\ngenerated the data. Alternatively, one can adopt a Bayesian approach to sparsity in the context of\nmulti-task learning [29, 21]. The main advantage of the Bayesian formalism is that it enables us to\nlearn the degree of sparsity supported by the data and does not require the user to specify the type\nof penalisation in advance.\nIn this paper, we adopt the \ufb01rst setting for multi-task learning, but we will consider a hierarchical\nBayesian model where the entries of W are correlated so that the residual errors are uncorrelated.\nThis is similar in spirit as the approach taken by [18], where tasks are related through a shared kernel\nmatrix. We will consider a matrix-variate prior to simultaneously model task correlations and group\nsparsity in W. A matrix-variate Gaussian prior was used in [35] in a maximum likelihood setting to\ncapture task correlations and feature correlations. While we are also interested in task correlations,\nwe will consider matrix-variate Gaussian scale mixture priors centred at zero to drive entire blocks\nof W to zero. The Bayesian group LASSO proposed in [30] is a special case. Group sparsity [34]\nis especially useful in presence of categorical features, which are in general represented as groups\nof \u201cdummy\u201d variables. Finally, we will allow the covariance to be of low-rank so that we can deal\nwith problems involving a very large number of tasks.\n\n2 Matrix-variate Gaussian prior\n\nBefore starting our discussion of the model, we introduce the matrix variate Gaussian as it plays a\nkey role in our work. For a matrix W \u2208 RP\u00d7D, the matrix-variate Gaussian density [16] with mean\nmatrix M \u2208 RP\u00d7D, row covariance \u2126 \u2208 RD\u00d7D and column covariance \u03a3 \u2208 RP\u00d7P is given by\n\n\u2212 1\n\n2 tr{\u2126\u22121(W\u2212M)(cid:62)\u03a3\u22121(W\u2212M)}\n\n.\n\nN (M, \u2126, \u03a3) \u221d e\n\n2 vec(W\u2212M)(cid:62)(\u2126\u2297\u03a3)\u22121vec(W\u2212M) \u221d e\n\u2212 1\n\n(2)\nIf we let \u03a3 = E(W \u2212 M)(W \u2212 M)(cid:62), then \u2126 = E(W \u2212 M)(cid:62)(W \u2212 M)/c where c ensures\nthe density integrates to one. While this introduces a scale ambiguity between \u03a3 and \u2126 (easily\nremoved by means of a prior), the use of a matrix-variate formulation is appealing as it makes\nexplicit the structure vec(W), which is a vector formed by the concatenation of the columns of\nW. This structure is re\ufb02ected in its covariance matrix which is not of full rank, but is obtained by\ncomputing the Kronecker product of the row and the column covariance matrices.\nIt is interesting to compare a matrix-variate prior for W in (1) with the classical multi-level approach\nto multiple regression from statistics (see e.g. [20]). In a standard multi-level model, the rows of W\nare drawn iid from a multivariate Gaussian with mean m and covariance S, and m is further drawn\nfrom zero mean Gaussian with covariance R. Integrating out m leads then to a Gaussian distributed\nvec(W) with mean zero and with a covariance matrix that has the block diagonal elements equal to\nS + R and all off-diagonal elements equal to R. Hence, the standard multi-level model assumes a\nvery different covariance structure than the one based on (2) and incidentally cannot learn correlated\nand anti-correlated tasks simultaneously.\n\n3 A general family of group sparsity inducing priors\n\nWe seek a solution for which the expectation of W is sparse, i.e., blocks of W are driven to zero. A\nstraightforward way to induce sparsity, and which would be equivalent to (cid:96)1-regularisation on blocks\nof W, is to consider a Laplace prior (or double exponential). Although applicable in a penalised\nlikelihood framework, the Laplace prior would be computationally hard in a Bayesian setting as it\nis not conjugate to the Gaussian likelihood. Hence, naively using this prior would prevent us from\ncomputing the posterior in closed form, even in a variational setting. In order to circumvent this\nproblem, we take a hierarchical Bayesian approach.\n\n2\n\n\f\u03c4\n\nZi\n\n\u03b3i\n\nV\n\nWi\n\n\u03c32\n\nyn\n\n\u2126i\n\n\u03c5, \u03bb\n\nQ\n\ntn\n\nN\n\n\u03c9, \u03c7, \u03c6\n\nFigure 1: Graphical model for sparse Bayesian multiple regression (when excluding the dashed\narrow) and sparse Bayesian multiple classi\ufb01cation (when considering all arrows).\n\nWe assume that the marginal prior, or effective prior, on each block Wi \u2208 RP\u00d7Di has the form of\na matrix-variate Gaussian scale mixture, a generalisation of the multivariate Gaussian scale mixture\n[3]:\n\n(cid:90) \u221e\n\nQ(cid:88)\n\n(cid:19)\n\np(Wi) =\n\nN (0, \u03b3\n\n\u22121\ni \u2126i, \u03a3) p(\u03b3i) d\u03b3i,\n\n0\n\ni=1\n\nDi = D,\n\n(3)\n\nwhere \u2126i \u2208 RDi\u00d7Di, \u03a3 \u2208 RP\u00d7P and \u03b3i > 0 is the latent precision (i.e., inverse scale) associated\nto block Wi.\nA sparsity inducing prior for Wi can then be constructed by choosing a suitable hyperprior for\n\u03b3i. We impose a generalised inverse Gaussian prior (see Supplemental Appendix A for a formal\nde\ufb01nition with special cases) on the latent precision variables:\n\n\u03b3i \u223c N \u22121(\u03c9, \u03c7, \u03c6) =\n\n\u03b3\u03c9\u22121\n\ni\n\ne\n\n\u2212 1\n\n2 (\u03c7\u03b3\n\n\u22121\ni +\u03c6\u03b3i),\n\nconcentration of the distribution and(cid:112)\u03c7/\u03c6 de\ufb01nes its scale. The effective prior is then a symmetric\n\nwhere K\u03c9(\u00b7) is the modi\ufb01ed Bessel function of the second kind, \u03c9 is the index,\n\n\u03c7\u03c6 de\ufb01nes the\n\n2K\u03c9(\n\n\u221a\n\n(4)\n\nmatrix-variate generalised hyperbolic distribution:\n\n\u03c7\u2212\u03c9(cid:0)\u221a\n\n\u03c7\u03c6(cid:1)\u03c9\n\n\u221a\n\u03c7\u03c6)\n\n(cid:18)(cid:113)\n\nK\n\n\u03c9+\n\np(Wi) \u221d\n\nP Di\n\n(cid:32)(cid:114)\n\n2\n\n\u03c7(\u03c6 + tr{\u2126\n\n\u22121\ni W(cid:62)\n\ni \u03a3\u22121Wi})\n(cid:33)\u03c9+\n\nP Di\n\n2\n\n\u03c6+tr{\u2126\n\n\u22121\ni W(cid:62)\n\ni \u03a3\u22121Wi}\n\n\u03c7\n\n.\n\n(5)\n\nThe marginal (5) has fat tails compared to the matrix-variate Gaussian. In particular, the family\ncontains the matrix-variate Student-t, the matrix-variate Laplace and the matrix-variate Variance-\nGamma as special cases. Several of the multivariate equivalents have recently been used as priors\nto induce sparsity in the Bayesian paradigm, both in the context of supervised [19, 11] and unsuper-\nvised linear Gaussian models [4].\n\ni=1, {\u2126i}Q\n\n4 Sparse Bayesian multiple regression\ni=1 and {\u03b31, . . . , \u03b3D1, . . . , \u03b31, . . . , \u03b3DQ} as latent variables that need to\nWe view {Wi}Q\nbe marginalised over. This is motivated by the fact that over\ufb01tting is avoided by integrating out\nall parameters whose cardinality scales with the model complexity, i.e., the number of dimensions\nand/or the number of tasks. We further introduce a latent projectoin matrix V \u2208 RP\u00d7K and a set of\nlatent matrices {Zi}Q\ni=1 to make a low-rank approximation of the column covariance \u03a3 as explained\nbelow. Note also that \u2126i captures the correlations between the rows of group i.\n\n3\n\n\fThe complete probabilistic model is given by\n\nyn|W, xn \u223c N (Wxn, \u03c32IP ),\n\nWi|V, Zi, \u2126i, \u03b3i \u223c N (VZi, \u03b3\n\u22121\ni \u2126i, IK ),\n\nZi|\u2126i, \u03b3i \u223c N (0, \u03b3\n\n\u22121\ni \u2126i, \u03c4 IP ),\n\nV \u223c N (0, \u03c4 IP , IK ),\n\u2126i \u223c W\u22121(\u03c5, \u03bbIDi ),\n\u03b3i \u223c N \u22121(\u03c9, \u03c7, \u03c6),\n\n(6)\n\nwhere \u03c32 is the residual noise variance and \u03c4 is residual variance associated to W. The graphical\nmodel is shown in Fig. 1. We reparametrise the inverse Wishart distribution and de\ufb01ne it as follows:\n\n\u2126 \u223c W\u22121(\u03c5, \u039b) =\n\n4 (cid:81)p\n\np(p\u22121)\n\nj=1 \u0393(z + 1\u2212j\n2 ).\n\n|\u039b| D+\u03c5\u22121\n(D+\u03c5\u22121)D\n2\n\n2\n\n2\n\n|\u2126\u22121| 2D+\u03c5\n\u0393D( D+\u03c5\u22121\n\n2\n\n2\n\ne\u2212 1\n\n2 tr{\u039b\u2126\u22121},\n\n\u03c5 > 0,\n\n)\n\nwhere \u0393p(z) = \u03c0\nUsing the compact notations W = (W1, . . . , WQ), Z = (Z1, . . . , ZQ), \u2126 = diag{\u21261, . . . , \u2126Q}\nand \u0393 = diag{\u03b31, . . . , \u03b3D1, . . . , \u03b31, . . . , \u03b3DQ}, we can compute the following marginal:\n\n(cid:90)(cid:90)\n(cid:90)\n\np(W|V, \u2126) \u221d\n\n=\n\nN (VZ, \u0393\n\n\u22121\u2126, \u03c4 IP )N (0, \u0393\n\n\u22121\u2126, IK )p(\u0393)dZd\u0393\n\nN (0, \u0393\n\n\u22121\u2126, VV\n\n(cid:62)\n\n+ \u03c4 IP )p(\u0393)d\u0393.\n\nThus, the probabilistic model induces sparsity in the blocks of W, while taking correlations between\nthe task parameters into account through the random matrix \u03a3 \u2248 VV(cid:62) + \u03c4 IP . This is especially\nuseful when there is a very large number of tasks.\nThe latent variables Z = {W, V, Z, \u2126, \u0393} are infered by variational EM [27], while the hyperpa-\nrameters \u03d1 = {\u03c32, \u03c4, \u03c5, \u03bb, \u03c9, \u03c7, \u03c6} are estimated by type II ML [8, 25]). Using variational inference\nis motivated by the fact that deterministic approximate inference schemes converge faster than tra-\nditional sampling methods such as Markov chain Monte Carlo (MCMC), and their convergence can\neasily be monitored. The choice of learning the hyperparameters by type II ML is preferred to the\noption of placing vague priors over them, although this would also be a valid option.\nthe variational posterior q(Z) =\nIn order to \ufb01nd a tractable solution, we assume that\nq(W, V, Z, \u2126, \u0393) factorises as q(W)q(V)q(, Z)q(\u2126)q(\u0393) given the data D = {(yn, xn)}N\nn=1 [7].\nThe variational EM combined to the type II ML estimation of the hyperparameters cycles through\nthe following two steps until convergence:\n\n1. Update of the approximate posterior of the latent variables and parameters for \ufb01xed hyper-\n\nparameters. The update for W is given by\n\nq(W) \u221d e(cid:104)ln p(D,Z|\u03d1)(cid:105)q(Z/W),\n\n(7)\nwhere Z/W is the set Z with W removed and (cid:104)\u00b7(cid:105)q denotes the expectation with respect to\nq. The posteriors of the other latent matrices have the same form.\n2. Update of the hyperparameters for \ufb01xed variational posteriors:\n(cid:104)ln p(D,Z,|\u03d1)(cid:105)q(Z) .\n\n\u03d1 \u2190 argmax\n\n(8)\n\n\u03d1\n\nVariational EM converges to a local maximum of the log-marginal likelihood. The convergence can\nbe checked by monitoring the variational lower bound, which monotonically increases during the op-\ntimisation. Next, we give the explicit expression of the variational EM steps and the updates for the\nhyperparameters, whereas we show that of the variational bound in the Supplemental Appendix D.\n\n4.1 Variational E step (mean \ufb01eld)\n\nAsssuming a factorised posterior enables us to compute it in closed form as the priors are each\nconjugate to the Gaussian likelihood. The approximate posterior is given by\n\nq(Z) = N (MW , \u2126W , SW )N (MV , \u2126V , SV )N (MZ, \u2126Z, SZ)\n\n(9)\n\n\u00d7(cid:89)\n\nW\u22121(\u03c5i, \u039bi)N \u22121(\u03c9i, \u03c7i, \u03c6i).\n\ni\n\nThe expression of posterior parameters are given in Supplemental Appendix C. The computational\nbottleneck resides in the inversion of \u2126W which is O(D3) per iteration. When D > N, we can use\nthe Woodbury identity for a matrix inversion of complexity O(N 3) per iteration.\n\n4\n\n\f4.2 Hyperparameter updates\n\nTo learn the degree of sparsity from data we optimise the hyperparameters. There are no closed\nform updates for {\u03c9, \u03c7, \u03c6}. Hence, we need to \ufb01nd the root of the following expressions, e.g., by\nline search:\n\n(cid:115)\n\n\u03c9 : Q ln\n\n\u03c7 :\n\n\u03c6\n\u03c7\n\n\u2212 Q\n\nd ln K\u03c9(\nd\u03c9\n\n(cid:115)\nR\u03c9((cid:112)\u03c7\u03c6) +\nR\u03c9((cid:112)\u03c7\u03c6) \u2212(cid:88)\n\n\u03c6\n\u03c7\n\n\u2212 Q\n2\n\nQ\u03c9\n\u03c7\n\n(cid:114) \u03c7\n\n\u03c6 : Q\n\n\u03c6\n\n\u221a\n\n\u03c7\u03c6)\n\n(cid:88)\n(cid:88)\n\ni\n\ni\n\n1\n2\n\n(cid:104)\u03b3i(cid:105) = 0,\n\ni\n\n(cid:104)ln \u03b3i(cid:105) = 0,\n\n(cid:104)\u03b3\n\n\u22121\ni\n\n(cid:105) = 0,\n\n(10)\n\n(11)\n\n(12)\n\nwhere (??) was invoked. Unfortunately, the derivative in the \ufb01rst equation needs to be estimated\nnumerically. When considering special cases of the mixing density such as the Gamma or the inverse\nGamma simpli\ufb01ed updates are obtained and no numerical differentiation is required.\nDue to space constraints, we omit the type II ML updates for the other hyperparameters.\n\n4.3 Predictions\n\nfollows: p(y\u2217|x\u2217) \u2248(cid:82) p(y\u2217|W, x\u2217)q(W)dW = N (MW x\u2217, (\u03c32 + x(cid:62)\n\nPredictions are performed by Bayesian averaging. The predictive distribution is approximated as\n\n\u2217 \u2126W x\u2217)IP ).\n\n5 Sparse Bayesian multiple classi\ufb01cation\n\nWe restrict ourselves to multiple binary classi\ufb01ers and consider a probit model in which the like-\nlihood is derived from the Gaussian cumulative density. A probit model is equivalent to a Gaus-\nsian noise and a step function likelihood [1]. Let tn \u2208 RP be the class label vectors, with\ntnp \u2208 {\u22121, +1} for all n. The likelihood is replaced by\n\nI(tnpynp),\n\nyn|W, xn \u223c N (Wxn, \u03c32IP ),\n\n(13)\n\ntn|yn \u223c(cid:89)\n\np\n\nwhere I(z) = 1 for z (cid:62) 0 and 0 otherwise. The rest of the model is as before; we will set \u03c3 = 1.\nThe latent variables to infer are now Y and Z. Again, we assume a factorised posterior. We fur-\nther assume the variational posterior q(Y) is a product of truncated Gaussians (see Supplemental\nAppendix B):\n\nI(tnpynp)N (\u03bdnp, 1) =\n\nN+(\u03bdnp, 1)\n\nN\u2212(\u03bdnp, 1),\n\n(14)\n\nn\n\np\n\ntnp=+1\n\n(cid:89)\n\n(cid:89)\n\ntnp=\u22121\n\nq(Y) \u221d(cid:89)\n\n(cid:89)\n\nwhere \u03bdnp is the pth entry of \u03bdn = MW xn. The other variational and hyperparameter updates are\nunchanged, except that Y is replaced by matrix \u03bd\u00b1. The elements of \u03bd\u00b1 are de\ufb01ned in (??).\n\n5.1 Bayesian classi\ufb01cation\n\nIn Bayesian classi\ufb01cation the goal is to predict the label with highest posterior probability. Based\non the variational approximation we propose the following classi\ufb01cation rule:\n\n\u02c6t\u2217 = arg max\nt\u2217\n\nP (t\u2217|T) \u2248 arg max\nt\u2217\n\nNt\u2217p (\u03bd\u2217p, 1)dy\u2217p = arg max\nt\u2217\n\n\u03a6 (t\u2217p\u03bd\u2217p) ,\n\n(15)\n\nwhere \u03bd\u2217 = MW x\u2217. Hence, to decide whether the label t\u2217p is \u22121 or +1 it is suf\ufb01cient to use the\nsign of \u03bd\u2217p as the decision rule. However, the probability P (t\u2217p|T) tells us also how con\ufb01dent we\nare in the prediction we make.\n\n5\n\n(cid:90)\n\n(cid:89)\n\np\n\n(cid:89)\n\np\n\n\fFigure 2: Results for the ground truth data set. Top left: Prediction accuracy on a test set as a\nfunction of training set size. Top right: estimated and true \u03a3 (top), true underlying sparsity pattern\n(middle) and inverse of the posterior mean of {\u03b3i}i showing that the sparsity is correctly captured\n(bottom). Bottom diagrams: Hinton diagram of true W (bottom), ordinary least squares learnt W\n(middle) and the sparse Bayesian multi-task learnt W (top). The ordinary least squares learnt W\ncontains many non-zero elements.\n\n6 A model study with ground truth data\n\n\u221a\n\n.9 \u2212\u221a\n\n.9 \u2212\u221a\n\n.9\n\n.9\n\n\u221a\n\n.9](cid:62) and \u03c4 = 0.1, i.e.\n\nTo understand the properties of the model we study a regression problem with known parame-\n\u221a\nters. Figure 2 shows the results for 5 tasks and 50 features. Matrix W is drawn using V =\nthe covariance for vec(W) has 1\u2019s on the\n[\ndiagonal and \u00b1.9 on the off-diagonal elements. The \ufb01rst three tasks and the last two tasks are pos-\nitively correlated. There is a negative correlation between the two groups. The active features are\nrandomly selected among the 50 candidate features. We evaluate the models with 104 test points\nand repeated the experiment 25 times. Gaussian noise was added to the targets (\u03c3 = 0.1).\nIt can be observed that the proposed model performs better and converges faster to the optimal\nperformance when the data set size increases compared ordinary least squares. Note also that both\n\u03a3 and the sparsity pattern are correctly identi\ufb01ed.\n\n6\n\n2030405060708090100012345678Training set sizeAverage Squared Test Error SPBMRCOrdinary Least SquaresPredict with ground truth WEstimated task covarianceTrue task covarianceSparsity pattern510152025303540455000.20.40.60.81E \u03b3\u22121Feature indexSBMR estimated weight matrixOLS estimated weight matrixTrue weight matrix\fTable 1: Performance (with standard deviation) of classi\ufb01cation tasks on Yeast and Scene data sets in\nterms of accuracy and AUC. LR: Bayesian logistic regression; Pooling: pooling all data and learning\na single model; Xue: the matrix stick-breaking process based multi-task learning model proposed in\n[33]. K = 10 for the proposed models (i.e., Laplace, Student-t, and ARD). Note that the \ufb01rst \ufb01ve\nrows for Yeast and Scene data sets are reported in [29]. The reported performances are averaged\nover \ufb01ve randomized repetitions.\n\nYeast\n\nScene\n\nAccuracy\n0.5047\n0.4983\n0.5106\n0.5212\n0.5424\n\nAUC\n0.5049\n0.5112\n0.5105\n0.5244\n0.5406\n\nAccuracy\n0.7362\n0.7862\n0.7765\n0.7756\n0.7911\n\nAUC\n0.6153\n0.5433\n0.5603\n0.6325\n0.6416\n\nModel\n\nLR\nPool\n\nXue [33]\n\nModel-1 [29]\nModel-2 [29]\n\nChen [15]\nLaplace\nStudent\nARD\n\nNA\n\n0.7987\u00b10.0017\n0.7988\u00b10.0017\n0.7987\u00b10.0020\n\n0.7987\u00b10.0044\n0.8349\u00b10.0020\n0.8349\u00b10.0019\n0.8349\u00b10.0020\n\nNA\n\n0.8892\u00b10.0038\n0.8897\u00b10.0034\n0.8896\u00b10.0044\n\n0.9160\u00b10.0038\n0.9188\u00b10.0041\n0.9183\u00b10.0041\n0.9187\u00b10.0042\n\n7 Multi-task classi\ufb01cation experiments\n\nIn this section, we evaluate the proposed model on two data sets: Yeast [17] and Scene [9], which\nhave been widely used as testbeds to evaluate multi-task learning approaches [28, 29, 15]. To demon-\nstrate the superiority of the proposed models, we conduct systematic empirical evaluations including\nthe comparisons with (1) Bayesian logistic regression (BLR) that learns tasks separately, (2) a pool-\ning model that pools data together and learns a single model collectively, and (3) the state-of-the-art\nmulti-task learning methods proposed in [33, 29, 15].\nWe follow the experimental setting as introduced in [29] for fair comparisons, and omit the detailed\nsetting due to space limitation. We evaluate all methods for the classi\ufb01cation task using two metrics:\n(1) overall accuracy at a threshold of zero and (2) the average area under the curve (AUC). Results\non the Yeast and Scence data sets using these two metrics are reported in Table 7. It is interesting\nto note that even for small values of K (fewer parameters in the column covariance) the proposed\nmodel achieves good results. We also study how the performances vary with different K on a tuning\nset, and observe that there are no signi\ufb01cant differences on performances using different K (not\nshown in the paper). The results in Table 7 were produced with K = 10.\nThe proposed models (Laplace, Student-t, ARD) signi\ufb01cantly outperform the Bayesian logistic re-\ngression approach that learns each task separately. This observation agrees with the previous work\n[6, 31, 2, 5] demonstrating that the multi-task approach is bene\ufb01cial over the naive approach of\nlearning tasks separately. For the Yeast data set, the proposed models are signi\ufb01cantly better than\n\u201cXue\u201d [33], Model-1 and Model-2 [29], and the best performing model in [15]. For the Scene data\nset, our models and the model in [15] show comparable results.\nThe advantage of using hierarchical priors is particularly evident in a low data regime. To study\nthe impact of training set size on performance, we report the accuracy and AUC as functions of the\ntraining set sizes in Figure 3. For this experiment, we use a single test set of size 1196, which repli-\ncates the experimental setup in [29]. Figure 3 shows that the proposed Bayesian methods perform\nwell overall, but that the performances are not signi\ufb01cantly impacted when the number of data is\nsmall. Similar results were obtained for the Yeast data set.\n\n8 Conclusion\n\nIn this work we proposed a Bayesian multi-task learning model able to capture correlations between\ntasks and to learn the sparsity pattern of the data features simultaneously. We further proposed a\nlow-rank approximation of the covariance to handle a very large number of tasks. Combining low-\nrank and sparsity at the same time has been a long open standing issue in machine learning. Here,\nwe are able to achieve this goal by exploiting the special structure of the parameters set. Hence, the\n\n7\n\n\fFigure 3: Model comparisons in terms of classi\ufb01cation accuracy and AUC on the Scene data set\nfor K = 10. Error bars represent 3 times the standard deviation. Results for Bayesian logistic\nregression (BLR), Model-1 and Model-2 are obtained based on the measurements using a ruler from\nFigure 2 in [29], for which no error bars are given.\n\nproposed model combines sparsity and low-rank in a different manner than in [10], where a sum of\na sparse and low-rank matrix is considered.\nBy considering a matrix-variate Gaussian scale mixture prior we extended the Bayesian group\nLASSO to a more general family of group sparsity inducing priors. This suggests the extension\nof current Bayesian methodology to learn structured sparsity from data in the future. A possible\nextension is to consider the graphical LASSO to learn sparse precision matrices \u2126\u22121 abd \u03a3\u22121. A\nsimiliar approach was explored in [35].\n\nReferences\n\n[1] J. H. Albers and S. Chib. Bayesian analysis of binary and polychotomous response data.\n\nJ.A.S.A., 88(422):669\u2013679, 1993.\n\n[2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks\n\nand unlabeled data. JMLR, 6:1817\u20131853, 2005.\n\n[3] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal\n\nStatistical Society B, 36(1):99\u2013102, 1974.\n\n[4] C. Archambeau and F. Bach. Sparse probabilistic projections. In NIPS. MIT Press, 2008.\n[5] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73:243\u2013272, 2008.\n\n[6] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. JMLR,\n\n4:83\u201399, 2003.\n\n[7] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby\n\nComputational Neuroscience Unit, University College London, 2003.\n\n[8] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 1985.\n[9] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classi\ufb01cation.\n\nPattern Recognition, 37(9):1757\u20131771, 2004.\n\n[10] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of\n\nthe ACM, 58:1\u201337, June 2011.\n\n[11] F. Caron and A. Doucet. Sparse Bayesian nonparametric regression. In ICML, pages 88\u201395.\n\nACM, 2008.\n\n[12] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n8\n\n40060080010000.50.60.70.80.9Scene data set, K=10Number of training samplesAccuracy ARDStudent\u2212tLaplaceModel\u22122Model\u22121BLR40060080010000.50.60.70.80.9Scene data set, K=10Number of training samplesAUC\f[13] O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, and B. Tseng. Multi-task\nlearning for boosting with application to web search ranking. In SIGKDD, pages 1189\u20131198,\n2010.\n\n[14] R. Chari, W. W. Lockwood, B. P. Coe, A. Chu, D. Macey, A. Thomson, J. J. Davies,\nC. MacAulay, and W. L. Lam. Sigma: A system for integrative genomic microarray analy-\nsis of cancer genomes. BMC Genomics, 7:324, 2006.\n\n[15] J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and low-rank patterns from multiple\n\ntasks. In SIGKDD, pages 1179\u20131188. ACM, 2010.\n\n[16] A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a bayesian\n\napplication. Biometrika, 68(1):265\u2013274, 1981.\n\n[17] A. Elisseeff and J. Weston. A kernel method for multi-labelled classi\ufb01cation. In NIPS. 2002.\n[18] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJMLR, 6:615\u2013637, 2005.\n\n[19] M. Figueiredo. Adaptive sparseness for supervised learning.\n\n25:1150\u20131159, 2003.\n\nIEEE Transactions on PAMI,\n\n[20] A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hiererarchical Models.\n\nCambridge University Press, 2007.\n\n[21] D. Hern\u00b4andez-Lobato, J. M. Hern\u00b4andez-Lobato, T. Helleputte, and P. Dupont. Expectation\npropagation for Bayesian multi-task feature selection. In ECML-PKDD, pages 522\u2013537, 2010.\nIn\n\n[22] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation.\n\nNIPS, pages 745\u2013752. 2009.\n\n[23] T. Jebara. Multitask sparsity via maximum entropy discrimination. JMLR, 12:75\u2013110, 2011.\nStatistical Properties of the Generalized Inverse Gaussian Distribution.\n[24] B. J\u00f8rgensen.\nSpringer, 1982.\n\n[25] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1992.\n[26] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In ECCV, 2008.\n[27] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse,\nand other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368. MIT\npress, 1998.\n\n[28] P. Rai and H. Daume. Multi-label prediction via sparse in\ufb01nite cca. In NIPS, pages 1518\u20131526.\n\n2009.\n\n[29] P. Rai and H. D. III. In\ufb01nite predictor subspace models for multitask learning. In AISTATS,\n\npages 613\u2013620, 2010.\n\n[30] S. Raman, T. J. Fuchs, P. J. Wild, E. Dahl, and V. Roth. The Bayesian group-Lasso for analyz-\n\ning contingency tables. In ICML, pages 881\u2013888, 2009.\n\n[31] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: ef\ufb01cient boosting procedures\n\nfor multiclass object detection. In CVPR, pages 762\u2013769. IEEE Computer Society, 2004.\n\n[32] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using\nIEEE Transactions on Information Theory,\n\nl1-constrained quadratic programming (lasso).\n55(5):2183 \u20132202, 2009.\n\n[33] Y. Xue, D. Dunson, and L. Carin. The matrix stick-breaking process for \ufb02exible multi-task\n\nlearning. In ICML, pages 1063\u20131070, 2007.\n\n[34] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J.\n\nR. Statistic. Soc. B, 68(1):49\u201367, 2006.\n\n[35] Y. Zhang and J. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In\n\nNIPS, pages 2550\u20132558. 2010.\n\n9\n\n\f", "award": [], "sourceid": 990, "authors": [{"given_name": "Shengbo", "family_name": "Guo", "institution": null}, {"given_name": "Onno", "family_name": "Zoeter", "institution": null}, {"given_name": "C\u00e9dric", "family_name": "Archambeau", "institution": null}]}