{"title": "Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2339, "page_last": 2347, "abstract": "We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multi-output Gaussian process regression, multi-class classification, image processing applications and collaborative filtering.", "full_text": "Spike and Slab Variational Inference for Multi-Task\n\nand Multiple Kernel Learning\n\nMichalis K. Titsias\n\nUniversity of Manchester\nmtitsias@gmail.com\n\nMiguel L\u00b4azaro-Gredilla\n\nUniv. de Cantabria & Univ. Carlos III de Madrid\n\nmiguel@tsc.uc3m.es\n\nAbstract\n\nWe introduce a variational Bayesian inference algorithm which can be widely\napplied to sparse linear models. The algorithm is based on the spike and slab prior\nwhich, from a Bayesian perspective, is the golden standard for sparse inference.\nWe apply the method to a general multi-task and multiple kernel learning model\nin which a common set of Gaussian process functions is linearly combined with\ntask-speci\ufb01c sparse weights, thus inducing relation between tasks. This model\nuni\ufb01es several sparse linear models, such as generalized linear models, sparse\nfactor analysis and matrix factorization with missing values, so that the variational\nalgorithm can be applied to all these cases. We demonstrate our approach in multi-\noutput Gaussian process regression, multi-class classi\ufb01cation, image processing\napplications and collaborative \ufb01ltering.\n\n1\n\nIntroduction\n\nSparse inference has found numerous applications in statistics and machine learning [1, 2, 3]. It is a\ngeneric idea that can be combined with popular models, such as linear regression, factor analysis and\nmore recently multi-task and multiple kernel learning models. In the regularization theory literature\nsparse inference is tackled via (cid:96)1 regularization [2], which requires expensive cross-validation for\nmodel selection. From a Bayesian perspective, the spike and slab prior [1, 4, 5], also called two-\ngroups prior [6], is the golden standard for sparse linear models. However, the discrete nature of\nthe prior makes Bayesian inference a very challenging problem. Speci\ufb01cally, for M linear weights,\ninference under a spike and slab prior distribution on those weights requires a combinatorial search\nover 2M possible models. The problems found when working with the spike and slab prior led\nseveral researchers to consider soft-sparse or shrinkage priors such as the Laplace and other related\nscale mixtures of normals [3, 7, 8, 9, 10]. However, such priors are not ideal since they assign zero\nprobability mass to events associated with weights having zero value.\nIn this paper, we introduce a simple and ef\ufb01cient variational inference algorithm based on the spike\nand slab prior which can be widely applied to sparse linear models. The novel characteristic of this\nalgorithm is that the variational distribution over sparse weights has a factorial nature, i.e., it can be\nwritten as a mixture of 2M components where M is the number of weights. Unlike the standard\nmean \ufb01eld approximation which uses a unimodal variational distribution, our variational algorithm\ncan more precisely match the combinational nature of the posterior distribution over the weights.\nWe will show that the proposed variational approach is more accurate and robust to unfavorable\ninitializations than the standard mean \ufb01eld variational approximation.\nWe apply the variational method to a general multi-task and multiple kernel learning model that\nexpresses the correlation between tasks by letting them share a common set of Gaussian process\nlatent functions. Each task is modeled by linearly combining these latent functions with task-\nspeci\ufb01c weights which are given a spike and slab prior distribution. This model is a spike and\nslab Bayesian reformulation of previous Gaussian process-based single-task multiple kernel learning\n\n1\n\n\fmethods [11, 12, 13] and multi-task Gaussian processes (GPs) [14, 15, 16, 17]. Further, this model\nuni\ufb01es several sparse linear models, such as generalized linear models, factor analysis, probabilistic\nPCA and matrix factorization with missing values. In the experiments, we apply the variational in-\nference algorithms to all the above models and present results in multi-output regression, multi-class\nclassi\ufb01cation, image denoising, image inpainting and collaborative \ufb01ltering.\n\n2 Spike and slab multi-task and multiple kernel learning\n\nSection 2.1 discusses the spike and slab multi-task and multiple kernel learning (MTMKL) model\nthat linearly combines Gaussian process latent functions. Spike and slab factor analysis and proba-\nbilistic PCA is discussed in Section 2.2, while missing values are dealt with in Section 2.3.\n\n(1a)\n\n(1b)\n\n\u2200n,q\n\u2200q\n\u2200q,m\n\u2200m.\n\n2.1 The model\nLet D = {X, Y}, with X \u2208 RN\u00d7D and Y \u2208 RN\u00d7Q, be a dataset such that the n-th row of X is\nan input vector xn and the n-th row of Y is the set of Q corresponding tasks or outputs. We use\nyq to refer to the q-th column of Y and ynq to the (n, q) entry. Outputs Y are then assumed to be\ngenerated according to the following hierarchical Bayesian model:\n\nynq \u223c N (ynq|fq(xn), \u03c32\nq ),\n\nM(cid:88)\n\nm=1\n\nwqm\u03c6m(x) = w(cid:62)\n\nfq(x) =\nwqm \u223c \u03c0N (wqm|0, \u03c32\n\u03c6m(x) \u223c GP(\u00b5m(x), km(xi, xj)),\n\nq \u03c6(x),\n\nw) + (1 \u2212 \u03c0)\u03b40(wqm),\n\n(1c)\n(1d)\nHere, each \u00b5m(x) is a mean function, km(xi, xj) a covariance function, wq = [wq1, . . . , wqM ](cid:62),\n\u03c6(x) = [\u03c61(x), . . . , \u03c6M (x)](cid:62) and \u03b40(wqm) denotes the Dirac delta function centered at zero. Since\neach of the Q tasks is a linear combination of the same set of latent functions {\u03c6m(x)}M\nm=1 (where\ntypically M < Q ), correlation is induced in the outputs. Sharing a common set of features means\nthat \u201cknowledge transfer\u201d between tasks can occur and latent functions are inferred more accurately,\nsince data belonging to all tasks are used.\nSeveral linear models can be expressed as special cases of the above. For instance, a generalized\nlinear model is obtained when the GPs are Dirac delta measures (with zero covariance functions)\nthat deterministically assign each \u03c6m(x) to its mean function \u00b5m(x). However, the model in (1) has\na number of additional features not present in standard linear models. Firstly, the basis functions are\nno longer deterministic, but they are instead drawn from different GPs, so an extra layer of \ufb02exibility\nis added to the model. Thus, a posterior distribution over the basis functions of the generalized linear\nmodel can be inferred from data. Secondly, a truly sparse prior, the spike and slab prior (1c), is\nplaced over the weights of the model. Speci\ufb01cally, with probability 1\u2212\u03c0, each wqm is zero, and with\nprobability \u03c0, it is drawn from a Gaussian. This contrasts with previous approaches [3, 7, 8, 9, 13]\nin which soft-sparse priors that assign zero probability mass to the weights being exactly zero were\nused. Hyperparameters \u03c0 and \u03c32\nw are learnable in order to determine the amount of sparsity and\nthe discrepancy of nonzero weights, respectively. Thirdly, the number of basis functions M can be\ninferred from data, since the sparse prior on the weights allows basis functions to be \u201cswitched off\u201d\nas necessary by setting the corresponding weights to zero.\nFurther, the model in (1) can be considered as a spike and slab Bayesian reformulation of multi-\ntask [14, 15] and multiple kernel learning previous methods [11, 12] that learn the weights using\nmaximum likelihood. By assuming the weights wq are given, each output function yq(x) is a GP\nwith covariance function\n\nM(cid:88)\n\nCov[(yq(xi), yq(xj)] =\n\nw2\n\nqmkm(xi, xj),\n\nwhich clearly consists of a conic combination of kernel functions. Therefore, the proposed model\ncan be reinterpreted as multiple kernel learning in which the weights of each kernel are assigned\nspike and slab priors in a full Bayesian formulation.\n\nm=1\n\n2\n\n\f2.2 Sparse factor and principal component analysis\nAn interesting case arises when \u00b5m(x) = 0 and km(xi, xj) = \u03b4ij \u2200m, where \u03b4ij is the Kronecker\ndelta. This says that each latent function is drawn from a white process so that it consists of indepen-\ndent values each following the standard normal distribution. We \ufb01rst de\ufb01ne matrices \u03a6 \u2208 RN\u00d7M\nand W \u2208 RQ\u00d7M , whose elements are, respectively, \u03c6nm = \u03c6m(xn) and wqm. Then, the model in\n(1) reduces to\n\nY = \u03a6W(cid:62) + \u03be,\n\nw) + (1 \u2212 \u03c0)\u03b40(wqm),\n\nwqm \u223c \u03c0N (wqm|0, \u03c32\n\u03c6nm \u223c N (\u03c6nm|0, 1),\n\u03benq \u223c N (\u03benq|0, \u03c32\nq ),\n\n(2a)\n(2b)\n(2c)\n(2d)\nwhere \u03be is an N \u00d7 Q noise matrix with entries \u03benq. The resulting model thus corresponds to sparse\nfactor analysis or sparse probabilistic PCA (when the noise is homoscedastic, i.e., \u03c32\nq is constant for\nall q). Observe that the sparse spike and slab prior is placed on the factor loadings W.\n\n\u2200q,m\n\u2200n,m\n\u2200n,q,\n\n2.3 Missing values\nThe method can easily handle missing values and thus be applied to problems involving matrix\ncompletion and collaborative \ufb01ltering. More precisely, in the presence of missing values we have\na binary matrix Z \u2208 RN\u00d7Q that indicates the observed elements in Y. Using Z the likelihood in\nq ), \u2200n,q s.t. [Z]nq = 1. In the experiments we\n(1a) is modi\ufb01ed according to ynq \u223c N (ynq|fq(xn), \u03c32\nconsider missing values in applications such as image inpainting and collaborative \ufb01ltering.\n\n3 Ef\ufb01cient variational inference\n\nthat follows the probability distribution in eq. (1c). This allows to reparameterize wqm according to\n\nThe presence of the Dirac delta mass function makes the application of variational approximate\ninference algorithms in spike and slab Bayesian models troublesome. However, there exists a sim-\nple reparameterization of the spike and slab prior that is more amenable to approximate inference\nw) and a Bernoulli\n\nmethods. Speci\ufb01cally, assume a Gaussian random variable (cid:101)wqm \u223c N ((cid:101)wqm|0, \u03c32\nrandom variable sqm \u223c \u03c0sqm (1 \u2212 \u03c0)1\u2212sqm. The product sqm(cid:101)wqm forms a new random variable\nwqm = sqm(cid:101)wqm and assign the above prior distributions on sqm and (cid:101)wqm. Thus, the reparameter-\nsqm(cid:101)wqm. After the above reparameterization, a standard mean \ufb01eld variational method that uses the\nfactorized variational distribution over(cid:102)W = {(cid:101)wq}Q\nq=1 takes the form q((cid:102)W, S) =\n(cid:81)Q\nq=1 q((cid:101)wq, sq), where\n\n(3)\nNotice that the presence of wqm in the likelihood function in (1a) is now replaced by the product\n\np((cid:101)wqm, sqm) = N (wqm|0, \u03c32\n\nized spike and slab prior takes the form:\n\nw)\u03c0sqm (1 \u2212 \u03c0)1\u2212sqm,\n\nq=1 and S = {sq}Q\n\n\u2200q,m.\n\nq((cid:101)wq, sq) = q((cid:101)wq)q(sq) = N ((cid:101)wq|\u00b5wq , \u03a3wq )\n\nM(cid:89)\n\nqm (1 \u2212 \u03b3qm)1\u2212sqm\n\u03b3sqm\n\n(4)\n\nm=1\n\nand where (\u00b5wq , \u03a3wq , \u03b3q) are variational parameters. Such an approach has extensively used in [18]\nand also considered in [19]. However, the above variational distribution leads to a very inef\ufb01cient\napproximation. This is because (4) is a unimodal distribution, and therefore has limited capacity\nwhen approximating the factorial true posterior distribution which can have exponentially many\nmodes. To analyze the nature of the true posterior distribution, we consider the following two\nproperties derived by assuming for simplicity a single output (Q = 1) so index q is dropped.\n\nProperty 1: The true marginal posterior p((cid:101)w|Y) can be written as a mixture distribution having 2M\ncomponents. This is an obvious fact since p((cid:101)w|Y) =(cid:80)\ns p((cid:101)w|s, Y)p(s|Y), where the summation\nThe second property characterizes the nature of each conditional p((cid:101)w|s, Y) in the above sum.\nProperty 2: Assume the conditional distribution p((cid:101)w|s, Y). We can write s = s1 \u222a s0, where\n\ninvolves all 2M possible values of the binary vector s.\n\ns1 denotes the elements in s with value one and s0 the elements with value zero. Using the\n\n3\n\n\fcorrespondence between s and (cid:101)w, we have (cid:101)w = (cid:101)w1 \u222a (cid:101)w0. Then, p((cid:101)w|s, Y) factorizes as\np((cid:101)w|s, Y) = p((cid:101)w1|Y)N ((cid:101)w0|0, \u03c32\nwI|(cid:101)w0|), which says that the posterior over (cid:101)w0 given s0 = 0\nis equal to the prior over (cid:101)w0. This property is obvious because (cid:101)w0 and s0 appear in the likelihood\nas an elementwise product (cid:101)w0 \u25e6 s0, thus when s0 = 0, (cid:101)w0 becomes disconnected from the data.\np((cid:101)w|Y), which is a mixture with 2M components, with a single Gaussian distribution. Next we\n\nThe standard variational distribution in (4) ignores these properties and approximates the marginal\n\npresent an alternative variational approximation that takes into account the above properties.\n\n3.1 The proposed variational method\n\nIn the reparameterized spike and slab prior, each pair of variables {(cid:101)wqm, sqm} is strongly correlated\nmation must treat each pair {(cid:101)wqm, sqm} as a unit so that {(cid:101)wqm, sqm} are placed in the same factor\n\nsince their product is the underlying variable that interacts with the data. Thus, a sensible approxi-\n\nof the variational distribution. The simplest factorization that achieves this is:\n\nM(cid:89)\n\nq((cid:101)wq, sq) =\n\nq((cid:101)wqm, sqm).\n\n(5)\n\nm=1\n\na mixture of 2M components is obtained. Therefore, Property 1 is satis\ufb01ed by (5). In turns out\nthat Property 2 is also satis\ufb01ed. This can be shown by taking the stationary condition for the factor\n\nThis variational distribution yields a marginal q((cid:101)wq) which has 2M components. This can be seen by\nwriting q((cid:101)wq) =(cid:81)M\nm=1 [q((cid:101)wqm, sqm = 1) + q((cid:101)wqm, sqm = 0)] and then by multiplying the terms\nq((cid:101)wqm, sqm) when maximizing the variational lower bound (on the true marginal likelihood):\nw)\u03c0sqm (1 \u2212 \u03c0)1\u2212sqm\nq((cid:101)wqm,sqm)q(\u0398)\nwhere \u0398 are the remaining random variables in the model (i.e., excluding {(cid:101)wqm, sqm}) and q(\u0398)\ntheir variational distribution. The stationary condition for q((cid:101)wqm, sqm) is\n\np(Y,(cid:101)wqm, sqm, \u0398)p(\u0398)N ((cid:101)wqm|0, \u03c32\nq((cid:101)wqm, sqm)q(\u0398)\n\n(cid:28)\n\n(cid:29)\n\n(6)\n\nlog\n\n,\n\nq((cid:101)wqm, sqm) =\n\nZ e(cid:104)log p(Y,(cid:101)wqm,sqm,\u0398)(cid:105)q(\u0398)N ((cid:101)wqm|0, \u03c32\n\n1\n\nw)\u03c0sqm (1 \u2212 \u03c0)1\u2212sqm,\n\nthat does not depend on {(cid:101)wqm, sqm}.\n\nwhere Z is a normalizing constant\n\nhave q((cid:101)wqm|sqm = 0) \u221d q((cid:101)wqm, sqm = 0) = C\nTherefore, we\ne(cid:104)log p(Y,(cid:101)wqm,sqm=0,\u0398)(cid:105)q(\u0398) is a constant that does not depend on (cid:101)wqm. From the last expression\nw)(1 \u2212 \u03c0), where C =\nwe obtain q((cid:101)wqm|sqm = 0) = N ((cid:101)wqm|0, \u03c32\nslab probability models as long as the weights(cid:101)w and binary variables s interact inside the likelihood\nfunction according to (cid:101)w \u25e6 s.\n\nThe above remarks regarding variational distribution (5) are general and can hold for many spike and\n\nw) which implies that Property 2 is satis\ufb01ed.\n\nZ N ((cid:101)wqm|0, \u03c32\n\n(7)\n\n3.2 Application to the multi-task and multiple kernel learning model\n\nHere, we brie\ufb02y discuss the variational method applied to the multi-task and multiple kernel model\ndescribed in Section 2.1 and refer to supplementary material for variational EM update equations.\nThe explicit form of the joint probability density function on the training data of model (1) is\n\np(Y,(cid:102)W, S, \u03a6) = N (Y|\u03a6((cid:102)W\u25e6S)(cid:62), \u03a3)\nN (\u03c6m|\u00b5m, Km),\nwhere {(cid:102)W, S, \u03a6} is the whole set of random variables that need to be marginalized out to compute\n\nw)\u03c0sqm(1 \u2212 \u03c0)sqm(cid:3) M(cid:89)\n\n(cid:2)N ((cid:101)wqm|0, \u03c32\n\nthe marginal likelihood. The marginal likelihood is analytically intractable, so we lower bound it\nusing the following variational distribution\n\n(cid:89)\n\nm=1\n\nq,m\n\nq((cid:102)W, S, \u03a6) =\n\nQ(cid:89)\n\nM(cid:89)\n\nM(cid:89)\n\nq((cid:101)wqm, sqm)\n\nq=1\n\nm=1\n\nm=1\n\n4\n\nq(\u03c6m).\n\n(8)\n\n\fw, \u03c0,{\u03b8m}M\n\nThe stationary conditions of the lower bound result in analytical updates for all factors above. More\n\nprecisely, q(\u03c6m) is an N-dimensional Gaussian distribution and each factor q((cid:101)wqm, sqm) leads to\na marginal q((cid:101)wqm) which is a mixture of two Gaussians where one component is q((cid:101)wqm|sqm =\n0) = N ((cid:101)wqm|0, \u03c32\nw), as shown in the previous section. The optimization proceeds using an EM\nalgorithm that at the E-step updates the factors in (8) and at the M-step updates hyperparameters\nm=1} where \u03b8m parameterize kernel matrix Km. There is, however, one\n{{\u03c3q}Q\nsurprise in these updates. The GP hyperparameters \u03b8m are strongly dependent on the factor q(\u03c6m)\nof the corresponding GP latent vector, so updating \u03b8m by keeping \ufb01xed the factor q(\u03c6m) exhibits\nslow convergence. This problem is ef\ufb01ciently resolved by applying a Marginalized Variational step\n[20] which jointly updates the pair (q(\u03c6m), \u03b8m). This more advanced update together with all\nremaining updates of the EM algorithm are discussed in detail in the supplementary material.\n4 Assessing the accuracy of the approximation\nIn this section we compare the proposed variational inference method, in the following called\npaired mean \ufb01eld (PMF), against the standard mean \ufb01eld (MF) approximation. For simplicity,\nwe consider a single-output linear regression problem where the data are generated according to:\n\ny = ((cid:101)w \u25e6 s)T x + \u03be. Moreover, to remove the effect of hyperparameter learning from the com-\naccuracy when approximating the true posterior mean value for the parameter vector wtr = E[(cid:101)w\u25e6 s]\n\nw) are \ufb01xed to known values. The objective of the comparison is to measure the\n\nwhere the expectation is under the true posterior distribution. wtr is obtained by running a very\nlong run of Gibbs sampling. PMF and MF provide alternative approximations wPMF and wMF, and\nabsolute errors between these approximations and wtr are used to measure accuracy. Since initial-\nization is crucial for variational non-convex algorithms, the accuracy of PMF and MF is averaged\nover many random initializations of their respective variational distributions.\n\nq=1, \u03c32\n\nparison, (\u03c32, \u03c0, \u03c32\n\nsoft-error\n\nsoft-bound\n\nextreme-error\n\nextreme-bound\n\nMF\nPMF\n\n0.917 [0.002,1.930]\n0.208 [0.002,0.454]\n\n-628.9 [-554.6,-793.5]\n-560.7 [-557.8, -564.1]\n\n1.880 [0.965, 2.561]\n0.204 [0.002, 0.454]\n\n-895.0 [-618.9,-1483.3]\n-560.6 [-557.8, -564.0]\n\nm=1 |wtr\n\nm \u2212 wappr\n\nfor soft and extreme initializations. Average values for the variational lower bound are also shown.\n\nTable 1: Comparison of MF and PMF in Boston-housing data in terms of approximating the ground-truth.\nm |) together with 95% con\ufb01dence intervals (given by percentiles) are shown\n\nAverage errors ((cid:80)13\nexactly the same principle as PMF. This Gibbs sampler iteratively samples the pair ((cid:101)wm, sm) from\nthe conditional p((cid:101)wm, sm|(cid:101)w\\m, s\\m, y) and has been observed to mix much faster than the standard\nGibbs sampler that samples (cid:101)w and s separately. More details about the paired Gibbs sampler are\n\nFor the purpose of the comparison we also derived an ef\ufb01cient paired Gibbs sampler that follows\n\ngiven in the supplementary material.\nWe considered the Boston-housing dataset which consists of 456 training examples and 13 inputs.\nHyperparameters were \ufb01xed to values (\u03c32 = 0.1\u00d7var(y), \u03c0 = 0.25, \u03c32\nw = 1) where var(y) denotes\nthe variance of the data. We performed two types of experiments each repeated 300 times. Each\nrepetition of the \ufb01rst type uses a soft random initialization of each q(sm = 1) = \u03b3m from the range\n(0, 1). The second type uses an extreme random initialization so that each \u03b3m is initialized to either\n0 or 1. For each run PMF and MF are initialized to the same variational parameters.\nTable 1 reports average absolute errors and also average values of the variational lower bounds.\nClearly, PMF is more accurate than MF, achieves signi\ufb01cantly higher values for the lower bound\nand exhibits smaller variance under different initializations. Further, for the more dif\ufb01cult case\nof extreme initializations the performance of MF becomes worse, while the performance of PMF\nremains unchanged. This shows that optimization in PMF, although non-convex, is very robust to\nunfavorable initializations. Similar experiments in other datasets have con\ufb01rmed the above remarks.\n\n5 Experiments\nToy multi-output regression dataset. To illustrate the capabilities of the proposed model, we \ufb01rst\napply it to a toy multi-output dataset with missing observations. Toy data is generated as follows:\n\n5\n\n\fTen random latent functions are generated by sampling i.i.d. from zero-mean GPs with the following\nnon-stationary covariance function\n\n(cid:16)\u2212x2\n\ni \u2212 x2\n20\n\nj\n\n(cid:17)\n\nk(xi, xj) = exp\n\n(4 cos(0.5(xi \u2212 xj)) + cos(2(xi \u2212 xj))),\n\nat 201 evenly spaced points in the interval x \u2208 [\u221210, 10]. Ten tasks are then generated by adding\nGaussian noise with standard deviation 0.2 to those random latent functions, and two additional tasks\nconsist only of Gaussian noise with standard deviations 0.1 and 0.4. Finally, for each of the 12 tasks,\nwe arti\ufb01cially simulate missing data by removing 41 contiguous observations, as shown in Figure\n1. Missing data are not available to any learning algorithm, and will be used to test performance\nonly. Note that the above covariance function is rank-4, so ten out of the twelve tasks will be related,\nthough we do not know how, or which ones.\nAll tasks are then learned using both independent GPs with squared exponential (SE) covariance\nfunction kSE(xi, xj) = exp(\u2212(xi \u2212 xj)2/(2(cid:96))) and the proposed MTMKL with M = 7 latent\nfunctions, each of them also using the SE prior. Hyperparameter (cid:96), as well as noise levels are\nlearned independently for each latent function. Figure 1 shows the inferred posterior means.\n\nFigure 1: Twelve related tasks and predictions according to independent GPs (blue, continuous line) and\nMTMKL (red, dashed line). Missing data for each task is represented using green circles.\n\nThe mean square error (MSE) between predictions and missing observations for each task are dis-\nplayed in Table 2. MTMKL is able to infer how tasks are related and then exploit that information\nto make much better predictions. After learning, only 4 out of the 7 available latent functions re-\nmain active, while the other ones are pruned by setting the corresponding weights to zero. This is in\ncorrespondence with the generating covariance function, which only had 4 eigenfunctions, showing\nhow model order selection is automatic.\n\nMethod \\ Task #\n12\nIndependent GPs 6.51 11.70 7.52 2.49 1.53 18.25 0.41 7.43 2.73 1.81 19.93 93.80\n2.83\nMTMKL\n\n2.09 0.41 1.96 1.90 1.57\n\n4.57 7.71 1.94 1.98\n\n1.20\n\n9\n\n10\n\n2\n\n3\n\n4\n\n5\n\n1.97\n\n6\n\n7\n\n8\n\n11\n\n1\n\nTable 2: MSE performance of independent GPs vs. MTMKL on the missing observations for each task.\n\n6\n\n\u221210\u22128\u22126\u22124\u221220246810\u22122\u22121.5\u22121\u22120.500.511.522.5\u221210\u22128\u22126\u22124\u221220246810\u22122.5\u22122\u22121.5\u22121\u22120.500.511.52\u221210\u22128\u22126\u22124\u221220246810\u22121.5\u22121\u22120.500.511.5\u221210\u22128\u22126\u22124\u221220246810\u22122\u2212101234\u221210\u22128\u22126\u22124\u221220246810\u22124\u22123\u22122\u22121012\u221210\u22128\u22126\u22124\u221220246810\u22123\u22122\u221210123\u221210\u22128\u22126\u22124\u221220246810\u22120.4\u22120.3\u22120.2\u22120.100.10.20.3\u221210\u22128\u22126\u22124\u221220246810\u22122\u22121.5\u22121\u22120.500.511.522.5\u221210\u22128\u22126\u22124\u221220246810\u22123\u22122\u221210123456\u221210\u22128\u22126\u22124\u221220246810\u22121.5\u22121\u22120.500.511.522.5\u221210\u22128\u22126\u22124\u221220246810\u22122.5\u22122\u22121.5\u22121\u22120.500.511.522.5\u221210\u22128\u22126\u22124\u221220246810\u22123\u22122.5\u22122\u22121.5\u22121\u22120.500.511.5\fInferred noise standard deviations for the noise-only tasks are 0.10 and 0.45, and the average for the\nremaining tasks is 0.22, which agrees well with the stated actual values.\nThe \ufb02owers dataset. Though the proposed model has been designed as a tool for regression, it\ncan also be used approximately to solve classi\ufb01cation problems by using output values to identify\nclass membership. In this section we will apply it to the challenging \ufb02ower identi\ufb01cation problem\nposed in [21]. There are 2040 instances of \ufb02owers for training and 6149 for testing, mainly acquired\nfrom the web, with varying scales, resolutions, etc., which are labeled into 102 categories. In [21],\nfour relevant features are identi\ufb01ed: Color, histogram of gradient orientations and the scale invariant\nfeature transform, sampled on both the foreground region and its boundary. More information is\navailable at http://www.robots.ox.ac.uk/\u02dcvgg/data/flowers/.\nFor this type of dataset, state of the art performance has been achieved using a weighted linear\ncombination of kernels (one per feature) in a support vector machine (SVM) classi\ufb01er. A different\nset of weights is learned for each class. In [22] it is shown that these weights can be learned by\nsolving a convex optimization problem. I.e., the standard approach to tackle the \ufb02ower classi\ufb01cation\nproblem would correspond to solving 102 independent binary classi\ufb01cation problems, each using a\nlinear combination of 4 kernels. We take a different approach: Since all the 102 binary classi\ufb01cation\ntasks are related, we learn all of them at once as a multi-task multiple-kernel problem, hoping that\nknowledge transfer between them will enhance performance.\nFor each training instance, we set the corresponding output to +1 for the desired task, whereas the\noutput for the remaining tasks is set to -1. Then we consider both using 10 and 13 latent functions\nper feature (i.e., M = 40 and M = 52). We measure performance in terms of the recognition\nrate (RR), which is the average of break-even points (where precision equals recall) for each class;\naverage area under the curve (AUC); and the multi-class accuracy (MA) which is the rate of correctly\nclassi\ufb01ed instances. As baseline, recall that a random classi\ufb01er would yield a RR and AUC of 0.5\nand a MA of 1/102 = 0.0098. Results are reported in Table 3.\n\nLatent function # AUC on test set RR on test set MA on test set\n\nMethod\nMTMKL\nM = 40\nMTMKL\nM = 52\nMKL from [21] M = 408\nMKL from [13] M = 408\n\n0.944\n0.952\n\n-\n\n0.957\n\n0.889\n0.893\n0.728\n\n-\n\n0.329\n0.400\n\n-\n-\n\nTable 3: Performance of the different multiple kernel learning algorithms on the \ufb02owers dataset.\n\nMTMKL signi\ufb01cantly outperforms the state-of-the-art method in [21], yielding a performance in\nline with [13], due to its ability to share information across tasks.\nImage denoising and dictionary learning. Here we illustrate denoising on the 256 \u00d7 256 \u201chouse\u201d\nimage used in [19]. Three noise levels (standard deviations 15, 25 and 50) are considered. Follow-\ning [19], we partition the noisy image in 62,001 overlapping 8 \u00d7 8 blocks and regard each block\nas a different task. MTMKL is then run using M = 64 \u201clatent blocks\u201d, also known as \u201cdictionary\nelements\u201d (bigger dictionaries do not result in signi\ufb01cant performance increase). For the covariance\nof the latent functions, we consider two possible choices: Either a white covariance function (as\nin [19]) or an exponential covariance of the form kEXP(xi, xj) = e\u2212 |xi\u2212xj|\n, where x are the pixel\ncoordinates within each block. The \ufb01rst option is equivalent to placing an independent standard nor-\nmal prior on each pixel of the dictionary. The second one, on the other hand, introduces correlation\nbetween neighboring pixels in the dictionary. Results are shown in Table 4. The exponential co-\nvariance clearly enhances performance and produces a more structured dictionary, as can be seen in\nFigure 3.(a). The Peak-to-Signal Ratio (PSNR) obtained using the proposed approach is comparable\nto the state-of-the-art results obtained in [19].\nImage inpainting and dictionary learning. We now address the inpainting problem in color im-\nages. Following [19], we consider a color image in which a random 80% of the RGB components\nare missing. Using an analogous partitioning scheme as in the previous section we obtain 148,836\nblocks of size 8\u00d7 8\u00d7 3, each of which is regarded as a different task. A dictionary size of M = 100\nand a white covariance function (which is used in [19]) are selected. Note that we do not apply any\nother preprocessing to data or any speci\ufb01c initialization as it is done in [19]. The PSNR of the image\n\n(cid:96)\n\n7\n\n\fPSNR (dB)\nNoise std Noisy image White\n33.98\n30.98\n26.14\n\n\u03c3 = 15\n\u03c3 = 25\n\u03c3 = 50\n\n24.66\n20.22\n14.20\n\nExpon.\n34.29\n31.88\n28.08\n\nFigure 2: Noisy \u201chouse\u201d image with \u03c3 = 25 and re-\nstored version using Exponential cov. function.\n\nTable 4: PSNR for noisy and restored image using\nseveral noise levels and covariance functions.\n\nafter it is restored using MTMKL is 28.94 dB, see Figure 3.(b). This result is similar to the results\nreported in [19] and close to the state-of-the-art result of 29.65 dB achieved in [23].\n\n(a) House: Dict. for white and Exponential\n\n(b) Castle: Missing values, restored and original\n\nFigure 3: Dictionaries inferred from noisy (\u03c3 = 25) \u201chouse\u201d image; and \u201ccastle\u201d inpainting results.\n\nCollaborative \ufb01ltering. Finally, we performed an experiment on the 10M MovieLens data\nset that consists of 10 million ratings for 71,567 users and 10,681 \ufb01lms, with ratings ranging\n{1, 0.5, 2, . . . , 4.5, 5}. We followed the setup in [24] and used the ra and rb partitions provided\nwith the database, that split the data into a training and testing, so that they are 10 ratings per user\nin the test set. We applied the sparse factor analysis model (i.e. sparse PCA but with heteroscedastic\nnoise for the columns of the observation matrix Y which corresponds to \ufb01lms) with M = 20 latent\ndimensions. The RMSE for the ra partition was 0.88 for the rb partition was 0.85 so one average\n0.865. This result is slightly better than 0.8740 RMSE reported in [24] using GP-LVM.\n6 Discussion\nIn this work we have proposed a spike and slab multi-task and multiple kernel learning model. A\nnovel variational algorithm to perform inference in this model has been derived. The key contri-\nbution in this regard that explains the good performance of the algorithm is the choice of a joint\ndistribution over \u02dcwqm and sqm in the variational posterior, as opposed to the usual independence\nassumption. This has the effect of using exponentially many modes to approximate the posterior,\nthus rendering it more accurate and much more robust to poor initializations of the variational pa-\nrameters. The relevance and wide applicability of the proposed model has been illustrated by using\nit on very diverse tasks: multi-output regression, multi-class classi\ufb01cation, image denoising, image\ninpainting and collaborative \ufb01ltering. Prior structure beliefs were introduced in image dictionaries,\nwhich is also a novel contribution to the best of our knowledge. Finally an interesting topic for future\nresearch is to optimize the variational distribution proposed here with alternative approximate infer-\nence frameworks such as belief propagation or expectation propagation. This could allow to extend\ncurrent methodologies within such frameworks that assume unimodal approximations [25, 26].\nAcknowledgments\nWe thank the reviewers for insightful comments. MKT was supported by EPSRC Grant No\nEP/F005687/1 \u201cGaussian Processes for Systems Identi\ufb01cation with Applications in Systems Bi-\nology\u201d. MLG gratefully acknowledges funding from CAM project CCG10-UC3M/TIC-5511 and\nCONSOLIDER-INGENIO 2010 CSD2008-00010 (COMONSENS).\n\n8\n\n\fReferences\n[1] T.J. Mitchell and J.J. Beauchamp. Bayesian variable selection in linear regression. Journal of the Ameri-\n\ncan Statistical Association, 83(404):1023\u20131032, 1988.\n\n[2] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1994.\n\n[3] M.E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning\n\nResearch, 1:211\u2013244, 2001.\n\n[4] E.I. George and R.E. Mcculloch. Variable selection via Gibbs sampling. Journal of the American Statis-\n\ntical Association, 88(423):881\u2013889, 1993.\n\n[5] M. West. Bayesian factor regression models in the \u201dlarge p, small n\u201d paradigm. In Bayesian Statistics,\n\npages 723\u2013732. Oxford University Press, 2003.\n\n[6] B. Efron. Microarrays, empirical Bayes and the two-groups model. Statistical Science, 23:1\u201322, 2008.\n[7] C. Archambeau and F. Bach. Sparse probabilistic projections. In D. Koller, D. Schuurmans, Y. Bengio,\n\nand L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 73\u201380. 2009.\n\n[8] F. Caron and A. Doucet. Sparse Bayesian nonparametric regression. In In 25th International Conference\n\non Machine Learning (ICML). ACM, 2008.\n\n[9] Matthias W. Seeger and Hannes Nickisch. Compressed sensing and Bayesian experimental design. In\n\nICML, pages 912\u2013919, 2008.\n\n[10] C.M. Carvalho, N.G. Polson, and J.G. Scott. The horseshoe estimator for sparse signals. Biometrika,\n\n97:465\u2013480, 2010.\n\n[11] T. Damoulas and M.A. Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recogni-\n\ntion and remote homology detection. Bioinformatics, 24:1264\u20131270, 2008.\n\n[12] M. Christoudias, R. Urtasun, and T. Darrell. Bayesian localized multiple kernel learning. Technical report,\n\nEECS Department, University of California, Berkeley, Jul 2009.\n\n[13] C. Archambeau and F. Bach. Multiple Gaussian process models. In NIPS 23 workshop on New Directions\n\nin Multiple Kernel Learning. 2010.\n\n[14] Y.W. Teh, M. Seeger, and M.I. Jordan. Semiparametric latent factor models.\nInternational Workshop on Arti\ufb01cial Intelligence and Statistics, volume 10, 2005.\n\nIn Proceedings of the\n\n[15] E.V. Bonilla, K.M.A. Chai, and C.K.I. Williams. Multi-task Gaussian process prediction. In Advances\n\nNeural Information Processing Systems 20, 2008.\n\n[16] P Boyle and M. Frean. Dependent Gaussian processes. In Advances in Neural Information Processing\n\nSystems 17, pages 217\u2013224. MIT Press, 2005.\n\n[17] M. Alvarez and N.D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In\n\nAdvances in Neural Information Processing Systems 20, pages 57\u201364, 2008.\n\n[18] R. Yoshida and M. West. Bayesian learning in sparse graphical factor models via variational mean-\ufb01eld\n\nannealing. Journal of Machine Learning Research, 11:1771\u20131798, 2010.\n\n[19] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-parametric Bayesian dictionary\nlearning for sparse image represent ations. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams,\nand A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 2295\u20132303. 2009.\nIn 28th\nInternational Conference on Machine Learning (ICML-11), pages 841\u2013848, New York, NY, USA, June\n2011. ACM.\n\n[20] M. L\u00b4azaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression.\n\n[21] M.E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes. In\n\nProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.\n\n[22] M. Varma and D. Ray. Learning the discriminative power invariance trade-off. In International Confer-\n\nence on Computer Vision. 2007.\n\n[23] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Trans. Image\n\nProcessing, 17, 2008.\n\n[24] N.D. Lawrence and R. Urtasun. Non-linear matrix factorization with Gaussian processes. In Proceedings\n\nof the 26th Annual International Conference on Machine Learning, pages 601\u2013608, 2009.\n\n[25] K. Sharp and M. Rattray. Dense message passing for sparse principal component analysis.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 725\u2013732, 2010.\n\nIn 13th\n\n[26] J.M. Hern\u00b4andez-Lobato, D. Hern\u00b4andez-Lobato, and A. Su\u00b4arez. Network-based sparse Bayesian classi\ufb01-\n\ncation. Pattern Recognition, 44(4):886\u2013900, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1253, "authors": [{"given_name": "Michalis", "family_name": "Titsias", "institution": ""}, {"given_name": "Miguel", "family_name": "L\u00e1zaro-Gredilla", "institution": ""}]}