{"title": "Learning Multiple Tasks with a Sparse Matrix-Normal Penalty", "book": "Advances in Neural Information Processing Systems", "page_first": 2550, "page_last": 2558, "abstract": "In this paper, we propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks. Learning multiple (parametric) models can be viewed as estimating a matrix of parameters, where rows and columns of the matrix correspond to tasks and features, respectively. Following the matrix-variate normal density, we design a penalty that decomposes the full covariance of matrix elements into the Kronecker product of row covariance and column covariance, which characterizes both task relatedness and feature representation. Several recently proposed methods are variants of the special cases of this formulation. To address the overfitting issue and select meaningful task and feature structures, we include sparse covariance selection into our matrix-normal regularization via L-1 penalties on task and feature inverse covariances. We empirically study the proposed method and compare with related models in two real-world problems: detecting landmines in multiple fields and recognizing faces between different subjects. Experimental results show that the proposed framework provides an effective and flexible way to model various different structures of multiple tasks.", "full_text": "Learning Multiple Tasks with a Sparse\n\nMatrix-Normal Penalty\n\nYi Zhang\n\nMachine Learning Department\n\nCarnegie Mellon University\nyizhang1@cs.cmu.edu\n\nJeff Schneider\n\nThe Robotics Institute\n\nCarnegie Mellon University\nschneide@cs.cmu.edu\n\nAbstract\n\nIn this paper, we propose a matrix-variate normal penalty with sparse inverse co-\nvariances to couple multiple tasks. Learning multiple (parametric) models can be\nviewed as estimating a matrix of parameters, where rows and columns of the ma-\ntrix correspond to tasks and features, respectively. Following the matrix-variate\nnormal density, we design a penalty that decomposes the full covariance of matrix\nelements into the Kronecker product of row covariance and column covariance,\nwhich characterizes both task relatedness and feature representation. Several re-\ncently proposed methods are variants of the special cases of this formulation. To\naddress the over\ufb01tting issue and select meaningful task and feature structures,\nwe include sparse covariance selection into our matrix-normal regularization via\n\u21131 penalties on task and feature inverse covariances. We empirically study the\nproposed method and compare with related models in two real-world problems:\ndetecting landmines in multiple \ufb01elds and recognizing faces between different\nsubjects. Experimental results show that the proposed framework provides an ef-\nfective and \ufb02exible way to model various different structures of multiple tasks.\n\n1 Introduction\n\nLearning multiple tasks has been studied for more than a decade [6, 24, 11]. Research in the fol-\nlowing two directions has drawn considerable interest: learning a common feature representation\nshared by tasks [1, 12, 30, 2, 3, 9, 23], and directly inferring the relatedness of tasks [4, 26, 21, 29].\nBoth have a natural interpretation if we view learning multiple tasks as estimating a matrix of model\nparameters, where the rows and columns correspond to tasks and features. From this perspective,\nlearning the feature structure corresponds to discovering the structure of the columns in the param-\neter matrix, and modeling the task relatedness aims to \ufb01nd and utilize the relations among rows.\n\nRegularization methods have shown promising results in \ufb01nding either feature or task struc-\nture [1, 2, 12, 21]. In this paper we propose a new regularization approach and show how several\nprevious approaches are variants of special cases of it. The key contribution is a matrix-normal\npenalty with sparse inverse covariances, which provides a framework for characterizing and cou-\npling the model parameters of related tasks. Following the matrix normal density, we design a\npenalty that decomposes the full covariance of matrix elements into the Kronecker product of row\nand column covariances, which correspond to task and feature structures in multi-task learning. To\naddress over\ufb01tting and select task and feature structures, we incorporate sparse covariance selection\ntechniques into our matrix-normal regularization framework via \u21131 penalties on task and feature in-\nverse covariances. We compare the proposed method to related models on two real-world data sets:\ndetecting landmines in multiple \ufb01elds and recognizing faces between different subjects.\n\n1\n\n\f2 Related Work\n\nMulti-task learning has been an active research area for more than a decade [6, 24, 11]. For joint\nlearning of multiple tasks, connections need to be established to couple related tasks. One direction\nis to \ufb01nd a common feature structure shared by tasks. Along this direction, researchers proposed to\ninfer task structure via principal components [1, 12], independent components [30] and covariance\n[2, 3] in the parameter space, to select a common subset of features [9, 23], as well as to use shared\nhidden nodes in neural networks [6, 11]. Speci\ufb01cally, learning a shared feature covariance for model\nparameters [2] is a special case of our proposed framework. On the other hand, assuming models\nof all tasks are equally similar is risky. Researchers recently began exploring methods to infer the\nrelatedness of tasks. These efforts include using mixtures of Gaussians [4] or Dirichlet processes\n[26] to model task groups, encouraging clustering of tasks via a convex regularization penalty [21],\nidentifying \u201coutlier\u201d tasks by robust t-processes [29], and inferring task similarity from task-speci\ufb01c\nfeatures [8, 27, 28]. The present paper uses the matrix normal density and \u21131-regularized sparse\ncovariance selection to specify a structured penalty, which provides a systematic way to characterize\nand select both task and feature structures in multiple parametric models.\n\nMatrix normal distributions have been studied in probability and statistics for several decades [13,\n16, 18] and applied to predictive modeling in the Bayesian literature. For example, the standard\nmatrix normal can serve as a prior for Bayesian variable selection in multivariate regression [9],\nwhere MCMC is used for sampling from the resulting posterior. Recently, matrix normal distribu-\ntions have also been used in nonparametric Bayesian approaches, especially in learning Gaussian\nProcesses (GPs) for multi-output prediction [7] and collaborative \ufb01ltering [27, 28]. In this case, the\ncovariance function of the GP prior is decomposed as the Kronecker product of a covariance over\nfunctions and a covariance over examples. We note that the proposed matrix-normal penalty with\nsparse inverse covariances in this paper can also be viewed as a new matrix-variate prior, upon which\nBayesian inference can be performed. We will pursue this direction in our future work.\n\n3 Matrix-Variate Normal Distributions\n\n3.1 De\ufb01nition\n\nThe matrix-variate normal distribution is one of the most widely studied matrix-variate distributions\n[18, 13, 16]. Consider an m \u00d7 p matrix W. Since we can vectorize W to be a mp \u00d7 1 vector,\nthe normal distribution on a matrix W can be considered as a multivariate normal distribution on a\nvector of mp dimensions. However, such an ordinary multivariate distribution ignores the special\nstructure of W as an m \u00d7 p matrix, and as a result, the covariance characterizing the elements of\nW is of size mp \u00d7 mp. This size is usually prohibitive for modeling and estimation. To utilize the\nstructure of W, matrix normal distributions assume that the mp\u00d7mp covariance can be decomposed\nas the Kronecker product \u03a3 \u2297 \u2126, and elements of W follow:\n(1)\nwhere \u2126 is an m \u00d7 m positive de\ufb01nite matrix indicating the covariance between rows of W, \u03a3 is\na p \u00d7 p positive de\ufb01nite matrix indicating the covariance between columns of W, \u03a3 \u2297 \u2126 is the\nKronecker product of \u03a3 and \u2126, M is a m \u00d7 p matrix containing the expectation of each element of\nW, and V ec is the vectorization operation which maps a m\u00d7 p matrix into a mp \u00d7 1 vector. Due to\nthe decomposition of covariance as the Kronecker product, the matrix-variate normal distribution of\nan m \u00d7 p matrix W, parameterized by the mean M, row covariance \u2126 and column covariance \u03a3,\nhas a compact log-density [18]:\n\nV ec(W) \u223c N (V ec(M), \u03a3 \u2297 \u2126)\n\nmp\n2\n\np\n2\n\nlog(2\u03c0) \u2212\n\nlog(|\u2126|) \u2212\n\nlog P (W) = \u2212\nwhere | | is the determinant of a square matrix, and tr{} is the trace of a square matrix.\n3.2 Maximum likelihood estimation (MLE)\n\nlog(|\u03a3|) \u2212\n\nm\n2\n\n1\n2\n\ntr{\u2126\u22121(W \u2212 M)\u03a3\u22121(W \u2212 M)T} (2)\n\nConsider a set of n samples {Wi}n\nnormal distribution as eq. (2). The maximum likelihood estimation (MLE) of mean M is [16]:\n\ni=1 where each Wi is a m\u00d7p matrix generated by a matrix-variate\n\n\u02c6M =\n\n1\nn\n\nn\n\nXi=1\n\nWi\n\n2\n\n(3)\n\n\fThe MLE estimators of \u2126 and \u03a3 are solutions to the following system:\ni=1(Wi \u2212 \u02c6M) \u02c6\u03a3\u22121(Wi \u2212 \u02c6M)T\ni=1(Wi \u2212 \u02c6M)T \u02c6\u2126\u22121(Wi \u2212 \u02c6M)\n\n\u02c6\u2126 = 1\n\u02c6\u03a3 = 1\n\nn\n\nnp Pn\nnm Pn\n\nIt is ef\ufb01cient to iteratively solve (4) until convergence, known as the \u201c\ufb02ip-\ufb02op\u201d algorithm [16].\n\n(4)\n\nAlso, \u02c6\u2126 and \u02c6\u03a3 are not identi\ufb01able and solutions for maximizing the log density in eq. (2) are not\nIf (\u2126\u2217, \u03a3\u2217) is an MLE estimate for the row and column covariances, for any \u03b1 > 0,\nunique.\n(\u03b1\u2126\u2217, 1\n\u03a3\u2217) will lead to the same log density and thus is also an MLE estimate. This can be seen\n\u03b1\nfrom the de\ufb01nition in eq. (1), where only the Kronecker product \u03a3 \u2297 \u2126 is identi\ufb01able.\n4 Learning Multiple Tasks with a Sparse Matrix-Normal Penalty\n\nRegularization is a principled way to control model complexity [20]. Classical regularization penal-\nties (for single-task learning) can be interpreted as assuming a multivariate prior distribution on the\nparameter vector and performing maximum-a-posterior estimation, e.g., \u21132 penalty and \u21131 penalty\ncorrespond to multivariate Gaussian and Laplacian priors, respectively. For multi-task learning, it is\nnatural to use matrix-variate priors to design regularization penalties.\n\nIn this section, we propose a matrix-normal penalty with sparse inverse covariances for learning\nmultiple related tasks. In Section 4.1 we start with learning multiple tasks with a matrix-normal\npenalty. In Section 4.2 we study how to incorporate sparse covariance selection into our framework\nby further imposing \u21131 penalties on task and feature inverse covariances. In Section 4.3 we outline\nthe algorithm, and in Section 4.4 we discuss other useful constraints in our framework.\n\n4.1 Learning with a Matrix Normal Penalty\n\nt=1, where each set Dt contains nt examples {(x(t)\n\nConsider a multi-task learning problem with m tasks in a p-dimensional feature space. The training\nsets are {Dt}m\ni=1. We want to learn\nm models for the m tasks but appropriately share knowledge among tasks. Model parameters are\nrepresented by an m \u00d7 p matrix W, where parameters for a task correspond to a row.\nThe last term in the matrix-variate normal density (2) provides a structure to couple the parameters\nof multiple tasks as a matrix W: 1) we set M = 0, indicating a preference for simple models; 2)\nthe m \u00d7 m row covariance \u2126 describes the similarity among tasks; 3) the p \u00d7 p column covariance\nmatrix \u03a3 represents a shared feature structure. This yields the following total loss L to optimize:\n\n, y(t)\ni )}nt\n\ni\n\nm\n\nnt\n\nL =\n\nXt=1\n\nXi=1\n\nL(y(t)\n\ni\n\n, x(t)\n\ni\n\n, W(t, :)) + \u03bb tr{\u2126\u22121W\u03a3\u22121WT}\n\n(5)\n\n(t)\nwhere \u03bb controls the strength of the regularization, (y\ni ) is the ith example in the training\nset of the tth task, W(t, :) is the parameter vector of the tth task, and L() is a convex empirical\nloss function depending on the speci\ufb01c model we use, e.g., squared loss for linear regression, log-\nlikelihood loss for logistic regression, hinge loss for SVMs, and so forth. When \u2126 and \u03a3 are known\nand positive de\ufb01nite, eq. (5) is convex w.r.t. W and thus W can be optimized ef\ufb01ciently [22].\n\n(t)\ni\n\n, x\n\nNow we discuss a few special cases of (5) and how is previous work related to them. When we \ufb01x\n\u2126 = Im and \u03a3 = Ip, the penalty term can be decomposed into standard \u21132-norm penalties on the\nm rows of W. In this case, the m tasks in (5) can be learned almost independently using single-task\n\u21132 regularization (but tasks are still tied by sharing the parameter \u03bb).\nWhen we \ufb01x \u2126 = Im, tasks are linked only by a shared feature covariance \u03a3. This corresponds\nto a multi-task feature learning framework [2, 3] which optimizes eq. (5) w.r.t. W and \u03a3, with an\nadditional constraint tr{\u03a3} \u2264 1 on the trace of \u03a3 to avoid setting \u03a3 to in\ufb01nity.\nWhen we \ufb01x \u03a3 = Ip, tasks are coupled only by a task similarity matrix \u2126. This is used in a\nrecent clustered multi-task learning formulation [21], which optimizes eq. (5) w.r.t. W and \u2126, with\nadditional constraints on the singular values of \u2126 that are motivated and derived from task clustering.\nA more recent multi-label classi\ufb01cation model [19] essentially optimizes W in eq. (5) with a label\ncorrelation \u2126 given as prior knowledge and empirical loss L as the max-margin hinge loss.\n\n3\n\n\fWe usually do not know task and feature structures in advance. Therefore, we would like to infer \u2126\nand \u03a3 in eq. (5). Note that if we jointly optimize W, \u2126 and \u03a3 in eq. (5), we will always set \u2126 and\n\u03a3 to be in\ufb01nity matrices. We can impose constraints on \u2126 and \u03a3 to avoid this, but a more natural\nway is to further expand eq. (5) to include all relevant terms w.r.t. \u2126 and \u03a3 from the matrix normal\nlog-density (2). As a result, the total loss L is:\n\nm\n\nnt\n\nL =\n\nXt=1\n\nXi=1\n\nL(y(t)\n\ni\n\n, x(t)\n\ni\n\n, W(t, :)) + \u03bb [p log |\u2126| + m log |\u03a3| + tr{\u2126\u22121W\u03a3\u22121WT}]\n\n(6)\n\nBased on this formula, we can infer task structure \u2126 and feature structure \u03a3 given the model pa-\nrameters W, as the following problem:\n\nmin\n\u2126,\u03a3\n\np log|\u2126| + m log |\u03a3| + tr{\u2126\u22121W\u03a3\u22121WT}\n\n(7)\n\nThis problem is equivalent to maximizing the log-likelihood of a matrix normal distribution as in\neq. (2), given W as observations and expectation M \ufb01xed at 0. Following Section 3.2, the MLE of\n\u2126 and \u03a3 can be obtained by the \u201c\ufb02ip-\ufb02op\u201d algorithm:\n\n\u02c6\u2126 = 1\np\n\u02c6\u03a3 = 1\nm\n\nW \u02c6\u03a3\u22121WT + \u01ebIm\nWT \u02c6\u2126\u22121W + \u01ebIp\n\nn\n\n(8)\n\nwhere \u01eb is a small positive constant to improve numerical stability. As discussed in Section 3.2, only\n\u03a3 \u2297 \u2126 is uniquely de\ufb01ned, and \u02c6\u2126 and \u02c6\u03a3 are only identi\ufb01able up to an multiplicative constant. This\nwill not affect the optimization of W using eq. (5), since only \u03a3 \u2297 \u2126 matters for this purpose.\n4.2 Sparse Covariance Selection in the Matrix-Normal Penalty\n\nConsider the sparsity of \u2126\u22121 and \u03a3\u22121. When \u2126 has a sparse inverse, task pairs corresponding to\nzero entries in \u2126\u22121 will not be explicitly coupled in the penalty of (6). Similarly, a zero entry in\n\u03a3\u22121 indicates no direct interaction between the two corresponding features in the penalty. Also,\nnote that a clustering of tasks can be expressed by block-wise sparsity of \u2126\u22121.\n\nCovariance selection aims to select nonzero entries in the Gaussian inverse covariance and discover\nconditional independence between variables (indicated by zero entries in the inverse covariance) [14,\n5, 17, 15]. The matrix-normal density in eq. (6) enables us to perform sparse covariance selection to\nregularize and select task and feature structures.\n\nFormally, we rewrite (6) to include two additional \u21131 penalty terms on the inverse covariances:\n\nL =\n\nm\n\nnt\n\nXt=1\n\nXi=1\n\nL(y(t)\n\ni\n\n, x(t)\n\ni\n\n, W(t, :)) + \u03bb[p log |\u2126| + m log |\u03a3| + tr{\u2126\u22121W\u03a3\u22121WT}]\n\n(9)\nwhere || ||\u21131 is the \u21131-norm of a matrix, and \u03bb\u2126 and \u03bb\u03a3 control the strength of \u21131 penalties and\ntherefore the sparsity of task and feature structures.\n\n+ \u03bb\u2126||\u2126\u22121||\u21131 + \u03bb\u03a3||\u03a3\u22121||\u21131\n\nBased on the new regularization formula (9), estimating W given \u2126 and \u03a3 as in (5) is not affected,\nwhile inferring \u2126 and \u03a3 given W, previously shown as (7), becomes a new problem:\n\nmin\n\u2126,\u03a3\n\np log |\u2126| + m log |\u03a3| + tr{\u2126\u22121W\u03a3\u22121WT} +\n\n\u03bb\u2126\n\u03bb ||\u2126\u22121||\u21131 +\nAs in (8), we can iteratively optimize \u2126 and \u03a3 until convergence, as follows:\n\n\u03bb\u03a3\n\u03bb ||\u03a3\u22121||\u21131\n\n(10)\n\n(11)\n\nn \u02c6\u2126 = argmin\u2126 p log |\u2126| + tr{\u2126\u22121(W\u03a3\u22121WT )} + \u03bb\u2126\n\u02c6\u03a3 = argmin\u03a3 m log|\u03a3| + tr{\u03a3\u22121(WT \u02c6\u2126\u22121W)} + \u03bb\u03a3\n\n\u03bb ||\u2126\u22121||\u21131\n\u03bb ||\u03a3\u22121||\u21131\n\nNote that both equations in (11) are \u21131 regularized covariance selection problems, for which ef\ufb01cient\noptimization has been intensively studied [5, 17, 15]. For example, we can use graphical lasso [17]\nas a basic solver and consider (11) as an \u21131 regularized \u201c\ufb02ip-\ufb02op\u201d algorithm:\n\n\u02c6\u2126 = glasso( 1\np\n\u02c6\u03a3 = glasso( 1\nm\n\nW \u02c6\u03a3\u22121WT , \u03bb\u2126\n\u03bb )\nWT \u02c6\u2126\u22121W, \u03bb\u03a3\n\u03bb )\n\nn\n\n4\n\n\fFinally, an annoying part of eq. (9) is the presence of two additional regularization parameters \u03bb\u2126\nand \u03bb\u03a3. Due to the property of matrix normal distributions that only \u03a3 \u2297 \u2126 is identi\ufb01able, we can\nsafely reduce the complexity of choosing regularization parameters by considering the restriction:\n(12)\nThe following lemma proves that restricting \u03bb\u2126 and \u03bb\u03a3 to be equal in eq. (9) will not reduce the\nspace of optimal models W we can obtain. As a result, we eliminate one regularization parameter.\nLemma 1. Suppose W\u2217 belongs to a minimizer (W\u2217, \u2126\u2217, \u03a3\u2217) for eq. (9) with some arbitrary\nchoice of \u03bb, \u03bb\u2126 and \u03bb\u03a3 > 0. Then, W\u2217 must also belong to a minimizer for eq. (9) with certain\nchoice of \u03bb\u2032, \u03bb\u2032\n\n\u03a3. Proof of lemma 1 is provided in Appendix A.\n\n\u03a3 such that \u03bb\u2032\n\n\u03bb\u2126 = \u03bb\u03a3\n\n\u2126 and \u03bb\u2032\n\n\u2126 = \u03bb\u2032\n\n4.3 The Algorithm\n\nBased on the regularization formula (9), we study the following algorithm to learning multiple tasks:\n1) Estimate W by solving (5), using \u2126 = Im and \u03a3 = Ip;\n2) Infer \u2126 and \u03a3 in (9) (by solving (11) until convergence), using the estimated W from step 1);\n\n3) Estimate W by solving (5), using the inferred \u2126 and \u03a3 from step 2).\n\nOne can safely iterate over steps 2) and 3) and convergence to a local minimum of eq. (9) is guaran-\nteed. However, we observed that a single pass yields good results1. Steps 1) and 3) are linear in the\nnumber of data points and step 2) is independent of it, so the method scales well with the number\nof samples. Step 2) needs to solve \u21131 regularized covariance selection problems as (11). We use the\nstate of the art technique [17], but more ef\ufb01cient optimization for large covariances is still desirable.\n\n4.4 Additional Constraints\n\n\u2126ii = 1\n\u03a3jj = 1\n\nWe can have additional structure assumptions in the matrix-normal penalty. For example, consider:\n(13)\n(14)\nIn this case, we ignore variances and restrict our attention to correlation structures. For example,\noff-diagonal entries of task covariance \u2126 characterize the task similarity; diagonal entries indicate\ndifferent amounts of regularization on tasks, which may be \ufb01xed as a constant if we prefer tasks to be\nequally regularized. Similar arguments apply to feature covariance \u03a3. We include these restrictions\nby converting inferred covariance(s) into correlation(s) in step 2) of the algorithm in Section 4.3. In\nother words, the restrictions are enforced by a projection step.\n\ni = 1, 2, . . . , m\nj = 1, 2, . . . , p\n\nIf one wants to iterative over steps 2) and 3) of the algorithm in Section 4.3 until convergence, we\nmay consider the constraints\n\n\u2126ii = c1\n\u03a3jj = c2\n\ni = 1, 2, . . . , m\nj = 1, 2, . . . , p\n\n(15)\n(16)\nwith unknown quantities c1 and c2, and consider eq. (9) in step 2) as a constrained optimization\nproblem w.r.t. W, \u2126, \u03a3, c1 and c2, instead of using a projection step. As a result, the \u201c\ufb02ip-\ufb02op\u201d\nalgorithm in (11) needs to solve \u21131 penalized covariance selection with equality constraints (15)\nor (16), where the dual block coordinate descent [5] and graphical lasso [17] are no longer directly\napplicable. In this case, one can solve the two steps of (11) as determinant maximization problems\nwith linear constraints [25], but this is inef\ufb01cient. We will study this direction (ef\ufb01cient constrained\nsparse covariance selection) in the future work.\n\n5 Empirical Studies\n\nIn this section, we present our empirical studies on a landmine detection problem and a face recog-\nnition problem, where multiple tasks correspond to detecting landmines at different landmine \ufb01elds\nand classifying faces between different subjects, respectively.\n\n1Further iterations over step 2) and 3) will not dramatically change model estimation. Also, early stopping\n\nas regularization might also lead to better generalizability.\n\n5\n\n\f5.1 Data Sets and Experimental Settings\n\nThe landmine detection data set from [26] contains examples collected from different landmine\n\ufb01elds. Each example in the data set is represented by a 9-dimensional feature vector extracted from\nradar imaging, which includes moment-based features, correlation-based features, an energy ratio\nfeature and a spatial variance feature. As a binary classi\ufb01cation problem, the goal is to predict\nlandmines (positive class) or clutter (negative class). Following [26], we jointly learn 19 tasks from\nlandmine \ufb01elds 1\u2212 10 and 19\u2212 24 in the data set. As a result, the model parameters W are a 19\u00d7 10\nmatrix, corresponding to 19 tasks and 10 coef\ufb01cients (including the intercept) for each task.\nThe distribution of examples is imbalanced in each task, with a few dozen positive examples and\nseveral hundred negative examples. Therefore, we use the average AUC (Area Under the ROC\nCurve) over 19 tasks as the performance measure. We vary the size of the training set for each task\nas 30, 40, 80 and 160. Note that we intentionally keep the training sets small because the need for\ncross-task learning diminishes as the training set becomes large relative to the number of parameters\nbeing learned. For each training set size, we randomly select training examples for each task and\nthe rest is used as the testing set. This is repeated 30 times. Task-average AUC scores are collected\nover 30 runs, and mean and standard errors are reported. Note that for small training sizes (e.g., 30\nper task) we often have some task(s) that do not have any positive training sample. It is interesting\nto see how well multi-task learning handles this case.\nThe face recognition data set is the Yale face database, which contains 165 images of 15 subjects.\nThe 11 images per subject correspond to different con\ufb01gurations in terms of expression, emotion,\nillumination, and wearing glasses (or not), etc. Each image is scaled to 32 \u00d7 32 pixels. We use the\n\ufb01rst 8 subjects to construct 8\u00d77\n2 = 28 binary classi\ufb01cation tasks, each to classify two subjects. We\nvary the size of the training set as 3, 5 and 7 images per subject. We have 30 random runs for each\ntraining size. In each run, we randomly select the training set and use the rest as the testing set. We\ncollect task-average classi\ufb01cation errors over 30 runs, and report mean and standard errors.\nChoice of features is important for face recognition problems. In our experiments, we use orthogonal\nLaplacianfaces [10], which have been shown to provide better discriminative power than Eigenfaces\n(PCA), \ufb01sherfaces (LDA) and Laplacianfaces on several benchmark data sets. In each random run,\nwe extract 30 orthogonal Laplacianfaces using the selected training set of all 8 subjects2, and conduct\nexperiments of all 28 classi\ufb01cation tasks in the extracted feature space.\n\n5.2 Models and Implementation Details\n\nWe use the logistic regression loss as the empirical loss L in (9). We compare the following models.\nSTL: learn \u21132 regularized logistic regression for each task separately.\nMTL-C: clustered multi-task learning [21], which encourages task clustering in regularization. As\ndiscussed in Section 4.1, this is related to eq. (5) with only a task structure \u2126.\nMTL-F: multi-task feature learning [2], which corresponds to \ufb01xing the task covariance \u2126 as Im\nand optimizing (6) with only the feature covariance \u03a3.\n\nIn addition, we also study various different con\ufb01gurations of the proposed framework:\nMTL(Im&Ip): learn W using (9) with \u2126 and \u03a3 \ufb01xed as identity matrices Im and Ip.\nMTL(\u2126&Ip): learn W and task covariance \u2126 using (9), with feature covariance \u03a3 \ufb01xed as Ip.\nMTL(Im&\u03a3): learn W and feature covariance \u03a3 using (9), with task covariance \u2126 \ufb01xed as Im.\nMTL(\u2126&\u03a3): learn W, \u2126 and \u03a3 using (9), inferring both task and feature structures.\nMTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1: learn W, \u2126 and \u03a3 using (9), with restricted \u2126 and \u03a3 as (13) and (14).\nMTL(\u2126&\u03a3)\u2126ii=1: learn W, \u2126 and \u03a3 using (9), with restricted \u2126 as (13) and free \u03a3. Intuitively,\nfree diagonal entries in \u03a3 are useful when features are of different importance, e.g, components\nextracted as orthogonal Laplacianfaces usually capture decreasing amounts of information [10].\n\nWe use conjugate gradients [22] to optimize W in (5), and infer \u2126 and \u03a3 in (11) using graphical\nlasso [17] as the basic solver. Regularization parameters \u03bb and \u03bb\u2126 = \u03bb\u03a3 are chosen by 3-fold cross\n2For experiments with 3 images per subject, we can only extract 23 Laplacianfaces, which is limited by the\n\nsize of training examples (3 \u00d7 8 = 24) [10].\n\n6\n\n\fAvg AUC Score\n\nSTL\n\nMTL-C [21]\nMTL-F [2]\n\nMTL(Im&Ip)\nMTL(\u2126&Ip)\nMTL(Im&\u03a3)\nMTL(\u2126&\u03a3)\n\nMTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1\n\nMTL(\u2126&\u03a3)\u2126ii=1\n\n30 samples\n64.85(0.52)\n67.09(0.44)\n72.39(0.79)\n66.10(0.65)\n74.88(0.29)\n72.71(0.65)\n75.10(0.27)\n75.31(0.26)\u2217\n75.19(0.22)\n\n40 samples\n67.62(0.64)\n68.95(0.40)\n74.75(0.63)\n69.91(0.40)\n75.83(0.28)\n74.98(0.32)\n76.16(0.15)\n76.64(0.13)\u2217\n76.25(0.14)\n\n80 samples\n71.86(0.38)\n72.89(0.31)\n77.12(0.18)\n73.34(0.28)\n76.93(0.15)\n77.35(0.14)\n77.32(0.24)\n77.56(0.16)\u2217\n77.22(0.15)\n\n160 samples\n76.22(0.25)\n76.64(0.17)\n78.13(0.12)\n76.17(0.22)\n77.95(0.17)\n78.13(0.14)\n78.21(0.17)\u2217\n78.01(0.12)\n78.03(0.15)\n\nTable 1: Average AUC scores (%) on landmine detection: means (and standard errors) over 30\nrandom runs. For each column, the best model is marked with \u2217 and competitive models (by paired\nt-tests) are shown in bold.\n\nvalidation within the range [10\u22127, 103]. The model in [21] uses 4 regularization parameters, and we\nconsider 3 values for each parameter, leading to 34 = 64 combinations chosen by cross validation.\n\n5.3 Results on Landmine Detection\n\nThe results on landmine detection are shown in Table 1. Each row of the table corresponds to a\nmodel in our experiments. Each column is a training sample size. We have 30 random runs for each\nsample size. We use task-average AUC score as the performance measure and report the mean and\nstandard error of this measure over 30 random runs. The best model is marked with \u2217, and models\ndisplayed in bold fonts are statistically competitive models (i.e. not signi\ufb01cantly inferior to the best\nmodel in a one-sided paired t-test with \u03b1 = 0.05).\nOverall speaking, MTL(\u2126&\u03a3) and MTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1 lead to the best prediction performance.\nFor small training sizes, restricted \u2126 and \u03a3 (\u2126ii = \u03a3jj = 1) offer better prediction; for large\ntraining size (160 per task), free \u2126 and \u03a3 give the best performance. The best model performs\nbetter than MTL-F [2] and much better than MTL-C [21] with small training sets.\nMTL(Im&Ip) performs better than STL, i.e., even the simplest coupling among tasks (by sharing \u03bb)\ncan be helpful when the size of training data is small. Consider the performance of MTL(\u2126&Ip) and\nMTL(Im&\u03a3), which learn either a task structure or a feature structure. When the size of training\nsamples is small (i.e., 30 or 40), coupling by task similarity is more effective, and as the training size\nincreases, learning a common feature representation is more helpful. Finally, consider MTL(\u2126&\u03a3),\nMTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1 and MTL(\u2126&\u03a3)\u2126ii=1. MTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1 imposes a strong restriction\nand leads to better performance when the training size is small. MTL(\u2126&\u03a3) is more \ufb02exible and\nperforms well given large numbers of training samples. MTL(\u2126&\u03a3)\u2126ii=1 performs similarly to\nMTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1, indicating no signi\ufb01cant variation of feature importance in this problem.\n\n5.4 Results on Face Recognition\n\nEmpirical results on face recognition are shown in Table 2, with the best model in each column\nmarked with \u2217 and competitive models displayed in bold. MTL-C [21] performs even worse than\nSTL. One possible explanation is that, since tasks are to classify faces between different subjects,\nthere may not be a clustered structure over tasks and thus a cluster norm will be inappropriate.\nIn this case, using a task similarity matrix may be more appropriate than clustering over tasks.\nIn addition, MTL(\u2126&\u03a3)\u2126ii=1 shows advantages over other models, especially if given relatively\nsuf\ufb01cient training data (5 or 7 per subject). Compared to MTL(\u2126&\u03a3), MTL(\u2126&\u03a3)\u2126ii=1 imposes\nrestrictions on diagonal entries of task covariance \u2126: all tasks seem to be similarly dif\ufb01cult and\nshould be equally regularized. Compared to MTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1, MTL(\u2126&\u03a3)\u2126ii=1 allows the\ndiagonal entries of feature covariance \u03a3 to capture varying degrees of importance of Laplacianfaces.\n\n7\n\n\fAvg Classi\ufb01cation Errors\n\n3 samples per class\n\n5 samples per class\n\n7 samples per class\n\nSTL\n\nMTL-C [21]\nMTL-F [2]\n\nMTL(Im&Ip)\nMTL(\u2126&Ip)\nMTL(Im&\u03a3)\nMTL(\u2126&\u03a3)\n\nMTL(\u2126&\u03a3)\u2126ii=\u03a3jj =1\n\nMTL(\u2126&\u03a3)\u2126ii=1\n\n10.97(0.46)\n11.09(0.49)\n10.78(0.60)\n10.88(0.48)\n9.98(0.55)\n9.87(0.59)\n9.81(0.49)\n9.67(0.57)\u2217\n9.67(0.51)\u2217\n\n7.62(0.30)\n7.87(0.34)\n6.86(0.27)\n7.51(0.28)\n6.68(0.30)\n6.25(0.27)\n6.23(0.29)\n6.21(0.28)\n5.98(0.29)\u2217\n\n4.75(0.35)\n5.33(0.34)\n4.20(0.31)\n5.00(0.35)\n4.12(0.38)\n4.06(0.34)\n4.11(0.36)\n4.02(0.32)\n3.53(0.34)\u2217\n\nTable 2: Average classi\ufb01cation errors (%) on face recognition: means (and standard errors) over 30\nrandom runs. For each column, the best model is marked with \u2217 and competitive models (by paired\nt-tests) are shown in bold.\n\n6 Conclusion\n\nWe propose a matrix-variate normal penalty with sparse inverse covariances to couple multiple tasks.\nThe proposed framework provides an effective and \ufb02exible way to characterize and select both task\nand feature structures for learning multiple tasks. Several recently proposed methods can be viewed\nas variants of the special cases of our formulation and our empirical results on landmine detection\nand face recognition show that we consistently outperform previous methods.\n\nAcknowledgement: this work was funded in part by the National Science Foundation under grant\nNSF-IIS0911032 and the Department of Energy under grant DESC0002607.\n\nAppendix A\n\nProof of Lemma 1.\nWe prove lemma 1 by construction. Given an arbitrary choice of \u03bb, \u03bb\u2126 and \u03bb\u03a3 > 0 in eq. (9) and an\noptimal solution (W\u2217, \u2126\u2217, \u03a3\u2217), we want to prove that W\u2217 also belongs to an optimal solution for\neq. (9) with certain \u03bb\u2032, \u03bb\u2032\n\n\u2126 and \u03bb\u2032\n\n\u03a3 as follows:\n\n\u2126 and \u03bb\u2032\n\n\u03a3 s.t. \u03bb\u2032\n\nWe denote the objective function in eq. (9) with \u03bb, \u03bb\u2126 and \u03bb\u03a3 as Obj\u03bb,\u03bb\u2126,\u03bb\u03a3(W, \u2126, \u03a3).\nAlso, we denote the objective function with our constructed parameters \u03bb\u2032, \u03bb\u2032\n\u03a3 as\nObj\u03bb\u2032,\u03bb\u2032\n\n\u2126 and \u03bb\u2032\n\n\u03a3(W, \u2126, \u03a3).\n\n\u2126,\u03bb\u2032\n\nFor any (W, \u2126, \u03a3), we further construct an invertible (i.e., one-to-one) transform as follows:\n\n(W\u2032, \u2126\u2032, \u03a3\u2032) = (W,r \u03bb\u03a3\n\n\u03bb\u2126\n\n\u2126,r \u03bb\u2126\n\n\u03bb\u03a3\n\n\u03a3)\n\nThe key step in our proof is that, by construction, the following equality always holds:\n\nObj\u03bb,\u03bb\u2126,\u03bb\u03a3(W, \u2126, \u03a3) = Obj\u03bb\u2032,\u03bb\u2032\n\n\u2126,\u03bb\u2032\n\n\u03a3(W\u2032, \u2126\u2032, \u03a3\u2032)\n\n(18)\n\n(19)\n\nTo see this, notice that eq. (9) consists of three parts. The \ufb01rst part is the empirical loss on training\nexamples, depending only on W (and training data). The second part is the log-density of matrix\nnormal distributions, which depends on W and \u03a3\u2297 \u2126. The third part is the sum of two \u21131 penalties.\nThe equality in eq. (19) stems from the fact that all three parts of eq. (9) are not changed: 1) W\u2032 =\nW so the \ufb01rst part remains unchanged; 2) \u03a3\u2032\u2297 \u2126\u2032 = \u03a3\u2297 \u2126 so the second part of the matrix normal\nlog-density is the same; 3) by our construction, the third part is not changed.\n\nBased\n\non\n\nthis\n\nequality,\n\nif\n\n(W\u2217, \u2126\u2217, \u03a3\u2217) minimizes Obj\u03bb,\u03bb\u2126,\u03bb\u03a3(), we\n\nthat\n\n(W\u2217,q \u03bb\u03a3\n\n\u03bb\u2126\n\n\u2126\u2217,q \u03bb\u2126\n\n\u03bb\u03a3\n\n\u03a3\u2217) minimizes Obj\u03bb\u2032,\u03bb\u2032\n\n\u2126,\u03bb\u2032\n\n\u03a3(), where \u03bb\u2032 = \u03bb and \u03bb\u2032\n\nhave\n\u03a3 = \u221a\u03bb\u2126\u03bb\u03a3.\n\n\u2126 = \u03bb\u2032\n\n(\u03bb\u2032, \u03bb\u2032\n\n\u2126, \u03bb\u2032\n\n\u03a3. Let\u2019s construct \u03bb\u2032, \u03bb\u2032\n\u2126 = \u03bb\u2032\n\u03a3) = (\u03bb,p\u03bb\u2126\u03bb\u03a3,p\u03bb\u2126\u03bb\u03a3)\n\n(17)\n\n8\n\n\fReferences\n[1] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2006.\n[3] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task\n\nstructure learning. In NIPS, 2007.\n\n[4] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine\n\nLearning Research, 4:83\u201399, 2003.\n\n[5] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485\u2013516, 2008.\n\n[6] J. Baxter. Learning Internal Representations. In COLT, pages 311\u2013320, 1995.\n[7] E. Bonilla, K. M. Chai, and C. Williams. Multi-task gaussian process prediction. In J. Platt, D. Koller,\n\nY. Singer, and S. Roweis, editors, NIPS, pages 153\u2013160. 2008.\n\n[8] E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-speci\ufb01c features.\n\nIn AISTATS, 2007.\n\n[9] P. J. Brown and M. Vannucci. Multivariate Bayesian Variable Selection and Prediction. Journal of the\n\nRoyal Statistical Soceity, Series B, 60(3):627\u2013641, 1998.\n\n[10] D. Cai, X. He, J. Han, and H. Zhang. Orthogonal laplacianfaces for face recognition. IEEE Transactions\n\non Image Processing, 15(11):3608\u20133614, 2006.\n\n[11] R. Caruana. Multitask Learning. Machine Learning, 28:41\u201375, 1997.\n[12] J. Chen, L. Tang, J. Liu, and J. Ye. A Convex Formulation for Learning Shared Structures from Multiple\n\nTasks. In ICML, 2009.\n\n[13] A. P. Dawid. Some matrix-variate distribution theory: Notational considerations and a bayesian applica-\n\ntion. Biometrika, 68(1):265\u2013274, 1981.\n\n[14] A. P. Dempster. Covariance selection. Biometrics, 1972.\n[15] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians.\n\nProceedings of the Twenty-fourth Conference on Uncertainty in AI (UAI), 2008.\n\nIn\n\n[16] P. Dutilleul. The MLE Algorithm for the Matrix Normal Distribution. J. Statist. Comput. Simul., 64:105\u2013\n\n123, 1999.\n\n[17] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 2007.\n\n[18] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman Hall, 1999.\n[19] B. Hariharan, S. Vishwanathan, and M. Varma. Large scale max-margin multi-label classi\ufb01cation with\n\npriors. In ICML, 2010.\n\n[20] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,\n\nand Prediction. Springer, 2001.\n\n[21] L. Jacob, F. Bach, and J. P. Vert. Clustered multi-task learning: A convex formulation. In NIPS, pages\n\n745\u2013752, 2008.\n\n[22] J. Nocedal and S. Wright. Numerical Optimization. Springer, 2000.\n[23] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Statistics and Computing, 2009.\n\n[24] S. Thrun and J. O\u2019Sullivan. Discovering Structure in Multiple Learning Tasks: The TC Algorithm. In\n\nICML, pages 489\u2013497, 1996.\n\n[25] L. Vandenberghe, S. Boyd, and S.-P. Wu. Determinant maximization with linear matrix inequality con-\n\nstraints. SIAM Journal on Matrix Analysis and Applications, 19:499\u2013533, 1996.\n\n[26] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with dirichlet\n\nprocess priors. Journal of Machine Learning Research, 8:35\u201363, 2007.\n\n[27] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction.\n\nIn NIPS, pages 1553\u20131560, 2007.\n\n[28] K. Yu, J. Lafferty, S. Zhu, and Y. Gong. Large-scale collaborative prediction using a nonparametric\n\nrandom effects model. In ICML, pages 1185\u20131192, 2009.\n\n[29] S. Yu, V. Tresp, and K. Yu. Robust multi-task learning with t-processes. In ICML, page 1103, 2007.\n[30] J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent compo-\n\nnent analysis. In NIPS, pages 1585\u20131592, 2006.\n\n9\n\n\f", "award": [], "sourceid": 462, "authors": [{"given_name": "Yi", "family_name": "Zhang", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}