{"title": "Probabilistic Multi-Task Feature Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 2559, "page_last": 2567, "abstract": "Recently, some variants of the $l_1$ norm, particularly matrix norms such as the $l_{1,2}$ and $l_{1,\\infty}$ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the $l_{1,2}$ and $l_{1,\\infty}$ norms by considering a family of $l_{1,q}$ norms for $1 < q\\le\\infty$ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the $l_{1,q}$ norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization~(EM) algorithms to learn all model parameters, including $q$, automatically. Experiments have been conducted on two cancer classification applications using microarray gene expression data.", "full_text": "Probabilistic Multi-Task Feature Selection\n\n\u001fDepartment of Computer Science and Engineering, Bioengineering Program\n\nYu Zhang\u001f, Dit-Yan Yeung\u001f, Qian Xu \n\nHong Kong University of Science and Technology\n\n\u0007zhangyu,dyyeung\u0007@cse.ust.hk, fleurxq@ust.hk\n\nAbstract\n\nRecently, some variants of the \u0001\u001f norm, particularly matrix norms such as the \u0001\u001f\u0002 \nand \u0001\u001f\u0002\u00dd norms, have been widely used in multi-task learning, compressed sens-\ning and other related areas to enforce sparsity via joint regularization.\nIn this\npaper, we unify the \u0001\u001f\u0002 and \u0001\u001f\u0002\u00dd norms by considering a family of \u0001\u001f\u0002\u0001 norms for\n\u001f \u0003 \u0001 \u08d8 \u00dd and study the problem of determining the most appropriate sparsity\nenforcing norm to use in the context of multi-task feature selection. Using the\ngeneralized normal distribution, we provide a probabilistic interpretation of the\ngeneral multi-task feature selection problem using the \u0001\u001f\u0002\u0001 norm. Based on this\nprobabilistic interpretation, we develop a probabilistic model using the noninfor-\nmative Jeffreys prior. We also extend the model to learn and exploit more general\ntypes of pairwise relationships between tasks. For both versions of the model,\nwe devise expectation-maximization (EM) algorithms to learn all model parame-\nters, including \u0001, automatically. Experiments have been conducted on two cancer\nclassi\ufb01cation applications using microarray gene expression data.\n\n1\n\nIntroduction\n\nLearning algorithms based on \u0001\u001f regularization have a long history in machine learning and statistics.\nA well-known property of \u0001\u001f regularization is its ability to enforce sparsity in the solutions. Recently,\nsome variants of the \u0001\u001f norm, particularly matrix norms such as the \u0001\u001f\u0002 and \u0001\u001f\u0002\u00dd norms, were\nproposed to enforce sparsity via joint regularization [24, 17, 28, 1, 2, 15, 20, 16, 18]. The \u0001\u001f\u0002 norm\nis the sum of the \u0001 norms of the rows and the \u0001\u001f\u0002\u00dd norm is the sum of the \u0001\u00dd norms of the rows.\nRegularizers based on these two matrix norms encourage row sparsity, i.e., they encourage entire\nrows of the matrix to have zero elements. Moreover, these norms have also been used for enforcing\ngroup sparsity among features in conventional classi\ufb01cation and regression problems, e.g., group\nLASSO [29]. Recently, they have been widely used in multi-task learning, compressed sensing and\nother related areas. However, when given a speci\ufb01c application, we often have no idea which norm\nis the most appropriate choice to use.\nIn this paper, we study the problem of determining the most appropriate sparsity enforcing norm\nto use in the context of multi-task feature selection [17, 15].\nInstead of choosing between spe-\nci\ufb01c choices such as the \u0001\u001f\u0002 and \u0001\u001f\u0002\u00dd norms, we consider a family of \u0001\u001f\u0002\u0001 norms. We restrict \u0001\nto the range \u001f \u0003 \u0001 \u08d8 \u00dd to ensure that all norms in this family are convex, making it easier to\nsolve the optimization problem formulated based on it. Within this family, the \u0001\u001f\u0002 and \u0001\u001f\u0002\u00dd norms\nare just two special cases. Using the \u0001\u001f\u0002\u0001 norm, we formulate the general multi-task feature se-\nlection problem and give it a probabilistic interpretation. It is noted that the automatic relevance\ndetermination (ARD) prior [9, 3, 26] comes as a special case under this interpretation. Based on\nthis probabilistic interpretation, we develop a probabilistic formulation using a noninformative prior\ncalled the Jeffreys prior [10]. We devise an expectation-maximization (EM) algorithm [8] to learn\nall model parameters, including \u0001, automatically. Moreover, an underlying assumption of existing\nmulti-task feature selection methods is that all tasks are similar to each other and they share the\nsame features. This assumption may not be correct in practice because there may exist outlier tasks\n\n1\n\n\for tasks with negative correlation. As another contribution of this paper, we propose to use a matrix\nvariate generalized normal prior [13] for the model parameters to learn the relationships between\ntasks. The task relationships learned here can be seen as an extension of the task covariance used\nin [4, 32, 31]. Experiments will be reported on two cancer classi\ufb01cation applications using microar-\nray gene expression data.\n\n2 Multi-Task Feature Selection\n\u0001\u0003\u001f. For the \u0001th task \u0001\u0001, the training set \u0001\u0001 consists\nSuppose we are given \u0001 learning tasks \u0007\u0001\u0001\u0007\u0001\n\u0001 \u08a0 \u00d3\u0001 and\nof \u0001\u0001 labeled data points in the form of ordered pairs \u001cN\u0001\n\u0001 \u08a0 \u0007\u08a4\u001f\u0002 \u001f\u0007 if it is a binary\nits corresponding output \u0001\u0001\n\u0001 N \u0002 \u0001\u0001. For applications\nclassi\ufb01cation problem. The linear function for \u0001\u0001 is de\ufb01ned as \u0001\u0001\u001cN\u001d \u0003 M\u0001\nthat need feature selection, e.g., document classi\ufb01cation, the feature dimensionality is usually very\nhigh and it has been found that linear methods usually perform better.\nThe objective functions of most existing multi-task feature selection methods [24, 17, 28, 1, 2, 15,\n20, 16, 18] can be expressed in the following form:\n\n\u0001 \u08a0 \u00d3 if it is a regression problem and \u0001\u0001\n\n\u0001\u001d, \u0001 \u0003 \u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001, with N\u0001\n\n\u0001\u0002 \u0001\u0001\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u0001\u001c\u0001\u0001\n\n\u0001\u0002 M\u0001\n\n\u0001 N\u0001\n\n\u0001 \u0002 \u0001\u0001\u001d \u0002 \u0001\u0001\u001c9\u001d\u0002\n\n(1)\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\nwhere 9 \u0003 \u001cM\u001f\u0002 \u0002 \u0002 \u0002 \u0002 M\u0001\u001d, \u0001\u001c\u0016\u0002\u0016\u001d denotes the loss function (e.g., squared loss for regression and\nhinge loss for classi\ufb01cation), \u0001\u001c\u0016\u001d is the regularization function that enforces feature sparsity un-\nder the multi-task setting, and \u0001 is the regularization parameter controlling the relative contribution\nof the empirical loss and the regularizer. Multi-task feature selection seeks to minimize the ob-\njective function above to obtain the optimal parameters \u0007M\u0001\u0002 \u0001\u0001\u0007. Two regularization functions are\nwidely used in existing multi-task feature selection methods. One of them is based on the \u0001\u001f\u0002 norm\n\u0001\u0003\u001f \u08b1M\u0001\u08b1 where \u08b1\u0016\u08b1\u0001 denotes the \u0001-norm (or \u0001\u0001 norm) of a\nvector and M\u0001 denotes the \u0001th row of 9. Another one is based on the \u0001\u001f\u0002\u00dd norm of 9 [24, 15, 20]:\n\nof 9 [17, 28, 1, 2, 16, 18]: \u0001\u001c9\u001d \u0003\u08a3\u0001\n\u0001\u001c9\u001d \u0003\u08a3\u0001\n\n\u0001\u0003\u001f \u08b1M\u0001\u08b1\u00dd.\n\nIn this paper, we unify these two cases by using the \u0001\u001f\u0002\u0001 norm of 9 to de\ufb01ne a more general\nregularization function:\n\n\u0001\u08a3\n\n\u0001\u001c9\u001d \u0003\n\n\u08b1M\u0001\u08b1\u0001\u0002\n\n\u001f \u0003 \u0001 \u08d8 \u00dd\u0002\n\n\u0001\u0003\u001f\n\nNote that when \u0001 \u0003 \u001f, \u0001\u001c9\u001d is non-convex with respect to 9. Although \u0001\u001c9\u001d is convex when\n\u0001 \u0003 \u001f, each element of 9 is independent of each other and so the regularization function cannot\nenforce feature sparsity. Thus we restrict the range to \u001f \u0003 \u0001 \u08d8 \u00dd.\nEven though restricting the range to \u001f \u0003 \u0001 \u08d8 \u00dd can enforce feature sparsity between different tasks,\ndifferent values of \u0001 imply different \u2018group discounts\u2019 for sharing the same feature. Speci\ufb01cally,\nwhen \u0001 approaches 1, the cost grows almost linearly with the number of tasks that use a feature, and\nwhen \u0001 \u0003 \u00dd, only the most demanding task matters. So selecting a proper \u0001 can potentially have a\nsigni\ufb01cant effect on the performance of the learning algorithms.\nIn the following, we \ufb01rst give a probabilistic interpretation for multi-task feature selection methods.\nBased on this probabilistic interpretation, we then develop a probabilistic model which, among other\nthings, can solve the model selection problem automatically by estimating \u0001 from data.\n\n3 Probabilistic Interpretation\n\nIn this section, we will show that existing multi-task feature selection methods are related to the\nmaximum a posteriori (MAP) solution of a probabilistic model. This probabilistic interpretation\nsets the stage for introducing our probabilistic model in the next section.\nWe \ufb01rst introduce the generalized normal distribution [11] which is useful for the model to be intro-\nduced.\n\n2\n\n\fDe\ufb01nition 1 \u0001 is a univariate generalized normal random variable iff its probability density func-\ntion (p.d.f.) is given as follows:\n\n\u0001\u001c\u0001\u001d \u0003\n\n\u001f\n\n \u0001\u0189\u001c\u001f \u0002 \u001f\n\u0001 \u001d\n\nANF\n\n\u001c \u08a4 \u08af\u0001 \u08a4 \u0001\u08af\u0001\n\n\u001d\n\n\u0002\n\n\u0001\u0001\n\nwhere \u0189\u001c\u0016\u001d denotes the Gamma function and \u08af \u0016 \u08af denotes the absolute value of a scalar.\nFor simplicity, if \u0001 is a univariate generalized normal random variable, we write \u0001 \u00df \u0001\u0001 \u001c\u0001\u0002 \u0001\u0002 \u0001\u001d.\nThe (ordinary) normal distribution can be viewed as a special case of the generalized normal distribu-\ntion when \u0001 \u0003 and the Laplace distribution is a special case when \u0001 \u0003 \u001f. When \u0001 approaches \u0002\u00dd,\nthe generalized normal distribution approaches the uniform distribution in the range \u0005\u0001 \u08a4 \u0001\u0002 \u0001 \u0002 \u0001\u0005.\nThe generalized normal distribution has proven useful in Bayesian analysis and robustness studies.\nDe\ufb01nition 2 A standardized \u0001  \u001f multivariate generalized normal random variable \u0007 \u0003\n\u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001 consists of \u0001 independent and identically distributed (i.i.d.) univariate generalized\nnormal random variables.\nIf \u0007 is a standardized \u0001  \u001f multivariate generalized normal random variable, we write \u0007 \u00df\n\u0855\u0001\u0001 \u001c\u0001\u0002 \u0001\u0002 \u0001\u001d with the following p.d.f.:\n\u001f\n\n\u08a3\u0001\n\u0001\u0003\u001f \u08af\u0001\u0001 \u08a4 \u0001\u08af\u0001\n\n\u001d\n\n\u0001\u001c\u0007\u001d \u0003\n\n\u0005 \u0001\u0189\u001c\u001f \u0002 \u001f\n\n\u0001 \u001d\u0005\u0001 ANF\n\n\u001c \u08a4\n\n\u0001\u0001\n\n\u0002\n\nWith these de\ufb01nitions, we now begin to present our probabilistic interpretation for multi-task feature\nselection by proposing a probabilistic model. For notational simplicity, we assume that all tasks\nperform regression. Extension to include classi\ufb01cation tasks will go through similar derivation.\nFor a regression problem, we use the normal distribution to de\ufb01ne the likelihood for N\u0001\n\u0001:\n\nwhere \u0001 \u001c\u0001\u0002 \u0001 \u001d denotes the (univariate) normal distribution with mean \u0001 and variance \u0001 .\nWe impose the generalized normal prior on each element of 9:\n\n\u0001 \u00df \u0001 \u001cM\u0001\n\u0001\u0001\n\n\u0001 N\u0001\n\n\u0001 \u0002 \u0001\u0001\u0002 \u0001 \u001d\u0002\n\n(2)\n\n(3)\nwhere \u0001\u0001\u0001 is the \u001c\u0001\u0002 \u0001\u001dth element of 9 (or, equivalently, the \u0001th element of M\u0001 or the \u0001th element\nof M\u0001). Then we can express the prior on M\u0001 as\n\n\u0001\u0001\u0001 \u00df \u0001\u0001 \u001c\u001e\u0002 \u0001\u0001\u0002 \u0001\u001d\u0002\n\n\u001cM\u0001\u001d\u0001 \u00df \u0855\u0001\u0001 \u001c\u001e\u0002 \u0001\u0001\u0002 \u0001\u001d\u0002\n\nWhen \u0001 \u0003 , this becomes the ARD prior [9, 3, 26] commonly used in Bayesian methods for\nenforcing sparsity. From this view, the generalized normal prior can be viewed as a generalization\nof the ARD prior.\nWith the above likelihood and prior, we can obtain the MAP solution of 9 by solving the following\nproblem:\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0006E\u0006\n9\u0002>\u0002\u0001\n\n\u0001 \u0003\n\n\u001f\n\u0001 \n\n\u0001\u08a3\n\n\u001c\u08b1M\u0001\u08b1\u0001\n\n\u0001\n\n\u0001\u0001\n\u0001\n\n\u0001\u0003\u001f\n\n\u001d\n\n\u0001\u001c\u0001\u0001\n\n\u0001\u0002 M\u0001\n\n\u0001 N\u0001\n\n\u0001 \u0002 \u0001\u0001\u001d \u0002\n\n\u0002 \u0001 \u0006\u0006 \u0001\u0001\n\n\u0002\n\n(4)\n\nwhere > \u0003 \u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001 and \u0001 \u0003 \u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001 .\nWe set the derivative of \u0001 with respect to \u0001\u0001 to zero and get\n\n\u001d\u001f\u0002\u0001 \u08b1M\u0001\u08b1\u0001\u0002\n\n\u001c \u0001\n\n\u0001\n\n\u0001\u0001 \u0003\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u0001\u08a3\n\nPlugging this into problem (4), the optimization problem can be reformulated as\n\n\u0006E\u0006\n9\u0002>\n\n\u0001 \u0003\n\n\u001f\n\u0001 \n\n\u0001\u001c\u0001\u0001\n\n\u0001\u0002 M\u0001\n\n\u0001 N\u0001\n\n\u0001 \u0002 \u0001\u0001\u001d \u0002 \u0001\n\n\u0006\u0006\u08b1M\u0001\u08b1\u0001\u0002\n\n(5)\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\nNote that problem (5) is non-convex since the second term is non-convex with respect to 9. Be-\ncause \u0006\u0006 \u0001 \u08d8 \u0001 \u08a4 \u001f for any \u0001 \u0003 \u001e, problem (5) can be relaxed to problem (1) by setting \u0001 \u0003 \u0001\u0001 .\n\n3\n\n\fSo the solutions of multi-task feature selection methods can be viewed as the solution of the relaxed\noptimization problem above. In many previous works such as [5, 27], \u0006\u0006\u001c\u0001\u001d can be used as an ap-\nproximation of \u0001\u001c\u0001 \u08a7\u0003 \u001e\u001d where \u0001\u001c\u0016\u001d is an indicator function. Using this view, we can regard the\nsecond term in problem (5) as an approximation of the number of rows with nonzero \u0001-norms.\nNote that we can directly solve problem (5) using a majorization-minimization (MM) algorithm [14].\nFor numerical stability, we can slightly modify the objective function in problem (5) by replacing\n\u0001\u0003\u001f \u0006\u0006\u001c\u08b1M\u0001\u08b1\u0001 \u0002\u0001\u001d where \u0001 can be regarded as a regularization parameter.\n\u001c\u0001\u001d. In the \u001c\u0001 \u0002 \u001f\u001dth iteration, due to the\n\nthe second term with \u0001\u08a3\u0001\n\nWe denote the solution obtained in the \u0001th iteration as M\u0001\n\u0005\nconcavity property of \u0006\u0006\u001c\u0016\u001d, we can bound the second term in problem (5) as follows\n\n\u0005\n\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0006\u0006\u001c\u08b1M\u0001\u08b1\u0001 \u0002 \u0001\u001d \u08d8 \u0001\u08a3\n\u0001\u0001\u08a3\n\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0006E\u0006\n9\u0002>\n\n\u001f\n\u0001 \n\n\u0006\u0006\u001c\u08b1M\u0001\n\n\u001c\u0001\u001d\u08b1\u0001 \u0002 \u0001\u001d \u0002\n\n\u08b1M\u0001\u08b1\u0001 \u08a4 \u08b1M\u0001\n\u001c\u0001\u001d\u08b1\u0001\n\u001c\u0001\u001d\u08b1\u0001 \u0002 \u0001\n\n\u08b1M\u0001\n\n\u0002\n\n\u0001\u001c\u0001\u0001\n\n\u0001\u0002 M\u0001\n\n\u0001 N\u0001\n\n\u0001 \u0002 \u0001\u0001\u001d \u0002 \u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u08a3\n\n\u08b1M\u0001\u08b1\u0001\n\u001c\u0001\u001d\u08b1\u0001 \u0002 \u0001\n\n\u08b1M\u0001\n\n\u0002\n\nThus, in the \u001c\u0001 \u0002 \u001f\u001dth iteration, we need to solve a weighted version of problem (1):\n\nAccording to [14], the MM algorithm is guaranteed to converge to a local optimum.\n\n4 A Probabilistic Framework for Multi-Task Feature Selection\nIn the probabilistic interpretation above, we use a type II method to estimate \u0007\u0001\u0001\u0007 in the generalized\nnormal prior which can be viewed as a generalization of the ARD prior. In the ARD prior, according\nto [19], this approach is likely to lead to over\ufb01tting because the hyperparameters in the ARD prior\nare treated as points. Similar to the ARD prior, the model in the above section may over\ufb01t since \u0007\u0001\u0001\u0007\nare estimated via point estimation. In the following, we will present our probabilistic framework for\nmulti-task feature selection by imposing priors on the hyperparameters.\n\n4.1 The Model\n\nAs in the above section, the likelihood for N\u0001\n\n\u0001 is also de\ufb01ned based on the normal distribution:\n\n\u0001 \u00df \u0001 \u001cM\u0001\n\u0001\u0001\n\n(6)\nHere we use different noise variances \u0001\u0001 for different tasks to make our model more \ufb02exible. The\nprior on 9 is also de\ufb01ned similarly:\n\n\u0001 \u0002 \u0001\u0001\u0002 \u0001 \n\n\u0001 N\u0001\n\n\u0001 \u001d\u0002\n\n(7)\nThe main difference here is that we treat \u0001\u0001 as a random variable with the noninformative Jeffreys\nprior:\n\n\u0001\u0001\u0001 \u00df \u0001\u0001 \u001c\u001e\u0002 \u0001\u0001\u0002 \u0001\u001d\u0002\n\n\u0001\u001c\u0001\u0001\u001d \u00dd\u00dd\u0001\u001c\u0001\u0001\u001d \u0003\n\n\u00dd\n\n\u0001\n\n\u0005\u001c \u089a \u0006\u0006 \u0001\u001cM\u0001\u08af\u0001\u0001\u001d\n\n\u001d \u0005 \u00dd \u001f\n\n(8)\nwhere \u0001\u001c\u0001\u0001\u001d denotes the Fisher information for \u0001\u0001 and \u0001\u0001\u0005\u0016\u0005 denotes the expectation with respect\nto \u0001. One advantage of using the Jeffreys prior is that the distribution has no hyperparameters.\n\nM\u0001\u08af\u0001\u0001\n\n\u089a\u0001\u0001\n\n\u0001\u0001\n\n\u0002\n\n4.2 Parameter Learning and Inference\n\nHere we use the EM algorithm [8] to learn the model parameters. In our model, we denote \u018e \u0003\n\u00079\u0002 >\u0002\u0007\u0001\u0001\u0007\u0002 \u0001\u0007 as the model parameters and \u0001 \u0003 \u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001 as the hidden variables.\nIn the E-step, we construct the so-called \u0001-function as the surrogate for the log-likelihood:\n\n\u00de\n\n\u0001\u001c\u018e\u08af\u018e\u001c\u0001\u001d\u001d \u0003\n\n\u0006\u0006 \u0001\u001c\u018e\u08afO\u0002 \u0001\u001d\u0001\u001c\u0001\u08afO\u0002 \u018e\u001c\u0001\u001d\u001d\u0001\u0001\u0002\n\nwhere \u018e\u001c\u0001\u001d denotes the estimate of \u018e in the \u0001th iteration and O \u0003 \u001c\u0001\u001f\nshow that\n\n\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\n\u0001\u0001\n\n\u001d\u0001 . It is easy to\n\n\u0006\u0006 \u0001\u001c\u018e\u08afO\u0002 \u0001\u001d \u00dd \u0006\u0006 \u0001\u001cO\u08af9\u0002\u0007\u0001\u0001\u0007\u001d \u0002 \u0006\u0006 \u0001\u001c9\u08af\u0001\u001d\n\u0001 \u08a4 \u0001\u0001\u001d \n\n\u00dd \u08a4 \u0001\u08a3\n\n\u0001 \u08a4 M\u0001\n\u0001 N\u0001\n\u001c\u0001\u0001\n \u0001 \n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0005 \u0001\u0001\u08a3\n\n\u0005\n\n\u08a4 \u0001\u08a3\n\n\u0001\u08a3\n\n\u001f\n\u0001\u0001\n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u08af\u0001\u0001\u0001\u08af\u0001 \u08a4 \u0001\u0001 \u0006\u0006 \u0189\u001c\u001f \u0002\n\n\u001f\n\u0001\n\n\u001d\n\n\u0002\n\n\u0001\u0001 \u0006\u0006 \u0001 \n\u0001\n\n \n\n4\n\n\f\u0001\u001c\u0001\u0001\u001d\u0001\u001cM\u0001\n\n\u001c\nand \u0001\u001c\u0001\u08afO\u0002 \u018e\u001c\u0001\u001d\u001d \u00dd\u00dc\u0001\n(cid:12)(cid:12)O\u0002 \u018e\u001c\u0001\u001d\u0005\n\u0001\u0005 \u001f\n\u0005 \u0001\u0001\u08a3\n\u0001\u001c\u018e\u08af\u018e\u001c\u0001\u001d\u001d \u0003 \u08a4 \u0001\u08a3\n\nSo we can get\n\n\u0001\u0001\n\u0001\n\n\u0001\u0003\u001f\n\n\u001d\n\u00de \u00dd\n\u001c\u0001\u001d\u08af\u0001\u0001\u001d\n\u00de \u00dd\n\n\u001e\n\n\u0003\n\n\u0001 \u08a4 M\u0001\n\u0001 N\u0001\n\u001c\u0001\u0001\n \u0001 \n\u0001\n\n\u0001 \u08a4 \u0001\u0001\u001d \n\n\u0001\u0001 \u0006\u0006 \u0001 \n\u0001\n\n\u0002\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n. We then compute \u0001\u0005 \u001f\n\u0001\u0001\n\u0001\n\u0001\u001c\u0001\u0001\u001d\u0001\u001cM\u0001\n\u001e \u0001\u001c\u0001\u0001\u001d\u0001\u001cM\u0001\n\n\u001f\n\u0001\u0001\n\u0001\n\n\u0003\n\n\u0001\n\u0001\u08b1M\u0001\n\u001c\u0001\u001d\u08b1\u0001\n\n\u0001\n\n\u0002\n\n\u08afO\u0002 \u018e\u001c\u0001\u001d\u0005 as\n\n\u001c\u0001\u001d\u08af\u0001\u0001\u001d\u0001\u0001\u0001\n\u001c\u0001\u001d\u08af\u0001\u0001\u001d\u0001\u0001\u0001\n\u0005\n\n\u08a4 \u0001\u08a3\n\n \n\n\u0001\u08a3\n\n\u0001\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u08af\u0001\u0001\u0001\u08af\u0001 \u08a4 \u0001\u0001 \u0006\u0006 \u0189\u001c\u001f \u0002\n\n\u001f\n\u0001\n\n\u001d\u0002\n\n.\n\n\u001c\u0001\u001d\u08b1\u0001\n\n\u0001\n\u0001\u08b1M\u0001\n\nwhere \u0001\u0001 \u0003\nIn the M-step, we maximize \u0001\u001c\u018e\u08af\u018e\u001c\u0001\u001d\u001d to update the estimates of 9, >, \u0007\u0001\u0001\u0007 and \u0001.\nFor the estimation of 9, we need to solve \u0001 convex optimization problems\n\n\u0001\n\n\u0001 \u0003 \u0001\u001e\u08b1\u0002O\u0001 \u08a4 :\u0001\n\n\u0001 M\u0001\u08b1 \n\n \u0002\n\n\u0006E\u0006\nM\u0001\n\n\u0001\u0001\u08af\u0001\u0001\u0001\u08af\u0001\u0002\n\n\u0001 \u0003 \u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0002\n\n(9)\n\n\u001f \u08a4 \u0001\u001c\u0001\u001d\n\n\u08a4 \u0001\u001c\u0001\u001d\n\n\u0001\n\n\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\n\u0001\u0001\n\n\u001d . When \u0001 \u0003 ,\nwhere \u0002O\u0001 \u0003 \u001c\u0001\u0001\nthis becomes the conventional ridge regression problem. Here \u0001\u0001 is related to the sparsity of the\n\u0001th row in 9\u001c\u0001\u001d: the more sparse the \u0001th row in 9\u001c\u0001\u001d, the larger the \u0001\u0001. When \u0001\u0001 is large, \u0001\u0001\u0001\nwill be enforced to approach 0. We use a gradient method such as conjugate gradient to optimize\nproblem (9). The subgradient with respect to M\u0001 is\n\n\u001d\u0001 , :\u0001 \u0003 \u001cN\u0001\n\n\u001d, and \u0001\u001e \u0003\n\n\u001f\u0002 \u0002 \u0002 \u0002 \u0002 N\u0001\n\u0001\u0001\n\n \u001c\u0001\u001c\u0001\u001d\n\n\u001f\n\n\u0001\n\n\u0001\n\n\u001c:\u0001:\u0001\n\n\u089a\u0001\n\u089aM\u0001\n\n\u0003 \u0001\u001e\n\n\u0001 M\u0001 \u08a4 :\u0001\u0002O\u0001\n\n\u001d \u0002 \u0001\u0001\u0002\n\nwhere \u0001 \u0003 \u001c\u0001\u001f\u08af\u0001\u001f\u0001\u08af\u0001\u08a4\u001fsign\u001c\u0001\u001f\u0001\u001d\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u08af\u0001\u0001\u0001\u08af\u0001\u08a4\u001fsign\u001c\u0001\u0001\u0001\u001d\u001d\u0001 and sign\u001c\u0016\u001d denotes the sign func-\ntion.\nWe set the derivatives of \u0001\u001c\u018e\u08af\u018e\u001c\u0001\u001d\u001d with respect to \u0001\u0001 and \u0001\u0001 to 0 and get\n\u0005 \n\n\u0001 \u08a4 \u001cM\u001c\u0001\u0002\u001f\u001d\n\u0001\u0001\n\n\u0001\u001c\u0001\u0002\u001f\u001d\n\u0001\n\n\u001d\u0001 N\u0001\n\u0001\n\n\u001f\n\u0001\u0001\n\n\u0005\n\n\u0001\u0003\u001f\n\n\u0003\n\n\u0001\n\n\u0001\u001c\u0001\u0002\u001f\u001d\n\u0001\n\n\u0003\n\n\u0001 \u08a4 \u001cM\u001c\u0001\u0002\u001f\u001d\n\u0001\u0001\n\n\u0001\n\n\u001d\u0001 N\u0001\n\n\u0001 \u08a4 \u0001\u001c\u0001\u0002\u001f\u001d\n\n\u0001\n\n\u0002\n\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0005\n\u0001\u0001\u08a3\n(cid:118)(cid:117)(cid:117)\u0017 \u001f\n\u0005\n\u0001\u0001\u08a3\n\u0001\u08a3\n(cid:12)(cid:12)\u0001\u001c\u0001\u0002\u001f\u001d\n\n\u0001\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0001\n\nFor the estimation of \u0001, we also use a gradient method. The gradient can be calculated as\n\n\u0003 \u08a4 \u0001\u08a3\n\n\u0001\u0001\n\n\u089a\u0001\n\u089a\u0001\n\n(cid:12)(cid:12)\u0001 \u0006\u0006(cid:12)(cid:12)\u0001\u001c\u0001\u0002\u001f\u001d\n\n\u0001\u0001\n\n(cid:12)(cid:12) \u0002\n\n\u0001\u0001\n\u0001\n\n\u0002\n\n\u0001\u0001\n\u0001 \u0001\u001c\n\n\u001f\n\u0001\n\n\u001d\u0002\n\nwhere \u0001\u001c\u0001\u001d \u08d5 \u089a \u0006\u0006 \u0189\u001c\u0001\u001d\n\n\u089a\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\u0002\u0001\n\n\u001c\u0001\u0002\u001f\u001d\n\u0001\u0001\n\n\u08a7\u0003\u001e\n\nis the digamma function.\n\n4.3 Extension to Deal with Outlier Tasks and Tasks with Negative Correlation\n\nAn underlying assumption of multi-task feature selection using the \u0001\u001f\u0002\u0001 norm is that all tasks are\nsimilar to each other and they share the same features. This assumption may not be correct in\npractice because there may exist outlier tasks (i.e., tasks that are not related to all other tasks) or\ntasks with negative correlation (i.e., tasks that are negatively correlated with some other tasks). In\nthis section, we will discuss how to extend our probabilistic model to deal with these tasks.\nWe \ufb01rst introduce the matrix variate generalized normal distribution [13] which is a generalization\nof the generalized normal distribution to random matrices.\nDe\ufb01nition 3 A matrix \u0005 \u08a0 \u00d3\u0001\u0001 is a matrix variate generalized normal random variable iff its p.d.f.\nis given as follows:\n\u0001\u001c\u0005\u08af\u0004\u0002 \u0003\u0002 \u0003\u0002 \u0001\u001d \u0003\n\u08a4\u001f\u001d\u0001\u0001\nwhere \u0003 \u08a0 \u00d3\u0001\u0001 and \u0003 \u08a0 \u00d3\u0001\u0001 are nonsingular, det\u001c\u0016\u001d denotes the determinant of a square matrix,\n\u0001\u0001\u0001 is the \u001c\u0001\u0002 \u0001\u001dth element of matrix ) and \u001c\u0001\u08a4\u001f\u001d\u0001\u0001 is the \u001c\u0001\u0002 \u0001\u001dth element of the matrix inverse )\u08a4\u001f.\n\n\u0001 \u001d\u0005\u0001\u0001det\u001c\u0003\u001d\u0001det\u001c\u0003\u001d\u0001\n\n\u0005 \u0189\u001c\u001f \u0002 \u001f\n\n\u08a4\u001f\u001d\u0001\u0001\u001c\u0001\u0001\u0001 \u08a4 \u0001\u0001\u0001\u001d\u001c\u0003\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u0001\u08a3\n\n\u08a4 \u0001\u08a3\n\n\u0001\u08a3\n\n\u0001\u08a3\n\n\u0005\n\nANF\n\n\u001c\u0003\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u001f\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u0001\u0005\n\n\u0002\n\n5\n\n\fWe write \u0005 \u00df \u0855\u0001\u0001\u0001 \u001c\u0004\u0002 \u0003\u0002 \u0003\u0002 \u0001\u001d for a matrix variate generalized normal random variable \u0005.\nWhen \u0001 \u0003 , the matrix variate generalized normal distribution becomes the (ordinary) matrix\nvariate normal distribution [12] with row covariance matrix \u0003\u0003\u0001 and column covariance matrix\n\u0003\u0003\u0001 , which has been used before in multi-task learning [4, 32, 31]. From this view, \u0003 is used\nto model the relationships between the rows of \u0005 and \u0003 is to model the relationships between the\ncolumns.\nWe note that the prior on 9 in Eq. (7) can be written as\n\n9 \u00df \u0855\u0001\u0001\u0001 \u001c\u001e\u0002 diag\u001c\u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001\u001d\u0002 1\u0001\u0002 \u0001\u001d\u0002\n\nwhere \u001e denotes a zero vector or matrix of proper size, 1\u0001 denotes the \u0001  \u0001 identity matrix and\ndiag\u001c\u0016\u001d converts a vector into a diagonal matrix. In this formulation, it can be seen that the columns\nof 9 (and hence the tasks) are independent of each other. However, the tasks are in general not\nindependent. So we propose to use a new prior on 9:\n\n9 \u00df \u0855\u0001\u0001\u0001 \u001c\u001e\u0002 diag\u001c\u001c\u0001\u001f\u0002 \u0002 \u0002 \u0002 \u0002 \u0001\u0001\u001d\u0001\u001d\u0002 \u0003\u0002 \u0001\u001d\u0002\n\n(10)\n\nwhere \u0003 models the pairwise relationships between tasks.\nThe likelihood is still based on the normal distribution. Since in practice the relationships between\ntasks are not known in advance, we also need to estimate \u0003 from data.\nFor parameter learning, we again use the EM algorithm to learn the model parameters. Here the\nmodel parameters are denoted as \u018e \u0003 \u00079\u0002 >\u0002\u0007\u0001\u0001\u0007\u0002 \u0001\u0002 \u0003\u0007. It is easy to show that\n\n\u0006\u0006 \u0001\u001c\u018e\u08afO\u0002 \u0001\u001d \u00dd \u08a4 \u0001\u08a3\n\n\u0005 \u0001\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001 \u08a4 M\u0001\n\u0001 N\u0001\n\u001c\u0001\u0001\n \u0001 \n\u0001\n\n\u0001 \u08a4 \u0001\u0001\u001d \n\n\u0002\n\n\u0001\u0001 \u0006\u0006 \u0001 \n\u0001\n\n \n\n\u0005 \u08a4 \u0001\u08a3\n\n\u0001\u08a3\n\n(cid:12)(cid:12)(cid:12) \u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u001f\n\u0001\u0001\n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n(cid:12)(cid:12)(cid:12)\u0001\n\n\u0001\u0001\u0001\u001c\u0003\n\n\u08a4\u001f\u001d\u0001\u0001\n\n\u001d \u08a4 \u0001 \u0006\u0006 det\u001c\u0003\u001d\u0002\n\nThen we compute \u0001\u0005 \u001f\n\u0001\u0001\n\u0001\n\n\u08a4 \u0001\u0001 \u0006\u0006 \u0189\u001c\u001f \u0002\n\u001f\n\u0001\n\u00de \u00dd\n\u08afO\u0002 \u018e\u001c\u0001\u001d\u0005 as\n(cid:12)(cid:12)O\u0002 \u018e\u001c\u0001\u001d\u0005\n\u0001\u0005 \u001f\n\u00de \u00dd\n\u0005 \u0001\u0001\u08a3\n\u0001\u001c\u018e\u08af\u018e\u001c\u0001\u001d\u001d \u0003 \u08a4 \u0001\u08a3\n\n\u0001\u001c\u0001\u0001\u001d\u0001\u001cM\u0001\n\u001e \u0001\u001c\u0001\u0001\u001d\u0001\u001cM\u0001\n\n\u001f\n\u0001\u0001\n\u0001\n\n\u0001\u0001\n\u0001\n\n\u0003\n\n\u001e\n\nIn the E-step, the \u0001-function can be formulated as\n\u0001 \u08a4 \u0001\u0001\u001d \n\n\u001c\u0001\u001d\u08af\u0001\u0001\u001d\u0001\u0001\u0001\n\u001c\u0001\u001d\u08af\u0001\u0001\u001d\u0001\u0001\u0001\n\n\u0001 \u08a4 M\u0001\n\u0001 N\u0001\n\u001c\u0001\u0001\n \u0001 \n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u08a3\u0001\n\n\u0001\u0003\u001f\n\n\u0003\n\n\u0001\u0001 \u0006\u0006 \u0001 \n\u0001\n\n\u0002\n\n \n\n\u0001\n\n\u0001\u0003\u001f \u0001 \u001c\u0001\u001d\n\n\u0001\u0001\n\n(cid:12)(cid:12)\u08a3\u0001\n\u0005 \u08a4 \u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0001\u001c\u0001\u001d\n\n\u0001\n\n\u001c\u001c\u0003\u001c\u0001\u001d\u001d\u08a4\u001f\u001d\n(cid:12)(cid:12)(cid:12) \u0001\u08a3\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0001\n\n\u0001\n\n\u0002\n\n(cid:12)(cid:12)\u0001 \u08d5 \u0001\u001c\u0001\u001d\n(cid:12)(cid:12)(cid:12)\u0001\n\n\u08a4\u001f\u001d\u0001\u0001\n\n\u0001\u0001\u0001\u001c\u0003\n\n\u08a4 \u0001\u0001 \u0006\u0006 \u0189\u001c\u001f \u0002\n\n\u001d \u08a4 \u0001 \u0006\u0006 det\u001c\u0003\u001d\u0002\n\n\u001f\n\u0001\n\nIn the M-step, for 9 and \u0003, the optimization problem becomes\n\n\u0006E\u0006\n9\u0002\u0003\n\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u001f\n\n \u001c\u0001\u001c\u0001\u001d\n\n\u0001\n\n\u0001\u08a3\n\n\u0001\u08a3\n\n\u0001 \u08a4 M\u0001\n\u001c\u0002\u0001\u0001\n\n\u0001\u0001\u08a3\n\u0001\u001c\u0001\u001d\n\u0001\n\u001d . We de\ufb01ne a new variable \u00029 \u0003 9\u0003\u08a4\u001f to rewrite the above problem as\n\u0001\u08a3\n\n(cid:12)(cid:12)(cid:12) \u0001\u08a3\n\u0001\u08a3\n\n\u0002 \u0001 \u0006\u0006 det\u001c\u0003\u001d\u0002\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u08a4\u001f\u001d\u0001\u0001\n\n(cid:12)(cid:12)(cid:12)\u0001\n\n\u0001\u0001\u0001\u001c\u0003\n\n\u0001 N\u0001\n\n\u0001\u001d \u0002\n\n\u0001\u001c\u0001\u001d\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\n\n\u0001 \u08a4 A\u0001\n\u001c\u0002\u0001\u0001\n\n\u0001 \u0003\u0001 \u00029\u0001 N\u0001\n\n\u0001\u001d \u0002\n\n\u08af \u0002\u0001\u0001\u0001\u08af\u0001 \u0002 \u0001 \u0006\u0006 det\u001c\u0003\u001d\u0002\n\n\u0001\u001c\u0001\u001d\n\n\u0001\n\n\u0001\u001c\u0001\u001d\n\u0001\n\n\u0001 \u0003\n\n\u0006E\u0006\n\u00029\u0002\u0003\n\nwhere \u0001\u001c\u0001\u001d\n\n\u0001 \u0003\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\nwhere A\u0001 denotes the \u0001th column of the \u0001  \u0001 identity matrix. We use an alternating method to\nsolve this problem. For a \ufb01xed \u0003, the problem with respect to \u00029 is a convex problem and we use\nconjugate gradient to solve it with the following subgradient\n\nN\u0001\n\u0001\u001cN\u0001\n\n\u0001\u001d\u0001 \u00029\u0003A\u0001A\u0001\n\n\u0001 \u0003\u0001 \u08a4 \u0001\u0001\n\n\u0001N\u0001\n\n\u0001A\u0001\n\n\u0002 \u0001\u0004\u0002\n\n\u0001 \u0003\u0001\u0005\n\u0005\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u0005\n\n\u089a\u0001\n\u089a \u00029\n\n\u0003 \n\n\u0001\u001c\u0001\u001d\n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\n\u0001\u08a3\n\n\u0001\u0001\u08a3\n\n\u0005 \u00029\u0001 N\u0001\n\n\u089a\u0001\n\u089a\u0003\n\n\u0003 \n\n\u0001\u001c\u0001\u001d\n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0003\u001f\n\nwhere \u0004 is a \u0001  \u0001 matrix with the \u001c\u0001\u0002 \u0001\u001dth element \u0001\u001c\u0001\u001d\nalso use conjugate gradient with the following gradient\n\n\u0001\n\n\u08af \u0002\u0001\u0001\u0001\u08af\u0001\u08a4\u001fsign\u001c \u0002\u0001\u0001\u0001\u001d. For a \ufb01xed \u00029, we\n\n\u0001\u001cN\u0001\n\n\u0001\u001d\u0001 \u00029\u0003A\u0001A\u0001\n\n\u0001 \u08a4 \u0001\u0001\n\n\u0001\n\n\u00029\u0001 N\u0001\n\n\u0001A\u0001\n\u0001\n\n\u0002 \u0001\u001c\u0003\u0001 \u001d\n\n\u08a4\u001f\u0002\n\nAfter obtaining the optimal \u00029\u0a2d and \u0003\u0a2d, we can compute the optimal 9\u0a2d as 9\u0a2d \u0003 \u00029\u0a2d\u0003\u0a2d. The\nupdate rules for \u0007\u0001\u0001\u0007, \u0007\u0001\u0001\u0007 and \u0001 are similar to those in the above section.\n\n6\n\n\f5 Related Work\n\nSome probabilistic multi-task feature selection methods have been proposed before [28, 2]. How-\never, they only focus on the \u0001\u001f\u0002 norm. Moreover, they use point estimation in the ARD prior and\nhence, as discussed in Section 3, are susceptible to over\ufb01tting [19].\nZhang et al. [30] proposed a latent variable model for multi-task learning by using the Laplace prior\nto enforce sparsity. This is equivalent to using the \u0001\u001f\u0002\u001f norm in our framework which, as discussed\nabove, cannot enforce group sparsity among different features over all tasks.\n\n6 Experiments\n\nIn this section, we study our methods empirically on two cancer classi\ufb01cation applications using\nmicroarray gene expression data. We compare our methods with three related methods: multi-task\nfeature learning (MTFL) [1]1, multi-task feature selection using \u0001\u001f\u0002 regularization [16]2, and multi-\ntask feature selection using \u0001\u001f\u0002\u00dd regularization [20]3.\n\n6.1 Breast Cancer Classi\ufb01cation\n\nWe \ufb01rst conduct empirical study on a breast cancer classi\ufb01cation application. This application con-\nsists of three learning tasks with data collected under different platforms [21]. The dataset for the\n\ufb01rst task, collected at the Koo Foundation Sun Yat-Sen Cancer Centre in Taipei, contains 89 sam-\nples with 8948 genes per sample. The dataset for the second task, obtained from the Netherlands\nCancer Institute, contains 97 samples with 16360 genes per sample. Most of the patients in this\ndataset had stage I or II breast cancer. The dataset for the third task, obtained using 22K Agilent\noligonucleotide arrays, contains 114 samples with 12065 genes per sample. Even though these three\ndatasets were collected under different platforms, they share 6092 common genes which are used in\nour experiments.\nHere we abbreviate the method in Section 4.2 as PMTFS1 and that in Section 4.3 as PMTFS2. For\neach task, we choose 70% of the data for training and the rest for testing. We perform 10 random\nsplits of the data and report the mean and standard derivation of the classi\ufb01cation error over the\n10 trials. The results are summarized in Table 1. It is clear that PMTFS1 outperforms the three\nprevious methods, showing the effectiveness of our more general formulation with \u0001 determined\nautomatically. Moreover, we also note that PMTFS2 is better than PMTFS1. This veri\ufb01es the\nusefulness of exploiting the relationships between tasks in multi-task feature selection. Since our\nmethods can estimate \u0001 automatically, we compute the mean of the estimated \u0001 values over 10 trials.\nThe means for PMTFS1 and PMTFS2 are 2.5003 and 2.6718, respectively, which seem to imply that\nsmaller values of \u0001 are preferred for this application. This probably explains why the performance\nof MTFS\u001f\u0002\u00dd is not good when compared with other methods.\nTable 1: Comparison of different methods on the breast cancer classi\ufb01cation application in terms of\nclassi\ufb01cation error rate (in meanstd-dev). Each column in the table represents one task.\n\n1st Task\n\nMethod\n0.34780.1108\nMTFL\n0.33700.0228\nMTFS\u001f\u0002 \nMTFS\u001f\u0002\u00dd 0.38960.0583\n0.30720.0234\nPMTFS1\n0.28700.0228\nPMTFS2\n\n2nd Task\n\n0.03640.0345\n0.03430.0134\n0.11360.0579\n0.02980.0121\n0.02730.0102\n\n3rd Task\n\n0.30910.0498\n0.28550.0337\n0.29090.0761\n0.17860.0245\n0.14550.0263\n\n6.2 Prostate Cancer Classi\ufb01cation\n\nWe next study a prostate cancer classi\ufb01cation application consisting of two tasks. The Singh\ndataset [22] for the \ufb01rst task is made up of laser intensity images from each microarray. The RMA\npreprocessing method was used to produce gene expression values from these images. On the other\n\n1http://ttic.uchicago.edu/\u00dfargyriou/code/index.html\n2http://www.public.asu.edu/\u00dfjye02/Software/SLEP/index.htm\n3http://www.lsi.upc.edu/\u00dfaquattoni/\n\n7\n\n\fhand, the Welsh dataset [25] for the second task is already in the form of gene expression values.\nEven though the collection techniques for the two datasets are different, they have 12600 genes in\ncommon and are used in our experiments.\nThe experimental setup for this application is similar to that in the previous subsection, that is, 70%\nof the data of each task are used for training and the rest for testing, and 10 random splits of the data\nare performed. We report the mean and standard derivation of the classi\ufb01cation error over the 10\ntrials in Table 2. As in the \ufb01rst set of experiments, PMTFS1 and PMTFS2 are better than the other\nthree methods compared and PMTFS2 slightly outperforms PMTFS1. The means of the estimated\n\u0001 values for PMTFS1 and PMTFS2 are 2.5865 and 2.6319, respectively. So it seems that smaller\nvalues are also preferred for this application.\nTable 2: Comparison of different methods on the prostate cancer classi\ufb01cation application in terms\nof classi\ufb01cation error rate (in meanstd-dev). Each column in the table represents one task.\n\n1st Task\n\nMethod\n0.12260.0620\nMTFL\n0.12320.0270\nMTFS\u001f\u0002 \nMTFS\u001f\u0002\u00dd 0.22160.1667\n0.11230.0170\nPMTFS1\n0.10320.0136\nPMTFS2\n\n2nd Task\n\n0.35000.0085\n0.34200.0067\n0.42000.1304\n0.32140.0053\n0.30000.0059\n\n7 Concluding Remarks\n\nIn this paper, we have proposed a probabilistic framework for general multi-task feature selection\nusing the \u0001\u001f\u0002\u0001 norm (\u001f \u0003 \u0001 \u08d8 \u00dd). Our model allows the optimal value of \u0001 to be determined\nfrom data automatically. Besides considering the case in which all tasks are similar, we have also\nconsidered the more general and challenging case in which there also exist outlier tasks or tasks with\nnegative correlation.\nCompressed sensing aims at recovering the sparse signal M from a measurement vector > \u0003 )M for\na given matrix ). Compressed sensing can be extended to the multiple measurement vector (MMV)\nmodel in which the signals are represented as a set of jointly sparse vectors sharing a common set\nof nonzero elements [7, 6, 23]. Speci\ufb01cally, joint compressed sensing considers the reconstruction\nof the signal represented by a matrix 9, which is given by a dictionary (or measurement matrix) )\nand multiple measurement vector * such that * \u0003 )9. Similar to multi-task feature selection,\nwe can use \u08b19\u08b1\u001f\u0002\u0001 to enforce the joint sparsity in 9. Since there usually exists noise in the data,\nthe optimization problem of MMV can be formulated as: \u0006E\u00069 \u0001\u08b19\u08b1\u001f\u0002\u0001 \u0002 \u08b1)9 \u08a4 *\u08b1 \n . This\nproblem is almost identical to problem (1) except that the loss de\ufb01nes the reconstruction error rather\nthan the prediction error. So we can use the probabilistic model presented in Section 4 to develop\na probabilistic model for joint compressed sensing. Besides, we are also interested in developing a\nfull Bayesian version of our model to further exploit the advantages of Bayesian modeling.\n\nAcknowledgment\n\nThis research has been supported by General Research Fund 622209 from the Research Grants\nCouncil of Hong Kong.\n\nReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[2] J. Bi, T. Xiong, S. Yu, M. Dundar, and R. B. Rao. An improved multi-task learning approach with\n\napplications in medical diagnosis. In ECMLPKDD, 2008.\n\n[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.\n[4] E. Bonilla, K. M. A. Chai, and C. Williams. Multi-task Gaussian process prediction. In NIPS 20, 2008.\n[5] E. J. Cand`es, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted \u0001\u001f minimization. Journal\n\nof Fourier Analysis and Applications, 14(5):877\u2013905, 2008.\n\n8\n\n\f[6] J. Chen and X. Huo. Theoretical results on sparse representations of multiple-measurement vectors. IEEE\n\nTransactions on Signal Processing, 54(12):4634\u20134643, 2006.\n\n[7] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-Delgado. Sparse solutions to linear inverse problems\n\nwith multiple measurement vectors. IEEE Transactions on Signal Processing, 53(7):2477\u20132488, 2005.\n\n[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistic Society, B, 39(1):1\u201338, 1977.\n\n[9] M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 25(9):1150\u20131159, 2003.\n\n[10] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, 2nd\n\nedition, 2003.\n\n[11] I. R. Goodman and S. Kotz. Multivariate \u0001-generalized normal distributions. Journal of Multivariate\n\nAnalysis, 3(2):204\u2013219, 1973.\n\n[12] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall, 2000.\n[13] A. K. Gupta and T. Varga. Matrix variate \u0001-generalized normal distribution. Transactions of The American\n\nMathematical Society, 347(4):1429\u20131437, 1995.\n\n[14] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal\n\nof Computational and Graphical Statistics, 9(1):1\u201359, 2000.\n\n[15] H. Liu, M. Palatucci, and J. Zhang. Blockwise coordinate descent procedures for the multi-task lasso,\n\nwith applications to neural semantic basis discovery. In ICML, 2009.\n\n[16] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via ef\ufb01cient \u0001 \u0002\u001f-norm minimization. In UAI, 2009.\n[17] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. Technical report, Department of\n\nStatistics, University of California, Berkeley, June 2006.\n\n[18] G. Obozinski1, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, 2010.\n\n[19] Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by\n\nexpectation propagation. In ICML, 2004.\n\n[20] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for \u0001\u001f\u0002\u00dd regularization. In\n\nICML, 2009.\n\n[21] A. A. Shabalin, H. Tjelmeland, C. Fan, C. M. Perou, and A. B. Nobel. Merging two gene-expression\n\nstudies via cross-platform normalization. Bioinformatics, 24(9):1154\u20131160, 2008.\n\n[22] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V.\nDAmico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R.Sellers. Gene\nexpression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203\u2013209, 2002.\n\n[23] L. Sun, J. Liu, J. Chen, and J. Ye. Ef\ufb01cient recovery of jointly sparse vectors. In NIPS 22. 2009.\n[24] B. A. Turlach, W. N. Wenables, and S. J. Wright. Simultaneous variable selection. Technometrics,\n\n47(3):349\u2013363, 2005.\n\n[25] J. B. Welsh, L. M. Sapinoso, A. I. Su, S. G. Kern, J. Wang-Rodriguez, C. A. Moskaluk, F. H. Frierson,\nJr., and G. M. Hampton. Analysis of gene expression identi\ufb01es candidate markers and pharmacological\ntargets in prostate cancer. Cancer Research, 61(16):5974\u20135978, 2001.\n\n[26] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In NIPS 20, 2007.\n[27] D.P. Wipf and S. Nagarajan. Iterative reweighted \u0001\u001f and \u0001 methods for \ufb01nding sparse solutions. Journal\n\nof Selected Topics in Signal Processing, 2010.\n\n[28] T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multi-task learning.\n\nIn SDM, 2007.\n\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, Series B, 2006.\n\n[30] J. Zhang, Z. Ghahramani, and Y. Yang. Flexible latent variable models for multi-task learning. Machine\n\nLearning, 73(3):221\u2013242, 2008.\n\n[31] Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning.\n\nIn UAI, 2010.\n\n[32] Y. Zhang and D.-Y. Yeung. Multi-task learning using generalized \u0001 process. In AISTATS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Yu", "family_name": "Zhang", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}, {"given_name": "Qian", "family_name": "Xu", "institution": null}]}