{"title": "Large Margin Multi-Task Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1867, "page_last": 1875, "abstract": "Multi-task learning (MTL) improves the prediction performance on multiple, different but related, learning problems through shared parameters or representations. One of the most prominent multi-task learning algorithms is an extension to svms by Evgeniou et al. Although very elegant, multi-task svm is inherently restricted by the fact that support vector machines require each class to be addressed explicitly with its own weight vector which, in a multi-task setting, requires the different learning tasks to share the same set of classes. This paper proposes an alternative formulation for multi-task learning by extending the recently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. Instead of relying on separating hyperplanes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for multitask learning. We evaluate the resulting multi-task lmnn on real-world insurance data and speech classification problems and show that it consistently outperforms single-task kNN under several metrics and state-of-the-art MTL classifiers.", "full_text": "Large Margin Multi-Task Metric Learning\n\nShibin Parameswaran\n\nKilian Q. Weinberger\n\nDepartment of Electrical and Computer Engineering\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nsparames@ucsd.edu\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\nkilian@wustl.edu\n\nAbstract\n\nMulti-task learning (MTL) improves the prediction performance on multiple, different but re-\nlated, learning problems through shared parameters or representations. One of the most promi-\nnent multi-task learning algorithms is an extension to support vector machines (svm) by Evge-\nniou et al. [15]. Although very elegant, multi-task svm is inherently restricted by the fact that\nsupport vector machines require each class to be addressed explicitly with its own weight vec-\ntor which, in a multi-task setting, requires the different learning tasks to share the same set of\nclasses. This paper proposes an alternative formulation for multi-task learning by extending the\nrecently published large margin nearest neighbor (lmnn) algorithm to the MTL paradigm. In-\nstead of relying on separating hyperplanes, its decision function is based on the nearest neighbor\nrule which inherently extends to many classes and becomes a natural \ufb01t for multi-task learning.\nWe evaluate the resulting multi-task lmnn on real-world insurance data and speech classi\ufb01cation\nproblems and show that it consistently outperforms single-task kNN under several metrics and\nstate-of-the-art MTL classi\ufb01ers.\n\n1\n\nIntroduction\n\nMulti-task learning (MTL) [6, 8, 19] refers to the joint training of multiple problems, enforcing a common interme-\ndiate parameterization or representation. If the different problems are suf\ufb01ciently related, MTL can lead to better\ngeneralization and bene\ufb01t all of the tasks. This phenomenon has been examined further by recent papers which\nhave started to build a theoretical foundation that underpins these initial empirical \ufb01ndings [1, 2, 3].\nA well-known application of MTL occurs within the realm of speech recognition. The way different people pro-\nnounce the same words differs greatly based on their gender, accent, nationality or other individual characteristics.\nOne can view each possible speaker, or clusters of speakers, as different learning problems that are highly related.\nIdeally, a speech recognition system should be trained only on data from the user it is intended for. However,\nannotated data is expensive and dif\ufb01cult to obtain. Therefore, it is highly bene\ufb01cial to leverage the similarities of\ndata sets from different types of speakers while adapting to the speci\ufb01cs of each particular user [13, 16].\nOne particularly successful instance of multi-task learning is its adaptation to support vector machines (svm) [14,\n15]. Support vector machines are arguably amongst the most successful classi\ufb01cation algorithms of all times,\nhowever their multi-class extensions such as one-vs-all [4] or clever re\ufb01nements of the loss functions [10, 21] all\nrequire at least one weight vector per class label. As a consequence, the MTL adaptation of svm [15] requires\nall tasks to share an identical set of labels (or require side-information about task dependencies) for meaningful\ntranfer of knowledge. This is a serious limitation in many domains (binary or non-binary) where different tasks\nmight not share the same classes (e.g. identifying multiple diseases from a particular patient data).\nRecently, Weinberger et al. introduced Large Margin Nearest Neighbor (lmnn) [20], an algorithm that translates\nthe maximum margin learning principle behind svms to k-nearest neighbor classi\ufb01cation (kNN) [9]. Similar to\nsvms, the solution of lmnn is also obtained through a convex optimization problem that maximizes a large margin\n\n1\n\n\fbetween input vectors from different classes. However, instead of positioning a separating hyperplane, lmnn learns\na Mahalanobis metric. Weinberger et al. show that the lmnn metric improves the kNN classi\ufb01cation accuracy to\nbe en par with kernelized svms [20] . One advantage that the kNN decision rule has over hyperplane classi\ufb01ers\nis its agnosticism towards the number of class labels of a particular data set. A new test point is classi\ufb01ed by the\nmajority label of its k closest neighbors within a known training data set \u2014 additional classes require no special\ntreatment.\nWe follow the intuition of Evgeniou et al. [15] and extend lmnn to the multitask setting. Our algorithm learns\none metric that is shared amongst all the tasks and one speci\ufb01c metric unique to each task. We show that the\ncombination is still a well-de\ufb01ned pseudo-metric that can be learned in a single convex optimization problem. We\ndemonstrate on several multi-task settings that these shared metrics signi\ufb01cantly reduce the overall classi\ufb01cation\nerror. Further, our algorithm tends to outperform multi-task neural networks [6] and svm [15] on tasks with many\nclass-labels. To our knowledge, this paper introduces the \ufb01rst multi-task metric learning algorithm for the kNN\nrule that explicitly models the commonalities and speci\ufb01cs of different tasks.\n\n2 Large Margin Nearest Neighbor\n\nThis section describes the large margin nearest neigh-\nbor algorithm as introduced in [20]. For now, we will\nfocus on a single-task learning framework, with a train-\ning set consisting of n examples of dimensionality d,\n{(xi, yi)}n\ni=1, where xi \u2208 Rd and yi \u2208 {1, 2, ..., c}.\nHere, c denotes the number of classes. The Maha-\nlanobis distance between two inputs xi and xj is de-\n\ufb01ned as\n\n(cid:113)\n\n(xi \u2212 xj)(cid:62)M(xi \u2212 xj),\n\nthe\n\ndM(xi, xj) =\n\n(1)\nwhere M is a symmetric positive de\ufb01nite matrix\n(M (cid:23) 0). The de\ufb01nition in eq. (1) reduces to the Eu-\nclidean metric if we set M to the identity matrix, i.e.\nM = I. The lmnn algorithm learns the matrix M\nfor the Mahalanobis metric1 in eq. (1) explicitly to en-\nhance k-nearest neighbor classi\ufb01cation.\nLmnn mimics\nnon-\nnon-continuous\ndifferentiable leave-one-out classi\ufb01cation error of\nkNN with a convex loss function. The loss function\nencourages the local neighborhood around every input\nto stay \u201cpure\u201d. Inputs with different labels are pushed away and inputs with a similar label are pulled closer. One\nof the advantages of lmnn over related work [12, 17] is that the (global) metric is optimized locally, which allows\nit to work with multi-modal data distributions and encourages better generalization. To achieve this, the algorithm\nrequires k target neighbors to be identi\ufb01ed for every input prior to learning, which should become the k nearest\nneighbors after the optimization. Usually, these are picked with the help of side-information, or in the absence\nthereof, as the k nearest neighbors within the same class based on Euclidean metric. We use the notation j (cid:32) i to\nindicate that xj is a target neighbor of xi. Lmnn learns a Mahalanobis metric that keeps each input xi closer to its\ntarget neighbors than other inputs with different class labels (impostors) \u2014 by a large margin. For an input xi,\ntarget neighbor xj, and impostor xk, this relation can be expressed as a linear inequality constraint with respect to\nthe squared distance d2\n\nFigure 1: An illustration of a data set before and after\nlmnn. The circles represent points of equal distance to the\nvector xi. The Mahalanobis metric rescales directions to\npush impostors further away than target neighbors by a\nlarge margin.\n\nand\n\n(2)\nEq. (2) is enforced only for the local target neighbors. See Fig. 1 for an illustration. Here, all points on the circles\nhave equal distance from xi. Under the Mahalanobis metric this circle is deformed to an ellipsoid, which causes\nthe impostors (marked as squares) to be further away than the target neighbors.\nThe semide\ufb01nite program (SDP)\n\nintroduced by [20] moves\n\ntarget neighbors close by minimizing\nM(xi, xj) while penalizing violations of the constraint in eq. (2). The latter is achieved through addi-\n\nj(cid:32)i d2\n1For simplicity we will refer to pseudo-metrics also as metrics as the distinction has no implications for our algorithm.\n\n(cid:80)\n\nM(xi, xk) \u2212 d2\nd2\n\nM(xi, xj) \u2265 1.\n\nM(\u00b7,\u00b7):\n\n2\n\nlocal neighborhoodEuclidean MetricmarginMMahalanobis MetricSimilarly labeled (target neighbor)Differently labeled (impostor)Differently labeled (impostor)xixi\ftive slack variables \u03beijk \u2265 0. If we de\ufb01ne a set of triples S = {(i, j, k) : j (cid:32) i, yk (cid:54)= yi}, the problem can be\nstated as the SDP shown in Table 1.\nThis optimization problem has O(kn2) constraints of\ntype (1) and (2), along with the positive semide\ufb01nite\nconstraint of a d \u00d7 d matrix M. Hence, standard off-\nthe shelf packages are not particularly suited to solve\nthis SDP. For this paper we use the special purpose sub-\ngradient descent solver, developed in [20], which can\nhandle data sets on the order of tens of thousands of\nsamples. As the optimization problem is not sensitive\nto the exact choice of the tradeoff constant \u00b5 [20], we\nset \u00b5 = 1 throughout this paper.\n\nmin\nM(xi, xj) + \u00b5\nd2\nM\nsubject to: (i, j, k) \u2208 S:\n(1) d2\n(2) \u03beijk \u2265 0\n(3) M (cid:23) 0.\n\n(cid:88)\n\nM(xi, xk) \u2212 d2\n\nM(xi, xj)\u22651 \u2212 \u03beijk\n\n(cid:88)\n\n(i,j,k)\u2208S\n\nj(cid:32)i\n\n\u03beijk\n\n3 Multi-Task learning\n\nTable 1: Convex optimization problem of lmnn.\n\nIn this section, we brie\ufb02y review the approach pre-\nsented by Evgeniou et al. [15] that extends svm to multi-task learning (mt-svm). We assume that we are given\nT different but related tasks. Each input (xi, yi) belongs to exactly one of the tasks 1, . . . , T , and we let It be\nthe set of indices such that i \u2208 It if and only if the input-label pair (xi, yi) belongs to task t. For simpli\ufb01cation,\nthroughout this section we will assume a binary classi\ufb01cation scenario, in particular yi\u2208{+1,\u22121}.\nFollowing the original description of [15], mt-svm learns T classi\ufb01ers w1, . . . , wT , where each classi\ufb01er wt\nIn addition, the authors introduce a global classi\ufb01er w0 that captures the\nis speci\ufb01cally dedicated for task t.\ncommonality among all the tasks. An example xi \u2208 It is classi\ufb01ed by the rule \u02c6yi = sign(x(cid:62)\ni (w0 + wt)). The joint\noptimization problem is to minimize the following cost:\n\nT(cid:88)\n\nt=0\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\ni\u2208It\n\nmin\n\nw0,...,wT\n\n\u03b3t(cid:107)wt(cid:107)2\n2+\n\n[1\u2212yi(w0 + wt)(cid:62)xi]+\n\n(3)\n\nwhere [a]+ = max(0, a). The constants \u03b3t \u2265 0 trade-off the regularization of the various tasks. Note that the\nrelative value between \u03b30 and the other \u03b3t>0 controls the strength of the connection across tasks. In the extreme\ncase, if \u03b30 \u2192 +\u221e, then w0 = (cid:126)0 and all tasks are decoupled; on the other hand, when \u03b30 is small and \u03b3t>0 \u2192 +\u221e\nwe obtain wt>0 = (cid:126)0 and all the tasks share the same decision function with weights w0. Although the mt-svm\nformulation is very elegant, it requires all tasks to share the same class labels. In the remainder of this paper we\nwill introduce an MTL algorithm based on the kNN rule, which does not model each class with its own parameter\nvector.\n\n4 Multi-Task Large Margin Nearest Neighbor\n\nIn this section we combine large margin nearest neighbor classi\ufb01cation from section 2 with the multi-task learning\nparadigm from section 3. We follow the MTL setting with T learning tasks. Our goal is to learn a metric dt(\u00b7,\u00b7) for\neach of the T tasks that minimizes the kNN leave-one-out classi\ufb01cation error. Inspired by the methodology of the\nprevious section, we model the commonalities between various tasks through a shared Mahalanobis metric with\nM0 (cid:23) 0 and the task-speci\ufb01c idiosyncrasies with additional matrices M1, . . . MT (cid:23) 0. We de\ufb01ne the distance for\ntask t as\n\ndt(xi, xj) =\n\n(xi \u2212 xj)(cid:62)(M0 + Mt)(xi \u2212 xj).\n\n(4)\n\n(cid:113)\n\nIntuitively, the metric de\ufb01ned by M0 picks up general trends across multiple data sets and Mt>0 specialize the\nmetric further for each particular task. See Fig. 2 for an illustration. If we constrain the matrices Mt to be positive\nsemi-de\ufb01nite (i.e. Mt (cid:23) 0), then eq. (4) will result in a well de\ufb01ned pseudo-metric, as we show in section 4.1.\nAn important aspect of multi-task learning is the appropriate coupling of the multiple learning tasks. We have to\nensure that the learning algorithm does not put too much emphasis onto the shared parameters M0 or the individual\n\n3\n\n\fFigure 2: An illustration of mt-lmnn. The matrix M0 captures the communality between the several tasks, whereas\nMt for t > 0 adds the task speci\ufb01c distance transformation.\n\nparameters M1, . . . , MT . To ensure this balance, we use the regularization term stated below:\n\nT(cid:88)\n\nt=1\n\n\u03b30(cid:107)M0 \u2212 I(cid:107)2\n\n\u03b3t(cid:107)Mt(cid:107)2\nF .\n\nmin\n\nF +\n\nM0,...,MT\n\n(5)\nThe trade-off parameter \u03b3t controls the regularization of Mt for all t = 0, 1, . . . , T . If \u03b30 \u2192\u221e, the shared metric\nM0 reduces to the plain Euclidean metric and if \u03b3t>0\u2192\u221e, the task-speci\ufb01c metrics Mt>0 become irrelevant zero\nmatrices. Therefore, if \u03b3t>0 \u2192 \u221e and \u03b30 is small, we learn a single metric M0 across all tasks. In this case we\nwant the result to be equivalent to applying lmnn on the union of all data sets. In the other extreme case, when\n\u03b30 = 0 and \u03b3t>0\u2192\u221e, we want our formulation to reduce to T independent lmnn algorithms.\nSimilar to the set of triples S de\ufb01ned in section 2, let St be the set of triples restricted to only vectors for task\nt : j (cid:32) i, yk (cid:54)= yi}. We can combine the regularizer in eq.( 5) with the objective\nt, i.e., St = {(i, j, k) \u2208 I3\nof lmnn applied to each of the T tasks. To ensure well-de\ufb01ned metrics, we add constraints that each matrix is\npositive semi-de\ufb01nite, i.e. Mt (cid:23) 0 (see next paragraph for more details). We refer to the resulting algorithm as\nmulti-task large margin nearest neighbor (mt-lmnn). The optimization problem is shown in Table 2 and can be\nsolved ef\ufb01ciently after some modi\ufb01cations to the special-purpose solver presented by Weinberger et al. [20]\n\n4.1 Theoretical Properties\n\nIn this section we verify that our resulting distances are guaranteed to be well-de\ufb01ned pseudo-metrics and that the\noptimization is convex.\nTheorem 1 If Mt (cid:23) 0 for all t = 0, . . . T then the distance functions dt(\u00b7,\u00b7), as de\ufb01ned in eq.( 4), are well-de\ufb01ned\npseudo-metrics for all 0 \u2264 t \u2264 T .\n\nThe proof of Theorem 1 is completed in two steps: First, as the cone of positive semi-de\ufb01nite matrices is convex,\nany linear combination of positive semide\ufb01nite matrices is also positive semide\ufb01nite. This implies that dt(\u00b7,\u00b7) is\nnon-negative, and it is also trivially symmetric. The second part of the proof utilizes the fact that any positive\nsemide\ufb01nite matrix M, can be decomposed as M = L(cid:62)L, for some matrix L \u2208 Rd\u00d7d. It therefore follows that\nthere exists some matrix Lt such that L(cid:62)\n\nt Lt = M0 + Mt. Hence we can rephrase eq.( 4) as\n\ndt(xi, xj)=\n\n(6)\nwhich is equivalent to the Euclidean distance after the transformation xi \u2192 Ltxi. It follows that eq.( 6) preserves\nthe triangular inequality. This completes the requirements for a pseudo-metric. If Lt is full rank, i.e. M0 + Mt\nis strictly positive de\ufb01nite, then it also ful\ufb01lls identity of indiscernibles, i.e., d(xi, xj) = 0 if and only if xi = xj\nand d(\u00b7,\u00b7) is a metric.\n\n(xi \u2212 xj)(cid:62)L(cid:62)\n\nt Lt(xi \u2212 xj),\n\n(cid:113)\n\n4\n\nEuclidean MetricJoint MetricM0x1ix2iM0+M1M0+M2M0Individual MetricsTask 1Task 2Similarly labeled (target neighbor)Differently labeled (impostor)x1ix2ix1ix2i\f\uf8ee\uf8f0\u03b3t(cid:107)Mt(cid:107)2\n\nT(cid:88)\n\nt=1\n\nF +(cid:88)\n\nd2\n(i,j)\u2208It,j(cid:32)i\n\nt (xi, xj) +(cid:88)\n\n\u03beijk\n(i,j,k)\u2208St\n\nmin\n\nM0,...,MT\n\n\u03b30(cid:107)M0 \u2212 I(cid:107)2\n\nF +\n\nt (xi, xk) \u2212 d2\n\nsubject to: \u2200t,\u2200(i, j, k) \u2208 St:\n(1) d2\n(2) \u03beijk \u2265 0\n(3) M0, M1, . . . , MT (cid:23) 0.\n\nt (xi, xj) \u2265 1 \u2212 \u03beijk\n\n\uf8f9\uf8fb\n\nTable 2: Convex optimization problem of mt-lmnn.\n\nOne of the advantages of lmnn over alternative distance metric learning algorithms, for example NCA [17], is that\nit can be stated as a convex optimization problem. This allows the global solution to be found ef\ufb01ciently with\nspecial purpose solvers [20] or for very large data sets in an online relaxation [7]. It is therefore important to show\nthat our new formulation preserves convexity.\n\nTheorem 2 The mt-lmnn optimization problem in Table 2 is convex.\n\nConstraints of type (2) and (3) are standard linear and positive-semide\ufb01nite constraints, which are known to be\nconvex [5]. Convexity remains to be shown for constraints of type (1) and the objective. Both access the matrices\nMt exclusively in terms of the squared distance d2(\u00b7,\u00b7). This can be expressed as\nij) + trace(Mtvijv(cid:62)\nij),\n\n(7)\nwhere vij = (xi \u2212 xj). Eq.( 7) is linear in terms of the matrices Mt and it follows that the constraints of\ntype (1) are also linear and therefore trivially convex. Similarly, it follows that all terms in the objective are\nalso linear with the exception of the Frobenius norms in the regularization term. The latter term is quadratic\n((cid:107)Mt(cid:107)2\nt Mt)) and therefore convex with respect to Mt. The regularization of M0 can be expanded\n0 M0 \u2212 2M0 + I) which has one quadratic and one linear term. The sum of convex functions is\nas trace(M(cid:62)\nconvex [5], hence this concludes the proof.\n\nd2(xi, xj) = trace(M0vijv(cid:62)\n\nF = trace(M(cid:62)\n\n5 Results\n\nWe evaluate mt-lmnn on the Isolet spoken alphabet recognition2 and CoIL 2000 dataset3. We \ufb01rst provide a brief\noverview of the two datasets and then present results in various multi-task and domain adaptation settings.\nThe Isolet dataset was collected from 150 speakers uttering all characters in the English alphabet twice, i.e., each\nspeaker contributed 52 training examples (in total 7797 examples4). The task is to classify which letter has been\nuttered based on several acoustic features \u2013 spectral coef\ufb01cients, contour-, sonorant- and post-sonorant features.\nThe exact feature description can be found in [16]. The speakers are grouped into smaller sets of 30 similar\nspeakers, giving rise to 5 disjoint subsets called isolet1-5. This representation of Isolet lends itself naturally to the\nmulti-task learning regime. We treat each of the subsets as its own classi\ufb01cation task (T = 5) with c = 26 labels.\nThe \ufb01ve tasks differ because the groups of speakers vary greatly in the way they utter the characters of the English\nalphabets. They are also highly related to each other because all the data is collected from the same utterances (the\nEnglish alphabets). To remove low variance noise and to speed up computation time we preprocess the Isolet data\nwith PCA [18] and project it onto its leading principal components that capture 95% of the data variance reducing\nthe dimensionality from 617 to 169.\nThe CoIL dataset contains information of customers of an insurance company. The customer information consists\nof 86 variables including product usage and socio-demographic data. The training set contains 5822 and the test set\n4000 examples. Out of the 86 variables, we used 6 categorical features to create different classi\ufb01cation problems,\nleaving the remaining 80 features as the joint data set. Our target variables consist of attributes 1, 4, 5, 6, 44 and 86,\n\n2Available for download from the UCI Machine Learning Repository.\n3Available for download at http://kdd.ics.uci.edu/databases/tic/tic.html\n4Three examples are historically missing.\n\n5\n\n\fIsolet\n\n1\n2\n3\n4\n5\nAvg\n\nEuc\n\nst-lmnn mt-lmnn\nU-lmnn\n3.89%\n5.32%\n13.30% 6.05%\n3.17%\n18.62% 6.53%\n5.03%\n6.99%\n21.44% 8.59% 10.09%\n6.31%\n9.39%\n24.42% 8.37%\n5.58%\n7.69%\n18.91% 7.30%\n5.19%\n19.34% 7.37%\n7.51%\n\nmt-net\n\nst-svm mt-svm\nst-net\n5.99%\n4.74 % 4.52 % 8.75%\n4.62 % 3.81 % 9.62%\n5.99%\n6.73 % 6.92 % 13.81% 7.30%\n7.95 % 6.51 % 13.62% 8.39%\n5.74 % 5.61 % 13.71% 7.82%\n5.96 % 5.48 % 11.90% 7.10%\n\nTable 3: Error rates on label-compatible Isolet tasks when tested with task-speci\ufb01c train sets.\n\nwhich indicate customer subtypes, customer age bracket, customer occupation, a discretized percentage of Roman\nCatholics in that area, contribution from a third party insurance and the last feature is a binary value that signi\ufb01es\nif the customer has a caravan insurance policy. The tasks have a different number of output labels but they share\nthe same input data.\nEach Isolet subset (task) was divided into randomly selected 60/20/20 splits of train/validation/test sets. We\nrandomly picked 20% of the CoIL training examples and set them aside for validation purposes. The results were\naveraged over 10 runs in both cases. The validation subset was used for model selection for mt-lmnn, i.e. choosing\nthe regularization constants \u03b3t and the number of iterations for early stopping. Although our model allows different\nweights \u03b3t for each task, throughout this paper we only differentiated between \u03b30 and \u03b3 = \u03b3t>0. The neighborhood\nsize k was \ufb01xed to k = 3, which is the setting recommended in the original lmnn publication [20]. For competing\nalgorithms, we performed a thorough parameter sweep and reported the best test set results (thereby favoring them\nover our method).\nThese two datasets capture the essence of an ideal mt-lmnn application area. Our algorithm is very effective when\nthe feature space is dense and when dealing with multi-label tasks with or without the same set of output labels.\nThis is demonstrated in the \ufb01rst subsection of results. The second subsection provides a brief demonstration of the\nuse of mt-lmnn in the domain adaptation (or cold start) scenario.\n\n5.1 Multi-task Learning\n\nWe categorized the multi-task learning setting into two different scenarios:\nlabel-compatible MTL and label-\nincompatible MTL. In the label-compatible MTL scenario, all the tasks share the same label set. The label-\nincompatible scenario arises when applying MTL to a group of multi-class classi\ufb01cation tasks that do not share the\nsame set of labels. We demonstrate the applicability and effectiveness of mt-lmnn in both these scenarios in the\nfollowing sub-sections.\nLabel-Compatible Multi-task Learning The experiments in this setting were conducted on the Isolet data, where\nisolet1-5 are the 5 tasks and all of them share the same 26 labels.\nWe compared the performance of our mt-\nlmnn algorithm with different baselines in\ntable 3. The \ufb01rst 3 algorithms are kNN\nclassi\ufb01ers using different metrics. \u201cEuc\u201d\nrepresents the Euclidean metric, \u201cU-lmnn\u201d\nis the metric obtained from lmnn trained\non the union of the training data of all\ntasks (essentially \u201cpooling\u201d all\nthe data\nand ignoring the multi-task aspect), \u201cst-\nlmnn\u201d is single-task lmnn trained indepen-\ndent of other tasks. As additional compari-\nson we have also included results from lin-\near single-task and multi-task support vec-\ntor machine [15], denoted as \u201cst-svm\u201d and \u201cmt-svm\u201d and non-linear single-task and multi-task neural networks (48\nhidden layers) [6] denoted as \u201cst-net\u201d and \u201cmt-net\u201d respectively.\nA special case arises in terms of the kNN based classi\ufb01ers in the label-compatible scenario: during the actual\nclassi\ufb01cation step, regardless what metric is used, the kNN training data set can either consist of only task speci\ufb01c\n\nTable 4: Error rates when tested with the union of train sets from all\nthe tasks.\n\nst-lmnn mt-lmnn\n4.13%\n5.51%\n3.94%\n5.29%\n3.85%\n7.14%\n4.49%\n7.89%\n3.65%\n7.11%\n4.01%\n6.59%\n\nIsolet\n\n1\n2\n3\n4\n5\nAvg\n\nEuc\nU-lmnn\n9.65%\n4.71%\n14.01% 5.19%\n11.06% 5.32%\n12.28% 5.03%\n10.67% 4.17%\n11.53% 4.88%\n\n6\n\n\fTask\n\n1\n2\n3\n4\n5\nAvg\n\nEuc\n\nU-lmnn\n11.25% 4.27%\n10.52% 3.02%\n14.79% 6.25%\n14.79% 6.25%\n9.38%\n2.71%\n12.15% 4.50%\n\nst-lmnn mt-lmnn\n4.48%\n3.44%\n2.71%\n3.96%\n5.83%\n6.04%\n5.52%\n6.46%\n1.77 %\n2.71%\n3.85%\n4.73%\n\nst-net mt-net\nst-svm\n3.92% 3.43% 7.08%\n2.50% 2.78% 6.83%\n6.67% 6.39% 9.58%\n5.83% 5.93% 9.83%\n1.58% 1.67% 6.17%\n4.10% 4.04% 7.90%\n\nTable 5: Error rates on Isolet label-incompatible tasks with task-speci\ufb01c train sets.\n\n# classes\n\n40\n6\n10\n10\n4\n2\n\nTask\n\n1\n2\n3\n4\n5\n6\nAvg\n\nEuc\n\n5.72%\n\n5.12%\n\nst-net\n\nmt-net\n\nst-lmnn mt-lmnn\n\nst-svm\n24.65% 13.67% 12.75% 47.45% 47.05% 55.68%\n6.78%\n17.25% 19.35% 36.30%\n18.48% 13.28% 11.06% 23.12% 27.80% 40.98%\n19.95% 17.40% 32.98%\n7.83 %\n6.05%\n3.63%\n3.63%\n33.18 % 8.23 %\n5.95%\n5.95%\n9.12%\n9.25%\n16.70%\n9.35%\n19.56% 20.20% 29.25%\n\n6.00 %\n7.54 %\n9.10%\n8.60%\n\n3.63%\n6.00%\n\nTable 6: Error rates on CoIL label-incompatible tasks. See text for details.\n\ntraining data or the pooled data from all tasks. The kNN results obtained from using pooled training sets at the\nclassi\ufb01cation phase is shown in table 4.\nBoth sets of results, in table 3 and 4, show that mt-lmnn obtains considerable improvement over its single-task\ncounterparts on all 5 tasks and generally outperforms the other multi-task algorithms based on neural networks and\nsupport vector machines.\nLabel-Incompatible Multi-task Learning To demonstrate mt-lmnn\u2019s ability to learn multiple tasks having dif-\nferent sets of class labels, we ran experiments on the CoIL dataset and on arti\ufb01cially incompatible versions of\nIsolet tasks. Note that in this setting, mt-svm cannot be used because there is no intuitive way to extend it to\nlabel-incompatible multi-class multi-label MTL setting. Also, U-lmnn cannot be used with CoIL data tasks since\nall of them share the same input.\nFor each original subset of Isolet we picked 10 labels at random and reduced the dataset to only examples with\nthese labels (resulting in 600 data points per set and different sets of output labels). Table 5 shows the results of\nthe kNN algorithm under the various metrics along with single-task and multi-task versions of svm and neural\nnetworks on these tasks. Mt-lmnn yields the lowest average performance across all tasks.\nThe classi\ufb01cation error rates on CoIL data tasks are shown in Table 6. The multi-task neural network and svm\nhave a hard time with most of the tasks and, at times perform worse than their single-task versions. Once again,\nmt-lmnn improves upon its single task counterparts demonstrating the sharing of knowledge between tasks. Both\nsvm and neural networks perform very well on the tasks with the least number of classes, whereas mt-lmnn does\nvery well in tasks with many classes (in particular 40-way classi\ufb01cation of task 1).\n\n5.2 Domain Adaptation\n\nDomain adaptation attempts to learn a severely undersampled target domain, with the help of source domains with\nplenty of data, that may not have the same sample distribution as that of the target.\nFor instance, in the context of speech recognition, one might have a lot of annotated speech recordings from a\nset of lab volunteers but not much from the client who will use the system. In such cases, we would like the\nlearned classi\ufb01er to gracefully adapt its recognition / classi\ufb01cation rule to the target domain as more data becomes\navailable.\nUnlike the previous setting, we now have one speci\ufb01c target task which can be heavily under-sampled. We evaluate\nthe domain adaptation capability of mt-lmnn with isolet1-4 as the source and isolet5 as the target domain across\nvarying amounts of available labeled target data. The classi\ufb01cation errors of kNN under the mt-lmnn and U-lmnn\nmetrics are shown in Figure 3.\n\n7\n\n\fIn the absence of any training data from\nisolet5 (also referred to as the cold-start\nscenario), we used the global metric M0\nlearned by mt-lmnn on tasks isolet1-4. U-\nlmnn and mt-lmnn global metric perform\nmuch better than the Euclidean metric, with\nU-lmnn giving slightly better classi\ufb01cation.\nWith the availability of more data charac-\nteristic of the new task, the performance\nof mt-lmnn improves much faster than U-\nlmnn. Note that the Euclidean error actu-\nally increases with more target data, pre-\nsumably because utterances from the same\nspeaker might be close together in Eu-\nclidean space even if they are from different\nclasses \u2013 leading to additional misclassi\ufb01cations.\n\nFigure 3: mt-lmnn, U-lmnn and Euclidean test error rates (%) in an\nunseen task with different sizes of train set.\n\n6 Related Work\n\nCaruana was the \ufb01rst to demonstrate results on multi-task learning for k-nearest neighbor regression and locally\nweighted averaging [6]. The multi-task aspect of their work focused on \ufb01nding common feature weights across\nmultiple, related tasks. In contrast, our work focuses on classi\ufb01cation and learns different metrics with shared\ncomponents.\nPrevious work on multi-task learning largely focused on neural networks [6, 8], where a hidden layer is shared\nbetween various tasks. This approach is related to our work as it also learns a joint representation across tasks.\nIt differs in the way classi\ufb01cation and the optimization are performed. Mt-lmnn uses the kNN rule and can be\nexpressed as a convex optimization problem with the accompanying convergence guarantees.\nMost recent work in multi-task learning focuses on linear classi\ufb01ers [11, 15] or kernel machines [14]. Our work\nwas in\ufb02uenced by these publications especially in the way the decoupling of joint and task-speci\ufb01c parameters is\nperformed. However, our method uses a different optimization and learns metrics rather than separating hyper-\nplanes.\n\n7 Conclusion\n\nIn this paper we introduced a novel multi-task learning algorithm, mt-lmnn. To our knowledge, it is the \ufb01rst\nmetric learning algorithm that embraces the multi-task learning paradigm that goes beyond feature re-weighting\nfor pooled training data. We demonstrated the abilities of mt-lmnn on real-world datasets. Mt-lmnn consistently\noutperformed single-task metrics for kNN in almost all of the learning settings and obtains better classi\ufb01cation\nresults than multi-task neural networks and support-vector machines. Addressing a major limitation of mt-svm,\nmt-lmnn is applicable (and effective) on multiple multi-class tasks with different sets of classes.\nThis MTL framework can also be easily adapted for other metric learning algorithms including the online version\nof lmnn [7]. A further research extension is to incorporate known structure by introducing additional sub-global\nmetrics that are shared only by a strict subset of the tasks.\nThe nearest neighbor classi\ufb01cation rule is a natural \ufb01t for multi-task learning, if accompanied with a suitable\nmetric. By extending one of the state-of-the-art metric learning algorithms to the multi-task learning paradigm,\nmt-lmnn provides a more integrative methodology for metric learning across multiple learning problems.\n\nAcknowledgments\n\nThe authors would like to thank Lawrence Saul for helpful discussions. This research was supported in part by the\nUCSD FWGrid Project, NSF Research Infrastructure Grant Number EIA-0303622.\n\n8\n\n3456789101100.10.20.30.40.50.60.70.8Test error rate in %Fraction of isolet5 used for trainingEUCU-LMNNMT-LMNN\fReferences\n[1] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine\n\nLearning Research, 4:83\u201399, 2003.\n\n[2] S. Ben-David, J. Gehrke, and R. Schuller. A theoretical framework for learning from a pool of disparate data\n\nsources. In KDD, pages 443\u2013449, 2002.\n\n[3] S. Ben-David and R. Schuller. Exploiting task relatedness for mulitple task learning. In COLT, pages 567\u2013\n\n580, 2003.\n\n[4] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classi\ufb01ers. In Proceedings of the\n\ufb01fth annual workshop on Computational learning theory, pages 144\u2013152. ACM New York, NY, USA, 1992.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[7] G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm for large scale image similarity\nlearning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in\nNeural Information Processing Systems 22, pages 306\u2013314. 2009.\n\n[8] R. Collobert and J. Weston. A uni\ufb01ed architecture for NLP: Deep neural networks with multitask learning.\nIn Proceedings of the 25th international conference on Machine learning, pages 160\u2013167. ACM New York,\nNY, USA, 2008.\n\n[9] T. Cover and P. Hart. Nearest neighbor pattern classi\ufb01cation. In IEEE Transactions in Information Theory,\n\nIT-13, pages 21\u201327, 1967.\n\n[10] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines.\n\nThe Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[11] H. Daum\u00b4e. Frustratingly easy domain adaptation. In Annual Meeting-Association for Computational Lin-\n\nguistics, volume 45, page 256, 2007.\n\n[12] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. Proceedings of the\n\n24th international conference on Machine learning, 2007.\n\n[13] V. Digalakis, D. Rtischev, and L. Neumeyer. Fast speaker adaptation using constrained estimation of Gaussian\n\nmixtures. IEEE Trans. on Speech and Audio Processing, pages 357\u2013366, 1995.\n\n[14] T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine\n\nLearning Research, 6(1):615, 2006.\n\n[15] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In KDD, pages 109\u2013117, 2004.\n[16] M. A. Fanty and R. Cole. Spoken letter recognition. In Advances in Neural Information Processing Systems\n\n4, page 220. MIT Press, 1990.\n\n[17] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In L. K.\nSaul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513\u2013\n520, Cambridge, MA, 2005. MIT Press.\n\n[18] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.\n[19] A. Quattoni, C. X., C. M., and D. T. A projected subgradient method for scalable multi-task learning.\n\nMassachusetts Institute of Technology, Technical Report, 2008.\n\n[20] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nThe Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[21] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, page 219,\n\n1999.\n\n9\n\n\f", "award": [], "sourceid": 510, "authors": [{"given_name": "Shibin", "family_name": "Parameswaran", "institution": null}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": null}]}