{"title": "Discriminative Robust Transformation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1333, "page_last": 1341, "abstract": "This paper proposes a framework for learning features that are robust to data variation, which is particularly important when only a limited number of trainingsamples are available. The framework makes it possible to tradeoff the discriminative value of learned features against the generalization error of the learning algorithm. Robustness is achieved by encouraging the transform that maps data to features to be a local isometry. This geometric property is shown to improve (K, \\epsilon)-robustness, thereby providing theoretical justification for reductions in generalization error observed in experiments. The proposed optimization frameworkis used to train standard learning algorithms such as deep neural networks. Experimental results obtained on benchmark datasets, such as labeled faces in the wild,demonstrate the value of being able to balance discrimination and robustness.", "full_text": "Discriminative Robust Transformation Learning\n\nJiaji Huang\n\nQiang Qiu\n\nGuillermo Sapiro\n\nRobert Calderbank\n\nDepartment of Electrical Engineering, Duke University\n\n{jiaji.huang,qiang.qiu,guillermo.sapiro,robert.calderbank}@duke.edu\n\nDurham, NC 27708\n\nAbstract\n\nThis paper proposes a framework for learning features that are robust to data vari-\nation, which is particularly important when only a limited number of training\nsamples are available. The framework makes it possible to tradeoff the discrim-\ninative value of learned features against the generalization error of the learning\nalgorithm. Robustness is achieved by encouraging the transform that maps data\nto features to be a local isometry. This geometric property is shown to improve\n(K, \u0001)-robustness, thereby providing theoretical justi\ufb01cation for reductions in gen-\neralization error observed in experiments. The proposed optimization framework\nis used to train standard learning algorithms such as deep neural networks. Exper-\nimental results obtained on benchmark datasets, such as labeled faces in the wild,\ndemonstrate the value of being able to balance discrimination and robustness.\n\n1\n\nIntroduction\n\nLearning features that are able to discriminate is a classical problem in data analysis. The basic idea\nis to reduce the variance within a class while increasing it between classes. One way to implement\nthis is by regularizing a certain measure of the variance, while assuming some prior knowledge\nabout the data. For example, Linear Discriminant Analysis (LDA) [4] measures sample covariance\nand implicitly assumes that each class is Gaussian distributed. The Low Rank Transform (LRT) [10],\ninstead uses nuclear norm to measure the variance and assumes that each class is near a low-rank\nsubspace. A different approach is to regularize the pairwise distances between data points. Examples\ninclude the seminal work on metric learning [17] and its extensions [5, 6, 16].\nWhile great attention has been paid to designing objectives to encourage discrimination, less effort\nhas been made in understanding and encouraging robustness to data variation, which is especially\nimportant when a limited number of training samples are available. One exception is [19], which\npromotes robustness by regularizing the traditional metric learning objective using prior knowledge\nfrom an auxiliary unlabeled dataset.\nIn this paper we develop a general framework for balancing discrimination and robustness. Robust-\nness is achieved by encouraging the learned data-to-features transform to be locally an isometry\nwithin each class. We theoretically justify this approach using (K, \u0001)-robustness [1, 18] and give an\nexample of the proposed formulation, incorporating it in deep neural networks. Experiments val-\nidate the capability to trade-off discrimination against robustness. Our main contributions are the\nfollowing: 1) prove that locally near isometry leads to robustness; 2) propose a practical framework\nthat allows to robustify a wide class of learned transforms, both linear and nonlinear; 3) provide\nan explicit realization of the proposed framework, achieving competitive results on dif\ufb01cult face\nveri\ufb01cation tasks.\nThe paper is organized as follows. Section 2 motivates the proposed study and proposes a general\nformulation for learning a Discriminative Robust Transform (DRT). Section 3 provides a theoretical\njusti\ufb01cation for the framework by making an explicit connection to robustness. Section 4 gives a\n\n1\n\n\fspeci\ufb01c example of DRT, denoted as Euc-DRT. Section 5 provides experimental validation of Euc-\nDRT, and section 6 presents conclusions. 1\n\n2 Problem Formulation\nConsider an L-way classi\ufb01cation problem. The training set is denoted by T = {(xi, yi)}, where\nxi \u2208 Rn is the data and yi \u2208 {1, . . . , L} is the class label. We want to learn a feature transform\nf\u03b1(\u00b7) such that a datum x becomes more discriminative when it is transformed to feature f\u03b1(x).\nThe transform f\u03b1 is parametrized by a vector \u03b1, a framework that includes linear transforms and\nneural networks where the entries of \u03b1 are the learned network parameters.\n\n2.1 Motivation\n\nThe transform f\u03b1 promotes discriminability by reducing intra-class variance and enlarging inter-\nclass variance. This aim is expressed in the design of objective functions [5, 10] or the structure\nof the transform f\u03b1 [7, 11]. However the robustness of the learned transform is an important issue\nthat is often overlooked. When training samples are scarce, statistical learning theory [15] predicts\nover\ufb01tting to the training data. The result of over\ufb01tting is that discrimination achieved on test data\nwill be signi\ufb01cantly worse than that on training data. Our aim in this paper is the design of robust\ntransforms f\u03b1 for which the training-to-testing degradation is small [18].\nWe formally measure robustness of the learned transform f\u03b1 in terms of (K, \u0001)-robustness [1].\nGiven a distance metric \u03c1, a learning algorithm is said to be (K, \u0001)-robust if the input data space\ncan be partitioned into K disjoint sets Sk, k = 1, ..., K, such that for all training sets T , the learned\nparameter \u03b1T determines a loss for which the value on pairs of training samples taken from different\nsets Sj and Sk is very close to the value of any pair of data samples taken from Sj and Sk.\n(K, \u0001)-robustness is illustrated in Fig. 1, where S1 and S2 are both of diameter \u03b3 and\nIf the transform f\u03b1 preserves all distances within S1 and S2, then |e\u2212 e(cid:48)| cannot deviate much from\n|d \u2212 d(cid:48)| \u2264 2\u03b3.\n\n|e \u2212 e(cid:48)| = |\u03c1(f\u03b1(x1), f\u03b1(x2)) \u2212 \u03c1(f\u03b1(x(cid:48)\n\n1), f\u03b1(x(cid:48)\n\n2))|.\n\nFigure 1: (K, \u0001)-robustness: Here d = \u03c1(x1, x2), d(cid:48) = \u03c1(x(cid:48)\ne(cid:48) = \u03c1(f\u03b1(x(cid:48)\n\n2)). The difference |e \u2212 e(cid:48)| cannot deviate too much from |d \u2212 d(cid:48)|.\n\n1), f\u03b1(x(cid:48)\n\n2), e = \u03c1(f\u03b1(x1), f\u03b1(x2)), and\n\n1, x(cid:48)\n\n2.2 Formulation and Discussion\n\n(cid:26) 1\n\n\u22121\n\n1\n|P|\n\n(cid:88)\n\nMotivated by the above reasoning, we now present our proposed framework. First we de\ufb01ne a pair\nif yi = yj\nlabel (cid:96)i,j (cid:44)\notherwise . Given a metric \u03c1, we use the following hinge loss to encourage\nhigh inter-class distance and small intra-class distance.\n\nmax{0, (cid:96)i,j [\u03c1 (f\u03b1(xi), f\u03b1(xj)) \u2212 t((cid:96)i,j)]} ,\n\n(1)\nHere P = {(i, j|i (cid:54)= j)} is the set of all data pairs. t((cid:96)i,j) \u2265 0 is a function of (cid:96)i,j and t(1) < t(\u22121).\nSimilar to metric learning [17], this loss function connects pairwise distance to discrimination. How-\never traditional metric learning typically assumes squared Euclidean distance and here the metric \u03c1\ncan be arbitrary.\nFor robustness, as discussed above, we may want f\u03b1(\u00b7) to be distance-preserving within each small\nlocal region. In particular, we de\ufb01ne the set of all local neighborhoods as\n\ni,j\u2208P\n\nNB (cid:44) {(i, j)|(cid:96)i,j = 1, \u03c1(xi, xj) \u2264 \u03b3} .\n\n1A note on the notations: matrices (vectors) are denoted in upper (lower) case bold letters. Scalars are\n\ndenoted in plain letters.\n\n2\n\n\f(cid:88)\n\n1\n|NB|\n\n(i,j)\u2208NB\n\nTherefore, we minimize the following objective function\n\n|\u03c1(f\u03b1(xi), f\u03b1(xj)) \u2212 \u03c1(xi, xj)| .\n\n(2)\n\nNote that we do not need to have the same metric in both the input and the feature space, they do not\neven have in general the same dimension. With a slight abuse of notation we use the same symbol\nto denote both metrics.\nTo achieve discrimination and robustness simultaneously, we formulate the objective function as a\nweighted linear combination of the two extreme cases in (1) and (2)\n\u03bb\n|P|\n\nmax{0, (cid:96)i,j [\u03c1 (f\u03b1(xi), f\u03b1(xj)) \u2212 t((cid:96)i,j)]}+\n\n|\u03c1(f\u03b1(xi), f\u03b1(xj)) \u2212 \u03c1(xi, xj)|\n\n1 \u2212 \u03bb\n|NB|\n\n(cid:88)\n\n(cid:88)\n\ni,j\u2208P\n\n(i,j)\u2208NB\n\n(3)\nwhere \u03bb \u2208 [0, 1]. The formulation (3) balances discrimination and robustness. When \u03bb = 1 it seeks\ndiscrimination, and as \u03bb decreases it starts to encourage robustness. We shall refer to a transform\nthat is learned by solving (3) as a Discriminative Robust Transform (DRT). The DRT framework\nprovides opportunity to select both the distance measure and the transform family.\n\n3 Theoretical Analysis\n\nIn this section, we provide a theoretical explanation for robustness. In particular, we show that if the\nsolution to (1) yields a transform f\u03b1 that is locally a near isometry, then f\u03b1 is robust.\n\n3.1 Theoretical Framework\nLet X denote the original data, let Y = {1, ..., L} denote the set of class labels, and let Z = X \u00d7Y.\nThe training samples are pairs zi = (xi, yi), i = 1, . . . , n drawn from some unknown distribution\nD de\ufb01ned on Z. The indicator function is de\ufb01ned as (cid:96)i,j = 1 if yi = yj and \u22121 otherwise. Let\nf\u03b1 be a transform that maps a low-level feature x to a more discriminative feature f\u03b1(x), and let F\ndenote the space of transformed features.\nFor simplicity we consider an arbitrary metric \u03c1 de\ufb01ned on both X and F (the general case of\ndifferent metrics is a straightforward extension), and a loss function g(\u03c1(f\u03b1(xi), f\u03b1(xj)), (cid:96)i,j) that\nencourages \u03c1(f\u03b1(xi), f\u03b1(xj)) to be small (big) if (cid:96)i,j = 1 (\u22121). We shall require the Lipschtiz\nconstant of g(\u00b7, 1) and g(\u00b7,\u22121) to be upper bounded by A > 0. Note that the loss function in Eq. (1)\nhas a Lipschtiz constant of 1. We abbreviate\n\ng(\u03c1(f\u03b1(xi), f\u03b1(xj)), (cid:96)i,j) (cid:44) h\u03b1(zi, zj).\n\nThe empirical loss on the training set is a function of \u03b1 given by\n\nRemp(\u03b1) (cid:44) 2\nn(n\u22121)\nand the expected loss on the test data is given by\nR(\u03b1) (cid:44) Ez(cid:48)\n1,z(cid:48)\n\n(cid:80)n\n2\u223cD [h\u03b1(z(cid:48)\n\ni(cid:54)=j\n\ni,j=1\n\nh\u03b1(zi, zj),\n\n1, z(cid:48)\n\n2)] .\n\nThe algorithm operates on pairs of training samples and \ufb01nds parameters\n\n(6)\nthat minimize the empirical loss on the training set T . The difference Remp \u2212 R between expected\nloss on the test data and empirical loss on the training data is the generalization error of the algorithm.\n\nRemp(\u03b1),\n\n\u03b1\n\n\u03b1T (cid:44) arg min\n\n3.2\n\n(K, \u0001)-robustness and Covering Number\n\nWe work with the following de\ufb01nition of (K, \u0001)-robustness [1].\nDe\ufb01nition 1. A learning algorithm is (K, \u0001)-robust if Z = X \u00d7Y can be partitioned into K disjoint\nsets Zk, k = 1, . . . , K such that for all training sets T \u2208 Z n, the learned parameter \u03b1T determines\na loss function where the value on pairs of training samples taken from sets Zp and Zq is \u201cvery\nclose\u201d to the value of any pair of data samples taken from Zp and Zq. Formally,\nassume zi, zj \u2208 T , with zi \u2208 Zp and zj \u2208 Zq, if z(cid:48)\n\ni \u2208 Zp and z(cid:48)\n\nj \u2208 Zq, then\n\n(cid:12)(cid:12)h\u03b1T (zi, zj) \u2212 h\u03b1T (z(cid:48)\n\nj)(cid:12)(cid:12) \u2264 \u0001.\n\ni, z(cid:48)\n\n3\n\n(4)\n\n(5)\n\n\fj) in Zp \u00d7 Zq is\nRemark 1. (K, \u0001)-robustness means that the loss incurred by a testing pair (z(cid:48)\nvery close to the loss incurred by any training pair (zi, zj) in Zp \u00d7 Zq. It is shown in [1] that the\ngeneralization error of (K, \u0001)-robust algorithms is bounded as\n\ni, z(cid:48)\n\nR(\u03b1T ) \u2212 Remp(\u03b1T ) \u2264 \u0001 + O\n\n.\n\n(7)\n\n(cid:32)(cid:114)\n\n(cid:33)\n\nK\nn\n\nTherefore the smaller \u0001, the smaller is the generalization error, and the more robust is the learning\nalgorithm.\n\nGiven a metric space, the covering number speci\ufb01es how many balls of a given radius are needed to\ncover the space. The more complex the metric space, the more balls are needed to cover it. Covering\nnumber is formally de\ufb01ned as follows.\nDe\ufb01nition 2 (Covering number). Given a metric space (S, \u03c1), we say that a subset \u02c6S of S is a\n\u03b3-cover of S, if for every element s \u2208 S, there exists \u02c6s \u2208 \u02c6S such that \u03c1(s, \u02c6s) \u2264 \u03b3. The \u03b3-covering\nnumber of S is\n\nN\u03b3(S, \u03c1) = min{| \u02c6S| : \u02c6S is a \u03b3-cover of S}.\n\nRemark 2. The covering number is a measure of the geometric complexity of (S, \u03c1). A set S with\ncovering number N\u03b3/2(S, \u03c1) can be partitioned into N\u03b3/2(S, \u03c1) disjoint subsets, such that any two\npoints within the same subset are separated by no more than \u03b3.\nLemma 1. The metric space Z = X \u00d7 Y can be partitioned into LN\u03b3/2(X , \u03c1) subsets, denoted\nas Z1, . . . ,ZLN\u03b3/2(X ,\u03c1), such that any two points z1 (cid:44) (x1, y1), z2 (cid:44) (x2, y2) in the same subset\nsatisfy y1 = y2 and \u03c1(x1, x2) \u2264 \u03b3.\nProof. Assuming the metric space (X , \u03c1) is compact, we can partition X into N\u03b3/2(X , \u03c1) subsets,\neach with diameter at most \u03b3. Since Y is a \ufb01nite set of size L, we can partition Z = X \u00d7 Y into\nLN\u03b3/2(X , \u03c1) subsets with the property that two samples (x1, y1), (x2, y2) in the same subset satisfy\ny1 = y2 and \u03c1(x1, x2) \u2264 \u03b3.\nIt follows from Lemma 1 that we may partition X into subsets X1, . . . ,XLN\u03b3/2(X ,\u03c1), such that pairs\nof points x1, x2 from the same subset have the same label and satisfy \u03c1(xi, xj) \u2264 \u03b3. Before we\nconnect local geometry to robustness we need one more de\ufb01nition. We say that a learned transform\nf\u03b1 is a \u03b4-isometry if the metric is distorted by at most \u03b4:\nDe\ufb01nition 3 (\u03b4-isometry). Let A,B be metric spaces with metrics \u03c1A and \u03c1B. A map f : A (cid:55)\u2192 B is\na \u03b4-isometry if for any a1, a2 \u2208 A, |\u03c1A(f (a1), f (a2)) \u2212 \u03c1B(a1, a2)| \u2264 \u03b4.\nTheorem 1. Let f\u03b1 be a transform derived via Eq. (6) and let X1, . . . ,XLN\u03b3/2(X ,\u03c1) be a cover of\nX as described above. If f\u03b1 is a \u03b4-isometry, then it is (LN\u03b3/2(X , \u03c1), 2A(\u03b3 + \u03b4))-robust.\nProof sketch. Consider training samples zi, zj and testing samples z(cid:48)\ni, z(cid:48)\nj such that zi, z(cid:48)\nj \u2208 Zq for some p, q \u2208 {1, . . . , LN\u03b3/2(X , \u03c1)}. Then by Lemma 1,\nzj, z(cid:48)\n\ni \u2208 Zp and\n\nand xi, x(cid:48)\n\n\u03c1(xi, x(cid:48)\ni \u2208 Xp and xj, x(cid:48)\n|\u03c1(f\u03b1T (xi), f\u03b1T (x(cid:48)\nRearranging the terms gives\n\u03c1(f\u03b1T (xi), f\u03b1T (x(cid:48)\n\nj) \u2264 \u03b3,\n\ni) \u2264 \u03b3 and \u03c1(xj, x(cid:48)\nj \u2208 Xq. By de\ufb01nition of \u03b4-isometry,\ni)) \u2212 \u03c1(xi, x(cid:48)\n\ni)| \u2264 \u03b4 and |\u03c1(f\u03b1T (xj), f\u03b1T (x(cid:48)\n\nyi = y(cid:48)\n\ni and yj = y(cid:48)\nj,\n\nj)) \u2212 \u03c1(xj, x(cid:48)\n\nj)| \u2264 \u03b4.\n\ni)) \u2264 \u03c1(xi, x(cid:48)\n\ni) + \u03b4 \u2264 \u03b3 + \u03b4 and \u03c1(f\u03b1T (xj), f\u03b1T (x(cid:48)\n\nj)) \u2264 \u03c1(xj, x(cid:48)\n\nj) + \u03b4 \u2264 \u03b3 + \u03b4.\n\nFigure 2: Proof without words.\n\n4\n\n\fi), f\u03b1T (x(cid:48)\n\nIn order\nto bound the generalization error, we need to bound the difference between\n\u03c1(f\u03b1T (xi), f\u03b1T (xj)) and \u03c1(f\u03b1T (x(cid:48)\nj)). The details can be found in [9]; here we ap-\npeal to the proof schematic in Fig. 2. We need to bound |e \u2212 e(cid:48)| and it cannot exceed twice the\ndiameter of a local region in the transformed domain.\nRobustness of the learning algorithm depends on the granularity of the cover and the degree to\nwhich the learned transform f\u03b1 distorts distances between pairs of points in the same covering\nsubset. The subsets in the cover constitute regions where the local geometry makes it possible to\nbound generalization error. It now follows from [1] that the generalization error satis\ufb01es R(\u03b1T ) \u2212\nRemp(\u03b1T ) \u2264 2A(\u03b3 + \u03b4) + O\n. The DRT proposed here is a particular example of a local\nisometry, and Theorem 1 explains why the generalization error is smaller than that of pure metric\nlearning.\nThe transform described in [9] partitions the metric space X into exactly L subsets, one for each\nclass. The experiments reported in Section 5 demonstrate that the performance improvements de-\nrived from working with a \ufb01ner partition can be worth the cost of learning \ufb01ner grained local regions.\n\n(cid:16)(cid:113) K\n\n(cid:17)\n\nn\n\n4 An Illustrative Realization of DRT\n\nHaving justi\ufb01ed robustness, we now provide a realization of the proposed general DRT where the\nmetric \u03c1 is Euclidean distance. We use Gaussian random variables to initialize \u03b1, then, on the\nrandomly transformed data, we set t(1) (t(\u22121)) to be the average intra-class (inter-class) pairwise\ndistance. In all our experiments, the solution satis\ufb01ed the condition t(1) < t(\u22121) required in Eq. (1).\nWe calculate the diameter \u03b3 of the local regions NB indirectly, using the \u03ba-nearest neighbors of each\ntraining sample to de\ufb01ne a local neighborhood. We leave the question of how best to initialize the\nindicator t and the diameter \u03b3 for future research.\nWe denote this particular example as Euc-DRT and use gradient descent to solve for \u03b1. Denoting\nthe objective by J, we de\ufb01ne yi (cid:44) f\u03b1(xi), \u03b4i,j (cid:44) f\u03b1(xi) \u2212 f\u03b1(xj), and \u03c10\n(cid:44) (cid:107)xi \u2212 xj(cid:107). Then\n(8)\n\n(cid:88)\n\n(cid:88)\n\n1 \u2212 \u03bb\n|NB| \u00b7 sgn((cid:107)\u03b4i,j(cid:107) \u2212 \u03c10\n\ni,j) \u00b7 \u03b4i,j(cid:107)\u03b4i,j(cid:107) .\n\n\u03bb\n\n|P| \u00b7 (cid:96)i,j \u00b7 \u03b4i,j(cid:107)\u03b4i,j(cid:107) +\n\n\u2202J\n\u2202yi\n\n=\n\ni.j\n\n(i,j)\u2208NB\n\n(i,j)\u2208P\n\n(cid:96)i,j ((cid:107)\u03b4i,j(cid:107)\u2212t((cid:96)i,j ))>0\n\nIn general, f\u03b1 de\ufb01nes a D-layer neural network (when D = 1 it de\ufb01nes a linear transform). Let \u03b1(d)\nbe the linear weights at the d-th layer, and let x(d) be the output of the d-th layer, so that yi = x(D)\n.\nThen the gradients are computed as,\n\ni\n\n\u2202J\n\n\u2202\u03b1(D)\n\n=\n\n\u2202J\n\u2202yi\n\n\u00b7 \u2202yi\n\u2202\u03b1(D)\n\n, and \u2202J\n\u2202\u03b1(d)\n\n=\n\n\u2202J\n\n\u2202x(d+1)\n\ni\n\ni\n\ni\n\n\u00b7 \u2202x(d+1)\n\u2202x(d)\n\ni\n\n\u00b7 \u2202x(d)\n\u2202\u03b1(d) for 1 \u2264 d \u2264 D\u22121. (9)\n\ni\n\nAlgorithm 1 provides a summary, and we note that the extension to stochastic training using min-\nbatches is straightforward.\n\n5 Experimental Results\n\nIn this section we report on experiments that con\ufb01rm robustness of Euc-DRT. Recall that empirical\nloss is given by Eq. (4) where \u03b1 is learned as \u03b1T from the training set T , and |T | = N. The\ngeneralization error is R \u2212 Remp where the expected loss R is estimated using a large test set.\n\n5.1 Toy Example\n\nThis illustrative example is motivated by the discussion in Section 2.1. We \ufb01rst generate a 2D\ndataset consisting of two noisy half-moons, then use a random 100 \u00d7 2 matrix to embed the data\nin a 100-dimensional space. We learn a linear transform f\u03b1 that maps the 100 dimensional data to\n2 dimensional features, and we use \u03ba = 5 nearest neighbors to construct the set NB. We consider\n\u03bb = 1, 0.5, 0.25, representing the most discriminative, balanced, and more robust scenarios.\nWhen \u03bb = 1 the transformed training samples are rather discriminative (Fig. 3a), but when the\ntransform is applied to testing data, the two classes are more mixed (Fig. 3d). When \u03bb = 0.5, the\n\n5\n\n(cid:88)\n\ni\n\n(cid:88)\n\n\ftransform), stepsize \u03b7, neighborhood size \u03ba.\n\nAlgorithm 1 Gradient descent solver for Euc-DRT\nInput: \u03bb \u2208 [0, 1], training pairs {(xi, xj, (cid:96)i,j)}, a pre-de\ufb01ned D-layer network (D = 1 as linear\nOutput: \u03b1\n1: Randomly initialize \u03b1, compute yi = f\u03b1(xi).\n2: On the yi, compute the average intra and inter-class pairwise distances, assign to t(1), t(\u22121)\n3: For each training datum, \ufb01nd its \u03ba nearest neighbor and de\ufb01ne the set NB.\n4: while stable objective not achieved do\n5:\n6:\n7:\n8:\n9:\n10:\nend for\n11:\n12: end while\n\nCompute yi = f\u03b1(xi) by a forward pass.\nCompute objective J.\nCompute \u2202J\nas Eq. (8).\n\u2202yi\nfor l = D down to 1 do\nCompute\n\u03b1(d) \u2190 \u03b1(d) \u2212 \u03b7 \u2202J\n\n\u2202J\n\n\u2202\u03b1(d) as Eq. (9).\n\n\u2202\u03b1(d) .\n\n(a) \u03bb = 1 Transformed training\nsamples. (discriminative case)\n\n(b) \u03bb = 0.5 transformed training\nsamples. (balanced case)\n\n(c) \u03bb = 0.25 Transformed train-\ning samples. (robust case)\n\n(d) \u03bb = 1 Transformed testing\nsamples. (discriminative case)\n\n(e) \u03bb = 0.5 transformed testing\nsamples. (balanced case)\n\n(f) \u03bb = 0.25 Transformed testing\nsamples. (robust case)\n\nFigure 3: Original and transformed training/testing samples embedded in 2-dimensional space with\ndifferent colors representing different classes.\n\ntransformed training data are more dispersed within each class (Fig. 3b), hence less easily separated\nthan when \u03bb = 1. However Fig. 3e shows that it is easier to separate the two classes on the test data.\nWhen \u03bb = 0.25, robustness is preferred to discriminative power as shown in Figs. 3c and 3f.\nTab. 1 quanti\ufb01es empirical loss Remp, generalization error, and classi\ufb01cation performance (by 1-nn)\nfor \u03bb = 1, 0.5 and 0.25. As \u03bb decreases, Remp increases, indicating loss of discrimination on the\ntraining set. However, generalization error decreases, implying more robustness. We conclude that\nby varying \u03bb, we can balance discrimination and robustness.\n\n5.2 MNIST Class\ufb01cation Using a Very Small Training Set\n\nThe transform f\u03b1 learned in the previous section was linear, and we now apply a more sophisticated\nconvolutional neural network to the MNIST dataset. The network structure is similar to LeNet, and is\n\n6\n\n-20020-30-20-100102030-20020-30-20-100102030-20020-30-20-100102030-20020-30-20-100102030-20020-30-20-100102030-20020-30-20-100102030\fTable 1: Varying \u03bb on a toy dataset.\n\n\u03bb\n\nRemp\n\ngeneralization error\n\n1-nn accuracy\n\n(original data 93.35%)\n\n1\n\n0.5\n\n0.25\n1.9439\n1.5983\n8.8040\n10.5855\n92.20% 98.30% 91.55%\n\n1.6025\n9.5071\n\nTable 2: Classi\ufb01cation error on MNIST.\n\nTraining/class\noriginal pixels\n\nLeNet\nDML\n\nEuc-DRT\n\n30\n\n50\n\n70\n\n100\n\n81.91% 86.18% 86.86% 88.49%\n87.51% 89.89% 91.24% 92.75%\n92.32% 94.45% 95.67% 96.19%\n94.14% 95.20% 96.05% 96.21%\n\nTable 3: Implementation de-\ntails of the neural network for\nMNIST classi\ufb01cation.\n\nname\nconv1\npool1\nconv2\npool2\nconv3\n\nparameters\n\nsize: 5 \u00d7 5 \u00d7 1 \u00d7 20\n\nstride: 1, pad: 0\n\nsize: 2 \u00d7 2\n\nsize: 5 \u00d7 5 \u00d7 20 \u00d7 50\n\nstride: 1, pad: 0\n\nsize: 2 \u00d7 2\n\nsize: 4 \u00d7 4 \u00d7 50 \u00d7 128\n\nstride: 1, pad: 0\n\nmade up of alternating convolutional layers and pooling layers, with parameters detailed in Table 3.\nWe map the original 784-dimensional pixel values (28x28 image) to 128-dimensional features.\nWhile state-of-art results often use the full training set (6,000 training samples per class), here we are\ninterested in small training sets. We use only 30 training samples per class, and we use \u03ba = 7 nearest\nneighbors to de\ufb01ne local regions in Euc-DRT. We vary \u03bb and study empirical error, generalization\nerror, and classi\ufb01cation accuracy (1-nn). We observe in Fig. 4 that when \u03bb decreases, the empirical\nerror also decreases, but that the generalization error actually increases. By balancing between these\ntwo factors, a peak classi\ufb01cation accuracy is achieved at \u03bb = 0.25. Next, we use 30, 50, 70, 100\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: MNIST test: with only 30 training samples per class. We vary \u03bb and assess (a) Remp; (b)\ngeneralization error; and (c) 1-nn classi\ufb01cation accuracy. Peak accuracy is achieved at \u03bb = 0.25.\ntraining samples per class and compare the performance of Euc-DRT with LeNet and Deep Metric\nLearning (DML) [7]. DML minimizes a hinge loss on the squared Euclidean distances. It shares the\nsame spirit with our Euc-DRT using \u03bb = 1. All methods use the same network structure, Tab. 3, to\nmap to the features. For classi\ufb01cation, LeNet uses a linear softmax classi\ufb01er on top of the \u201cconv3\u201d\nlayer and minimizes the standard cross-entropy loss during training. DML and Euc-DRT both use\na 1-nn classi\ufb01er on the learned features. Classi\ufb01cation accuracies are reported in Tab. 2. In Tab. 2,\nwe see that all the learned features improve upon the original ones. DML is very discriminative\nand achieves higher accuracy than LeNet. However, when the training set is very small, robustness\nbecomes more important and Euc-DRT signi\ufb01cantly outperforms DML.\n\n5.3 Face Veri\ufb01cation on LFW\n\nWe now present face veri\ufb01cation on the more challenging Labeled Faces in the Wild (LFW) bench-\nmark, where our experiments will show that there is an advantage to balancing disciminability and\nrobustness. Our goal is not to reproduce the success of deep learning in face veri\ufb01cation [7, 14],\nbut to stress the importance of robust training and to compare the proposed Euc-DRT objective\nwith popular alternatives. Note also that it is dif\ufb01cult to compare with deep learning methods when\ntraining sets are proprietary [12\u201314].\n\n7\n\n\u03bb00.250.50.751Remp00.020.040.060.080.10.120.14\u03bb00.250.50.751R-Remp1.522.533.544.5\u03bb00.250.50.7511-nn accuracy(%)9292.59393.59494.5\fWe adopt the experimental framework used in [2], and train a deep network on the WDRef dataset,\nwhere each face is described using a high dimensional LBP feature [3] (available at 2) that is re-\nduced to a 5000-dimensional feature using PCA. The WDRef dataset is signi\ufb01cantly smaller than\nthe proprietary datasets typical of deep learning, such as the 4.4 million labeled faces from 4030\nindividuals in [14], or the 202,599 labeled faces from 10,177 individuals in [12]. It contains 2,995\nsubjects with about 20 samples per subject.\nWe compare the Euc-DRT objective with DeepFace (DF) [14] and Deep Metric Learning (DML) [7],\ntwo state-of-the-art deep learning objectives. For a fair comparison, we employ the same network\nstructure and train on the same input data. DeepFace feeds the output of the last network layer to an\nL-way soft-max to generate a probability distribution over L classes, then minimizes a cross entropy\nloss. The Euc-DRT feature f\u03b1 is implemented as a two-layer fully connected network with tanh as\nthe squash function. Weight decay (conventional Frobenius norm regularization) is employed in\nboth DF and DML, and results are only reported for the best weight decay factor. After a network\nis trained on WDRef, it is tested on the LFW benchmark. Veri\ufb01cation simply consists of comparing\nthe cosine distance between a given pair of faces to a threshold.\nFig. 5 displays ROC curves and Table 4 reports area under the ROC curve (AUC) and veri\ufb01cation\naccuracy. High-Dim LBP refers to veri\ufb01cation using the initial LBP features. DeepFace (DF) op-\ntimizes for a classi\ufb01cation objective by minimizing a softmax loss, and it successfully separates\nsamples from different classes. However the constraint that assigns similar representations to the\nsame class is weak, and this is re\ufb02ected in the true positive rate displayed in Fig. 5. In Deep Metric\nLearning (DML) this same constraint is strong, but robustness is a concern when the training set\nis small. The proposed Euc-DRT improves upon both DF and DML by balancing disciminability\nand robustness. It is less conservative than DF for better discriminability, and more responsive to\nlocal geometry than DML for smaller generalization error. Face veri\ufb01cation accuracy for Euc-DRT\nwas obtained by varying the regularization parameter \u03bb between 0.4 and 1 (as shown in Fig 6), then\nreporting the peak accuracy observed at \u03bb = 0.9.\n\nTable 4: Veri\ufb01cation accuracy and\nAUCs on LFW\n\nMethod\nHD-LBP\ndeepFace\n\nDML\n\nEuc-DRT\n\nAccuracy\n\n(%)\n74.73\n88.72\n90.28\n92.33\n\nAUC\n(\u00d710\u22122)\n82.22\u00b11.00\n95.50\u00b1 0.29\n96.74\u00b10.33\n97.77\u00b1 0.25\n\nFigure 5: Comparison of\nROCs for all methods\n\nFigure 6: Veri\ufb01cation accu-\nracy of Euc-DRT as \u03bb varies\n\n6 Conclusion\n\nWe have proposed an optimization framework within which it is possible to tradeoff the discrim-\ninative value of learned features with robustness of the learning algorithm. Improvements to gen-\neralization error predicted by theory are observed in experiments on benchmark datasets. Future\nwork will investigate how to initialize and tune the optimization, also how the Euc-DRT algorithm\ncompares with other methods that reduce generalization error.\n\n7 Acknowledgement\n\nThe work of Huang and Calderbank was supported by AFOSR under FA 9550-13-1-0076 and by\nNGA under HM017713-1-0006. The work of Qiu and Sapiro is partially supported by NSF and\nDoD.\n\n2http://home.ustc.edu.cn/chendong/\n\n8\n\n00.510.50.60.70.80.91HD-LBPdeepFaceDMLEuc-DRT\u03bb0.40.60.81verification accuracy (%)91.491.691.89292.292.4\fReferences\n[1] A. Bellet and A. Habrard. Robustness and generalization for metric learning. Neurocomputing,\n\n151(14):259\u2013267, 2015.\n\n[2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation.\n\nIn European Conference on Computer Vision (ECCV), 2012.\n\n[3] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature\nand its ef\ufb01cient compression for face veri\ufb01cation. In IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2013.\n\n[4] K. Fukunaga. Introduction to Statistical Pattern Recognition. San Diego: Academic Press,\n\n1990.\n\n[5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2005.\n\n[6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components anal-\n\nysis. In Advances in Neural Information Processing Systems (NIPS), 2004.\n\n[7] J. Hu, J. Lu, and Y. Tan. Discriminative deep metric learning for face veri\ufb01cation in the wild.\n\nIn Computer Vision and Pattern Recognition (CVPR), pages 1875\u20131882, 2014.\n\n[8] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-\n49, University of Massachusetts, Amherst, October 2007.\n\n[9] J. Huang, Q. Qiu, R. Calderbank, and G. Sapiro. Geometry-aware deep transform. In Interna-\n\ntional Conference on Computer Vision, 2015.\n\n[10] G. Sapiro Q. Qiu. Learning transformations for clustering and classi\ufb01cation. Journal of Ma-\n\nchine Learning Research (JMLR), pages 187\u2013225, 2015.\n\n[11] C. Sumit, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with appli-\ncation to face veri\ufb01cation. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), volume 1, pages 539\u2013546, 2005.\n\n[12] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint\nIn Advances in Neural Information Processing Systems (NIPS),\n\nidenti\ufb01cation-veri\ufb01cation.\npages 1988\u20131996, 2014.\n\n[13] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n\nclasses.\n1891\u20131898, 2014.\n\n[14] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf. Deepface: Closing the gap to human-\nlevel performance in face veri\ufb01cation. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 1701\u20131708, 2014.\n\n[15] V. N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Net-\n\nworks, 10(5):988\u2013999, 1999.\n\n[16] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[17] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application\nto clustering with side-information. In Advances in Neural Information Processing Systems\n(NIPS), 2002.\n\n[18] H. Xu and S. Mannor. Robustness and generalization. Machine Learning, 86(3):391\u2013423,\n\n2012.\n\n[19] Z. Zha, T. Mei, M. Wang, Z. Wang, and X. Hua. Robust distance metric learning with auxiliary\n\nknowledge. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2009.\n\n9\n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Jiaji", "family_name": "Huang", "institution": "Duke University"}, {"given_name": "Qiang", "family_name": "Qiu", "institution": "Duke University"}, {"given_name": "Guillermo", "family_name": "Sapiro", "institution": null}, {"given_name": "Robert", "family_name": "Calderbank", "institution": "Duke University"}]}