{"title": "Efficient Optimization for Discriminative Latent Class Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1045, "page_last": 1053, "abstract": "Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression; we show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While expectation-maximization (EM) algorithm is commonly used to learn these models, its optimization usually leads to local minimum because it relies on a non-convex cost function with many such local minima. To avoid this problem, we introduce a local approximation of this cost function, which leads to a quadratic non-convex optimization problem over a product of simplices. In order to minimize such functions, we propose an efficient algorithm based on convex relaxation and low-rank representation of our data, which allows to deal with large instances. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods.", "full_text": "Ef\ufb01cient Optimization for Discriminative\n\nLatent Class Models\n\nArmand Joulin\u2217\n\nINRIA\n\n23, avenue d\u2019Italie,\n75214 Paris, France.\n\nFrancis Bach\u2217\n\nINRIA\n\n23, avenue d\u2019Italie,\n75214 Paris, France.\n\nJean Ponce\u2217\n\nEcole Normale Sup\u00b4erieure\n\n45, rue d\u2019Ulm\n\n75005 Paris, France.\n\narmand.joulin@inria.fr\n\nfrancis.bach@inria.fr\n\njean.ponce@ens.fr\n\nAbstract\n\nDimensionality reduction is commonly used in the setting of multi-label super-\nvised classi\ufb01cation to control the learning capacity and to provide a meaningful\nrepresentation of the data. We introduce a simple forward probabilistic model\nwhich is a multinomial extension of reduced rank regression, and show that this\nmodel provides a probabilistic interpretation of discriminative clustering meth-\nods with added bene\ufb01ts in terms of number of hyperparameters and optimization.\nWhile the expectation-maximization (EM) algorithm is commonly used to learn\nthese probabilistic models, it usually leads to local maxima because it relies on\na non-convex cost function. To avoid this problem, we introduce a local approx-\nimation of this cost function, which in turn leads to a quadratic non-convex op-\ntimization problem over a product of simplices. In order to maximize quadratic\nfunctions, we propose an ef\ufb01cient algorithm based on convex relaxations and low-\nrank representations of the data, capable of handling large-scale problems. Exper-\niments on text document classi\ufb01cation show that the new model outperforms other\nsupervised dimensionality reduction methods, while simulations on unsupervised\nclustering show that our probabilistic formulation has better properties than exist-\ning discriminative clustering methods.\n\n1\n\nIntroduction\n\nLatent representations of data are wide-spread tools in supervised and unsupervised learning. They\nare used to reduce the dimensionality of the data for two main reasons: on the one hand, they\nprovide numerically ef\ufb01cient representations of the data; on the other hand, they may lead to better\npredictive performance. In supervised learning, latent models are often used in a generative way,\ne.g., through mixture models on the input variables only, which may not lead to increased predictive\nperformance. This has led to numerous works on supervised dimension reduction (e.g., [1, 2]),\nwhere the \ufb01nal discriminative goal of prediction is taken explicitly into account during the learning\nprocess.\n\nIn this context, various probabilistic models have been proposed, such as mixtures of experts [3] or\ndiscriminative restricted Boltzmann machines [4], where a layer of hidden variables is used between\nthe inputs and the outputs of the supervised learning model. Parameters are usually estimated by\nexpectation-maximization (EM), a method that is computationally ef\ufb01cient but whose cost function\nmay have many local maxima in high dimensions.\n\nIn this paper, we consider a simple discriminative latent class (DLC) model where inputs and outputs\nare independent given the latent representation.We make the following contributions:\n\n\u2217WILLOW project-team, Laboratoire d\u2019Informatique de l\u2019Ecole Normale Sup\u00b4erieure, (ENS/INRIA/CNRS\n\nUMR 8548).\n\n1\n\n\f\u2013 We provide in Section 2 a quadratic (non convex) local approximation of the log-likelihood of\nour model based on the EM auxiliary function. This approximation is optimized to obtain robust\ninitializations for the EM procedure.\n\n\u2013 We propose in Section 3.3 a novel probabilistic interpretation of discriminative clustering with\n\nadded bene\ufb01ts, such as fewer hyperparameters than previous approaches [5, 6, 7].\n\n\u2013 We design in Section 4 a low-rank optimization method for non-convex quadratic problems over a\nproduct of simplices. This method relies on a convex relaxation over completely positive matrices.\n\u2013 We perform experiments on text documents in Section 5, where we show that our inference tech-\n\nnique outperforms existing supervised dimension reduction and clustering methods.\n\n2 Probabilistic discriminative latent class models\n\nWe consider a set of N observations xn \u2208 Rp, and their labels yn \u2208 {1, . . . , M }, n \u2208 {1, . . . , N }.\nWe assume that each observation xn has a certain probability to be in one of K latent classes, mod-\neled by introducing hidden variables zn \u2208 {1, . . . , K}, and that these classes should be predictive\nof the label yn. We model directly the conditional probability of zn given the input data xn and the\nprobability of the label yn given zn, while making the assumption that yn and xn are independent\ngiven zn (leading to the directed graphical model xn \u2192 zn \u2192 yn). More precisely, we assume that,\ngiven xn, zn follows a multinomial logit model while, given zn, yn is a multinomial variable:\n\nk xn+bk\n\newT\nj=1 ewT\n\nj xn+bj\n\nand\n\np(yn = m|zn = k) = \u03b1km,\n\n(1)\n\np(zn = k|xn) =\n\nPK\nwith wk \u2208 Rp, bk \u2208 R and PM\n\nm=1 \u03b1km = 1. We use the notation w = (w1, . . . , wK ), b =\n(b1, . . . , bK ) and \u03b1 = (\u03b1km)1\u2264k\u2264K,1\u2264m\u2264M . Note that the model de\ufb01ned by (1) can be kernelized\nby replacing implicitly or explicitly x by the image \u03a6(x) of a non linear mapping.\n\nRelated models.\nThe simple two-layer probabilistic model de\ufb01ned in Eq. (1), can be interpreted\nand compared to other methods in various ways. First, it is an instance of a mixture of experts [3]\nwhere each expert has a constant prediction. It has thus weaker predictive power than general mix-\ntures of experts; however, it allows ef\ufb01cient optimization as shown in Section 4. It would be inter-\nesting to extend our optimization techniques to the case of experts with non-constant predictions.\nThis is what is done in [8] where a convex relaxation of EM for a similar mixture of experts is con-\nsidered. However, [8] considers the maximization with respect to hidden variables rather than their\nmarginalization, which is essential in our setting to have a well-de\ufb01ned probabilistic model. Note\nalso that in [8], the authors derive a convex relaxation of the softmax regression problems, while we\nderive a quadratic approximation. It is worth trying to combine the two approaches in future work.\n\nAnother related model is a two-layer neural network.\nIndeed, if we marginalize the latent vari-\nable z, we get that the probability of y given x is a linear combination of softmax functions of linear\nfunctions of the input variables x. Thus, the only difference with a two-layer neural network with\nsoftmax functions for the last layer is the fact that our last layer considers linear parameterization in\nthe mean parameters rather than in the natural parameters of the multinomial variable. This change\nallows us to provide a convexi\ufb01cation of two-layer neural networks in Section 4.\n\nAmong probabilistic models, a discriminative restricted Boltzmann machine (RBM) [4, 9] mod-\nels p(y|z) as a softmax function of linear functions of z. Our model assumes instead that p(y|z)\nis linear in z. Again, this distinction between mean parameters and natural parameters allows us to\nderive a quadratic approximation of our cost function. It would of course be of interest to extend our\noptimization technique to the discriminative RBM.\n\nFinally, one may also see our model as a multinomial extension of reduced-rank regression (see,\ne.g. [10]), which is commonly used with Gaussian distributions and reduces to singular value de-\ncomposition in the maximum likelihood framework.\n\n2\n\n\f3\n\nInference\n\nWe consider the negative conditional log-likelihood of yn given xn (regularized in w to avoid over-\n\ufb01tting) where \u03b8 = (\u03b1, w, b) and ynm is equal to 1 if yn = m and 0 otherwise:\n\n\u2113(\u03b8) = \u2212\n\n1\nN\n\nN\n\nM\n\nXn=1\n\nXm=1\n\nynm log p(ynm = 1|xn) +\n\n\u03bb\n2K\n\nkwk2\nF .\n\n(2)\n\n3.1 Expectation-maximization\n\nA popular tool for solving maximum likelihood problems is the EM algorithm [10]. A traditional\nway of viewing EM is to add auxiliary variables and minimize the following upperbound of the\nnegative log-likelihood \u2113, obtained by using the concavity of the logarithm:\n\nF (\u03be, \u03b8) = \u2212\n\n1\nN\n\nN\n\nM\n\nXn=1\n\nXm=1\n\nynm\" K\nXk=1\n\n\u03benk log\n\nk xn+bk\n\nn \u03b1kewT\nyT\n\u03benk\n\n\u2212 log(cid:16)\n\nK\n\nXk=1\n\newT\n\nk xn+bk(cid:17)# +\n\n\u03bb\n2K\n\nkwk2\nF ,\n\nwhere \u03b1k = (\u03b1k1, . . . , \u03b1km)T \u2208 RM and \u03be = (\u03be1, . . . , \u03beK )T \u2208 RN \u00d7K with \u03ben =\n(\u03ben1, . . . , \u03benK ) \u2208 RK . The EM algorithm can be viewed as a two-step block-coordinate descent\nprocedure [11], where the \ufb01rst step (E-step) consists in \ufb01nding the optimal auxiliary variables \u03be,\ngiven the parameters of the model \u03b8. In our case, the result of this step is obtained in closed form\nas \u03benk \u221d yT\nn 1K = 1. The second step (M-step) consists of \ufb01nding the best set\nof parameters \u03b8, given the auxiliary variables \u03be. Optimizing the parameters \u03b1k leads to the closed\nk 1M = 1 while optimizing jointly on w and b leads to a\n\nn=1 \u03benkyn with \u03b1T\n\nk xn+bk with \u03beT\n\nn \u03b1kewT\n\nsoftmax regression problem, which we solved with Newton method.\n\nform updates \u03b1k \u221d PN\n\nSince F (\u03be, \u03b8) is not jointly convex in \u03be and \u03b8, this procedure stops when it reaches a local minimum,\nand its performance strongly depends on its initialization. We propose in the following section, a\nrobust initialization for EM given our latent model, based on an approximation of the auxiliary cost\nfunction obtained with the M-step.\n\n3.2\n\nInitialization of EM\n\nMinimizing F w.r.t. \u03be leads to the original log-likelihood \u2113(\u03b8) depending on \u03b8 alone. Minimizing F\nw.r.t. \u03b8 gives a function of \u03be alone. In this section, we focus on deriving a quadratic approximation\nof this function, which will be minimized to obtain an initialization for EM.\n\nWe consider second-order Taylor expansions around the value of \u03be corresponding to the uniformly\ndistributed latent variables zn, independent of the observations xn, i.e., \u03be0 = 1\nK . This choice\nis motivated by the lack of a priori information on the latent classes. We brie\ufb02y explain the calcu-\nlation of the expansion of the terms depending on (w, b). For the rest of the calculation, see the\nsupplementary material.\n\nK 1N 1T\n\nSecond-order Taylor expansion of the terms depending on (w, b). Assuming uniformly dis-\ntributed variables zn and independence between zn and xn implies that wT\nk xn + bk = 0. There-\nk=1 exp(uk))\n\nfore, using the second-order expansion of the log-sum-exp function \u03d5(u) = log(PK\n\naround 0 leads to the following approximation of the terms depending on (w, b):\n\nJwb(\u03be) = cst +\n\ntr(\u03be\u03beT ) \u2212\n\nk(K\u03be \u2212 Xw \u2212 b)\u03a0Kk2\n\nF + \u03bbkwk2\n\nK 1K1T\n\nwhere \u03a0K = I \u2212 1\nK is the usual centering projection matrix, and X = (x1, . . . , xN )T . The\nthird-order term O(kXw + bk3\nF ) can be replaced by third-order terms in k\u03be \u2212 \u03be0k, which makes\nthe minimization with respect to w and b correspond to a multi-label classi\ufb01cation problem with a\nsquare-loss [7, 10, 12]. Its solution may be obtained in closed form and leads to:\n\nK\n2N\n\n1\n2K\n\nmin\n\nw,b h 1\n\nN\n\nF + O(kXw + bk3)i,\n\nJwb(\u03be) = cst +\n\nK\n2N\n\ntrh\u03be\u03beT(cid:0)I \u2212 A(X, \u03bb)(cid:1)i + O(k\u03be \u2212 \u03be0k3),\n\nwhere A(X, \u03bb) = \u03a0N(cid:16)I \u2212 X(N \u03bbI + X T \u03a0N )\u22121X T(cid:17)\u03a0N .\n\n3\n\n\fQuadratic approximation. Omitting the terms that are independent of \u03be or of an order in \u03be higher\nthan two, the second-order approximation Japp of the function obtained for the M-step is:\n\nJapp(\u03be) =\n\nK\n2\n\nwhere B(Y ) = 1\n\nN(cid:16)Y (Y T Y )\u22121Y T \u2212 1\n\ntrh\u03be\u03beT(cid:16)B(Y ) \u2212 A(X, \u03bb)(cid:17)i,\nN(cid:17) and Y \u2208 RN \u00d7M is the matrix with entries ynm.\n\nN 1N 1T\n\n(3)\n\nLink with ridge regression. The \ufb01rst term, tr(\u03be\u03beT B(Y )), is a concave function in \u03be, whose maxi-\nmum is obtained for \u03be\u03beT = I (each variable in a different cluster). The second term, A(X, \u03bb), is the\nmatrix obtained in ridge regression [7, 10, 12]. Since A(x, \u03bb) is a positive semi-de\ufb01nite matrix such\nthat A(X, \u03bb)1N = 0, the maximum of the second term is obtained for \u03be\u03beT = 1N 1T\nN (all variables\nin the same cluster). Japp(\u03be) is thus a combination of a term trying to put every point in the same\ncluster and a term trying to spread them equally. Note that in general, Japp is not convex.\n\nNon linear predictions. Using the matrix inversion lemma, A(X, \u03bb) can be expressed in terms\nof the Gram matrix K = XX T , which allows us to use any positive de\ufb01nite kernel in our frame-\nwork [12], and tackle problems that are not linearly separable. Moreover, the square loss gives a\nnatural interpretation of the regularization parameter \u03bb in terms of the implicit number of param-\neters of the learning procedure [10]. Indeed, the degree of freedom de\ufb01ned as df = n(1 \u2212 trA)\nprovides a intuitive method for setting the value of \u03bb [7, 10].\n\nInitialization of EM. We optimize Japp(\u03be) to get a robust initialization for EM. Since the entries\nof each vector \u03ben sum to 1, we optimize Japp over a set of N simplices in K dimensions, S = {v \u2208\nRK | v \u2265 0, vT 1K = 1}. However, since this function is not convex, minimizing it directly leads\nto local minima. We propose, in Section 4, a general reformulation of any non-convex quadratic\nprogram over a set of N simplices and propose an ef\ufb01cient algorithm to optimize it.\n\n3.3 Discriminative clustering\n\nThe goal of clustering is to \ufb01nd a low-dimensional representation of unlabeled observations, by\nassigning them to K different classes, Xu et al. [5] proposes a discriminative clustering framework\nbased on the SVM and [7] simpli\ufb01es it by replacing the hinge loss function by the square loss,\nleading to ridge regression. By taking M = N and the labels Y = I, we obtain a formulation\nsimilar to [7] where we are looking for a latent representation that can recover the identity matrix.\nHowever, unlike [5, 7], our discriminative clustering framework is based on a probabilistic model\nwhich may allow natural extensions. Moreover, our formulation naturally avoids putting all variables\nin the same cluster, whereas [5, 7] need to introduce constraints on the size of each cluster. Also,\nour model leads to a soft assignment of the variables, allowing \ufb02exibility in the shape of the clusters,\nwhereas [5, 7] is based on hard assignment. Finally, since our formulation is derived from EM, we\nobtain a natural rounding by applying the EM algorithm after the optimization whereas [7] uses a\ncoarse k-means rounding. Comparisons with these algorithms can be found in Section 5.\n\n4 Optimization of quadratic functions over simplices\n\nTo initialize the EM algorithm, we must minimize the non-convex quadratic cost function de\ufb01ned by\nEq. (3) over a product of N simplices. More precisely, we are interested in the following problems:\n\nmin\n\nV\n\nf (V ) = 1\n\n2 tr (V V T B)\n\ns.t. V = (V1, . . . , VN )T \u2208 RN \u00d7K and \u2200n, Vn \u2208 S,\n\n(4)\n\nwhere B can be any N \u00d7N symmetric matrix. Denoting v = vec(V ) \u2208 RN K the vector obtained by\nstacking all the columns of V and de\ufb01ning Q = (BT \u2297IK)T , where \u2297 is the Kronecker product [13],\nthe problem (4) is equivalent to:\n\nmin\n\nv\n\n1\n\n2 vT Qv\n\ns.t.\n\nv \u2208 RN K , v \u2265 0 and (IN \u2297 1T\n\nK)v = 1N .\n\n(5)\n\nNote that this formulation is general, and that Q could be any N K \u00d7 N K symmetric matrix. Tradi-\ntional convex relaxation methods [14] would rewrite the objective function as vT Qv = tr(QvvT ) =\n\n4\n\n\ftr(QT ) where T = vvT is a rank-one matrix which satis\ufb01es the set of constraints:\n\n\u2212 T \u2208 DN K = {T \u2208 RN K\u00d7N K | T \u2265 0, T < 0}\n\u2212 \u2200 n, m \u2208 {1, . . . , N }, 1T\n\u2212 \u2200 n, i, j \u2208 {1, . . . , N }, Tni1K = Tnj1K .\n\nK Tnm1K = 1,\n\n(6)\n\n(7)\n\n(8)\n\nWe note F the set of matrix T verifying (7-8). With the unit-rank constraint, optimizing over v is\nexactly equivalent to optimizing over T . The problem is relaxed into a convex problem by removing\nthe rank constraint, leading to a semide\ufb01nite programming problem (SDP) [15].\n\nRelaxation. Optimizing T instead of v is computationally inef\ufb01cient since the running time com-\n\nplexity of general purpose SDP toolboxes is in this case O(cid:0)(KN )7(cid:1). On the other hand, for prob-\n\nlems without pointwise positivity, [16, 17] have considered low-rank representations of matrices T ,\nof the form T = V V T where V has more than one column. In particular, [17] shows that the non\nconvex optimization with respect to V leads to the global optimum of the convex problem in T .\n\nIn order to apply the same technique here, we need to deal with the pointwise nonnegativity. This\ncan be done by considering the set of completely positive matrices, i.e.,\n\nCP K = {T \u2208 RN K\u00d7N K|\u2203R \u2208 N\u2217, \u2203V \u2208 RN K\u00d7R, V \u2265 0, T = V V T }.\n\nThis set is strictly included in the set DN K of doubly non-negative matrices (i.e., both pointwise\nnonnegative and positive semi-de\ufb01nite). For R \u2265 5, it turns out that the intersection of CP K and F\nis the convex hull of the matrices vvT such that v is an element of the product of simplices [16]. This\nimplies that the convex optimization problem of minimizing tr (QT ) over CP K \u2229 F is equivalent to\nour original problem (for which no polynomial-time algorithm is known).\n\nHowever, even if the set CP K \u2229 F is convex, optimizing over it is computationally inef\ufb01cient [18].\nWe thus follow [17] and consider the problem through the low-rank pointwise nonnegative matrix\nV \u2208 RN K\u00d7R instead of through matrices T = V V T . Note that following arguments from [16], if R\nis large enough, there are no local minima. However, because of the positivity constraint one cannot\n\ufb01nd in polynomial time a local minimum of a differentiable function. Nevertheless, any gradient\ndescent algorithm will converge to a stationary point. In Section 5, we compare results with R > 1\nthan with R = 1, which corresponds to a gradient descent directly on the simplex.\n\nProblem reformulation.\nIn order to derive a local descent algorithm, we reformulate the con-\nstraints (7-8) in terms of V (details can be found in the supplementary material). Denoting by Vr\nthe r-th column of V , V n\nR ),\ncondition (8) is equivalent to kV m\nr k1 for all n and m. Substituting this in (7) yields that\nfor all n, kV nk2\u22121 = 1, where kV nk2\nr )2 is the squared \u21132\u22121 norm. We drop this\ncondition by using a rescaled cost function which equivalent. Finally, using the notation D:\n\nr the K-vector such as Vr = (V 1\n\nr k1 = kV n\n\nr=1(1T V n\n\nr , . . . , V N\n\nr )T and V n = (V n\n\n1 , . . . , V n\n\n2\u22121 =PR\n\nD = {W \u2208 RN K | W \u2265 0, \u2200n, m, kW nk1 = kW mk1},\n\nwe obtain a new equivalent formulation:\n\nmin\n\nV \u2208RNK\u00d7R, \u2200r, Vr \u2208D\n\n1\n\n2 tr(V D\u22121V T Q) with D = Diag((IN \u2297 1K)T V V T (IN \u2297 1K)),\n\n(9)\n\nwhere Diag(A) is the matrix with the diagonal of A and 0 elsewhere. Since the set of constraints for\nV is convex, we can use a projected gradient method [19] with the projection step we now describe.\n\nProjection on D. Given N K-vectors Z n stacked in a N K vector Z = [Z 1; . . . ; Z N ], we consider\nthe projection of Z on D. For a given positive real number a, the projection of Z on the set of\nall U \u2208 D such that for all n, kU nk1 = a, is equivalent to N independent projections on the \u21131 ball\nwith radius a. Thus projecting Z on D is equivalent to \ufb01nd the solution of:\n\nN\n\nmin\na\u22650\n\nL(a) =\n\nXn=1\n\nmax\n\u03bbn\u2208R\n\nmin\nU n\u22650\n\n1\n\n2 kU n \u2212 Z nk2\n\n2 + \u03bbn(1T\n\nK U n \u2212 a),\n\nwhere (\u03bbn)n\u2264N are Lagrange multipliers. The problem of projecting each Z n on the \u21131-\nball of radius a is well studied [20], with known expressions for the optimal Lagrange mul-\ntipliers, (\u03bbn(a))n\u2264N and the corresponding projection for a given a. The function L(a) is\n\n5\n\n\f)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n100\n\n80\n\n60\n\n40\n\n \n\n0\n\nK = 2\n\n \n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \n\nn\no\n\navg round\nmin round\nind \u2212 avg round\nind \u2212 min round\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n100\n\n80\n\n60\n\n40\n\nK = 3\n\nK = 5\n\n100\n\n80\n\n60\n\n40\n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nc\n\nl\n\n5\n\nnoise dimension\n\n10\n\n0\n\n5\n\nnoise dimension\n\n10\n\n0\n\n5\n\nnoise dimension\n\n10\n\nFigure 1: Comparison between our algorithm and R independent optimizations. Also comparison\nbetween two rounding: by summing and by taking the best column. Average results for K = 2, 3, 5\n(Best seen in color).\n\nconvex, piecewise-quadratic and differentiable, which yields the \ufb01rst-order optimality condi-\nn=1 \u03bbn(a) = 0 for a. Several algorithms can be used to \ufb01nd the optimal value of a. We\nn=1 \u03bbn(a) on the interval [0, \u03bbmax], where \u03bbmax is\n\ntion PN\nuse a binary search by looking at the sign ofPN\n\nfound iteratively. This method was found to be empirically faster than gradient descent.\n\nOverall complexity and running time. We use projected gradient descent, the bottleneck of our\nalgorithm is the projection with a complexity of O(RN 2K log(K)). We present experiments on\nrunning times in the supplementary material.\n\n5\n\nImplementation and results\n\nWe \ufb01rst compare our algorithm with others to optimize the problem (4). We show that the per-\nformances are equivalent but, our algorithm can scale up to larger database. We also consider the\nproblem of supervised and unsupervised discriminative clustering. In both cases, we show that our\nalgorithm outperforms existing methods.\n\nImplementation. For supervised and unsupervised multilabel classi\ufb01cation, we \ufb01rst optimize the\nsecond-order approximation Japp, using the reformulation (9). We use a projected gradient descent\nmethod with Armijo\u2019s rule along the projection arc for backtracking [19].\nIt is stopped after a\nmaximum number of iterations (500) or if relative updates are too small (10\u22128). When the algorithm\nr=1 Vr \u2208 S as our \ufb01nal\nsolution (\u201cavg round\u201d). We also compare this rounding with another heuristic obtained by taking\nf (Vr) (\u201cmin round\u201d). v\u2217 is then used to initialize the EM algorithm described in\nv\u2217 = argminVr\nSection 2.\n\nstops, the matrix V has rank greater than 1 and we use the heuristic v\u2217 =PR\n\nOptimization over simplices. We compare our optimization of the non-convex quadratic prob-\nlem (9) in V , to the convex SDP in T = V V T on the set of constraints de\ufb01ned by T \u2208 DN K , (7)\nand (8). To optimize the SDP, we use generic algorithms, CVX [21] and PPXA [22]. CVX uses\ninterior points methods whereas PPXA uses proximal methods [22]. Both algorithms are com-\nputationally inef\ufb01cient and do not scale well with either the number of points or the number of\nconstraints. Thus we set N = 10 and K = 2 on discriminative clustering problems (which are\ndescribed later in this section). We compare the performances of these algorithms after rounding.\nFor the SDP, we take \u03be\u2217 = T 1N K and for our algorithm we report performances obtained for both\nrounding discuss above (\u201cavg round\u201d and \u201cmin round\u201d). On these small examples, our algorithm\nassociated with \u201cmin round\u201d reaches similar performances than the SDP, whereas, associated with\n\u201cavg round\u201d, its performance drops.\n\nStudy of rounding procedures. We compare the performances of the two different roundings,\n\u201cmin round\u201d and \u201cavg round\u201d on discriminative clustering problems. After rounding, we apply the\nEM algorithm and look at the classi\ufb01cation scores. We also compare our algorithm for a given R, to\ntwo baselines where we solve independently problem (4) R times and then apply the same round-\nings (\u201cind - min round\u201d and \u201cind - avg round\u201d). Results are shown Figure 1. We consider three\n\n6\n\n\f1 vs. 20 \u2212 K = 3\n\n1 vs. 20 \u2212 K = 7\n\n1 vs. 20 \u2212 K = 15\n\n90\n\n80\n\n70\n\n60\n\n50\n\n90\n\n80\n\n70\n\n60\n\n50\n\n100\n\n200\nN\n\n300\n\n400\n\n100\n\n200\nN\n\n300\n\n400\n\n100\n\n200\nN\n\n300\n\n400\n\n4 vs. 5 \u2212 K = 3\n\n4 vs. 5 \u2212 K = 7\n\n4 vs. 5 \u2212 K = 15\n\n90\n\n80\n\n70\n\n60\n\n50\n\n90\n\n80\n\n70\n\n60\n\n50\n\n)\n\n%\n\n(\n \ne\na\nr\n \n\nt\n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n90\n\n80\n\n70\n\n60\n\n50\n\n)\n\n%\n\n(\n \ne\na\nr\n \n\nt\n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n90\n\n80\n\n70\n\n60\n\n50\n\n100\n\n200\nN\n\n300\n\n400\n\n100\n\n200\nN\n\n300\n\n400\n\n100\n\n200\nN\n\n300\n\n400\n\nFigure 2: Classi\ufb01cation rate for several binary classi\ufb01cation tasks (from top to bottom) and for\ndifferent values of K, from left to right (Best seen in color).\n\ndifferent problems, N = 100 and K = 2, K = 3 and K = 5. We look at the average perfor-\nmances as the number of noise dimensions increases in discriminative clustering problems. Our\nmethod outperforms the baseline whatever rounding we use. Figure 1 shows that on problems with\na small number of latent classes (K < 5), we obtain better performances by taking the column\nassociated with the lowest value of the cost function (\u201cmin round\u201d), than summing all the columns\n(\u201cavg round\u201d). On the other hand, when dealing with a larger number of classes (K \u2265 5), the per-\nformance of \u201cmin round\u2019 drops signi\ufb01cantly while \u201cavg round\u201d maintains good results. A potential\nexplanation is that summing the columns of V gives a solution close to 1\nK in expectation, thus\nin the region where our quadratic approximation is valid. Moreover, the best column of V is usually\na local minimum of the quadratic approximation, which we have found to be close to similar local\nminima of our original problem, therefore, preventing the EM algorithm from converging to another\nsolution. In all others experiments, we choose \u201cavg round\u201d.\n\nK 1N 1T\n\nApplication to classi\ufb01cation. We evaluate the optimization performance of our algorithm (DLC)\non text classi\ufb01cation tasks.\nFor our experiments, we use the 20 Newsgroups dataset\n(http://people.csail.mit.edu/jrennie/), which contains postings to Usenet newsgroups. The postings\nare organized by content into 20 categories. We use the \ufb01ve binary classi\ufb01cation tasks considered\nin [23, Chapter 4, page 91]. To set the regularization parameter \u03bb, we use the degree of free-\ndom df (see Section 3.2). Each document has 13312 entries and we take df = 1000. We use 50\nrandom initializations for our algorithm. We compare our method with classi\ufb01ers such as the lin-\near SVM and the supervised Latent Dirichlet Allocation (sLDA) classi\ufb01er of Blei et al. [2]. We\nalso compare our results to those obtained by an SVM using the features obtained with dimension-\nreducing methods such as LDA [1] and PCA. For these models, we select parameters with 5-fold\ncross-validation. We also compare to the EM without our initialization (\u201crand-init\u201d) but also with\n50 random initializations, a local descent method which is close to back-propagation in a two-layer\nneural network, which in this case strongly suffers from local minima problems. An interesting\nresult on computational time is that EM without our initialization needs more steps to obtain a local\nminimum. It is therefore slower than with our initialization in this particular set of experiments.\nWe show some results in Figure 2 (others maybe found in the supplementary material) for different\nvalues of K and with an increasing number N of training samples. In the case of topic models, K\nrepresents the number of topics. Our method signi\ufb01cantly outperforms all the other classi\ufb01ers. The\ncomparison with \u201crand-init\u201d shows the importance of our convex initialization. We also note that\nour performance increases slowly with K. Indeed, the number of latent classes needed to correctly\nseparate two classes of text is small. Moreover, the algorithm tends to automatically select K. Em-\npirically, we notice that starting with K = 15 classes, our average \ufb01nal number of active classes is\naround 3. This explains the relatively small gain in performance as K increases.\n\n7\n\n\f \n\nr\no\nr\nr\ne\ng\nn\ni\nr\ne\n\nt\ns\nu\nc\n\nl\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n5\nnoise dimension\n\n10\n\n15\n\n \n\nr\no\nr\nr\ne\ng\nn\ni\nr\ne\n\nt\ns\nu\nc\n\nl\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n20\n\n5\nnoise dimension\n\n15\n\n10\n\n20\n\nFigure 3: Clustering error when increasing the number of noise dimensions. We have take 50\ndifferent problems and 10 random initializations for each of them. K = 2, N = 100 and R = 5, on\nthe left, and K = 5, N = 250 and R = 10, on the right(Best seen in color).\n\nFigure 4: Comparison between our method (left) and k-means (right). First, circles with RBF\nkernels. Second, linearly separable bumps. K = 2, N = 200 and R = 5 in both cases.\n\nApplication to discriminative clustering. Figure 3 shows the optimization performance of the\nEM algorithm with 10 random starting points with (\u201cDLC\u201d) and without (\u201crand-init\u201d) our initial-\nization method. We compare their performances to K-means, Gaussian Mixture Model (\u201cGMM\u201d),\nDiffrac [7] and max-margin clustering (\u201cMMC\u201d) [24]. Following [7], we take linearly separable\nbumps in a two-dimensional space and add dimensions containing random independent Gaussian\nnoise (e.g. \u201cnoise dimensions\u201d) to the data. We evaluate the ratio of misclassi\ufb01ed observations over\nthe total number of observations. For the \ufb01rst experiment, we \ufb01x K = 2, N = 100, and R = 5, and\nfor the second K = 5, N = 250, and R = 10. The additional independent noise dimensions are\nnormally distributed. We use linear kernels for all the methods. We set the regularization parame-\nters \u03bb to 10\u22122 for all experiments but we have seen that results do not change much as long as \u03bb\nis not too small (> 10\u22128). Note that we do not show results for the MMC algorithm when K = 5\nsince this algorithm is specially designed for problems with K = 2. It would be interesting to com-\npare to the extension for multi-class problems proposed by Zhang et al. [24]. On both examples, we\nare signi\ufb01cantly better than Diffrac, k-means and MMC. We show in Figure 4 additional examples\nwhich are non linearly separable.\n\n6 Conclusion\n\nWe have presented a probabilistic model for supervised dimension reduction, together with associ-\nated optimization tools to improve upon EM. Application to text classi\ufb01cation has shown that our\nmodel outperforms related ones and we have extended it to unsupervised situations, thus drawing\nnew links between probabilistic models and discriminative clustering. The techniques presented in\nthis paper could be extended in different directions: First, in terms of optimization, while the embed-\nding of the problem to higher dimensions has empirically led to \ufb01nding better local minima, sharp\nstatements might be made to characterize the robustness of our approach. In terms of probabilistic\nmodels, such techniques should generalize to other latent variable models. Finally, some additional\nstructure could be added to the problem to take into account more speci\ufb01c problems, such as multiple\ninstance learning [25], multi-label learning or discriminative clustering for computer vision [26, 27].\n\nAcknowledgments. This paper was partially supported by the Agence Nationale de la Recherche\n(MGA Project) and the European Research Council (SIERRA Project). We would like to thank Toby\nDylan Hocking, for his help on the comparison with other methods for the classi\ufb01cation task.\n\n8\n\n\fReferences\n\n[1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent Dirichlet Allocation. Journal\n\nof Machine Learning Research, 3, 2003.\n\n[2] David M. Blei and Jon D. Mcauliffe. Supervised topic models.\n\nIn Advances in Neural Information\n\nProcessing Systems (NIPS), 2007.\n\n[3] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural\n\nComputation, 3(1):79\u201387, 1991.\n\n[4] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted boltzmann machines.\n\nIn\n\nProceedings of the international conference on Machine learning (ICML), 2008.\n\n[5] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In Advances\n\nin Neural Information Processing Systems (NIPS), 2004.\n\n[6] Linli Xu. Unsupervised and semi-supervised multi-class support vector machines. In AAAI, 2005.\n\n[7] F. Bach and Z. Harchaoui. Diffrac : a discriminative and \ufb02exible framework for clustering. In Advances\n\nin Neural Information Processing Systems (NIPS), 2007.\n\n[8] N. Quadrianto, T. Caetano, J. Lim, and D. Schuurmans. Convex relaxation of mixture regression with\n\nef\ufb01cient algorithms. In Advances in Neural Information Processing Systems (NIPS), 2009.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504, 2006.\n\n[10] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.\n\n[11] David R Hunter and Kenneth Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30\u2013\n\n37, February 2004.\n\n[12] J Shawe-Taylor and N Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ Press, 2004.\n\n[13] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 3rd\n\nedition, October 1996.\n\n[14] Kurt Anstreicher and Samuel Burer. D.C. versus copositive bounds for standard QP. Journal of Global\n\nOptimization, 33(2):299\u2013312, October 2005.\n\n[15] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004.\n\n[16] Samuel Burer. Optimizing a polyhedral-semide\ufb01nite relaxation of completely positive programs. Mathe-\n\nmatical Programming Computation, 2(1):1\u201319, March 2010.\n\n[17] M. Journ\u00b4ee, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization for semide\ufb01nite convex\n\nproblems. volume 20, pages 2327\u20132351. SIAM Journal on Optimization, 2010.\n\n[18] A. Berman and N. Shaked-Monderer. Completely Positive Matrices. World Scienti\ufb01c Publishing Com-\n\npany, 2003.\n\n[19] D. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1995.\n\n[20] P. Brucker. An O(n) algorithm for quadratic knapsack problems. In Journal of Optimization Theory and\n\nApplications, volume 134, pages 549\u2013554, 1984.\n\n[21] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21.\n\nhttp://cvxr.com/cvx, August 2010.\n\n[22] Patrick L. Combettes. Solving monotone inclusions via compositions of nonexpansive averaged operators.\n\nOptimization, 53:475\u2013504, 2004.\n\n[23] Simon Lacoste-Julien. Discriminative Machine Learning with Structure. PhD thesis, University of Cali-\n\nfornia, Berkeley, 2009.\n\n[24] Kai Zhang, Ivor W. Tsang, and James T. Kwok. Maximum margin clustering made practical. In Proceed-\n\nings of the international conference on Machine learning (ICML), 2007.\n\n[25] Thomas G. Dietterich and Richard H. Lathrop. Solving the multiple-instance problem with axis-parallel\n\nrectangles. Arti\ufb01cial Intelligence, 89:31\u201371, 1997.\n\n[26] P. Felzenszwalb, D. Mcallester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2008.\n\n[27] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In Proceedings of\n\nthe Conference on Computer Vision and Pattern Recognition (CVPR), 2010.\n\n9\n\n\f", "award": [], "sourceid": 1104, "authors": [{"given_name": "Armand", "family_name": "Joulin", "institution": null}, {"given_name": "Jean", "family_name": "Ponce", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}