{"title": "Supervised Exponential Family Principal Component Analysis via Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 569, "page_last": 576, "abstract": "Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and strongly suggest important underlying structures in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA and provide a simple but novel form to project new testing data into the embedded space. This convex approach successfully avoids the local optima of the EM learning. Moreover, by introducing a sample-based multinomial approximation to exponential family models, it avoids the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained through a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data.", "full_text": "Supervised Exponential Family Principal Component\n\nAnalysis via Convex Optimization\n\nYuhong Guo\n\nComputer Sciences Laboratory\nAustralian National University\n\nyuhongguo.cs@gmail.com\n\nAbstract\n\nRecently, supervised dimensionality reduction has been gaining attention, owing\nto the realization that data labels are often available and indicate important under-\nlying structure in the data. In this paper, we present a novel convex supervised\ndimensionality reduction approach based on exponential family PCA, which is\nable to avoid the local optima of typical EM learning. Moreover, by introduc-\ning a sample-based approximation to exponential family models, it overcomes the\nlimitation of the prevailing Gaussian assumptions of standard PCA, and produces\na kernelized formulation for nonlinear supervised dimensionality reduction. A\ntraining algorithm is then devised based on a subgradient bundle method, whose\nscalability can be gained using a coordinate descent procedure. The advantage of\nour global optimization approach is demonstrated by empirical results over both\nsynthetic and real data.\n\n1 Introduction\n\nPrincipal component analysis (PCA) has been extensively used for data analysis and processing.\nIt provides a closed-form solution for linear unsupervised dimensionality reduction through singu-\nlar value decomposition (SVD) on the data matrix [8]. Probabilistic interpretations of PCA have\nalso been provided in [9, 16], which formulate PCA using a latent variable model with Gaussian\ndistributions. To generalize PCA to better suit non-Gaussian data, many extensions to PCA have\nbeen proposed that relax the assumption of a Gaussian data distribution. Exponential family PCA\nis the most prominent example, where the underlying dimensionality reduction principle of PCA\nis extended to the general exponential family [4, 7, 13]. Previous work has shown that improved\nquality of dimensionality reduction can be obtained by using exponential family models appropri-\nate for the data at hand [4, 13]. Given data from a non-Gaussian distribution these techniques are\nbetter able than PCA to capture the intrinsic low dimensional structure. However, most existing\nnon-Gaussian dimensionality reduction methods rely on iterative local optimization procedures and\nthus suffer from local optima, with the sole exception of [7] which shows a general convex form can\nbe obtained for dimensionality reduction with exponential family models.\n\nRecently, supervised dimensionality reduction has begun to receive increased attention. As the goal\nof dimensionality reduction is to identify the intrinsic structure of a data set in a low dimensional\nspace, there are many reasons why supervised dimensionality reduction is a meaningful topic to\nstudy. First, data labels are almost always assigned based on some important intrinsic property of\nthe data. Such information should be helpful to suppress noise and capture the most useful aspects\nof a compact representation of the data. Moreover, there are many high dimensional data sets with\nlabel information available, e.g., face and digit images, and it is unwise to ignore them. A few su-\npervised dimensionality reduction methods based on exponential family models have been proposed\nin the literature. For example, a supervised probabilistic PCA (SPPCA) model was proposed in\n[19]. SPPCA extends probabilistic PCA by assuming that both features and labels have Gaussian\n\n\fdistributions and are generated independently from the latent low dimensional space through linear\ntransformations. The model is learned by maximizing the marginal likelihood of the observed data\nusing an alternating EM procedure. A more general supervised dimensionality reduction approach\nwith generalized linear models (SDR GLM) was proposed in [12]. SDR GLM views both features\nand labels as exponential family random variables and optimizes a weighted linear combination of\ntheir conditional likelihood given latent low dimensional variables using an alternating EM-style\nprocedure with closed-form update rules. SDR GLM is able to deal with different data types by\nusing different exponential family models. Similar to SDR GLM, the linear supervised dimension-\nality reduction method proposed in [14] also takes advantage of exponential family models to deal\nwith different data types. However, it optimizes the conditional likelihood of labels given observed\nfeatures within a mixture model framework using an EM-style optimization procedure. Beyond the\nPCA framework, many other supervised dimensionality reduction methods have been proposed in\nthe literature. Linear (\ufb01sher) discriminant analysis (LDA) is a popular alternative [5], which max-\nimizes between-class variance and minimizes within-class variance. Moreover, a kernelized \ufb01sher\ndiscriminant analysis (KDA) has been studied in [10]. Another notable nonlinear supervised dimen-\nsionality reduction approach is the colored maximum variance unfolding (MVU) approach proposed\nin [15], which maximizes the variance aligning with the side information (e.g., label information),\nwhile preserving the local distance structures from the data. However, colored MVU has only been\nevaluated on training data.\n\nIn this paper, we propose a novel supervised exponential family PCA model (SEPCA). In the SEPCA\nmodel, observed data x and its label y are assumed to be generated from the latent variables z via\nconditional exponential family models; dimensionality reduction is conducted by optimizing the\nconditional likelihood of the observations (x, y). By exploiting convex duality of the sub-problems\nand eigenvector properties, a solvable convex formulation of the problem can be derived that pre-\nserves solution equivalence to the original. This convex formulation allows ef\ufb01cient global op-\ntimization algorithms to be devised. Moreover, by introducing a sample-based approximation to\nexponential family models, SEPCA does not suffer from the limitations of implicit Gaussian as-\nsumptions and is able to be conveniently kernelized to achieve nonlinearity. A training algorithm\nis then devised based on a subgradient bundle method, whose scalability can be gained through a\ncoordinate descent procedure. Finally, we present a simple formulation to project new testing data\ninto the embedded space. This projection can be used for other supervised dimensionality reduction\napproach as well. Our experimental results over both synthetic and real data suggest that a more\nglobal, principled probabilistic approach, SEPCA, is better able to capture subtle structure in the\ndata, particularly when good label information is present.\n\nThe remainder of this paper is organized as follows. First, in Section 2 we present the proposed\nsupervised exponential family PCA model and formulate a convex nondifferentiable optimization\nproblem. Then, an ef\ufb01cient global optimization algorithm is presented in Section 3. In Section 4,\nwe present a simple projection method for new testing points. We then present the experimental\nresults in Section 5. Finally, in Section 6 we conclude the paper.\n\n2 Supervised Exponential Family PCA\n\nobservation Xi:; thus Pk\n\nWe assume we are given a t \u00d7 n data matrix, X, consisting of t observations of n-dimensional\nfeature vectors, Xi:, and a t\u00d7k indicator matrix, Y , with each row to indicate the class label for each\nj=1 Yij = 1. For simplicity, we assume features in X are centered; that is,\ntheir empirical means are zeros. We aim to recover a d-dimensional re-representation, a t \u00d7 d matrix\nZ, of the data (d < n). This is typically viewed as discovering a latent low dimensional manifold\nin the high dimensional feature space. Since the label information Y is exploited in the discovery\nprocess, this is called supervised dimensionality reduction. For recovering Z, a key restriction that\none would like to enforce is that the features used for coding, Z:j, should be linearly independent;\nthat is, one would like to enforce the constraint Z \u22a4Z = I, which ensures that the codes are expressed\nby orthogonal features in the low dimensional representation.\n\nGiven the above setup, in this paper, we are attempting to address the problem of supervised dimen-\nsionality reduction using a probabilistic latent variable model. Our intuition is that the important\nintrinsic structure (underlying feature representation) of the data should be able to accurately gener-\nate/predict the original data features and labels.\n\n\fIn this section, we formulate the low-dimensional principal component discovering problem as a\nconditional likelihood maximization problem based on exponential family model representations,\nwhich can be reformulated into an equivalent nondifferentiable convex optimization problem. We\nthen exploit a sample-based approximation to unify exponential family models for different data\ntypes.\n\n2.1 Convex Formulation of Supervised Exponential Family PCA\n\nAs with the generalized exponential family PCA [4], we attempt to \ufb01nd low-dimensional represen-\ntation by maximizing the conditional likelihood of the observation matrix X and Y given the latent\nmatrix Z, log P (X, Y |Z) = log P (X|Z) + log P (Y |Z). Using the general exponential family\nrepresentation, a regularized version of this maximization problem can be formulated as\n\nmax\n\nZ:Z\u22a4Z=I\n\nmax\nW,\u2126,b\n\nlog P (X|Z, W ) \u2212\n\n\u03b2\n2\n\ntr\u00a1W W \u22a4\u00a2 + log P (Y |Z, \u2126, b) \u2212\n\n\u03b2\n\n2 \u00a1tr\u00a1\u2126\u2126\u22a4\u00a2 + b\u22a4b\u00a2\n\n= max\n\nZ:Z\u22a4Z=I\n\nmax\nW,\u2126,b\n\n(A(Zi:, W ) \u2212 log P0(Xi:)) \u2212\n\n\u03b2\n2\n\ntr\u00a1W W \u22a4\u00a2\n\n(1)\n\ntr\u00a1ZW X \u22a4\u00a2 \u2212Xi\n+tr\u00a1Z\u2126Y \u22a4\u00a2 + 1\u22a4Y b \u2212Xi\n\nA(Zi:, \u2126, b) \u2212\n\n\u03b2\n\n2 \u00a1tr\u00a1\u2126\u2126\u22a4\u00a2 + b\u22a4b\u00a2\n\nwhere W is a d \u00d7 n parameter matrix for conditional model P (X|Z); \u2126 is a d \u00d7 k parameter matrix\nfor conditional model P (Y |Z) and b is a k \u00d7 1 bias vector; 1 denotes the vector of all 1s; A(Zi:, W )\nand A(Zi:, \u2126, b) are the log normalization functions to ensure valid probability distributions:\n\nA(Zi:, W ) = logZ exp (Zi:W x) P0(x) dx .\n\nA(Zi:, \u2126, b) = log\n\nk\n\nX\u2113=1\n\nexp\u00a1Zi:\u21261\u2113 + 1\u22a4\n\u2113 b\u00a2\n\n(2)\n\n(3)\n\nwhere 1\u2113 denotes a zero vector with a single 1 in the \u2113th entry.\nNote that the class variable y is discrete, thus maximizing log P (Y |Z, \u2126, b) is a discriminative\nclassi\ufb01cation training. In fact, the second part of the objective function in (1) is simply a multi-class\nlogistic regression. That is why we have incorporated an additional bias term b into the model.\n\nTheorem 1 The optimization problem (1) is equivalent to\n\nmin\nU x,U y\n\nmax\n\nM :I\u00baM \u00ba0, tr(M )=d Xi\n+Xi\n\n(A\u2217(U x\n\ni: ) + log P0(Xi:)) +\n\n1\n2\u03b2\n\ntr\u00a1(X \u2212U x)(X \u2212U x)\u22a4M\u00a2\n\nA\u2217(U y\n\ni:) +\n\n1\n2\u03b2\n\ntr\u00a1(Y \u2212U y)(Y \u2212U y)\u22a4(M + E)\u00a2\n\n(4)\n\ni: ) and\nwhere E is a t \u00d7 t matrix with all 1s; U x is a t \u00d7 n matrix; U y is a t \u00d7 k matrix; A\u2217(U x\nA\u2217(U y\ni:) are the Fenchel conjugates of A(Zi:, W ) and A(Zi:, \u2126, b) respectively; M = ZZ \u22a4 and Z\ncan be recovered by taking the top d eigenvectors of M; and the model parameters W, \u2126, b can be\nrecovered by\n\nW =\n\n1\n\u03b2\n\nZ \u22a4(X \u2212 U x), \u2126 =\n\n1\n\u03b2\n\nZ \u22a4(Y \u2212 U y), b =\n\n1\n\u03b2\n\n(Y \u2212 U y)\u22a41\n\nProof: The proof is simple and based on standard results. Due to space limitation, we only provide\na summarization of the key steps here. There are three steps. The \ufb01rst step is to derive the Fenchel\nconjugate dual for each log partition function, A(Z, .), following [18, Section 3.3.3]; which can be\nused to yield\n\nmax\n\nZ:Z\u22a4Z=I\n\nmin\n\nU x,U y Xi\n+Xi\n\n(A\u2217(U x\n\ni: ) + log P0(Xi:)) +\n\n1\n2\u03b2\n\ntr\u00a1(X \u2212U x)(X \u2212U x)\u22a4ZZ \u22a4\u00a2\n\nA\u2217(U y\n\ni:) +\n\n1\n2\u03b2\n\ntr\u00a1(Y \u2212U y)(Y \u2212U y)\u22a4(ZZ \u22a4 + E)\u00a2\n\n(5)\n\n\fthat is equivalent to the original problem (1). The second step is based on exploiting the strong\nmin-max property [2] and the relationships between different constraint sets\n\n{M : M = ZZ \u22a4 for some Z such that Z \u22a4Z = I} \u2286 {M : I \u00ba M \u00ba 0, tr(M ) = d},\n\nwhich allows one to further show the optimization (4) is an upper bound relaxation of (5). The \ufb01nal\nequivalence proof is based on the result of [11], which suggests the substitution of ZZ \u22a4 with matrix\nM does not produce relaxation gap.\nNote that (4) is a min-max optimization problem. Moreover, for each \ufb01xed M, the outer minimiza-\ntion problem is obviously convex, since the Fenchel conjugates, A\u2217(U x\ni:), are convex\nfunctions of U x and U y respectively [2]; that is, the objective function for the outer minimization is a\npointwise supremum over an in\ufb01nite set of convex functions. Thus the overall min-max optimization\nis convex [3], but apparently not necessarily differentiable. We will address the nondifferentiable\ntraining issue in Section 3.\n\ni: ) and A\u2217(U y\n\n2.2 Sample-based Approximation\n\nIn the previous section, we have formulated our supervised exponential family PCA as a convex\noptimization problem (4). However, before attempting to devise a training algorithm to solve it, we\nhave to provide some concrete forms for the Fenchel conjugate functions A\u2217(U x\ni:). For\ndifferent exponential family models, the Fenchel conjugate functions A\u2217 are different; see [18, Table\n2]. For example, since the y variable in our model is a discrete class variable, it takes a multinomial\ndistribution. Thus the Fenchel conjugate function A\u2217(U y\n\ni: ) and A\u2217(U y\n\nA\u2217(U y\n\ni:) = A\u2217(\u0398y\n\ni:) = tr\u00b3\u0398y\n\ni: log \u0398y\u22a4\n\ni:) is given by\ni: \u00b4 , where \u0398y \u2265 0, \u0398y1 = 1\n\nThe speci\ufb01c exponential family model is determined by the data type and distribution. PCA and\nSPPCA use Gaussian models, thus their performances might be degraded when the data distribution\nis non-Gaussian. However, it is tedious and sometimes hard to choose the most appropriate expo-\nnential family model to use for each speci\ufb01c application problem. Moreover, the log normalization\nfunction A and its Fenchel conjugate A\u2217 might not be easily computable. For these reasons, we pro-\npose to use a sample-based approximation to the integral (2) and achieve an empirical approximation\nto the true underlying exponential family model as follows. If one replaces the integral de\ufb01nition (2)\n\nwith an empirical de\ufb01nition, A(Zi:, W ) = logPj exp\u00a1Zi:W X \u22a4\n\ncan be given by\n\nj:\u00a2 /t, then the conjugate function\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nA\u2217(U x\n\ni: ) = A\u2217(\u0398x\n\ni:) = tr\u00a1\u0398x\n\ni: log \u0398x\u22a4\n\ni: \u00a2 \u2212 log(1/t), where \u0398x \u2265 0, \u0398x1 = 1\n\nWith this sample-based approximation, problem (4) can be expressed as\n\nmin\n\u0398x,\u0398y\n\nmax\n\nM :I\u00baM \u00ba0, tr(M )=d\n\ntr (\u0398x log \u0398x) +\n\n+ tr (\u0398y log \u0398y) +\n\n1\n2\u03b2\n1\n2\u03b2\n\ntr\u00a1(I \u2212\u0398x)K(I \u2212\u0398x)\u22a4M\u00a2\ntr\u00a1(Y \u2212\u0398y)(Y \u2212\u0398y)\u22a4(M + E)\u00a2\n\nsubject to\n\n\u0398x \u2265 0, \u0398x1 = 1; \u0398y \u2265 0, \u0398y1 = 1\n\nOne bene\ufb01t of working with this sample-based approximation is that it is automatically kernelized,\nK = XX \u22a4, to enable non-linearity to be conveniently introduced.\n\n3 Ef\ufb01cient Global Optimization\n\nThe optimization (8) we derived in the previous section is a convex-concave min-max optimization\nproblem. The inner maximization of (8) is a well known problem with a closed-form solution [11]:\nM \u2217 = Z \u2217Z \u2217\u22a4 and Z \u2217 = Qd\nmax(D)\ndenotes the matrix formed by the top d eigenvectors of D. However, the overall outer minimization\nproblem is nondifferentiable with respect to \u0398x and \u0398y. Thus the standard \ufb01rst-order or second-\norder optimization techniques that rely on the standard gradients can not be applied here. In this\nsection, we deploy a bundle method to solve this nondifferentiable min-max optimization.\n\nmax\u00a1(I \u2212\u0398x)K(I \u2212\u0398x)\u22a4 + (Y \u2212\u0398y)(Y \u2212\u0398y)\u22a4\u00a2, where Qd\n\n\f3.1 Bundle Method for Min-Max Optimization\n\nThe bundle method is an ef\ufb01cient subgradient method for nondifferentiable convex optimization; it\nrelies on the computation of subgradient terms of the objective function. A vector g is a subgradient\nof function f at point x, if f (y) \u2265 f (x) + g\u22a4(y \u2212 x), \u2200y. To adapt standard bundle methods to our\nspeci\ufb01c min-max problem, we need to \ufb01rst address the critical issue of subgradient computation.\n\nProposition 1 Consider a joint function h(x, y) de\ufb01ned over x \u2208 X and y \u2208 Y, satisfying: (1)\nh(\u00b7, y) is convex for all y \u2208 Y; (2) h(x, \u00b7) is concave for all x \u2208 X . Let f (x) = maxy h(x, y),\nand q(x0) = arg maxy h(x0, y). Assume that g is a gradient of h(\u00b7, q(x0)) at x = x0, then g is a\nsubgradient of f (x) at x = x0.\nProof:\n\nh(x, y) \u2265 h(x, q(x0))\n\nf (x) = max\n\ny\n\n\u2265 h(x0, q(x0)) + g\u22a4(x \u2212 x0)\n= f (x0) + g\u22a4(x \u2212 x0)\n\n(by the de\ufb01nitions of f (x) and q(x0))\nThus g is a subgradient of f (x) at x = x0 according to the de\ufb01nition of subgradient.\nAccording to Proposition 1, the subgradients of our outer minimization objective function f in (8)\nover \u0398x and \u0398y can be given by\n\n(since h(\u00b7, y) is convex for all y \u2208 Y)\n\n1\n\u03b2\n\n1\n\u03b2\n\n(10)\n\n\u2202\u0398xf \u220b \u00a1 log \u0398x + 1 \u2212\n\nM \u2217(I \u2212 \u0398x)K\u00a2, \u2202\u0398y f \u220b \u00a1 log \u0398y + 1 \u2212\n\nM \u2217(Y \u2212 \u0398y)\u00a2\n\nwhere M \u2217 is the optimal inner maximization solution at the current point [\u0398x, \u0398y].\nAlgorithm 1 illustrates the bundle method we developed to solve the in\ufb01nite min-max optimiza-\ntion (8), where the linear constraints (9) over \u0398x and \u0398y can be conveniently incorporated into the\nquadratic bound optimization. One important issue in this algorithm is how to manage the size of the\nlinear lower bound constraints formed from the active set B (de\ufb01ned in Algorithm 1), as it incremen-\ntally increases with new points being explored. To solve this problem, we noticed the Lagrangian\ndual parameters \u03b1 for the lower bound constraints obtained by the quadratic optimization in step 1\nis a sparse vector, indicating that many lower bound constraints can be turned off. Moreover, any\nconstraint that is turned off will mostly stay off in the later steps. Therefore, for the bundle method\nwe developed, whenever the size of B is larger than a given constant b, we will keep the active points\nof B that correspond to the \ufb01rst b largest \u03b1 values, and drop the remaining ones.\n\n3.2 Coordinate Descent Procedure\n\nAn important factor affecting the running ef\ufb01ciency is the size of the problem. The convex opti-\nmization (8) works in the dual parameter space, where the size of the parameters \u0398 = {\u0398x, \u0398y},\nt \u00d7 (t + k), depends only on the number of training samples, t, not on the feature size, n. For high\ndimensional small data sets (n \u226b t), our dual optimization is certainly a good option. However,\nwith the increase of t, our problem size will increase in an order of O(t2). It might soon become too\nlarge to handle for the quadratic optimization step of the bundle method.\n\nOn the other hand, the optimization problem (8) possesses a nice semi-decomposable structure:\none equality constraint in (9) involves only one row of the \u0398; that is, the \u0398 can be separated into\nrows without affecting the equality constraints. Based on this observation, we develop a coordinate\ndescent procedure to obtain scalability of the bundle method over large data sets. Speci\ufb01cally, we\nput an outer loop above the bundle method. Within each of this outer loop iteration, we randomly\nseparate the \u0398 parameters into m groups, with each group containing a subset rows of \u0398; and\nwe then use bundle method to sequentially optimize each subproblem de\ufb01ned on one group of\n\u0398 parameters while keeping the remaining rows of \u0398 \ufb01xed. Although coordinate descent with a\nnondifferentiable convex objective is not guaranteed to converge to a minimum in general [17], we\nhave found that this procedure performs quite well in practice, as shown in the experimental results.\n\n4 Projection for Testing Data\n\nOne important issue for supervised dimensionality reduction is to map new testing data into the\ndimensionality-reduced principal dimensions. We deploy a simple procedure for this purpose. After\n\n\fAlgorithm 1 Bundle Method for Min-Max Optimization in (8)\n\nInput: \u00af\u03b4 > 0, m \u2208 (0, 1), b \u2208 IN, \u00b5 \u2208 IR\nInitial: Find an initial point \u03b8\u2217 satisfying the linear constraints in (9); compute f (\u03b8\u2217).\n\nLet \u2113 = 1, \u03b8\u2113 = \u03b8\u2217, compute g\u2113 \u2208 \u2202\u03b8\u2113f by (10); e\u2113 = f (\u03b8\u2217) \u2212 f (\u03b8\u2113) \u2212 g\u2113\u22a4(\u03b8\u2217 \u2212 \u03b8\u2113).\nLet B = {(e\u2113, g\u2113)}, \u02c6\u03b5 = Inf, \u02c6g = 0; \u2113 = \u2113 + 1.\n\nrepeat\n\n1. Solve quadratic minimization for solution \u02c6\u03b8, and Lagrangian dual parameters \u03b1 w.r.t. the\nlower bound linear constraints in B [1]:\n\n\u00b5\n2\n\nk\u03b8 \u2212 \u03b8\u2217k2, subject to the linear constraints in (9)\n\n\u03b8\n\n\u03c8\u2113(\u03b8) +\n\n\u02c6\u03b8 = arg min\nwhere \u03c8\u2113(\u03b8) = f (\u03b8\u2217) + max\u00a9 \u2212 \u02c6\u03b5 + \u02c6g\u22a4(\u03b8 \u2212 \u03b8\u2217), max\n2 k\u02c6\u03b8 \u2212 \u03b8\u2217k2 \u2265 0. If \u03b4\u2113 < \u02c6\u03b4, return.\n\n(ei,gi)\u2208B\n\n2. De\ufb01ne \u03b4\u2113 = f (\u03b8\u2217) \u2212 [\u03c8\u2113(\u02c6\u03b8) + \u00b5\n3. Conduct line search to minimize f (\u03b8\u2113) with \u03b8\u2113 = \u03b3\u03b8\u2217 + (1 \u2212 \u03b3)\u02c6\u03b8, for 0 < \u03b3 < 1.\n4. Compute g\u2113 \u2208 \u2202\u03b8\u2113f by (10); e\u2113 = f (\u03b8\u2217)\u2212f (\u03b8\u2113)\u2212g\u2113\u22a4(\u03b8\u2217 \u2212\u03b8\u2113); update B = B\u222a{(e\u2113, g\u2113)}.\n5. If f (\u03b8\u2217) \u2212 f (\u03b8\u2113) \u2265 m\u03b4\u2113, then take a serious step:\n\n{\u2212ei + \u02c6gi\u22a4(\u03b8 \u2212 \u03b8\u2217)}\u00aa\n\n(1) update: ei = ei + f (\u03b8\u2113) \u2212 f (\u03b8\u2217) + gi\u22a4(\u03b8\u2217 \u2212 \u03b8\u2113);\n\n(2) update the aggregation: \u02c6g = Pi \u03b1igi, \u02c6\u03b5 = Pi \u03b1iei;\n\n(3) update the stored solution: \u03b8\u2217 = \u03b8\u2113, f (\u03b8\u2217) = f (\u03b8\u2113).\n\n6. If |B| > b, reduce B set according to \u03b1.\n7. \u2113 = \u2113 + 1.\n\nuntil maximum iteration number is reached\n\ntraining, we obtain a low-dimensional representation Z for X, where Z can be viewed as a linear\nprojection of X in some transformed space \u03c8(X) through a parameter matrix U; such that Z =\n\u03c8(X)U = \u03c8(X)\u03c8(X)\u22a4K +\u03c8(X)U, where K + denotes the pseudo inverse of K = \u03c8(X)\u03c8(X)\u22a4.\nThen a new testing sample x\u2217 can be projected by\n\nz\u2217 = \u03c8(x\u2217)\u03c8(X)\u22a4K +\u03c8(X)U = k(x\u2217, X)K +Z\n\n(11)\n\n5 Experimental Results\n\nIn order to evaluate the performance of the proposed supervised exponential family PCA (SEPCA)\napproach, we conducted experiments over both synthetic and real data, and compared to supervised\ndimensionality reduction with generalized linear models (SDR GLM), supervised probabilistic PCA\n(SPPCA), linear discriminant analysis (LDA), and colored maximum variance unfolding (MVU).\nThe projection procedure (11) is used for colored MVU as well. In all the experiments, we used\n\u00b5 = 1 for Algorithm 1, and used \u03b1 = 0.0001 for SDR GLM as suggested in [12].\n\n5.1 Experiments on Synthetic Data\n\nTwo synthetic experiments were conducted to compare the \ufb01ve approaches under controlled con-\nditions. The \ufb01rst synthetic data set is formed by \ufb01rst generating four Gaussian clusters in a two-\ndimensional space, with each corresponding to one class, and then adding the third dimension to\neach point by uniformly sampling from a \ufb01xed interval. This experiment attempts to compare the\nperformance of the \ufb01ve approaches in the situation where the data distribution does not satisfy the\nGaussian assumption. Figure 1 shows the projection results for each approach in a two dimensional\nspace for 120 testing points after being trained on a set with 80 points. In this case, SEPCA and\nLDA outperform all the other three approaches.\n\nThe second synthetic experiment is designed to test the capability of performing nonlinear dimen-\nsionality reduction. The synthetic data is formed by \ufb01rst generating two circles in a two dimensional\nspace (one circle is located inside the other one), with each circle corresponding to one class, and\nthen the third dimension sampled uniformly from a \ufb01xed interval. As SDR GLM does not provide\na nonlinear form, we conducted the experiment with only the remaining four approaches. For LDA,\nwe used its kernel variant, KDA. A Gaussian kernel with \u03c3 = 1 was used for SEPCA, SPPCA and\nKDA. Figure 2 shows the projection results for each approach in a two dimensional space for 120\n\n\fSEPCA\n\nSDR\u2212GLM\n\nSPPCA\n\nLDA\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\n\u22120.2\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22120.25\n\n\u22120.4\n\n\u22120.3\n\n\u22120.2\n\n\u22120.1\n\n0\n\n0.1\n\n0.2\n\n\u221230\n\n\u221230\n\n0.3\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n\u22121.5\n\u22121\n\n30\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n\u22123\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\u22127\n\nColored\u2212MVU\n\n\u22126\n\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\nx 10\u22127\n\nFigure 1: Projection results on test data for synthetic experiment 1. Each color indicates one class.\n\nSEPCA\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\n0.01\n\n0.005\n\n0\n\n\u22120.005\n\n\u22120.01\n\n\u22120.015\n\nSPPCA\n\nKDA\n\nColored\u2212MVU\n\n0.02\n\n0.01\n\n0\n\n\u22120.01\n\n\u22120.02\n\n\u22120.03\n\n\u22120.04\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22120.2\n\u22120.25\n\n\u22120.2\n\n\u22120.15\n\n\u22120.1\n\n\u22120.05\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n\u22120.02\n\n\u221220\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\nx 10\u22123\n\n\u22120.05\n\n\u22120.35\n\n\u22120.3\n\n\u22120.25\n\n\u22120.2\n\n\u22120.15\n\n\u22120.1\n\n\u22120.05\n\n\u22122.5\n\n5.85\n\n0\n\n5.9\n\n5.95\n\n6\n\n6.05\n\n6.1\n\n6.15\n\n6.2\n\n6.25\n\n6.3\n\n6.35\n\nFigure 2: Projection results on test data for synthetic experiment 2. Each color indicates one class.\n\ntesting points after being trained on a set with 95 points. Again, SEPCA and KDA achieve good\nclass separations and outperform the other two approaches.\n\n5.2 Experiments on Real Data\n\nTo better characterize the performance of dimensionality reduction in a supervised manner, we con-\nducted some experiments on a few high dimensional multi-class real world data sets. The left side\nof Table 1 provides the information about these data sets. Our experiments were conducted in the\nfollowing way. We randomly selected 3\u223c5 examples from each class to form the training set and\nused the remaining examples as the test set. For each approach, we \ufb01rst learned the dimensionality\nreduction model on the training set. Moreover, we also trained a logistic regression classi\ufb01er us-\ning the projected training set in the reduced low dimensional space. (Note, for SEPCA, a classi\ufb01er\nwas trained simultaneously during the process of dimensionality reduction optimization.) Then the\ntest data were projected into the low dimensional space according to each dimensionality reduction\nmodel. Finally, the projected test set for each approach were classi\ufb01ed using each corresponding\nlogistic regression classi\ufb01er. The right side of Table 1 shows the classi\ufb01cation accuracies on the test\nset for each approach. To better understand the quality of the classi\ufb01cation using projected data, we\nalso included the standard classi\ufb01cation results, indicated as \u2019FULL\u2019, using the original high dimen-\nsional data. (Note, we are not able to obtain any result for SDR GLM on the newsgroup data as it is\ninef\ufb01cient for very high dimensional data.) The results reported here are averages over 20 repeated\nruns, and the projection dimension d = 10. Still the proposed SEPCA presents the best performance\namong the compared approaches. But different from the synthetic experiments, LDA does not work\nwell on these real data sets.\n\nThe results on both synthetic and real data show that SEPCA outperforms the other four approaches.\nThis might be attributed to its adaptive exponential family model approximation and its global opti-\nmization, while SDR GLM and SPPCA apparently suffer from local optima.\n\n6 Conclusions\n\nIn this paper, we propose a supervised exponential family PCA (SEPCA) approach, which can\nbe solved ef\ufb01ciently to \ufb01nd global solutions. Moreover, SEPCA overcomes the limitation of the\nGaussian assumption of PCA and SPPCA by using a data adaptive approximation for exponential\nfamily models. A simple, straightforward projection method for new testing data has also been\nconstructed. Empirical study suggests that this SEPCA outperforms other supervised dimensionality\nreduction approaches, such as SDR GLM, SPPCA, LDA and colored MVU.\n\n\fTable 1: Data set statistics and test accuracy results (%)\n\nDataset\n\n#Data\n\n#Dim #Class\n\nFULL SEPCA GLM SPPCA LDA\n\nSDR\n\nYale\nYaleB\n11 Tumor\nUsps3456\nNewsgroup\n\n165\n2414\n174\n120\n19928\n\n4096\n1024\n12533\n256\n25284\n\n15\n38\n11\n4\n20\n\n65.3\n47.0\n77.6\n82.1\n32.1\n\n64.4\n20.5\n88.9\n79.7\n16.9\n\n58.8\n19.0\n63.5\n77.9\n\u2013\n\n51.6\n9.8\n63.0\n78.5\n6.9\n\n31.0\n6.2\n23.7\n74.3\n10.0\n\ncolored\nMVU\n\n21.1\n2.8\n40.2\n75.8\n10.4\n\nReferences\n\n[1] A. Belloni. Introduction to bundle methods. Technical report, MIT, 2005.\n[2] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.\n[4] M. Collins, S. Dasgupta, and R. Schapire. A generalization of principal component analysis to\nthe exponential family. In Advances in Neural Information Processing Systems (NIPS), 2001.\n[5] R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,\n\n7:179\u2013188, 1936.\n\n[6] Y. Guo and D. Schuurmans. Convex relaxations of latent variable training. In Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n[7] Y. Guo and D. Schuurmans. Ef\ufb01cient global optimization for exponential family PCA and\nlow-rank matrix factorization. In Allerton Conf. on Commun., Control, and Computing, 2008.\n\n[8] I. Jolliffe. Principal Component Analysis. Springer Verlag, 2002.\n[9] N. Lawrence. Probabilistic non-linear principle component analysis with gaussian process\n\nlatent variable models. Journal of Machine Learning Research, 6:1783\u20131816, 2005.\n\n[10] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller. Fisher discriminant analysis with\n\nkernels. In IEEE Neural Networks for Signal Processing Workshop, 1999.\n\n[11] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums\n\nof the largest eigenvalues of symmetric matrices. Math. Prog., 62:321\u2013357, 1993.\n\n[12] I. Rish, G. Grabarnilk, G. Cecchi, F. Pereira, and G. Gordon. Closed-form supervised dimen-\nsionality reduction with generalized linear models. In Proceedings of International Conference\non Machine Learning (ICML), 2008.\n\n[13] Sajama and A. Orlitsky. Semi-parametric exponential family PCA.\n\nInformation Processing Systems (NIPS), 2004.\n\nIn Advances in Neural\n\n[14] Sajama and A. Orlitsky. Supervised dimensionality reduction using mixture models. In Pro-\n\nceedings of the International Conference on Machine Learning (ICML), 2005.\n\n[15] L. Song, A. Smola, K. Borgwardt, and A. Gretton. Colored maximum variance unfolding. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2007.\n\n[16] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society, B, 6(3):611\u2013622, 1999.\n\n[17] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimiza-\n\ntion. Journal of Optimization Theory and Applications, 109:457\u2013494, 2001.\n\n[18] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational infer-\n\nence. Technical Report TR-649, UC Berkeley, Dept. Statistics, 2003.\n\n[19] S. Yu, K. Yu, V. Tresp, H. Kriegel, and M. Wu. Supervised probabilistic principal component\n\nanalysis. In Proceedings of 12th ACM SIGKDD International Conf. on KDD, 2006.\n\n\f", "award": [], "sourceid": 864, "authors": [{"given_name": "Yuhong", "family_name": "Guo", "institution": null}]}