{"title": "Semi-supervised Learning using Sparse Eigenfunction Bases", "book": "Advances in Neural Information Processing Systems", "page_first": 1687, "page_last": 1695, "abstract": "We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices.  It turns out that  when the \\emph{cluster assumption} holds, that is, when the high density regions are sufficiently separated by  low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data  with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity appears naturally from the  cluster assumption. Experimental results on a number  of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).", "full_text": "Semi-supervised Learning using Sparse\n\nEigenfunction Bases\n\nDept. of Computer Science and Engineering\n\nDept. of Computer Science and Engineering\n\nMikhail Belkin\n\nOhio State University\nColumbus, OH 43210\n\nKaushik Sinha\n\nOhio State University\nColumbus, OH 43210\n\nsinhak@cse.ohio-state.edu\n\nmbelkin@cse.ohio-state.edu\n\nAbstract\n\nWe present a new framework for semi-supervised learning with sparse eigenfunc-\ntion bases of kernel matrices. It turns out that when the data has clustered, that\nis, when the high density regions are suf(cid:2)ciently separated by low density valleys,\neach high density area corresponds to a unique representative eigenvector.\nLinear combination of such eigenvectors (or, more precisely, of their Nystrom\nextensions) provide good candidates for good classi(cid:2)cation functions when the\ncluster assumption holds. By (cid:2)rst choosing an appropriate basis of these eigen-\nvectors from unlabeled data and then using labeled data with Lasso to select a\nclassi(cid:2)er in the span of these eigenvectors, we obtain a classi(cid:2)er, which has a very\nsparse representation in this basis. Importantly, the sparsity corresponds naturally\nto the cluster assumption.\nExperimental results on a number of real-world data-sets show that our method\nis competitive with the state of the art semi-supervised learning algorithms and\noutperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).\n\n1 Introduction\nSemi-supervised learning, i.e., learning from both labeled and unlabeled data has received con-\nsiderable attention in recent years due to its potential in reducing the need for expensive labeled\ndata. However, to make effective use of unlabeled examples one needs to make some assumptions\nabout the connection between the process generating the data and the process of assigning labels.\nThere are two important assumptions popular in semi-supervised learning community the (cid:147)cluster\nassumption(cid:148) [CWS02] and the (cid:147)manifold assumption(cid:148) [BNS06] as well as a number of model-based\nmethods, such as Naive Bayes [HTF03]. In particular, the cluster assumption can be interpreted as\nsaying that two points are likely to have the same class labels if they can be connected by a path\npassing through a high density area. In other words two high density areas with different class labels\nmust be separated by a low density valley.\nIn this paper, we develop a framework for semi-supervised learning when the cluster assumption\nholds. Speci(cid:2)cally, we show that when the high density areas are suf(cid:2)ciently separated, a few ap-\npropriately chosen eigenfunctions of a convolution operator (which is the continuous counterpart\nof the kernel matrix) represents the high density areas reasonably well. Under the ideal conditions\neach high density area can be represented by a single unique eigenfunction called the (cid:147)representa-\ntive(cid:148) eigenfunction. If the cluster assumption holds, each high density area will correspond to just\none class label and thus a sparse linear combination of these representative eigenfunctions would be\na good classi(cid:2)er. Moreover, the basis of such eigenfunctions can be learned using only the unlabeled\ndata by constructing the Nystrom extension of the eigenvectors of an appropriate kernel matrix.\nThus, given unlabeled data we construct the basis of eigenfunctions and then apply L1 penalized\noptimization procedure Lasso [Tib96] to (cid:2)t a sparse linear combination of the basis elements to\n\n1\n\n\fthe labeled data. We provide a detailed theoretical analysis of the algorithm and show that it is\ncomparable to the state-of-the-art on several common UCI datasets.\nThe rest of the paper is organized as follows.\nIn section 2 we provide the proposed framework\nfor semi-supervised learning and describe the algorithm. In section 3 we provide an analysis of\nthis algorithm to show that it can consistently identify the correct model. In section 4 we provide\nexperimental results on synthetic and real datasets and (cid:2)nally we conclude with a discussion in\nsection 5.\n\n2 Semi-supervised Learning Framework\n2.1 Outline of the Idea\nIn this section we present a framework for semi-supervised learning under the cluster assumption.\nSpeci(cid:2)cally we will assume that (i) data distribution has natural clusters separated by regions of low\ndensity and (ii) the label assignment conforms to these clusters.\nThe recent work of [SBY08a, SBY08b] shows that if the (unlabeled) data is clustered, then for each\nhigh density region there is a unique (representative) eigenfunction of a convolution operator, which\ntakes positive values for points in the chosen cluster and whose values are close to zero everywhere\nelse (no sign change). Moreover, it can be shown (e.g., [RBV08]) that these eigenfunctions can be\napproximated from the eigenvectors of a kernel matrix obtained from the unlabeled data.\nThus, if the cluster assumption holds we expect each cluster to have exactly one label assignment.\nTherefore eigenfunctions corresponding to these clusters should produce a natural sparse basis for\nconstructing a classi(cid:2)cation function.\nThis suggests the following learning strategy:\n\n1. From unlabeled and labeled data obtain the eigenvectors of the Gaussian kernel matrix.\n2. From these eigenvectors select a subset of candidate eigenvectors without sign change.\n3. Using the labeled data, apply Lasso (sparse linear regression) in the constructed basis to\n\nobtain a classi(cid:2)er.\n\n4. Using the Nystrom extension (see [BPV03]), extend the eigenvectors to obtain the classi(cid:2)-\n\ncation function de(cid:2)ned everywhere.\n\nConnection to Kernel PCA ( [SSM98]). We note that our method is related to KPCA, where data is\nprojected onto the space spanned by the top few eigenvectors of the kernel matrix and classi(cid:2)cation\nor regression task can be performed in that projected space. The important difference is that we\nchoose a subset of the eigenvectors in accordance to the cluster assumption. We note that the method\nsimply using the KPCA basis does not seem to bene(cid:2)t from unlabeled data and, in fact, cannot\noutperform the standard fully supervised SVM classi(cid:2)er. On the other hand, our algorithm using a\nbasis subselection procedure shows results comparable to the state of the art.\nThis is due to two reasons. We will see that each cluster in the data corresponds to its unique\nrepresentative eigenvector of the kernel matrix. However, this eigenvector may not be among the\ntop eigenvectors and may thus be omitted when applying KPCA. Alternatively, if the representa-\ntive eigenvector is included, it will be included with a number of other uninformative eigenvectors\nresulting in poor performance due to over(cid:2)tting.\nWe now proceed with the detailed discussion of our algorithm and its analysis.\n\n2.2 Algorithm\nThe focus of our discussion will be binary classi(cid:2)cation in the semi-supervised setting. Given l\nlabeled examples f(xi; yi)gl\ni=1 sampled from an underlying joint probability distribution PX ;Y,\nX (cid:26) Rd;Y = f(cid:0)1; 1g, where xis are the data points, yis are their corresponding labels and u\ni=l+1 drawn iid from the marginal distribution PX , we choose a Gaus-\nunlabeled examples fxigl+u\nsian kernel k(x; z) = exp(cid:16)(cid:0) kx(cid:0)zk2\n2!2 (cid:17) with kernel bandwidth ! to construct the kernel matrix K\ni=1 be the eigenvalue-eigenvector pair of K sorted by the\nwhere Kij = 1\nnon-increasing eigenvalues. It has been shown ([SBY08a, SBY08b]) that when data distribution PX\n\nu k(zi; zj). Let ((cid:21)i; vi)u\n\n2\n\n\fhas clusters, for each high density region there is a unique representative eigenfunction of a convo-\nlution operator that takes positive values around the chosen cluster and is close to zero everywhere\nelse. Moreover these eigenfunctions can be approximated from the eigenvectors of a kernel matrix\nobtained from the unlabeled data ([RBV08]), thus for each high density region there is a unique rep-\nresentative eigenvector of the kernel matrix that takes only positive or negative values in the chosen\ncluster and is nearly zero everywhere else (no sign change).\nIf the cluster assumption holds, i.e., each high density region corresponds to a portion of a pure class,\nthen the classi(cid:2)er can be naturally expressed as a linear combination of the representative eigenfunc-\ntions. representative eigenvector basis and a linear combination of the representative eigenvectors\nwill be a reasonable candidate for a good classi(cid:2)cation function. However, identifying representative\neigenvectors is not very trivial because in real life depending on the separation between high density\nclusters the representative eigenvectors can have no sign change up to some small precision (cid:15) > 0.\nSpeci(cid:2)cally, we say that a vector e = (e1; e2; :::; en) 2 Rn has no sign change up to precision (cid:15) if\neither 8i ei > (cid:0)(cid:15) or 8i ei < (cid:15). Let N(cid:15) be the set of indices of all eigenvectors that have no sign\nchange up to precision (cid:15). If (cid:15) is chosen properly, N(cid:15) will contain representative eigenvectors (note\nthat the set N(cid:15) and the set f1; 2; :::;jN(cid:15)jg are not necessarily the same). Thus, instead of identifying\nthe representative eigenvectors, we carefully select a small set containing the representative eigen-\nvectors. Our goal is to learn a linear combination of the eigenvectors Pi2N(cid:15)\n(cid:12)ivi which minimizes\nclassi(cid:2)cation error on the labeled examples and the coef(cid:2)cients corresponding to non-representative\neigenvectors are zeros. Thus, the task is more of model selection or sparse approximation.\nStandard approach to get a sparse solution is to minimize a convex loss function V on the labeled\nexamples and apply a L1 penalty (on (cid:12)is). If we select V to be square loss function, we end up\nsolving the L1 penalized least square or so called Lasso [Tib96], whose consistency property was\nstudied in [ZY06]. Thus we would seek a solution of the form\n\narg min\n\n(cid:12)\n\n(y (cid:0) (cid:9)(cid:12))T (y (cid:0) (cid:9)(cid:12)) + (cid:21)jj(cid:12)jjL1\n\n(1)\n\nwhich is a convex optimization problem, where (cid:9) is the l (cid:2) jN(cid:15)j design matrix whose ith column\nis the (cid:2)rst l elements of vN(cid:15)(i), y 2 Rl is the label vector, (cid:12) is the vector of coef(cid:2)cients and (cid:21) is a\nregularization parameter. Note that solving the above problem is equivalent to solving\n\narg min\n\n(cid:12)\n\n(y (cid:0) (cid:9)(cid:12))T (y (cid:0) (cid:9)(cid:12)) s:t: Xi2N(cid:15)\n\nj(cid:12)ij (cid:20) t\n\n(2)\n\nbecause for any given (cid:21) 2 [0;1), there exists a t (cid:21) 0 such that the two problems have\nthe same solution, and vice versa [Tib96]. We will denote the solution of Equation 2, by ^(cid:12).\nTo obtain a classi(cid:2)cation function which is de(cid:2)ned everywhere, we use the Nystrom extension\nof the ith eigenvector de(cid:2)ned as i(x) =\nj=1 vi(xj)k(x; xj). Let the set T con-\ntains indices of all nonzero ^(cid:12)is. Using Nystrom extension, classi(cid:2)cation function is given by,\ni=1 Wik(xi; x), where, W 2 Ru is a weight vector whose ith ele-\nf (x) = Pi2T\nment is given by\n(3)\n\n^(cid:12)i i(x) = Pl+u\n\n(cid:21)ipl+u Pl+u\n\n1\n\nand can be computed while training.\n\n^(cid:12)j vj(xi)\n(cid:21)jpu\n\nWi = Xj2T\n\nAlgorithm for Semi-supervised Learning\n\nInput: f(xi; yi)gl\nParameters: !; t; (cid:15)\n\ni=1; fxigl+u\n\ni=l+1\n\ni=1.\n1. Construct kernel matrix K from l + u unlabeled examples fxigl+u\n2. Select set N(cid:15) containing indices of the eigenvectors with no sign change up to precision (cid:15).\n3. Construct design matrix (cid:9) whose ith column is top l rows of vN(cid:15)(i).\n4. Solve Equation 2 to get ^(cid:12) and calculate weight vector W using Equation 3.\n5. Given a test point x, predict its label as y = sign (Pu\n\ni=1 k(xi; x)Wi)\n\n3\n\n\f3 Analysis of the Algorithm\nThe main purpose of the analysis is, (i) to estimate the amount of separation required among the high\ndensity regions which ensures that each high density region can be well represented by a unique\n(representative) eigenfunction, (ii) to estimate the number of unlabeled examples required so that\neigenvectors of kernel matrix can approximate the eigenfunctions of a convolution operator (de(cid:2)ned\nbelow) and (iii) to show that using few labeled examples Lasso can consistently identify the correct\nmodel consisting of linear combination of representative eigenvectors.\nBefore starting the actual analysis, we (cid:2)rst note that the continuous counterpart of the Gram matrix\nis a convolution operator LK : L2(X ;PX ) ! L2(X ;PX ) de(cid:2)ned by,\nk(x; z)f (z)dPX (z)\n\n(4)\n\n(LK f )(x) = ZX\n\n(cid:9)T\n\n(1) u\n\nl (cid:9)T (cid:9).\n\n(cid:16)(cid:9)T\n\n(1)(cid:9)(1)(cid:17)T\n\nj 2(cid:9)(2) (cid:12)(cid:12)(cid:12)(cid:12)\n\nThe eigenfunctions of the symmetric positive de(cid:2)nite operator LK will be denoted by (cid:30)L\ni .\nNext, we brie(cid:3)y discuss the effectiveness of model selection using Lasso (established by [ZY06])\nwhich will be required for our analysis. Let ^(cid:12)l((cid:21)) be the solution of Equation 1 for a chosen\nregularization parameter (cid:21). In [ZY06] a concept of sign consistency was introduced which states\nthat Lasso is sign consistent if, as l tends to in(cid:2)nity, signs of ^(cid:12)l((cid:21)) matches with the signs of (cid:12)(cid:3) with\nprobability 1, where (cid:12)(cid:3) is the coef(cid:2)cients of the correct model. Note that since we are expecting a\nsparse model, matching zeros of ^(cid:12)l((cid:21)) to the zeros of (cid:12)(cid:3) is not enough, but in addition, matching\nthe signs of the non zero coef(cid:2)cients ensures that the true model will be selected. Next, without loss\nof generality assume (cid:12)(cid:3) = ((cid:12)(cid:3)1 ;(cid:1)(cid:1)(cid:1) ; (cid:12)(cid:3)q ; (cid:12)(cid:3)q+1;(cid:1)(cid:1)(cid:1) ; (cid:12)(cid:3)\n) has only (cid:2)rst q terms non-zero, i.e., only\njN(cid:15)j\nq predictors describe the model and rest of the predictors are irrelevant in describing the model. Now\nlet us write the (cid:2)rst q and jN(cid:15)j (cid:0) q columns of (cid:9) as (cid:9)(1) and (cid:9)(2) respectively. Let C = 1\nNote that, for a random design matrix, sign consistency is equivalent to irrepresentable condition\n(see [ZY06]). When (cid:12)(cid:3) is unknown, in order to ensure that irrepresentable condition holds for all\npossible signs, it requires that L1 norm of the regression coef(cid:2)cients corresponding to the irrelevant\npredictors to be less than 1, which can be written as (cid:22)(cid:9) = max u\n<\n1. The requirement (cid:22)(cid:9) < 1 is not new and have also appeared in the context of noisy or noiseless\nsparse recovery of signal [Tro04, Wai06, Zha08]. Note that Lasso is sign consistent if irrepresentable\ncondition holds and the suf(cid:2)cient condition needed for irrepresentable condition to hold is given by\nthe following result,\nTheorem 3.1. [ZY06] Suppose (cid:12)(cid:3) has q nonzero entries. Let the matrix C0 be normalized version\nof C such that C0ij = Cij\n2q(cid:0)1 for a constant 0 (cid:20) c < 1, then strong\nirrepresentable condition holds.\nOur main result in the following shows that this suf(cid:2)cient condition is satis(cid:2)ed with high probability\nrequiring relatively few labeled examples, as a result the correct model is identi(cid:2)ed consistently,\nwhich in turn describes a good classi(cid:2)cation function.\nTheorem 3.2. Let q be the minimum number of columns of the design matrix (cid:9) 2 Rl(cid:2)jN(cid:15)j, con-\nstructed from l labeled examples, that describes the sparse model. Then for any 0 < (cid:14) < 1, if\n, then with probability greater than\nthe number of unlabeled examples u satis\ufb01es u >\n50q2 (cid:17), maxi6=j jC0ijj < 1\n1 (cid:0) (cid:14)\n2q(cid:0)1 .\nmax (to be de(cid:2)ned later) largest eigenvalue of LK and gNmax is the N th\nwhere (cid:21)Nmax is the N th\nmax\neigengap. Note that in our framework, unlabeled examples help polynomially fast in estimating\nthe eigenfunctions while labeled examples help exponentially fast in identifying the sparse model\nconsisting of representative eigenfunctions. Interestingly, in semi-supervised learning setting, sim-\nilar role of labeled and unlabeled examples (in reducing classi(cid:2)cation error) has been reported in\nliterature [CC96, RV95, SB07, SNZ08].\n3.1 Brief Overview of the Analysis\nAs a (cid:2)rst step of our analysis, in section 3.2, we estimate the separation requirement among the\nhigh density regions which ensures that each high density region (class) can be well represented\nby a unique eigenfunction. This allows us to express the classi(cid:2)cation task in this eigenfunction\n\nand maxi;j;i6=j jC0ijj (cid:20) c\n\n2 (cid:0) 4 exp(cid:16)(cid:0)\n\n2048q2 log( 2\n(cid:14) )\ng2\nNmax\n\nNmax\n\nj(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nCii\n\nl(cid:21)2\n\nNmax\n\n(cid:21)2\n\n4\n\n\fbasis where we look for a classi(cid:2)cation function consisting of linear combination of representative\neigenfunctions only and thus relate the problem to sparse approximation from the model selection\npoint of view, which is a well studied (cid:2)eld [Wai06, ZH06, CP07].\nAs a second step in section 3.3, using perturbation results from [RBV08], we estimate the number of\nunlabeled examples required to ensure that Nystrom extensions of eigenvectors of K approximate\nthe eigenfunctions of the convolution operator LK reasonably well with high probability.\nFinally, as a third step in section 3.4, we establish a concentration inequality, which along with\nresult from the second step 2, ensures that as more and more labeled examples are used to (cid:2)t the\neigenfunctions basis to the data, the probability that Lasso identi(cid:2)es correct model consisting of\nrepresentative eigenfunctions increases exponentially fast.\n3.2 Separation Requirement\nTo motivate our discussion we consider binary classi(cid:2)cation problem where the marginal density\ncan be considered as a mixture model where each class has its own probability density function,\np1(x); p2(x) and corresponding mixing weights (cid:25)1; (cid:25)2 respectively. Thus, the density of the mixture\nis p(x) = (cid:25)1p1(x) + (cid:25)2p2(x). We will use the following results from [SBY08a] specifying the\nbehavior of the eigenfunction of LK corresponding to the largest eigenvalue.\nTheorem 3.3. [SBY08a] The top eigenfunction (cid:30)L\n0 (x) of LK corresponding to the largest eigen-\nvalue (cid:21)0, (1) is the only eigenfunction with no sign change, (2) has multiplicity one, (3) is non zero\n(cid:21)0qR k2(x; z)p(z)dz (Tail decay\n0 (x)j (cid:20) 1\non the support of the underlying density, (4) satis\ufb01es j(cid:30)L\nproperty), where p is the underlying probability density function.\nNote that the last (tail decay) property above is not restricted to the top eigenfunction alone\nbut is satis(cid:2)ed by all eigenfunctions of LK. Now, consider applying LK to the three cases\nwhen the underlying probability distributions are p1; p2 and p. The largest eigenvalues and\n0 respec-\ncorresponding eigenfunctions in the above three cases are (cid:21)1\ntively. To show explicit dependency on the underlying probability distribution, we will denote\nK .\nthe corresponding operators as Lp1\nK + (cid:25)2Lp2\nK = (cid:25)1Lp1\n0(cid:16)(cid:30)L;1\n0 + T1(x)(cid:17) where,\nThen we can write, Lp\nT1(x) = (cid:25)2\n(x) =\n(cid:25)1(cid:21)1\n0(cid:16)(cid:30)L;2\n(z)p1(z)dz. Thus, when T1(x) and\n(cid:25)2(cid:21)2\nT2(x) are small enough then (cid:30)L;1\nK with corresponding eigen-\n0 respectively. Note that (cid:147)separation condition(cid:148) requirement refers to T1(x),\nvalues (cid:25)1(cid:21)1\nT2(x) being small, so that eigenfunctions corresponding to the largest eigenvalues of convolution\noperator when applied to individual high density bumps are preserved in the case when convolution\noperator is applied to the mixture. Clearly, we can not expect T1(x), T2(x) to arbitrarily small if\nthere is suf(cid:2)cient overlap between p1 and p2. Thus, we will restrict ourselves to the following class\nof probability distributions for each individual class which has reasonably fast tail decay.\nAssumption 1. For any 1=2 < (cid:17) < 1, let M((cid:17);R) be the class of probability distributions such\nthat its density function p satis\ufb01es\n1) RR\n2) For any positive t > 0, smaller than the radius of R, and for any point z 2 X n R with\ndist(z;R) (cid:21) t, the volume S = fx 2 (X n R) \\ B(z; 3t=p2)g has total probability mass\np(x)dx (cid:20) C1(cid:17) exp(cid:16)(cid:0) dist2(z;R)\nRS\nwhere the distance between a point x and set D is de(cid:2)ned as dist(x;D) = inf y2D jjx (cid:0) yjj. With\na little abuse of notation we will use p 2 M((cid:17);R) to mean that p is the probability density function\nof a member of M((cid:17);R). Now a rough estimate of separation requirement can be given by the\nfollowing lemma.\nLemma 3.1. Let p1 2 M((cid:17);R1) and p2 2 M((cid:17);R2) and let the minimum distance between R1;R2\nbe (cid:1). If (cid:1) = (cid:10)(cid:3)(cid:16)!pd(cid:17) then T1(x) and T2(x) can be made arbitrarily small for all x 2 X .\n\n(x) = R k(x; z)(cid:30)L;1\n(z)p2(z)dz.\n0 + T2(x)(cid:17) where, T2(x) = (cid:25)1\n(cid:25)2(cid:21)2\nand (cid:30)L;2\n0 and (cid:25)2(cid:21)2\n\nIn a similar way we can write, Lp\n0 R k(x; z)(cid:30)L;2\nare eigenfunctions of Lp\n\np(x)d(x) = (cid:17) where R is the minimum volume ball around the mean of the distribution.\n\nK(cid:30)L;1\n0 R k(x; z)(cid:30)L;1\n\nK ; Lp2\n\nK and Lp\n\nK respectively. Clearly, Lp\n\n(cid:17) for some C1 > 0.\n\n0; (cid:21)2\n\n0; (cid:21)0 and (cid:30)L;1\n\n0\n\n(z)p(z)dz = (cid:25)1(cid:21)1\n\n; (cid:30)L;2\n\n0\n\n; (cid:30)L\n\nK(cid:30)L;2\n\n0\n\nThe estimate of (cid:1) in the above lemma, where we hide the log factor by (cid:10)(cid:3), is by no means tight,\nnevertheless, it shows that separation requirement refers to existence of a low density valley between\n\n0\n\n0\n\n0\n\n0\n\n0\n\n0\n\nt2\n\n5\n\n\f2!2\n\n(cid:17)\n\np(C1+(cid:17))\n\n(cid:21)0\n\nand (cid:30)L;2\n\n0\n\n0\n\nK correspond-\n\n0 (z)j (cid:21) j(cid:30)L\n\n0 (x)j\n\n0 is the eigenfunction of Lq\n\n1+e < (cid:17) < 1, let q 2 M((cid:17);R). If (cid:30)L\nexp(cid:16)(cid:0) dist2(x;R)\n\ntwo high density regions each corresponding to one of the classes. This separation requirement is\nroughly of the same order required to learn mixture of Gaussians [Das99]. Note that, provided\nare not necessarily the top two eigenfunctions of\nseparation requirement is satis(cid:2)ed, (cid:30)L;1\nLK corresponding to the two largest eigenvalues but can be quite far down the spectrum of Lp\nK\ndepending on the mixing weights (cid:25)1; (cid:25)2. Next, the following lemma suggests that we can say more\nabout the eigenfunction corresponding to the largest eigenvalue.\nLemma 3.2. For any e\ning to the largest eigenvalue (cid:21)0 then there exists a C1 > 0 such that\n1) For all x 2 X n R;j(cid:30)L\n2) For all z 2 R and x 2 X n R;j(cid:30)L\nThus for each class, top eigenfunction corresponding to the largest eigenvalue represents high den-\nsity region reasonably well, outside high density region is has lower absolute value and decays\nexponentially fast.\n3.3 Finite Sample Results\nWe start with the following assumption.\nAssumption 2. The Nmax largest eigenvalues of LK and K, where Nmax = maxifi : i 2 N(cid:15)g, are\nsimple and bounded away from zero.\nNote that Nystrom extension is are eigenfunctions of an operator LK;H : H ! H , where H\nis the unique RKHS de(cid:2)ned by the chosen Gaussian kernel and all the eigenvalues of K are also\neigenvalues of LK;H ([RBV08]). There are two implications of Assumption 2. The (cid:2)rst one is due\nto the bounded away from zero part, which ensures that if we restrict to i 2 H corresponding to the\nlargest Nmax eigenvalues, then each of them is square integrable hence belongs to L2(X ;PX ). The\nsecond implication due to the simple part, ensures that eigenfunctions corresponding to the Nmax\nlargest eigenvalues are uniquely de(cid:2)ned and so are the orthogonal projections on to them. Note that\nif any eigenvalue has multiplicity greater than one then the corresponding eigenspace is well de(cid:2)ned\nbut not the individual eigenfunctions. Thus, Assumption 2 enables us to compare how close each i\nis to some other function in L2(X ;PX ) in L2(X ;PX ) norm sense. Let gNmax be the N th\nmax eigengap\nwhen eigenvalues of LK are sorted in non increasing order. Then we have the following results.\nLemma 3.3. Suppose Assumption 2 holds and the top Nmax eigenvalues of LK and K are sorted in\nthe decreasing order. Then for any 0 < (cid:14) < 1 and for any i 2 N(cid:15), with probability at least (1 (cid:0) (cid:14)),\nk i (cid:0) (cid:30)L\nCorollary 3.1. Under the above conditions, for any 0 < (cid:14) < 1 and for any i; j 2 N(cid:15), with\nprobability at least (1 (cid:0) (cid:14)) the following holds,\n1) (cid:12)(cid:12)h i; jiL2(X ;PX )(cid:12)(cid:12) (cid:20) (cid:18) 8 log(2=(cid:14))\n(cid:21)i (cid:19) 1pu (cid:20) k u\n2) 1 (cid:0)(cid:18)r 8 log(2=(cid:14))\n\ngNmax (cid:18) 1p(cid:21)i\ni kL2(X ;PX ) (cid:20) 1 +(cid:18)r 8 log(2=(cid:14))\n\nu +(cid:18)p8 log(2=(cid:14))\n\ngNmaxq 2 log(2=(cid:14))\n\nNmaxp(cid:21)i(cid:21)j(cid:19) 1\n\ni kL2(X ;PX ) = 2\n\n+ 1p(cid:21)j(cid:19)(cid:19) 1pu\n\n(cid:21)i (cid:19) 1pu\n\ng2\nNmax\n\ng2\nNmax\n\n0 (x)j (cid:20)\n\nu(cid:21)i\n\ng2\n\ni=1 such that the ith column of (cid:9) is (cid:0) N(cid:15)(i)(x1); N(cid:15)(i)(x2);(cid:1)(cid:1)(cid:1) ; N(cid:15)(i)(xl)(cid:1)T\n\n3.4 Concentration Results\nHaving established that f igi2N(cid:15) approximate the top N(cid:15) eigenfunctions of LK reasonably well,\nnext, we need to consider what happens when we restrict each of the is to (cid:2)nite labeled examples.\nNote that the design matrix (cid:9) 2 Rl(cid:2)jN(cid:15)j is constructed by restricting the f jgj2N(cid:15) to l labeled data\npoints fxigl\n2 Rl.\nNow consider the jN(cid:15)j (cid:2) jN(cid:15)j matrix C = 1\nk=1 N(cid:15)(i)(xk) N(cid:15)(j)(xk).\nl Pl\nFirst, applying Hoeffding\u2019s inequality we establish,\nLemma 3.4. For all i; j 2 N(cid:15) and (cid:15)1 > 0 the following two facts hold.\nP(cid:16)(cid:12)(cid:12)(cid:12)\nk=1[ i(xk)]2 (cid:0) E(cid:0)[ i(X)]2(cid:1)(cid:12)(cid:12)(cid:12) (cid:21) (cid:15)1(cid:17) (cid:20) 2 exp(cid:16)(cid:0) l(cid:15)2\n2 (cid:17)\nl Pl\nP(cid:16)(cid:12)(cid:12)(cid:12)\nk=1 i(xk) j(xk) (cid:0) E ( i(X) j(X))(cid:12)(cid:12)(cid:12) (cid:21) (cid:15)1(cid:17) (cid:20) 2 exp(cid:16)(cid:0) l(cid:15)2\n(cid:17)\nl Pl\nand C0ii = 1. To en-\nNext, consider the jN(cid:15)j (cid:2) jN(cid:15)j normalized matrix C0 where C0ij = Cij\nsure that Lasso will consistently choose the correct model we need to show (see Theorem 3.1) that\n\nl (cid:9)T (cid:9) where, Cij = 1\n\n1(cid:21)i(cid:21)j\n\n1(cid:21)2\n\nCii\n\n1\n\n1\n\n2\n\ni\n\n6\n\n\fmaxi6=j jC0ijj < 1\nsample results yields Theorem 3.2.\n\n2q(cid:0)1 with high probability. Applying the above concentration result and (cid:2)nite\n\n4 Experimental Results\n4.1 Toy Dataset\nHere we present a synthetic example in 2-D. Consider a binary classi(cid:2)cation problem where the\npositive examples are generated from a Gaussian distribution with mean (0; 0) and covariance ma-\ntrix [2 0; 0 2] and the negative examples are generated from a mixture of Gaussians having means\nand covariance matrices (5; 5); [2 1; 1 2] and (7; 7); [1:5 0; 0 1:5] respectively. The correspond-\ning mixing weights are 0:4; 0:3 and 0:3 respectively. Left panel in Figure 1 shows the probability\ndensity of the mixture in blue and representative eigenfunctions of each class in green and magenta\nrespectively using 1000 examples (positive and negative) drawn from this mixture. It is clear that\neach representative eigenfunction represents high density area of a particular class reasonably well.\nSo intuitively a linear combination of them will represent a good decision function. In fact, the\nright panel of Fig 1 shows the regularization path for L1 penalized least square regression with 20\nlabeled examples. The bold green and magenta lines shows the coef(cid:2)cient values for the representa-\ntive eigenfunctions for different values of regularization parameter t. As can be seen, regularization\nparameter t can be so chosen that the decision function will consist of a linear combination of repre-\nsentative eigenfunctions only. Note that these representative eigenfunctions need not be the top two\neigenfunctions corresponding to the largest eigenvalues.\n\nProbability density for the mixture and representative eigenfunctions\n\nRegularization path\n\ns\nn\no\n\n0.05\n\nf\n\ni\nt\nc\nn\nu\nn\ne\ng\nE\n\ni\n\n \n/\n \ny\nt\ni\ns\nn\ne\nD\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\n\u221220\n\ns\nt\n\ni\n\nn\ne\nc\ni\nf\nf\n\ne\no\nC\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n0\n\n20\n\n0\ny\n\n10\n\n20\n\nFigure 1: Left panel: Probability density of the mixture in blue and representative eigenfunctions\nin green and magenta. Right panel: Regularization path. Bold lines correspond to regularization\npath associated with representative eigenfunctions.\n\nx\n\n30\nt\n\n40\n\n50\n\n60\n\n4.2 UCI Datasets\nIn this set of experiment we tested the effectiveness of our algorithm (we call it SSL SEB) on some\ncommon UCI datasets. We compared our algorithm with state of the art semi-supervised learning\n(manifold regularization) method Laplacian SVM (LapSVM) [BNS06], fully supervised SVM and\nalso two other kernel sparse regression methods. In KPCA+L1 we selected top jN(cid:15)j eigenvectors,\nand applied L1 regularization, in KPCA F+L1 we selected the top 20 ((cid:2)xed) eigenvectors of Ku\nand applied L1 regularization1, where as in KPCA max+L1 we selected top max eigenvectors, and\napplied L1 regularization, where max is the maximum index of set of eigenvectors in N(cid:15), that is the\nindex of the lowest eigenvector, chosen by our method. For both SVM and LapSVM we used RBF\nkernel. In each experiment a speci(cid:2)ed number of examples (l) were randomly chosen and labeled\nand the rest (u) were treated as unlabeled test set. Such random splitting was performed 30 times\nand the average is reported.\nThe results are reported in Table 1. As can be seen, for small number of labeled examples our method\nconvincingly outperform SVM and is comparable to LapSVM. The result also suggests that instead\nof selecting top few eigenvectors, as is normally done in KPCA, selecting them by our method\nand then applying L1 regularization yields better result. In particular, in case of IONOSPHERE\nand BREAST-CANCER data sets top jN(cid:15)j (5 and 3 respectively) eigenvectors do not contain the\nrepresentative ones. As a result in these two cases KPCA+L1 performs very poorly. Table 2 shows\nthat the solution obtained by our method is very sparse, where average sparsity is the average number\nof non-zero coef(cid:2)cients.\nWe note that our method does not work equally well for all datasets, and has generally higher\nvariability than LapSVM.\n\n1We also selected 100 top eigenvectors and applied L1 penalty but it gave worse result.\n\n7\n\n\fDATA SET\n# Labeled Data\nSSL SEB\n\nKPCA+L1\n\nKPCA F+L1\n\nKPC max+L1\n\nSVM\n\nLapSVM\n\nHEART\n\nWINE\n\nBREAST-CANCER\n\nVOTING\n\nIONOSPHERE\nd=33, l+u=351\n\nl=20\n\nl=30\n\nd=13, l+u=303\n\nl=10\n78:26 85:84 87:25 75:45 77:34 79:92 93:01 98:95\n96:68\n(cid:6)13:56 (cid:6)10:61 (cid:6)4:16 (cid:6)6:14 (cid:6)6:04 (cid:6)1:18 (cid:6)8:49 (cid:6)8:49 (cid:6)3:43\n70:26\n\nd=16, l+u=435\nl=10\nl=15\n86:85\n87:84\n(cid:6)6:21 (cid:6)3:82\n65:15\n87:84\n86:85\n(cid:6)8:82 (cid:6)9:81 (cid:6)9:89 (cid:6)7:94 (cid:6)8:41 (cid:6)6:68 (cid:6)10:06 (cid:6)3:89 (cid:6)14:43 (cid:6)13:68 (cid:6)6:21 (cid:6)3:82\n77:38\n64:92\n\nd=30, l+u=569\nl=10\n98:66\n(cid:6)2:86\n73:95\n\nd=13, l+u=178\nl=10\nl=20\n\n60:91 67:32 71:46\n\n66:82 70:36 75:16\n\nl=10\n\nl=20\n\nl=30\n\n81:44\n\n71:78\n\n69:43\n\n67:43\n\n65:66\n\n69:57\n\n93:47\n\n98:75\n\nl=5\n\n79:82\n\n87:32\n\n63:04\n\n(cid:6)10:13 (cid:6)11:68 (cid:6)11:26 (cid:6)7:33 (cid:6)7:01 (cid:6)5:91 (cid:6)10:29 (cid:6)8:56 (cid:6)12:29 (cid:6)13:12 (cid:6)12:65 (cid:6)10:43\n\n59:76\n\n64:73\n\n66:89\n\n57:26 60:16 63:36\n\n84:62\n\n89:96\n\n59:32\n\n73:95\n\n71:78\n\n77:38\n\n(cid:6)10:23 (cid:6)11:62 (cid:6)12:45 (cid:6)5:16 (cid:6)6:69 (cid:6)6:15 (cid:6)9:63 (cid:6)9:26 (cid:6)15:18 (cid:6)8:97 (cid:6)12:65 (cid:6)10:43\n\n79:8\n\n72:09\n\n65:16\n\n88:51\n(cid:6)10:87 (cid:6)10:04 (cid:6)9:94 (cid:6)11:63 (cid:6)5:95 (cid:6)4:29 (cid:6)10:25 (cid:6)11:68 (cid:6)17:56 (cid:6)8:65 (cid:6)16:05 (cid:6)5:88\n89:52 89:97\n(cid:6)1:43 (cid:6)1:26\n\n98:95\n71:17\n(cid:6)7:33 (cid:6)4:07 (cid:6)3:81 (cid:6)5:55 (cid:6)6:08 (cid:6)3:14 (cid:6)5:33 (cid:6)1:57 (cid:6)2:32\n\n74:91 75:33 77:43 98:33 97:67\n\n64:61 73:16 76:55\n\n99:72\n(cid:6)1:42\n\n77:18\n\n81:32\n\n83:98\n\n88:12\n\n72:83\n\n81:53\n\n97:32\n\nTable 1: Classi(cid:2)cation Accuracies for different UCI datasets\n\nDATA SET\nSSL SEB\nKPCA+L1\nKPCA F+L1\nKPC max+L1\n\nIONOSPHERE\n\n2.83 / 5\n3.23 / 5\n6.05 / 20\n6.85 / 23\n\nHEART\n4.63 / 9\n5.84 / 9\n8.11 / 20\n16.42 / 78\n\nWINE\n3.52 / 6\n3.8 / 6\n6.12 / 20\n6.07 /16\n\nBREAST-CANCER\n\n2.10 / 3\n2.78 / 3\n4.70 / 20\n10.81 / 57\n\nVOTING\n2.02 / 3\n2.02/ 3\n3.05 / 20\n2.02 / 3\n\nTable 2: Average sparsity of our method for different UCI datasets. The notation A=B represents\naverage sparsity A and number of eigenvectors (jN(cid:15)j or 20).\n\n4.3 Handwritten Digit Recognition\nIn this set of experiments we applied our method to the 45 binary classi(cid:2)cation problems that arise\nin pairwise classi(cid:2)cation of handwritten digits and compare its performance with LapSVM. For\neach pairwise classi(cid:2)cation problem, in each trial, 500 images of each digit in the USPS training\nset were chosen uniformly at random out of which 20 images were labeled and the rest were set\naside for testing. This trial was repeated 10 times. For the LapSVM we set the regularization\nterms and the kernel as reported by [BNS06] for a similar set of experiments, namely we set (cid:13)Al =\n(u+l)2 = 0:045 and chose a polynomial kernel of degree 3. The results are shown2 in Figure2.\n0:005;\nAs can be seen our method is comparable to LapSVM.\n\n(cid:13)I l\n\n20\n\n15\n\n10\n\n5\n\n)\n\n%\n\n(\n \n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\ns\ne\nT\n\n0\n\n \n0\n\n \n\nSSL_SEB\nLapSVM\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\n45\n\nAll 45 two class classifications for USPS dataset\n\nFigure 2: Classi(cid:2)cation results for USPS dataset\n\nWe also performed multi-class classi(cid:2)cation on USPS dataset. In particular, we chose all the images\nof digits 3, 4 and 5 from USPS training data set (there were 1866 in total) and randomly labeled\n10 images from each class. Rest of the 1836 images were set aside for testing. Average prediction\naccuracy of LapSVM, after repeating this procedure 20 times, was 90:14% as compared to 87:53%\nof our method.\n5 Conclusion\nIn this paper we have presented a framework for spectral semi-supervised learning based on the\ncluster assumption. We showed that the cluster assumption is equivalent to the classi(cid:2)er being\nsparse in a certain appropriately chosen basis and demonstrated how such basis can be computed\nusing only unlabeled data. We have provided theoretical analysis of the resulting algorithm and\ngiven experimental results demonstrating that the resulting algorithm has performance comparable\nto the state-of-the-art for a number of data sets and dramatically outperforms the natural baseline of\nKPCA + Lasso.\n\n2It turned out that the cases where our method performed very poorly, the respective distances between the\n\nmeans of corresponding two classes were very small.\n\n8\n\n\fReferences\n[BNS06] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold Regularization: A Geometric Frame-\nwork for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning\nResearch, 7:2399(cid:150)2434, 2006.\n\n[BPV03] Y. Bengio, J-F. Paiement, and P. Vincent. Out-of-sample Extensions for LLE, Isomap,\n\n[CC96]\n\n[CP07]\n\nMDS, Eigenmaps and Spectral Clustering. In NIPS. 2003.\nV. Castelli and T. M. Cover. The Relative Value of Labeled and Unlabeled Samples in\nPattern Recognition with Unknown Mixing Parameters. IEEE Transactions on Informa-\ntion Theory, 42(6):2102(cid:150)2117, 1996.\nE. J. Candes and Y. Plan. Near Ideal Model Selection by \u20181 Minimization, eprint\narxiv:0801.0345. 2007.\n\n[CWS02] O. Chapelle, J. Weston, and B. Scholkopf. Cluster Kernels for Semi-supervised Learn-\n\ning. In NIPS. 2002.\nS. Dasgupta. Learning Mixture of Gaussians. In 40th Annual Symposium on Foundations\nof Computer Science, 1999.\n\n[Das99]\n\n[HTF03] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning Data\n\nMining, Inference and Prediction. Springer, 2003.\n\n[RBV08] L. Rosasco, M. Belkin, and E. De Vito. Perturbation Results for Learning Empirical\nOpertors. Technical Report TR-2008-052, Massachusetts Institute of Technology, Cam-\nbridge, MA, August 2008.\nJ. Ratsaby and S. Venkatesh. Learning From a Mixture of Labeled and Unlabeled Ex-\namples with Parametric Side Information. In COLT. 1995.\nK. Sinha and M. Belkin. The Value of Labeled and Unlabeled Examples when the Model\nis Imperfect. In NIPS. 2007.\n\n[RV95]\n\n[SB07]\n\n[SBY08a] T. Shi, M. Belkin, and B. Yu. Data Spectroscopy: Eigenspace of Convolution Operators\n\nand Clustering. Technical report, Dept. of Statistics, Ohio State University, 2008.\n\n[SBY08b] T. Shi, M. Belkin, and B. Yu. Data Spectroscopy: Learning Mixture Models using\n\nEigenspaces of Convolution Operators. In ICML. 2008.\n\n[SNZ08] A. Singh, R. D. Nowak, and X. Zhu. Unlabeled Data: Now it Helps Now it Doesn\u2019t. In\n\nNIPS. 2008.\n\n[SSM98] Bernhard Scholkopf, A. Smola, and Klaus-Robert Muller. Nonlinear Component Anal-\n\n[Tib96]\n\n[Tro04]\n\nysis as a Kernel Eigenvalue Problem. Neural Computation, 10:1299(cid:150)1319, 1998.\nR. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal\nStatistical Society, Series B, 58:267(cid:150)288, 1996.\nJ. A. Tropp. Greed is Good: Algorithmic Result for Sparse Approximation. IEEE Trans.\nInfo. Theory, 50(10):2231(cid:150)2242, 2004.\n\n[ZH06]\n\n[Wai06] M. Wainwright. Sharp Thresholds for Noisy and High-dimensional Recovery of Spar-\nsity using \u20181-constrained Quadratic Programming. Technical Report TR-709, Dept. of\nStatistics, U. C. Berkeley, September 2006.\nC. Zhang and J. Huang. Model Selection Consistency of Lasso in High Dimensional\nLinear Regression. Technical report, Dept. of Statistics, Rutgers University, 2006.\nT. Zhang. On consistency of feature selection using greedy least square regression.\nJournal of Machine Learning Research, 2008.\nP. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine\nLearning Research, 7:2541(cid:150)2563, 2006.\n\n[Zha08]\n\n[ZY06]\n\n9\n\n\f", "award": [], "sourceid": 1025, "authors": [{"given_name": "Kaushik", "family_name": "Sinha", "institution": null}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": null}]}