{"title": "Learning with Local and Global Consistency", "book": "Advances in Neural Information Processing Systems", "page_first": 321, "page_last": 328, "abstract": "", "full_text": "Learning with Local and Global Consistency\n\nDengyong Zhou, Olivier Bousquet, Thomas Navin Lal,\n\nMax Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany\n\nJason Weston, and Bernhard Sch\u00a4olkopf\nf(cid:2)rstname.secondnameg@tuebingen.mpg.de\n\nAbstract\n\nWe consider the general problem of learning from labeled and unlabeled\ndata, which is often called semi-supervised learning or transductive in-\nference. A principled approach to semi-supervised learning is to design\na classifying function which is suf(cid:2)ciently smooth with respect to the\nintrinsic structure collectively revealed by known labeled and unlabeled\npoints. We present a simple algorithm to obtain such a smooth solution.\nOur method yields encouraging experimental results on a number of clas-\nsi(cid:2)cation problems and demonstrates effective use of unlabeled data.\n\n1\n\nIntroduction\n\nWe consider the general problem of learning from labeled and unlabeled data. Given a\npoint set X = fx1; : : : ; xl; xl+1; : : : ; xng and a label set L = f1; : : : ; cg; the (cid:2)rst l points\nhave labels fy1; : : : ; ylg 2 L and the remaining points are unlabeled. The goal is to predict\nthe labels of the unlabeled points. The performance of an algorithm is measured by the\nerror rate on these unlabeled points only.\nSuch a learning problem is often called semi-supervised or transductive. Since labeling\noften requires expensive human labor, whereas unlabeled data is far easier to obtain, semi-\nsupervised learning is very useful in many real-world problems and has recently attracted\na considerable amount of research [10]. A typical application is web categorization, in\nwhich manually classi(cid:2)ed web pages are always a very small part of the entire web, and\nthe number of unlabeled examples is large.\nThe key to semi-supervised learning problems is the prior assumption of consistency, which\nmeans: (1) nearby points are likely to have the same label; and (2) points on the same struc-\nture (typically referred to as a cluster or a manifold) are likely to have the same label. This\nargument is akin to that in [2, 3, 4, 10, 15] and often called the cluster assumption [4, 10].\nNote that the (cid:2)rst assumption is local, whereas the second one is global. Orthodox super-\nvised learning algorithms, such as k-NN, in general depend only on the (cid:2)rst assumption of\nlocal consistency.\nTo illustrate the prior assumption of consistency underlying semi-supervised learning, let us\nconsider a toy dataset generated according to a pattern of two intertwining moons in Figure\n1(a). Every point should be similar to points in its local neighborhood, and furthermore,\npoints in one moon should be more similar to each other than to points in the other moon.\nThe classi(cid:2)cation results given by the Support Vector Machine (SVM) with a RBF kernel\n\n\f(a) Toy Data (Two Moons)\n\nunlabeled point\nlabeled point \u22121\nlabeled point +1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(c) k\u2212NN\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(b) SVM (RBF Kernel)\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(c) Ideal Classification\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\nFigure 1: Classi(cid:2)cation on the two moons pattern. (a) toy data set with two labeled points;\n(b) classifying result given by the SVM with a RBF kernel; (c) k-NN with k = 1; (d) ideal\nclassi(cid:2)cation that we hope to obtain.\n\nand k-NN are shown in Figure 1(b) & 1(c) respectively. According to the assumption of\nconsistency, however, the two moons should be classi(cid:2)ed as shown in Figure 1(d).\nThe main differences between the various semi-supervised learning algorithms, such as\nspectral methods [2, 4, 6], random walks [13, 15], graph mincuts [3] and transductive SVM\n[14], lie in their way of realizing the assumption of consistency. A principled approach to\nformalize the assumption is to design a classifying function which is suf(cid:2)ciently smooth\nwith respect to the intrinsic structure revealed by known labeled and unlabeled points. Here\nwe propose a simple iteration algorithm to construct such a smooth function inspired by the\nwork on spreading activation networks [1, 11] and diffusion kernels [7, 8, 12], recent work\non semi-supervised learning and clustering [2, 4, 9], and more speci(cid:2)cally by the work of\nZhu et al. [15]. The keynote of our method is to let every point iteratively spread its label\ninformation to its neighbors until a global stable state is achieved.\nWe organize the paper as follows: Section 2 shows the algorithm in detail and also discusses\npossible variants; Section 3 introduces a regularization framework for the method; Section\n4 presents the experimental results for toy data, digit recognition and text classi(cid:2)cation,\nand Section 5 concludes this paper and points out the next researches.\n\n2 Algorithm\n\nGiven a point set X = fx1; : : : ; xl; xl+1; : : : ; xng (cid:26) Rm and a label set L = f1; : : : ; cg;\nthe (cid:2)rst l points xi(i (cid:20) l) are labeled as yi 2 L and the remaining points xu(l+1 (cid:20) u (cid:20) n)\nare unlabeled. The goal is to predict the label of the unlabeled points.\nLet F denote the set of n (cid:2) c matrices with nonnegative entries. A matrix F =\nn ]T 2 F corresponds to a classi(cid:2)cation on the dataset X by labeling each\n1 ; : : : ; F T\n[F T\npoint xi as a label yi = arg maxj(cid:20)c Fij: We can understand F as a vectorial function\nF : X ! Rc which assigns a vector Fi to each point xi: De(cid:2)ne a n(cid:2) c matrix Y 2 F with\nYij = 1 if xi is labeled as yi = j and Yij = 0 otherwise. Clearly, Y is consistent with the\n\n\finitial labels according the decision rule. The algorithm is as follows:\n\n1. Form the af(cid:2)nity matrix W de(cid:2)ned by Wij = exp((cid:0)kxi (cid:0) xjk2=2(cid:27)2) if i 6= j\n2. Construct the matrix S = D(cid:0)1=2W D(cid:0)1=2 in which D is a diagonal matrix with\n\nand Wii = 0:\n\nits (i; i)-element equal to the sum of the i-th row of W:\n\nin (0; 1):\n\n3. Iterate F (t + 1) = (cid:11)SF (t) + (1(cid:0) (cid:11))Y until convergence, where (cid:11) is a parameter\n4. Let F (cid:3) denote the limit of the sequence fF (t)g: Label each point xi as a label\n\nyi = arg maxj(cid:20)c F (cid:3)\nij:\n\nThis algorithm can be understood intuitively in terms of spreading activation networks\n[1, 11] from experimental psychology. We (cid:2)rst de(cid:2)ne a pairwise relationship W on the\ndataset X with the diagonal elements being zero. We can think that a graph G = (V; E) is\nde(cid:2)ned on X , where the the vertex set V is just X and the edges E are weighted by W: In\nthe second step, the weight matrix W of G is normalized symmetrically, which is necessary\nfor the convergence of the following iteration. The (cid:2)rst two steps are exactly the same as\nin spectral clustering [9]. During each iteration of the third step each point receives the\ninformation from its neighbors ((cid:2)rst term), and also retains its initial information (second\nterm). The parameter (cid:11) speci(cid:2)es the relative amount of the information from its neighbors\nand its initial label information. It is worth mentioning that self-reinforcement is avoided\nsince the diagonal elements of the af(cid:2)nity matrix are set to zero in the (cid:2)rst step. Moreover,\nthe information is spread symmetrically since S is a symmetric matrix. Finally, the label of\neach unlabeled point is set to be the class of which it has received most information during\nthe iteration process.\nLet us show that the sequence fF (t)g converges and F (cid:3) = (1 (cid:0) (cid:11))(I (cid:0) (cid:11)S)(cid:0)1Y: Without\nloss of generality, suppose F (0) = Y: By the iteration equation F (t + 1) = (cid:11)SF (t) + (1(cid:0)\n(cid:11))Y used in the algorithm, we have\n\nF (t) = ((cid:11)S)t(cid:0)1Y + (1 (cid:0) (cid:11))\n\nt(cid:0)1\n\nXi=0\n\n((cid:11)S)iY:\n\n(1)\n\nSince 0 < (cid:11) < 1 and the eigenvalues of S in [-1, 1] (note that S is similar to the stochastic\nmatrix P = D(cid:0)1W = D(cid:0)1=2SD1=2),\n\nlim\nt!1\n\n((cid:11)S)t(cid:0)1 = 0; and lim\nt!1\n\nt(cid:0)1\n\nXi=0\n\n((cid:11)S)i = (I (cid:0) (cid:11)S)(cid:0)1:\n\n(2)\n\nHence\n\nF (cid:3) = lim\nt!1\n\nF (t) = (1 (cid:0) (cid:11))(I (cid:0) (cid:11)S)(cid:0)1Y;\n\nfor classi(cid:2)cation, which is clearly equivalent to\n\nF (cid:3) = (I (cid:0) (cid:11)S)(cid:0)1Y:\n\n(3)\nNow we can compute F (cid:3) directly without iterations. This also shows that the iteration\nresult does not depend on the initial value for the iteration. In addition, it is worth to notice\nthat (I (cid:0) (cid:11)S)(cid:0)1 is in fact a graph or diffusion kernel [7, 12].\nNow we discuss some possible variants of this method. The simplest modi(cid:2)cation is to\nrepeat the iteration after convergence, i.e. F (cid:3) = (I (cid:0) (cid:11)S)(cid:0)1 (cid:1)(cid:1)(cid:1) (I (cid:0) (cid:11)S)(cid:0)1Y = (I (cid:0)\n(cid:11)S)(cid:0)pY; where p is an arbitrary positive integer. In addition, since that S is similar to P;\nwe can consider to substitute P for S in the third step, and then the corresponding closed\nform is F (cid:3) = (I (cid:0) (cid:11)P )(cid:0)1Y: It is also interesting to replace S with P T ; the transpose of P:\nThen the classifying function is F (cid:3) = (I (cid:0)(cid:11)P T )(cid:0)1Y: It is not hard to see this is equivalent\nto F (cid:3) = (D (cid:0) (cid:11)W )(cid:0)1Y: We will compare these variants with the original algorithm in the\nexperiments.\n\n\f3 Regularization Framework\n\nHere we develop a regularization framework for the above iteration algorithm. The cost\nfunction associated with F is de(cid:2)ned to be\n\nQ(F ) =\n\n1\n\n2(cid:18) n\nXi;j=1\n\nWij(cid:13)(cid:13)(cid:13)(cid:13)\n\n1\npDii\n\nFi (cid:0)\n\n1\n\nFj(cid:13)(cid:13)(cid:13)(cid:13)\n\npDjj\nF 2F Q(F ):\n\nF (cid:3) = arg min\n\nn\n\n2\n\n+ (cid:22)\n\nXi=1 (cid:13)(cid:13)Fi (cid:0) Yi(cid:13)(cid:13)\n\n2(cid:19);\n\n(4)\n\n(5)\n\nWhere (cid:22) > 0 is the regularization parameter. Then the classifying function is\n\nThe (cid:2)rst term of the right-hand side in the cost function is the smoothness constraint, which\nmeans that a good classifying function should not change too much between nearby points.\nThe second term is the (cid:2)tting constraint, which means a good classifying function should\nnot change too much from the initial label assignment. The trade-off between these two\ncompeting constraints is captured by a positive parameter (cid:22): Note that the (cid:2)tting constraint\ncontains labeled as well as unlabeled data.\nWe can understand the smoothness term as the sum of the local variations, i.e. the local\nchanges of the function between nearby points. As we have mentioned, the points involving\npairwise relationships can be be thought of as an undirected weighted graph, the weights\nof which represent the pairwise relationships. The local variation is then in fact measured\non each edge. We do not simply de(cid:2)ne the local variation on an edge by the difference of\nthe function values on the two ends of the edge. The smoothness term essentially splits\nthe function value at each point among the edges attached to it before computing the local\nchanges, and the value assigned to each edge is proportional to its weight.\nDifferentiating Q(F ) with respect to F , we have\n\n@Q\n\n@F (cid:12)(cid:12)(cid:12)(cid:12)F =F (cid:3)\n\nwhich can be transformed into\n\nF (cid:3) (cid:0)\nLet us introduce two new variables,\n\n= F (cid:3) (cid:0) SF (cid:3) + (cid:22)(F (cid:3) (cid:0) Y ) = 0;\n\n1\n\n1 + (cid:22)\n\nSF (cid:3) (cid:0)\n\n(cid:22)\n\n1 + (cid:22)\n\nY = 0:\n\n(cid:11) =\n\n1\n\n1 + (cid:22)\n\n; and (cid:12) =\n\n(cid:22)\n\n1 + (cid:22)\n\n:\n\nNote that (cid:11) + (cid:12) = 1: Then\n\n(I (cid:0) (cid:11)S)F (cid:3) = (cid:12)Y;\nF (cid:3) = (cid:12)(I (cid:0) (cid:11)S)(cid:0)1Y:\n\nSince I (cid:0) (cid:11)S is invertible, we have\nwhich recovers the closed form expression of the above iteration algorithm.\nSimilarly we can develop the optimization frameworks for the variants F (cid:3) = (I(cid:0)(cid:11)P )(cid:0)1Y\nand F (cid:3) = (D (cid:0) (cid:11)W )(cid:0)1Y . We omit the discussions due to lack of space.\n4 Experiments\n\n(6)\n\nWe used k-NN and one-vs-rest SVMs as baselines, and compared our method to its two\nvariants: (1) F (cid:3) = (I (cid:0) (cid:11)P )(cid:0)1Y ; and (2) F (cid:3) = (D (cid:0) (cid:11)W )(cid:0)1Y: We also compared to\nZhu et al.\u2019s harmonic Gaussian (cid:2)eld method coupled with the Class Mass Normalization\n(CMN) [15], which is closely related to ours. To the best of our knowledge, there is no\nreliable approach for model selection if only very few labeled points are available. Hence\nwe let all algorithms use their respective optimal parameters, except that the parameter (cid:11)\nused in our methods and its variants was simply (cid:2)xed at 0.99.\n\n\f(a) t = 10\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(c) t = 100 \n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(b) t = 50\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n(d) t = 400\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\nFigure 2: Classi(cid:2)cation on the pattern of two moons. The convergence process of our\niteration algorithm with t increasing from 1 to 400 is shown from (a) to (d). Note that the\ninitial label information are diffused along the moons.\n\nFigure 3: The real-valued classifying function becomes (cid:3)atter and (cid:3)atter with respect to\nthe two moons pattern with increasing t. Note that two clear moons emerge in (d).\n\n\f(a) SVM (RBF Kernel)\n\nlabeled point \u22121\nlabeled point +1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(b) Smooth with Global Consistency\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\nFigure 4: Smooth classi(cid:2)cation results given by supervised classi(cid:2)ers with the global con-\nsistency: (a) the classi(cid:2)cation result given by the SVM with a RBF kernel; (b) smooth the\nresult of the SVM using the consistency method.\n\n4.1 Toy Problem\n\ni2)=(F (cid:3)\n\ni1 + F (cid:3)\n\ni1 (cid:0) F (cid:3)\n\nIn this experiment we considered the toy problem mentioned in Section 1 (Figure 1).\nThe af(cid:2)nity matrix is de(cid:2)ned by a RBF kernel but the diagonal elements are set to zero.\nThe convergence process of our iteration algorithm with t increasing from 1 to 400 is\nshown in Figure 2(a)-2(d). Note that the initial label information are diffused along the\nmoons. The assumption of consistency essentially means that a good classifying func-\ntion should change slowly on the coherent structure aggregated by a large amount of\ndata. This can be illustrated by this toy problem very clearly. Let us de(cid:2)ne a function\ni2) and accordingly the decision function is sign(f (xi));\nf (xi) = (F (cid:3)\nwhich is equivalent to the decision rule described in Section 2. In Figure 3, we show that\nf (xi) becomes successively (cid:3)atter with respect to the two moons pattern from Figure 3(a)-\n3(d) with increasing t. Note that two clear moons emerge in the Figure 3(d).\nThe basic idea of our method is to construct a smooth function. It is natural to consider\nusing this method to improve a supervised classi(cid:2)er by smoothing its classifying result. In\nother words, we use the classifying result given by a supervised classi(cid:2)er as the input of\nour algorithm. This conjecture is demonstrated by a toy problem in Figure 4. Figure 4(a) is\nthe classi(cid:2)cation result given by the SVM with a RBF kernel. This result is then assigned\nto Y in our method. The output of our method is shown in Figure 4(b). Note that the points\nclassi(cid:2)ed incorrectly by the SVM are successfully smoothed by the consistency method.\n\n4.2 Digit Recognition\n\nIn this experiment, we addressed a classi(cid:2)cation task using the USPS handwritten 16x16\ndigits dataset. We used digits 1, 2, 3, and 4 in our experiments as the four classes. There\nare 1269, 929, 824, and 852 examples for each class, for a total of 3874.\nThe k in k-NN was set to 1. The width of the RBF kernel for SVM was set to 5, and\nfor the harmonic Gaussian (cid:2)eld method it was set to 1.25. In our method and its variants,\nthe af(cid:2)nity matrix was constructed by the RBF kernel with the same width used as in\nthe harmonic Gaussian method, but the diagonal elements were set to 0. The test errors\naveraged over 100 trials are summarized in the left panel of Figure 5. Samples were chosen\nso that they contain at least one labeled point for each class. Our consistency method and\none of its variant are clearly superior to the orthodox supervised learning algorithms k-NN\nand SVM, and also better than the harmonic Gaussian method.\nNote that our approach does not require the af(cid:2)nity matrix W to be positive de(cid:2)nite. This\nenables us to incorporate prior knowledge about digit image invariance in an elegant way,\ne.g., by using a jittered kernel to compute the af(cid:2)nity matrix [5]. Other kernel methods are\n\n\f0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0\n4\n\n10\n\n15\n\nk\u2212NN (k = 1)\nSVM (RBF kernel)\nharmonic Gaussian\nconsistency method\nvariant consistency (1)\nvariant consistency (2)\n\nk\u2212NN (k = 1)\nSVM (RBF kernel)\nharmonic Gaussian\nconsistency method\nvariant consistency (1)\nvariant consistency (2)\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n20\n\n25\n\n# labeled points\n\n30\n\n40\n\n50\n\n0.2\n4\n\n10\n\n15\n\n20\n\n25\n\n# labeled points\n\n30\n\n40\n\n50\n\nFigure 5: Left panel: the error rates of digit recognition with USPS handwritten 16x16\ndigits dataset for a total of 3874 (a subset containing digits from 1 to 4). Right panel: the\nerror rates of text classi(cid:2)cation with 3970 document vectors in a 8014-dimensional space.\nSamples are chosen so that they contain at least one labeled point for each class.\n\nknown to have problems with this method [5]. In our case, jittering by 1 pixel translation\nleads to an error rate around 0.01 for 30 labeled points.\n\n4.3 Text Classi(cid:2)cation\n\nIn this experiment, we investigated the task of text classi(cid:2)cation using the 20-newsgroups\ndataset. We chose the topic rec which contains autos, motorcycles, baseball, and hockey\nfrom the version 20-news-18828. The articles were processed by the Rainbow software\npackage with the following options: (1) passing all words through the Porter stemmer\nbefore counting them; (2) tossing out any token which is on the stoplist of the SMART\nsystem; (3) skipping any headers; (4) ignoring words that occur in 5 or fewer documents.\nNo further preprocessing was done. Removing the empty documents, we obtained 3970\ndocument vectors in a 8014-dimensional space. Finally the documents were normalized\ninto TFIDF representation.\nThe distance between points xi and xj was de(cid:2)ned to be d(xi; xj) = 1(cid:0)hxi; xji=kxikkxjk\n[15]. The k in k-NN was set to 1: The width of the RBF kernel for SVM was set to 1:5, and\nfor the harmonic Gaussian method it was set to 0:15. In our methods, the af(cid:2)nity matrix\nwas constructed by the RBF kernel with the same width used as in the harmonic Gaussian\nmethod, but the diagonal elements were set to 0. The test errors averaged over 100 trials\nare summarized in the right panel of Figure 5. Samples were chosen so that they contain at\nleast one labeled point for each class.\nIt is interesting to note that the harmonic method is very good when the number of labeled\npoints is 4, i.e. one labeled point for each class. We think this is because there are almost\nequal proportions of different classes in the dataset, and so with four labeled points, the pro-\nportions happen to be estimated exactly. The harmonic method becomes worse, however, if\nslightly more labeled points are used, for instance, 10 labeled points, which leads to pretty\npoor estimation. As the number of labeled points increases further, the harmonic method\nworks well again and somewhat better than our method, since the proportions of classes\nare estimated successfully again. However, our decision rule is much simpler, which in\nfact corresponds to the so-called naive threshold, the baseline of the harmonic method.\n\n\f5 Conclusion\n\nThe key to semi-supervised learning problems is the consistency assumption, which essen-\ntially requires a classifying function to be suf(cid:2)ciently smooth with respect to the intrinsic\nstructure revealed by a huge amount of labeled and unlabeled points. We proposed a sim-\nple algorithm to obtain such a solution, which demonstrated effective use of unlabeled data\nin experiments including toy data, digit recognition and text categorization. In our further\nresearch, we will focus on model selection and theoretic analysis.\n\nAcknowledgments\n\nWe would like to thank Vladimir Vapnik, Olivier Chapelle, Arthur Gretton, and Andre Elis-\nseeff for their help with this work. We also thank Andrew Ng for helpful discussions about\nspectral clustering, and the anonymous reviewers for their constructive comments. Special\nthanks go to Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty who communicated with\nus on the important post-processing step class mass normalization used in their method and\nalso provided us with their detailed experimental data.\n\nReferences\n[1] J. R. Anderson. The architecture of cognition. Harvard Univ. press, Cambridge, MA,\n\n1983.\n\n[2] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. Machine Learning\n\nJournal, to appear.\n\n[3] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph min-\n\ncuts. In ICML, 2001.\n\n[4] O. Chapelle, J. Weston, and B. Sch\u00a4olkopf. Cluster kernels for semi-supervised learn-\n\ning. In NIPS, 2002.\n\n[5] D. DeCoste and B. Sch\u00a4olkopf. Training invariant support vector machines. Machine\n\nLearning, 46:161(cid:150)190, 2002.\n\n[6] T. Joachims. Transductive learning via spectral graph partitioning. In ICML, 2003.\n[7] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity.\n\nIn\n\nNIPS, 2002.\n\n[8] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input\n\nspaces. In ICML, 2002.\n\n[9] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and an algo-\n\nrithm. In NIPS, 2001.\n\n[10] M. Seeger. Learning with labeled and unlabeled data. Technical report, The Univer-\n\nsity of Edinburgh, 2000.\n\n[11] J. Shrager, T. Hogg, and B. A. Huberman. Observation of phase transitions in spread-\n\ning activation networks. Science, 236:1092(cid:150)1094, 1987.\n\n[12] A. Smola and R. I. Kondor. Kernels and regularization on graphs. In Learning Theory\n\nand Kernel Machines, Berlin - Heidelberg, Germany, 2003. Springer Verlag.\n\n[13] M. Szummer and T. Jaakkola. Partially labeled classi(cid:2)cation with markov random\n\nwalks. In NIPS, 2001.\n\n[14] V. N. Vapnik. Statistical learning theory. Wiley, NY, 1998.\n[15] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian\n\n(cid:2)elds and harmonic functions. In ICML, 2003.\n\n\f", "award": [], "sourceid": 2506, "authors": [{"given_name": "Dengyong", "family_name": "Zhou", "institution": null}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Thomas", "family_name": "Lal", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}