{"title": "Fast Prediction on a Tree", "book": "Advances in Neural Information Processing Systems", "page_first": 657, "page_last": 664, "abstract": "Given an $n$-vertex weighted tree with structural diameter $S$ and a subset of $m$ vertices, we present a technique to compute a corresponding $m \\times m$ Gram matrix of the pseudoinverse of the graph Laplacian in $O(n+ m^2 + m S)$ time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. To this end we present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 nodes and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information.", "full_text": "Fast Prediction on a Tree\n\nMark Herbster, Massimiliano Pontil, Sergio Rojas-Galeano\n\nGower Street, London WC1E 6BT, England, UK\n\n{m.herbster, m.pontil,s.rojas}@cs.ucl.ac.uk\n\nDepartment of Computer Science\n\nUniversity College London\n\nAbstract\n\nGiven an n-vertex weighted tree with structural diameter S and a subset of m ver-\ntices, we present a technique to compute a corresponding m \u00d7 m Gram matrix of\nthe pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss\nthe application of this technique to fast label prediction on a generic graph. We\napproximate the graph with a spanning tree and then we predict with the kernel\nperceptron. We address the approximation of the graph with either a minimum\nspanning tree or a shortest path tree. The fast computation of the pseudoinverse\nenables us to address prediction problems on large graphs. We present experi-\nments on two web-spam classi\ufb01cation tasks, one of which includes a graph with\n400,000 vertices and more than 10,000,000 edges. The results indicate that the ac-\ncuracy of our technique is competitive with previous methods using the full graph\ninformation.\n\n1 Introduction\nClassi\ufb01cation methods which rely upon the graph Laplacian (see [3, 20, 13] and references therein),\nhave proven to be useful for semi-supervised learning. A key insight of these methods is that unla-\nbeled data can be used to improve the performance of supervised learners. These methods reduce\nto the problem of labeling a graph whose vertices are associated to the data points and the edges to\nthe similarity between pairs of data points. The labeling of the graph can be achieved either in a\nbatch [3, 20] or in an online manner [13]. These methods can all be interpreted as different kernel\nmethods: ridge regression in the case of [3], minimal semi-norm interpolation in [20] or the per-\nceptron algorithm in [13]. This computation scales in the worst case cubically with the quantity of\nunlabeled data, which may prevent the use of these methods on large graphs.\nIn this paper, we propose a method to improve the computational complexity of Laplacian-based\nlearning algorithms. If an n-vertex tree is given, our method requires an O(n) initialization step and\nafter that any m\u00d7 m block of the pseudoinverse of the Laplacian may be computed in O(m2 + mS)\ntime, where S is the structural diameter of the tree. The pseudoinverse of the Laplacian may then\nbe used as a kernel for a variety of label prediction methods. If a generic graph is given, we \ufb01rst\napproximate it with a tree and then run our method on the tree. The use of a minimum spanning tree\nand shortest path tree is discussed.\nIt is important to note that prediction is also possible using directly the graph Laplacian, without\ncomputing its pseudoinverse. For example, this may be achieved by solving a linear system of\nequations [3, 20] involving the Laplacian, and a solution may be computed in O(|E| logO(1) n)\ntime [18], where E is the edge set. However, computation via the graph kernel allows for multiple\nprediction problems on the same graph to be computed more ef\ufb01ciently. The advantage is even more\nstriking if the data come sequentially and we need to predict in an online fashion.\nTo illustrate the advantage of our approach consider the case in which we are provided with a small\nsubset of (cid:96) labeled vertices of a large graph and we wish to predict the label of a different subset of\np vertices. Let m = (cid:96) + p and assume that m (cid:28) n (typically we will also have (cid:96) (cid:28) p). A practical\napplication is the problem of detecting \u201cspam\u201d hosts in the internet. Although the number of hosts\n\n\fin the internet is in the millions we may only need to detect spam hosts from some limited domain.\nIf the graph is a tree the total time required to predict with the kernel perceptron using our method\nwill be O(n + m2 + mS). The promise of our technique is that, if m + S (cid:28) n and a tree is given,\nit requires O(n) time versus O(n3) for standard methods.\nTo the best of our knowledge this is the \ufb01rst paper which addresses the problem of fast prediction\nin semi-supervised learning using tree graphs. Previous work has focused on special prediction\nmethods and graphs. The work in [5] presents a non-Laplacian-based method for predicting the\nlabeling of a tree, based on computing the exact probabilities of a Markov random \ufb01eld. The issue\nof computation time is not addressed there. In the case of unbalanced bipartite graphs [15] presents\na method which signi\ufb01cantly improves the computation time of the pseudoinverse to \u0398(k2(n \u2212 k)),\nwhere k is the size of a minority partition. Thus, in the case of a binary tree the computation is still\n\u0398(n3) time.\nThe paper is organized as follows. In Section 2 we review the notions which are needed in order\nto present our technique in Section 3, concerning the fast computation of a tree graph kernel. In\nSection 4 we address the issue of tree selection, commenting in particular on a potential advantage\nof shortest path trees. In Section 5 we present the experimental results and draw our conclusions in\nSection 6.\n\nelement Dii =(cid:80)n\n\n2 Background\nIn this paper any graph G is assumed to be connected, to have n vertices, and to have edge weights.\nThe set of vertices of G is denoted V = {1, . . . , n}. Let A = (Aij)n\ni,j=1 be the n \u00d7 n symmetric\nweight matrix of the graph, where Aij \u2265 0, and de\ufb01ne the edge set E(G) := {(i, j) : Aij >\n0, i < j}. We say that G is a tree if it is connected and has n \u2212 1 edges. The graph Laplacian\nis the n \u00d7 n matrix de\ufb01ned as G = D \u2212 A, where D is the diagonal matrix with i-th diagonal\nj=1 Aij, the weighted degree of vertex i. Where it is not ambiguous, we will use\nthe notation G to denote both the graph G and the graph Laplacian and the notation T to denote\nboth a Laplacian of a tree and the tree itself. The Laplacian is positive semi-de\ufb01nite and induces\nthe semi-norm (cid:107)w(cid:107)2\n(i,j)\u2208E(G) Aij(wi \u2212 wj)2. The kernel associated with the\nabove semi-norm is G+, the pseudoinverse of matrix G, see e.g. [14] for a discussion. As the graph\nis connected, it follows from the de\ufb01nition of the semi-norm that the null space of G is spanned by\nthe constant vector 1 only.\nThe analogy between graphs and networks of resistors plays an important role in this paper. That\nis, the weighted graph may be seen as a network of resistors where edge (i, j) is a resistor with\nresistance \u03c0ij = A\u22121\nij . Then the effective resistance rG(i, j) may be de\ufb01ned as the resistance\nmeasured between vertex i and j in this network and may be calculated using Kirchoff\u2019s circuit laws\nor directly from G+ using the formula [16]\n\nG := w(cid:62)Gw =(cid:80)\n\nmin{(cid:80)\n\n(2.1)\nThe effective resistance is a metric distance on the graph [16] as well as the geodesic and struc-\ntural distances. The structural distance between vertices i, j \u2208 V is de\ufb01ned as sG(i, j) :=\nmin{|P (i, j)| : P (i, j) \u2208 P} where P is the set of all paths in G and P (i, j) is the set of edges\nin a particular path from i to j. Whereas, the geodesic distance is de\ufb01ned as dG(i, j) :=\n: P (i, j) \u2208 P}. The diameter is the maximum distance between any\ntwo points on the graph, hence the resistance, structural, and, geodesic diameter are de\ufb01ned as\nRG = maxi,j\u2208V rG(i, j) SG = maxi,j\u2208V sG(i, j), and DG = maxi,j\u2208V dG(i, j), respectively.\nNote that, by Kirchoff\u2019s laws, rG(i, j) \u2264 dG(i, j) and, so, RG \u2264 DG.\n\n(p,q)\u2208P (i,j) \u03c0pq\n\nrG(i, j) = G+\n\nii + G+\n\njj \u2212 2G+\nij .\n\n3 Computing the Pseudoinverse of a Tree Laplacian Quickly\nIn this section we describe our method to compute the pseudoinverse of a tree.\n\nInverse Connectivity\n\n3.1\nLet us begin by noting that the effective resistance is a better measure of connectivity than the\ngeodesic distance, as for example if there are k edge disjoint paths of geodesic distance d between\ntwo vertices, then the effective resistance is no more than d\nk . Thus, the more paths, the closer the\nvertices.\n\n\fresistance Rtot =(cid:80)\nsmaller Rtot the more connected the graph. The second quantity is R(i) =(cid:80)n\n\nIn the following, we will introduce three more global measures of connectivity built on top of the\neffective resistance, which are useful for our computation below. The \ufb01rst quantity is the total\ni>j rG(i, j), which is a measure of the inverse connectivity of the graph: the\nj=1 rG(i, j), which is\nused as a measure of inverse centrality of vertex i [6, Def. 3] (see also [17]). The third quantity is\nG+\nSumming both sides of equation (2.1) over j gives\n\nii, which provides an alternate notion of inverse centrality.\n\nwhere we used the fact that(cid:80)n\n\nthe constant vector. Summing again over i yields\n\nR(i) = nG+\n\nii +\n\nG+\njj,\n\n(3.1)\n\nj=1 G+\n\nij = 0, which is true because the null space of G is spanned by\n\nn(cid:88)\n\nj=1\n\nn(cid:88)\n\ni=1\n\nRtot = n\n\nG+\nii,\n\nii = R(i)\nG+\n\nn\n\n\u2212 Rtot\nn2 .\n\nwhere we have used(cid:80)n\n\ni=1 R(i) = 2Rtot. Combing the last two equations we obtain\n\n(3.2)\n\n(3.3)\n\n3.2 Method\nThroughout this section we assume that G is a tree with corresponding Laplacian matrix T. The\nprinciple of the method to compute T+ is that, on a tree there is a unique path between any two\nvertices and, so, the effective resistance is simply the sum of resistances along that path, see e.g. [16,\n13] (for the same reason, on a tree the geodesic distance is the same as the resistance distance).\nWe assume that the root vertex is indexed as 1. The parent and the children of vertex i are denoted\nby \u2191(i) and \u2193(i), respectively. The descendants of vertex i are denoted by\n\n(cid:40)\u2193(i)(cid:83)\n\n\u2205\n\n\u2193*(i) :=\n\nj\u2208\u2193(i) \u2193*(j) \u2193(i) (cid:54)= \u2205\n\n\u2193(i) = \u2205 .\n\n(cid:80)n\n\nWe also let \u03ba(i) be the number of descendants of vertex i and i itself, that is, \u03ba(i) = 1 + |\u2193*(i)|.\nThe method is outlined as follows. We initially compute R(1), . . . , R(n) in O(n) time. This in turn\nnn via equation (3.3), also in O(n) time. As we shall\ngives us Rtot = 1\n2\nsee, with these precomputed values, we may obtain off-diagonal elements G+\nij from equation (2.1)\nby computing individually rT(i, j) in O(ST) or an m \u00d7 m block in O(m2 + mST) time.\n\ni=1 R(i) and G+\n\n11, . . . , G+\n\nInitialization\nWe split the computation of the inverse centrality R(i) into two terms, namely R(i) = T (i) +\nS(i), where T (i) and S(i) are the sum of the resistances of vertex i to each descendant and non-\ndescendant, respectively. That is,\n\nT (i) = (cid:88)\n\nj\u2208\u2193*(i)\n\nrT(i, j) , S(i) = (cid:88)\n(cid:40)(cid:80)\n\nj(cid:54)\u2208\u2193*(i)\n\nT (i) :=\n\nrT(i, j) .\n\n(cid:40)\n1 +(cid:80)\n\n1\n\n\u03ba(i) :=\n\nWe compute \u03ba(i) and T (i), i = 1, . . . , n with the following leaves-to-root recursions\n\nj\u2208\u2193(i) \u03ba(j) \u2193(i) (cid:54)= \u2205\n\n\u2193(i) = \u2205 ,\n\nj\u2208\u2193(i)(T (j) + \u03c0ij\u03ba(j)) \u2193(i) (cid:54)= \u2205\n\u2193(i) = \u2205\n\n0\n\nby computing \u03ba(1) then T (1) and caching the intermediate values. We next descend the tree caching\neach calculated S(i) with the root-to-leaves recursion\n\n(cid:26)S(\u2191(i)) + T (\u2191(i)) \u2212 T (i) + (n \u2212 2\u03ba(i))\u03c0i \u2191(i)\n\ni (cid:54)= 1\ni = 1 .\n\nS(i) :=\n\n0\n\nIt is clear that the time complexity of the above recursions is O(n).\n\n\fRepeat\n\nrT(vi, w) = rT(w, vi) = rT(vi, c) + rT(c, w)\n\n1. Input: {v1, . . . , vm} \u2286 V\n2. Initialization: visited(all) = \u2205\n3. for i = 1, . . . , m do\np = \u22121; c = vi; rT(c, c) = 0\n4.\n5.\nfor w \u2208 visited(c) \u2229 {p} \u222a \u2193*(p) do\n6.\n7.\n8.\n9.\n10.\n11.\n12.\n13. end\n\nend\nvisited(c) = visited(c) \u222a vi\np = c; c = \u2191(c)\nrT(vi, c) = rT(c, vi) = rT(vi, p) + \u03c0p,c\n\nuntil (\u201cp is the root\u201d)\n\nFigure 1: Computing an m \u00d7 m block of a tree Laplacian pseudoinverse.\n\nComputing an m \u00d7 m block of the Laplacian pseudoinverse\nOur algorithm (see Figure 1) computes the effective resistance matrix of an m \u00d7 m block which\neffectively gives the kernel (via equation (2.1)). The motivating idea is that a single effective re-\nsistance rT(i, j) is simply the sum of resistances along the path from i to j. It may be computed\nby separately ascending the path from i\u2013to\u2013root and j\u2013to\u2013root in O(ST) time and summing the\nresistances along each edge that is either in the i\u2013to\u2013root or j\u2013to\u2013root path but not in both. However\nwe may amortize the computation of an m \u00d7 m block to O(m2 + mST) time, saving a factor of\nmin(m, ST). This is realized by additionally caching the cumulative sums of resistances along the\npath to the root during each ascent from a vertex.\nWe outline in further detail the algorithm as follows: for each vertex vi in the set Vm = {v1, . . . , vm}\nwe perform an ascent to the root (see line 3 in Figure 1). As we ascend, we cache each cumulative re-\nsistance (from the starting vertex vi to the current vertex c) along the path on the way to the root (line\n11). If, while ascending from vi we enter a vertex c which has previously been visited during the as-\ncent from another vertex w (line 6) then we compute rT(vi, w) as rT(vi, c)+rT(c, w). For example,\nduring the ascent from vertex vk \u2208 Vm to the root we will compute {rT(v1, vk), . . . , rT(vk, vk)}.\nThe computational complexity is obtained by noting that every ascent to the root requires O(ST)\nsteps and along each ascent we must compute up to max(m, ST) resistances. Thus, the total com-\nplexity is O(m2 + mST), assuming that each step of the algorithm is ef\ufb01ciently implemented. For\nthis purpose, we give two implementation notes. First, each of the effective resistances computed\nby the algorithm should be stored on the tree, preventing creation of an n \u00d7 n matrix. When the\ncomputation is completed the desired m \u00d7 m Gram matrix may then be directly computed by gath-\nering the cached values via an additional set of ascents. Second, it should be ensured that the \u201cfor\nloop\u201d (line 6) is executed in \u0398(|visited(c) \u2229 {p} \u222a \u2193*(p)|) time by a careful but straightforward\nimplementation of the visited predicate. Finally, this algorithm may be generalized to compute\na p \u00d7 (cid:96) block in O(p(cid:96) + (p + (cid:96))ST) time or to operate fully \u201conline.\u201d\nLet us return to the practical scenario described in the introduction, in which we wish to predict p\nvertices of the tree based on (cid:96) labeled vertices. Let m = (cid:96)+p. By the above discussion, computation\nof an m \u00d7 m block of the kernel matrix T+ requires O(n + m2 + mST) time. In many practical\napplications m (cid:28) n and SG will typically be no more than logarithmic in n, which leads to an\nappealing O(n) time complexity.\n\n4 Tree Construction\n\nIn the previous discussion, we have considered that a tree has already been given. In the follow-\ning, we assume that a graph G or a similarity function is given and the aim is to construct an\napproximating tree. We will consider both the minimum spanning tree (MST) as a \u201cbest\u201d in norm\napproximation; and the shortest path tree (SPT) as an approximation which maintains a mistake\nbound [13] guarantee.\nGiven a graph with a \u201ccost\u201d on each edge, an MST is a connected n-vertex subgraph with n \u2212 1\nedges such that the total cost is minimized. In our set-up the cost of edge (i, j) is the resistance\n\n\f\u03c0ij = 1\nAij\n\n, therefore, a minimum spanning tree of G solves the problem\n\n\uf8fc\uf8fd\uf8fe ,\n\n(i,j)\u2208E(T)\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\ni,j=1 Aij \u2212(cid:80)\n\nmin\n\n\u03c0ij : T \u2208 T (G)\n\n(4.1)\n\nwhere T (G) denotes the set of spanning trees of G. An MST is also a tree whose Laplacian best\napproximates the Laplacian of the given graph according to the trace norm, that is, it solves the\nproblem\n\nIndeed, we have tr(G\u2212 T) =(cid:80)n\n\nmin{tr(G \u2212 T) : T \u2208 T (G)} .\n(i,j)\u2208E(T) \u2212\u03c0\u22121\n\n(4.2)\nij . Then, our claim that the problems\n(4.1) and (4.2) have the same solution follows by noting that the edges in a minimum spanning\ntree are invariant with respect to any strictly increasing function of the \u201ccosts\u201d on the edges in the\noriginal graph [8] and the function \u2212\u03c0\u22121 is increasing in \u03c0.\nThe above observation suggests another approximation criterion which we may consider for \ufb01nding\na spanning tree. We may use the trace norm between the pseudoinverse of the Laplacians, rather\nthan the Laplacians themselves as in (4.2). This seems a more natural criterion, since our goal is to\napproximate well the kernel (it is the kernel which is directly involved in the prediction problem). It\nis interesting to note that the quantity tr(T+\u2212G+) is related to the total resistance. Speci\ufb01cally, we\nhave by equation (3.2) that tr(T+\u2212G+) = Rtot(T)\n. As noted in [10], the total resistance\nis a convex function of the graph Laplacian. However, we do not know how to minimize Rtot(T)\nover the set of spanning trees of G. We thus take a different route, which leads us to the notion of\nshortest path trees. We choose a vertex i and look for a spanning tree which minimizes the inverse\ncentrality R(i) of vertex i, that is we solve the problem\n\nn \u2212 Rtot(G)\n\nn\n\nmin{R(i) : T \u2208 T (G)} .\n\n(4.3)\n\nii + Rtot\n\nNote that R(i) is the contribution of vertex i to the total resistance of T and that, by equations (3.1)\nand (3.2), R(i) = nT +\nn . The above problem can then be interpreted as minimizing a trade-\noff between inverse centrality of vertex i and inverse connectivity of the tree. In other words, (4.3)\nencourages trees which are centered at i and, at the same time have a small diameter. It is interesting\nto observe that the solution of problem (4.3) is a shortest path tree (SPT) centered at vertex i, namely\na spanning tree for which the geodesic distance in \u201ccosts\u201d is minimized from i to every other vertex\nin the graph. This is because the geodesic distance is equivalent to the resistance distance on a tree\nand any SPT of G is formed from a set of shortest paths connecting the root to any other vertex in\nG [8, Ch. 24.1].\nLet us observe a fundamental difference between MST and SPT, which provides a justi\ufb01cation for\napproximating the given graph with an SPT. It relies upon the analysis in [13, Theorem 4.2], where\nthe cumulative number of mistakes of the kernel perceptron with the kernel K = G+ + 11(cid:62) was\nG + 1)(RG + 1) for consistent labelings [13] u \u2208 {\u22121, 1}n. To explain\nupper bounded by ((cid:107)u(cid:107)2\nour argument, \ufb01rst we note that when we approximate the graph with a tree T the term (cid:107)u(cid:107)2\nG is\nalways decreasing, while the term RG is always increasing by Rayleigh\u2019s monotonicity law (see for\nexample [13, Corollary 3.1]). Now, note that the resistance diameter RT of an SPT of a graph G is\nbounded by twice the geodesic diameter of the original graph,\n\nRT \u2264 2DG.\n\n(4.4)\nIndeed, as an SPT is formed from a set of shortest paths between the root and any other vertex in G,\nfor any pair of vertices p, q in the graph there is in the SPT a path from p to the root and then to q\nwhich can be no longer than 2DG.\nTo further discuss, consider the case that G consists of a few dense clusters each uniquely labeled\nand with only a few cross-cluster edges. The above mistake bound and the bound (4.4), imply that a\ntree built with an SPT would still have a non-vacuous mistake bound. No such bound as (4.4) holds\nfor an MST subgraph. For example, consider a bicycle wheel graph whose edge set is the union of\nn spoke edges {(0, i) : i = 1, . . . , n} and n rim edges {(i, i + 1 mod n) : i = 1, . . . , n} with costs\non the spoke edges of 2 and on the rim edges of 1. The MST diameter is then n + 1 while any SPT\ndiameter is \u2264 8.\n\n\fAt last, let us comment on the time and space complexity of constructing such trees. The MST and\nSPT trees may be constructed with Prim and Dijkstra algorithms [8] respectively in O(n log n+|E|)\ntime. Prim\u2019 algorithm may be further speeded up to O(n + |E|) time in the case of small integer\nweights [12]. In the general case of a non-sparse graph or similarity function the time complexity is\n\u0398(n2), however as both Prim and Dijkstra are \u201cgreedy\u201d algorithms their space complexity is O(n)\nwhich may be a dominant consideration in a large graph.\n\n5 Web-spam Detection Experiments\nIn this section, we present an experimental study of the feasibility of our method on large graphs\n(400,000 vertices). The motivation for our methodology is that on graphs with already 10,000 ver-\ntices it is computationally challenging to use standard graph labeling methods such as [3, 20, 13], as\nthey require the computation of the full graph Laplacian kernel. This computational burden makes\nthe use of such methods prohibitive when the number of vertices is in the millions. On the other\nhand, in the practical scenario described in the introduction the computational time of our method\nscales linearly in the number of vertices in the graph and can be run comfortably on large graphs\n(see Figure 2 below) and at worst quadratically if the full graph needs to be labeled.\nThe aims of the experiments are: (i) to see whether there is a signi\ufb01cant performance loss when using\na tree sub-graph rather than the original graph, (ii) to compare tree construction methods, speci\ufb01cally\nthe MST and the SPT and (iii) to exploit the possibility of improving performance through ensembles\nof trees. The initial results are promising in that the performance of the predictor with a single SPT or\nMST is competitive with that of the existing methods, some of which use the full graph information.\nWe shall also comment on the computational time of the method.\n\n5.1 Datasets and previous methods\nWe applied the Fast Prediction on a Tree (FPT) method to the 2007 web-spam challenge developed\nat the University of Paris VI1. Two graphs are provided. The \ufb01rst one is formed by 9,072 vertices\nand 464,959 edges, which represent computer hosts \u2013 we call this the host-graph. In this graph,\none host is \u201cconnected\u201d to another host if there is at least one link from a web-page in the \ufb01rst host\nto a web-page in the other host. The second graph consists of 400,000 vertices (corresponding to\nweb-pages) and 10,455,545 edges \u2013 we call this the web-graph. Again, a web-page is \u201cconnected\u201d\nto another web-page if there is at least one hyperlink from the former to the latter. Note that both\ngraphs are directed. In our experiments we discarded directional information and assigned a weight\nof either 1 to unidirectional edges and of w \u2208 {1, 2} to the bidirectional edges. Each vertex is\neither labeled as spam or as non-spam. In both graphs there are about 80% of non-spam vertices and\n20% of spam ones. Additional tf-idf feature vectors (determined by the web-pages\u2019 html content)\nare provided for each vertex in the graph, but we have discarded this information for simplicity.\nFollowing the web-spam protocol, for both graphs we used 10% of labeled vertices for training and\n90% for testing.\nWe brie\ufb02y discuss some previous methods which participated in the web-spam challenge. Abernathy\net al. [1] used an SVM variant on the tf-idf features with an additional graph-based regularization\nterm, which penalizes predictions with links between non-spam to spam vertices. Tang et al. (see\n[7]) used a linear and Gaussian SVM combined with Random Forests on the feature vectors, plus\nnew features obtained from link information. The method of Witschel and Biemann [4] consisted of\niteratively selecting vertices and classifying them with the predominant class in their neighborhood\n(hence this is very similar to label propagation method of [20]). Bencz\u00b4ur et al. (see [7]) used Naive\nBayes, C4.5 and SVM\u2019s with a combination of content and/or graph-based features. Finally, Filoche\net al.\n(see [7]) applied html preprocessing to obtain web-page \ufb01ngerprints, which were used to\nobtain clusters; these clusters along with link and content-based features were then fed to a modi\ufb01ed\nNaive Bayes classi\ufb01er.\n\n5.2 Results\nExperimental results are shown in Table 1. We report the following performance measures: (i)\naverage accuracy when predicting with a single tree, (ii) average accuracy when each predictor\nis optimized over a threshold in the range of [\u22121, 1], (iii) area under the curve (AUC) and (iv)\n\n1See http://webspam.lip6.fr/wiki/pmwiki.php for more information.\n\n\fMethod\n\nMST\nSPT\n\nMST (bidir)\nSPT (bidir)\n\nAbernathy et al.\n\nTang et al.\nFiloche et al.\nBencz\u00b4ur et al.\n\nMST (bidir)\nSPT (bidir)\nWitschel et al.\nFiloche et al.\nBencz\u00b4ur et al.\n\nTang et al.\n\nAgg.\n\n0.907\n0.889\n0.912\n0.913\n0.896\n0.906\n0.889\n0.829\n\n0.991\n0.994\n0.995\n0.973\n0.942\n0.296\n\nAgg.-Best AUC\n\n0.907\n0.890\n0.915\n0.913\n0.906\n0.907\n0.890\n0.847\n\n0.992\n0.994\n0.996\n0.974\n0.942\n0.965\n\nHost-graph\n0.950\n0.952\n0.944\n0.960\n0.952\n0.951\n0.927\n0.877\nWeb-graph\n1.000\n0.999\n0.998\n0.991\n0.973\n0.989\n\n. . .\n. . .\n. . .\n. . .\n\n. . .\n. . .\n. . .\n. . .\n\nSingle\n0.857\u00b10.022\n0.850\u00b10.026\n0.878\u00b10.033\n0.873\u00b10.028\n\n0.976\u00b10.011\n0.985\u00b10.002\n\nSingle-Best\n0.865\u00b10.017\n0.857\u00b10.018\n0.887\u00b10.027\n0.877\u00b10.026\n\n. . .\n. . .\n. . .\n. . .\n\n0.980\u00b10.009\n0.985\u00b10.002\n\n. . .\n. . .\n. . .\n. . .\n\nAUC\n\n0.841\u00b10.045\n0.804\u00b10.063\n0.851\u00b10.100\n0.846\u00b10.065\n\n. . .\n. . .\n. . .\n. . .\n\n0.993\u00b10.005\n0.992\u00b10.003\n\n. . .\n. . .\n. . .\n. . .\n\nTable 1: Results of our FPT method and other competing methods.\n\nFigure 2: AUC and Accuracy vs. number of trees (left and middle) and Runtime vs. number of\nlabeled vertices (right).\naggregate predictive value given by each tree. In the case of the host-graph, predictions for the\naggregate method were made using 81 trees. MST and SPT were obtained for the weighted graphs\nwith Prim and Dijkstra algorithms, respectively. For the unweighted graphs, every tree is an MST,\nso we simply used trees generated by a randomized unweighted depth-\ufb01rst traversal of the graph and\nSPT\u2019s may be generated by using the breadth-\ufb01rst-search algorithm, all in O(|E|) time. In the table,\nthe tag \u201cAgg.\u201d stands for aggregate and the \u201cbidir\u201d tag indicates that the original graph was modi\ufb01ed\nby setting w = 2 for bidirectional edges. In the case of the larger web-graph, we used 21 trees and\nthe modi\ufb01ed graph with bidirectional weights. In all experiments we used a kernel perceptron which\nwas trained for three epochs (e.g. [13]).\nIt is interesting to note that some of the previous methods [1, 4] take the full graph information into\naccount. Thus, the above results indicate that our method is statistically competitive (in fact better\nthan most of the other methods) even though the full graph structure is discarded. Remarkably, in the\ncase of the large web-graph, using just a single tree gives a very good accuracy, particularly in the\ncase of SPT. On this graph SPT is also more stable in terms of variance than MST. In the case of the\nsmaller host-graph, just using one tree leads to a decrease in performance. However, by aggregating\na few trees our result improves over the state of the art results.\nIn order to better understand the role of the number of trees on the aggregate prediction, we also ran\nadditional experiments on the host-graph with t = 5, 11, 21, 41, 81 randomly chosen MST or SPT\ntrees. We averaged the accuracy and AUC over 100 trials each. Results are shown in Figure 2. As it\ncan be seen, using as few as 11 trees already gives competitive performance. SPT works better than\nMST in term of AUC (left plot), whereas the result is less clear in the case of accuracy (middle plot).\nFinally, we report on an experiment evaluating the running time of our method. We choose the web-\ngraph (n = 400, 000). We then \ufb01xed p = 1000 predictive vertices and let the number of labeled\nvertices (cid:96) vary in the set {20, 40, 60, 80, 100, 200, 400}. Initialization time (tree construction plus\ncomputation of the diagonal elements of the kernel) and initialization plus prediction times were\nmeasured in seconds on a dual core 1.8GHz machine with 8Gb memory. As expected, the solid\ncurve, corresponding to initialization time, is the dominant contribution to the computation time.\n\n5112141810.90.910.920.930.940.950.960.97Num.of treesAUCHost\u2212graph unweighted_MSTunweighted_SPTbiweighted_MSTbiweighted_SPT5112141810.860.870.880.890.90.910.920.93Num.of treesAccuracyHost\u2212graph unweighted_MSTunweighted_SPTbiweighted_MSTbiweighted_SPT2010020040044.555.566.577.58Labeled nodesRuntime (secs)Web\u2212graph InitializationInit+Prediction\f6 Conclusions\nWe have presented a fast method for labeling of a tree. The method is simple to implement and, in\nthe practical regime of small labeled and testing sets and diameters, scales linearly in the number\nof vertices in the tree. When we are presented with a generic undirected weighted graph, we \ufb01rst\nextract a spanning tree from it and then run the method. We have studied minimum spanning trees\nand shortest path trees, both of which can be computed ef\ufb01ciently with standard algorithms. We\nhave tested the method on a web-spam classi\ufb01cation problem involving a graph of 400,000 vertices.\nOur results indicate that the method is competitive with the state of the art. We have also shown\nhow performance may be improved by averaging the predictors obtained by a few spanning trees.\nFurther improvement may involve learning combinations of different trees. This may be obtained\nfollowing ideas in [2]. At the same time it would be valuble to study connections between our work\nand other approximation methods such as those in in the context of kernel-methods [9], Gaussian\nprocesses [19] and Bayesian learning [11].\n\nAcknowledgments. We wish to thank A. Argyriou and J.-L. Balc\u00b4azar for valuable discussions, D.\nAthanasakis and S. Shankar Raman for useful preliminary experimentation, D. Fernandez-Reyes\nfor both useful discussions and computing facility support, and the anonymous reviewers for useful\ncomments. This work was supported in part by the IST Programme of the European Community,\nunder the PASCAL Network of Excellence, IST-2002-506778, by EPSRC Grant EP/D071542/1 and\nby the DHPA Research Councils UK Scheme.\nReferences\n[1] J. Abernethy, O. Chapelle and C. Castillo. Webspam Identi\ufb01cation Through Content and Hyperlinks.\n\nProc. Adversarial Information Retrieval on Web, 2008.\n\n[2] A. Argyriou, M. Herbster, and M. Pontil. Combining graph Laplacians for semi-supervised learning.\n\nAdvances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.\n\n[3] M. Belkin, I. Matveeva, P. Niyogi. Regularization and Semi-supervised Learning on Large Graphs.\n\nProceedings of the 17-th Conference on Learning Theory (COLT\u2019 04), pages 624\u2013638, 2004.\n\n[4] C. Biemann. Chinese Whispers \u2013 an Ef\ufb01cient Graph Clustering Algorithm and its Application to Natural\n\nLanguage Processing Problems. Proc. HLT-NAACL-06 Workshop on Textgraphs-06, 2006.\n\n[5] A. Blum, J. Lafferty, M. R. Rwebangira, and R. Reddy. Semi-supervised learning using randomized\n\nmincuts. Proc. 21-st International Conference on Machine Learning, page 13, 2004.\n\n[6] U. Brandes and D. Fleischer. Centrality measures based on current \ufb02ow. Proc. 22-nd Annual Symposium\n\non Theoretical Aspects of Computer Science, pages 533\u2013544, 2005.\n\n[7] C. Castillo, B. D. Davison, L. Denoyer and P. Gallinari. Proc. of the Graph Labelling Workshop and\n\nWeb-spam Challenge (ECML Workshop), 2007.\n\n[8] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.\n[9] P. Drineas and M. W. Mahoney, On the Nystr\u00a8om Method for Approximating a Gram Matrix for Improved\n\nKernel-Based Learning. J. Mach. Learn. Res., 6:2153\u20132175, 2005.\n\n[10] A. Ghosh, S. Boyd and A. Saberi. Minimizing Effective Resistance of a Graph. SIAM Review, problems\n\nand techniques section, 50(1):37-66, 2008.\n\n[11] T. Jebara. Bayesian Out-Trees. Proc. Uncertainty in Arti\ufb01cal Intelligence, 2008.\n[12] R. E. Haymond, J. Jarvis and D. R. Shier. Algorithm 613: Minimum Spanning Tree for Moderate Integer\n\nWeights. ACM Trans. Math. Softw., 10(1):108\u2013111, 1984.\n\n[13] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. Advances in Neural Information\n\nProcessing Systems 19, pages 577\u2013584. MIT Press, 2007.\n\n[14] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML \u201905: Proceedings of the\n\n22nd international conference on Machine learning, pages 305\u2013312, 2005.\n\n[15] N.-D. Ho and P. V. Dooren. On the pseudo-inverse of the Laplacian of a bipartite graph. Appl. Math.\n\nLett., 18(8):917\u2013922, 2005.\n\n[16] D. Klein and M. Randi\u00b4c. Resistance distance. J. of Mathematical Chemistry, 12(1):81\u201395, 1993.\n[17] M. E. J. Newman. A measure of betweenness centrality based on random walks. Soc. Networks, 27:39\u2013\n\n54, 2005.\n\n[18] D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsi\ufb01cation,\n\nand solving linear systems. Proc. 36-th Annual ACM Symposium Theory of Computing, 2004.\n\n[19] C.K.I. Williams and M. Seeger. Using the Nystr\u00a8om Method to Speed Up Kernel Machines. Neural\n\nInformation Processing Systems 13, pages 682\u2013688, MIT Press, 2001\n\n[20] X. Zhu, J. Lafferty, and Z. Ghahramani. Semi-Supervised Learning Using Gaussian Fields and Harmonic\n\nFunctions. Proc of the the 20-th International Conference on Machine Learning, pages 912\u2013919, 2003.\n\n\f", "award": [], "sourceid": 824, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}, {"given_name": "Sergio", "family_name": "Galeano", "institution": null}]}