{"title": "The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels", "book": "Advances in Neural Information Processing Systems", "page_first": 3016, "page_last": 3024, "abstract": "Link prediction and clustering are key problems for network-structureddata. While spectral clustering has strong theoretical guaranteesunder the popular stochastic blockmodel formulation of networks, itcan be expensive for large graphs. On the other hand, the heuristic ofpredicting links to nodes that share the most common neighbors withthe query node is much fast, and works very well in practice. We showtheoretically that the common neighbors heuristic can extract clustersw.h.p. when the graph is dense enough, and can do so even in sparsergraphs with the addition of a ``cleaning'' step. Empirical results onsimulated and real-world data support our conclusions.", "full_text": "The Consistency of Common Neighbors for\nLink Prediction in Stochastic Blockmodels\n\nPurnamrita Sarkar\nDepartment of Statistics\n\nUniversity of Texas at Austin\n\npurnamritas@austin.utexas.edu\n\nDeepayan Chakrabarti\n\nIROM, McCombs School of Business\n\nUniversity of Texas at Austin\n\ndeepay@utexas.edu\n\nPeter Bickel\n\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nbickel@stat.berkeley.edu\n\nAbstract\n\nLink prediction and clustering are key problems for network-structured\ndata. While spectral clustering has strong theoretical guarantees under\nthe popular stochastic blockmodel formulation of networks, it can be ex-\npensive for large graphs. On the other hand, the heuristic of predicting\nlinks to nodes that share the most common neighbors with the query node\nis much fast, and works very well in practice. We show theoretically that\nthe common neighbors heuristic can extract clusters with high probabil-\nity when the graph is dense enough, and can do so even in sparser graphs\nwith the addition of a \u201ccleaning\u201d step. Empirical results on simulated and\nreal-world data support our conclusions.\n\n1 Introduction\n\nNetworks are the simplest representation of relationships between entities, and as such have\nattracted signi\ufb01cant attention recently. Their applicability ranges from social networks such\nas Facebook, to collaboration networks of researchers, citation networks of papers, trust\nnetworks such as Epinions, and so on. Common applications on such data include ranking,\nrecommendation, and user segmentation, which have seen wide use in industry. Most of\nthese applications can be framed in terms of two problems: (a) link prediction, where the\ngoal is to \ufb01nd a few similar nodes to a given query node, and (b) clustering, where we want\nto \ufb01nd groups of similar individuals, either around a given seed node or a full partitioning\nof all nodes in the network.\nAn appealing model of networks is the stochastic blockmodel, which posits the existence of\na latent cluster for each node, with link probabilities between nodes being simply functions\nof their clusters. Inference of the latent clusters allows one to solve both the link prediction\nproblem and the clustering problem (predict all nodes in the query node\u2019s cluster). Strong\ntheoretical and empirical results have been achieved by spectral clustering, which uses the\nsingular value decomposition of the network followed by a clustering step on the eigenvectors\nto determine the latent clusters.\nHowever, singular value decomposition can be expensive, particularly for (a) large graphs,\nwhen (b) many eigenvectors are desired. Unfortunately, both of these are common require-\nments. Instead, many fast heuristic methods are often used, and are empirically observed\nto yield good results [8]. One particularly common and e\ufb00ective method is to predict links\nto nodes that share many \u201ccommon neighbors\u201d with the query node q, i.e., rank nodes\nby |CN(q, i)|, where CN(q, i) = {u | q \u223c u \u223c i} (i \u223c j represents an edge between i\n\n1\n\n\fand j). The intuition is that q probably has many links with others in its cluster, and\nhence probably also shares many common friends with others in its cluster. Counting com-\nmon neighbors is particularly fast (it is a Join operation supported by all databases and\nMap-Reduce systems). In this paper, we study the theoretical properties of the common\nneighbors heuristic.\nOur contributions are the following:\n(a) We present, to our knowledge the \ufb01rst, theoretical analysis of the common neighbors for\nthe stochastic blockmodel.\n(b) We demarcate two regimes, which we call semi-dense and semi-sparse, under which\ncommon neighbors can be successfully used for both link prediction and clustering.\n(c) In particular, in the semi-dense regime, the number of common neighbors between the\nquery node q and another node within its cluster is signi\ufb01cantly higher than that with a\nnode outside its cluster. Hence, a simple threshold on the number of common neighbors\nsu\ufb03ces for both link prediction and clustering.\n(d) However, in the semi-sparse regime, there are too few common neighbors with any node,\nand hence the heuristic does not work. However, we show that with a simple additional\n\u201ccleaning\u201d step, we regain the theoretical properties shown for the semi-dense case.\n(e) We empirically demonstrate the e\ufb00ectiveness of counting common neighbors followed by\nthe \u201ccleaning\u201d post-process on a variety of simulated and real-world datasets.\n\n2 Related Work\n\nLink prediction has recently attracted a lot of attention, because of its relevance to important\npractical problems like recommendation systems, predicting future connections in friendship\nnetworks, better understanding of evolution of complex networks, study of missing or partial\ninformation in networks, etc [9, 8]. Algorithms for link prediction fall into two main groups:\nsimilarity-based, and model-based.\nSimilarity-based methods: These methods use similarity measures based on network\ntopology for link prediction. Some methods look at nodes two hops away from the query\nnode: counting common neighbors, the Jaccard index, the Adamic-Adar score [1] etc. More\ncomplex methods include nodes farther away, such as the Katz score [7], and methods based\non random walks [16, 2]. These are often intuitive, easily implemented, and fast, but they\ntypically lack theoretical guarantees.\nModel-based methods: The second approach estimates parametric models for predict-\ning links. Many popular network models fall in the latent variable model category [12, 3].\nThese models assign n latent random variables Z := (Z1, Z2, . . . , Zn) to n nodes in a net-\nwork. These variables take values in a general space Z. The probability of linkage between\ntwo nodes is speci\ufb01ed via a symmetric map h : Z \u00d7 Z \u2192 [0, 1]. These Zi\u2019s can be i.i.d\nUniform(0,1) [3], or positions in some d\u2212dimensional latent space [12]. In [5] a mixture of\nmultivariate Gaussian distributions is used, each for a separate cluster. A Stochastic Block-\nmodel [6] is a special class of these models, where Zi is a binary length k vector encoding\nmembership of a node in a cluster.\nIn a well known special case (the planted partition\nmodel), all nodes in the same cluster connect to each other with probability \u03b1, whereas\nall pairs in di\ufb00erent clusters connect with probability \u03b3. In fact, under broad parameter\nregimes, the blockmodel approximation of networks has recently been shown to be analogous\nto the use of histograms as non-parametric summaries of an unknown probability distribu-\ntion [11]. Varying the number of bins or the bandwidth corresponds to varying the number\nor size of communities. Thus blockmodels can be used to approximate more complex models\n(under broad smoothness conditions) if the number of blocks are allowed to increase with\nn.\nEmpirical results: As the models become more complex, they also become computation-\nally demanding. It has been commonly observed that simple and easily computable measures\nlike common neighbors often have competitive performance with more complex methods.\n\n2\n\n\fThis behavior has been empirically established across a variety of networks, starting from\nco-authorship networks [8] to router level internet connections, protein protein interaction\nnetworks and electrical power grid network [9].\nTheoretical results: Spectral clustering has been shown to asymptotically recover cluster\nmemberships for variations of Stochastic Blockmodels [10, 4, 13]. However, apart from [15],\nthere is little understanding of why simple methods such as common neighbors perform so\nwell empirically.\nGiven their empirical success and computational tractability, the common neighbors heuris-\ntic is widely applied for large networks. Understanding the reasons for the accuracy of\ncommon neighbors under the popular stochastic blockmodel setting is the goal of our work.\n\n3 Proposed Work\n\nMany link prediction methods ultimately make two assumptions: (a) each node belongs to\na latent \u201ccluster\u201d, where nodes in the same cluster have similar behavior; and (b) each node\nis very likely to connect to others in its cluster, so link prediction is equivalent to \ufb01nding\nother nodes in the cluster. These assumptions can be relaxed: instead of belonging to the\nsame cluster, nodes could have \u201ctopic distributions\u201d, with links being more likely between\npairs of nodes with similar topical interests. However, we will focus on the assumptions\nstated above, since they are clean and the relaxations appear to be fundamentally similar.\n\nModel. Speci\ufb01cally, consider a stochastic blockmodel where each node i belongs to an\nunknown cluster ci \u2208 {C1, . . . , CK}. We assume that the number of clusters K is \ufb01xed as\nthe number of nodes n increases. We also assume that each cluster has \u03c0 = n/K members,\nthough this can be relaxed easily. The probability P(i \u223c j) of a link between nodes i and j\n(i 6= j) depends only on the clusters of i and j: P(i \u223c j) = Bci,cj\n(cid:44) \u03b1{ci = cj} + \u03b3{ci 6= cj}\nfor some \u03b1 > \u03b3 > 0; in other words, the probability of a link is \u03b1 between nodes in the same\ncluster, and \u03b3 otherwise. By de\ufb01nition, P(i \u223c i) = 0. If the nodes were arranged so that all\nnodes in a cluster are contiguous, then the corresponding matrix, when plotted, attains a\nblock-like structure, with the diagonal blocks (corresponding to links within a cluster) being\ndenser than o\ufb00-diagonal blocks (since \u03b1 > \u03b3).\nUnder these assumptions, we ask the following two questions:\nProblem 1 (Link Prediction and Recommendation). Given node i, how can we identify at\nleast a constant number of nodes from ci?\nProblem 2 (Local Cluster Detection). Given node i, how can we identify all nodes in ci?\nProblem 1 can be considered as the problem of \ufb01nding good recommendations for a given\nnode i. Here, the goal is to \ufb01nd a few good nodes that i could connect to (e.g., recommending\na few possible friends on Facebook, or a few movies to watch next on Net\ufb02ix). Since within-\ncluster links have higher probability than across-cluster links (\u03b1 > \u03b3), predicting nodes from\nci gives the optimal answer. Crucially, it is unnecessary to \ufb01nd all good nodes. As against\nthat, Problem 2 requires us to \ufb01nd everyone in the given node\u2019s cluster. This is the problem\nof detecting the entire cluster corresponding to a given node. Note that Problem 2 is clearly\nharder than Problem 1.\nWe next present a summary of our results and the underlying intuition before delving into\nthe details.\n\nIntuition and Result Summary\n\n3.1\nCurrent approaches. Standard approaches to inference for the stochastic blockmodel\nattempt to solve an even harder problem:\nProblem 3 (Full Cluster Detection). How can we identify the latent clusters ci for all i?\nA popular solution is via spectral clustering, involving two steps: (a) computing the top-K\neigenvectors of the graph Laplacian, and (b) clustering the projections of each node on the\n\n3\n\n\f\u221a\n\ncorresponding eigenspace via an algorithm like k-means [13]. A slight variation of this has\nbeen shown to work as long as (\u03b1 \u2212 \u03b3)/\nn) and the average degree grows\nfaster than poly-logarithmic powers of n [10].\nHowever, (a) spectral clustering solves a harder problem than Problems 1 and 2, and (b)\neigen-decompositions can be expensive, particularly for very large graphs. Our claim is that\na simpler operation \u2014 counting common neighbors between nodes \u2014 can yield results that\nare almost as good in a broad parameter regime.\n\n\u03b1 = \u2126(log n/\n\n\u221a\n\nCommon neighbors. Given a node i, link prediction via common neighbors follows a\nsimple prescription: predict a link to node j such that i and j have the maximum number\n|CN(i, j)| of shared friends CN(i, j) = {u | i \u223c u \u223c j}. The usefulness of common\nneighbors have been observed in practice [8] and justi\ufb01ed theoretically for the latent distance\nmodel [15]. However, its properties under the stochastic blockmodel remained unknown.\nIntuitively, we would expect a pair of nodes i and j from the same cluster to have many\ncommon neighbors u from the same cluster, since both the links i \u223c u and u \u223c j occur with\nprobability \u03b1, whereas for ci 6= cj, at least one of the edges i \u223c u and u \u223c j must have the\nlower probability \u03b3.\nP(u \u2208 CN(i, j) | ci = cj) = \u03b12P(cu = ci | ci = cj) + \u03b32P(cu 6= ci | ci = cj)\n\n= \u03c0\u03b12 + (1 \u2212 \u03c0)\u03b32\n\nP(u \u2208 CN(i, j) | ci 6= cj) = \u03b1\u03b3P(cu = ci or cu = cj | ci 6= cj) + \u03b32P(cu 6= ci, cu 6= cj | ci 6= cj)\n\n= 2\u03c0\u03b1\u03b3 + (1 \u2212 2\u03c0)\u03b32 = P(u \u2208 CN(i, j) | ci = cj) \u2212 \u03c0(\u03b1 \u2212 \u03b3)2\n\u2264 P(u \u2208 CN(i, j) | ci = cj)\n\nThus the expected number of common neighbors E [|CN(i, j)|] is higher when ci = cj. If\nwe can show that the random variable CN(i, j) concentrates around its expectation, node\npairs with the most common neighbors would belong to the same cluster. Thus, common\nneighbors would o\ufb00er a good solution to Problem 1.\nWe show conditions under which this is indeed the case. There are three key points regarding\nour method: (a) handling dependencies between common neighbor counts, (b) de\ufb01ning the\ngraph density regime under which common neighbors is consistent, and (c) proposing a\nvariant of common neighbors which signi\ufb01cantly broadens this region of consistency.\n\nDependence. CN(i, j) and CN(i, j0) are dependent; hence, distinguishing between\nwithin-group and outside-group nodes can be complicated even if each CN(i, j) concen-\ntrates around its expectation. We handle this via a careful conditioning step.\n\nDense versus sparse graphs.\nIn general, the parameters \u03b1 and \u03b3 can be functions of\nn, and we can try to characterize parameter settings when common neighbors consistently\nreturns nodes from the same cluster as the input node. We show that when the graph is\nsu\ufb03ciently \u201cdense\u201d (average degree is growing faster than\nn log n), common neighbors is\npowerful enough to answer Problem 2. Also, (\u03b1 \u2212 \u03b3)/\u03b1 can go to zero at a suitable rate.\nOn the other hand, the expected number of common neighbors between nodes tends to\nzero for sparser graphs, irrespective of whether the nodes are in the same cluster or not.\nFurther, the standard deviation is of a higher order than the expectation, so there is no\nconcentration. In this case, counting common neighbors fails, even for Problem 1.\n\n\u221a\n\nA variant with better consistency properties. However, we show that the addition\nof an extra post-processing step (henceforth, the \u201ccleaning\u201d step) still enables common\nneighbors to identify nodes from its own cluster, while reducing the number of o\ufb00-cluster\nnodes to zero with probability tending to one as n \u2192 \u221e. This requires a stronger separation\ncondition between \u03b1 and \u03b3. However, such \u201cstrong consistency\u201d is only possible when\nthe average degree grows faster than (n log n)1/3. Thus, the cleaning step extends the\nconsistency of common neighbors beyond the O(1/\n\nn) range.\n\n\u221a\n\n4\n\n\f4 Main Results\n\nWe \ufb01rst split the edge set of the complete graph on n nodes into two sets: K1 and its\ncomplement K2 (independent of the given graph G). We compute common neighbors on\nG1 = G \u2229 K1 and perform a \u201ccleaning\u201d process on G2 = G \u2229 K2. The adjacency matrices\nof G1 and G2 are denoted by A1 and A2. We will \ufb01x a reference node q, which belongs to\nclass C1 without loss of generality (recall that there are K clusters C1 . . . CK, each of size\nn\u03c0).\nLet Xi(i 6= q) denote the number of common neighbors between q and i. Algorithm 1\ncomputes the set S = {i : Xi \u2265 tn} of nodes who have at least tn common neighbors with\nq on A1, whereas Algorithm 2 does a further degree thresholding on A2 to re\ufb01ne S into S1.\n\nAlgorithm 1 Common neighbors screening algorithm\n1: procedure Scan(A1, q, tn)\nFor 1 \u2264 i \u2264 n, Xi \u2190 A2\n2:\nXq \u2190 0\n3:\nS \u2190 {i : Xi \u2265 tn}\n4:\nreturn S\n5:\n\n1(q, i)\n\nS1 \u2190 {i :P\n\nAlgorithm 2 Post Selection Cleaning algorithm\n1: procedure Clean(S, A2, q, sn)\nj\u2208S A2(i, j) \u2265 sn}\n2:\n3:\n\nreturn S1\n\nTo analyze the algorithms, we must specify conditions on graph densities. Recall that \u03b1\nand \u03b3 represent within-cluster and across-cluster link probabilities. We assume that \u03b1/\u03b3\nis constant while \u03b1 \u2192 0, \u03b3 \u2192 0; equivalently, assume that both \u03b1 and \u03b3 are both some\nconstant times \u03c1, where \u03c1 \u2192 0.\nThe analysis of graphs has typically been divided into two regimes. The dense regime\nconsists of graphs with n\u03c1 \u2192 \u221e, where the expected degree n\u03c1 is a fraction of n as n grows.\nIn the sparse regime, n\u03c1 = O(1), so degree is roughly constant. Our work explores a \ufb01ner\ngradation, which we call semi-dense and semi-sparse, de\ufb01ned next.\nDe\ufb01nition 4.1 (Semi-dense graph). A sequence of graphs is called semi-dense if\nn\u03c12/ log n \u2192 \u221e as n \u2192 \u221e.\nDe\ufb01nition 4.2 (Semi-sparse graph). A sequence of graphs is called semi-sparse if n\u03c12 \u2192 0\nbut n2/3\u03c1/ log n \u2192 \u221e as n \u2192 \u221e.\nOur \ufb01rst result is that common neighbors is enough to solve not only the link-prediction\nproblem (Problem 1) but also the local clustering problem (Problem 2) in the semi-dense\ncase. This is because even though both nodes within and outside the query node\u2019s cluster\nhave a growing number of common neighbors with q, there is a clear distinction in the\nexpected number of common neighbors between the two classes. Also, since the standard\ndeviation is of a smaller order than the expectation, the random variables concentrate.\nThus, we can pick a threshold tn such that SCAN(A1, q, tn) yields just the nodes in the\nsame cluster as q with high probability. Note that the cleaning step (Algorithm 2) is not\nnecessary in this case.\nTheorem 4.1 (Algorithm 1 solves Problem 2 in semi-dense graphs). Let\ntn =\nnw and no denote the number of nodes in S \u2229 C1 and S \\ C1 respectively. If the graph is\nsemi-dense, and if \u03b1\u2212\u03b3\n\nn(cid:0)\u03c0(\u03b1 + \u03b3)2/2 + (1 \u2212 2\u03c0)\u03b32(cid:1). Let S be the set of nodes returned by SCAN(A1, q, tn). Let\n\n, then P(nw = n\u03c0) \u2192 1 and P(no = 0) \u2192 1.\n\n(cid:16) log n\n\n(cid:17)1/4\n\n\u03b1 \u2265 2\u221a\n\n\u03c0\n\nn\u03b12\n\nrial. Let dqa = P\n\nProof Sketch. We only sketch the proof here, deferring details to the supplementary mate-\nA1(q, i) be the number of links from the query node q to nodes in\n\ni\u2208Ca\n\n5\n\n\fcluster Ca. Let dq = {dq1, . . . qqK} and d =P\n\nP(dq \u2208 Good) (cid:44) P\n\n\u03c8n (cid:44)p(6 log n)/(n\u03c0\u03b3) =\n\na dqa. We \ufb01rst show that\n\n(cid:18) dq1 \u2208 n\u03c0\u03b1(1 \u00b1 \u03c8n)\nqplog n/n \u00b7 \u0398(plog n/(n\u03c12)) \u2192 0.\n\ndqa \u2208 n\u03c0\u03b3(1 \u00b1 \u03c8n) \u2200a 6= 1\n\n(cid:19)\n\n\u2265 1 \u2212 K\nn2 ,\n\n(1)\n\n(2)\nConditioned on dq, Xi is the sum of K Binomial(dqa, B1a) independent random variables\nrepresenting the number of common neighbors between q and i via nodes in each of the K\nclusters: E[Xi | dq, i \u2208 Ca] = dqa\u03b1 + (d \u2212 dqa)\u03b3. We have:\n\n\u03b71 (cid:44) E[Xi | dq \u2208 Good, i \u2208 C1] \u2265 n(cid:0)\u03c0\u03b12 + (1 \u2212 \u03c0)\u03b32(cid:1) (1 \u2212 \u03c8n) (cid:44) \u2018n(1 \u2212 \u03c8n)\n\u03b7a (cid:44) E[Xi | dq \u2208 Good, i \u2208 Ca, a 6= 1] \u2264 n(cid:0)2\u03c0\u03b1\u03b3 + (1 \u2212 2\u03c0)\u03b32(cid:1) (1 + \u03c8n) (cid:44) un(1 + \u03c8n)\nNote that tn = (\u2018n+un)/2, un \u2264 tn \u2264 \u2018n, and \u2018n\u2212un = n\u03c0(\u03b1\u2212\u03b3)2 \u2265 4 log npn\u03b12/ log n \u2192\n\n\u221e, where we applied condition on (\u03b1 \u2212 \u03b3)/\u03b1 noted in the theorem statement. We show:\n\nP (Xi \u2264 tn | dq \u2208 Good, i \u2208 C1) \u2264 n\u22124/3+o(1)\nP (Xi \u2265 tn | dq \u2208 Good, i \u2208 Ca, a 6= 1) \u2264 n\u22124/3+o(1)\n\nConditioned on dq, both nw and no are sums of conditionally independent and identically\ndistributed Bernoullis.\nP(nw = n\u03c0) \u2265 P(dq \u2208 Good)P(nw = n\u03c0 | dq \u2208 Good) \u2265\nP(no = 0) \u2265 P(dq \u2208 Good) \u00b7 P(no = 0 | dq \u2208 Good) \u2265 1 \u2212 \u0398(n\u22121/3) \u2192 1\n\n\u00b7 (1 \u2212 n\u03c0 \u00b7 n\u22124/3) \u2192 1\n\n1 \u2212 K\nn2\n\n(cid:18)\n\n(cid:19)\n\nThere are two major di\ufb00erences between the semi-sparse and semi-dense cases. First, in the\nsemi-sparse case, both expectations \u03b71 and \u03b7a are of the order O(n\u03c12) which tends to zero.\nSecond, standard deviations on the number of common neighbors are of a larger order than\nexpectations. Together, this means that the number of common neighbors to within-cluster\nand outside-cluster nodes can no longer be separated; hence, Algorithm 1 by itself cannot\nwork. However, after cleaning, the entire cluster of the query node q can still be recovered.\nTheorem 4.2 (Algorithm 1 followed by Algorithm 2 solves Problem 2 in semi-sparse\ngraphs). Let tn = 1 and sn = n2 (\u03c0\u03b1 + (1 \u2212 \u03c0)\u03b3)2 (\u03b1 + \u03b3)/2. Let S = Scan(A1, q, tn)\nand S1 = Clean(S, A2, q, sn). Let n\n(S1 \\ C1). If the graph is semi-sparse, and \u03c0\u03b1 \u2265 3(1 \u2212 \u03c0)\u03b3, then P\nP\n\n(cid:17) denote the number of nodes in S1 \u2229 C1\n(cid:17) \u2192 1 and\n\no = 0(cid:17) \u2192 1.\n\n(c)\nw = n\u03c0\n\n(cid:16)\n\n(cid:16)\n\n(cid:16)\n\n(c)\no\n\nn\n\n(c)\n\nn\n\n(c)\nw\n\nn\n\nProof Sketch. We only sketch the proof here, with details being deferred to the supplemen-\ntary material. The degree bounds of Eq. 1 and the equations for E[Xi|dq \u2208 Good] hold\neven in the semi-sparse case. We can also bound the variances of Xi (which are sums of\nconditionally independent Bernoullis):\n\nvar[Xi | dq \u2208 Good, i \u2208 C1] \u2264 E[Xi | dq \u2208 Good, i \u2208 C1] = \u03b71\n\nSince the expected number of common neighbors vanishes and the standard deviation is an\norder larger than the expectation, there is no hope for concentration; however, there are\nslight di\ufb00erences in the probability of having at least one common neighbor.\nFirst, by an application of the Paley-Zygmund inequality, we \ufb01nd:\n\np1 (cid:44) P(Xi \u2265 1 | dq \u2208 Good, i \u2208 C1)\n\nE[Xi | dq \u2208 Good, i \u2208 C1]2\n\nvar(Xi | dq \u2208 Good, i \u2208 C1) + E[Xi | dq \u2208 Good, i \u2208 C1]2\n\n\u2265\n\u2265 \u03b72\n\n1\n\n\u03b71 + \u03b72\n1\n\n\u2265 \u2018n(1 \u2212 \u03c8n)(1 \u2212 \u03b71)\n\nsince \u03b71 \u2192 0\n\n6\n\n\fFor a > 1, Markov\u2019s inequality gives:\n\npa (cid:44) P(Xi \u2265 1 | dq \u2208 Good, i \u2208 Ca, a 6= 1) \u2264 E[Xi | dq \u2208 Good, i \u2208 Ca, a 6= 1] = \u03b7a\n\nEven though pa \u2192 0, n\u03c0pa = \u0398(n2\u03c12) \u2192 \u221e, so we can use concentration inequalities like\nthe Cherno\ufb00 bound again to bound nw and no.\n\nP(nw \u2265 n\u03c0p1(1 \u2212p6 log n/n\u03c0p1)) \u2265 1 \u2212 n\u22124/3\nP(no \u2264 n(1 \u2212 \u03c0)pa(1 +p6 log n/n(1 \u2212 \u03c0)pa)) \u2265 1 \u2212 n\u22124/3\n\nUnlike the denser regime, nw and no can be of the same order here. Hence, the candidate\nset S returned by thresholding the common neighbors has a non-vanishing fraction of nodes\nfrom outside q\u2019s community. However, this fraction is relatively small, which is what we\nwould exploit in the cleaning step.\nLet \u03b8w and \u03b8o denote the expected number of edges in A2 from a node to S. The separation\n\u221a\ncondition in the theorem statement gives \u03b8w \u2212 \u03b8o \u2265 4\n\u03b8w log n. Setting the degree threshold\nsn = (\u03b8w + \u03b8o)/2, we bound the probability of mistakes in the cleaning step:\nA2(i, j) \u2264 sn | dq \u2208 Good) \u2264 n\u22121/3+o(1)\n\nP(\u2203i \u2208 C1 s.t. X\nP(\u2203i 6\u2208 C1 s.t. X\n\nj\u2208S\n\nA2(i, j) \u2265 sn | dq \u2208 Good) \u2264 n\u22121/3+o(1)\n\nj\u2208S\n\nRemoving the conditioning on dq \u2208 Good (as in Theorem 4.1) yields the desired result.\n\n5 Experiments\n\nWe present our experimental results in two parts. First, we use simulations to support our\ntheoretical claims. Next we present link prediction accuracies on real world collaborative\nnetworks to show that common neighbors indeed perform close to gold standard algorithms\nlike spectral clustering and the Katz score.\nImplementation details: Recall that our algorithms are based on thresholding. When\nthere is a large gap between common neighbors between node q and nodes in its cluster (e.g.,\nin the semi-dense regime), this is equivalent to using the k-means algorithm with k = 2 to\n\ufb01nd S in Algorithm 1. The same holds for \ufb01nding S1 in algorithm 2. When the number\nof nodes with more than two common neighbors is less than ten, we de\ufb01ne the set S by\n\ufb01nding all neighbors with at least one common neighbor (as in the semi-sparse regime).\nOn the other hand, since the cleaning step works only when S is su\ufb03ciently large (so that\ndegrees concentrate), we do not perform any cleaning when |S| < 30. While we used the\nsplit sample graph A2 in the cleaning step for ease of analysis, we did the cleaning using\nthe same network in the experiments.\nExperimental setup for simulations: We use a stochastic blockmodel of 2000 nodes split\ninto 4 equal-sized clusters. For each value of (\u03b1, \u03b3) we pick 50 query nodes at random, and\ncalculate the precision and recall of the result against nodes from the query node\u2019s cluster\n(for any subset S and true cluster C, precision = |S \u2229 C|/|S| and recall = |S \u2229 C|/|C|). We\nreport mean precision and recall over 50 random generated graph instances.\nAccuracy on simulated data: Figure 1 shows the precision and recall as degree grows,\nwith the parameters (\u03b1, \u03b3) satisfying the condition \u03c0\u03b1 \u2265 3(1 \u2212 \u03c0)\u03b3 of Thm. 4.2. We see\nthat cleaning helps both precision and recall, particularly in the medium-degree range (the\nsemi-sparse regime). As a reference, we also plot the precision of spectral clustering, when\nit was given the correct number of clusters (K = 4). Above average degree of 10, spectral\nclustering gives perfect precision, whereas common neighbors can identify a large fraction of\nthe true cluster once average degree is above 25. On the other hand, for average degree less\nthan seven, spectral clustering performs poorly, whereas the precision of common neighbors\nis remarkably higher. Precision is relatively higher than recall for a broad degree regime,\nand this explains why common neighbors are a popular choice for link prediction. On a side\n\n7\n\n\f(A)\n\n(B)\n\nFigure 1: Recall and precision versus average degree: When degree is very small, none of the\nmethods work well. In the medium-degree range (semi-sparse regime), we see that common\nneighbors gets increasingly better precision and recall, and cleaning helps. With high enough\ndegrees (semi-dense regime), just common neighbors is su\ufb03cient and gets excellent accuracy.\n\nDataset\n\nn\n\nHepTH 5969\nCiteseer\n4520\n1222\nNIPS\n\nTable 1: AUC scores for co-authorship networks\nAUC\nMean degree Time-steps\nSPEC Katz Random\n.82\n.89\n.68\n\nCN CN-clean\n.70\n.88\n.63\n\n4\n5\n3.95\n\n6\n11\n9\n\n.74\n.89\n.69\n\n.82\n.95\n.78\n\n.49\n.52\n.47\n\nnote, it is not surprising that in a very sparse graph common neighbors cannot identify the\nwhole cluster, since not everyone can be reached in two hops.\nAccuracy on real-world data: We used publicly available co-authorship datasets over\ntime where nodes represent authors and an edge represents a collaboration between two\nauthors.\nIn particular, we used subgraphs of the High Energy Physics (HepTH) co-\nauthorship dataset (6 timesteps), the NIPS dataset (9 timesteps) and the Citeseer dataset\n(11 timesteps). We obtain the training graph by merging the \ufb01rst T-2 networks, use the\nT-1th step for cross-validation and use the last timestep as the test graph. The number of\nnodes and average degrees are reported in Table 1. We merged 1-2 years of papers to create\none timestep (so that the median degree of the test graph is at least 1).\nWe compare our algorithm (CN and CN-clean) with the Katz score which is used widely\nin link prediction [8] and spectral clustering of the network. Spectral clustering is carried\nout on the giant component of the network. Furthermore, we cross-validate the number of\nclusters using the held out graph. Our setup is very similar to link prediction experiments\nin related literature [14].\nSince these datasets are unlabeled, we cannot calculate precision or recall as before. Instead\nfor any score or a\ufb03nity measure, we propose to perform link prediction experiments as\nfollows. For a randomly picked node we calculate the score from the node to everyone else.\nWe compute the AUC score of this vector against the edges in the test graph. We report\nthe average AUC for 100 randomly picked nodes. Table 1 shows that even in sparse regimes\ncommon neighbors performs similar to benchmark algorithms.\n6 Conclusions\n\nCounting common neighbors is a particularly useful heuristic:\nit is fast and also works\nwell empirically. We prove the e\ufb00ectiveness of common neighbors for link prediction as\nwell as local clustering around a query node, under the stochastic blockmodel setting. In\nparticular, we show the existence of a semi-dense regime where common neighbors yields\nthe right cluster w.h.p, and a semi-sparse regime where an additional \u201ccleaning\u201d step is\nrequired. Experiments with simulated as well as real-world datasets shows the e\ufb03cacy of\nour approach, including the importance of the cleaning step.\n\n8\n\n\fReferences\n[1] L. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 25:211\u2013230,\n\n2003.\n\n[2] L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommend-\ning links in social networks. In Proceedings of the Fourth ACM International Conference\non Web Search and Data Mining, pages 635\u2013644, New York, NY, USA, 2011. ACM.\n\n[3] P. J. Bickel and A. Chen. A nonparametric view of network models and newman girvan\nand other modularities. Proceedings of the National Academy of Sciences of the Unites\nStates of America, 106(50):21068\u017021073, 2009.\n\n[4] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general\ndegrees in the extended planted partition model. Journal of Machine Learning Research\n- Proceedings Track, 23:35.1\u201335.23, 2012.\n\n[5] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social\nnetworks. Journal of the Royal Statistical Society: Series A (Statistics in Society),\n170(2):301\u2013354, 2007.\n\n[6] P. W. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social\n\nNetworks, 5(2):109\u2013137, 1983.\n\n[7] L. Katz. A new status index derived from sociometric analysis.\n\nvolume 18, pages 39\u201343, 1953.\n\nIn Psychometrika,\n\n[8] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In\n\nConference on Information and Knowledge Management. ACM, 2003.\n\n[9] L. L\u00fc and T. Zhou. Link prediction in complex networks: A survey. Physica A,\n\n390(6):11501170, 2011.\n\n[10] F. McSherry. Spectral partitioning of random graphs. In FOCS, pages 529\u2013537, 2001.\n[11] S. C. Olhede and P. J. Wolfe. Network histograms and universality of blockmodel\napproximation. Proceedings of the National Academy of Sciences of the Unites States\nof America, 111(41):14722\u201314727, 2014.\n\n[12] A. E. Raftery, M. S. Handcock, and P. D. Ho\ufb00. Latent space approaches to social\n\nnetwork analysis. Journal of the American Statistical Association, 15:460, 2002.\n\n[13] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional\n\nstochastic blockmodel. Annals of Statistics, 39:1878\u20131915, 2011.\n\n[14] P. Sarkar and P. J. Bickel. Role of normalization in spectral clustering for stochastic\n\nblockmodels. To appear in the Annals of Statistics., 2014.\n\n[15] P. Sarkar, D. Chakrabarti, and A. Moore. Theoretical justi\ufb01cation of popular link\n\nprediction heuristics. In Conference on Learning Theory. ACM, 2010.\n\n[16] P. Sarkar and A. Moore. A tractable approach to \ufb01nding closest truncated-commute-\n\ntime neighbors in large graphs. In Proc. UAI, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1692, "authors": [{"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "UT Austin"}, {"given_name": "Deepayan", "family_name": "Chakrabarti", "institution": "UT Austin"}, {"given_name": "peter", "family_name": "bickel", "institution": "U C Berkeley"}]}