{"title": "From which world is your graph", "book": "Advances in Neural Information Processing Systems", "page_first": 1469, "page_last": 1479, "abstract": "Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation models: the stochastic block model (SBM) and the small world (or latent space) model (SWM). Integrating techniques from kernel learning, spectral graph theory, and nonlinear dimensionality reduction, we develop the first statistically sound polynomial-time algorithm to discover latent patterns in sparse graphs for both models. When the network comes from an SBM, the algorithm outputs a block structure. When it is from an SWM, the algorithm outputs estimates of each node's latent position.", "full_text": "From which world is your graph?\n\nCheng Li\n\nCollege of William & Mary\n\nFelix M. F. Wong\n\nIndependent Researcher\u2217\n\nZhenming Liu\n\nCollege of William & Mary\n\nVarun Kanade\n\nUniversity of Oxford\n\nAbstract\n\nDiscovering statistical structure from links is a fundamental problem in the anal-\nysis of social networks. Choosing a misspeci\ufb01ed model, or equivalently, an incor-\nrect inference algorithm will result in an invalid analysis or even falsely uncover\npatterns that are in fact artifacts of the model. This work focuses on unifying two\nof the most widely used link-formation models: the stochastic blockmodel (SBM)\nand the small world (or latent space) model (SWM). Integrating techniques from\nkernel learning, spectral graph theory, and nonlinear dimensionality reduction, we\ndevelop the \ufb01rst statistically sound polynomial-time algorithm to discover latent\npatterns in sparse graphs for both models. When the network comes from an SBM,\nthe algorithm outputs a block structure. When it is from an SWM, the algorithm\noutputs estimates of each node\u2019s latent position.\n\n1\n\nIntroduction\nDiscovering statistical structures from links is a fundamental problem in the analysis of social\nnetworks. Connections between entities are typically formed based on underlying feature-based\nsimilarities; however these features themselves are partially or entirely hidden. A question of great\ninterest is to what extent can these latent features be inferred from the observable links in the net-\nwork. This work focuses on the so-called assortative setting, the principle that similar individuals\nare more likely to interact with each other. Most stochastic models of social networks rely on this as-\nsumption, including the two most famous ones \u2013 the stochastic blockmodel [1] and the small-world\nmodel [2, 3], described below.\nStochastic Blockmodel (SBM). In a stochastic blockmodel [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], nodes\nare grouped into disjoint \u201ccommunities\u201d and links are added randomly between nodes, with a higher\nprobability if nodes are in the same community. In its simplest incarnation, an edge is added between\nnodes within the same community with probability p, and between nodes in different communities\nwith probability q, for p > q. Despite arguably na\u00a8\u0131ve modelling choices, such as the independence\nof edges, algorithms designed with SBM work well in practice [14, 15].\nSmall-World Model (SWM). In a small-world model, each node is associated with a latent variable\nxi, e.g., the geographic location of an individual. The probability that there is a link between two\nnodes is proportional to an inverse polynomial of some notion of distance, dist(xi, xj), between\nthem. The presence of a small number of \u201clong-range\u201d connections is essential to some of the most\nintriguing properties of these networks, such as small diameter and fast decentralized routing algo-\nrithms [3]. In general, the latent position may re\ufb02ect geographic location as well as more abstract\nconcepts, e.g., position on a political ideology spectrum.\nThe Inference Problem. Without observing the latent positions, or knowing which model generates\nthe underlying graph, the adjacency matrix of a social graph typically looks like the one shown in\n\n\u2217Currently at Google.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFig. 5(a) (App. A.1). However, if the model generating the graph is known, it is then possible to\nrun a suitable \u201cclustering algorithm\u201d [14, 16] that reveals the hidden structure. When the vertices\nare ordered suitably, the SBM\u2019s adjacency matrix looks like the one shown in Fig. 5(b) (App. A.1)\nand that of the SWM looks like the one shown in Fig. 5(c) (App. A.1). Existing algorithms typically\ndepend on knowing the \u201ctrue\u201d model and are tailored to graphs generated according to one of these\nmodels, e.g., [14, 16, 17, 18].\nOur Contributions. We consider a latent space model that is general enough to include both these\nmodels as special cases.\nIn our model, an edge is added between two nodes with a probability\nthat is a decreasing function of the distance between their latent positions. This model is a fairly\nnatural one, and it is quite likely that a variant has already been studied; however, to the best of\nour knowledge there is no known statistically sound and computationally ef\ufb01cient algorithm for\nlatent-position inference on a model as general as the one we consider.\n\n1. A uni\ufb01ed model. We propose a model that is a natural generalization of both the stochastic\nblockmodel and the small-world model that captures some of the key properties of real-world social\nnetworks, such as small out-degrees for ordinary users and large in-degrees for celebrities. We focus\non a simpli\ufb01ed model where we have a modest degree graph only on \u201ccelebrities\u201d; the full paper\nmaterial contains an analysis of the more realistic model using somewhat technical machinery [19].\n\n2. A provable algorithm. We present statistically sound and polynomial-time algorithms for inferring\nlatent positions in our model(s). Our algorithm approximately infers the latent positions of almost all\n\u201ccelebrities\u201d (1 \u2212 o(1)-fraction), and approximately infers a constant fraction of the latent positions\nof ordinary users. We show that it is statistically impossible to err on at most o(1) fraction of\nordinary users by using standard lower bound arguments.\n\n3. Proof-of-concept experiments. We report several experiments on synthetic and real-world data\ncollected on Twitter from Oct 1 and Nov 30, 2016. Our experiments demonstrate that our model and\ninference algorithms perform well on real-world data and reveal interesting structures in networks.\nAdditional Related Work. We brie\ufb02y review the relevant published literature. 1. Graphon &\nLatent-space techniques. Studies using graphons and latent-space models have focused on the sta-\ntistical properties of the estimators [20, 21, 22, 23, 24, 25, 26, 27, 28], with limited attention paid\nto computational ef\ufb01ciency. The \u201cUSVT\u201d technique developed recently [29] estimates the kernel\nwell when the graph is dense. Xu et al. [30] consider a polynomial time algorithm for a sparse\nmodel similar to ours, but focus on edge classi\ufb01cation rather than latent position estimation. 2.\nCorrespondence analysis in political science. Estimating the ideology scores of politicians is an im-\nportant research topic in political science [31, 32, 33, 34, 35, 36, 17, 18]. High accuracy heuristics\ndeveloped to analyze dense graphs include [17, 18].\nOrganization. Section 2 describes background, our model and results. Section 3 describes our\nalgorithm and an gives an overview of its analysis. Section 4 contains the experiments.\n\n2 Preliminaries and Summary of Results\nBasic Notation. We use c0, c1, etc. to denote constants which may be different in each case. We\nuse whp to denote with high probability, by which we mean with probability larger 1 \u2212 1\nnc for any\nc. All notation is summarized in Appendix B for quick reference.\nStochastic Blockmodel. Let n be the number of nodes in the graph with each node assigned a label\nfrom the set {1, . . . , k} uniformly at random. An edge is added between two nodes with the same\nlabel with probability p and between the nodes with different labels with probability q, with p > q\n(assortative case). In this work, we focus on the k = 2 case, where p, q = \u2126 ((log n)c/n) and the\ncommunity sizes are exactly the same. (Many studies of the regimes where recovery is possible have\nbeen published [37, 9, 5, 8].)\n\n2 \u00d7 n\n\nLet A be the adjacency matrix of the realized graph and let M = E[A] =\n\n, where\nP and Q \u2208 R n\n2 with every entry equal to p and q, respectively. We next explain the inference\nalgorithm, which uses two key observations. 1. Spectral Properties of M. M has rank 2 and the\nnon-trivial eigenvectors are (1, . . . , 1)T and (1, . . . , 1,\u22121, . . . ,\u22121) corresponding to eigenvalues\nn(p + q)/2 and n(p \u2212 q)/2, respectively. If one has access to M, the hidden structure in the graph\nis revealed merely by reading off the second eigenvector. 2. Low Discrepancy between A and\n\n(cid:16) P Q\n\nQ P\n\n(cid:17)\n\n2\n\n\fM. Provided the average degree n(p + q)/2 and the gap p \u2212 q are large enough, the spectrum\nand eigenspaces of the matrices A and M can be shown to be close using matrix concentration\ninequalities and the Davis-Kahan theorem [38, 39]. Thus, it is suf\ufb01cient to look at the projection of\nthe columns of A onto the top two eigenvectors of A to identify the hidden latent structure.\nSmall-World Model (SWM). In a 1-dim. SWM, each node vi is associated with an independent\nlatent variable xi \u2208 [0, 1] that is drawn from the uniform distribution on [0, 1]. The probability of a\nlink between two nodes is Pr[{vi, vj} \u2208 E] \u221d\n\n, where \u2206 > 1 is a hyper-parameter.\n\n1\n\n|xi\u2212xj|\u2206+c0\n\nThe inference algorithm for small-world models uses different ideas. Each edge in the graph\nis considered as either \u201cshort-range\u201d or \u201clong-range.\u201d Short-range edges are those between nodes\nthat are nearby in latent space, while long-range edges have end-points that are far away in latent\nspace. After removing the long-range edges, the shortest path distance between two nodes scales\nproportionally to the corresponding latent space distance (see Fig. 6 in App. A.2). After obtaining\nestimates for pairwise distances, standard buidling blocks are used to \ufb01nd the latent positions xi [40].\nThe key observation used to remove the long-range edges is: an edge {vi, vj} is a short-range edge\nif and only if vi and vj will share many neighbors.\nA Uni\ufb01ed Model. Both SBM and SWM are special cases of our uni\ufb01ed latent space model. We\nbegin by describing the full-\ufb02edged bipartite (heterogeneous) model that is a better approximation\nof real-world networks, but requires sophisticated algorithmic techniques (see [19] for a detailed\nanalysis). Next, we present a simpli\ufb01ed (homogeneous) model to explain the key ideas.\n\nBipartite Model. We use latent-space model to characterize the stochastic interactions between\nusers. Each individual is associated with a latent variable in [0, 1]. The bipartite graph model\nthe left side of the graph Y = {y1, . . . , ym} are the followers\nconsists of two types of users:\n(ordinary users) and the right side X = {x1, . . . , xn} are the in\ufb02uencers (celebrities). Both yi and\nxi are i.i.d. random variables from a distribution D. This assumption follows the convention of\nexisting heterogeneous models [41, 42]. The probability that two individuals yi and xj interact is\n\u03ba(yi, xj)/n, where \u03ba : [0, 1]\u00d7 [0, 1] \u2192 (0, 1] is a kernel function. Throughout this paper we assume\nthat \u03ba is a small-world kernel, i.e., \u03ba(x, y) = c0/((cid:107)x \u2212 y(cid:107)\u2206 + c1) for some \u2206 > 1 and suitable\nconstants c0, c1, and that m = \u0398(n \u00b7 polylog(n)). Let B \u2208 Rm\u00d7n be a binary matrix that Bi,j = 1\nif and only if there is an edge between yi and xj. Our goal is to estimate {xi}i\u2208[n] based on B for\nsuitably large n.\nSimpli\ufb01ed Model. The graph only has the node set is X = {x1, ..., xn} of celebrity users. Each\nxi is again an i.i.d. random variable from D. The probability that two users vi and vj interact\nis \u03ba(xi, xj)/C(n). The denominator is a normalization term that controls the edge density of the\ngraph. We assume C(n) = n/polylog(n), i.e., the average degree is polylog(n). Unlike the SWM\nwhere the xi are drawn uniformly from [0, 1], in the uni\ufb01ed model D can be \ufb02exible. When D is the\nuniform distribution, the model is the standard SWM. When D has discrete support (e.g., xi = 0\nwith prob. 1/2 and xi = 1 otherwise), then the uni\ufb01ed model reduces to the SBM. Our distribution-\nagnostic algorithm can automatically select the most suitable model from SBM and SWM, and infer\nthe latent positions of (almost) all the nodes.\n\nBipartite vs. Simpli\ufb01ed Model. The simpli\ufb01ed model suffers from the following problem: If the\naverage degree is O(1), then we err on estimating every individual\u2019s latent position with a constant\nprobability (e.g., whp the graph is disconnected), but in practice we usually want a high prediction\naccuracy on the subset of nodes corresponding to high-pro\ufb01le users. Assuming that the average\ndegree is \u03c9(1) mismatches empirical social network data. Therefore, we use a bipartite model that\nintroduces heterogeneity among nodes: By splitting the nodes into two classes, we achieve high\nestimation accuracy on the in\ufb02uencers and the degree distribution more closely matches real-world\ndata. For example, in most online social networks, nodes have O(1) average degree, and a small\nfraction of users (in\ufb02uencers) account for the production of almost all \u201ctrendy\u201d content while most\nusers (followers) simply consume the content.\n\nAdditional Remarks on the Bipartite Model. 1. Algorithmic contribution. Our algorithm com-\nputes BTB and then regularizes the product by shrinking the diagonal entries before carrying out\nspectral analysis. Previous studies of the bipartite graph in similar settings [43, 44, 45] attempt to\nconstruct a regularized product using different heuristics. Our work presents the \ufb01rst theoretically\nsound regularization technique for spectral algorithms. In addition, some studies have suggested run-\nning SVD on B directly (e.g., [28]). We show that the (right) singular vectors of B do not converge\n\n3\n\n\fto the eigenvectors of K (the matrix with entries \u03ba(xi, xj)). Thus, it is necessary to take the product\nand use regularization. 2. Comparison to degree-corrected models (DCM). In DCM, each node vi is\nassociated with a degree parameter D(vi). Then we have Pr[{vi, vj} \u2208 E] \u221d D(vi)\u03ba(xi, xj)D(vj).\nThe DCM model implies the subgraph induced by the highest degree nodes is dense, which is incon-\nsistent with real-world networks. There is a need for better tools to analyze the asymptotic behavior\nof such models and we leave this for future work (see, e.g., [41, 42]).\nTheoretical Results. Let F be the cdf of D. We say F and \u03ba are well-conditioned if:\n(1) F has \ufb01nitely many points of discontinuity, i.e., the closure of the support of F can be expressed\nas the union of non-overlapping closed intervals I1, I2, ..., Ik for a \ufb01nite number k.\n(2) F is near-uniform, i.e., for any interval I that has non-empty overlap with F \u2019s support,\nI dF (x) \u2265 c0|I|, for some constant c0.\n(3) Decay Condition: The eigenvalues of the integral operator based on \u03ba and F decay suf\ufb01ciently\nThen, it holds that \u03bbi = O(i\u22122.5).\n\n(cid:82)\nfast. We de\ufb01ne the Kf (x) =(cid:82) \u03ba(x, x(cid:48))f (x(cid:48))dF (x(cid:48)) and let (\u03bbi)i\u22651 denote the eigenvalues of K.\n\nIf we use the small-word kernel \u03ba(x, y) = c0/(|x \u2212 y|\u2206 + c1) and choose F that gives rise\nto SBM or SWM, in each case the pair F and \u03ba are well-conditioned, as described below. As the\ndecay condition is slightly more invoved, we comment upon it. The condition is a mild one. When\nF is uniformly distributed on [0, 1], it is equivalent to requiring K to be twice differentiable, which\nis true for the small world kernel. When F has a \ufb01nite discrete support, there are only \ufb01nitely many\nnon-zero eigenvalues, i.e., this condition also holds. The decay condition holds in more general\nsettings, e.g., when F is piecewise linear [46] (see [19]). Without the decay condition, we would\nrequire much stronger assumptions: Either the graph is very dense or \u2206 (cid:29) 2. Neither of these\nassumptions is realistic, so effectively our algorithm fails to work. In practice, whether the decay\ncondition is satis\ufb01ed can be checked by making a log-log plot and it has been observed that for\nseveral real-world networks, the eigenvalues follow a power-law distribution [47].\n\nNext, we de\ufb01ne the notion of latent position recovery for our algorithms.\n\nDe\ufb01nition 2.1 ((\u03b1, \u03b2, \u03b3)-Aproximation Algorithm). Let Ii, F , and K be de\ufb01ned as above, and let\nRi = {xj : xj \u2208 Ii}. An algorithm is called an (\u03b1, \u03b2, \u03b3)-approximation algorithm if\n1. It outputs a collection of disjoint points C1, C2, . . . , Ck such that Ci \u2286 Ri, which correspond to\nsubsets of reconstructed latent variables.\n2. For each Ci, it produces a distance matrix D(i). Let Gi \u2286 Ci be such that for any ij, ik \u2208 Gi\n\nD(i)\nij ,ik\n\n\u2264 |xij \u2212 xik| \u2264 (1 + \u03b2)D(i)\n\nij ,ik\n\n+ \u03b3.\n\n(1)\n\n3. |(cid:83)\n\ni Gi| \u2265 (1 \u2212 \u03b1)n.\n\nIn bipartite graphs, Eq.(1) is required only for in\ufb02uencers.\n\nWe do not attempt to optimize constants in this paper. We set \u03b1 = o(1), \u03b2 a small constant,\nand \u03b3 = o(1). De\ufb01nition 2.1 allows two types of errors: Cis are not required to form a partition\ni.e., some nodes can be left out, and a small fraction of estimation errors is allowed in each Ci,\n\ne.g., if xj = 0.9 but(cid:98)xj = 0.2, then the j-th \u201crow\u201d in D(i) is incorrect. To interpret the de\ufb01nition,\n\nconsider the blockmodel with 2 communities. Condition 1 means that our algorithm will output two\ndisjoint groups of points. Each group corresponds to one block. Condition 2 means that there are\npairwise distance estimates within each group. Since the true distances for nodes within the same\nblock are zero, our estimates must also be zero to satisfy Eq.1. Condition 3 says that the proportion\nof misclassi\ufb01ed nodes is \u03b1 = o(1). We can also interpret the de\ufb01nition when we consider a small-\nworld graph, in which case k = 1. The algorithm outputs pairwise distances for a subset C1. We\nknow that there is a suf\ufb01ciently large G1 \u2286 C1 such that the pairwise distances are all correct in C1.\nOur algorithm does not attempt to estimate the distance between Ci and Cj for i (cid:54)= j. When\nthe support contains multiple disjoint intervals, e.g., in the SBM case, it \ufb01rst pulls apart the nodes in\ndifferent communities. Estimating the distance between intervals, given the output of our algorithm\nis straightforward. Our main result is the following.\nTheorem 2.2. Using the notation above, assume F and \u03ba are well-conditioned, and C(n) and\nm/n are \u2126(logc n) for some suitably large c. The algorithm for the simpli\ufb01ed model shown in\nFigure 1 and that for the bipartite model (appears in [19]) give us an (1/ log2 n, \u0001, O(1/ log n))-\napproximation algorithm w.h.p. for any constant \u0001. Furthermore, the distance estimates D(i) for\neach Ci are constructed using the shortest path distance of an unweighted graph.\n\n4\n\n\f3 // Step 2. Execute isomap algo.\n\nLATENT-INFERENCE(A)\n1 // Step 1. Estimate \u03a6 .\n\n2 (cid:98)\u03a6 = SM-EST(A).\n4 D = ISOMAP-ALGO((cid:98)\u03a6)\nISOMAP-ALGO((cid:98)\u03a6, (cid:96))\n1 Execute S \u2190 DENOISE((cid:98)\u03a6) (See Section 3.2)\n\n5 // Step 3. Find latent variables.\n6 Run a line embedding algorithm [48, 49].\n\n[ \u02dcUA, \u02dcSA, \u02dcVA] = svd(A).\n\nSM-EST(A, t)\n1\n2 Let also \u03bbi be i-th singular value of A.\n3 // let t be a suitable parameter.\n4 d = DECIDETHRESHOLD(t, \u03c1(n)).\n5 SA: diagonal matrix comprised of {\u03bbi}i\u2264d\n6 UA, VA: the singular vectors\n7\n\n8 Let(cid:98)\u03a6 =(cid:112)C(n)UAS1/2\nreturn(cid:98)\u03a6\n\ncorresponding to SA.\n\nA .\n\n9\n\n2 // S is a subset of [n].\n3 Build G = {S, E} s.t. {i, j} \u2208 E iff\n4\n5 Compute D such D(i, j) is the shortest\n6\n7\n\n|( \u02dc\u03a6d)i \u2212 ( \u02dc\u03a6d)j| \u2264 (cid:96)/ log n ((cid:96) a constant).\npath distance between i and j when i, j \u2208 S.\n\nDECIDETHRESHOLD(t, \u03c1(n))\n1 // This procedure decides d the number\n2\n3 // t is a tunable parameter. See Proposition 3.1.\n4 d = arg maxd{\u03bbd( A\n\u03c1(n) ) \u2265 \u03b8}.\n5 where \u03b8 = 10(t/\u03c1(n))24/59\nFigure 1: Subroutines of our Latent Inference Algorithm.\n\n\u03c1(n) ) \u2212 \u03bbd+1( A\n\nof Eigenvectors to keep.\n\nreturn D\n\nPairwise Estimation to Line-embedding and High-dimensional Generalization. Our algo-\nrithm builds estimates on pairwise latent distance and uses well-studied metric-embedding meth-\nods [48, 49] as blackboxes to infer latent positions. Our inference algorithm can be generalized to\np becomes increasingly\nd-dimensional space with d being a constant. But the metric-embedding on (cid:96)d\ndif\ufb01cult, e.g., when d = 2, the approximation ratio for embedding a graph is \u2126(\n\nn) [50].\n\n\u221a\n\n3 Our algorithms\n\nK ( \u02dcUA \u02dcSA \u02dcV T\n\nAs previously noted, SBM and SWM are special cases of our uni\ufb01ed model and both require\ndifferent algorithmic techniques. Given that it is not surprising that our algorithm blends ingredients\nfrom both sets of techniques. Before proceeding, we review basics of kernel learning.\nNotation. Let A be the adjacency matrix of the observed graph (simpli\ufb01ed model) and let \u03c1(n) (cid:44)\nn/C(n). Let K be the matrix with entries \u03ba(xi, xj). Let \u02dcUK \u02dcSK \u02dcV T\nA ) be the SVD of K\n(A). Let d be a parameter to be chosen later. Let SK (SA) be a d\u00d7 d diagonal matrix comprising the\nd-largest eigenvalues of K (A). Let UK (UA) and VK (VA) be the corresponding singular vectors of\nK (A). Finally, let \u00afK = UKSKV T\nA ) be the low-rank approximation of K (A). Note\nthat when a matrix is positive de\ufb01nite and symmetric SVD coincides with eigen-decomposition; as\na consequence UK = VK and UA = VA.\n\n\u03c81, \u03c82, . . . be the eigenfunctions of K and \u03bb1, \u03bb2, . . . be the corresponding eigenvalues such that\n\u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 and \u03bbi \u2265 0 for each i. Also let NH be the number of eigenfunctions/eigenvalues\nof K, which is either \ufb01nite or countably in\ufb01nite. We recall some important properties of K [51, 25].\n\nKernel Learning. De\ufb01ne an integral operator K as Kf (x) = (cid:82) \u03ba(x, x(cid:48))f (x(cid:48))dF (x(cid:48)). Let\nFor x \u2208 [0, 1], de\ufb01ne the feature map \u03a6(x) = ((cid:112)\u03bbj\u03c8j(x) : j = 1, 2, ...), so that (cid:104)\u03a6(x), \u03a6(x(cid:48))(cid:105) =\n\u03ba(x, x(cid:48)). We also consider a truncated feature \u03a6d(x) = ((cid:112)\u03bbj\u03c8j(x) : j = 1, 2, ..., d). Intuitively,\nthe feature map well. Finally, let \u03a6d(X) \u2208 Rn\u00d7d such that its (i, j)-th entry is(cid:112)\u03bbj\u03c8j(xi). Let\u2019s\n\nif \u03bbj is too small for suf\ufb01ciently large j, then the \ufb01rst d coordinates (i.e., \u03a6d) already approximate\n\nK ( \u00afA = UASAV T\n\nfurther write (\u03a6d(X)):,i be the i-th column of \u03a6d(X). Let \u03a6(X) = limd\u2192\u221e \u03a6d(X). When the\ncontext is clear, shorten \u03a6d(X) and \u03a6(X) to \u03a6d and \u03a6, respectively.\n\nThere are two main steps in our algorithm which we explain in the following two subsections.\n\n3.1 Estimation of \u03a6 through K and A\n\nThe mapping \u03a6 : [0, 1] \u2192 RNH is bijective so a (reasonably) accurate estimate of \u03a6(xi) can\nbe used to recover xi. Our main result is the design of a data-driven procedure to choose a suitable\nnumber of eigenvectors and eigenvalues of A to approximate \u03a6 (see SM-EST(A) in Fig. 1).\n\n5\n\n\f2\n29\n\n(2)\n\nn (t/(\u03c1(n)))\n\n(cid:16)\u221a\n\nwell-conditioned, then with high probability:\n\nLet d be chosen by DECIDETHRESHOLD(\u00b7). Let (cid:98)\u03a6 \u2208 RNH be such that its \ufb01rst d-coordinates are\nequal to(cid:112)C(n)UAS1/2\nProposition 3.1. Let t be a tunable parameter such that t = o(\u03c1(n)) and t2/\u03c1(n) = \u03c9(log n).\nA , and its remaining entries are 0. If \u03c1(n) = \u03c9(log n) and K (F and \u03ba) is\n(cid:17)\nSpeci\ufb01cally, by letting t = \u03c12/3(n), we have (cid:107)(cid:98)\u03a6 \u2212 \u03a6(cid:107)F = O(cid:0)\u221a\n\nn\u03c1\u22122/87(n)(cid:1). We remark that\n\n(cid:107)(cid:98)\u03a6 \u2212 \u03a6(cid:107)F = O\n\nour result is stronger than an analogous result for sparse graphs in [25] as our estimate is close to \u03a6\nrather than the truncated \u03a6d.\nRemark on the Eigengap. In our analysis, there are three groups of eigenvalues: the eigenvalues\nof K, those of K, and those of A. They are in different scales: \u03bbi(K) \u2264 1 (resulting from the\nfact that \u03ba(x, y) \u2264 1 for all x and y), and \u03bbi(A/\u03c1(n)) \u2248 \u03bbi(K/n) \u2248 \u03bbi(K) if n and \u03c1(n) are\nsuf\ufb01ciently large. Thus, \u03bbd(K) are independent of n for a \ufb01xed d and should be treated as \u0398(1).\nAlso \u03b4d (cid:44) \u03bbd(K) \u2212 \u03bbd+1(K) \u2192 0 as d \u2192 \u221e. Since the procedure of choosing d depends on C(n)\n(and thus also on n), \u03b4d depends on n and can be bounded by a function in n. This is the reason why\nProposition 3.1 does not explicitly depend on the eigengap. We also note that we cannot directly\n\ufb01nd \u03b4d based on the input matrix A. But standard interlacing results can give \u03b4d = \u0398(\u03bbd(A/\u03c1(n))\u2212\n\u03bbd+1(A/\u03c1(n))) (cf. [19]).\ntheorem, we have (cid:104)\u03a6(xi), \u03a6(xj)(cid:105) =\nIntuition of\nlimd\u2192\u221e(cid:104)\u03a6d(xi), \u03a6d(xj)(cid:105) = \u03ba(xi, xj). Thus, limd\u2192\u221e \u03a6d\u03a6T\nd = K. On the other hand, we have\n( \u02dcUK \u02dcS1/2\nK are approximately the same, up to a uni-\ntary transformation. We need to identify different sources of errors to understand the approximation\nquality.\nError source 1. Finite samples to learn the kernel. We want to infer about \u201ccontinuous objects\u201d \u03ba\nand D (speci\ufb01cally the eigenfunctions of K) but K only contains the kernel values of a \ufb01nite set of\npairs. From standard results in Kernel PCA [52, 25], we have with probability \u2265 1 \u2212 \u0001,\n\nK )T = K. Thus, \u03a6d(X) and \u02dcUK \u02dcS1/2\n\nthe algorithm.\n\nK )( \u02dcUK \u02dcS1/2\n\nUsing Mercer\u2019s\n\n(cid:112)log \u0001\u22121\n\n(cid:112)log \u0001\u22121\n\n\u221a\n= 2\n\n2\n\n(cid:107)UKS\n\n1/2\n\nK W \u2212 \u03a6d(X)(cid:107)F \u2264 2\n\n\u221a\n\n2\n\n\u03bbd(K) \u2212 \u03bbd+1(K)\n\n.\n\n\u03b4d\n\n(cid:13)(cid:13)(cid:13)(cid:112)C(n)UAS1/2\n\nError source 2. Only observe A. We observe only the realized graph A and not K, though it holds\nthat EA = K/C(n). Thus, we can only use singular vectors of C(n)A to approximate \u02dcUK \u02dcS1/2\nK .\n. When A is dense (i.e., C(n) = O(1)),\nWe have:\nthe problem is analyzed in [25]. We generalize the results in [25] for the sparse graph case. See [19]\nfor a complete analysis.\n\nA W \u2212 UKS1/2\n\n\u221a\ndn\n\u03b42\nd\u03c1(n)\n\n(cid:13)(cid:13)(cid:13)F\n\n(cid:16) t\n\n(cid:17)\n\n= O\n\nK\n\n\u221a\nE[(\n\nthrown away.\n\n\u03bbi\u03c8i(x))2] =(cid:80)\n\nError source 3. Truncation error. When i is large, the noise in \u03bbi(A)( \u02dcUA):,i \u201coutweighs\u201d the\nused to approximate \u03a6d. Here, we need to address the truncation error: the tail {\u221a\nsignal. Thus, we need to choose a d such that only the \ufb01rst d eigenvectors/eigenvalues of A are\n\u03bbi\u03c8i(xj)}i>d is\nWe have E(cid:107)\u03a6(x) \u2212 \u03a6d(x)(cid:107)2 =(cid:80)\n\nNext we analyze the magitude of the tail. We abuse notation so that \u03a6d(x) refers to both a\nd-dimensional vector and a NH-dimensional vector in which all entries after the d-th one are 0.\n\u221a\ni>d \u03bbi\ni>d \u03bbi.\n(A Chernoff bound is used to obtain that (cid:107)\u03a6 \u2212 \u03a6d(cid:107)F = O(\ni>d \u03bbi))). Using the decay\ncondition, we show that a d can be identi\ufb01ed so that the tail can be bounded by a polynomial in \u03b4d.\nThe details are technical and are provided in [19].\n\nn/((cid:112)(cid:80)\n3.2 Estimating Pairwise Distances from(cid:98)\u03a6(xi) through Isomap\nSee ISOMAP-ALGO(\u00b7) in Fig. 1 for the pseudocode. After we construct our estimate (cid:98)\u03a6d, we\nestimate K by letting (cid:98)K = (cid:98)\u03a6d(cid:98)\u03a6T\nto estimate |xi \u2212 xj| = (c0/(cid:98)Ki,j \u2212 c1)1/\u2206. However, \u03ba(xi, xj) is a convex function in |xi \u2212 xj|.\nd . Recalling Ki,j = c0/(|xi \u2212 xj|\u2206 + c1), a plausible approach is\n\n(cid:82) |\u03c8i(x)|2dF (x) =(cid:80)\n\ni>d\n\n6\n\n\f(a) True features\n\n(b) Estimated features\n\n(c) Isomap w/o denoising (d) Isomap + denoising\n\nFigure 2: Using the Isomap Algorithm to recover pairwise distances. (a) The true curve C = {\u03a6(x)}x\u2208[0,1]\n\n(b) Estimate (cid:98)\u03a6 (c) Shows that an undesirable short-cut may exist when we run the Isomap algorithm and (d)\n\nShows the result of running the Isomap algorithm after removal of the corrupted nodes.\n\nThus, when Ki,j is small, a small estimation error here will result in an ampli\ufb01ed estimation error\nin |xi \u2212 xj| (see also Fig. 7 in App. A.3). But when |xi \u2212 xj| is small, Ki,j is reliable (see the\n\u201creliable\u201d region in Fig. 7 in App. A.3).\n\nC = {\u03a6(x)}x\u2208[0,1] forms a curve in RNH (Fig. 2(a)). Our estimate {(cid:98)\u03a6(xi)}i\u2208[n] will be a noisy\nare connected if and only if(cid:98)\u03a6(xi) and(cid:98)\u03a6(xj) are close (Fig. 2(c-d)). Then the shortest path distance\n\nThus, our algorithm only uses large values of Ki,j to construct estimates. The isomap technique\nintroduced in topological learning [53, 54] is designed to handle this setting. Speci\ufb01cally, the set\napproximation of the curve (Fig. 2(b)). Thus, we build up a graph on {\u03a6(xi)}i\u2264n so that xi and xj\non G approximates the geodesic distance on C. By using the fact that \u03ba is a radial basis kernel, the\ngeodesic distance will also be proportional to the latent distance.\nCorrupted nodes. Excessively corrupted nodes may help build up \u201cundesirable bridges\u201d and inter-\nfere with the shortest-path based estimation (cf.Fig. 2(c)). Here, the shortest path between two green\nnodes \u201cjumps through\u201d the excessively corrupted nodes (labeled in red) so the shortest path distance\nis very different from the geodesic distance.\n\ncan serve to build up undesirable shortcuts. Thus, we want to eliminate these nodes.\n\nBelow, we describe a procedure to remove excessively corrupted nodes and then explain how\nto analyze the isomap technique\u2019s performance after their removal. Note that d in this section mostly\nrefers to the shortest path distance.\nStep 1. Eliminate corrupted nodes. Recall that x1, x2, ..., xn are the latent variables. Let zi =\nprojection Proj(z) = arg minz(cid:48)\u2208C (cid:107)z(cid:48) \u2212 z(cid:107), where C is the curve formed by {\u03c6(x)}x\u2208[0,1]. Finally,\nfor any point z \u2208 C, de\ufb01ne \u03a6\u22121(z) such that \u03a6(\u03a6\u22121(z)) = z (i.e., z\u2019s original latent position). For\nthe points that fall outside of C, de\ufb01ne \u03a6\u22121(z) = \u03a6\u22121(Proj(z)). Let us re-parametrize the error\nn/f (n), where f (n) = \u03c12/87(n) =\n\n\u03a6(xi) and(cid:98)zi =(cid:98)\u03a6(xi). For any z \u2208 RNH and r > 0, we let Ball(z, r) = {z(cid:48) : (cid:107)z(cid:48)\u2212z(cid:107) \u2264 r}. De\ufb01ne\nterm in Propostion 3.1. Let f (n) be such that (cid:107)(cid:98)\u03a6 \u2212 \u03a6(cid:107)F \u2264 \u221a\n\u2126(log2 n) for suf\ufb01ciently large \u03c1(n). By Markov\u2019s inequality, we have Pri[(cid:107)(cid:98)\u03a6(xi) \u2212 \u03a6(xi)(cid:107)2 \u2265\n1/(cid:112)f (n)] \u2264 1/f (n). Intuitively, when (cid:107)(cid:98)\u03a6(xi)\u2212 \u03a6(xi)(cid:107)2 \u2265 1/(cid:112)f (n), i becomes a candidate that\nLooking at a ball of radius O(1/(cid:112)f (n)) centered at a point(cid:98)zi, consider two cases.\nCase 1. If(cid:98)zi is close to Proj((cid:98)zi), i.e., corresponding to the blue nodes in Figure 2(c). For the purpose\nof exposition, let us assume(cid:98)zi = zi. Now for any point zj, if |xi \u2212 xj| = O(f\u22121/\u2206(n)), then we\nhave (cid:107)(cid:98)zi \u2212(cid:98)zj(cid:107) = O(1/(cid:112)f (n)), which means zj is in Ball(zi, O(1/(cid:112)f (n))). The total number\nCase 2. If(cid:98)zi is far away from any point in C, i.e., corresponding to the red ball in Figure 2(c), any\npoints in Ball((cid:98)zi, O(1/(cid:112)f (n))) will also be far from C. Then the total number of such nodes will\nAs n/f 1/\u2206(n) = \u03c9(n/f (n)) for \u2206 > 1, there is a phase-transition phenomenon: When(cid:98)zi\nis far from C, then a neighborhood of(cid:98)zi contains O(n/f (n)) nodes. When(cid:98)zi is close to C, then a\nneighborhood of(cid:98)zi contains \u03c9(n/f (n)) nodes. We can leverage this intuition to design a counting-\n\nof such nodes will be in the order of \u0398(n/f 1/\u2206(n)), by using the near-uniform density assumption.\n\nbased algorithm to eliminate nodes that are far from C:\n\nDENOISE((cid:98)zi) : If |Ball((cid:98)zi, 3/(cid:112)f (n))| < n/f (n), remove(cid:98)zi.\n\nbe O(n/f (n)).\n\n(3)\n\n7\n\n\fAlgo.\nOurs\n\nMod. [55]\nCA [18]\nMaj [56]\nRW [54]\nMDS [49]\n\n\u03c1\n0.53\n0.16\n0.20\n0.13\n0.01\n0.05\n\nSlope of \u03b2\n\n9.54\n1.14\n0.11\n0.09\n1.92\n30.91\n\nS.E.\np-value\n0.28 < 0.001\n0.02 < 0.001\n7e-4 < 0.001\n0.02 < 0.001\n0.65 < 0.001\n120.9\n\n0.09\n\nFigure 3: Latent Estimates vs. Ground-truth.\n\n(a) Inferred kernel\n\n(b) SWM\n\n(c) SBM\n\nFigure 4: Visualization of real and synthetic networks. (a) Our inferred kernel matrix, which is \u201cin-between\u201d\n(b) the small-world model and (c) the stochastic blockmodel.\n\nbut not in Good-I.\n\nTheoretical result. We classify a point i into three groups:\n\n1. Good: Satisfying (cid:107)(cid:98)zi \u2212 Proj((cid:98)zi)(cid:107) \u2264 1/(cid:112)f (n). We further partition the set of good points into\ntwo parts. Good-I are points such that (cid:107)(cid:98)zi\u2212 zi(cid:107) \u2264 1/(cid:112)f (n), while Good-II are points that are good\n2. Bad: when (cid:107)zi \u2212 Proj(zi)(cid:107) > 4/(cid:112)f (n).\n\n3. Unclear: otherwise.\nLemma 3.2. (cf. [19] ) After running DENOISE that uses the counting-based decision rule, all good\npoints are kept, all bad points are eliminated, and all unclear points have no performance guarantee.\nThe total number of eliminated nodes is \u2264 n/f (n).\n\nStep 2. An isomap-based algorithm. Wlog assume there is only one closed interval for\n\nsupport(F ). We build a graph G on [n] so that two nodes (cid:98)zi and (cid:98)zj are connected if and only\nif (cid:107)(cid:98)zi \u2212(cid:98)zj(cid:107) \u2264 (cid:96)/(cid:112)f (n), where (cid:96) is a suf\ufb01ciently large constant (say 10). Consider the shortest path\nthe latent distance, i.e., (d\u2212 1)(cid:0) c\n\ndistance between arbitrary pairs of nodes i and j (that are not eliminated.) Because the corrupted\nnodes are removed, the whole path is around C. Also, by the uniform density assumption, walking\non the shortest path in G is equivalent to walking on C with \u201cuniform speed\u201d, i.e., each edge on the\npath will map to an approximately \ufb01xed distance on C. Thus, the shortest path distance scales with\n, which\n\n(cid:19)2/\u2206 \u2264 |xi\u2212 xj| \u2264 d(cid:0) c\n\n(cid:1)1/\u2206(cid:18)\n\n(cid:1)1/\u2206(cid:18)\n\n(cid:19)2/\u2206\n\n(cid:96)\u22123\u221a\n\nf (n)\n\n2\n\n(cid:96)+8\u221a\n\nf (n)\n\n2\n\nimplies Theorem 2.2 (cf. [19] for details).\nDiscussion: \u201cGluing together\u201d two algorithms? The uni\ufb01ed model is much more \ufb02exible than\nSBM and SWM. We were intrigued that the generalized algorithm needs only to \u201cglue together\u201d\nimportant techniques used in both models: Step 1 uses the spectral technique inspired by SBM\ninference methods, while Step 2 resembles techniques used in SWM: the isomap G only connects\nbetween two nodes that are close, which is akin to throwing away the long-range edges.\n\n4 Experiments\n\nWe apply our algorithm to a social interaction graph from Twitter to construct users\u2019 ideology\nscores. We assembled a dataset by tracking keywords related to the 2016 US presidential election for\n10 million users. First, we note that as of 2016 the Twitter interaction graph behaves \u201cin-between\u201d\nthe small-world and stochastic blockmodels (see Figure 4), i.e., the latent distributions are bi-modal\nbut not as extreme as the SBM.\n\nGround-truth data. Ideology scores of the US Congress (estimated by third parties [57]) are usu-\nally considered as a \u201cground-truth\u201d dataset, e.g., [18]. We apply our algorithm and other baselines\non Twitter data to estimate the ideology score of politicians (members of the 114th Congress), and\n\n8\n\n\fobserve that our algorithm has the highest correlation with ground-truth. See Fig. 3. Beyond corre-\nlation, we also need to estimate the statistical signi\ufb01cance of our estimates. We set up a linear model\n\ny \u223c \u03b21(cid:98)x + \u03b20, in which(cid:98)x\u2019s are our estimates and y\u2019s are ground-truth. We use bootstrapping to\n\ncompute the standard error of our estimator, and use the standard error to estimate the p-value of our\nestimator. The details of this experiment and additional empirical evaluation are available in [19].\n\nAcknowlegments\n\nThe authors thank Amazon for partly providing AWS Cloud Credits for this research.\n\nReferences\n[1] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.\n\nSocial networks, 5(2):109\u2013137, 1983.\n\n[2] Duncan J Watts and Steven H Strogatz. Collective dynamics of small-world networks. Nature,\n\n393(6684):440\u2013442, 1998.\n\n[3] Jon Kleinberg. The small-world phenomenon: An algorithmic perspective. In Proceedings of the thirty-\n\nsecond annual ACM symposium on Theory of computing, pages 163\u2013170. ACM, 2000.\n\n[4] Se-Young Yun and Alexandre Prouti`ere. Optimal cluster recovery in the labeled stochastic block model.\nIn Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information\nProcessing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 965\u2013973, 2016.\n\n[5] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition\n\nmodel. Probability Theory and Related Fields, 162(3-4):431\u2013461, 2015.\n\n[6] Emmanuel Abbe and Colin Sandon. Detection in the stochastic block model with multiple clusters:\nproof of the achievability conjectures, acyclic BP, and the information-computation gap. arXiv preprint\narXiv:1512.09080, 2015.\n\n[7] Emmanuel Abbe and Colin Sandon. Community detection in the general stochastic block model: Funda-\nmental limits and ef\ufb01cient algorithms for recovery. In Proceedings of 56th Annual IEEE Symposium on\nFoundations of Computer Science, Berkely, CA, USA, pages 18\u201320, 2015.\n\n[8] Laurent Massouli\u00b4e. Community detection thresholds and the weak Ramanujan property. In Proceedings\n\nof the 46th Annual ACM Symposium on Theory of Computing, pages 694\u2013703. ACM, 2014.\n\n[9] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture. arXiv\n\npreprint arXiv:1311.4115, 2013.\n\n[10] Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newmangirvan and other\n\nmodularities. Proceedings of the National Academy of Sciences, 106(50):21068\u201321073, 2009.\n\n[11] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Statistical properties of\ncommunity structure in large social and information networks. In Proceedings of the 17th international\nconference on World Wide Web, pages 695\u2013704. ACM, 2008.\n\n[12] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Phys-\n\nical review E, 69(2):026113, 2004.\n\n[13] Mark EJ Newman, Duncan J Watts, and Steven H Strogatz. Random graph models of social networks.\n\nProceedings of the National Academy of Sciences, 99(suppl 1):2566\u20132572, 2002.\n\n[14] Frank McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001.\n\nProceedings. 42nd IEEE Symposium on, pages 529\u2013537. IEEE, 2001.\n\n[15] Jure Leskovec, Kevin J Lang, and Michael Mahoney. Empirical comparison of algorithms for network\nIn Proceedings of the 19th international conference on World wide web, pages\n\ncommunity detection.\n631\u2013640. ACM, 2010.\n\n[16] Ittai Abraham, Shiri Chechik, David Kempe, and Aleksandrs Slivkins. Low-distortion inference of latent\n\nsimilarities from a multiplex social network. In SODA, pages 1853\u20131872. SIAM, 2013.\n\n[17] Pablo Barber\u00b4a. Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter\n\nData. 2012.\n\n9\n\n\f[18] Pablo Barber\u00b4a, John T. Jost, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. Tweeting from\n\nleft to right. Psychological Science, 26(10):1531\u20131542, 2015.\n\n[19] Cheng Li, Felix M. F. Wong, Zhenming Liu, and Varun Kanade. From which world is your graph?\n\nAvailable on Arxiv, 2017.\n\n[20] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network\n\nanalysis. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 97:1090\u20131098, 2001.\n\n[21] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic\n\nblockmodels. J. Mach. Learn. Res., 9:1981\u20132014, 2008.\n\n[22] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic block-\n\nmodel. The Annals of Statistics, 39(4):1878\u20131915, 2011.\n\n[23] Edo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon:\nTheory and consistent estimation. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 26, pages 692\u2013700. Curran\nAssociates, Inc., 2013.\n\n[24] So\ufb01a C. Olhede Patrick J. Wolfe. Nonparametric graphon estimation. 2013.\n\n[25] Minh Tang, Daniel L. Sussman, and Carey E. Priebe. Universally consistent vertex classi\ufb01cation for latent\n\npositions graphs. Ann. Statist., 41(3):1406\u20131430, 06 2013.\n\n[26] Patrick J. Wolfe and David Choi. Co-clustering separately exchangeable network data. The Annals of\n\nStatistics, 42(1):29\u201363, 2014.\n\n[27] Varun Kanade, Elchanan Mossel, and Tselil Schramm. Global and local information in clustering labeled\n\nblock models. IEEE Trans. Information Theory, 62(10):5906\u20135917, 2016.\n\n[28] Karl Rohe, Tai Qin, and Bin Yu. Co-clustering directed graphs to discover asymmetries and directional\n\ncommunities. Proceedings of the National Academy of Sciences, 113(45):12679\u201312684, 2016.\n\n[29] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. Ann. Statist., 43(1):177\u2013\n\n214, 02 2015.\n\n[30] Jiaming Xu, Laurent Massouli\u00b4e, and Marc Lelarge. Edge label inference in generalized stochastic block\nmodels: from spectral theory to impossibility results. In Maria Florina Balcan, Vitaly Feldman, and Csaba\nSzepesvri, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings\nof Machine Learning Research, pages 903\u2013920, Barcelona, Spain, 13\u201315 Jun 2014. PMLR.\n\n[31] K. T. Poole and H. Rosenthal. A spatial model for legislative roll call analysis. American Journal of\n\nPolitical Science, 29(2):357\u2013384, 1985.\n\n[32] M. Laver, K. Benoit, and J. Garry. Extracting policy positions from political texts using words as data.\n\nAmerican Political Science Review, 97(2), 2003.\n\n[33] J. Clinton, S. Jackman, and D. Rivers. The statistical analysis of roll call data. American Political Science\n\nReview, 98(2):355\u2013370, 2004.\n\n[34] S. Gerrish and D. Blei. How the vote: Issue-adjusted models of legislative behavior. In Proc. NIPS, 2012.\n\n[35] S. Gerrish and D. Blei. Predicting legislative roll calls from text. In Proc. ICML, 2011.\n\n[36] J. Grimmer and B. M. Stewart. Text as data: The promise and pitfalls of automatic content analysis\n\nmethods for political texts. Political Analysis, 2013.\n\n[37] Emmanuel Abbe. Community detection and the stochastic block model. 2016.\n\n[38] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational\n\nMathematics, 12(4):389\u2013434, 2012.\n\n[39] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. SIAM J. Numer. Anal., 7:1\u201346,\n\n1970.\n\n[40] Piotr Indyk and Jiri Matou\u02c7sek. Low-distortion embeddings of \ufb01nite metric spaces. Handbook of discrete\n\nand computational geometry, page 177, 2004.\n\n[41] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under\n\ndegree-corrected stochastic block models. Ann. Statist., 40(4):2266\u20132292, 08 2012.\n\n10\n\n\f[42] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel.\nIn C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 26, pages 3120\u20133128. 2013.\n\n[43] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In\nProceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, KDD \u201901, pages 269\u2013274, New York, NY, USA, 2001. ACM.\n\n[44] T. Zhou, J. Ren, M. Medo, and Y.-C. Zhang. Bipartite network projection and personal recommendation.\n\n76(4):046115, October 2007.\n\n[45] Felix Ming Fai Wong, Chee-Wei Tan, Soumya Sen, and Mung Chiang. Quantifying political leaning from\n\ntweets, retweets, and retweeters. IEEE Trans. Knowl. Data Eng., 28(8):2158\u20132172, 2016.\n\n[46] H. K\u00a8onig. Eigenvalue Distribution of Compact Operators. Operator Theory: Advances and Applications.\n\nBirkh\u00a8auser, 1986.\n\n[47] Milena Mihail and Christos Papadimitriou. On the eigenvalue power law. In International Workshop on\n\nRandomization and Approximation Techniques in Computer Science, pages 254\u2013262. Springer, 2002.\n\n[48] Mihai Badoiu, Julia Chuzhoy, Piotr Indyk, and Anastasios Sidiropoulos. Low-distortion embeddings of\ngeneral metrics into the line. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing,\nBaltimore, MD, USA, May 22-24, 2005, pages 225\u2013233, 2005.\n\n[49] I. Borg and P.J.F. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, 2005.\n\n[50] Piotr Indyk and Jiri Matousek. Low-distortion embeddings of \ufb01nite metric spaces. In in Handbook of\n\nDiscrete and Computational Geometry, pages 177\u2013196. CRC Press, 2004.\n\n[51] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regular-\n\nization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n\n[52] Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators. J. Mach.\n\nLearn. Res., 11:905\u2013934, March 2010.\n\n[53] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319, 2000.\n\n[54] Vin De Silva and Joshua B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduc-\n\ntion. In Advances in Neural Information Processing Systems 15, pages 705\u2013712. MIT Press, 2003.\n\n[55] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical\n\nreview E, 74, 2006.\n\n[56] U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in\n\nlarge-scale networks. Physical Review E, 76(3), 2007.\n\n[57] Joshua Tauberer. Observing the unobservables in the us congress. Law Via the Internet, 2012.\n\n11\n\n\f", "award": [], "sourceid": 942, "authors": [{"given_name": "Cheng", "family_name": "Li", "institution": "College of William and Mary"}, {"given_name": "Felix", "family_name": "Wong", "institution": "Google"}, {"given_name": "Zhenming", "family_name": "Liu", "institution": "William and Mary"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "University of Oxford"}]}