{"title": "Learning a Distance Metric from a Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1899, "page_last": 1907, "abstract": "Many real-world networks are described by both connectivity information and features for every node.  To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network.  Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric.  We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques.  We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges.", "full_text": "Learning a Distance Metric from a Network\n\nBlake Shaw\u2217\n\nComputer Science Dept.\n\nColumbia University\n\nBert Huang\u2217\n\nComputer Science Dept.\n\nColumbia University\n\nTony Jebara\n\nComputer Science Dept.\n\nColumbia University\n\nblake@cs.columbia.edu\n\nbert@cs.columbia.edu\n\njebara@cs.columbia.edu\n\nAbstract\n\nMany real-world networks are described by both connectivity information and\nfeatures for every node. To better model and understand these networks, we\npresent structure preserving metric learning (SPML), an algorithm for learning\na Mahalanobis distance metric from a network such that the learned distances are\ntied to the inherent connectivity structure of the network. Like the graph embed-\nding algorithm structure preserving embedding, SPML learns a metric which is\nstructure preserving, meaning a connectivity algorithm such as k-nearest neigh-\nbors will yield the correct connectivity when applied using the distances from\nthe learned metric. We show a variety of synthetic and real-world experiments\nwhere SPML predicts link patterns from node features more accurately than stan-\ndard techniques. We further demonstrate a method for optimizing SPML based\non stochastic gradient descent which removes the running-time dependency on\nthe size of the network and allows the method to easily scale to networks of thou-\nsands of nodes and millions of edges.\n\n1\n\nIntroduction\n\nThe proliferation of social networks on the web has spurred many signi\ufb01cant advances in modeling\nnetworks [1, 2, 4, 12, 13, 15, 16, 26]. However, while many efforts have been focused on modeling\nnetworks as weighted or unweighted graphs [17], or constructing features from links to describe\nthe nodes in a network [14, 25], few techniques have focused on real-world network data which\nconsists of both node features in addition to connectivity information. Many social networks are\nof this form; on services such as Facebook, Twitter, or LinkedIn, there are pro\ufb01les which describe\neach person, as well as the connections they make. The relationship between a node\u2019s features and\nconnections is often not explicit. For example, people \u201cfriend\u201d each other on Facebook for a variety\nof reasons: perhaps they share similar parts of their pro\ufb01le such as their school or major, or perhaps\nthey have completely different pro\ufb01les. We want to learn the relationship between pro\ufb01les and links\nfrom massive social networks such that we can better predict who is likely to connect. To model\nthis relationship, one could simply model each link independently, where one simply learns what\ncharacteristics of two pro\ufb01les imply a possible link. However, this approach completely ignores the\nstructural characteristics of the links in the network. We posit that modeling independent links is\ninsuf\ufb01cient, and in order to better model these networks one must account for the inherent topology\nof the network as well as the interactions between the features of nodes. We thus propose structure\npreserving metric learning (SPML), a method for learning a distance metric between nodes that\npreserves the structural network behavior seen in data.\n\n1.1 Background\n\nMetric learning algorithms have been successfully applied to many supervised learning tasks such\nas classi\ufb01cation [3, 23, 24]. These methods \ufb01rst build a k-nearest neighbors (kNN) graph from\n\n\u2217Blake Shaw is currently at Foursquare, and Bert Huang is currently at the University of Maryland.\n\n1\n\n\ftraining data with a \ufb01xed k, and then learn a Mahalanobis distance metric which tries to keep con-\nnected points with similar labels close while pushing away class impostors, pairs of points which\nare connected but of different classes. Fundamentally, these supervised methods aim to learn a dis-\ntance metric such that applying a connectivity algorithm (for instance, k-nearest neighbors) under\nthe metric will produce a graph where no point is connected to others with different class labels. In\npractice, these constraints are enforced with slack. Once the metric is learned, the class label for an\nunseen datapoint can be predicted by the majority vote of nearby points under the learned metric.\nUnfortunately, these metric learning algorithms are not easily applied when we are given a network\nas input instead of class labels for each point. Under this new regime, we want to learn a metric such\nthat points connected in the network are close and points which are unconnected are more distant.\nIntuitively, certain features or groups of features should in\ufb02uence how nodes connect, and thus it\nshould be possible to learn a mapping from features to connectivity such that the mapping respects\nthe underlying topological structure of the network. Like previous metric learning methods, SPML\nlearns a metric which reconciles the input features with some auxiliary information such as class\nlabels. In this case, instead of pushing away class impostors, SPML pushes away graph impostors,\npoints which are close in terms of distance but which should remain unconnected in order to preserve\nthe topology of the network. Thus SPML learns a metric where the learned distances are inherently\ntied to the original input connectivity.\nPreserving graph topology is possible by enforcing simple linear constraints on distances between\nnodes [21]. By adapting the constraints from the graph embedding technique structure preserving\nembedding, we formulate simple linear structure preserving constraints for metric learning that en-\nforce that neighbors of each node are closer than all others. Furthermore, we adapt these constraints\nfor an online setting similar to PEGASOS [20] and OASIS [3], such that we can apply SPML to\nlarge networks by optimizing with stochastic gradient descent (SGD).\n\n2 Structure preserving metric learning\nGiven as input an adjacency matrix A \u2208 Bn\u00d7n, and node features X \u2208 Rd\u00d7n, structure pre-\nserving metric learning (SPML) learns a Mahalanobis distance metric parameterized by a positive\nsemide\ufb01nite (PSD) matrix M \u2208 Rd\u00d7d , where M (cid:23) 0. The distance between two points under the\nmetric is de\ufb01ned as DM(xi, xj) = (xi\u2212xj)(cid:62)M(xi\u2212xj). When the metric is the identity M = Id,\nDM(xi, xj) represents the squared Euclidean distance between the i\u2019th and j\u2019th points. Learning M\nis equivalent to learning a linear scaling on the input features LX where M = L(cid:62)L and L \u2208 Rd\u00d7d.\nSPML learns an M which is structure preserving, as de\ufb01ned in De\ufb01nition 1. Given a connectivity\nalgorithm G, SPML learns a metric such that applying G to the input data using the learned met-\nric produces the input adjacency matrix exactly.1 Possible choices for G include maximum weight\nb-matching, k-nearest neighbors, \u0001-neighborhoods, or maximum weight spanning tree.\nDe\ufb01nition 1 Given a graph with adjacency matrix A, a distance metric parametrized by M \u2208 Rd\u00d7d\nis structure preserving with respect to a connectivity algorithm G, if G(X, M) = A.\n\n2.1 Preserving graph topology with linear constraints\n\nTo preserve graph topology, we use the same linear constraints as structure preserving embedding\n(SPE) [21], but apply them to M, which parameterizes the distances between points. A useful tool\nfor de\ufb01ning distances as linear constraints on M is the transformation\n\nDM(xi, xj) = x(cid:62)\n\ni Mxi + x(cid:62)\n\n(1)\nwhich allows linear constraints on the distances to be written as linear constraints on the M ma-\ntrix. For different connectivity schemes below, we present linear constraints which enforce graph\nstructure to be preserved.\n\nj Mxi,\n\nj Mxj \u2212 x(cid:62)\n\ni Mxj \u2212 x(cid:62)\n\nNearest neighbor graphs The k-nearest neighbor algorithm (k-nn) connects each node to the k\nneighbors to which the node has shortest distance, where k is an input parameter; therefore, setting k\n1In the remainder of the paper, we interchangeably use G to denote the set of feasible graphs and the\n\nalgorithm used to \ufb01nd the optimal connectivity within the set of feasible graphs.\n\n2\n\n\fto the true degree for each node, the distances to all disconnected nodes must be larger than the dis-\ntance to the farthest connected neighbor: DM(xi, xj) > (1 \u2212 Aij) maxl(AilDM(xi, xl)),\u2200i, j.\nSimilarly, preserving an \u0001-neighborhood graph obeys linear constraints on M: DM(xi, xj) \u2264\n\u0001, \u2200{i, j|Aij = 1}, and DM(xi, xj) \u2265 \u0001, \u2200{i, j|Aij = 0}. If for each node the connected dis-\ntances are less than the unconnected distances (or some \u0001), i.e., the metric obeys the above linear\nconstraints, De\ufb01nition 1 is satis\ufb01ed, and thus the connectivity computed under the learned metric M\nis exactly A.\n\nMaximum weight subgraphs Unlike nearest neighbor algorithms, which select edges greedily for\neach node, maximum weight subgraph algorithms select edges from a weighted graph to produce\na subgraph which has total maximal weight [6]. Given a metric parametrized by M, let the weight\nbetween two points (i, j) be the negated pairwise distance between them: Zij = \u2212DM(xi, xj) =\n\u2212(xi \u2212 xj)(cid:62)M(xi \u2212 xj). For example, maximum weight b-matching \ufb01nds the maximum weight\nsubgraph while also enforcing that every node has a \ufb01xed degree bi for each i\u2019th node. The formu-\nlation for maximum weight spanning tree is similar. Unfortunately, preserving structure for these\nalgorithms requires enforcing many linear constraints of the form: tr(Z(cid:62)A) \u2265 tr(Z(cid:62) \u02dcA),\u2200 \u02dcA \u2208 G.\nThis reveals one critical difference between structure preserving constraints of these algorithms\nand those of nearest-neighbor graphs: there are exponentially many linear constraints. To avoid\nan exponential enumeration, the most violated inequalities can be introduced sequentially using a\ncutting-plane approach as shown in the next section.\n\n2.2 Algorithm derivation\nBy combining the linear constraints from the previous section with a Frobenius norm (denoted ||\u00b7||F)\nregularizer on M and regularization parameter \u03bb, we have a simple semide\ufb01nite program (SDP)\nwhich learns an M that is structure preserving and has minimal complexity. Algorithm 1 summarizes\nthe naive implementation of SPML when the connectivity algorithm is k-nearest neighbors, which is\noptimized by a standard SDP solver. For maximum weight subgraph connectivity (e.g., b-matching),\nwe use a cutting-plane method [10], iteratively \ufb01nding the worst violating constraint and adding it to\na working-set. We can \ufb01nd the most violated constraint at each iteration by computing the adjacency\nmatrix \u02dcA that maximizes tr( \u02dcZ \u02dcA) s.t. \u02dcA \u2208 G, which can be done using various methods [6, 7, 8].\nEach added constraint enforces that the total weight along the edges of the true graph is greater\nthan total weight of any other graph by some margin. Algorithm 2 shows the steps for SPML with\ncutting-plane constraints.\n\nAlgorithm 1 Structure preserving metric learning with nearest neighbor constraints\nInput: A \u2208 Bn\u00d7n, X \u2208 Rd\u00d7n, and parameter \u03bb\n1: K = {M (cid:23) 0, DM(xi, xj) \u2265 (1 \u2212 Aij) maxl(AilDM(xi, xl)) + 1 \u2212 \u03be \u2200i,j}\n2: \u02dcM \u2190 argminM\u2208K \u03bb\n3: return \u02dcM\n\nF + \u03be {Found via SDP}\n\n2||M||2\n\n2||M||2\n\nF + \u03be {Found via SDP}\n\nAlgorithm 2 Structure preserving metric learning with cutting-plane constraints\nInput: A \u2208 Bn\u00d7n, X \u2208 Rd\u00d7n, connectivity algorithm G, and parameters \u03bb, \u03ba\n1: K = {M (cid:23) 0}\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9: until |tr( \u02dcZ(cid:62) \u02dcA) \u2212 tr( \u02dcZ(cid:62)A)| \u2264 \u03ba\n10: return \u02dcM\n\n\u02dcM \u2190 argminM\u2208K \u03bb\n\u02dcZ \u2190 2X(cid:62) \u02dcMX \u2212 diag(X(cid:62) \u02dcMX)1(cid:62) \u2212 1diag(X(cid:62) \u02dcMX)(cid:62)\n\u02dcA \u2190 argmax \u02dcA tr( \u02dcZ(cid:62) \u02dcA) s.t. \u02dcA \u2208 G {Find worst violator}\nif |tr( \u02dcZ(cid:62) \u02dcA) \u2212 tr( \u02dcZ(cid:62)A)| \u2265 \u03ba then\n\nadd constraint to K : tr(Z(cid:62)A) \u2212 tr(Z(cid:62) \u02dcA) > 1 \u2212 \u03be\n\nend if\n\n3\n\n\fUnfortunately, for networks larger than a few hundred nodes or for high-dimensional features, these\nSDPs do not scale adequately. The complexity of the SDP scales with the number of variables and\nconstraints, yielding a worst-case time of O(d3 + C3) where C = O(n2). By temporarily omit-\nting the PSD requirement on M, Algorithm 2 becomes equivalent to a one-class structural support\nvector machine (structural SVM). Stochastic SVM algorithms have been recently developed that\nhave convergence time with no dependence on input size [19]. Therefore, we develop a large-scale\nalgorithm based on projected stochastic subgradient descent. The proposed adaptation removes the\ndependence on n, where each iteration of the algorithm is O(d2), sampling one random constraint at\na time. We can rewrite the optimization as unconstrained over an objective function with a hinge-loss\non the structure preserving constraints:\n\nf (M) =\n\n\u03bb\n2\n\n||M||2\n\nF \u2212 1\n|S|\n\nmax(DM(xi, xj) \u2212 DM(xi, xk) + 1, 0).\n\n(cid:88)\n\n(i,j,k)\u2208S\n\nHere the constraints have been written in terms of hinge-losses over triplets, each consisting of a\nnode, its neighbor and its non-neighbor. The set of all such triplets is S = {(i, j, k) | Aij =\n1, Aik = 0}. Using the distance transformation in Equation 1, each of the |S| constraints can be\nwritten using a sparse matrix C(i,j,k), where\n\nC (i,j,k)\n\njj\n\n= 1, C (i,j,k)\n\nik\n\n= 1,\n\n, C (i,j,k)\n\nki\n\n= 1,\n\n, C (i,j,k)\n\nij\n\n= \u22121, C (i,j,k)\n\n= \u22121,\n\nji\n\n, C (i,j,k)\n\nkk\n\n= \u22121,\n\nand whose other entries are zero. By construction, sparse matrix multiplication of C(i,j,k) in-\ndexes the proper elements related to nodes i, j, and k, such that tr(C(i,j,k)X(cid:62)MX) is equal to\nDM(xi, xj) \u2212 DM(xi, xk). The subgradient of f at M is then\n\n\u2207f = \u03bbM +\n\n1\n|S|\n\nXC(i,j,k)X(cid:62),\n\n(cid:88)\n\n(i,j,k)\u2208S+\n\nwhere S+ = {(i, j, k)|DM(xi, xj) \u2212 DM(xi, xk) + 1 > 0}.\nIf for all triplets this quantity is\nnegative, there exists no unconnected neighbor of a point which is closer than a point\u2019s farthest\nconnected neighbor \u2013 precisely the structure preserving criterion for nearest neighbor algorithms. In\npractice, we optimize this objective function via stochastic subgradient descent. We sample a batch\nof triplets, replacing S in the objective function with a random subset of S of size B. If a true metric\nis necessary, we intermittently project M onto the PSD cone. Full details about constructing the\nconstraint matrices and minimizing the objective are shown in Algorithm 3.\n\n\u03bbt\n\nAlgorithm 3 Structure preserving metric learning with nearest neighbor constraints and optimiza-\ntion with projected stochastic subgradient descent\nInput: A \u2208 Bn\u00d7n, X \u2208 Rd\u00d7n, and parameters \u03bb, T, B\n1: M1 \u2190 Id\n2: for t from 1 to T \u2212 1 do\n\u03b7t \u2190 1\n3:\n4: C \u2190 0n,n\nfor b from 1 to B do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: \u2207t \u2190 XCX(cid:62) + \u03bbMt\n13: Mt+1 \u2190 Mt \u2212 \u03b7t\u2207t\n14:\n15: end for\n16: return MT\n\n(i, j, k) \u2190 Sample random triplet from S = {(i, j, k) | Aij = 1, Aik = 0}\nif DMt(xi, xj) \u2212 DMt(xi, xk) + 1 > 0 then\n\nCjj \u2190 Cjj + 1, Cik \u2190 Cik + 1, Cki \u2190 Cki + 1\nCij \u2190 Cij \u2212 1, Cji \u2190 Cji \u2212 1, Ckk \u2190 Ckk \u2212 1\n\nOptional: Mt+1 \u2190 [Mt+1]+ {Project onto the PSD cone}\n\nend if\nend for\n\n2.3 Analysis\n\nIn this section, we provide analysis for the scaling behavior of SPML using SGD. A primary insight\nis that, since Algorithm 3 regularizes with the L2 norm and penalizes with hinge-loss, omitting the\n\n4\n\n\fpositive semide\ufb01nite requirement for M and vectorizing M makes the algorithm equivalent to a one-\nclass, linear support vector machine with O(n3) input vectors. Thus, the stochastic optimization is\nan instance of the PEGAGOS algorithm [19], albeit a cleverly constructed one. The running time\nof PEGASOS does not depend on the input size, and instead only scales with the dimensionality,\nthe desired optimization error on the objective function \u0001 and the regularization parameter \u03bb. The\noptimization error \u0001 is de\ufb01ned as the difference between the found objective value and the true\noptimal objective value, f ( \u02dcM) \u2212 minM f (M).\nTheorem 2 Assume that the data is bounded such that max(i,j,k)\u2208S ||XC(i,j,k)X(cid:62)||2\nF \u2264 R, and\nR \u2265 1. During Algorithm 3 at iteration T , with \u03bb \u2264 1/4, and batch-size B = 1, let \u00afM =\n\n(cid:80)T\nt=1 Mt be the average M so far. Then, with probability of at least 1 \u2212 \u03b4,\n\n1\nT\n\nf ( \u00afM) \u2212 min\n\nM\n\nf (M) \u2264 84R2 ln(T /\u03b4)\n\n\u03bbT\n\n.\n\nConsequently, the number of iterations necessary to reach an optimization error of \u0001 is \u02dcO( 1\n\n\u03bb\u0001 ).\n\nProof The theorem is proven by realizing that Algorithm 3 is an instance of PEGASOS without\na projection step on one-class data, since Corollary 2 in [20] proves this same bound for traditional\nSVM input, also without a projection step. The input to the SVM is the set of all d \u00d7 d matrices\nXC (i,j,k)X(cid:62) for each triplet (i, j, k) \u2208 S.\n\nNote that the large size of set S plays no role in the running time; each iteration requires O(d2) work.\nAssuming the node feature vectors are of bounded norm, the radius of the input data R is constant\nwith respect to n, since each is constructed using the feature vectors of three nodes. In practice, as\nin the PEGASOS algorithm, we propose using MT as the output instead of the average, as doing\nso performs better on real data, but an averaging version is easily implemented by storing a running\nsum of M matrices and dividing by T before returning.\nFigure 2(b) shows the training and testing prediction performance on one of the experiments de-\nscribed in detail in Section 3 as stochastic SPML converges. The area under the receiver operator\ncharacteristic (ROC) curve is measured, which is related to the structure preserving hinge loss, and\nthe plot clearly shows fast convergence and quickly diminishing returns at higher iteration counts.\n\n2.4 Variations\n\nWhile stochastic SPML does not scale with the size of the input graph, evaluating distances using\na full M matrix requires O(d2) work. Thus, for high-dimensional data, one approach is to use\nprincipal component analysis or random projections to \ufb01rst reduce dimensionality.\nIt has been\nshown that n points can be mapped into a space of dimensionality O(log n/\u03b52) such that distances\nare distorted by no more than a factor of (1 \u00b1 \u03b5) [5, 11]. Another approach is to to limit M to be\nnonzero only along the diagonal. Diagonalizing M reduces the amount of work to O(d).\nIf modeling cross-feature interactions is necessary, another option for reducing the computational\ncost is to perform SPML using a low-rank factorization of M. In this case, all references to M can\nbe replaced with L(cid:62)L, thus inducing a true metric without projection. The updated gradient with\nrespect to L is simply \u2207t \u2190 2XCX(cid:62)L(cid:62) + \u03bbLt. Using a factorization also allows replacing the\nregularizer with the Frobenius norm of the L matrix, which is equivalent to the nuclear norm of M\n[18]. Using this formulation causes the objective to no longer be convex, but seems to work well in\npractice. Finally, when predicting links of new nodes, SPML does not know how many connections\nto predict. To address this uncertainty, we propose a variant to SPML called degree distributional\nmetric learning (DDML), which simultaneously learns the metric as well as parameters for the\nconnectivity algorithm. Details on DDML and low-rank SPML are provided in the Appendix.\n\n3 Experiments\n\nWe present a variety of synthetic and real-world experiments that elucidate the behavior of SPML.\nFirst we show how SPML performs on a simple synthetic dataset that is easily visualized in two\n\n5\n\n\fdimensions and which we believe mimics many traditional network datasets. We then demonstrate\nfavorable performance for SPML in predicting links of the Wikipedia document network and the\nFacebook social network.\n\n3.1 Synthetic example\n\nTo better understand the behavior of SPML, consider the following synthetic experiment. First n\npoints are sampled from a d-dimensional uniform distribution. These vectors represent the true fea-\ntures for the n nodes X \u2208 Rd\u00d7n. We then compute an adjacency matrix by performing a minimum-\ndistance b-matching on X. Next, the true features are scrambled by applying a random linear trans-\nformation: RX where R \u2208 Rd\u00d7d. Given RX and A, the goal of SPML is to learn a metric M that\nundoes the linear scrambling, so that when b-matching is applied to RX using the learned distance\nmetric, it produces the input adjacency matrix.\nFigure 1 illustrates the results of the above experiment for d = 2, n = 50, and b = 4. In Figure 1(a),\nwe see an embedding of the graph using the true features for each node as coordinates, and connec-\ntivity generated from b-matching. In Figure 1(b), the random linear transformation has been applied.\nWe posit that many real-world datasets resemble plot 1(b), with seemingly incongruous feature and\nconnectivity information. Applying b-matching to the scrambled data produces connections shown\nin Figure 1(c). Finally, by learning M via SPML (Algorithm 2) and computing L by Cholesky\ndecomposition of M, we can recover features LRX (Figure 1(d)) that respect the structure in the\ntarget adjacency matrix and thus more closely resemble the true features used to generate the data.\n\n(a) True network\n\n(b) Scrambled features\n& true connectivity\n\n(c) Scrambled features\n& implied connectivity\n\n(d) Recovered features &\ntrue connectivity\n\nFigure 1: In this synthetic experiment, SPML \ufb01nds a metric that inverts the random transformation applied\nto the features (b), such that under the learned metric (d) the implied connectivity is identical to the original\nconnectivity (a) as opposed to inducing a different connectivity (c).\n\n3.2 Link prediction\n\nWe compare SPML to a variety of methods for predicting links from node features: Euclidean\ndistances, relational topic models (RTM) , and traditional support vector machines (SVM). A simple\nbaseline for comparison is how well the Euclidean distance metric performs at ranking possible\nconnections. Relational topic models learn a link probability function in addition to latent topic\nmixtures describing each node [2]. For the SVM, we construct training examples consisting of the\npairwise differences between node features. Training examples are labeled positive if there exists an\nedge between the corresponding pair of nodes, and negative if there is no edge. Because there are\npotentially O(n2) possible examples, and the graphs are sparse, we subsample the negative examples\nso that we include a randomly chosen equal number of negative examples as positive edges. Without\nsubsampling, the SVM is unable to run our experiments in a reasonable time. We use the SVMPerf\nimplementation for our SVM [9], and the authors\u2019 code for RTM [2].\nInterestingly, an SVM with these inputs can be interpreted as an instance of SPML using diagonal\nM and the \u0001-neighborhood connectivity algorithm, which connects points based on their distance,\ncompletely independently of the rest of the graph structure. We thus expect to see better performance\nusing SPML in cases where the structure is important. The RTM approach is appropriate for data\nthat consists of counts, and is a generative model which recovers a set of topics in addition to\nlink predictions. Despite the generality of the model, RTM does not seem to perform as well as\ndiscriminative methods in our experiments, especially in the Facebook experiment where the data\nis quite different from bag-of-words features. For SPML, we run the stochastic algorithm with\nbatch size 10. We skip the PSD projection step, since these experiments are only concerned with\n\n6\n\n\fprediction, and obtaining a true metric is not necessary. SPML is implemented in MATLAB and\nrequires only a few minutes to converge for each of the experiments below.\n\n(a) Average ROC curve for Wikipedia Experi-\nment: \u201cgraph theory topics\u201d\n\n(b) Convergence behavior of SPML optimized\nvia SGD on Facebook Data\n\nFigure 2: Average ROC performance for the \u201cgraph theory topics\u201d Wikipedia experiment (left) shows a strong\nlift for SPML over competing methods. We see that SPML converges quickly with diminishing returns after\nmany iterations (right).\n\nWikipedia articles We apply SPML to predicting links on Wikipedia pages. Imagine the scenario\nwhere an author writes a new Wikipedia entry and then, by analyzing the word counts on the newly\nwritten page, an algorithm is able to suggest which other Wikipedia pages it should link to. We \ufb01rst\ncreate a few subnetworks consisting of all the pages in a given category, their bag-of-words features,\nand their connections. We choose three categories: \u201cgraph theory topics\u201d, \u201cphilosophy concepts\u201d,\nand \u201csearch engines\u201d. We use a word dictionary of common words with stop-words removed. For\neach network, we split the data 80/20 for training and testing, where 20% of the nodes are held out\nfor evaluation. On the remaining 80% we cross-validate (\ufb01ve folds) over the parameters for each\nalgorithm (RTM, SVM, SPML), and train a model using the best-scoring regularization parameter.\nFor SPML, we use the diagonal variant of Algorithm 3, since the high-dimensionality of the input\nfeatures reduces the bene\ufb01t of cross-feature weights. On the held-out nodes, we task each algo-\nrithm to rank the unknown edges according to distance (or another measure of link likelihood), and\ncompare the accuracy of the rankings using receiver operator characteristic (ROC) curves. Table 1\nlists the statistics of each category and the average area under the curve (AUC) over three train/test\nsplits for each algorithm. A ROC curve for the \u201cgraph theory\u201d category is shown in Figure 2(a). For\n\u201cgraph theory\u201d and \u201csearch engines\u201d, SPML provides a distinct advantage over other methods, while\nno method has a particular advantage on \u201cphilosophy concepts\u201d. One possible explanation for why\nthe SVM is unable to gain performance over Euclidean distance is that the wide range of degrees\nfor nodes in these graphs makes it dif\ufb01cult to \ufb01nd a single threshold that separates edges from non-\nedges. In particular, the \u201csearch engines\u201d category had an extremely skewed degree distribution, and\nis where SPML shows the greatest improvement.\nWe also apply SPML to a larger subset of the Wikipedia network, by collecting word counts and\nconnections of 100,000 articles in a breadth-\ufb01rst search rooted at the article \u201cPhilosophy\u201d. The\nexperimental setup is the same as previous experiments, but we use a 0.5% sample of the nodes for\ntesting. The \ufb01nal training algorithm ran for 50,000 iterations, taking approximately ten minutes on\na desktop computer. The resulting AUC on the edges of the held-out nodes is listed in Table 1 as the\n\u201cPhilosophy Crawl\u201d dataset. The SVM and RTM do not scale to data of this size, whereas SPML\noffers a clear advantage over using Euclidean distance for predicting links.\n\nFacebook social networks Applying SPML to social network data allows us to more accurately\npredict who will become friends based on the pro\ufb01le information for those users. We use Face-\nbook data [22], where we have a small subset of anonymized pro\ufb01le information for each student\nof a university, as well as friendship information. The pro\ufb01le information consists of gender, status\n(meaning student, staff, or faculty), dorm, major, and class year. Similarly to the Wikipedia exper-\niments in the previous section, we compared SPML to Euclidean, RTM, and SVM. For SPML, we\nlearn a full M via Algorithm 3. For each person, we construct a sparse feature vector where there\nis one feature corresponding to every possible dorm, major, etc. for each feature type. We select\nonly people who have indicated all \ufb01ve feature types on their pro\ufb01les. Table 1 shows details of\n\n7\n\n00.20.40.60.8100.20.40.60.81false positive ratetrue positive rate  SPMLEuclideanRTMSVMRandom010002000300040000.70.750.80.85IterationAUC  TrainingTesting\fTable 1: Wikipedia (top), Facebook (bottom) dataset and experiment information. Shown below: number of\nnodes n, number of edges m, dimensionality d, and AUC performance.\n\nGraph Theory\nPhilosophy Concepts\nSearch Engines\nPhilosophy Crawl\nHarvard\nMIT\nStanford\nColumbia\n\nn\n223\n303\n269\n\n1937\n2128\n3014\n3050\n\n100,000\n\n4,489,166\n\nm\n917\n921\n332\n\n48,980\n95,322\n147,516\n118,838\n\nd\n\n6695\n6695\n6695\n7702\n193\n173\n270\n251\n\nEuclidean\n\n0.624\n0.705\n0.662\n0.547\n0.764\n0.702\n0.718\n0.717\n\n0.610\n0.708\n0.611\n\n\u2013\n\nRTM SVM SPML\n0.722\n0.591\n0.707\n0.571\n0.742\n0.487\n0.601\n0.854\n0.801\n0.808\n0.818\n\n0.839\n0.784\n0.784\n0.796\n\n0.562\n0.494\n0.532\n0.519\n\n\u2013\n\nthe Facebook networks for the four schools we consider: Harvard, MIT, Stanford, and Columbia.\nWe perform a separate experiment for each school, randomly splitting the data 80/20 for training\nand testing. We use the training data to select parameters via \ufb01ve-fold cross validation, and train a\nmodel. The AUC performance on the held-out edges is also listed in Table 1. It is clear from the\nquantitative results that structural information is contributing to higher performance for SPML as\ncompared to other methods.\n\nFigure 3: Comparison of Facebook social networks from four schools in terms of feature importance computed\nfrom the learned structure preserving metric.\n\nBy looking at the weight of the diagonal values in M normalized by the total weight, we can de-\ntermine which feature differences are most important for determining connectivity. Figure 3 shows\nthe normalized weights averaged by feature types for Facebook data. Here we see the feature types\ncompared across four schools. For all schools except MIT, the graduating year is most important for\ndetermining distance between people. For MIT, dorms are the most important features. A possible\nexplanation for this difference is that MIT is the only school in the list that makes it easy for students\nto stay in a residence for all four years of their undergraduate program, and therefore which dorm\none lives in may affect more strongly the people they connect to.\n\n4 Discussion\n\nWe have demonstrated a fast convex optimization for learning a distance metric from a network\nsuch that the distances are tied to the network\u2019s inherent topological structure. The structure pre-\nserving distance metrics introduced in this article allow us to better model and predict the behavior\nof large real-world networks. Furthermore, these metrics are as lightweight as independent pairwise\nmodels, but capture structural dependency from features making them easy to use in practice for\nlink-prediction. In future work, we plan to exploit SPML\u2019s lack of dependence on graph size to\nlearn a structure preserving metric on massive-scale graphs, e.g., the entire Wikipedia site. Since\neach iteration requires only sampling a random node, following a link to a neighbor, and sampling\na non-neighbor, this can all be done in an online fashion as the algorithm crawls a network such as\nthe worldwide web, learning a metric that may gradually change over time.\n\nAcknowledgments This material is based upon work supported by the National Science Founda-\ntion under Grant No. 1117631, by a Google Research Award, and by the Department of Homeland\nSecurity under Grant No. N66001-09-C-0080.\n\n8\n\n00.10.20.30.40.5statusgendermajordormyearRelative Importance  HarvardMITStanfordColumbia\fReferences\n[1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic blockmodels. JMLR, 9:1981\u2013\n\n2014, 2008.\n\n[2] J. Chang and D. Blei. Hierarchical relational models for document networks. Annals of Applied Statistics,\n\n4:124\u2013150, 2010.\n\n[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through\n\nranking. J. Mach. Learn. Res., 11:1109\u20131135, March 2010.\n\n[4] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy. Make new friends, but keep the old: recommending\n\npeople on social networking sites. In CHI, pages 201\u2013210. ACM, 2009.\n\n[5] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random\n\nStruct. Algorithms, 22:60\u201365, January 2003.\n\n[6] C. Fremuth-Paeger and D. Jungnickel. Balanced network \ufb02ows, a unifying framework for design and\n\nanalysis of matching algorithms. Networks, 33(1):1\u201328, 1999.\n\n[7] B. Huang and T. Jebara. Loopy belief propagation for bipartite maximum weight b-matching. In Pro-\nceedings of the Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics, volume 2 of\nJMLR: W&CP, pages 195\u2013202, 2007.\n\n[8] B. Huang and T. Jebara. Fast b-matching via suf\ufb01cient selection belief propagation. In Proceedings of the\n\nFourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[9] T. Joachims. Training linear SVMs in linear time. In ACM SIG International Conference On Knowledge\n\nDiscovery and Data Mining (KDD), pages 217 \u2013 226, 2006.\n\n[10] T. Joachims, T. Finley, and C. Yu. Cutting-plane training of structural SVMs. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[11] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Contemporary\n\nMathematics, (26):189\u2013206, 1984.\n\n[12] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. ACM WWW,\n\n2008.\n\n[13] J. Leskovec, J Kleinberg, and C. Faloutsos. Graphs over time: densi\ufb01cation laws, shrinking diameters and\npossible explanations. In Proc. of the Eleventh ACM SIGKDD International Conference on Knowledge\nDiscovery in Data Mining, 2005.\n\n[14] M. Middendorf, E. Ziv, C. Adams, J. Hom, R. Koytcheff, C. Levovitz, and G. Woods. Discriminative\n\ntopological features reveal biological network mechanisms. BMC Bioinformatics, 5:1471\u20132105, 2004.\n\n[15] G. Namata, H. Sharara, and L. Getoor. A survey of link mining tasks for analyzing noisy and incomplete\n\nnetworks. In Link Mining: Models, Algorithms, and Applications. Springer, 2010.\n\n[16] M. Newman. The structure and function of complex networks. SIAM REVIEW, 45:167\u2013256, 2003.\n[17] M. Newman. Analysis of weighted networks. Phys. Rev. E, 70(5):056131, Nov 2004.\n[18] J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Pro-\nceedings of the Twenty-Second International Conference, volume 119 of ACM International Conference\nProceeding Series, pages 713\u2013719. ACM, 2005.\n\n[19] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\nProceedings of the 24th International Conference on Machine Learning, ICML \u201907, pages 807\u2013814, New\nYork, NY, USA, 2007. ACM.\n\n[20] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nSVM. Mathematical Programming, To appear.\n\n[21] B. Shaw and T. Jebara. Structure preserving embedding. In Proc. of the 26th International Conference\n\non Machine Learning, 2009.\n\n[22] A. Traud, P. Mucha, and M. Porter. Social structure of Facebook networks. CoRR, abs/1102.2166, 2011.\n[23] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJournal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[24] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with application to clustering with\nside-information. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, pages 505\u2013512. MIT Press,\n2002.\n\n[25] J. Xu and Y. Li. Discovering disease-genes by topological features in human protein-protein interaction\n\nnetwork. Bioinformatics, 22(22):2800\u20132805, 2006.\n\n[26] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative\napproach. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery\nand data mining, KDD \u201909, pages 927\u2013936, New York, NY, USA, 2009. ACM.\n\n9\n\n\f", "award": [], "sourceid": 1070, "authors": [{"given_name": "Blake", "family_name": "Shaw", "institution": null}, {"given_name": "Bert", "family_name": "Huang", "institution": null}, {"given_name": "Tony", "family_name": "Jebara", "institution": null}]}