{"title": "Density estimation from unweighted k-nearest neighbor graphs: a roadmap", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 233, "abstract": "Consider an unweighted k-nearest neighbor graph   on n points that have been sampled i.i.d. from some unknown density p on R^d. We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or their distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate some local function of p, and that integrating this function along shortest paths leads to an estimate of the underlying density.", "full_text": "Density estimation from unweighted k-nearest\n\nneighbor graphs: a roadmap\n\nUlrike von Luxburg\n\nand\n\nMorteza Alamgir\n\nDepartment of Computer Science\nUniversity of Hamburg, Germany\n\n{luxburg,alamgir}@informatik.uni-hamburg.de\n\nAbstract\n\nConsider an unweighted k-nearest neighbor graph on n points that have been sam-\npled i.i.d. from some unknown density p on Rd. We prove how one can estimate\nthe density p just from the unweighted adjacency matrix of the graph, without\nknowing the points themselves or any distance or similarity scores. The key in-\nsights are that local differences in link numbers can be used to estimate a local\nfunction of the gradient of p, and that integrating this function along shortest paths\nleads to an estimate of the underlying density.\n\n1\n\nIntroduction\n\nThe problem. Consider an unweighted k-nearest neighbor graph that has been built on a random\nsample X1, ..., Xn from some unknown density p on Rd. Assume we are given the adjacency matrix\nof the graph, but we do not know the point locations X1, ...., Xn or any distance or similarity scores\nbetween the points. Is it then possible to estimate the underlying density p, just from the adjacency\nmatrix of the unweighted graph?\nWhy is this problem interesting for machine learning? Machine learning algorithms on graphs\nare abundant, ranging from graph clustering methods such as spectral clustering over label prop-\nagation methods for semi-supervised learning to dimensionality reduction methods and manifold\nalgorithms. In the majority of applications, the graphs that are used as input are similarity graphs:\nGiven a set of abstract \u201cobjects\u201d X1, ..., Xn we \ufb01rst compute pairwise similarities s(Xi, Xj) accord-\ning to some suitable similarity function and then build a k-nearest neighbor graph (kNN graph for\nshort) based on this similarity function. The intuition is that the edges in the graph encode the local\ninformation given by the similarity function, whereas the graph as a whole reveals global properties\nof the data distribution such as cluster properties, high- and low-density regions, or manifold struc-\nture. From a computational point of view, kNN graphs are convenient because they lead to a sparse\nrepresentation of the data \u2014 even more so when the graph is unweighted. From a statistical point of\nview the key question is whether this sparse representation still contains all the relevant information\nabout the original data, in particular the information about the underlying data distribution. It is easy\nto see that for suitably weighted kNN graphs this is the case: the original density can be estimated\nfrom the degrees in the graph. However, it is completely unclear whether the same holds true for\nunweighted kNN graphs.\nWhy is the problem dif\ufb01cult?\nThe naive attempt to estimate the density from vertex degrees\nobviously has to fail in unweighted kNN graphs because all vertex degrees are (about) k. Moreover,\nunweighted kNN graphs are invariant with respect to rescaling of the underlying distribution by\na constant factor (e.g., the unweighted kNN graph on a sample from the uniform distribution on\n[0, 1]2 is indistinguishable from a kNN graph on a sample from the uniform distribution on [0, 2]2).\nSo all we can hope for is an estimate of the density up to some multiplicative constant that cannot\nbe determined from the kNN graph alone. The main dif\ufb01culty, however, is that a kNN graph \u201clooks\n\n1\n\n\fthe same\u201d in every small neighborhood. To see this, consider the case where the underlying density\nis continuous, hence approximately constant in small neighborhoods. Then, if n is large and k/n is\nsmall, local neighborhoods in the kNN graph are all going to look like kNN graphs from a uniform\nIt is impossible to estimate the density in\ndistribution. This intuition raises an important issue.\nan unweighted kNN graph by local quantities alone. We somehow have to make use of global\nproperties if we want to be successful. This makes the problem very different and much harder than\nmore standard density estimation problems.\nOur solution. We show that it is indeed possible to estimate the underlying density from an\nunweighted kNN graph. The construction is fairly involved. In a \ufb01rst step we estimate a pointwise\nfunction of the gradient of the density, and in a second step we integrate these estimates along\nshortest paths in the graph to end up with an approximation of the log-density. Our estimate works\nas long as the kNN graph is reasonably dense (kd+2/(n2 logd n) \u2192 \u221e). However, it fails in the\nmore important sparser regime (e.g., k \u2248 log n). Currently we do not know whether this is due to a\nsuboptimal proof or whether density estimation is generally impossible in the sparse regime.\n\nof X . For \u03b5 > 0 de\ufb01ne the \u03b5-interior X\u03b5 := {x \u2208 X (cid:12)(cid:12) d(x, \u2202X ) \u2265 \u03b5}. We assume that X is \u201cfull\n\n2 Notation and assumptions\nUnderlying space. Let X \u2282 Rd be a compact subset of Rd. Denote by \u2202X the topological boundary\ndimensional\u201d in the sense that there exists some \u03b50 > 0 such that X\u03b50 is non-empty and connected.\nBy \u03b7d we denote the volume of a d-dimensional unit ball, and by vd the volume of the intersection\nof two d-dimensional unit balls whose centers have distance 1.\nDensity. Let p be a continuously differentiable density on X . We assume that there exist constants\npmin and pmax such that 0 < pmin \u2264 p(x) \u2264 pmax < \u221e for all x \u2208 X .\nGraph. Given an i.i.d. sample Xn := {X1, ..., Xn} from p, we build a graph Gn = (Vn, En) with\nVn = Xn. We connect Xi by a directed edge to Xj if Xj is among the k-nearest neighbors of Xi.\nThe resulting graph is called the directed, unweighted kNN graph (in the following, we will often\ndrop the words \u201cdirected\u201d and \u201cunweighted\u201d). By r(x) := rn,k(x) we denote the Euclidean distance\nof a point x to its kth nearest neighbor. For any vertex x \u2208 V we de\ufb01ne the sets\n\nIn(x) := Inn,k(x) := {y \u2208 Xn\nOut(x) := Outn,k(x) := {y \u2208 Xn\n\n(source points of in-links to x)\n(target points of out-links from x).\n\n(cid:12)(cid:12) (y, x) \u2208 En}\n(cid:12)(cid:12) (x, y) \u2208 En}\n(cid:90) 1\n\np1/d(x) ds :=\n\n0\n\n(cid:90)\n\n\u03b3\n\nTo increase readability we often omit the indices n and k. For a \ufb01nite set S we denote by |S| its\nnumber of elements.\nPaths. For a recti\ufb01able path \u03b3 : [0, 1] \u2192 X we de\ufb01ne its p-weighted length as\n\n(cid:96)p(\u03b3) :=\n\np1/d(\u03b3(t))|\u03b3(cid:48)(t)| dt\n\n(recall the notational convention of writing \u201cds\u201d in a line integral). For two points x, y \u2208 X we\nde\ufb01ne their p-weighted distance as Dp(x, y) = inf \u03b3 (cid:96)p(\u03b3) where the in\ufb01mum is taken over all\nrecti\ufb01able paths \u03b3 that connect x to y. As a consequence of the compactness of X , a minimizing\npath that realizes Dp always exists (cf. Burago et al., 2001, Section 2.5.2). We call such a path a\nDp-shortest path. Under the given assumptions on p, the Dp-shortest path between any two points\nx, y \u2208 X\u03b50 is smooth.\nIn an unweighted graph, de\ufb01ne the length of a path as its number of edges. For two vertices\nx, y denote by Dsp(x, y) their shortest path distance in the graph. It has been proved in Alamgir\nand von Luxburg (2012) that for unweighted, undirected kNN graphs, (k/(n\u03b7d))1/dDsp(x, y) \u2192\nDp(x, y) almost surely as n \u2192 \u221e and k \u2192 \u221e appropriately slowly. The proofs extend directly to\nthe case of directed kNN graphs.\n\n3 Warmup: the 1-dimensional case\n\nTo gain some intuition about the problem and its solution, let us consider the 1-dimensional case\nX \u2282 R. For any given point x \u2208 Xn we de\ufb01ne the following sets:\n\nRight1(x) := |{y \u2208 Out(x)(cid:12)(cid:12) y > x}|.\n\nLeft1(x) := |{y \u2208 Out(x)(cid:12)(cid:12) y < x}|\n\nand\n\n2\n\n\fFigure 1: Geometric argument (left: 1-dimensional case, right: 2-dimensional case). The difference\nRight\u2212 Left is approximately proportional to the volume of the grey-shaded area.\n\nThe intuition to estimate the density from the directed kNN graph is the following. Consider a point\nx in a region where the density has positive slope. The set Out(x) is approximately symmetric\naround x, that is it has the form Out(x) = Xn \u2229 [x \u2212 R, x + R] for some R > 0. When the density\nhas an increasing slope at x, there tend to be less sample points in [x\u2212R, x] than in [x, x+R], so the\nset Right1(x) tends to contain more sample points than the set Left1(x). This is the effect we want\nto exploit. The difference Right1(x)\u2212Left1(x) can be approximated by n\u00b7(P ([x, x+R])\u2212P ([x\u2212\nR, x])), and by a simple geometric argument one can see that the latter probability is approximately\nR2p(cid:48)(x). See Figure 1 (left side) for an illustration. By standard concentration arguments one can\nsee that if n is large enough and k chosen appropriately, then R \u2248 k/(2np(x)). Plugging these\ntwo things together shows that Right1(x) \u2212 Left1(x) \u2248 (k2/(4n2)) \u00b7 p(cid:48)(x)/p2, hence gives an\nestimate of p(cid:48)(x)/p2(x). But we are not there yet: it is impossible to directly turn an estimate of\np(cid:48)(x)/p2(x) into an estimate of p(x). This is in accordance with the intuition we mentioned above:\none cannot estimate the density by just looking at a local neighborhood of x in the kNN graph.\nHere is now the key trick to introduce a global component to the estimate. We \ufb01x one data point\nX0 that is going to play the role of an anchor point. To estimate the density at a particular data\npoint Xs, we now sum the estimates p(cid:48)(x)/p2(x) over all data points x that sit between X0 and Xs.\nThis corresponds to integrating the function p(cid:48)(x)/p2(x) over the interval [X0, Xs] with respect to\nthe underlying density p, which in turn corresponds to integrating the function p(cid:48)(x)/p(x) over the\ninterval [X0, Xs] with respect to the standard Lebesgue measure. This latter integral is well known,\nits primitive is log p(x). Hence, for each data point Xs we get an estimate of log p(Xs)\u2212 log p(X0).\nThen we exponentiate and arrive at an estimate of c \u00b7 p(x), where c = 1/p(X0) plays the role of an\nunknown constant.\n\n4 A hypothetical estimate in the d-dimensional case\n\nWe now generalize our approach to the d-dimensional setting. There are two main challenges: First,\nwe need to replace the integral over all sample points between X0 and Xs by something more general\nin Rd. Our idea is to consider an integral along a path between X0 and Xs, speci\ufb01cally along a path\nthat corresponds to a shortest path in the graph Gn. Second, we need a generalization of the concept\nof what are \u201cleft\u201d and \u201cright\u201d out-links. Our idea is to use the shortest path as reference. For a point\nx on the shortest path between X0 and Xs, the \u201cleft points\u201d of x should be the ones that are on or\nclose to the subpath from X0 to x and \u201cright points\u201d the ones on or close to the path from x to Xs.\n\n4.1 Gradient estimates based on link differences\n\nFix a point x on a simple, continuously differentiable path \u03b3 and let T (x) be its tangent vector.\nConsider h(y) = (cid:104)w, y(cid:105) + b with normal vector w := T (x), where the offset b has been chosen such\n\nthat the hyperplane H := {y \u2208 Rd(cid:12)(cid:12) h(y) = 0} goes through x. De\ufb01ne\n\nLeftd(x) := Leftd,n,k(x) := |{x \u2208 Out(x)(cid:12)(cid:12) h(x) \u2264 0}|\nRightd(x) := Rightd,n,k(x) := |{x \u2208 Out(x)(cid:12)(cid:12) h(x) > 0}|.\n\n3\n\nxxlxrDensityR Rp'(x)TangentLeftRight11LeftRightTangent space to the density at xRp'(x)Rdd\fFigure 2: De\ufb01nitions of \u201cleft\u201d and \u201cright\u201din the d-dimensional case.\n\nSee Figure 2 (left side) for an illustration. This de\ufb01nition is a direct generalization of the de\ufb01nition\nof Left1 und Right1 in the 1-dimensional case. It is not yet the end of the story, as the quantities\nLeftd and Rightd cannot be evaluated based on the kNN graph alone, but it is a good starting point to\ndevelop the necessary proof concepts. In this section we prove the consistency of a density estimate\nbased on Leftd and Rightd. In Section 5 we will further generalize the de\ufb01nition to our \ufb01nal estimate.\nTheorem 1 (Estimate related to the gradient) Let X and p satisfy the assumptions in Section 2.\nLet \u03b3 be a differentiable, regular, simple path in X\u03b50 and x a sample point on this path. Let T be\nthe tangent direction of \u03b3 at x and p(cid:48)\nT (x) the directional derivative of the density p in direction T at\npoint x. Then, if n \u2192 \u221e, k \u2192 \u221e, k/n \u2192 0, kd+2/n2 \u2192 \u221e,\n\n(cid:16)\n\n2n1/d\u03b71/d\nd\nk(d+1)/d\n\nRightd,n,k(x) \u2212 Leftd,n,k(x)\n\n(cid:17) \u2212\u2192\n\np(cid:48)\nT (x)\n\np(x)(d+1)/d\n\na.s.\n\nIf kd+2/(n2 logd n) \u2192 \u221e the convergence even holds uniformly over all sample points x \u2208 Xn.\nProof sketch. The key problem in the proof is that the difference Rightd\u2212Leftd is of a much smaller\norder of magnitude than Rightd and Leftd themselves, so controlling the deviations of Rightd\u2212Leftd\nis somewhat tricky. Conditioned on rout(x) =: r, Rightd \u223c Bin(k, \u03c0r) and Leftd \u223c Bin(k, \u03c0l),\n\u221a\nwhere \u03c0r = P (right half ball)/P (ball) and \u03c0l analogously (cf. Figure 2). By Hoeffding\u2019s inequal-\nity, Rightd\u2212 Leftd \u2248 E(Rightd\u2212 Leftd)\u00b1 \u0398(\nk) with high probability. Note that \u03c0l and \u03c0r tend to\nbe close to 1/2, thus Hoeffding\u2019s inequality is reasonably tight. A simple geometric argument shows\nthat if the density in a neighborhood of x is linear, then E(Rightd \u2212 Leftd) = n \u00b7 rd\u03b7d/2 \u00b7 rp(cid:48)\nT (x)\n(n times the probability mass of the grey area in Figure 1). A similar argument holds approximately\nif the density is just differentiable at x. A standard concentration argument for the out-radius shows\nthat with high probability, rout(x) can be approximated by (k/(n\u03b7dp(x)))1/d. Combining all results\nwe obtain that with high probability,\n\n2n1/d\u03b71/d\nd\nk(d+1)/d\n\n(Rightd \u2212 Leftd) =\n\np(cid:48)\nT (x)\n\np(x)(d+1)/d\n\n\u00b1 \u0398\n\n(cid:16) n1/d\n\nk1/2+1/d\n\n(cid:17)\n\n.\n\nConvergence takes place if the noise term on the right hand side goes to 0 and the \u201chigh probability\u201d\nconverges to 1, which happens under the conditions on n and k stated in the theorem.\n\n4.2\n\nIntegrating the gradient estimates along the shortest path\n\nTo deal with the integration part, let us recap some standard results about line integrals.\nProposition 2 (Line integral) Let \u03b3 : [0, 1] \u2192 Rd be a simple, continuously differentiable path\nfrom x0 = \u03b3(0) to x1 = \u03b3(1) parameterized by arc length. For a point x = \u03b3(t) on the path, denote\nby T (x) the tangent vector to \u03b3 at x, and by p(cid:48)\nT (x) the directional derivative of p in the tangent\ndirection T . Then\n\n(cid:44)\n\n(cid:90)\n\np(cid:48)\nT (x)\np(x)\n\n\u03b3\n\nds = log(p(x1)) \u2212 log(p(x0)).\n\n4\n\nLeftRightpath \u03b3Out(x)HddxLeftRightpath \u03b3Out(x)In(x  )In(x  )lr\u03b3\u03b3xlxr\fProof. We de\ufb01ne the vector \ufb01eld\n\nF : Rd \u2192 Rd, x (cid:55)\u2192 p(cid:48)(x)\n\np(x)\n\n=\n\n1\n\np(x)\n\n(cid:32)\u2202p/\u2202x1\n\n(cid:33)\n\n...\n\n.\n\n\u2202p/\u2202xd\n\nObserve that F is a continuous gradient \ufb01eld with primitive V : Rd \u2192 R, x (cid:55)\u2192 log(p(x)). Now\nconsider the line integral of F along \u03b3:\n\n(cid:90)\n\n(cid:90) 1\n\n(cid:68)\n\n(cid:69)\n\n(cid:90) 1\n\n(cid:68)\n\nF (x) dx\n\ndef\n=\n\n\u03b3\n\n0\n\nF (\u03b3(t)), \u03b3(cid:48)(t)\n\ndt =\n\n1\n\np(cid:48)(\u03b3(t)), \u03b3(cid:48)(t)\n\ndt.\n\n(1)\n\np(\u03b3(t))\n\n0\n\nNote that \u03b3(cid:48)(t) is the tangent vector T (x) of the path \u03b3 at point x = \u03b3(t). Hence, the scalar product\n(cid:104)p(cid:48)(\u03b3(t)), \u03b3(cid:48)(t)(cid:105) coincides with the directional derivative of p in direction T , so the right hand side\nof Equation (1) coincides with the left hand side of the equation in the proposition. On the other\nhand, it is well known that the line integral over a gradient \ufb01eld only depends on the starting and\nend point of \u03b3 and is given by\n\n(cid:69)\n\n(cid:90)\n\n\u03b3\n\nF (x) dx = V (x1) \u2212 V (x0).\n\nThis coincides with the right hand side of the equation in the proposition.\n(cid:44)\nNow we consider the \ufb01nite sample case. The goal is to approximate the integral along the continuous\npath \u03b3 by a sum along a path \u03b3n in the kNN graph Gn. To achieve this, we need to construct a\nsequence of paths \u03b3n in Gn such that \u03b3n converges to some well-de\ufb01ned path \u03b3 in the underlying\nspace and the lengths of \u03b3n in Gn converge to (cid:96)p(\u03b3). To this end, we are going to consider paths \u03b3n\nwhich are shortest paths in the graph.\nAdapting the proof of the convergence of shortest paths in unweighted kNN graphs (Alamgir and\nvon Luxburg, 2012) we can derive the following statement for integrals along shortest paths.\nProposition 3 (Integrating a function along a shortest path) Let X and p satisfy the assumptions\nin Section 2. Fix two sample points in X\u03b50, say X0 and Xs, and let \u03b3n be a shortest path between\nX0 and Xs in the kNN graph Gn. Let \u03b3 \u2282 X be a path that realizes Dp(X0, Xs). Assume that it\nis unique and is completely contained in X\u03b50. Let g : X \u2192 R be a continuous function. Then, as\nn \u2192 \u221e, k1+\u03b1/n \u2192 0 (for some small \u03b1 > 0), k/ log n \u2192 \u221e,\n\n(cid:18) k\n\nn\u03b7d\n\n(cid:19)1/d \u00b7 (cid:88)\n\nx\u2208\u03b3n\n\n(cid:90)\n\n\u03b3\n\ng(x) \u2212\u2192\n\ng(x)p(x)1/d ds a.s.\n\nNote that if g(x)p1/d(x) can be written in the form (cid:104)F (\u03b3(t)), \u03b3(cid:48)(t)(cid:105), then the same statement even\nholds if the shortest Dp-path is not unique, because the path integral then only depends on start and\nend point. This is the case for our particular function of interest, g(x) = p(cid:48)\n\nT (x)/p1+1/d(x).\n\n4.3 Combining everything to obtain a density estimate\nTheorem 4 (Density estimate) Let X and p satisfy the assumptions in Section 2, let X0 \u2208 X\u03b50 be\nany \ufb01xed sample point. For another sample point Xs, let \u03b3n be a shortest path between X0 and Xs\nin the kNN graph Gn. Assume that there exists a path \u03b3 that realizes Dp(x, y) and that is completely\ncontained in X\u03b50. Then, as n \u2192 \u221e, k \u2192 \u221e, k/n \u2192 0, kd+2/(n2 logd n) \u2192 \u221e,\n\n(Rightd,n,k(x) \u2212 Leftd,n,k(x)) \u2212\u2192 log p(Xs) \u2212 log p(X0) a.s.\n\n(cid:88)\n\nx\u2208\u03b3n\n\n2\nk\n\nProof sketch. By Proposition 2,\n\nlog(p(Xs)) \u2212 log(p(X0)) =\n\n(cid:90)\n\n\u03b3\n\np(cid:48)\nT (x)\np(x)\n\nds =\n\n(cid:90)\n\np(cid:48)\nT (x)\n\np(x)(d+1)/d\n\n\u03b3\n\np(x)1/d ds.\n\n5\n\n\fAccording to Proposition 3, the latter can be approximated by\n\n(cid:18) k\n\n(cid:19)1/d (cid:88)\n\np(cid:48)\nT (x)\n\nn\u03b7d\n\nx\u2208\u03b3n\n\np(x)(d+1)/d\n\nwhere \u03b3n is a shortest path between X0 and Xs in the kNN graph. Proposition 1 shows that this\nquantity gets estimated by\n\n(cid:18) k\n\n(cid:19)1/d n1/d\n\nn\u03b7d\n\nk(d+1)/d\n\n(cid:88)\n\nx\u2208\u03b3n\n\n\u00b7 2\u03b71/d\n\nd\n\n(cid:16)\n\n(cid:17)\nRightd(x) \u2212 Leftd(x)\n\n(cid:88)\n\n(cid:16)\n\nx\u2208\u03b3n\n\n=\n\n2\nk\n\nRightd(x) \u2212 Leftd(x)\n\n(cid:17)\n\n.\n\n5 The \ufb01nal d-dimensional density estimate\n\n(cid:44)\n\nIn this section, we \ufb01nally introduce an estimate that solely uses quantities available from the kNN\ngraph. Let x be a vertex on a shortest path \u03b3n,k in the kNN graph Gn. Let xl and xr be the\npredecessor and successor vertices of x on this path (in particular, xl and xr are sample points as\nwell). De\ufb01ne\n\nLeft\u03b3n,k (x) := | Out(x) \u2229 In(xl)|\n\nand\n\nRight\u03b3n,k\n\n(x) := | Out(x) \u2229 In(xr)|.\n\nSee Figure 2 (right side) for an illustration. On \ufb01rst glance, these sets look quite different from\nLeftd and Rightd. But the intuition is that whenever we \ufb01nd two sets on the left and right side of\nx that have approximately the same volume, then the difference Left\u03b3n,k \u2212 Right\u03b3n,k should be a\nfunction of p(cid:48)\nT (x). For a second intuition consider the special case d = 1 and recall the de\ufb01nition of\nR of Section 3. One can show that in expectation, [x \u2212 R, x] coincides with Out(x) \u2229 In(xl) and\n[x, x + R] with Out(x) \u2229 In(xr), so in case d = 1 the de\ufb01nitions coincide in expectation with the\nones in Section 3. Another insight is that the set Left\u03b3n,k (x) counts the number of directed paths of\nlength 2 from x to xl, and Right\u03b3n,k\n\u2212 Left\u03b3n,k can be used as before to construct a density\nWe conjecture that the difference Right\u03b3n,k\nestimate. Speci\ufb01cally, if \u03b3n,k is a shortest path from the anchor point X0 to Xs, we believe that\nunder similar conditions on k and n as before,\n\n(x) analogously.\n\nRight\u03b3n,k\n\n(x) \u2212 Left\u03b3n,k (x)\n\n((cid:63))\n\n(cid:88)\n\nx\u2208\u03b3n\n\n\u03b7d\nk\u03bdd\n\nis a consistent estimator of the quantity log p(Xs) \u2212 log p(X0). Our simulations in Section 6 show\nthat the estimate works, even surprisingly well. So far we do not have a formal proof yet, due\nto two technical dif\ufb01culties. The \ufb01rst problem is that the set In(xl) is not a ball, but an \u201cegg-\nshaped\u201d set. As n \u2192 \u221e, one can sandwich In(x) between two concentric balls that converge to\neach other, but this approximation is too weak to carry the proof. To compute the expected value\n(x)\u2212 Left\u03b3n,k (x)) we would have to integrate the intersection of the \u201cegg\u201d In(xl) with\nE(Right\u03b3n,k\nthe ball Out(x), and so far we have no closed form solution. The second dif\ufb01culty is related to the\nshortest path in the graph. While it is clear that \u201cmost edges\u201d in this path have approximately the\nmaximal length (that is, (k/(n\u03b7dp(x))1/d for an edge in the neighborhood of x), this is not true for\nall edges. Intuitively it is clear that the contribution of the few violating edges will be washed out in\nthe integral along the shortest path, but we don\u2019t have a formal proof yet.\nWhat we can prove is the following weaker version. Consider a Dp-shortest path \u03b3 \u2282 Rd and a point\nx on this path with out-radius rout(x). De\ufb01ne the points xl and xr as the two points where the path \u03b3\nenters resp. leaves the ball B(x, rout(x)), and de\ufb01ne the sets Ln,k := Out(x) \u2229 B(xl, rout(x)) and\nRn,k := Out(x) \u2229 B(xr, rout(x)). Then it can be proved that (\u03b71/d\nRn,k(x) \u2212\nLn,k(x) \u2192 log p(Xs) \u2212 log p(X0). The proof is similar to the one in Section 4 . It circumvents the\nproblems mentioned above by using well de\ufb01ned balls instead of In-sets and the continuous path \u03b3\nrather than the \ufb01nite sample shortest path \u03b3n, but the quantities cannot be estimated from the kNN\ngraph alone.\n\n)(cid:80)\n\nx\u2208\u03b3n\n\n)/(k\u03bd1/d\n\nd\n\nd\n\n6\n\n\f6 Simulations\n\nAs a proof of concept, we ran simple experiments to evaluate the behavior of estimator ((cid:63)). We\ndraw n = 2000 points according to a couple of simple densities on R, R2 and R10, then we build\nthe directed, unweighted kNN graph with k = 50. We \ufb01x a random point as anchor point X0,\ncompute the quantities Right\u03b3n,k and Left\u03b3n,k for all sample points, and then sum the differences\n\u2212 Left\u03b3n,k along shortest paths to X0. Rescaling by the constant \u03b7n/(kvd) and exponen-\nRight\u03b3n,k\ntiating then leads to our estimate for p(x)/p(X0). In order to nicely plot our results, we multiply\nthe resulting estimate by p(X0) to get rid of the scaling constant (this step would not be possible in\napplications, but it merely serves for illustration purposes). The results are shown in Figure 3. It is\nobvious from these \ufb01gures that our estimate \u201cworks\u201d, even surprisingly well (note that the sample\nsize is not very large and we did not perform any parameter tuning). Even in the case of a step\nfunction the estimate recovers the structure of the density. Note that this is a particularly dif\ufb01cult\ncase in our setting, because within the constant parts of the two steps, the kNN graphs of the left and\nright step are indistinguishable. It is only in a small strip around the boundary between the two steps\nthat kNN graph will reveal non-uniform behavior. The simulations show that this is already enough\nto reveal the overall structure of the step function.\n\n7 Extensions\n\nWe have seen how to estimate the density in an unweighted, directed kNN graph. It is even possible\nto extend this result to more general cases. Here is a sketch of the main ideas.\nEstimating the dimension from the graph. The current density estimate requires that we know\nthe dimension d of the underlying space because we need to be able to compute the constants \u03b7d\n(volume of the unit ball) and vd (intersection of two unit balls). The dimension can be estimated\nfrom the directed, unweighted kNN graph as follows. Denote by r the distance of x to its kth-nearest\nneighbor, and by K the number of vertices that can be reached from x by a directed shortest path\nof length 2. Then k/n \u2248 P (B(x, r)) and K/n \u2248 P (B(x, 2r)). If n is large enough and k small\nenough, the density on these balls is approximately constant, which implies K/k \u2248 2d where d is\nthe dimension of the underlying space.\nRecovering the directed graph from the undirected one. The current estimate is based on the\ndirected kNN graph, but many applications use undirected kNN graphs. However, it is possible to\nrecover the directed, unweighted kNN graph from the undirected, unweighted graph. Denote by\nN (x) the vertices that have an undirected edge to x. If n is large and k small, then for any two\nvertices x and y we can approximate |N (x) \u2229 N (y)|/n \u2248 P (B(x, r) \u2229 B(y, r)). The latter is\nmonotonously decreasing with (cid:107)x \u2212 y(cid:107). To estimate the set Out(x) in order to recover the directed\nkNN graph, we rank all points y \u2208 N (x) according to |N (x) \u2229 N (y)| and choose Out(x) as the\n\ufb01rst k vertices in this ranking.\nPoint embedding. In this paper we focus on estimating the density from the unweighted kNN graph.\nAnother interesting problem is to recover an embedding of the vertices to Rd such that the kNN graph\nbased on the embedded vertices corresponds to the given kNN graph. This problem is closely related\nto a classic problem in statistics, namely non-metric multidimensional scaling (Shepard, 1966, Borg\nand Groenen, 2005), and more speci\ufb01cally to learning distances and embeddings from ranking and\ncomparison data (Schultz and Joachims, 2004, Agarwal et al., 2007, Ouyang and Gray, 2008, McFee\nand Lanckriet, 2009, Shaw and Jebara, 2009, Shaw et al., 2011, Jamieson and Nowak, 2011) as well\nas to ordinal (monotone) embeddings (Bilu and Linial, 2005, Alon et al., 2008, B\u02d8adoiu et al., 2008,\nGutin et al., 2009). However, we are not aware of any approach in the literature that can faithfully\nembed unweighted kNN graphs and comes with performance guarantees. Based on our density\nestimate, such an embedding can now easily be constructed. Given the unweighted kNN graph,\nwe assign edge weights w(Xi, Xj) = (\u02c6p\u22121/d(Xi) + \u02c6p\u22121/d(Xj))/2 where \u02c6p is the estimate of the\nunderlying density. Then the shortest paths in this weighted kNN graph converge to the Euclidean\ndistances in the underlying space, and standard metric multidimensional scaling can be used to\nconstruct an appropriate embedding. In the limit of n \u2192 \u221e, this approach is going to recover the\noriginal point embedding up to similarity transformations (translation, rotation or rescaling).\n\n7\n\n\fFigure 3: Densities and their estimates. Density model in the \ufb01rst row: the \ufb01rst dimension is sampled\nfrom a mixture of Gaussians, the other dimensions from a uniform distribution. The \ufb01gures plot the\n\ufb01rst dimension of the data points versus the true (black) and estimated (green) density values. From\nleft to right, they show the case of 1, 2, and 10 dimensions, respectively. Second and third row:\n2-dimensional densities. The left plots show the true log-density (a Gaussian and a step function),\nthe middle plots show the estimated log-density. The right \ufb01gures plot the \ufb01rst coordinate of the\ndata points against the true (black) and estimated (green) density values. The black star in the left\nplot depicts the anchor point X0 of the integration step.\n\n8 Conclusions\n\nIn this paper we show how a density can be estimated from the adjacency matrix of an unweighted,\ndirected kNN graph, provided the graph is dense enough (kd+2/(n2 logd n) \u2192 \u221e). In this case, the\ninformation about the underlying density is implicitly contained in unweighted kNN graphs, and,\nat least in principle, accessible by machine learning algorithms. However, in most applications, k\nis chosen much, much smaller, typically on the order k \u2248 log(n). For such sparse graphs, our\ndensity estimate fails because it is dominated by sampling noise that does not disappear as n \u2192 \u221e.\nThis raises the question whether this failure is just an artifact of our particular construction or of\nour proof, or whether a similar phenomenon is true more generally. If yes, then machine learning\nalgorithms on sparse unweighted kNN graphs would be highly problematic: If the information about\nthe underlying density is not present in the graph, it is hard to imagine how machine learning algo-\nrithms (for example, spectral clustering) could still be statistically consistent. General lower bounds\nproving or disproving these speculations are an interesting open problem.\n\nAcknowledgements\n\nWe would like to thank Gabor Lugosi for help with the proof of Theorem 1. This research was\npartly supported by the German Research Foundation (grant LU1718/1-1 and Research Unit 1735\n\u201dStructural Inference in Statistics: Adaptation and Ef\ufb01ciency\u201d).\n\n8\n\n\u22123\u22122\u22121012300.20.40.60.81density, n=2000, k=50, dim=1 \u22123\u22122\u22121012300.20.40.60.811.21.4density, n=2000, k=50, dim=2 \u22124\u2212202400.511.522.5density, n=2000, k=50, dim=10 \u2212202\u22122\u221210123log(p) true, n = 2000, k=50, dim=2   \u2212202\u22122\u221210123log(p) estimated, n = 2000, k=50, dim=2   \u22124\u2212202400.050.10.150.2density, n=2000, k=50, dim=2 0.20.40.60.80.20.40.60.8log(p) true, n = 2000, k=50, dim=2   0.20.40.60.80.20.40.60.8  log(p) estimated, n = 2000, k=50, dim=2 00.20.40.60.8100.511.5density, n=2000, k=50, dim=2 \fReferences\nS. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-\n\nmetric multidimensional scaling. In AISTATS, 2007.\n\nM. Alamgir and U. von Luxburg. Shortest path distance in random k-nearest neighbor graphs. In\n\nInternational Conference on Machine Learning (ICML), 2012.\n\nN. Alon, M. B\u02d8adoiu, E. Demaine, M. Farach-Colton, M. Hajiaghayi, and A. Sidiropoulos. Ordinal\nembeddings of minimum relaxation: general properties, trees, and ultrametrics. ACM Transac-\ntions on Algorithms, 4(4):46, 2008.\n\nM. B\u02d8adoiu, E. Demaine, M. Hajiaghayi, A. Sidiropoulos, and M. Zadimoghaddam. Ordinal embed-\nding: approximation algorithms and dimensionality reduction. In Approximation, Randomization\nand Combinatorial Optimization. Algorithms and Techniques. Springer, 2008.\n\nY. Bilu and N. Linial. Monotone maps, sphericity and bounded second eigenvalue. Journal of\n\nCombinatorial Theory, Series B, 95(2):283\u2013299, 2005.\n\nI. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer,\n\n2005.\n\nD. Burago, Y. Burago, and S. Ivanov. A course in metric geometry. American Mathematical Society,\n\n2001.\n\nG. Gutin, E. Kim, M. Mnich, and A. Yeo. Ordinal embedding relaxations parameterized above tight\n\nlower bound. arXiv preprint arXiv:0907.5427, 2009.\n\nK. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinal data. In\n\nConference on Communication, Control, and Computing, pages 1077\u20131084, 2011.\n\nB. McFee and G. Lanckriet. Partial order embedding with multiple kernels. In International Con-\n\nference on Machine Learning (ICML), 2009.\n\nH. Ouyang and A. Gray. Learning dissimilarities by ranking: from SDP to QP. In International\n\nConference on Machine Learning (ICML), pages 728\u2013735, 2008.\n\nM. Schultz and T. Joachims. Learning a distance metric from relative comparisons.\n\nInformation Processing Systems (NIPS), 2004.\n\nIn Neural\n\nB. Shaw and T. Jebara. Structure preserving embedding. In International Conference on Machine\n\nLearning (ICML), 2009.\n\nB. Shaw, B. Huang, and T. Jebara. Learning a distance metric from a network. Neural Information\n\nProcessing Systems (NIPS), 2011.\n\nR. Shepard. Metric structures in ordinal data. Journal of Mathematical Psychology, 3(2):287\u2013315,\n\n1966.\n\n9\n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Ulrike", "family_name": "Von Luxburg", "institution": "University of Hamburg"}, {"given_name": "Morteza", "family_name": "Alamgir", "institution": "University of Hamburg"}]}