{"title": "Data Skeletonization via Reeb Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 837, "page_last": 845, "abstract": "Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since  many important aspects of  data are often intrinsically low-dimensional.   Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories.   In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional \"skeleton\" from unorganized data using the Reeb graph.   Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or  proximity graphs.   It can also represent arbitrary graph structures in the data.  We also give  theoretical results to justify our method.  We  provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves.  We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.", "full_text": "Data Skeletonization via Reeb Graphs\n\nXiaoyin Ge\n\nIssam Safa\n\nMikhail Belkin\n\nYusu Wang\n\nComputer Science and Engineering Department\n\nThe Ohio State University\n\ngex,safa,mbelkin,yusu@cse.ohio-state.edu\n\nAbstract\n\nRecovering hidden structure from complex and noisy non-linear data is one of the\nmost fundamental problems in machine learning and statistical inference. While\nsuch data is often high-dimensional, it is of interest to approximate it with a low-\ndimensional or even one-dimensional space, since many important aspects of data\nare often intrinsically low-dimensional. Furthermore, there are many scenarios\nwhere the underlying structure is graph-like, e.g, river/road networks or various\ntrajectories. In this paper, we develop a framework to extract, as well as to sim-\nplify, a one-dimensional \u201dskeleton\u201d from unorganized data using the Reeb graph.\nOur algorithm is very simple, does not require complex optimizations and can\nbe easily applied to unorganized high-dimensional data such as point clouds or\nproximity graphs. It can also represent arbitrary graph structures in the data. We\nalso give theoretical results to justify our method. We provide a number of exper-\niments to demonstrate the effectiveness and generality of our algorithm, including\ncomparisons to existing methods, such as principal curves. We believe that the\nsimplicity and practicality of our algorithm will help to promote skeleton graphs\nas a data analysis tool for a broad range of applications.\n\n1 Introduction\n\nLearning or inferring a hidden structure from discrete samples is a fundamental problem in data\nanalysis, ubiquitous in a broad range of application \ufb01elds. With the rapid generation of diverse data\nall across science and engineering, extracting geometric structure is often a crucial \ufb01rst step towards\ninterpreting the data at hand, as well as the underlying process of phenomenon. Recently, there has\nbeen a large amount of research in this direction, especially in the machine learning community.\n\nIn this paper, we consider a simple but important scenario, where the hidden space has a graph-\nlike geometric structure, such as the branching \ufb01lamentary structures formed by blood vessels. Our\ngoal is to extract such structures from points sampled on and around them. Graph-like geometric\nstructures arise naturally in many \ufb01elds, both in modeling natural phenomena, and in understanding\nabstract procedures and simulations. However, there has been only limited work on obtaining a\ngeneral-purpose algorithm to automatically extract skeleton graph structures [2]. In this paper, we\npresent such an algorithm by bringing in a topological concept called the Reeb graph to extract\nskeleton graphs. Our algorithm is simple, ef\ufb01cient and easy to use. We demonstrate the generality\nand effectiveness of our algorithm via several applications in both low and high dimensions.\nMotivation. Geometric graphs are the underlying structures for modeling many natural phenomena\nfrom river / road networks, root systems of trees, to blood vessels, and particle trajectories. For\nexample, if we are interested in obtaining the road network of a city, we may send out cars to explore\nvarious streets of the city, with each car recording its position using a GPS device. The resulting\ndata is a set of potentially noisy points sampled from the roads in a city. Given these data, the goal\nis to automatically reconstruct the road network, which is a graph embedded in a two- dimensional\nspace. Indeed, abundant data of this type are available at the open-streets project website [1].\n\n1\n\n\fGeometric graphs also arise from many modeling processes, such as molecular simulations. They\ncan sometimes provide a natural platform to study a collection of time-series data, where each time-\nseries corresponds to a trajectory in the feature space. These trajectories converge and diverge,\nwhich can be represented by a graph. This graph in turn can then be used as a starting point for\nfurther processing (such as matching) or inference tasks.\n\nGenerally, there are a number of scenarios where we wish to extract a one-dimensional skeleton\nfrom an input space. The goal in this paper is to develop, as well as to demonstrate the use of, a\npractical and general algorithm to extract a graph structure from input data of any dimensions.\n\nNew work. Given a set of points P sampling a hidden domain X, we present a simple and practical\nalgorithm to extract a skeleton graph G for X. The input points P do not have to be embedded \u2013 we\nonly need their distance matrix or simply a proximity graph as input to our algorithm.\n\nOur algorithm is based on using the so-called Reeb graph to model skeleton graphs. Given a contin-\nuous function f : X ! IR, the Reeb graph tracks the connected components in the level-set f (cid:0)1(a)\nof f as we vary the value a. It provides a meaningful abstraction of the scalar \ufb01eld f, and has been\nwidely used in graphics, visualization, and computer vision (see [6] for a survey). However, it has\nnot yet been aimed as a tool to analyze high dimensional data from unorganized input data. By\nbringing the concept of the Reeb graph to machine learning applications, we can leverage the recent\nalgorithms developed to compute and process Reeb graphs [15, 9]. Moreover, combining the Reeb\ngraph with the so-called Rips complex allows us to obtain theoretical guarantees for our algorithm.\n\nOur algorithm is simple and ef\ufb01cient. There is only one parameter involved, which intuitively spec-\ni\ufb01es the scale at which we look at the data. Our algorithm always outputs a graph G given data.\nFurthermore, it also computes a map (cid:8) : P ! G, which maps each sample point to G. Hence we\ncan decompose the input data into sets, each corresponding to a single branch in the skeleton graph.\nFinally, there is a canonical way to measure importance of features in the Reeb graph, which allows\nus to easily simplify the resulting graph. We summarize our contributions as follows:\n\n(1) We bring in Reeb graphs to the learning community for analyzing high dimensional unorganized\ndata sets. We developed an accompanying software to not only extract, but also process skeleton\ngraphs from data. Our algorithm is simple and robust, always extracting a graph from the input. Our\nalgorithm complements principal curve algorithms and can be used in combination with them.\n\n(2) We provide certain theoretical guarantees for our algorithm. We also demonstrate both the effec-\ntiveness of our software and the usefulness of skeleton graphs via a sets of experiments on diverse\ndatasets. Experimental results show that despite being simple and general, our algorithm compares\nfavorably to existing graph-extracting algorithms in various settings.\n\nRelated work. At a broad level, the graph-extraction problem is related to manifold learning and\nnon-linear dimensionality reduction which has a rich literature, see e.g [4, 24, 25, 27]. Manifold\nlearning methods typically assume that the hidden domain has a manifold structure. An even more\ngeneral scenario is that the hidden domain is a strati\ufb01ed space, which intuitively, can be thought of\nas a collection of manifolds (strata) glued together. Recently, there have been several approaches to\nlearn strati\ufb01ed spaces [5, 14]. However, this general problem is hard and requires algorithms both\nmathematically sophisticated and computationally intensive. In this case, we aim to learn a graph\nstructure, which is simply a one-dimensional strati\ufb01ed space, allowing for simple approaches.\n\nThe most relevant previous work related to our graph-extraction problem is a series of results on an\nelegant concept of principal curves, originally proposed by Hastie and Stuetzle [16, 17]. Intuitively,\nprincipal curves are \u201dself-consistent\u201d curves that pass through the middle of the data. Since its orig-\ninal introduction, there has been much work on analyzing and extending the concept and algorithms\nas well as on numerous applications. See, e.g, [7, 11, 10, 19, 22, 26, 28, 29] among many others.\nBelow we discuss the results most relevant to the current work.\n\nOriginal principal curves are simple smooth curves with no self-intersections. In [19], K\u00b4egl et al.\nrepresented principal curves as polygonal lines, and proposed a regularized version of principal\ncurves. They gave a practical algorithm to compute such a polygonal principal curve. This algo-\nrithm was later extended in [18] into a principal graph algorithm to compute the skeleton graph\nof hand-written digits and characters. To the best of our knowledge, this was the \ufb01rst algorithm to\nexplicitly allow self-intersections in the output principal curves. However, this principal graph algo-\n\n2\n\n\frithm could only handle 2D images. Very recently in [22], Ozertem and Erdogmus proposed a new\nde\ufb01nition for the principal curve associated to the probability density function. Intuitively, imagin-\ning the probability density function as a terrain, their principal curves are the mountain ridges. A\nrigorous de\ufb01nition can be made in terms of the Hessian of the probability density. Their approach\nhas several nice properties, including connections to the popular mean-shift clustering algorithm. It\nalso allows for certain bifurcations and self-intersections. However, the output of the algorithm is\nonly a collection of points with neither connectivity information, nor the information about which\npoints are junction points (graph nodes) and which points belong to the same arc in the principal\ngraph. Furthermore, the algorithm depends on reliable density estimation from input data, which is\na challenging task for high dimensional data.\n\nAanijaneya et al. [2] recently proposed perhaps the \ufb01rst general algorithm to approximate a hidden\nmetric graph from an input graph with theoretical guarantees. While the goal of [2] is to approxi-\nmate a metric graph, their algorithm can also be used to skeletonize data. The algorithm relies on\ninspecting the local neighborhood of each point to \ufb01rst classify whether it should be a \u201cbranching\npoint\u201d or an \u201c edge point\u201d. Although this approach has theoretical guarantees when the sampling is\nnice and the parameters are chosen correctly, it is often hard to \ufb01nd suitable parameters in practice,\nand such local decisions tend to be less reliable when the input data are not as nice (such as a \u201cfat\u201d\njunction region). In the section on experimental results we show that our algorithm tends to be more\nrobust in practical applications.\n\nFinally we note that the concept of the Reeb graph has been used in a number of applications in\ngraphics, visualization, and computer vision (see [6] for a survey). However, it has been typically\nused with mesh structures rather than a tool for analyzing unorganized point cloud data, especially\nin high dimensions, where constructing meshes is prohibitively expensive. An exception is the very\nrecent work[20], where the authors propose to use the Reeb graph for point cloud data and show\napplications for several data-sets still in 3D. The advantage of our approach is that it is based on\nthe Rips complex, which allows for a general and cleaner Reeb graph reconstruction algorithm with\ntheoretical justi\ufb01cation (see [9, 15] and Theorem 3.1).\n\n2 Reeb Graphs\n\nWe now give a very brief description of the Reeb graph; see Section VI.4 of [12] for a more formal\ndiscussion of it. Let f : X ! IR be a continuous function de\ufb01ned on a domain X. For each\nscalar value a 2 IR, the level set f (cid:0)1(a) = fx 2 X j f (x) = ag may have multiple connected\ncomponents. The Reeb graph of f, denoted by Rf (X), is obtained by continuously identifying\nevery connected component in a level set to a single point. In other words, Rf (X) is the image of a\ncontinuous surjective map (cid:8) : X ! Rf (X) where (cid:8)(x) = (cid:8)(y) if and only if x and y come from\nthe same connected component of a level set of f.\n\nIntuitively, as the value a increases, connected components in\nthe level set f (cid:0)1(a) appear, disappear, split and merge, and the\nReeb graph of f tracks such changes. The Reeb graph is an ab-\nstract graph. Its nodes indicate changes in the connected com-\nponents in level sets, and each arc represents the evolution of a\nconnected component before it is merged, killed, or split. See\nthe right \ufb01gure for an example, where we show (an embedding\nof) the Reeb graph of the height function f de\ufb01ned on a topo-\nlogical torus. The Reeb graph Rf (X) provides a simple yet\nmeaningful abstraction of the input domain X w.r.t function f.\n\nf\n\na\n\n(cid:8)\n\nX\n\nRf (X)\n\nz\n\n(cid:8)(z)\n\nx\n\ny\n\n(cid:8)(x) =\n(cid:8)(y)\n\nComputation in discrete setting. Assume the input domain is modeled by a simplicial complex\nK. Speci\ufb01cally, a k-dimensional simplex (cid:27) is simply the convex combination of k + 1 independent\npoints fv0; : : : ; vkg, and any simplex formed by a subset of its vertices is called a face of (cid:27). A\nsimplical complex K is a collection of simplices with the property that if a simplex (cid:27) is in K, then\nany face of it is also in K. A piecewise-linear (PL) function f de\ufb01ned on K is a function with values\ngiven at vertices of K and linearly interpolated within each simplex in K. Given a PL-function f on\nK, its Reeb graph Rf (K) is decided by all the 0, 1 and 2-simplices from K, which are the vertices,\nedges, and triangles of K. Hence from now on we use only 2-dimensional simplicial complex.\n\n3\n\n\fp6\n\np5\n\np8\n\np7\n\n~p8\n\n~p7\n\nGiven a PL function de\ufb01ned on a simplicial com-\nplex domain K, its Reeb graph can be computed\nef\ufb01ciently in O(n log n) expected time by a sim-\nple randomized algorithm [15], where n is the size\nof K. In fact, the algorithm outputs the so-called\naugmented Reeb graph R, which contains the im-\nage of all vertices in K under the surjection map\n(cid:8) : K ! R introduced earlier. See \ufb01gure on the right: the Reeb graph (middle) is an abstract graph\nwith four nodes, while the augmented Reeb graph (on the right) shows the image of all vertices (i.e,\n~pis). From the augmented Reeb graph R, we can easily extract junction points (graph nodes), the\nset of points from the input data that should be mapped to each graph arc, as well as the connectivity\nbetween these points along the Reeb graph (e.g, ~p1 ~p4 ~p7 form one arc between ~p1 and ~p7).\n\np1\n\np2\n\np3\n\n~p1\n\n~p0\n\n~p1\n\n~p0\n\n~p8\n\n~p7\n\n~p5\n\n~p2\n\np4\n\n~p6\n\n~p4\n\n~p3\n\nf\n\np0\n\n3 Method\n\n3.1 Basic algorithm\n\nStep 1: Set up complex K. The input data we consider can be a set of points sampled from a\nhidden domain or a probabilistic distribution, or it can be the distance matrix, or simply the proximity\ngraph, among a set of points. (So the input points do not have to be embedded.) Our goal is to\ncompute (possibly an embedding of) a skeleton graph from the input data. First, we construct an\nappropriate space approximating the hidden domain that input points are sampled from. We use a\nsimplicial complex K to model such a space.\n\nSpeci\ufb01cally, given input sampled points P and the distance matrix of P , we \ufb01rst construct a prox-\nimity graph based on either r-neighborhood or k-nearest neighbors(NN) information; that is, a point\np 2 P is connected either to all its neighbors within r distance to p, or to its k-NNs. We add all\npoints in P and all edges from this proximity graph to the simplicial complex K we are building.\nNext, for any three vertices p1; p2; p3 2 P , if they are pairwise connected in the proximity graph,\nwe add the triangle 4p1p2p3 to K. Note that if the proximity graph is already given as the input,\nthen we simply \ufb01ll in a triangle whenever all its three edges are in the proximity graph to obtain\nthe target simplicial complex K. We remark that there is only one parameter involved in the basic\nalgorithm, which is the parameter r (if we use r-neighborhood) or k (if we use k-NN) to specify the\nscale with which we look at the input data.\n\nIf the proximity graph is built based on r-neighborhood,\nMotivation behind this construction.\nthen the above construction is simply that of the so-called Vietoris-Rips complex, which has been\nwidely used in manifold reconstruction (especially surface reconstruction) community to recover the\nhidden domain from its point samples. Intuitively, imagine that we grow a ball of radius r around\neach sample point. The union of these balls roughly captures the hidden domain at scale r. On the\nother hand, the topological structure of the union of these balls is captured by the so-called \u02c7Cech\ncomplex, which mathematically is the nerve of this union of balls. Hence the \u02c7Cech complex captures\nthe topology of the hidden domain when the sampling is reasonable (see e.g., [8, 21]). However,\n\u02c7Cech complex is hard to compute, and the Vietoris-Rips complex is a practical approximation of the\n\u02c7Cech complex that is much easier to construct. Furthermore, it has been shown that the Reeb graph\nof a hidden manifold can be approximated with theoretical guarantees from the Rips complex [9].\n\nStep 2: Reeb graph computation. Now we have a simplicial complex K that ap-\nproximates the hidden domain. In order to extract the skeleton graph using the Reeb\ngraph, we need to de\ufb01ne a function g on K that respects its shape. It is also desirable\nthat this function is intrinsic, given that input points may not be embedded. To this\nend, we construct the function g as the geodesic distance in K to a certain base point\nb 2 K. We compute the base point by taking an arbitrary point v 2 K and choosing b\nas the point furtherest away from v. Intuitively, this base point is an extreme point. If\nthe underlying domain indeed has a branching \ufb01lamentary structure, then the geodesic distance to b\ntends to progress along each \ufb01lament, and branch out at junction points. See the right \ufb01gure for an\nexample, where the thin curves are level sets of the geodesic distance function to the base point b.\n\nb\n\nv\n\n4\n\n\fFigure 1: Overview of the algorithm. The input points are light (yellow) shades beneath dark curves.\n(Left): the augmented Reeb graph output by our algorithm. (Center): after iterative smoothing.\n(Right): \ufb01nal output after repairing missing links (e.g top box) and simpli\ufb01cation (lower box).\n\nSince the Reeb graph tracks the evolution of the connected components in the level sets, a branching\n(splitting in the level set) will happen when the level set passes through point v.\nIn our algorithm, the geodesic distance function g to b in K is approximated by the shortest distance\nin the proximity graph (i.e, the set of edges in K) to b. We then perform the algorithm from [15] to\ncompute the Reeb graph of K with respect to g, and denote the resulting Reeb graph as R. Recall\nthat this algorithm in fact outputs the augmented Reeb graph R. Hence we not only obtain a graph\nstructure, but also the set of input points (together with their connectivity) that are mapped to every\ngraph arc in this graph structure.\n\nTime complexity. The time complexity of the basic algorithm is the summation of time to compute\n(A) the proximity graph, (B) the complex K from the proximity graph, (C) the geodesic distance and\n(D) the Reeb graph. (A) is O(n2) for high dimensional data (and can be made near-linear for data\nin very low dimensions) where n is the number of input points. (B) is O(k 3n) if each point takes\nk neighbors. (C) and (D) takes time O(m log n) = O(k 3n log n) where m is the size of K. Hence\noverall, the time complexity is O(n2 + k3n log n). For high dimensional data sets, this is dominated\nby the computation of the proximity graph O(n2).\n\nTheoretical guarantees. Given a domain X and a function f : X ! IR de\ufb01ned on it, the topology\n(i.e, the number of independent loops) of the Reeb graph Rf (X) may not re\ufb02ect that of the given\ndomain X. However, in our case, we have the following result which offers a partial theoretical\nguarantee for the basic algorithm. Intuitively, the theorem states that if the hidden space is a graph G,\nand if our simplicial complex K approximates G both in terms of topology (as captured by homotopy\nequivalence) and metric (as captured by the \"-approximation), then the Reeb graph captures all loops\nin G. Below, dY ((cid:1); (cid:1)) denotes the geodesic distance in domain Y .\n\nTheorem 3.1 Suppose K is homotopy equivalent to a graph G, and h : K ! G is the corresponding\nhomotopy. Assume that the metric is \"-approximated under h; that is, jdK(x; y)(cid:0)dG(h(x); h(y))j (cid:20)\n\" for any x; y 2 K, Let R be the Reeb graph of K w.r.t the geodesic distance function to an arbitrary\nbase point b 2 K. If \" < l=4, where l is the length of the shortest arc in G, we have that there is a\none-to-one correspondence between loops in R and loops in G.\n\nThe proof can be found in the full version [13]. It relies on results and observations from [9]. The\nabove result can be made even stronger: (i) There is not only a one-to-one correspondence between\nloops in R and in G, the ranges of each pair of corresponding loops are also close. Here, the range\nof a loop (cid:13) w.r.t. a function f is the interval [minx2(cid:13) f (x); maxx2(cid:13) f (x)]. (ii) The condition on\n\" < l=4 can be relaxed. Furthermore, even when \" does not satisfy this condition, the reconstructed\nReeb graph R can still preserve all loops in G whose range is larger than 2\".\n\n3.2 Embedding and Visualization\n\nThe Reeb graph is an abstract graph. To visualize the skeleton graph, we need to embed it in a\nreasonable way that re\ufb02ects the geometry of hidden domain. To this end, if points are not already\nembedded in 2D or 3D, we project the input points P to IR3 using any standard dimensionality\nreduction algorithm. We then connect projected points based on their connectivity given in the\n\n5\n\n\faugmented Reeb graph R. Each arc of the Reeb graph is now embedded as a polygonal curve. To\nfurther improve the quality of this curve, we \ufb01x its endpoints, and iteratively smooth it by repeatedly\nassigning a point\u2019s position to be the average of its neighbors\u2019 positions. See Figure 1 for an example.\n\n3.3 Further Post-processing\n\nIn practice, data can be noisy, and there may be spurious branches or loops in the Reeb graph R\nconstructed no matter how we choose parameter r or k to decide the scale. Following [3], there is\na natural way to de\ufb01ne \u201cfeatures\u201d in a Reeb graph and measure their \u201cimportance\u201d. Speci\ufb01cally,\ngiven a function f : X ! IR, imagine we plot its Reeb graph Rf (X) such that the height of each\npoint z 2 Rf (X) is the function value of all those points in X mapped to z. Now we sweep the\nReeb graph bottom-up in increasing order of the function values. As we sweep through a point z,\nwe inspect what happens to the part of Reeb graph that we already swept, denoted by Rz\nf := fw 2\nRf (X) j f (w) (cid:20) f (z)g. When we sweep past a down-fork saddle s, there are two possibilities:\n\n(i). The two branches merged by s belong to different connected components, say\nC1 and C2, in Rs\nf . In such case, we have a branch-feature, where two disjoint\nlower-branches in Rs\nf will be merged at s. The importance of this feature is\nthe smaller height of the lower-branches being merged. Intuitively, this is the\namount we have to perturb function f in order to remove this branch-feature. See\nthe right \ufb01gure, where the height h of C2 is the importance of this branch-feature.\n\ns\n\nC1\n\nh\n\nC2\n\ns\n\nf . In such\n(ii). The two branches merged by s are already connected below s in Rs\ncase, when s connects them again, we create a family of new loops. This is called a\nloop-feature. Its size is measured as smallest height of any loop formed by s in Rs\nf ,\nwhere the height of a loop (cid:13) is de\ufb01ned as maxz2(cid:13) f (z) (cid:0) minz2(cid:13) f (z). See the right\n\ufb01gure, where the dashed loop (cid:13) is the thinnest loop created by s.\nNow if we sweep Rf (X) top-down, we will also obtain branch-features and loop-features captured\nby up-fork saddles in a symmetric manner. It turns out that these features (and their sizes) correspond\nto the so-called extended persistence of the Reeb graph Rf (X) with respect to function f [12]. The\nsize of each feature is called its persistence, as it indicates how persistent this feature is as we perturb\nthe function f. These features and their persistence can be computed in O(n log 2 n) time, where\nn is the number of nodes and arcs in the Reeb graph [3]. We can now simplify the Reeb graph by\nmerging features whose persistence value is smaller than a given threshold. This simpli\ufb01cation step\nnot only removes noise, but can also be used as a way to look at features at larger scales.\n\n(cid:13)\n\nFinally, there may also be missing data causing missing links in the constructed skeleton graph.\nHence in post-processing the user can also choose to \ufb01rst \ufb01ll some missing links before the sim-\npli\ufb01cation step. This is achieved by connecting pairs of degree-1 nodes (x; y) in the Reeb graph\nwhose distances d(x; y) is smaller than certain distance threshold. Here d(x; y) is the input distance\nbetween x and y (if the input points are embedded, or the distance matrix is given), not the distance\nin the simplicial complex K constructed by our algorithm. Connecting x and y may either connect\ntwo disjoint component in the Reeb graph, thus creating new branch-features; or form new loop-\nfeatures. See Figure 1. We do not check the size of the new features created when connecting pairs\nof vertices. Small newly-created features will be removed in the subsequent simpli\ufb01cation step.\n\n4 Experimental Results\n\nIn this section we \ufb01rst provide comparisons of our algorithm to three existing methods. We then\npresent three sets of experiments to demonstrate the effectiveness of our software and show potential\napplications of skeleton graph extraction for data analysis.\nExperimental Comparisons. We compare our approach with three existing comparable algo-\nrithms: (1) the principal graph algorithm (PGA) [18]; (2) the local-density principal curve algorithm\n(LDPC) [22]; and (3) the metric-graph reconstruction algorithm (MGR) [2]. Note that PGA only\nworks for 2D images. LDPC only outputs point cloud at the center of the input data with no con-\nnectivity information.\n\n6\n\n\fIn the \ufb01gure on the right, we show the skele-\nton graph of the image of a hand-written\nChinese character. Our result is shown in\n(a). PGA [18] is shown in (b), while the\noutput of (the KDE version of) LDPC [22]\nis shown in (c). We see that the algo-\nrithm from [18], speci\ufb01cally designed for\nthese 2-D applications provides the best\noutput. However, the results of our al-\ngorithm, which is completely generic, are\ncomparable. On the other hand, the output of LDPC is a point cloud (rather than a graph). In this\nexample, many points do not belong to the 1D structure1. We do not show the results from MGR [2]\nas we were not able to produce a satisfactory result for this data using MGR even after tuning the\nparameters. However, note that the goal of their algorithm is to approximate a graph metric, which\nis different from extracting a skeleton graph.\n\n(c) LDPC [22]\n\n(a) Our output\n\n(b) PGA [18]\n\nFor the second set of com-\nparisons we build a skele-\nton graph out of an input\nmetric graph. Note that\nPGA and LDPC cannot\nhandle such graph-type in-\nput, and the only compa-\nrable algorithm is MGR\n[2]. We use the image-web\ndata previously used in [2]. Figure (a) on the right is our output and (b) is the output by MGR [2].\nThe input image web graph is shown in light (gray) color in the background. Finally to provide an\nadditional comparison we apply MGR to image edge detection: (c) above shows the reconstructed\nedges for the nautilus image used earlier in Figure 1. To be fair, MGR does not provide an embed-\nding, so we should focus on comparing graph structure. Still, MGR collapses the center of nautilus\ninto a single point, while out algorithm is able to recover the structure more accurately2.\n\n(c) MGR on nautilus\n\n(a) Our output\n\n(b) MGR [2]\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) & (b): Edge detection in images. (c) Sharp feature curves detection.\n\nWe now proceed with three examples of our algorithms applied to different datasets.\n\nExample 1: Image edge detection and surface feature curve reconstruction.\nIn edge detection\nfrom images, it is often easy to identify (disconnected) points potentially lying on an edge. We can\nthen use our Reeb-graph algorithm to connect them into curves. See Figure 1 and 2 (a), (b) for\nsome examples. The yellow (light) shades are input potential edge-points computed by a standard\nedge-detection algorithm based on Roberts edge \ufb01lter. Original images are given in the full version\n[13]. In Figure 2 (c), we are given a set of points sampled from a hidden surface model (gray points),\nand the goal is to extract (sharp) feature curves from it automatically. We \ufb01rst identify points lying\naround sharp feature lines using a local differential operator (yellow points) and apply our algorithm\nto connect them into feature lines/graphs (dark curves).\n\n1 Larger (cid:27) for kernel density estimation \ufb01xes that problem but causes important features to disappear.\n2Tuning the parameters of MGR does not seem to help, see the full version [13] for details.\n\n7\n\n\fExample 2: Speech data. The input speech data contains utterances of single digits by different\nspeakers. Each utterance is sampled every 10msec with a moving frame. Each sample is represented\nby the \ufb01rst 13 coef\ufb01cients resulting from running the standard Perceptual Linear Prediction (PLP)\nalgorithm on the wave \ufb01le. Given this setup, each utterance of a digit is a trajectory in this 13D\nfeature space.\n\n1\n\n1\n\n0.4\n\n0.6\n\n0.8\n\nIn the left panel, we show the trajectory of an utter-\nance of digit \u20181\u2019 projected to IR3. The right panel\nshows the graph reconstructed by our algorithm by\ntreating the input simply as a set of points (i.e, re-\nmoving the time sequence information). No post-\nprocessing is performed. Note the main portion of\nthe utterance (the large loop) is well-reconstructed.\nThe cluster of points in the right side corresponds to sampling of silence at the beginning and end of\nthe utterance. This indicates that our algorithm can potentially be used to automatically reconstruct\ntrajectories when the time information is lost.\n\n\u22120.2\n\n\u22122.5\n\n\u22123.5\n\n\u22124.5\n\n\u22120.4\n\n\u22120.5\n\n\u22125.5\n\n\u22123.5\n\n\u22120.2\n\n\u22124.5\n\n\u22120.4\n\n\u22120.5\n\n\u22122.5\n\n\u22125.5\n\n0.5\n\n0.2\n\n0.8\n\n0.6\n\n0.2\n\n0.5\n\n0.4\n\n\u22124\n\n\u22125\n\n\u22122\n\n\u22123\n\n\u22123\n\n\u22125\n\n\u22124\n\n\u22122\n\n0\n\n1\n\n0\n\n0\n\n1\n\n0\n\n1\n\n1.2\n\n0.6\n\n1.2\n\n0.8\n\n0.4\n\nNext, we combine three utterances of the digit \u20181\u2019\nand construct the graph from the resulting point\ncloud shown in the left panel. Each color represents\nthe point cloud coming from one utterance of \u20181\u2019.\nAs shown in the right panel, the graph reconstructed\nby our algorithm automatically aligns these three ut-\nterances (curves) in the feature space: well-aligned\nsubcurves are merged into single pieces along the graph skeleton, while divergent portions will ap-\npear as branches and loops in the graph (see the loops on the left-side of this picture). We expect that\nour methods can be used to produce a summary representation for multiple similar trajectories (low\nand high-dimensional curves), to both align trajectories with no time information and to discover\nconvergent and divergent portions of the trajectories.\n\n\u22123.5\n\n\u22124.5\n\n\u22122.5\n\n\u22125.5\n\n\u22120.5\n\n\u22120.4\n\n\u22120.2\n\n\u22125.5\n\n\u22122.5\n\n\u22120.5\n\n\u22124.5\n\n\u22123.5\n\n\u22120.4\n\n\u22120.2\n\n0.5\n\n0.2\n\n0.8\n\n0.4\n\n0.2\n\n0.5\n\n0.6\n\n\u22125\n\n\u22123\n\n\u22124\n\n\u22122\n\n\u22121\n\n\u22121\n\n\u22124\n\n\u22125\n\n\u22122\n\n\u22123\n\n0\n\n0\n\n1\n\n1\n\n0\n\n1\n\n0\n\nExample 3: Molecular simulation. The input is a molecular simulation data using the replica-\nexchange molecular dynamics method [23]. It contains 250K protein conformations, generated by\n20 simulation runs, each of which produces a trajectory in the protein conformational space.\n\nThe \ufb01gure on the right shows a 3D-projection\nof the Reeb graph constructed by our algorithm.\nInterestingly, \ufb01laments structures can be seen at\nthe beginning of the simulation, which indicates\nthe 20 trajectories at high energy level. As the\nsimulation proceeds, these different simulation\nruns converge and about 40% of the data points\nare concentrated in the oval on the right of the\n\ufb01gure, which correspond to low-energy conformations. Ideally, simulations at low energy should\nprovide a good sampling in the protein conformational space around the native structure of this\nprotein. However, it turns out that there are several large loops in the Reeb graph close to the native\nstructure (the conformation with lowest energy). Such loop features could be of interest for further\ninvestigation.\n\nCombining with principal curve algorithms. Finally, our algorithm can be used in combination\nwith principal curve algorithms. In particular, one way is to use our algorithm to \ufb01rst decompose\nthe input data into different arcs of a graph structure, and then use a principal curve algorithm to\ncompute an embedding of this arc in the center of points contributing to it. Alternatively, we can\n\ufb01rst use the LDPC algorithm [22] to move points to the center of the data, and then perform our\nalgorithm to connect them into a graph structure. Some preliminary results on such combination\napplied to the hand-written Chinese character can be found in the full version [13].\n\nAcknowledgments. The authors thank D. Chen and U. Ozertem for kindly providing their soft-\nware and for help with using the software. This work was in part supported by the NSF under\nCCF-0747082, CCF-1048983, CCF-1116258, IIS-1117707, IIS-0643916.\n\n8\n\n\fReferences\n[1] Open street map. http://www.openstreetmap.org/.\n[2] M. Aanjaneya, F. Chazal, D. Chen, M. Glisse, L. Guibas, and D. Morozov. Metric graph reconstruction\n\nfrom noisy data. In Proc. 27th Sympos. Comput. Geom., 2011.\n\n[3] P. K. Agarwal, H. Edelsbrunner, J. Harer, and Y. Wang. Extreme elevation on a 2-manifold. Discrete and\n\nComputational Geometry (DCG), 36(4):553\u2013572, 2006.\n\n[4] M. Belkin and P. Niyogi. Laplacian Eigenmaps for dimensionality reduction and data representation.\n\nNeural Comp, 15(6):1373\u20131396, 2003.\n\n[5] P. Bendich, B. Wang, and S. Mukherjee. Local homology transfer and strati\ufb01cation learning. In ACM-\n\nSIAM Sympos. Discrete Alg., 2012. To appear.\n\n[6] S. Biasotti, D. Giorgi, M. Spagnuolo, and B. Falcidieno. Reeb graphs for shape analysis and applications.\n\nTheor. Comput. Sci., 392:5\u201322, February 2008.\n\n[7] K. Chang and J. Grosh. A uni\ufb01ed model for probabilistic principal surfaces. IEEE Trans. Pattern Anal.\n\nMachine Intell., 24(1):59\u201364, 2002.\n\n[8] F. Chazal, D. Cohen-Steiner, and A. Lieutier. A sampling theory for compact sets in Euclidean space.\n\nDiscrete Comput. Geom., 41(3):461\u2013479, 2009.\n\n[9] T. K. Dey and Y. Wang. Reeb graphs: Approximation and persistence. In Proc. 27th Sympos. Comput.\n\nGeom., pages 226\u2013235, 2011.\n\n[10] D Dong and T. J Mcavoy. Nonlinear principal component analysis based on principal curves and neural\n\nnetworks. Computers & Chemical Engineering, 20:65\u201378, 1996.\n\n[11] T. Duchamp and W. Stuetzle. Extremal properties of principal curves in the plane. The Annals of Statistics,\n\n24(4):1511\u20131520, 1996.\n\n[12] H. Edelsbrunner and J. Harer. Computational Topology, An Introduction. Amer. Math. Society, 2010.\n[13] X. Ge, I. Safa, M. Belkin, and Y. Wang. Data skeletonization via Reeb graphs, 2011. Full version at\n\nwww.cse.ohio-state.edu/(cid:24)yusu.\n\n[14] G. Haro, G. Randall, and G. Sapiro. Translated poisson mixture model for strati\ufb01cation learning. Inter-\n\nnational Journal of Computer Vision, 80(3):358\u2013374, 2008.\n\n[15] W. Harvey, Y. Wang, and R. Wenger. A randomized O(mlogm) time algorithm for computing Reeb\ngraphs of arbitrary simplicial complexes. In Proc. 26th Sympos. Comput. Geom., pages 267\u2013276, 2010.\n\n[16] T. J. Hastie. Principal curves and surfaces. PhD thesis, stanford university, 1984.\n[17] T. J. Hastie and W. Stuetlze.\n\nPrincipal curves.\n\nJournal of the American Statistical Association,\n\n84(406):502\u2013516, 1989.\n\n[18] B. K\u00b4egl and A. Krzy\u02d9zak. Piecewise linear skeletonization using principal curves. IEEE Trans. Pattern\n\nAnal. Machine Intell., 24:59\u201374, January 2002.\n\n[19] B. K\u00b4egl, A. Krzyzak, T. Linder, and K. Zeger. Learning and design of principal curves. IEEE Trans.\n\nPattern Anal. Machine Intell., 22:281\u2013297, 2000.\n\n[20] M. Natali, S. Biasotti, G. Patan`e, and B. Falcidieno. Graph-based representations of point clouds. Graph-\n\nical Models, 73(5):151 \u2013 164, 2011.\n\n[21] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high con\ufb01dence\n\nfrom random samples. Discrete Comput. Geom., 39(1-3):419\u2013441, 2008.\n\n[22] U. Ozertem and D. Erdogmus. Locally de\ufb01ned principal curves and surfaces. Journal of Machine Learn-\n\ning Research, 12:1249\u20131286, 2011.\n\n[23] I.-H. Park and C. Li. Dynamic ligand-induced-\ufb01t simulation via enhanced conformational samplings and\n\nensemble dockings: A survivin example. J. Phys. Chem. B., 114:5144\u20135153, 2010.\n\n[24] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, 2000.\n\n[25] B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear Component Analysis as a Kernel Eigenvalue Prob-\n\nlem. Neural Computation, 10:1299\u20131319, 2000.\n\n[26] Derek Stanford and Adrian E. Raftery. Finding curvilinear features in spatial point patterns: Principal\n\ncurve clustering with noise. IEEE Trans. Pattern Anal. Machine Intell., 22(6):601\u2013609, 2000.\n\n[27] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[28] R. Tibshirani. Principal curves revisited. Statistics and Computing, 2:183\u2013190, 1992.\n[29] J. J. Verbeek, N. Vlassis, and B. Kr\u00a8ose. A k-segments algorithm for \ufb01nding principal curves. Pattern\n\nRecognition Letters, 23(8):1009\u20131017, 2002.\n\n9\n\n\f", "award": [], "sourceid": 559, "authors": [{"given_name": "Xiaoyin", "family_name": "Ge", "institution": null}, {"given_name": "Issam", "family_name": "Safa", "institution": null}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": null}, {"given_name": "Yusu", "family_name": "Wang", "institution": null}]}