{"title": "Learning metrics for persistence-based summaries and applications for graph classification", "book": "Advances in Neural Information Processing Systems", "page_first": 9859, "page_last": 9870, "abstract": "Recently a new feature representation and data analysis methodology based on a topological tool called persistent homology (and its persistence diagram summary) has gained much momentum. A series of methods have been developed to map a persistence diagram to a vector representation so as to facilitate the downstream use of machine learning tools. In these approaches, the importance (weight) of different persistence features are usually pre-set. However often in practice, the choice of the weight-function should depend on the nature of the specific data at hand. It is thus highly desirable to learn a best weight-function (and thus metric for persistence diagrams) from labelled data. We study this problem and develop a new weighted kernel, called WKPI, for persistence summaries, as well as an optimization framework to learn the weight (and thus kernel). We apply the learned kernel to the challenging task of graph classification, and show that our WKPI-based classification framework obtains similar or (sometimes significantly) better results than the best results from a range of previous graph classification frameworks on a collection of benchmark datasets.", "full_text": "Learning metrics for persistence-based summaries\n\nand applications for graph classi\ufb01cation\n\nQi Zhao\n\nYusu Wang\n\nzhao.2017@osu.edu\n\nyusu@cse.ohio-state.edu\n\nComputer Science and Engineering Department\n\nThe Ohio State University\n\nColumbus, OH 43221\n\nAbstract\n\nRecently a new feature representation framework based on a topological tool called\npersistent homology (and its persistence diagram summary) has gained much\nmomentum. A series of methods have been developed to map a persistence diagram\nto a vector representation so as to facilitate the downstream use of machine learning\ntools. In these approaches, the importance (weight) of different persistence features\nare usually pre-set. However often in practice, the choice of the weight-function\nshould depend on the nature of the speci\ufb01c data at hand. It is thus highly desirable\nto learn a best weight-function (and thus metric for persistence diagrams) from\nlabelled data. We study this problem and develop a new weighted kernel, called\nWKPI, for persistence summaries, as well as an optimization framework to learn\nthe weight (and thus kernel). We apply the learned kernel to the challenging task\nof graph classi\ufb01cation, and show that our WKPI-based classi\ufb01cation framework\nobtains similar or (sometimes signi\ufb01cantly) better results than the best results\nfrom a range of previous graph classi\ufb01cation frameworks on benchmark datasets.\n\n1\n\nIntroduction\n\nIn recent years a new data analysis methodology based on a topological tool called persistent\nhomology has started to attract momentum. The persistent homology is one of the most important\ndevelopments in the \ufb01eld of topological data analysis, and there have been fundamental developments\nboth on the theoretical front (e.g, [23, 10, 13, 8, 14, 5]), and on algorithms / implementations (e.g,\n[43, 4, 15, 20, 29, 3]). On the high level, given a domain X with a function f : X \u2192 R on it, the\npersistent homology summarizes \u201cfeatures\u201d of X across multiple scales simultaneously in a single\nsummary called the persistence diagram (see the second picture in Figure 1). A persistence diagram\nconsists of a multiset of points in the plane, where each point p = (b, d) intuitively corresponds to\nthe birth-time (b) and death-time (d) of some (topological) features of X w.r.t. f. Hence it provides a\nconcise representation of X, capturing multi-scale features of it simultaneously. Furthermore, the\npersistent homology framework can be applied to complex data (e.g, 3D shapes, or graphs), and\ndifferent summaries could be constructed by putting different descriptor functions on input data.\nDue to these reasons, a new persistence-based feature vectorization and data analysis framework\n(Figure 1) has become popular. Speci\ufb01cally, given a collection of objects, say a set of graphs modeling\nchemical compounds, one can \ufb01rst convert each shape to a persistence-based representation. The\ninput data can now be viewed as a set of points in a persistence-based feature space. Equipping this\nspace with appropriate distance or kernel, one can then perform downstream data analysis tasks (e.g,\nclustering).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: A persistence-based data analysis framework.\n\nThe original distances for persistence diagram summaries unfortunately do not lend themselves easily\nto machine learning tasks. Hence in the last few years, starting from the persistence landscape [7],\nthere have been a series of methods developed to map a persistence diagram to a vector representation\nto facilitate machine learning tools [41, 1, 33, 12, 35]. Recent ones include Persistence Scale-Space\nkernel [41], Persistence Images [1], Persistence Weighted Gaussian kernel (PWGK) [33], Sliced\nWasserstein kernel [12], and Persistence Fisher kernel [34].\nIn these approaches, when computing the distance or kernel between persistence summaries, the\nimportance (weight) of different persistence features are often pre-determined. In persistence images\n[1] and PWGK [33], the importance of having a weight-function for the birth-death plane (containing\nthe persistence points) has been emphasized and explicitly included in the formulation of their kernels.\nHowever, before using these kernels, the weight-function needs to be pre-set.\nOn the other hand, as recognized by [26], the choice of the weight-function should depend on the\nnature of the speci\ufb01c type of data at hand. For example, for the persistence diagrams computed\nfrom atomic con\ufb01gurations of molecules, features with small persistence could capture the local\npacking patterns which are of utmost importance and thus should be given a larger weight; while\nin many other scenarios, small persistence leads to noise with low importance. However, in general\nresearchers performing data analysis tasks may not have such prior insights on input data. Thus it is\nnatural and highly desirable to learn a best weight-function from labelled data.\n\nOur work. We study the problem of learning an appropriate metric (kernel) for persistence sum-\nmaries from labelled data, and apply the learnt kernel to the challenging graph classi\ufb01cation task.\n(1) Metric learning for persistence summaries: We propose a new weighted-kernel (called WKPI),\nfor persistence summaries based on persistence images representations. Our WKPI kernel is positive\nsemi-de\ufb01nite and its induced distance is stable. A weight-function used in this kernel directly encodes\nthe importance of different locations in the persistence diagram. We next model the metric learning\nproblem for persistence summaries as the problem of learning (the parameters of) this weight-function\nfrom a certain function class. In particular, the metric-learning is formulated as an optimization\nproblem over a speci\ufb01c cost function we propose. This cost function has a simple matrix view which\nhelps both conceptually clarify its meaning and simplify the implementation of its optimization.\n(2) Graph classi\ufb01cation application: Given a set of objects with class labels, we \ufb01rst learn a best\nWKPI-kernel as described above, and then use the learned WKPI to further classify objects. We\nimplemented this WKPI-classi\ufb01cation framework, and apply it to a range of graph data sets. Graph\nclassi\ufb01cation is an important problem, and there has been a large literature on developing effective\ngraph representations (e.g, [25, 40, 2, 32, 44, 47, 38], including the very recent persistent-homology\nenhanced WL-kernel [42]), and graph neural networks (e.g, graph neural networks [48, 39, 46, 45,\n35, 31]) to classify graphs. The problem is challenging as graph data are less structured. We perform\nour WKPI-classi\ufb01cation framework on various benchmark graph data sets as well as new neuron-\ncell data sets. Our learnt WKPI performs consistently better than other persistence-based kernels.\nMost importantly, when compared with existing state-of-the-art graph classi\ufb01cation frameworks, our\nframework shows similar or (sometimes signi\ufb01cantly) better performance in almost all cases than the\nbest results by existing approaches.\nWe note that [26] is the \ufb01rst to recognize the importance of using labelled data to learn a task-optimal\nrepresentation of topological signatures. They developed an end-to-end deep neural network for\nthis purpose, using a novel and elegant design of the input layer to implicitly learn a task-speci\ufb01c\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a): As we sweep the curve bottom-up in increasing f-values, at certain critical moments\nnew 0-th homological features (connected components) are created, or destroyed (i.e, components\nmerge). For example, a component is created when passing x4 and killed when passing x6, giving\nrise to the persistence-point (f4, f6) in the persistence diagram (fi := f (xi)). (b) shows the graph of\na persistence surface (where z-axis is the function \u03c1A), and (c) is its corresponding persistence image.\n\nrepresentation. Very recently, in a parallel and independent development of our work, Carri\u00e8re\net al. [11] built an interesting new neural network based on the DeepSet architecture [49], which\ncan achieve an end-to-end learning for multiple persistence representations in a uni\ufb01ed manner.\nCompared to these developments, we instead explicitly formulate the metric-learning problem for\npersistence-summaries, and decouple the metric-learning (which can also be viewed as representation-\nlearning) component from the downstream data analysis tasks. Also as shown in Section 4, our\nWKPI-classi\ufb01cation framework (using SVM) achieves better results on graph classi\ufb01cation datasets.\n\n2 Persistence-based framework\n\nWe \ufb01rst give an informal description of persistent homology below. See [22] for more detailed\nexposition on the subject.\nSuppose we are given a shape X (in our later graph classi\ufb01cation application, X is a graph).\nImagine we inspect X through a \ufb01ltration of X, which is a sequence of growing subsets of X:\nX1 \u2286 X2 \u2286 \u00b7\u00b7\u00b7 \u2286 Xn = X. As we scan X, sometimes a new feature appears in Xi, and\nsometimes an existing feature disappears upon entering Xj. Using the topological object called\nhomology classes to describe these features (intuitively components, independent loops, voids, and\ntheir high dimensional counter-parts), the birth and death of topological features can be captured by\nthe persistent homology, in the form of a persistence diagram DgX. Speci\ufb01cally, for each dimension\nk, DgkX consists of a multi-set of points in the plane (which we call the birth-death plane R2): each\npoint (b, d) in it, called a persistence-point, indicates that a certain k-dimensional homological feature\nis created upon entering Xb and destroyed upon entering Xd. In the remainder of the paper, we often\nomit the dimension k for simplicity: when multiple dimensions are used for persistence features, we\nwill apply our construction to each dimension and concatenate the resulting vector representations.\nA common way to obtain a meaningful \ufb01ltration of X is via the sublevel-set \ufb01ltration induced by\na descriptor function f on X. More speci\ufb01cally, given a function f : X \u2192 R, let X\u2264a := {x \u2208\nX | f (x) \u2264 a} be its sublevel-set at a. Let a1 < a2 < \u00b7\u00b7\u00b7 < an be n real values. The sublevel-set\n\ufb01ltration w.r.t. f is: X\u2264a1 \u2286 X\u2264a2 \u2286 \u00b7\u00b7\u00b7 \u2286 X\u2264an ; and its persistence diagram is denoted by Dgf.\nEach persistence-point p = (ai, aj) \u2208 Dgf indicates the function values when some topological\nfeatures are created (when entering X\u2264ai) and destroyed (in X\u2264aj ), and the persistence of this\nfeature is its life-time pers(p) = |aj \u2212 ai|. See Figure 2 (a) for a simple example where X = R. If\none sweeps X top-down in decreasing function values, one gets the persistence diagram induced by\nthe super-levelset \ufb01ltration of X w.r.t. f in an analogous way. Finally, if one tracks the change of\ntopological features in the levelset f\u22121(a), one obtains the so-called levelset zigzag persistence [9]\n(which contains the information captured by the extended persistence [17]).\n\nGraph Setting. Given a graph G = (V, E), a descriptor function f de\ufb01ned on V or E will induce\na \ufb01ltration and its persistence diagrams. Suppose f : V \u2192 R is de\ufb01ned on the node set of G\n(e.g, the node degree). Then we can extend f to E by setting f (u, v) = max{f (u), f (v)}, and\nthe sublevel-set at a is de\ufb01ned as G\u2264a := {\u03c3 \u2208 V \u222a E | f (\u03c3) \u2264 a}. Similarly, if we are given\nf : E \u2192 R, then we can extend f to V by setting f (u) = minu\u2208e,e\u2208E f (e). When scanning G\n\n3\n\nfx1x2x3x4x5x6x7x8(f1,f8)(f2,f7)(f4,f6)(f3,f5)birthtimedeathtimeaX\fvia the sublevel-set \ufb01ltration of f, connected components in the swept subgraphs will be created\nand merged, and new cycles will be created. The formal events are encoded in the 0-dimensional\npersistence diagram. The the 1-dimensional features (cycles), however, we note that cycles created\nwill never be killed, as they are present in the total space X = G. To this end, we use the so-called\nextended persistence introduced in [17] which can record information of cycles.\nNow given a collection of shapes \u039e, we can compute a persistence diagram DgX for each X \u2208 \u039e,\nwhich maps the set \u039e to a set of points in the space of persistence diagrams. There are natural\ndistances de\ufb01ned for persistence diagrams, including the bottleneck distance and the Wasserstein\ndistance, both of which have been well studied (e.g, stability under them [16, 18, 14]) with ef\ufb01cient\nimplementations available [27, 28]. However, to facilitate downstream machine learning tasks, it\nis desirable to further map the persistence diagrams to another \u201cvector\u201d representation. Below we\nintroduce one such representation, called the persistence images [1], as our new kernel is based on it.\n\nLet A be a persistence diagram (containing a multiset of persistence-points). Following [1], set\nT : R2 \u2192 R2 to be the linear transformation1 where for each (x, y) \u2208 R2, T (x, y) = (x, y \u2212 x). Let\nT (A) be the transformed diagram of A. Let \u03c6u : R2 \u2192 R be a differentiable probability distribution\nwith mean u \u2208 R2 (e.g, the normalized Gaussian where for any z \u2208 R2, \u03c6u(z) = 1\n).\n\n\u2212 (cid:107)z\u2212u(cid:107)2\n\n2\u03c4 2\n\n2\u03c0\u03c4 2 e\n\nDe\ufb01nition 2.1 ([1]) Let \u03b1 : R2 \u2192 R be a non-negative weight-function for the persistence plane R2.\nGiven a persistence diagram A, its persistence surface \u03c1A : R2 \u2192 R (w.r.t. \u03b1) is de\ufb01ned as: for any\n\nThe persistence image is a discretization of the persistence surface. Speci\ufb01cally, \ufb01x a grid on a\nrectangular region in the plane with a collection P of N rectangles (pixels). The persistence image\nfor a diagram A is PIA = { PI[p] }p\u2208P consists of N numbers (i.e, a vector in RN ), one for each\n\nz \u2208 R2, \u03c1A(z) =(cid:80)\npixel p in the grid P with PI[p] :=(cid:82)(cid:82)\n\nu\u2208T (A) \u03b1(u)\u03c6u(z).\n\np \u03c1A dydx.\n\nThe persistence image can be viewed as a vector in RN . One can then compute distance between\ntwo persistence diagrams A1 and A2 by the L2-distance (cid:107)PI1 \u2212 PI2(cid:107)2 between their persistence\nimages (vectors) PI1 and PI2. The persistence images have several nice properties, including stability\nguarantees; see [1] for more details.\n\n3 Metric learning frameworks\nSuppose we are given a set of n objects \u039e (sampled from a hidden data space S), classi\ufb01ed into\nk classes. We want to use these labelled data to learn a good distance for (persistence image\nrepresentations of) objects from \u039e which hopefully is more appropriate at classifying objects in the\ndata space S. To do so, below we propose a new persistence-based kernel for persistence images,\nand then formulate an optimization problem to learn the best weight-function so as to obtain a good\ndistance metric for \u039e (and data space S).\n\n3.1 Weighted persistence image kernel (WKPI)\nFrom now on, we \ufb01x the grid P (of size N) to generate persistence images (so a persistence image\nis a vector in RN ). Let ps be the center of the s-th pixel ps in P, for s \u2208 {1, 2,\u00b7\u00b7\u00b7 , N}. We now\npropose a new kernel for persistence images. A weight-function refers to a non-negative real-valued\nfunction on R2.\n\n1In fact, we can de\ufb01ne our kernel without transforming the persistence diagram. We use the transformation\n\nsimply to follow the same convention as persistence images.\n\n4\n\n\fDe\ufb01nition 3.1 Let \u03c9 : R2 \u2192 R be a weight-function. Given two persistence images PI\nand PI(cid:48), the (\u03c9-)weighted persistence image kernel (WKPI) is de\ufb01ned as: kw(PI, PI(cid:48))\n:=\n\n\u2212 (PI(s)\u2212PI(cid:48) (s))2\n\n2\u03c32\n\n.\n\n(cid:80)N\n\ns=1 \u03c9(ps)e\n\nexample, we could use k(PI, PI(cid:48)) = (cid:80)N\n\nRemark 0: We could use the persistence surfaces (instead of persistence images) to de\ufb01ne the kernel\n(with the summation replaced by an integral). Since for computational purpose, one still needs to\napproximate the integral in the kernel via some discretization, we choose to present our work using\npersistence images directly. Our Lemma 3.2 and Theorem 3.4 still hold (with slightly different\nstability bound) if we use the kernel de\ufb01ned for persistence surfaces.\nRemark 1: One can choose the weight-function from different function classes. Two popular choices\nare: mixture of m 2D Gaussians; and degree-d polynomials on two variables.\nRemark 2: There are other natural choices for de\ufb01ning a weighted kernel for persistence images. For\n, which we refer this as altWKPI.\nAlternatively, one could use the weight function used in PWGK kernel [33] directly. Indeed, we have\nimplemented all these choices, and our experiments show that our WKPI kernel leads to better results\nthan these choices for almost all datasets (see Supplement Section 2). In addition, note that PWGK\nkernel [33] contains cross terms \u03c9(x) \u00b7 \u03c9(y) in its formulation, meaning that there are quadratic\nnumber of terms (w.r.t the number of persistence points) to calculate the kernel, making it more\nexpensive to compute and learn for complex objects (e.g, for the neuron data set, a single neuron tree\ncould produce a persistence diagrams with hundreds of persistence points).\n\n\u2212 \u03c9(ps)(PI(s)\u2212PI(cid:48) (s))2\n\ns=1 e\n\n2\u03c32\n\nLemma 3.2 The WKPI kernel is positive semi-de\ufb01nite.\n\nThe rather simple proof of the above lemma is in Supplement Section 1.1. By Lemma 3.2, the\nWKPI kernel gives rise to a Hilbert space. We can now introduce the WKPI-distance, which is the\npseudo-metric induced by the inner product on this Hilbert space.\n\nDe\ufb01nition 3.3 Given two persistence diagrams A and B, let PIA and PIB be their corresponding\npersistence images. Given a weight-function \u03c9 : R2 \u2192 R, the (\u03c9-weighted) WKPI-distance is:\n\nD\u03c9(A, B) :=(cid:112)kw(PIA, PIA) + kw(PIB, PIB) \u2212 2kw(PIA, PIB).\n\nStability of WKPI-distance. Given two persistence diagrams A and B, two traditional distances\nbetween them are the bottleneck distance dB(A, B) and the p-th Wasserstein distance dW,p(A, B).\nStability of these two distances w.r.t. changes of input objects or functions de\ufb01ned on them have\nbeen studied [16, 18, 14]. Similar to the stability study on persistence images, below we prove\nWKPI-distance is stable w.r.t. small perturbation in persistence diagrams as measured by dW,1. (Very\ninformally, view two persistence diagrams A and B as two (appropriate) measures (with special\ncare taken to the diagonals), and dW,1(A, B) is roughly the \u201cearth-mover\u201d distance between them to\nconvert the measure corresponding to A to that for B.)\nTo simplify the presentation of Theorem 3.4, we use unweighted persistence images w.r.t. Gaussian,\nmeaning in De\ufb01nition 2.1, (1) the weight function \u03b1 is the constant function \u03b1 = 1; and (2) the\ndistribution \u03c6u is the Gaussian \u03c6u(z) = 1\n. (Our result below can be extended to the\ncase where \u03c6u is not Gaussian.) The proof of the theorem below follows from results of [1] and can\nbe found in Supplement Section 1.2.\nTheorem 3.4 Given a weight-function \u03c9 : R2 \u2192 R, set cw = (cid:107)\u03c9(cid:107)\u221e = supz\u2208R2 \u03c9(z). Given two\npersistence diagrams A and B, with corresponding persistence images PIA and PIB, we have that:\n\u00b7 1\n\u03c3\u00b7\u03c4 \u00b7 dW,1(A, B), where \u03c3 is the width of the Gaussian used to de\ufb01ne our\nWKPI kernel (Def. 3.1), and \u03c4 is that for the Gaussian \u03c6u to de\ufb01ne persistence images (Def. 2.1).\n\nD\u03c9(A, B) \u2264(cid:113) 20cw\n\n\u2212 (cid:107)z\u2212u(cid:107)2\n\n2\u03c0\u03c4 2 e\n\n2\u03c4 2\n\n\u03c0\n\n5\n\n\fRemarks: We can obtain a more general bound for the case where the distribution \u03c6u is not Gaussian.\nFurthermore, we can obtain a similar bound when our WKPI-kernel and its induced WKPI-distance\nis de\ufb01ned using persistence surfaces instead of persistence images.\n\n3.2 Optimization problem for metric-learning\nSuppose we are given a collection of objects \u039e = {X1, . . . , Xn} (sampled from some hidden data\nspace S), already classi\ufb01ed (labeled) to k classes C1, . . . ,Ck. In what follows, we say that i \u2208 Cj if Xi\nhas class-label j. We \ufb01rst compute the persistence diagram Ai for each object Xi \u2208 \u039e. (The precise\n\ufb01ltration we use to do so will depend on the speci\ufb01c type of objects. Later in Section 4, we will\ndescribe \ufb01ltrations used for graph data). Let {A1, . . . , An} be the resulting set of persistence diagrams.\nGiven a weight-function \u03c9, its induced WKPI-distance between Ai and Aj can also be thought of as a\ndistance for the original objects Xi and Xj; that is, we can set D\u03c9(Xi, Xj) := D\u03c9(Ai, Aj). Our goal\nis to learn a good distance metric for the data space S (where \u039e are sampled from) from the labels.\nWe will formulate this as learning a best weight-function \u03c9\u2217 so that its induced WKPI-distance \ufb01ts\nthe class-labels of Xi\u2019s best. Speci\ufb01cally, for any t \u2208 {1, 2,\u00b7\u00b7\u00b7 , k}, set:\n\ncost\u03c9(t, t) =\n\nD\u03c9\n\n2(Ai, Aj);\n\nand\n\ncost\u03c9(t,\u00b7) =\n\n(cid:88)\n\nD\u03c9\n\n2(Ai, Aj).\n\ni\u2208Ct,j\u2208{1,2,\u00b7\u00b7\u00b7 ,n}\n\n(cid:88)\n\ni,j\u2208Ct\n\nIntuitively, cost\u03c9(t, t) is the total in-class (square) distances for Ct; while cost\u03c9(t,\u00b7) is the total\ndistance from objects in class Ct to all objects in \u039e. A good metric should lead to relatively smaller\ndistance between objects from the same class, but larger distance between objects from different\nclasses. We thus propose the following optimization problem, which is related to k-way spectral\nclustering where the distance for an edge (Ai, Aj) is D2\nDe\ufb01nition 3.5 (Optimization problem) Given a weight-function \u03c9 : R2 \u2192 R, the total-cost of\ncost(t,\u00b7) . The optimal distance\nproblem aims to \ufb01nd the best weight-function \u03c9\u2217 from a certain function class F so that the total-cost\nis minimized; that is: T C\u2217 = min\u03c9\u2208F T C(\u03c9); and \u03c9\u2217 = argmin\u03c9\u2208F T C(\u03c9).\n\nits induced WKPI-distance over \u039e is de\ufb01ned as: T C(\u03c9) :=(cid:80)k\n\n\u03c9(Ai, Aj):\n\ncost(t,t)\n\nt=1\n\nMatrix view of optimization problem. We observe that our cost function can be re-formulated\ninto a matrix form. This provides us with a perspective from the Laplacian matrix of certain graphs to\nunderstand the cost function, and helps to simplify the implementation of our optimization problem,\nas several programming languages popular in machine learning (e.g Python and Matlab) handle\nmatrix operations more ef\ufb01ciently (than using loops). More precisely, recall our input is a set \u039e of n\nobjects with labels from k classes. We set up the following matrices:\n\nL = G \u2212 \u039b; \u039b =(cid:2)\u039bij\n(cid:3)\n(cid:3)\nG =(cid:2)gij\n(cid:3)\nH =(cid:2)hti\n\nn\u00d7n, where \u039bij = D\u03c9\n\nn\u00d7n, where gij =\n\n2(Ai, Aj) for i, j \u2208 {1, 2,\u00b7\u00b7\u00b7 , n};\n(cid:96)=1 \u039bi(cid:96)\n\n(cid:26)(cid:80)n\n(cid:40)\n\n0\n1\u221a\ncost\u03c9(t,\u00b7)\n\n0\n\nif i = j\nif i (cid:54)= j\ni \u2208 Ct\notherwise\n\nk\u00d7n where hti =\n\nViewing \u039b as distance matrix of objects {X1, . . . , Xn}, L is then its Laplacian matrix. We have the\nfollowing main theorem, which essentially is similar to the trace-minimization view of k-way spectral\nclustering (see e.g, Section 6.5 of [30]). The proof for our speci\ufb01c setting is in Supplement 1.3.\nTheorem 3.6 The total-cost can also be represented by T C(\u03c9) = k \u2212 Tr(HLH T ), where Tr(\u00b7) is\nthe trace of a matrix. Furthermore, HGH T = I, where I is the k \u00d7 k identity matrix.\n\nNote that all matrices, L, G, \u039b, and H, are dependent on the (parameters of) weight-function \u03c9, and\nin the following corollary of Theorem 3.6, we use the subscript of \u03c9 to emphasize this dependence.\n\nCorollary 3.7 The Optimal distance problem is equivalent to\n\n(cid:0)k \u2212 Tr(H\u03c9L\u03c9H T\n\u03c9 )(cid:1),\n\nmin\n\n\u03c9\n\nsubject to H\u03c9G\u03c9H T\n\n\u03c9 = I.\n\n6\n\n\f\u03c32\nr\n\n\u2202xr\n\nt=1 htLhT\n\n\u2212 (zx\u2212xr )2+(zy\u2212yr )2\n\nthe best parameters for the weight-function \u03c9\u2217 to minimize T r(HLH T ) =(cid:80)k\n\nSolving the optimization problem.\nIn our implementation, we use (stochastic) gradient descent to\n\ufb01nd a (locally) optimal weight-function \u03c9\u2217 for the minization problem. Speci\ufb01cally, given a collection\nof objects \u039e with labels from k classes, we \ufb01rst compute their persistence diagrams via appropriate\n\ufb01ltrations, and obtain a resulting set of persistence diagrams {A1, . . . , An}. We then aim to \ufb01nd\nt subject to\nHGH T = I (via Corollary 3.7). For example, assume that the weight-function \u03c9 is from the class F\nof mixture of m number of 2D non-negatively weighted (spherical) Gaussians. Each weight-function\n\u03c9 : R2 \u2192 R \u2208 F is thus determined by 4m parameters {xr, yr, \u03c3r, wr | r \u2208 {1, 2,\u00b7\u00b7\u00b7 , m}} with\n. We then use (stochastic) gradient decent to \ufb01nd the best parameters\n\u03c9(z) = wre\nto minimize T r(HLH T ) subject to HGH T = I. Note that the set of persistence diagrams / images\nwill be \ufb01xed through the optimization process.\nFrom the proof of Theorem 3.6 (in Supplement 1.3), it turns out that condition HGH T = I is\nsatis\ufb01ed as long as the multiplicative weight wr of each Gaussian in the mixture is non-negative.\nHence during the gradient descent, we only need to make sure that this holds 2. It is easy to write out\nthe gradient of T C(\u03c9) w.r.t. each parameter {xr, yr, \u03c3r, wr | r \u2208 {1, 2,\u00b7\u00b7\u00b7 , m}} in matrix form.\nFor example, \u2202T C(\u03c9)\nis the t-th row vector of H. While this does not improve the asymptotic complexity of computing\nthe gradient (compared to using the formulation of cost function in De\ufb01nition 3.5), these matrix\noperations can be implemented much more ef\ufb01ciently than using loops in languages such as Python\nand Matlab. For large data sets, we use stochastic gradient decent, by sampling a subset of s << n\nnumber of input persistence images, and compute the matrices H, D, L, G as well as the cost using\nthe subsampled data points. The time complexity of one iteration in updating parameters is O(s2N ),\nwhere N is the size of a persistence image (recall, each persistence image is a vector in RN ). In\nour implementation, we use Armijo-Goldstein line search scheme to update the parameters in each\n(stochastic) gradient decent step. The optimization procedure terminates when the cost function\nconverges or the number of iterations exceeds a threshold. Overall, the time complexity of our\noptimization procedure is O(Rs2N ) where R is the number of iterations, s is the minibatch size, and\nN is the size (# pixels) of a single persistence image.\n\n); where ht =(cid:2)ht1, ht2, ..., htn\n\n= \u2212((cid:80)k\n\n\u2202L\n\u2202xr\n\nt + htL \u2202hT\nhT\n\nt\n\u2202xr\n\n\u2202ht\n\u2202xr\n\nLhT\n\nt + ht\n\nt=1\n\n(cid:3)\n\n4 Experiments\n\nWe show the effectiveness of our metric-learning framework and the use of the learned metric via\ngraph classi\ufb01cation applications. In particular, given a set of graphs \u039e = {G1, . . . , Gn} coming from\nk classes, we \ufb01rst compute the unweighted persistence images Ai for each graph Gi, and apply the\nframework from Section 3.1 to learn the \u201cbest\u201d weight-function \u03c9\u2217 : R2 \u2192 R on the birth-death\nplane R2 using these persistence images {A1, . . . , An} and their labels. We then perform graph\nclassi\ufb01cation using kernel-SVM with the learned \u03c9\u2217-WKPI kernel. We refer to this framework as\nWKPI-classi\ufb01cation framework. We show two sets of experiments. Section 4.1 shows that our learned\nWKPI kernel signi\ufb01cantly outperforms existing persistence-based representations. In Section 4.2, we\ncompare the performance of WKPI-classi\ufb01cation framework with various state-of-the-art methods\nfor the graph classi\ufb01cation task over a range of data sets. More details / results can be found in\nSupplement Section 2.\n\nSetup for our WKPI-based framework.\nIn all our experiments, we assume that the weight-\nfunction comes from the class F of mixture of m 2D non-negatively weighted Gaussians as described\nin the end of Section 3.2. We take m and the width \u03c3 in our WKPI kernel as hyperparameters:\nSpeci\ufb01cally, we search among m \u2208 {3, 4, 5, 6, 7, 8} and \u03c3 \u2208 {0.001, 0.01, 0.1, 1, 10, 100}. The\n10 \u2217 10-fold nested cross validation are applied to evaluate our algorithm: There are 10 folds in\nouter loop for evaluation of the model with selected hyperparameters and 10 folds in inner loop for\nhyperparameter tuning. We then repeat this process 10 times (although the results are extremely close\nwhether repeating 10 times or not). Our optimization procedure terminates when the change of the\ncost function remains \u2264 10\u22124 or the iteration number exceeds 2000.\n\n2 In our implementation, we add a penalty term(cid:80)m\n\nthis in a \u201csoft\u201d manner.\n\nexp(wr ) to total-cost k \u2212 T r(HLH T ), to achieve\n\nc\n\nr=1\n\n7\n\n\fTable 1: Classi\ufb01cation accuracy on neuron dataset. Our results are WKPI-km and WKPI-kc.\n\nDatasets\n\nNEURON-BINARY\nNEURON-MULTI\n\nAverage\n\nExisting approaches\n\nPI-PL\n83.7\u00b10.3\n44.2\u00b10.3\n63.95\n\nAlternative metric learning\ntrainPWGK\naltWKPI\n82.1\u00b12.1\n84.6\u00b12.4\n54.3\u00b12.3\n49.7\u00b12.4\n67.15\n68.20\n\nOur WKPI framework\nWKPI-km WKPI-kc\n89.6 \u00b12.2\n86.4\u00b12.4\n56.6\u00b12.7\n59.3\u00b12.3\n73.10\n72.85\n\nPWGK\n80.5\u00b10.4\n45.1\u00b10.3\n62.80\n\nSW\n\n85.3\u00b10.7\n57.6\u00b10.6\n71.45\n\nOne important question is to initialize the centers of the Gaussians in our mixture. There are three\nstrategies that we consider. (1) We simply sample m centers in the domain of persistence images\nrandomly. (2) We collect all points in the persistence diagrams {A1, . . . , An} derived from the\ntraining data \u039e, and perform a k-means algorithm to identify m means. (3) We perform a k-center\nalgorithm to those points to identify m centers. Strategies (2) and (3) usually outperform strategy\n(1). Thus in what follows we only report results from using k-means and k-centers as initialization,\nreferred to as WKPI-kM and WKPI-kC, respectively.\n\n4.1 Comparison with other persistence-based methods\n\nWe compare our methods with state-of-the-art persistence-based representations, including the\nPersistence Weighted Gaussian Kernel (PWGK) [33], original Persistence Image (PI) [1], and Sliced\nWasserstein (SW) Kernel [12]. Furthermore, as mentioned in Remark 2 after De\ufb01nition 3.1, we\ncan learn weight functions in PWGK by the optimizing the same cost function (via replacing our\nWKPI-distance with the one computed from PWGK kernel); and we refer to this as trainPWGK.\nWe can also use an alternative kernel for persistence images as described in Remark 2, and then\noptimize the same cost function using distance computed from this kernel; we refer to this as altWKPI.\nWe will compare our methods both with existing approaches, as well as with these two alternative\nmetric-learning approaches (trainPWGK and altWKPI).\n\nGeneration of persistence diagrams. Neuron cells have natural tree morphology, rooted at the\ncell body (soma), with dendrite and axon branching out, and are commonly modeled as geometric\ntrees. See Figure 1 in the Supplement for an example. Given a neuron tree T , following [36], we use\nthe descriptor function f : T \u2192 R where f (x) is the geodesic distance from x to the root of T along\nthe tree. To differentiate the dendrite and axon part of a neuron cell, we further negate the function\nvalue if a point x is in the dendrite. We then use the union of persistence diagrams AT induced by\nboth the sublevel-set and superlevel-set \ufb01ltrations w.r.t. f. Under these \ufb01ltrations, intuitively, each\npoint (b, d) in the birth-death plane R2 corresponds to the creation and death of certain branch feature\nfor the input neuron tree. The set of persistence diagrams obtained this way (one for each neuron\ntree) is the input to our WKPI-classi\ufb01cation framework.\n\nResults on neuron datasets. Neuron-Binary dataset consists of 1126 neuron trees from two\nclasses; while Neuron-Multi contains 459 neurons from four classes. As the number of trees is\nnot large, we use all training data to compute the gradients in the optimization process instead of\nmini-batch sampling. Persistence images are both needed for the methodology of [1] and as input for\nour WKPI-distance, and its resolution is \ufb01xed at roughly 40 \u00d7 40 (see Supplement 2.2 for details).\nFor persistence image (PI) approach of [1], we experimented both with the unweighted persistence\nimages (PI-CONST), and one, denoted by (PI-PL), where the weight function \u03b1 : R2 \u2192 R is a\nsimple piecewise-linear (PL) function adapted from what\u2019s proposed in [1]; see Supplement 2.2\nfor details. Since PI-PL performs better than PI-CONST on both datasets, Table 1 only shows the\nresults of PI-PL. The classi\ufb01cation accuracy of various methods is given in Table 1. Our results\nare consistently better than other topology-based approaches, as well as alternative metric-learning\napproaches; not only for the neuron datasets as in Table 1, but also for graph benchmark datasets\nshown in Table 3 of Supplement Section 2.2, and often by a large margin. In Supplement Section 2.1,\nwe also show the heatmaps indicating the learned weight function \u03c9 : R2 \u2192 R.\n\n4.2 Graph classi\ufb01cation task\n\nWe use a range of benchmark datasets: (1) several datasets on graphs derived from small chemical\ncompounds or protein molecules: NCI1 and NCI109 [44], PTC [24], PROTEIN [6], DD [21]\nand MUTAG [19]; (2) two datasets on graphs representing the response relations between users\nin Reddit: REDDIT-5K (5 classes) and REDDIT-12K (11 classes) [48]; and (3) two datasets on\n\n8\n\n\fTable 2: Graph classi\ufb01cation accuracy + standard deviation. Our results are last two columns.\n\nPrevious approaches\n\nDataset\n\nNCI1\nNCI109\n\nPTC\n\nPROTEIN\n\nDD\n\nMUTAG\n\nIMDB-BINARY\nIMDB-MULTI\nREDDIT-5K\nREDDIT-12K\n\n-\n\nRetGK\n84.5\u00b10.2\n62.5\u00b11.6\n75.8\u00b10.6\n81.6\u00b10.3\n90.3\u00b11.1\n71.9\u00b11.0\n47.7\u00b10.3\n56.1\u00b10.5\n48.7\u00b10.2\n\nWL\n\n85.4\u00b10.3\n84.5\u00b10.2\n55.4\u00b11.5\n71.2\u00b10.8\n78.6\u00b10.4\n84.4\u00b11.5\n70.8\u00b10.5\n49.8\u00b10.5\n51.2\u00b10.3\n32.6\u00b10.3\n\nDGK\n80.3\u00b10.5\n80.3\u00b10.3\n60.1\u00b12.5\n75.7\u00b10.5\n87.4\u00b12.7\n67.0\u00b10.6\n44.6\u00b10.4\n41.3\u00b10.2\n32.2\u00b10.1\n\n-\n\nP-WL-UC\n85.6\u00b10.3\n85.1\u00b10.3\n63.5\u00b11.6\n75.9\u00b10.8\n78.5\u00b10.4\n85.2\u00b10.3\n73.0\u00b11.0\n\n-\n-\n-\n\nPF\n\n81.7\u00b10.8\n78.5\u00b10.5\n62.4\u00b11.8\n75.2\u00b12.1\n79.4\u00b10.8\n85.6\u00b11.7\n71.2\u00b11.0\n48.6\u00b10.7\n56.2\u00b11.1\n47.6\u00b10.5\n\n-\n\nPSCN\n76.3\u00b11.7\n62.3\u00b15.7\n75.0\u00b12.5\n76.2\u00b12.6\n89.0\u00b14.4\n71.0\u00b12.3\n45.2\u00b12.8\n49.1\u00b10.7\n41.3\u00b10.4\n\nGIN\n\n-\n\n82.7\u00b11.6\n66.6\u00b16.9\n76.2\u00b12.6\n90.0\u00b18.8\n75.1\u00b15.1\n52.3 \u00b12.8\n57.5\u00b11.5\n\n-\n\n-\n\nOur approaches\n\nWKPI-kM WKPI-kC\n84.5\u00b10.5\n87.5\u00b10.5\n85.9\u00b10.4\n87.4\u00b10.3\n62.7\u00b12.7\n68.1\u00b12.4\n78.5\u00b10.4\n75.2\u00b10.4\n80.3\u00b10.4\n82.0\u00b10.5\n88.3\u00b12.6\n85.8\u00b12.5\n75.1\u00b11.1\n70.7\u00b11.1\n49.5\u00b10.4\n46.4\u00b10.5\n59.1\u00b10.5\n59.5\u00b10.6\n47.4\u00b10.6\n48.4\u00b10.5\n\nIMDB networks of actors/actresses: IMDB-BINARY (2 classes), and IMDB-MULTI (3 classes).\nSee Supplement Section 2.2 for descriptions of these datasets, and their statistics (sizes of graphs etc).\nMany graph classi\ufb01cation methods have been proposed in the literature, with different methods\nperforming better on different datasets. Thus we include multiple approaches to compare with, to\ninclude state-of-the-art results on different datasets: six graph-kernel based approaches: RetGK[50],\nWeisfeiler-Lehman kernel (WL)[44], Weisfeiler-Lehman optimal assignment kernel (WL-OA)[32],\nDeep Graphlet kernel (DGK)[48], the very recent persistent Weisfeiler-Lehman kernel (P-WL-UC)\n[42], and Persistence Fisher kernel[34]; two graph neural networks: PATCHYSAN (PSCN) [39] and\nGraph Isomorphism Network (GIN)[46].\n\nClassi\ufb01cation results. To generate persistence summaries, we need a meaningful descriptor func-\ntion on input graphs. We consider two choices: (a) the Ricci-curvature function fc : G \u2192 R, where\nfc(x) is the discrete Ricci curvature for graphs as introduced in [37]; and (b) Jaccard-index function\nfJ : G \u2192 R which measures edge similarities in a graph. See Supplement 2.2 for details. Graph\nclassi\ufb01cation results are in Table 2: Ricci curvature function is used for the small chemical com-\npounds datasets (NCI1, NCI9, PTC and MUTAG), while Jaccard function is used for proteins datasets\n(PROTEIN and DD) and the social/IMDB networks (IMDB\u2019s and REDDIT\u2019s). Results of previous\nmethods are taken from their respective papers. Comparisons with more methods (including with\nother topology-based methods such as SW [12]) are in Supplement Section 2.2. We rerun the two\nbest performing approaches GIN and RetGK using the exactly same nested cross validation setup as\nours. The results are also in Supplement Section 2.2, which are similar to those in Table 2.\nExcept for MUTAG and IMDB-MULTI, the performances of our WKPI-framework are similar\nor better than the best of other methods. Our WKPI-framework performs well on both chemical\ngraphs and social graphs, while some of the earlier work tend to work well on one type of the graphs.\nFurthermore, note that the chemical / molecular graphs usually have attributes associated with them.\nSome existing methods use these attributes in their classi\ufb01cation [48, 39, 50]. Our results however are\nobtained purely based on graph structure without using any attributes. In terms of variance, the\nstandard deviations of our methods tend to be on-par with graph kernel based previous approaches;\nand are usually much better (smaller) than the GNN based approaches (i.e, PSCN and GIN).\n\n5 Concluding remarks\n\nThis paper introduces a new weighted-kernel for persistence images (WKPI), together with a metric-\nlearning framework to learn the best weight-function for WKPI-kernel from labelled data. We apply\nthe learned WKPI-kernel to the task of graph classi\ufb01cation, and show that our new framework\nachieves similar or better results than the best results among a range of previous approaches.\nIn our current framework, only a single descriptor function of each input object is used to derive a\npersistence-based representation. It will be interesting to extend our framework to leverage multiple\ndescriptor functions (so as to capture different types of information) effectively. Recent work on\nmultidimensional persistence would be useful in this effort. Another interesting question is to study\nhow to incorporate categorical attributes associated to graph nodes effectively. Real-valued attributed\ncan be used as a descriptor function to generate persistence-based summaries. But the handling of\ncategorical attributes via topological summary is much more challenging, especially when there is\nno (prior-known) correlation between these attributes (e.g, the attribute is simply a number from\n{1, 2,\u00b7\u00b7\u00b7 , s}, coming from s categories. The indices of these categories may carry no meaning).\n\n9\n\n\fAcknowledgments\nThe authors would like to thank Chao Chen and Justin Eldridge for useful discussions related to this\nproject. We would also like to thank Giorgio Ascoli for helping provide the neuron dataset. This\nwork is partially supported by National Science Foundation via grants CCF-1740761, CCF-1733798,\nand RI-1815697, as well as by National Health Institute under grant R01EB022899.\n\nReferences\n[1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson,\nF. Motta, and L. Ziegelmeier. Persistence images: a stable vector representation of persistent homology.\nJournal of Machine Learning Research, 18:218\u2013252, 2017.\n\n[2] L. Bai, L. Rossi, A. Torsello, and E. R. Hancock. A quantum jensen-shannon graph kernel for unattributed\n\ngraphs. Pattern Recognition, 48(2):344\u2013355, 2015.\n\n[3] U. Bauer. Ripser. https://github.com/Ripser/ripser, 2016.\n\n[4] U. Bauer, M. Kerber, J. Reininghaus, and H. Wagner. Phat \u2013 persistent homology algorithms toolbox. In\nH. Hong and C. Yap, editors, Mathematical Software \u2013 ICMS 2014, pages 137\u2013143, Berlin, Heidelberg,\n2014. Springer Berlin Heidelberg.\n\n[5] S. Bhatia, B. Chatterjee, D. Nathani, and M. Kaul. Understanding and predicting links in graphs: A\n\npersistent homology perspective. arXiv preprint arXiv:1811.04049, 2018.\n\n[6] K. M. Borgwardt, C. S. Ong, S. Sch\u00f6nauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein\n\nfunction prediction via graph kernels. Bioinformatics, 21(suppl_1):i47\u2013i56, 2005.\n\n[7] P. Bubenik. Statistical topological data analysis using persistence landscapes. Journal of Machine Learning\n\nResearch, 16(1):77\u2013102, 2015.\n\n[8] G. Carlsson and V. de Silva. Zigzag persistence. Foundations of Computational Mathematics, 10(4):367\u2013\n\n405, 2010.\n\n[9] G. Carlsson, V. de Silva, and D. Morozov. Zigzag persistent homology and real-valued functions. In Proc.\n\n25th Annu. ACM Sympos. Comput. Geom., pages 247\u2013256, 2009.\n\n[10] G. Carlsson and A. Zomorodian. The theory of multidimensional persistence. Discrete & Computational\n\nGeometry, 42(1):71\u201393, 2009.\n\n[11] M. Carriere, F. Chazal, Y. Ike, T. Lacombe, M. Royer, and Y. Umeda. A general neural network architecture\n\nfor persistence diagrams and graph classi\ufb01cation. arXiv preprint arXiv:1904.09378, 2019.\n\n[12] M. Carri\u00e8re, M. Cuturi, and S. Oudot. Sliced Wasserstein kernel for persistence diagrams. International\n\nConference on Machine Learning, pages 664\u2013673, 2017.\n\n[13] F. Chazal, D. Cohen-Steiner, M. Glisse, L. J. Guibas, and S. Oudot. Proximity of persistence modules and\n\ntheir diagrams. In Proc. 25th ACM Sympos. on Comput. Geom., pages 237\u2013246, 2009.\n\n[14] F. Chazal, V. de Silva, M. Glisse, and S. Oudot. The structure and stability of persistence modules.\n\nSpringerBriefs in Mathematics. Springer, 2016.\n\n[15] M. Cl\u00e9ment, J.-D. Boissonnat, M. Glisse, and M. Yvinec. The gudhi library: simplicial complexes\nand persistent homology. http://gudhi.gforge.inria.fr/python/latest/index.html,\n2014.\n\n[16] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams. Discrete & Computa-\n\ntional Geometry, 37(1):103\u2013120, 2007.\n\n[17] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Extending persistence using Poincar\u00e9 and Lefschetz\n\nduality. Foundations of Computational Mathematics, 9(1):79\u2013103, 2009.\n\n[18] D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko. Lipschitz functions have Lp-stable persistence.\n\nFoundations of computational mathematics, 10(2):127\u2013139, 2010.\n\n[19] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structure-activity\nrelationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital\nenergies and hydrophobicity. Journal of medicinal chemistry, 34(2):786\u2013797, 1991.\n\n10\n\n\f[20] T. K. Dey, D. Shi, and Y. Wang. Simba: An ef\ufb01cient tool for approximating Rips-\ufb01ltration persistence via\nsimplicial batch-collapse. In 24th Annual European Symposium on Algorithms (ESA 2016), volume 57 of\nLeibniz International Proceedings in Informatics (LIPIcs), pages 35:1\u201335:16, 2016.\n\n[21] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments.\n\nJournal of molecular biology, 330(4):771\u2013783, 2003.\n\n[22] H. Edelsbrunner and J. Harer. Computational Topology : an Introduction. American Mathematical Society,\n\n2010.\n\n[23] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simpli\ufb01cation. Discrete\n\nComput. Geom., 28:511\u2013533, 2002.\n\n[24] C. Helma, R. D. King, S. Kramer, and A. Srinivasan. The predictive toxicology challenge 2000\u20132001.\n\nBioinformatics, 17(1):107\u2013108, 2001.\n\n[25] S. Hido and H. Kashima. A linear-time graph kernel. In Data Mining, 2009. ICDM\u201909. Ninth IEEE\n\nInternational Conference on, pages 179\u2013188. IEEE, 2009.\n\n[26] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In Advances\n\nin Neural Information Processing Systems, pages 1634\u20131644, 2017.\n\n[27] M. Kerber, D. Morozov, and A. Nigmetov. Geometry helps to compare persistence diagrams. J. Exp.\n\nAlgorithmics, 22:1.4:1\u20131.4:20, Sept. 2017.\n\n[28] M. Kerber, D. Morozov, and A. Nigmetov. HERA: software to compute distances for persistence diagrams.\n\nhttps://bitbucket.org/grey_narn/hera, 2018.\n\n[29] M. Kerber and H. Schreiber. Barcodes of towers and a streaming algorithm for persistent homology. In\n33rd International Symposium on Computational Geometry (SoCG 2017), page 57. Schloss Dagstuhl-\nLeibniz-Zentrum f\u00fcr Informatik GmbH, 2017.\n\n[30] E. Kokiopoulou, J. Chen, and Y. Saad. Trace optimization and eigenproblems in dimension reduction\n\nmethods. Numerical Linear Algebra with Applications, 18(3):565\u2013602, 2011.\n\n[31] R. Kondor, H. T. Son, H. Pan, B. M. Anderson, and S. Trivedi. Covariant compositional networks for\nlearning graphs. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC,\nCanada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.\n\n[32] N. M. Kriege, P. L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and applications to\n\ngraph classi\ufb01cation. In Advances in Neural Information Processing Systems, pages 1623\u20131631, 2016.\n\n[33] G. Kusano, K. Fukumizu, and Y. Hiraoka. Kernel method for persistence diagrams via kernel embedding\n\nand weight factor. Journal of Machine Learning Research, 18(189):1\u201341, 2018.\n\n[34] T. Le and M. Yamada. Persistence Fisher kernel: A Riemannian manifold kernel for persistence diagrams.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 10028\u201310039, 2018.\n\n[35] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks\n\nwith complex rational spectral \ufb01lters. IEEE Trans. Signal Processing, 67(1):97\u2013109, 2019.\n\n[36] Y. Li, D. Wang, G. A. Ascoli, P. Mitra, and Y. Wang. Metrics for comparing neuronal tree shapes based on\n\npersistent homology. PloS one, 12(8):e0182184, 2017.\n\n[37] Y. Lin, L. Lu, and S.-T. Yau. Ricci curvature of graphs. Tohoku Mathematical Journal, Second Series,\n\n63(4):605\u2013627, 2011.\n\n[38] M. Neumann, N. Patricia, R. Garnett, and K. Kersting. Ef\ufb01cient graph kernels by randomization. In Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 378\u2013393.\nSpringer, 2012.\n\n[39] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. International\n\nconference on machine learning, pages 2014\u20132023, 2016.\n\n[40] S. Nino, V. SVN, P. Tobias, M. Kurt, and B. Karsten. Ef\ufb01cient graphlet kernels for large graph comparison.\n\nArti\ufb01cial Intelligence and Statistics, pages 488\u2013495, 2009.\n\n[41] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multi-scale kernel for topological machine\n\nlearning. In Computer Vision & Pattern Recognition, pages 4741\u20134748, 2015.\n\n11\n\n\f[42] B. Rieck, C. Bock, and K. Borgwardt. A persistent weisfeiler-lehman procedure for graph classi\ufb01cation.\n\nInternational Conference on Machine Learning, 2019.\n\n[43] D. Sheehy. Linear-size approximations to the Vietoris-Rips \ufb01ltration. In Proc. 28th. Annu. Sympos. Comput.\n\nGeom., pages 239\u2013248, 2012.\n\n[44] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-Lehman\n\ngraph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n\n[45] S. Verma and Z.-L. Zhang. Hunt for the unique, stable, sparse and fast feature learning on graphs. Advances\n\nin Neural Information Proceeding Systems, pages 88\u201398, 2017.\n\n[46] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? International\n\nConference on Learning Representations, 2019.\n\n[47] L. Xu, X. Jin, X. Wang, and B. Luo. A mixed Weisfeiler-Lehman graph kernel. In International Workshop\n\non Graph-based Representations in Pattern Recognition, pages 242\u2013251, 2015.\n\n[48] P. Yanardag and S. Vishwanathan. Deep graph kernels. Proceedings of the 21th ACM SIGKDD International\n\nConference on Knowledge Discovery and Data Mining, pages 1365\u20131374, 2015.\n\n[49] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In\n\nAdvances in neural information processing systems, pages 3391\u20133401, 2017.\n\n[50] Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai. Retgk: Graph kernels based on return probabilities\n\nof random walks. In Advances in Neural Information Processing Systems, pages 3968\u20133978, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5218, "authors": [{"given_name": "Qi", "family_name": "Zhao", "institution": "The Ohio State University"}, {"given_name": "Yusu", "family_name": "Wang", "institution": "Ohio State University"}]}