{"title": "The Perturbed Variation", "book": "Advances in Neural Information Processing Systems", "page_first": 1934, "page_last": 1942, "abstract": "We introduce a new discrepancy score between two distributions that gives an indication on their \\emph{similarity}. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score's capacity to detect similarity with that of other known measures on real data.", "full_text": "The Perturbed Variation\n\nMaayan Harel\n\nShie Mannor\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nTechnion, Haifa, Israel\n\nTechnion, Haifa, Israel\n\nmaayanga@tx.technion.ac.il\n\nshie@ee.technion.ac.il\n\nAbstract\n\nWe introduce a new discrepancy score between two distributions that gives an indi-\ncation on their similarity. While much research has been done to determine if two\nsamples come from exactly the same distribution, much less research considered\nthe problem of determining if two \ufb01nite samples come from similar distributions.\nThe new score gives an intuitive interpretation of similarity; it optimally perturbs\nthe distributions so that they best \ufb01t each other. The score is de\ufb01ned between\ndistributions, and can be ef\ufb01ciently estimated from samples. We provide conver-\ngence bounds of the estimated score, and develop hypothesis testing procedures\nthat test if two data sets come from similar distributions. The statistical power of\nthis procedures is presented in simulations. We also compare the score\u2019s capacity\nto detect similarity with that of other known measures on real data.\n\n1\n\nIntroduction\n\nThe question of similarity between two sets of examples is common to many \ufb01elds, including statis-\ntics, data mining, machine learning and computer vision. For example, in machine learning, a\nstandard assumption is that the training and test data are generated from the same distribution. How-\never, in some scenarios, such as Domain Adaptation (DA), this is not the case and the distributions\nare only assumed similar. It is quite intuitive to denote when two inputs are similar in nature, yet the\nfollowing question remains open: given two sets of examples, how do we test whether or not they\nwere generated by similar distributions? The main focus of this work is providing a similarity score\nand a corresponding statistical procedure that gives one possible answer to this question.\nDiscrepancy between distributions has been studied for decades, and a wide variety of distance\nscores have been proposed. However, not all proposed scores can be used for testing similarity.\nThe main dif\ufb01culty is that most scores have not been designed for statistical testing of similarity\nbut equality, known as the Two-Sample Problem (TSP). Formally, let P and Q be the generating\ndistributions of the data; the TSP tests the null hypothesis H0 : P = Q against the general alternative\nH1 : P \ufffd= Q. This is one of the classical problems in statistics. However, sometimes, like in DA,\nthe interesting question is with regards to similarity rather than equality. By design, most equality\ntests may not be transformed to test similarity; see Section 3 for a review of representative works.\nIn this work, we quantify similarity using a new score, the Perturbed Variation (PV). We propose\nthat similarity is related to some prede\ufb01ned value of permitted variations. Consider the gait of two\nmale subjects as an example. If their physical characteristics are similar, we expect their walk to\nbe similar, and thus assume the examples representing the two are from similar distributions. This\nintuition applies when the distribution of our measurements only endures small changes for people\nwith similar characteristics. Put more generally, similarity depends on what \u201csmall changes\u201d are in\na given application, and implies that similarity is domain speci\ufb01c. The PV, as hinted by its name,\nmeasures the discrepancy between two distributions while allowing for some perturbation of each\ndistribution; that is, it allows small differences between the distributions. What accounts for small\ndifferences is a parameter of the PV, and may be de\ufb01ned by the user with regard to a speci\ufb01c domain.\n\n1\n\n\fFigure 1: X and O identify samples from two distributions, doted circles denote allowed perturbations.\nSamples marked in red are matched with neighbors, while the unmatched samples indicate the PV discrepancy.\n\nFigure 1 illustrates the PV. Note that, like perceptual similarity, the PV turns a blind eye to variations\nof some rate.\n\n2 The Perturbed Variation\n\nThe PV on continuous distributions is de\ufb01ned as follows:\nDe\ufb01nition 1. Let P and Q be two distributions on a Banach space X , and let M (P, Q) be the set\nof all joint distributions on X \u00d7 X with marginals P and Q. The PV, with respect to a distance\nfunction d : X \u00d7 X \u2192 R and \ufffd, is de\ufb01ned by\n.\n= inf\n\nPV(P, Q, \ufffd, d)\n\n(1)\n\nP\u00b5[d(X, Y ) > \ufffd],\n\n\u00b5\u2208M (P,Q)\n\nover all pairs (X, Y ) \u223c \u00b5, such that the marginal of X is P and the marginal of Y is Q.\nPut into words, Equation (1) de\ufb01nes the joint distribution \u00b5 that couples the two distributions such\nthat the probability of the event of a pair (X, Y ) \u223c \u00b5 being within a distance grater than \ufffd is\nminimized.\nThe solution to (1) is a special case of the classical mass transport problem of Monge [1] and its\nversion by Kantorovich: inf \u00b5\u2208M (P,Q)\ufffdX\u00d7X\nc(x, y)d\u00b5(x, y), where c : X \u00d7X \u2192 R is a measurable\ncost function. When c is a metric, the problem describes the 1st Wasserstein metric. Problem (1)\nmay be rephrased as the optimal mass transport problem with the cost function c(x, y) = 1[d(x,y)>\ufffd],\nand may be rewritten as inf \u00b5\ufffd\ufffd 1[d(x,y)>\ufffd]\u00b5(y|x)dy P (x)dx. The probability \u00b5(y|x) de\ufb01nes the\n\ntransportation plan of x to y. The PV optimal transportation plan is obtained by perturbing the mass\nof each point x in its \ufffd neighborhood so that it redistributes to the distribution of Q. These small\nperturbations do not add any cost, while transportation of mass to further areas is equally costly.\nNote that when P = Q the PV is zero as the optimal plan is simply the identity mapping. Due to\nits cost function, the PV it is not a metric, as it is symmetric but does not comply with the triangle\ninequality and may be zero for distributions P \ufffd= Q. Despite this limitation, this cost function fully\nquanti\ufb01es the intuition that small variations should not be penalized when similarity is considered.\nIn this sense, similarity is not unique by de\ufb01nition, as more than one distribution can be similar to a\nreference distribution.\nThe PV is also closely related to the Total Variation distance (TV) that may be written, using a\ncoupling characterization, as T V (P, Q) = inf \u00b5\u2208M (P,Q) P\u00b5 [X \ufffd= Y ] [2]. This formulation argues\nthat any transportation plan, even to a close neighbor, is costly. Due to this property, the TV is\nknown to be an overly sensitive measure that overestimates the distance between distributions. For\nexample, consider two distributions de\ufb01ned by the dirac delta functions \u03b4(a) and \u03b4(a + \ufffd). For any\n\ufffd, the TV between the two distributions is 1, while they are intuitively similar. The PV resolves this\nproblem by adding perturbations, and therefore is a natural extension of the TV. Notice, however,\nthat the \ufffd used to compute the PV need not be in\ufb01nitesimal, and is de\ufb01ned by the user.\nThe PV can be seen as a conciliatory between the Wasserstein distance and the TV. As explained, it\nrelaxes the sensitivity of the TV; however, it does not \u201cover optimize\u201d the transportation plan. Specif-\nically, distances larger than the allowed perturbation are discarded. This aspect also contributes to\nthe ef\ufb01ciency of estimation of the PV from samples; see Section 2.2.\n\n2\n\n\f2 :\nP V (\u00b51, \u00b52, \ufffd) = 1\n\na1 = 0, a2 = 1, a3 = 2, a4 = 2.1\nw1 = w2 = 1\nv4 = 1\n\n4 , w3 = w4 = 0\n\n2 , v1 = v2 = v3 = 0\n\nZ =\uf8ee\uf8ef\uf8f0\n\n0\n0\n0\n0\n\n0\n1\n4\n0\n0\n\n0\n0\n0\n0\n\n0\n0\n1\n4\n0\n\n\uf8f9\uf8fa\uf8fb\n\n\ufffd\n\n\u00b51\n\u00b52\n\n0.75\n\n0.5\n\n0.25\n\n\u2265 \ufffd\n\n0\n\n1\n\n2\n\nFigure 2.1: Illustration of the PV score between discrete distributions.\n\n2.1 The Perturbed Variation on Discrete Distributions\n\nIt can be shown that for two discrete distributions Problem (1) is equivalent to the following problem.\nDe\ufb01nition 2. Let \u00b51 and \u00b52 be two discrete distributions on the uni\ufb01ed support {a1, ..., aN}. De\ufb01ne\nthe neighborhood of ai as ng(ai, \ufffd) = {z ; d(z, ai) \u2264 \ufffd}. The PV(\u00b51, \u00b52, \ufffd, d) between the two\ndistributions is:\n(2)\n\nmin\n\nwi +\n\nvj\n\n1\n2\n\nN\ufffdi=1\n\n1\n2\n\nN\ufffdj=1\n\nZij + wi = \u00b51(ai), \u2200i\n\nZij + vj = \u00b52(aj), \u2200j\n\nwi\u22650,vi\u22650,Zij\u22650\n\ns.t. \ufffdaj\u2208ng(ai,\ufffd)\n\ufffdai\u2208ng(aj ,\ufffd)\n\nZij = 0 ,\n\n\u2200(i, j) \ufffd\u2208 ng(ai, \ufffd).\n\nEach row in the matrix Z \u2208 RN\u00d7N corresponds to a point mass in \u00b51, and each column to a point\nmass in \u00b52. For each i, Z(i, :) is zero in columns corresponding to non neighboring elements, and\nnon-zero only for columns j for which transportation between \u00b52(aj) \u2192 \u00b51(ai) is performed. The\ndiscrepancies between the distributions are depicted by the scalars wi and vi that count the \u201cleftover\u201d\nmass in \u00b51(ai) and \u00b52(aj). The objective is to minimize these discrepancies, therefore matrix Z\ndescribes the optimal transportation plan constrained to \ufffd-perturbations. An example of an optimal\nplan is presented in Figure 2.1.\n\n2.2 Estimation of the Perturbed Variation\n\nTypically, we are given samples from which we would like to estimate the PV. Given two sam-\nples S1 = {x1, ..., xn} and S2 = {y1, ..., ym}, generated by distributions P and Q respectively,\n\ufffdPV(S1, S2, \ufffd, d) is:\n(3)\n\nwi\u22650,vi\u22650,Zij\u22650\n\n1\n2n\n\nmin\n\nvj\n\nwi +\n\n1\n2m\n\nn\ufffdi=1\n\nm\ufffdj=1\nZij + wi = 1, \ufffdxi\u2208ng(yj ,\ufffd)\n\n\u2200(i, j) \ufffd\u2208 ng(xi, \ufffd),\n\ns.t. \ufffdyj\u2208ng(xi,\ufffd)\n\nZij = 0 ,\n\nZij + vj = 1,\n\n\u2200i, j\n\nwhere Z \u2208 Rn\u00d7m. When n = m, the optimization in (3) is identical to (2), as in this case the\nsamples de\ufb01ne a discrete distribution. However, when n \ufffd= m Problem (3) also accounts for the\ndifference in the size of the two samples.\nProblem (3) is a linear program with constraints that may be written as a totally unimodular matrix.\nIt follows that one of the optimal solutions of (3) is integral [3]; that is, the mass of each sample\nis transferred as a whole. This solution may be found by solving the optimal assignment on an\nappropriate bipartite graph [3]. Let G = (V = (A, B), E) de\ufb01ne this graph, with A = {xi, wi ; i =\n1, ..., n} and B = {yj, vj ; j = 1, ..., m} as its bipartite partition. The vertices xi \u2208 A are linked\n\n3\n\n\fAlgorithm 1 Compute\ufffdPV(S1, S2, \ufffd, d)\nInput: S1 = {x1, ..., xn} and S2 = {y1, ..., ym}, \ufffd rate, and distance measure d.\n1. De\ufb01ne \u02c6G = ( \u02c6V = ( \u02c6A, \u02c6B), \u02c6E): \u02c6A = {xi \u2208 S1}, \u02c6B = {yj \u2208 S2},\n2. Compute the maximum matching on \u02c6G.\n3. De\ufb01ne Sw and Sv as number of unmatched edges in sets S1 and S2 respectively.\n\nConnect an edge eij \u2208 \u02c6E if d(xi, yj) \u2264 \ufffd.\nOutput: \ufffdP V (S1, S2, \ufffd, d) = 1\nwith edge weight zero to yj \u2208 ng(xi) and with weight \u221e to yj \ufffd\u2208 ng(xi). In addition, every vertex\nxi (yj) is linked with weight 1 to wi (vj). To make the graph complete, assign zero cost edges\nbetween all vertices xi and wk for k \ufffd= i (and vertices yj and vk for k \ufffd= j).\nWe note that the Earth Mover Distance (EMD) [4], a sampled version of the transportation problem,\nis also formulated by a linear program that may be solved by optimal assignment. For the EMD and\nother typical assignment problems, the computational complexity is more demanding, for example\nusing the Hungarian algorithm it has an O(N 3) complexity, where N = n + m is the number of ver-\n\n2 ( Sw\n\nn + Sv\n\nm ).\n\ntices [5]. Contrarily, graph G, which describes \ufffdPV, is a simple bipartite graph for which maximum\n\ncardinality matching, a much simpler problem, can be applied to \ufb01nd the optimal assignment. To\n\ufb01nd the optimal assignment, \ufb01rst solve the maximum matching on the partial graph between vertices\nxi, yj that have zero weight edges (corresponding to neighboring vertices). Then, assign vertices xi\nand yj for whom a match was not found with wi and vj respectively; see Algorithm 1 and Figure\n1 for an illustration of a matching. It is easy to see that the solution obtained solves the assignment\n\nproblem associated with \ufffdPV.\nThe complexity of Algorithm 1 amounts to the complexity of the maximal matching step and of\nsetting up the graph, i.e., additional O(nm) complexity of computing distances between all points.\nLet k be the average number of neighbors of a sample, then the average number of edges in the\nbipartite graph \u02c6G is | \u02c6E| = n \u00d7 k. The maximal cardinality matching of this graph is obtained in\nO(kn\ufffd(n + m)) steps, in the worst case [5].\n\n3 Related Work\n\nMany scores have been de\ufb01ned for testing discrepancy between distributions. We focus on represen-\ntative works for nonparametric tests that are most related to our work. First, we consider statistics for\nthe Two Sample Problem (TSP), i.e., equality testing, that are based on the asymptotic distribution of\nthe statistic conditioned on the equality. Among these tests is the well known Kolmogorov-Smirnov\ntest (for one dimensional distributions), and its generalization to higher dimensions by minimal\nspanning trees [6]. A different statistic is de\ufb01ned by the portion of k-nearest neighbors of each sam-\nple that belongs to different distributions; larger portions mean the distributions are closer [7]. These\nscores are well known in the statistical literature but cannot be easily changed to test similarity, as\ntheir analysis relies on testing equality.\nAs discussed earlier, the 1st Wasserstein metric and the TV metric have some relation to the PV. The\nEMD and histogram based L1 distance are the sample based estimates of these metrics respectively.\nIn both cases, the distance is not estimated directly on the samples, but on a higher level partition\nof the space: histogram bins or signatures (cluster centers). It is impractical to use the EMD to\nestimate the Wasserstein metric between the continuous distributions, as convergence would require\nthe number of bins to be exponentially dependent on the dimension. As a result, it is commonly\nused to rate distances and not for statistical testing. Contrarily, the PV is estimated directly on the\nsamples and converges to its value between the underlying continuous distributions. We note that\nafter a good choice of signatures, the EMD captures perceptual similarity, similar to that of the PV. It\nis possible to consider the PV as a re\ufb01nement of the EMD notion of similarity; instead of clustering\nthe data to signatures and moving the signatures, it perturbs each sample. In this manner, it captures\na \ufb01ner notion of similarity better suited for statistical testing.\n\n4\n\n\f10\n\n8\n\n6\n\n4\n\n2\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0.1\n\n0.2\n\n0\n0\n(a) PV(\ufffd = 0.1) = 0\n\n0.3\n\n0.4\n\n0.5\n\n0.2\n\n0.1\n\n0\n0\n(b) PV(\ufffd = 0.1) = 0\n\n0.5\n\n0.4\n\n0.3\n\n0.6\n\n0.1\n\n0.2\n\n0\n0\n(c) PV(\ufffd = 0.1) = 1\n\n0.4\n\n0.3\n\n0.5\n\n0.6\n\nFigure 2: Two distributions on R: The PV captures the perceptual similarity of (a),(b) against the disimilarity\nin (c). The L1\n1 = 1 on I1 = {(0, 0.1), (0.1, 0.2), ...} for all cases; on I2 = {(0, 0.2), (0.2, 0.4), ...} it is\nL2\n1(Pa, Qa) =\n0, L3\n\n1(Pc, Qc) = 1; and on I3 = {(0, 0.3), (0.3, 0.6), ...} it is L3\n\n1(Pb, Qb) = 1, L2\n1(Pc, Qc) = 0.\n\n1(Pa, Qa) = 0, L2\n\n1(Pb, Qb) = 0, L3\n\nThe partition of the support to bins allows some relaxation of the TV notion. Therefore, instead\nof the TV, it may be interesting to consider the L1 as a similarity distance on the measures after\ndiscretization. The example in Figure (2) shows that this relaxation is quite rigid and that there is no\nsingle partition that captures the perceptual similarity. In general, the problem would remain even\nif bins with varying width were permitted. Namely, the problem is the choice of a single partition\nto measure similarity of a reference distribution to multiple distributions, while choosing multiple\npartitions would make the distances incomparable. Also note that de\ufb01ning a \u201cgood\u201d partition is a\ndif\ufb01cult task, which is exasperated in higher dimensions.\nThe last group of statistics are scores established in machine learning: the dA distance presented by\nKifer et al. that is based on the maximum discrepancy on a chosen subset of the support [8], and\nMaximum Mean Discrepancy (MMD) by Gretton et al., which de\ufb01ne discrepancy after embeddings\nthe distributions to a Reproducing Kernel Hilbert Space (RKHS)[9]. These scores have correspond-\ning statistical tests for the TSP; however, since their analysis is based on \ufb01nite convergence bounds,\nin principle they may be modi\ufb01ed to test similarity. The dA captures some intuitive notion of simi-\nlarity, however, to our knowledge, it is not known how to compute it for a general subset class 1. The\nMMD captures the distance between the samples in some RKHS. The MMD may be used to de\ufb01ne\na similarity test, yet this would require de\ufb01ning two parameters, \u03c3 and the similarity rate, whose\ndependency is not intuitive. Namely, for any similarity rate the result of the test is highly dependent\non the choice of \u03c3, but it is not clear how it should be made. Contrarily, the PV\u2019s parameter \ufffd is\nrelated to the data\u2019s input domain and may be chosen accordingly.\n\n4 Analysis\n\nWe present sample rate convergence analysis of the PV. The proofs of the theorems are provided in\nthe supplementary material. When no clarity is lost, we omit d from the notation. Our main theorem\nis stated as follows:\nTheorem 3. Suppose we are given two i.i.d.\nsamples S1 = {x1, ..., xn} \u2208 Rd and S2 =\n{y1, ..., ym} \u2208 Rd generated by distributions P and Q, respectively. Let the ground distance be\nd = \ufffd \u00b7 \ufffd\u221e and let N (\ufffd) be the cardinality of a disjoint cover of the distributions\u2019 support. Then,\nfor any \u03b4 \u2208 (0, 1), N = min(n, m), and \u03b7 =\ufffd 2(log(2(2N (\ufffd)\u22122))+log(1/\u03b4))\nP\ufffd\ufffd\ufffd\ufffd\ufffdPV (S1, S2, \ufffd) \u2212 PV (P, Q, \ufffd)\ufffd\ufffd\ufffd \u2264 \u03b7\ufffd \u2265 1 \u2212 \u03b4.\n\nThe theorem is de\ufb01ned using \ufffd \u00b7 \ufffd\u221e, but can be rewritten for other metrics (with a slight change of\nconstants). The proof of the theorem exploits the form of the optimization Problem 3. We use the\nbound of Theorem 3 construct hypothesis tests. A weakness of this bound is its strong dependency\non the dimension. Speci\ufb01cally, it is dependent on N (\ufffd), which for \ufffd\u00b7\ufffd\u221e is O((1/\ufffd)d): the number of\ndisjoint boxes of volume \ufffdd that cover the support. Unfortunately, this convergence rate is inherent;\nnamely, without making any further assumptions on the distribution, this rate is unavoidable and is\nan instance of the \u201ccurse of dimensionality\u201d. In the following theorem, we present a lower bound on\nthe convergence rate.\n\nwe have that\n\nN\n\n1Most work with the dA has been with the subset of characteristic functions, and approximated by the error\n\nof a classi\ufb01er.\n\n5\n\n\fTheorem 4. Let P = Q be the uniform distribution on Sd\u22121, a unit (d \u2212 1)\u2013dimensional hyper-\nsphere. Let S1 = {x1, ..., xN} \u223c P and S2 = {y1, ..., yN} \u223c Q be two i.i.d. samples. For\nany \ufffd, \ufffd\ufffd, \u03b4 \u2208 (0, 1), 0 \u2264 \u03b7 < 2/3 and sample size\n2 )/2, we have\nP V (P, Q, \ufffd\ufffd) = 0 and\n\n2(1\u22123\u03b7/2)2 \u2264 N \u2264 \u03b7\n\n2 ed(1\u2212 \ufffd2\n\nlog(1/\u03b4)\n\nP(\ufffdPV (S1, S2, \ufffd) > \u03b7) \u2265 1 \u2212 \u03b4.\nFor example, for \u03b4 = 0.01, \u03b7 = 0.5, for any 37 \u2264 N \u2264 0.25ed(1\u2212 \ufffd2\n2 )/2 we have that \ufffdPV > 0.5 with\nprobability at least 0.99. The theorem shows that, for this choice of distributions, for a sample size\nthat is smaller than O(ed), there is a high probability that the value of \ufffdPV is far form PV.\nIt can be observed that the empirical estimate \ufffdPV is stable, that is, it is almost identical for two\ndata sets differing on one sample. Due to its stability, applying McDiarmid inequality yields the\nfollowing.\nTheorem 5. Let S1 = {x1, ..., xn} \u223c P and S2 = {y1, ..., ym} \u223c Q be two i.i.d. samples. Let\nn \u2265 m, then for any \u03b7 > 0\n\n(4)\n\nP\ufffd|\ufffdPV (S1, S2, \ufffd) \u2212 E[\ufffdPV (n, m, \ufffd)]| \u2265 \u03b7\ufffd \u2264 e\u2212\u03b72m2/4n,\n\nThis theorem shows that the sample estimate of the PV converges to its expectation without depen-\ndence on the dimension. By combining this result with Theorem 3 it may be deduced that only the\n\nwhere E[\ufffdPV (n, m, \ufffd)] is the expectation of\ufffdPV for a given sample size.\nconvergence of the bias \u2013 the difference |E[\ufffdPV(n, m, \ufffd)]\u2212 PV(P, Q, \ufffd)| \u2013 may be exponential in the\n\ndimension. This convergence is distribution dependent. However, intuitively, slow convergence is\nnot always the case, for example when the support of the distributions lies in a lower dimensional\nmanifold of the space. To remedy this dependency we propose a bootstrapping bias correcting tech-\nnique, presented in Section 5. A different possibility is to project the data to one dimension; due\nto space limitations, this extension of the PV is left out of the scope of this paper and presented in\nAppendix A.2 in the supplementary material.\n\n5 Statistical Inference\n\n0\n\n1\n\nWe construct two types of complementary procedures for hypothesis testing of similarity and dis-\nsimilarity2.\nIn the \ufb01rst type of procedures, given 0 \u2264 \u03b8 < 1, we distinguish between the null\nhypothesis H(1)\n: PV(P, Q, \ufffd, d) \u2264 \u03b8, which implies similarity, and the alternative hypothesis\nH(1)\n: PV(P, Q, \ufffd, d) > \u03b8. Notice that when \u03b8 = 0, this test is a relaxed version of the TSP. Using\nPV(P, Q) = 0 instead of P = Q as the null, allows for some distinction between the distributions,\nwhich gives the needed relaxation to capture similarity. In the second type of procedures, we test\nwhether two distributions are similar. To do so, we \ufb02ip the role of the null and the alternative. Note\nthat there isn\u2019t an equivalent of this form for the TSP, therefore we can not infer similarity using\nthe TSP test, but only reject equality. Our hypothesis tests are based on the \ufb01nite sample analysis\npresented in Section 4; see Appendix A.1 in the supplementary material for the procedures.\nTo provide further inference on the PV, we apply bootstrapping for approximations of Con\ufb01dence\nIntervals (CI). The idea of bootstrapping for estimating CIs is based on a two step procedure: ap-\nproximation of the sampling distribution of the statistic by resampling with replacement from the\ninitial sample \u2013 the bootstrap stage \u2013 following, a computation of the CI based on the resulting dis-\ntribution. We propose to estimate the CI by Bootstrap Bias-Corrected accelerated (BCa) interval,\nwhich adjusts the simple percentile method to correct for bias and skewness [10]. The BCa is known\nfor its high accuracy; particularly, it can be shown, that the BCa interval converges to the theoretical\nCI with rate O(N\u22121), where N is the sample size. Using the CI, a hypothesis test may be formed:\nthe null H(1)\nis rejected with signi\ufb01cance \u03b1 if the range [0, \u03b8] \ufffd\u2282 [CI, CI]. Also, for the second test,\nwe apply the principle of CI inclusion [11], which states that if [CI, CI] \u2282 [0, \u03b8], dissimilarity is\nrejected and similarity deduced.\n\n0\n\n2The two procedures are distinct, as, in general, lacking evidence to reject similarity is not suf\ufb01cient to infer\n\ndissimilarity, and vice versa.\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\n2\n\u2212\ne\np\ny\nT\n\n0\n102\n\n\u03b5=0.1\n\u03b5=0.2\n\u03b5=0.3\n\u03b5=0.4\n\u03b5=0.5\n\nPV\nMMD\nFR\nKNN\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nPV\nMMD\nFR\nKNN\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n103\n\nSample size\n\n10\n\n0.5\n0\n\n0.2\n\n0.4\n\nRecall\n\n0.6\n\n0.8\n\n1\n\n0\n0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9\n\nRecall\n\n11\n\n(a) The Type-2 error for varying\nperturbation sizes and \ufffd values.\n\n(b) Precision-Recall: Gait data.\n\n(c) Precision-Recall: Video clips.\n\n6 Experiments\n\n6.1 Synthetic Simulations\n\n0\n\nIn our \ufb01rst experiment, we examine the effect of the choice of \ufffd on the statistical power of the test.\nFor this purpose, we apply signi\ufb01cance testing for similarity on two univariate uniform distributions:\nP \u223c U [0, 1] and Q \u223c U [\u0394(\ufffd), 1 + \u0394(\ufffd)], where \u0394(\ufffd) is a varying size of perturbation. We\nconsidered values of \ufffd = [0.1, 0.2, 0.3, 0.4, 0.5] and sample sizes up to 5000 samples from each\ndistribution. For each value \ufffd\ufffd, we test the null hypothesis H(1)\n: P V (P, Q, \ufffd\ufffd) = 0 for ten equally\nspaced values of \u0394(\ufffd\ufffd) in the range [0, 2\ufffd\ufffd]. In this manner, we test the ability of the PV to detect\nsimilarity for different sizes of perturbations. The percentage of times the null hypothesis was falsely\nrejected, i.e. the type-1 error, was kept at a signi\ufb01cance level \u03b1 = 0.05. The percentage of times\nthe null hypothesis was correctly rejected, the power of the test, was estimated as a function of the\nsample size and averaged over 500 repetitions. We repeated the simulation using the tests based on\nthe bounds as well as using BCa con\ufb01dence intervals.\nThe results in Figure (3(a)) show the type-2 error of the bound based simulations. As expected,\nthe power of the test increases as the sample size grows. Also, when \ufb01ner perturbations need to be\ndetected, more samples are needed to gain statistical power. For the BCa CI we obtained type-1\nand type-2 errors smaller than 0.05 for all the sample sizes. This shows that the convergence of the\nestimated PV to its value is clearly faster than the bounds. Note that, given a suf\ufb01cient sample size,\nany statistic for the TSP would have rejected similarity for any \u0394 > 0.\n\n6.2 Comparing Distance Measures\n\nNext, we test the ability of the PV to measure similarity on real data. To this end, we test the ranking\nperformance of the PV score against other known distributional distances. We compare the PV to\nthe multivariate extension of the Wald-Wolfowitz score of Friedman & Rafsky (FR) [6] , Schilling\u2019s\nnearest neighbors score (KNN) [7], and the Maximum Mean Discrepancy score of Gretton et al. [9]\n(MMD)3. We rank similarity for the applications of video retrieval and gait recognition.\nThe ranking performance of the methods was measured by precision-recall curves, and the Mean\nAverage Precision (MAP). Let r be the number of samples similar to a query sample. For each\n1 \u2264 i \u2264 r of these observations, de\ufb01ne ri \u2208 [1, T \u2212 1] as its similarity rank, where T is the total\nnumber of observations. The Average Precision is: AP = 1/r\ufffdi i/ri, and the MAP is the average\nof the AP over the queries. The tuning parameter for the methods \u2013 k for the KNN, \u03c3 for the MMD\n(with RBF kernel), and \ufffd for the PV \u2013 were chosen by cross-validation. The Euclidian distance was\nused in all methods.\nIn our \ufb01rst experiment, we tested raking for video-clip retrieval. The data we used was collected\nand generated by [12], and includes 1,083 videos of commercials, each of about 1,500 frames (25\nfps). Twenty unique videos were selected as query videos, each of which has one similar clip in\n\n3Note that the statistical tests of these measures test equality while the PV tests similarity and therefore our\nexperiments are not of statistical power but of ranking similarity. Even in the case of the distances that may be\ntransformed for similarity, like the MMD, there is no known function between the PV similarity to other forms\nof similarity. As a result, there is no basis on which to compare which similarity test has better performance.\n\n7\n\n\fTable 1: MAP for Auslan, Video, and Gait data sets. Average MAP (\u00b1 standard deviation) computed on a\nrandom selection of 75% of the queries, repeated 100 times.\n\nDATA SET\nVIDEO\nGAIT\nGAIT-F\nGAIT-M\n\n\ufffdPV\n\n0.758 \u00b10.009\n0.792\u00b10.021\n0.844\u00b10.017\n0.679 \u00b1 0.024\n\nKNN\n\n0.741 \u00b10.014\n0.736 \u00b1 0.014\n0.750 \u00b1 0.015\n0.712 \u00b1 0.017\n\nMMD\n\n0.689 \u00b1 0.008\n0.722 \u00b1 0.017\n0.729 \u00b1 0.017\n0.716 \u00b1 0.031\n\nFR\n\n0.563 \u00b1 0.019\n0.698 \u00b1 0.017\n0.666 \u00b1 0.016\n0.799 \u00b10.016\n\nthe collection, to which 8 more similar clips were generated by different transformations: bright-\nness increased/decreased, saturation increased/decreased, borders cropped, logo inserted, randomly\ndropped frames, and added noise frames. Lastly, each frame of a video was transformed to a 32-\nRGB representation. We computed the similarity rate for each query video to all videos in the set,\nand ranked the position of each video. The results show that the PV and the KNN score are invariant\nto most of the transformations, and outperform the FR and MMD methods (Table 1 and Figure 3(c)).\nWe found that brightness changes were most problematic for the PV. For this type of distortion, the\nsimple RGB representation is not suf\ufb01cient to capture the similarity.\nWe also tested gait similarity of female and male subjects; same gender samples are assumed similar.\nWe used gait data that was recorded by a mobile phone, available at [13]. The data consists of two\nsets of 15min walks of 20 individuals, 10 women and 10 men. As features we used the magnitude\nof the triaxial accelerometer.We cut the raw data to intervals of approximately 0.5secs, without\nidenti\ufb01cation of gait cycles. In this manner, each walk is represented by a collection of about 1500\nintervals. An initial scaling to [0,1] was performed once for the whole set. The comparison was\ndone by ranking by gender the 39 samples with respect to a reference walk.\nThe precision-recall curves in Figure 3(b) show that the PV is able to retrieve with higher precision\nin the mid-recall range. For the early recall points the PV did not show optimal performance; Inter-\nestingly, we found that with a smaller \ufffd, the PV had better performance on early recall points. This\nbehavior re\ufb02ects the \ufb02exibility of the PV: smaller \ufffd should be chosen when the goal is to \ufb01nd very\nsimilar instances, and larger when the goal is to \ufb01nd higher level similarity. The MAP results pre-\nsented in Table 1 show that the PV had better performance on the female subjects. From examination\nof the subject information sheet we found that the range of weight and hight within the female group\nis 50-77Kg and 1.6-1.8m, while within the male group it is 47-100Kg and 1.65-1.93m; that is, there\nis much more variability in the male group. This information provides a reasonable explanation to\nthe PV results, as it appears that a subject from the male group may have a gait that is as dissimilar\nto the gait of a female subject as it is to a different male. In the female group the subjects are more\nsimilar and therefore the precision is higher.\n\n7 Discussion\n\nWe proposed a new score that measures the similarity between two multivariate distributions, and\nassigns to it a value in the range [0,1]. The sensitivity of the score, re\ufb02ected by the parameter \ufffd,\nallows for \ufb02exibility that is essential for quantifying the notion of similarity. The PV is ef\ufb01ciently\nestimated from samples. Its low computational complexity relies on its simple binary classi\ufb01cation\nof points as neighbors or non-neighbor points, such that optimization of distances of faraway points\nis not needed. In this manner, the PV captures only the essential information to describe similarity.\nAlthough it is not a metric, our experiments show that it captures the distance between similar distri-\nbutions as well as well known distributional distances. Our work also includes convergence analysis\nof the PV. Based on this analysis we provide hypothesis tests that give statistical signi\ufb01cance to the\nresulting score. While our bounds are dependent on the dimension, when the intrinsic dimension of\nthe data is smaller than the domains dimension, statistical power can be gained by bootstrapping.\nIn addition, the PV has an intuitive interpretation that makes it an attractive score for a meaningful\nstatistical testing of similarity. Lastly, an added value of the PV is that its computation also gives\ninsight to the areas of discrepancy; namely, the areas of the unmatched samples. In future work we\nplan to further explore this information, which may be valuable on its own merits.\n\nAcknowledgements\n\nThis Research was supported in part by the Israel Science Foundation (grant No. 920/12).\n\n8\n\n\fReferences\n[1] G. Monge. M\u00b4emoire sur la th\u00b4eorie des d\u00b4eblais et de remblais. Histoire de l\u2019Academie Royale\ndes Sciences de Paris, avec les Memoires de Mathematique et de Physique pour la meme annee,\n1781.\n\n[2] L. R\u00a8uschendorf. Monge\u2013kantorovich transportation problem and optimal couplings. Jahres-\n\nbericht der DMV, 3:113\u2013137, 2007.\n\n[3] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons Inc, 1998.\n[4] Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image\ndatabases. In Computer Vision, 1998. Sixth International Conference on, pages 59\u201366. IEEE,\n1998.\n\n[5] R.K. Ahuja, L. Magnanti, and J.B. Orlin. Network Flows: Theory, Algorithms, and Applica-\n\ntions, chapter 12, pages 469\u2013473. Prentice Hall, 1993.\n\n[6] J.H. Friedman and L.C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and\n\nSmirnov two-sample tests. Annals of Statistics, 7:697\u2013717, 1979.\n\n[7] M.F. Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the\n\nAmerican Statistical Association, pages 799\u2013806, 1986.\n\n[8] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams.\n\nIn Proceedings\nof the Thirtieth international conference on Very large data bases, pages 180\u2013191. VLDB\nEndowment, 2004.\n\n[9] A. Gretton, K. Borgwardt, B. Sch\u00a8olkopf, M. Rasch, and E. Smola. A kernel method for the\n\ntwo sample problem. In Advances in Neural Information Processing Systems 19, 2007.\n\n[10] B. Efron and R. Tibshirani. An introduction to the bootstrap, chapter 14, pages 178\u2013188.\n\nChapman & Hall/CRC, 1993.\n\n[11] S. Wellek. Testing Statistical Hypotheses of Equivalence and Noninferiority; 2nd edition.\n\nChapman and Hall/CRC, 2010.\n\n[12] J. Shao, Z. Huang, H. Shen, J. Shen, and X. Zhou. Distribution-based similarity measures for\nmulti-dimensional point set retrieval applications. In Proceeding of the 16th ACM international\nconference on Multimedia MM 08, 2008.\n\n[13] J. Frank, S. Mannor, and D. Precup. Data sets: Mobile phone gait recognition data, 2010.\n[14] S. Boyd and L. Vandenberghe. Convex Optimization, chapter 5, pages 258\u2013261. Cambridge\n\nUniversity Press, New York, NY, USA, 2004.\n\n[15] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M.J. Weinberger. Inequalities for the\n\nl1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.\n\n9\n\n\f", "award": [], "sourceid": 956, "authors": [{"given_name": "Maayan", "family_name": "Harel", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}]}