{"title": "Informative Features for Model Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 808, "page_last": 819, "abstract": "Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.", "full_text": "Informative Features for Model Comparison\n\nWittawat Jitkrittum\n\nMax Planck Institute for Intelligent Systems\n\nwittawat@tuebingen.mpg.de\n\nHeishiro Kanagawa\nGatsby Unit, UCL\n\nheishirok@gatsby.ucl.ac.uk\n\nPatsorn Sangkloy\n\nGeorgia Institute of Technology\npatsorn_sangkloy@gatech.edu\n\nJames Hays\n\nGeorgia Institute of Technology\n\nhays@gatech.edu\n\nBernhard Sch\u00f6lkopf\n\nMax Planck Institute for Intelligent Systems\nbernhard.schoelkopf@tuebingen.mpg.de\n\nArthur Gretton\u2217\nGatsby Unit, UCL\n\narthur.gretton@gmail.com\n\nAbstract\n\nGiven two candidate models, and a set of target observations, we address the prob-\nlem of measuring the relative goodness of \ufb01t of the two models. We propose two\nnew statistical tests which are nonparametric, computationally ef\ufb01cient (runtime\ncomplexity is linear in the sample size), and interpretable. As a unique advantage,\nour tests can produce a set of examples (informative features) indicating the regions\nin the data domain where one model \ufb01ts signi\ufb01cantly better than the other. In a\nreal-world problem of comparing GAN models, the test power of our new test\nmatches that of the state-of-the-art test of relative goodness of \ufb01t, while being one\norder of magnitude faster.\n\nIntroduction\n\n1\nOne of the most fruitful areas in recent machine learning research has been the development of\neffective generative models for very complex and high dimensional data. Chief among these have\nbeen the generative adversarial networks [Goodfellow et al., 2014, Arjovsky et al., 2017, Nowozin\net al., 2016], where samples may be generated without an explicit generative model or likelihood\nfunction. A related thread has emerged in the statistics community with the advent of Approximate\nBayesian Computation, where simulation-based models without closed-form likelihoods are widely\napplied in bioinformatics applications [see Lintusaari et al., 2017, for a review]. In these cases, we\nmight have several competing models, and wish to evaluate which is the better \ufb01t for the data.\nThe problem of model criticism is traditionally de\ufb01ned as follows: how well does a model Q \ufb01t a\ngiven sample Zn := {zi}n\n\u223c R? This task can be addressed in two ways: by comparing samples\ni=1 from the model Q and data samples, or by directly evaluating the goodness of \ufb01t of\nYn := {yi}n\nthe model itself. In both of these cases, the tests have a null hypothesis (that the model agrees with\nthe data), which they will reject given suf\ufb01cient evidence. Two-sample tests fall into the \ufb01rst category:\nthere are numerous nonparametric tests which may be used [Alba Fern\u00e1ndez et al., 2008, Gretton\net al., 2012a, Friedman and Rafsky, 1979, Sz\u00e9kely and Rizzo, 2004, Rosenbaum, 2005, Harchaoui\net al., 2008, Hall and Tajvidi, 2002, Jitkrittum et al., 2016], and recent work in applying two-sample\ntests to the problem of model criticism [Lloyd and Ghahramani, 2015]. A second approach requires\nthe model density q explicitly. In the case of simple models for which normalisation is not an issue\n(e.g., checking for Gaussianity), several tests exist [Baringhaus and Henze, 1988, Sz\u00e9kely and Rizzo,\n\ni.i.d.\n\ni=1\n\n\u2217Arthur Gretton\u2019s ORCID ID: 0000-0003-3169-7624.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2005]; when a model density is known only up to a normalisation constant, tests of goodness of \ufb01t\nhave been developed using a Stein-based divergence [Chwialkowski et al., 2016, Liu et al., 2016,\nJitkrittum et al., 2017b].\nAn issue with the above notion of model criticism, particularly in the case of modern generative\nmodels, is that any hypothetical model Q that we design is likely a poor \ufb01t to the data. Indeed, as\nnoted in Yamada et al. [2018, Section 5.5], comparing samples from various Generative Adversarial\nNetwork (GAN) models [Goodfellow et al., 2014] to the reference sample Zn by a variant of the\nMaximum Mean Discrepancy (MMD) test [Gretton et al., 2012a] leads to the trivial conclusion that\nall models are wrong [Box, 1976], i.e., H0 : Q = R is rejected by the test in all cases. A more\nrelevant question in practice is thus: \u201cGiven two models P and Q, which is closer to R, and in what\nways?\u201d This is the problem we tackle in this work.\nTo our knowledge, the only nonparametric statistical test of relative goodness of \ufb01t is the Rel-MMD\ntest of Bounliphone et al. [2015], based on the maximum mean discrepancy [MMD, Gretton et al.,\n2012a]. While shown to be practical (e.g., for comparing network architectures of generative\nnetworks), two issues remain to be addressed. Firstly, its runtime complexity is quadratic in the\nsample size n, meaning that it can be applied only to problems of moderate size. Secondly and more\nimportantly, it does not give an indication of where one model is better than the other. This is essential\nfor model comparison: in practical settings, it is highly unlikely that one model will be uniformly\nbetter than another in all respects: for instance, in hand-written digit generation, one model might\nproduce better \u201c3\u201ds, and the other better \u201c6\u201ds. The ability to produce a few examples which indicate\nregions (in the data domain) in which one model \ufb01ts better than the other will be a valuable tool for\nmodel comparison. This type of interpretability is useful especially in learning generative models\nwith GANs, where the \u201cmode collapse\u201d problem is widespread [Salimans et al., 2016, Srivastava\net al., 2017]. The idea of generating such distinguishing examples (so called test locations) was\nexplored in Jitkrittum et al. [2016, 2017b] in the context of model criticism and two-sample testing.\nIn this work, we propose two new linear-time tests for relative goodness-of-\ufb01t. In the \ufb01rst test, the two\nmodels P, Q are represented by their two respective samples Xn and Yn, and the test generalises that\nof Jitkrittum et al. [2016]. In the second, the test has access to the probability density functions p, q\nof the two respective candidate models P, Q (which need only be known up to normalisation), and is\na three-way analogue of the test of Jitkrittum et al. [2017b]. In both cases, the tests return locations\nindicating where one model outperforms the other. We emphasise that the practitioner must choose\nthe model ordering, since as noted earlier, this will determine the locations that the test prioritises. We\nfurther note that the two tests complement each other, as both address different aspects of the model\ncomparison problem. The \ufb01rst test simply \ufb01nds the location where the better model produces mass\nclosest to the test sample: a worse model can produce too much mass, or too little. The second test\ndoes not address the overall probability mass, but rather the shape of the model density: speci\ufb01cally,\nit penalises the model whose derivative log density differs most from the target (the interpretation\nis illustrated in our experiments). In the experiment on comparing two GAN models, we \ufb01nd that\nthe performance of our new test matches that of Rel-MMD while being one order of magnitude\nfaster. Further, unlike the popular Fr\u00e9chet Inception Distance (FID) [Heusel et al., 2017] which can\ngive a wrong conclusion when two GANs have equal goodness of \ufb01t, our proposed method has a\nwell-calibrated threshold, allowing the user to \ufb02exibly control the false positive rate.\n2 Measures of Goodness of Fit\nIn the proposed tests, we test the relative goodness of \ufb01t by comparing the relative magnitudes of\ntwo distances, following Bounliphone et al. [2015]. More speci\ufb01cally, let D(P, R) be a discrepancy\nmeasure between P and R. Then, the problem can be formulated as a hypothesis test proposing\nH0 : D(P, R) \u2264 D(Q, R) against H1 : D(P, R) > D(Q, R). This is the approach taken by Bounli-\nphone et al. who use the MMD as D, resulting in the relative MMD test (Rel-MMD). The proposed\nRel-UME and Rel-FSSD tests are based on two recently proposed discrepancy measures for D:\nthe Unnormalized Mean Embeddings (UME) statistic [Chwialkowski et al., 2015, Jitkrittum et al.,\n2016], and the Finite-Set Stein Discrepancy (FSSD) [Jitkrittum et al., 2017b], for the sample-based\nand density-based settings, respectively. We \ufb01rst review UME and FSSD. We will extend these two\nmeasures to construct two new relative goodness-of-\ufb01t tests in Section 3. We assume throughout that\nthe probability measures P, Q, R have a common support X \u2286 Rd.\nThe Unnormalized Mean Embeddings (UME) Statistic UME is a (random) distance between two\nprobability distributions [Chwialkowski et al., 2015] originally proposed for two-sample testing for\n\n2\n\n\fn(n\u22121)\n\n(cid:107)\n\n1 (w), . . . , gq,r\n\n(cid:0)kY (y, w1), . . . , kY (y, wJq )(cid:1)(cid:62)\n\ni=1 (cid:107)\u03c8W (yi) \u2212 \u03c8W (zi)(cid:107)2(cid:3), which\n(cid:80)n\n\nH0 : Q = R and H1 : Q (cid:54)= R. Let kY : X \u00d7 X \u2192 R be a positive de\ufb01nite kernel. Let \u00b5Q be the\nmean embedding of Q, and is de\ufb01ned such that \u00b5Q(w) := Ey\u223cQk(y, w) (assumed to exist) [Smola\net al., 2007]. Gretton et al. [2012a] shows that when kY is characteristic [Sriperumbudur et al., 2011],\nthe Maximum Mean Discrepancy (MMD) witness function witQ,R(w) := \u00b5Q(w) \u2212 \u00b5R(w) is a\n(cid:80)J\nzero function if and only if Q = R. Based on this fact, the UME statistic evaluates the squared\nwitness function at Jq test locations W := {wj}Jq\nj=1 \u2282 X to determine whether it is zero. Formally,\nj=1(\u00b5Q(wj) \u2212 \u00b5R(wj))2.\nthe population squared UME statistic is de\ufb01ned as U 2(Q, R) := 1\nJ\nFor our purpose, it will be useful to rewrite the UME statistic as follows. De\ufb01ne the feature\n(cid:80)n\n\u2208 RJq. Let \u03c8Q\nW := Ey\u223cQ[\u03c8W (y)], and\nfunction \u03c8W (y) := 1\u221aJq\nits empirical estimate \u02c6\u03c8Q\ni=1 \u03c8W (yi). The squared population UME statistic is equivalent\nW := 1\nn\nto U 2(Q, R) := (cid:107)\u03c8Q\n2. For W \u223c \u03b7 where \u03b7 is a distribution with a density, Theorem 2\nW \u2212 \u03c8R\nW(cid:107)2\nin Chwialkowski et al. [2015] states that if kY is real analytic, integrable, and characteristic, then\n(cid:2)\n(cid:80)n\n\u03b7-almost surely (cid:107)\u03c8Q\n2 = 0 if and only if Q = R. In words, under the stated conditions,\nW(cid:107)2\nW \u2212 \u03c8R\nestimator is (cid:99)U 2\nU (Q, R) := UQ de\ufb01nes a distance between Q and R (almost surely).2 A consistent unbiased\nQ = 1\ni=1[\u03c8W (yi) \u2212 \u03c8W (zi)](cid:107)2 \u2212\nclearly can be computed in O(n) time. Jitkrittum et al. [2016] proposed optimizing the test locations\nW and kY so as to maximize the test power (i.e., the probability of rejecting H0 when it is false) of\nthe two-sample test with the normalized version of the UME statistic. It was shown that the optimized\nlocations give an interpretable indication of where Q and R differ in the input domain X .\nThe Finite-Set Stein Discrepancy (FSSD) FSSD is a discrepancy between two density func-\ntions q and r. Let X \u2286 Rd be a connected open set. Assume that Q, R have probability\ndensity functions denoted by q, r respectively. Given a positive de\ufb01nite kernel kY , the Stein\nwitness function [Chwialkowski et al., 2016, Liu et al., 2016] gq,r : X \u2192 Rd between q and\nd (w))(cid:62), where \u03beq(z, w) :=\nr is de\ufb01ned as gq,r(w) := Ez\u223cr [\u03beq(z, w)] = (gq,r\nkY (z, w)\u2207z log q(z) + \u2207zkY (z, w). Under appropriate conditions (see Chwialkowski et al. [2016,\nTheorem 2.2] and Liu et al. [2016, Proposition 3.3]), it can be shown that gq,r = 0 (i.e., the zero\nfunction) if and only if q = r. An implication of this result is that the deviation of gq,r from the zero\nfunction can be used as a measure of mismatch between q and r. Different ways to characterize such\ndeviation have led to different measures of goodness of \ufb01t.\nThe FSSD characterizes such deviation from 0 by evaluating gq,r at Jq test locations. Formally,\n(cid:80)Jq\ngiven a set of test locations W = {wj}Jq\nq(r) :=\nq [Jitkrittum et al., 2017b]. Under appropriate conditions, it is\n1\nj=1 (cid:107)gq,r(wj)(cid:107)2\ndJq\nknown that almost surely F 2\nq = 0 if and only if q = r. Using the notations as in Jitkrittum\nq (z)\u03c4 q(z(cid:48)),\ni (z, wj)/(cid:112)dJq for i = 1, . . . , d and\net al. [2017b], one can write F 2\n\u03c4 q(z) := vec(\u039eq(z)) \u2208 RdJq, vec(M ) concatenates columns of M into a column vector,\nand \u039eq(z) \u2208 Rd\u00d7Jq is de\ufb01ned such that [\u039eq(z)]i,j := \u03beq\n2 where \u00b5q := Ez\u223cr[\u03c4 q(z)]. Similar to the UME statistic\nj = 1, . . . , Jq. Equivalently, F 2\n(cid:99)F 2\ndescribed previously, given a sample Zn = {zi}n\nq , denoted by\nq can be straightforwardly written as a second-order U-statistic, which can be computed in O(Jqn)\nthe test power of the goodness-of-\ufb01t test proposing H0 : q = r against H1 : q (cid:54)= r, using(cid:99)F 2\ntime. It was shown in Jitkrittum et al. [2017b] that the test locations W can be chosen by maximizing\nstatistic. We note that, unlike UME,(cid:99)F 2\nq as the\nq requires access to the density q. Another way to characterize\nthe deviation of gq,r from the zero function is to use the norm in the reproducing kernel Hilbert space\n(RKHS) that contains gq,r. This measure is known as the Kernel Stein Discrepancy having a runtime\ncomplexity of O(n2) [Chwialkowski et al., 2016, Liu et al., 2016, Gorham and Mackey, 2015].\n3 Proposal: Rel-UME and Rel-FSSD Tests\nn = \u221an((cid:99)U 2\nRelative UME (Rel-UME) Our \ufb01rst proposed relative goodness-of-\ufb01t test based on UME tests\nH0 : U 2(P, R) \u2264 U 2(Q, R) versus H1 : U 2(P, R) > U 2(Q, R). The test uses \u221an \u02c6SU\nP \u2212\n2In this work, since the distance is always measured relative to the data generating distribution R, we write\n\nq = Ez\u223crEz(cid:48)\u223cr\u2206q(z, z(cid:48)) where \u2206q(z, z(cid:48)) := \u03c4 (cid:62)\n\nq = (cid:107)\u00b5q(cid:107)2\n\ni=1 \u223c r, an unbiased estimator of F 2\n\nj=1, the squared FSSD is de\ufb01ned as FSSD2\n\n2 := F 2\n\nUQ instead of U (Q, R) to avoid cluttering the notation.\n\n3\n\n\f(cid:99)U 2\nQ) as the statistic, and rejects H0 when it is larger than the threshold T\u03b1. The threshold is given by\nthe (1\u2212 \u03b1)-quantile of the asymptotic distribution of \u221an \u02c6SU\nn when H0 holds i.e., the null distribution,\nand the pre-chosen \u03b1 is the signi\ufb01cance level. It is well-known that this choice for the threshold\nasymptotically controls the false rejection rate to be bounded above by \u03b1 yielding a level-\u03b1 test\nj=1 for computing (cid:99)U 2\n[Casella and Berger, 2002, De\ufb01nition 8.3.6]. In the full generality of Rel-UME, two sets of test\nfunction for (cid:99)U 2\nlocations can be used: V = {vj}Jp\nQ. The feature\n\u2208 RJp , for some\nkernel kX which can be different from kY used in \u03c8W . The asymptotic distribution of the statistic is\nstated in Theorem 1.\n\nj=1 for (cid:99)U 2\n(cid:0)kX (x, v1), . . . , kX (x, vJp )(cid:1)(cid:62)\n\nP is denoted by \u03c8V (x) := 1\u221aJp\n\nP , and W = {wj}Jq\n\nTheorem 1 (Asymptotic distribution of \u02c6SU\ncovx\u223cP [\u03c8V (x), \u03c8V (x)], and C R\n\n0\n\nn ). De\ufb01ne C Q\n\n(cid:19)\nW := covy\u223cQ[\u03c8W (y), \u03c8W (y)], C P\nV :=\nV W := covz\u223cR[\u03c8V (z), \u03c8W (z)] \u2208 RJp\u00d7Jq. Let SU := U 2\nQ,\nP \u2212 U 2\n\u2208 R(Jp+Jq)\u00d72. Assume that 1) P, Q and R are\n:= M(cid:62)(cid:18) C P\n\u03c8Q\nW \u2212 \u03c8R\nP > 0, and (kY , W ) are chosen such that\nQ)(cid:1)\nM is positive de\ufb01nite. Then,\n\nC R\nW + C R\nW\n\n(cid:19)\n\nV W\n\nW\n\n(cid:19)\n(cid:0)0, 4(\u03b6 2\n\nall distinct, 2) (kX , V ) are chosen such that U 2\nV + C R\nQ > 0, 3)\nU 2\nV\nV W )(cid:62) C Q\n(C R\n\u221an\nP \u2212 2\u03b6P Q + \u03b6 2\n\nP\n\u03b6P Q\n\u2192 N\n\n\u03b6P Q\n\u03b6 2\nQ\n\nV \u2212 \u03c8R\n\nV\n\n0\n\nand M :=\n\n(cid:18) \u03c8P\n(cid:18) \u03b6 2\n(cid:16)(cid:98)SU\nn \u2212 SU(cid:17) d\n\nquantile function of the standard normal distribution. Since SU is unknown in practice, we therefore\n\nP \u22122\u03b6P Q +\u03b6 2\nn is normal with the mean given by SU := U 2\n\nA proof of Theorem 1 can be found in Section C.1 (appendix). Let \u03bd := 4(\u03b6 2\nQ). Theorem\n1 states that the asymptotic distribution of \u02c6SU\nQ. It\nP \u2212 U 2\nfollows that under H0, SU \u2264 0 and the (1 \u2212 \u03b1)-quantile is SU + \u221a\u03bd\u03a6\u22121(1 \u2212 \u03b1) where \u03a6\u22121 is the\nadjust it to be \u221a\u03bd\u03a6\u22121(1 \u2212 \u03b1), and use it as the test threshold T\u03b1. The adjusted threshold can be\nestimated easily by replacing \u03bd with \u02c6\u03bdn, a consistent estimate based on samples. It can be shown that\nthe test with the adjusted threshold is still level-\u03b1 (more conservative in rejecting H0). We note that\nthe same approach of adjusting the threshold is used in Rel-MMD [Bounliphone et al., 2015].\nBetter Fit of Q in Terms of W . When specifying V and W , the model comparison is done by\ncomparing the goodness of \ufb01t of P (to R) as measured in the regions speci\ufb01ed by V to the goodness\nof \ufb01t of Q as measured in the regions speci\ufb01ed by W . By specifying V and setting W = V ,\ntesting with Rel-UME is equivalent to posing the question \u201cDoes Q \ufb01t to the data better than P\ndoes, as measured in the regions of V ?\u201d For instance, the observed sample from R might contain\nsmiling and non-smiling faces, and P, Q are candidate generative models for face images. If we are\ninterested in checking the relative \ufb01t in the regions of smiling faces, V can be a set of smiling faces.\nIn the followings, we will assume V = W and k := kX = kY for interpretability. Investigating\nthe general case without these constraints will be an interesting topic of future study. Importantly\nwe emphasize that test results are always conditioned on the speci\ufb01ed V . To be precise, let U 2\nV\nbe the squared UME statistic de\ufb01ned by V . It is entirely realistic that the test rejects H0 in favor\n(Q, R) (i.e., Q \ufb01ts better) for some V1, and also rejects H0 in favor of\nof H1 : U 2\nV1\nthe opposite alternative H1 : U 2\n(P, R) (i.e., P \ufb01ts better) for another setting of V2.\nV2\nThis is because the regions in which the model comparison takes place are different in the two\ncases. Although not discussed in Bounliphone et al. [2015], the same behaviour can be observed for\nRel-MMD i.e., test results are conditioned on the choice of kernel.\nIn some cases, it is not known in advance what features are better represented by one model vs another,\nand it becomes necessary to learn these features from the model outputs. In this case, we propose set-\nting V to contain the locations which maximize the probability that the test can detect the better \ufb01t of\nQ, as measured at the locations. Following the same principle as in Gretton et al. [2012b], Sutherland\n(cid:18)\n(cid:113) \u02c6\u03bdn\nP(cid:16)\u221an \u02c6SU\net al. [2016], Jitkrittum et al. [2016, 2017a,b], this goal can be achieved by \ufb01nding (k, V ) which max-\nimize the test power, while ensuring that the test is level-\u03b1. By Theorem 1, for large n the test power\n\u03bd \u03a6\u22121(1 \u2212 \u03b1)\n. Under H1,\nfollows that, for large n, (k\u2217, V \u2217) = arg max(k,V ) P(cid:16)\u221an \u02c6SU\n(cid:17)\nSU > 0. For large n, \u03a6\u22121(1 \u2212 \u03b1)\u221a\u02c6\u03bdn/\u221a\u03bd approaches a constant, and \u221anSU /\u221a\u03bd dominates. It\n\u2248 arg max(k,V ) SU /\u221a\u03bd. We\n\nis approximately \u03a6\n\n(P, R) > U 2\nV1\n\n(Q, R) > U 2\nV2\n\n(cid:16)\u221a\n\n\u221an SU\u221a\n\nnSU\u2212T\u03b1\n\nn > T\u03b1\n\nn > T\u03b1\n\n(cid:19)\n\n\u03bd \u2212\n\n(cid:17)\n\n\u221a\n\n\u03bd\n\n= \u03a6\n\n(cid:17)\n\n4\n\n\fp\n\u03c3pq\n\nn ). De\ufb01ne SF\n\nLet \u03a3ss(cid:48)\n\np \u2264 F 2\n\nq versus H1 : F 2\n\n:= F 2\n\nq .\np \u2212 F 2\n\nn := \u221an((cid:99)F 2\n\np \u2212(cid:99)F 2\n\n(cid:19)\n\u03c3pq\n\u03c32\nq\np \u2212 2\u03c3pq + \u03c32\n\n(cid:18) \u00b5(cid:62)\nq )(cid:1).\n\n:=\n\nq ). We note that the feature functions \u03c4 p (for F 2\n\nn /(\u03b3 +\u221a\u02c6\u03bdn) as an estimate of the power criterion objective SU /\u221a\u03bd for the test power,\ncan thus use \u02c6SU\nwhere \u03b3 > 0 is a small regularization parameter added to promote numerical stability following\nJitkrittum et al. [2017b, p. 5]. To control the false rejection rate, the maximization is carried out on\nheld-out training data which are independent of the data used for testing. In the experiments (Section\n4), we hold out 20% of the data for the optimization. A unique consequence of this procedure is that\nwe obtain optimized V \u2217 which indicates where Q \ufb01ts signi\ufb01cantly better than P . We note that this\ninterpretation only holds if the test, using the optimized hyperparameters (k\u2217, V \u2217), decides to reject\nH0. The optimized locations may not be interpretable if the test fails to reject H0.\nRelative FSSD (Rel-FSSD) The proposed Rel-FSSD tests H0 : F 2\nq .\np > F 2\nThe test statistic is \u221an \u02c6SF\np ) and\n\u03c4 q (for F 2\nq ) depend on (kX , V ) and (kY , W ) respectively, and play the same role as the feature\nfunctions \u03c8V and \u03c8W of the UME statistic. Due to space limitations, we only state the salient facts\nof Rel-FSSD. The rest of the derivations closely follow Rel-UME. These include the interpretation\nthat the relative \ufb01t is measured at the speci\ufb01ed locations given in V and W , and the derivation of\nRel-FSSD\u2019s power criterion (which can be derived using the asymptotic distribution of \u02c6SF\nn given in\nTheorem 2, following the same line of reasoning as in the case of Rel-UME). A major difference is\nthat Rel-FSSD requires explicit (gradients of the log) density functions of the two models, allowing\nit to gain structural information of the models that may not be as easily observed in \ufb01nite samples.\nWe next state the asymptotic distribution of the statistic (Theorem 2), which is needed for obtaining\nthe threshold and for deriving the power criterion. The proof closely follows the proof of Theorem 1,\nand is omitted.\nTheorem 2 (Asymptotic distribution of \u02c6SF\n:=\n(cid:62), \u03a3pp =\ncovz\u223cr[\u03c4 s(z), \u03c4 s(cid:48)(z)] for s, s(cid:48)\n(cid:18) \u03c32\n\u03a3p \u2208 RdJp\u00d7dJp, and \u03a3qq = \u03a3q \u2208 RdJq\u00d7dJq. Assume that 1) p, q, and r are all distinct,\nq > 0, 3)\n2) (kX , V ) are chosen such that F 2\np \u03a3p\u00b5p \u00b5(cid:62)\n(cid:0)0, 4(\u03c32\n\u00b5(cid:62)\np \u03a3pq\u00b5q \u00b5(cid:62)\n\u2192\n\n\u2208 {p, q} so that \u03a3pq \u2208 RdJp\u00d7dJq, \u03a3qp := (\u03a3pq)\nn \u2212 SF(cid:17) d\n(cid:16)(cid:98)SF\np > 0, and (kY , W ) are chosen such that F 2\np \u03a3pq\u00b5q\nq \u03a3q\u00b5q\n\nN\n4 Experiments\nIn this section, we demonstrate the two proposed tests on both toy and real problems. We start\nwith an illustration of the behaviors of Rel-UME and Rel-FSSD\u2019s power criteria using simple one-\ndimensional problems. In the second experiment, we examine the test powers of the two proposed\ntests using three toy problems. In the third experiment, we compare two hypothetical generative\nmodels on the CIFAR-10 dataset [Krizhevsky and Hinton, 2009] and demonstrate that the learned test\nlocations (images) can clearly indicate the types of images that are better modeled by one of the two\ncandidate models. In the last two experiments, we consider the problem of determining the relative\ngoodness of \ufb01t of two given Generative Adversarial Networks (GANs) [Goodfellow et al., 2014].\nCode to reproduce all the results is available at https://github.com/wittawatj/kernel-mod.\n1. Illustration of Rel-UME and Rel-FSSD Power Criteria We consider k = kX = kY to be a\nGaussian kernel, and set V = W = {v} (one test location). The power criterion of Rel-UME\nas a function of v can be written as 1\nQ(v))1/2 where wit(\u00b7) is the MMD witness\n2\nfunction (see Section 2), and we explicitly indicate the dependency on v. To illustrate, we consider\ntwo Gaussian models p, q with different means but the same variance, and set r to be a mixture of p\nand q. Figure 1a shows that when each component in r has the same mixing proportion, the power\ncriterion of Rel-UME is a zero function indicating that p and q have the same goodness of \ufb01t to r\neverywhere. To understand this, notice that at the left mode of r, p has excessive probability mass\n(compared to r), while q has almost no mass at all. Both models are thus wrong at the left mode\nof r. However, since the extra probability mass of p is equal to the missing mass of q, Rel-UME\nconsiders p and q as having the same goodness of \ufb01t. In Figure 1b, the left mode of r now has a\nmixing proportion of only 30%, and r more closely matches q. The power criterion is thus positive at\nthe left mode indicating that q has a better \ufb01t.\nThe power criterion of Rel-FSSD indicates that q \ufb01ts better at the right mode of r in the case of\nequal mixing proportion (see Figure 1c). In one dimension, the Stein witness function gq,r (de\ufb01ned\n\nP,R(v)\u2212wit2\nwit2\nP (v)\u22122\u03b6P Q(v)+\u03b62\n(\u03b62\n\nis positive de\ufb01nite. Then, \u221an\n\n(cid:19)\n\n5\n\nQ,R(v)\n\n\f(a) Rel-UME\n\n(b) Rel-UME\n\n(c) Rel-FSSD\n\n(d) Rel-FSSD\nFigure 1: One-dimensional plots (in green) of Rel-UME\u2019s power criteria (in (a), (b)), and Rel-FSSD\u2019s\npower criteria (in (c), (d)). The dashed lines in (a), (b) indicate MMD\u2019s witness functions used in\nRel-UME, and the dashed lines in (c), (d) indicate FSSD\u2019s Stein witness functions.\nin Section 2) can be written as gq,r(w) = Ez\u223cr [kY (z, w)\u2207z (log q(z) \u2212 log r(z))], which is the\nexpectation under r of the difference in the derivative log of q and r, weighted by the kernel kY .\nThe Stein witness thus only captures the matching of the shapes of the two densities (as given by\nthe derivative log). Unlike the MMD witness, the Stein witness is insensitive to the mismatch of\nprobability masses i.e., it is independent of the normalizer of q. In Figure 1c, since the shape of q and\nthe shape of the right mode of r match, the Stein witness gq,r (dashed blue curve) vanishes at the\nright mode of r, indicating a good \ufb01t of q in the region. The mismatch between the shape of q and the\nshape of r at the left mode of r is what creates the peak of gq,r. The same reasoning holds for the\nStein witness gp,r. The power criterion of Rel-FSSD, which is given by 1\nq (w))1/2 ,\n2\nis thus positive at the right mode of r (shapes of q and r matched there), and negative at the left\nmode of r (shapes of p and r matched there). To summarize, Rel-UME measures the relative \ufb01t by\nchecking the probability mass, while Rel-FSSD does so by matching the shapes of the densities.\n2. Test Powers on Toy Problems The goal of this experiment is to investigate the rejection rates of\nseveral variations of the two proposed tests. To this end, we study three toy problems, each having\nits own characteristics. All the three distributions in each problem have density functions to allow\ncomparison with Rel-FSSD.\n\np(w)\u22122\u03c3pq(w)+\u03c32\n\ngp,r(w)2\u2212gq,r(w)2\n\n(\u03c32\n\n(cid:16)\n\nZ exp\n\n(cid:62)\nx(cid:62)Bh + b\n\nB,b,c(x) = (cid:80)\n2(cid:107)x(cid:107)2(cid:17)\n\n1. Mean shift: All the three distributions are isotropic multivariate normal distributions: p =\nN ([0.5, 0, . . . , 0], I), q = N ([1, 0, . . . 0], I), and r = N (0, I), de\ufb01ned on R50. The two\ncandidates models p and q differ in the mean of the \ufb01rst dimension. In this problem, the null\nhypothesis H0 is true since p is closer to r.\n2. Blobs: Each distribution is given by a mixture of four Gaussian distributions organized in a grid\nin R2. Samples from p, q and r are shown in Figure 4. In this problem, q is closer to r than p is\ni.e., H1 is true. One characteristic of this problem is that the difference between p and q takes\nplace in a small scale relative to the global structure of the data. This problem was studied in\nGretton et al. [2012b], Chwialkowski et al. [2015].\n3. RBM: Each of the three distributions is given by a Gaussian Bernoulli Restricted Boltz-\nmann Machine (RBM) model with density function p(cid:48)\nB,b,c(x, h), where\np(cid:48)\n, h \u2208 {\u22121, 1}dh is a latent vec-\nB,b,c(x, h) := 1\ntor, Z is the normalizer, and B, b, c are model parameters. Let r(x) := p(cid:48)\nB,b,c(x), p(x) :=\np(cid:48)\nBp,b,c(x), and q(x) := p(cid:48)\nBq,b,c(x). Following a similar setting as in Liu et al. [2016], Jitkrit-\ntum et al. [2017b], we set the parameters of the data generating density r by uniformly randomly\nsetting entries of B to be from {\u22121, 1}, and drawing entries of b and c from the standard normal\ndistribution. Let \u03b4 be a matrix of the same size as B such that \u03b41,1 = 1 and all other entries are\n0. We set Bq = B + 0.3\u03b4 and Bp = B + \u0001\u03b4, where the perturbation constant \u0001 is varied. We\n\ufb01x the sample size n to 2000. Perturbing only one entry of B creates a problem in which the\ndifference of distributions can be dif\ufb01cult to detect. This serves as a challenging benchmark to\nmeasure the sensitivity of statistical tests [Jitkrittum et al., 2017b]. We set d = 20 and dh = 5.\nWe compare three kernel-based tests: Rel-UME, Rel-FSSD, and Rel-MMD (the relative MMD test\nof Bounliphone et al. [2015]), all using a Gaussian kernel. For Rel-UME and Rel-FSSD we set\nkX = kY = k, where the the Gaussian width of k, and the test locations are chosen by maximizing\ntheir respective power criteria described in Section 3 on 20% of the data. The optimization procedure\nis described in Section A (appendix). Following Bounliphone et al. [2015], the Gaussian width of\nRel-MMD is chosen by the median heuristic as implemented in the code by the authors. In the RBM\n\nx + c(cid:62)h \u2212 1\n\nh p(cid:48)\n\n6\n\n\u221250510\u22120.20.00.20.4pqrwitnessp,rwitnessq,rPowerCri.\f(a) Mean shift. d = 50.\n\n(b) Blobs. d = 2.\n\n(c) Blobs (Runtime)\n\n(d) RBM. d = 20\n\nFigure 2: (a), (b), (d) Rejection rates (estimated from 300 trials) of the \ufb01ve tests with \u03b1 = 0.05. In\nthe RBM problem, n = 2000. (c) Runtime in seconds for one trial in the Blobs problem.\n\n(a) Power Criterion\n\n(b) Sorted in ascending order\n\n(c) Sorted in descending order\n\nFigure 3: P = {airplane, cat}, Q = {automobile, cat}, and R = {automobile, cat}. (a) Histogram of\nRel-UME power criterion values. (b), (c) Images as sorted by the criterion values in ascending and\ndescending orders, respectively.\n\nproblem, all problem parameters B, b, and c are drawn only once and \ufb01xed. Only the samples vary\nacross trials.\n\nFigure 4: Blobs problem\nsamples: p, q, r.\n\nFigure 2 shows the test powers of all the tests. When H0 holds, all tests\nhave false rejection rates (type-I errors) bounded above by \u03b1 = 0.05 (Fig-\nure 2a). In the Blobs problem (Figure 2b), it can be seen that Rel-UME\nachieves larger power at all sample sizes, compared to Rel-MMD. Since\nthe relative goodness of \ufb01t of p and q must be compared locally, the\noptimized test locations of Rel-UME are suitable for detecting such local\ndifferences. The poor performance of Rel-MMD is caused by unsuitable\nchoices of the kernel bandwidth. The bandwidth chosen by the median\nheuristic is only appropriate for capturing the global length scale of the\nproblem. It is thus too large to capture small-scale differences. No ex-\nisting work has proposed a kernel selection procedure for Rel-MMD. Regarding the number J\nof test locations, we observe that changing J from 1 to 5 drastically increases the test power of\nRel-UME, since more regions characterizing the differences can be pinpointed. Rel-MMD exhibits\na quadratic-time pro\ufb01le (Figure 2c) as a function of n.\nFigure 2d shows the rejection rates against the perturbation strength \u0001 in p in the RBM problem. When\n\u0001 \u2264 0.3, p is closer to r than q is (i.e., H0 holds). We observe that all the tests have well-controlled\nfalse rejection rates in this case. At \u0001 = 0.35, while q is closer (i.e., H1 holds), the relative amount\nby which q is closer to r is so small that a signi\ufb01cant difference cannot be detected when p and q\nare represented by samples of size n = 2000, hence the low powers of Rel-UME and Rel-MMD.\nStructural information provided by the density functions allows Rel-FSSD (both J = 1 and J = 5)\nto detect the difference even at \u0001 = 0.35, as can be seen from the high test powers. The fact that\nRel-MMD has higher power than Rel-UME, and the fact that changing J from 1 to 5 increases the\npower only slightly suggest that the differences may be spatially diffuse (rather than local).\n3. Informative Power Objective In this part, we demonstrate that test locations having positive\n(negative) values of the power criterion correctly indicate the regions in which Q has a better (worse)\n\ufb01t. We consider image samples from three categories of the CIFAR-10 dataset [Krizhevsky and\nHinton, 2009]: airplane, automobile, and cat. We partition the images, and assume that the sample\nfrom P consists of 2000 airplane, 1500 cat images, the sample from Q consists of 2000 automobile,\n1500 cat images, and the reference sample from R consists of 2000 automobile, 1500 cat images. All\nsamples are independent. We consider a held-out random sample consisting of 1000 images from each\n\n7\n\n300900150021002700Samplesizen0.00.51.0RejectionrateRel-UMEJ1Rel-UMEJ5Rel-FSSDJ1Rel-FSSDJ5Rel-MMD50010001500Samplesizen0.0000.005Rejectionrate0.312358Samplesizen(\u00d7103)0.00.51.0Rejectionrate0.312358Samplesizen(\u00d7103)10\u22121100101102Time(s)0.20.30.40.6Perturbation\u00010.00.51.0Rejectionrate0.00.20.40100200300\u221210\u221250510\u221210\u221250510pqr\fTable 1: Rejection rates of the proposed Rel-UME, Rel-MMD, KID and FID, in the GAN model\ncomparison problem. \u201cFID diff.\u201d refers to the average of FID(P, R) \u2212 FID(Q, R) estimated in each\ntrial. Signi\ufb01cance level \u03b1 = 0.01 (for Rel-UME, Rel-MMD, and KID).\n\nP\n\nQ\n\nR\n\nRel-UME\n\nS\n\n1.\nS\n2. RS RS\n3.\nN\n4.\nN\n5.\nN\n\nS\nS\nS\n\nJ10\n0.0\nRS\n0.0\nRS\nRS\n0.0\nRN 0.57\nRM 0.0\n\nJ20\n0.0\n0.0\n0.0\n0.97\n0.0\n\nJ40\n0.0\n0.0\n0.0\n1.0\n0.0\n\nRel-\nMMD\n0.0\n0.03\n0.0\n1.0\n0.0\n\nKID FID\n\nFID diff.\n\n0.0\n0.02\n0.0\n1.0\n0.0\n\n0.53\n0.7\n0.0\n1.0\n0.0\n\n-0.045 \u00b1 0.52\n0.04 \u00b1 0.19\n-15.22 \u00b1 0.83\n5.25 \u00b1 0.75\n-4.55 \u00b1 0.82\n\ncategory, serving as a pool of test location candidates. We set the kernel to be the Gaussian kernel\non 2048 features extracted by the Inception-v3 network at the pool3 layer [Szegedy et al., 2016].\nWe evaluate the power criterion of Rel-UME at each of the test locations in the pool individually.\nThe histogram of the criterion values is shown in Figure 3a. We observe that all the power criterion\nvalues are non-negative, con\ufb01rming that Q is better than P everywhere. Figure 3b shows the top 15\ntest locations as sorted in ascending order by the criterion, consisting of automobile images. These\nindicate the regions in the data domain where Q \ufb01ts better. Notice that cat images do not have high\npositive criterion values because they can be modeled equally well by P and Q, and thus have scores\nclose to zero as shown in Figure 3b.\n4. Testing GAN Models In this experiment, we apply the proposed Rel-UME test to comparing\ntwo generative adversarial networks (GANs) [Goodfellow et al., 2014]. We consider the CelebA\ndataset [Liu et al., 2015]3 in which each data point is an image of a celebrity with 40 binary attributes\nannotated e.g., pointy nose, smiling, mustache, etc. We create a partition of the images on the smiling\nattribute, thereby creating two disjoint subsets of smiling and non-smiling images. A set of 30000\nimages from each subset is held out for subsequent relative goodness-of-\ufb01t testing, and the rest are\nused for training two GAN models: a model for smiling images, and a model for non-smiling images.\nGenerated samples and details of the trained models can be found in Section B (appendix). The two\nmodels are trained once and \ufb01xed throughout.\nIn addition to Rel-MMD, we compare the proposed Rel-UME to Kernel Inception Distance (KID)\n[Bi\u00b4nkowski et al., 2018], and Fr\u00e9chet Inception Distance (FID) [Heusel et al., 2017], which are\ndistances between two samples (originally proposed for comparing a sample of generated images,\nand a reference sample). All images are represented by 2048 features extracted from the Inception-v3\nnetwork [Szegedy et al., 2016] at the pool3 layer following Bi\u00b4nkowski et al. [2018]. When adapted\nfor three samples, KID is in fact a variant of Rel-MMD in which a third-order polynomial kernel\nis used instead of a Gaussian kernel (on top of the pool3 features). Following Bi\u00b4nkowski et al.\n[2018], we construct a bootstrap estimator for FID (10 subsamples with 1000 points in each). For\nthe proposed Rel-UME, the J \u2208 {10, 20, 40} test locations are randomly set to contain J/2 smiling\nimages, and J/2 non-smiling images drawn from a held-out set of real images. We create problem\nvariations by setting P, Q, R \u2208 {S, N, RS, RN, RM} where S denotes generated smiling images\n(from the trained model), N denotes generated non-smiling images, M denotes an equal mixture of\nsmiling and non-smiling images, and the pre\ufb01x R indicates that real images are used (as opposed to\ngenerated ones). The sample size is n = 2000, and each problem variation is repeated for 10 trials\nfor FID (due to its high complexity) and 100 trials for other methods. The rejection rates from all the\nmethods are shown in Table 1. Here, the test result for FID in each trial is considered \u201creject H0\u201d if\nFID(P, R) > FID(Q, R). Heusel et al. [2017] did not propose FID as a statistical test. That said,\nthere is a generic way of constructing a relative goodness-of-\ufb01t test based on repeated permutation of\nsamples of P and Q to simulate from the null distribution. However, FID requires computing the\nsquare root of the feature covariance matrix (2048 x 2048), and is computationally too expensive for\npermutation testing.\nOverall, we observe that the proposed test does at least equally well as existing approaches, in\nidentifying the better model in each case. In problems 1 and 2, P and Q have the same goodness of \ufb01t,\nby design. In these cases, all the tests correctly yield low rejection rates, staying roughly at the design\nlevel (\u03b1 = 0.01). Without a properly chosen threshold, the (false) rejection rates of FID \ufb02uctuate\n\n3CelebA dataset: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.\n\n8\n\n\fMax:\n\nMin:\n\n(c) Power criterion\n\n(d) Greedily optimized test\nlocations\n\n(b) Sample from Q = LS-\nGAN trained for 17 epochs.\n\n(a) Sample from P = LS-\nGAN trained for 15 epochs.\nFigure 5: Examining the training of an LSGAN model with Rel-UME. (a), (b) Samples from the\ntwo models P, Q trained on MNIST. (c) Distributions of power criterion values computed over 200\ntrials. Each distribution is formed by randomly selecting J = 40 test locations from real images of a\ndigit type. (d) Test locations showing where Q is better (maximization of the power criterion), and\ntest locations showing where P is better (minimization).\naround the expected value of 0.5. This means that simply comparing FIDs (or other distances) to\nthe reference sample without a calibrated threshold can lead to a wrong conclusion on the relative\ngoodness of \ufb01t. The FID is further complicated by the fact that its estimator suffers from bias in ways\nthat are hard to model and correct for (see Bi\u00b4nkowski et al. [2018, Section D.1]). Problem 4 is a case\nwhere the model Q is better. We notice that increasing the number of test locations of Rel-UME\nhelps detect the better \ufb01t of Q. In problem 5, the reference sample is bimodal, and each model can\ncapture only one of the two modes (analogous to the synthetic problem in Figure 1a). All the tests\ncorrectly indicate that no model is better than another.\n5. Examining GAN Training In the \ufb01nal experiment, we show that the power criterion of Rel-UME\ncan be used to examine the relative change of the distribution of a GAN model after training further\nfor a few epochs. To illustrate, we consider training an LSGAN model [Mao et al., 2017] on MNIST,\na dataset in which each data point is an image of a handwritten digit. We set P and Q to be LSGAN\nmodels after 15 epochs and 17 epochs of training, respectively. Details regarding the network\narchitecture, training, and the kernel (chosen to be a Gaussian kernel on features extracted from a\nconvolutional network) can be found in Section D. Samples from P and Q are shown in Figures 5a\nand 5b (see Figure 8 in the appendix for more samples).\nWe set the test locations V to be the set Vi containing J = 40 randomly selected real images of digit i,\nfor i \u2208 {0, . . . , 9}. We then draw n = 2000 points from P, Q and the real data (R), and use V = Vi\nto compute the power criterion for i \u2208 {0, . . . , 9}. The procedure is repeated for 200 trials where V\nand the samples are redrawn each time. The results are shown in Figure 5c. We observe that when\nV = V3 (i.e., box plot at the digit 3) or V9, the power criterion values are mostly negative, indicating\nthat P is better than Q, as measured in the regions indicated by real images of the digits 3 or 9. By\ncontrast, when V = V6, the large mass of the box plot in the positive orthant shows that Q is better\nin the regions of the digit 6. For other digits, the criterion values spread around zero, showing that\nthere is no difference between P and Q, on average. We further con\ufb01rm that the class proportions\nof the generated digits from both models are roughly correct (i.e., uniform distribution), meaning\nthat the difference between P and Q in these cases is not due to the mismatch in class proportions\n(see Section D). These observations imply that after the 15th epoch, training this particular LSGAN\nmodel two epochs further improves generation of the digit 6, and degrades generation of digits 3\nand 9. A non-monotonic improvement during training is not uncommon since at the 15th epoch the\ntraining has not converged. More experimental results from comparing different GAN variants on\nMNIST can be found in Section E in the appendix.\nWe note that the set V does not need to contain test locations of the same digit. In fact, the notion of\nclass labels may not even exist in general. It is up to the user to de\ufb01ne V to contain examples which\ncapture the relevant concept of interest. For instance, to compare the ability of models to generate\nstraight strokes, one might include digits 1 and 7 in the set V . An alternative to manual speci\ufb01cation\nof V is to optimize the power criterion to \ufb01nd the locations that best distinguish the two models\n(as done in experiment 2). To illustrate, we consider greedily optimizing the power criterion by\niteratively selecting a test location (from real images) which best improves the objective. Maximizing\nthe objective yields locations that indicate the better \ufb01t of Q, whereas minimization gives locations\nwhich show the better \ufb01t of P (recall from Figure 1). The optimized locations are shown in Figure 5d.\nThe results largely agree with our previous observations, and do not require manually specifying V .\nThis optimization procedure is applicable to any models which can be sampled.\n\n9\n\n0123456789Digit\u22120.020.000.02Power Criterion\fAcknowledgments\n\nHK and AG thank the Gatsby Charitable Foundation for the \ufb01nancial support.\n\nReferences\nV. Alba Fern\u00e1ndez, M. Jim\u00e9nez-Gamero, and J. Mu\u00f1oz Garcia. A test for the two-sample problem\nbased on empirical characteristic functions. Computational Statistics and Data Analysis, 52:\n3730\u20133748, 2008.\n\nB. Amos, B. Ludwiczuk, and M. Satyanarayanan. Openface: A general-purpose face recognition\nlibrary with mobile applications. Technical report, CMU-CS-16-118, CMU School of Computer\nScience, 2016.\n\nM. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML,\n\n2017.\n\nL. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical\n\ncharacteristic function. Metrika, 35:339\u2013348, 1988.\n\nM. Bi\u00b4nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. In ICLR.\n\n2018.\n\nW. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A test of relative\n\nsimilarity for model selection in generative models. In ICLR, 2015.\n\nG. E. P. Box. Science and statistics. Journal of the American Statistical Association, 71:791\u2013799,\n\n1976.\n\nG. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Paci\ufb01c Grove, CA, 2002.\n\nX. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In NIPS, pages\n2172\u20132180, 2016.\n\nK. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic\n\nrepresentations of probability measures. In NIPS, pages 1972\u20131980, 2015.\n\nK. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of \ufb01t. In ICML, pages\n\n2606\u20132615, 2016.\n\nJ. Friedman and L. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov\n\ntwo-sample tests. The Annals of Statistics, 7(4):697\u2013717, 1979.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\nJ. Gorham and L. Mackey. Measuring sample quality with Stein\u2019s method. In NIPS, pages 226\u2013234,\n\n2015.\n\nA. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test. Journal\n\nof Machine Learning Research, 13:723\u2013773, 2012a.\n\nA. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.\nSriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NIPS, pages 1205\u20131213,\n2012b.\n\nI. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nWasserstein GANs. In NIPS, pages 5767\u20135777, 2017.\n\nP. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings.\n\nBiometrika, 89(2):359\u2013374, 2002.\n\nZ. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discriminant\n\nanalysis. pages 609\u2013616. MIT Press, Cambridge, MA, 2008.\n\n10\n\n\fM. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two\n\ntime-scale update rule converge to a local nash equilibrium. In NIPS. 2017.\n\nW. Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Math. Statist., 19\n\n(3):293\u2013325, 09 1948.\n\nW. Jitkrittum, Z. Szab\u00f3, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with\n\nmaximum testing power. In NIPS, pages 181\u2013189. 2016.\n\nW. Jitkrittum, Z. Szab\u00f3, and A. Gretton. An adaptive test of independence with analytic kernel\n\nembeddings. In ICML. 2017a.\n\nW. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A linear-time kernel goodness-of-\ufb01t\n\ntest. In NIPS, 2017b.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ArXiv e-prints, Dec. 2014.\n\nA. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\nJ. Lintusaari, M. Gutmann, R. Dutta, S. Kaski, and J. Corander. Fundamentals and recent develop-\n\nments in approximate bayesian computation. Systematic Biology, 66(1):e66\u2013e82, 2017.\n\nQ. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests. In ICML,\n\npages 276\u2013284, 2016.\n\nZ. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of\n\nInternational Conference on Computer Vision (ICCV), 2015.\n\nJ. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In NIPS,\n\npages 829\u2013837, 2015.\n\nX. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial\nnetworks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2813\u20132821.\nIEEE, 2017.\n\nS. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational\n\ndivergence minimization. In NIPS, 2016.\n\nA. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\nP. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on\n\nadjacency. Journal of the Royal Statistical Society B, 67(4):515\u2013530, 2005.\n\nT. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques\n\nfor training GANs. ArXiv e-prints, June 2016.\n\nR. J. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 2009.\n\nA. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions. In\n\nInternational Conference on Algorithmic Learning Theory (ALT), pages 13\u201331, 2007.\n\nB. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and\n\nRKHS embedding of measures. Journal of Machine Learning Research, 12:2389\u20132410, 2011.\n\nA. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton. VEEGAN: Reducing mode\n\ncollapse in GANs using implicit variational learning. ArXiv e-prints, May 2017.\n\nD. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative\n\nmodels and model criticism via optimized maximum mean discrepancy. In ICLR. 2016.\n\nC. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture\nfor computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2818\u20132826, 2016.\n\nG. Sz\u00e9kely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004.\n\n11\n\n\fG. J. Sz\u00e9kely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate\n\nAnalysis, 93(1):58\u201380, 2005.\n\nM. Yamada, D. Wu, Y.-H. H. Tsai, I. Takeuchi, R. Salakhutdinov, and K. Fukumizu. Post selection\ninference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 444, "authors": [{"given_name": "Wittawat", "family_name": "Jitkrittum", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Heishiro", "family_name": "Kanagawa", "institution": "Gatsby Unit, University College London"}, {"given_name": "Patsorn", "family_name": "Sangkloy", "institution": "Georgia Institute of Technology"}, {"given_name": "James", "family_name": "Hays", "institution": "Georgia Institute of Technology, USA"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}]}