
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
== Summary ==
The paper presents nonparametric twosample tests with a cost linear in the sample size. The proposed MMD formulation relies on the L2(R) distance between the characteristic functions of corresponding distributions where R is the density specified by the inverse Fourier transform of the kernel. This formulation was previously studied in Zhao & Meng (2014), for example. They showed that the original formulation fails to distinguish a large class of measures. To alleviate this problem, the paper presents a "smoothed" version of the formulation using an analytic smoothing kernel. The resulting metric is shown to be "random metric" which satisfies all the conditions for a metric with qualification "almost surely". The lineartime tests are obtained by approximating the integral with the sum over random samples from the inverse Fourier transform of the kernel. The new representation is a mapping of the space of probability measures into a finite dimensional Euclidean space that is almost surely an injection, and as a result almost surely a metrization.
Interestingly, the proposed lineartime test can outperform the quadratictime MMD in terms of power of the test.
== Quality ==
The paper is technically sound. Although the general idea is not completely new, the proposed
method is supported by both theoretical guarantee and empirical results, which is a key strength of this paper.
comments/questions:
How does the proposed method depend on the kernel function? Intuitively, it seems there is an interaction between the behaviour of characteristic function and kernel. For example, what if the distribution is heavytailed? Is there any difference when using Gaussian or Laplace kernels?
== Clarity ==
The paper is clearly written. Minor comments below:
comments/questions:
Line 065: "... can be though as ..." => "... can be thought of as ..." ? Line 065: "... characteristics functions" => "... characteristic functions ..." Line 164: please check the last condition. Line 273: "... as ..." => "... are ..." ? Figure 1 is quite difficult to read as the colors are not distinguishable. Line 358: "... faster then ..." => "... faster than ..." Perhaps ref [32] can be updated to "D. M. Ji Zhao. FastMMD: Ensemble of circular discrepancy for efficient twosample test. Neural Computation, 27(6):13451372, June 2015."
== Originality ==
The proposed idea is similar to that presented in
D. M. Ji Zhao. FastMMD: Ensemble of circular discrepancy for efficient twosample test. Neural Computation, 27(6):13451372, June 2015.
The novelty of this work lies in the improvement of empirical characteristic formulation via "smoothing" which has been shown to improve the test statistic.
== Significance ==
Largescale twosample test has become increasingly important in both machine learning and statistics. This work provides an efficient tool for such a problem.
Q2: Please summarize your review in 12 sentences
A neat idea for largescale two sampletest with concrete contributions, both theoretically and empirically. Accept.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
weak review
The authors propose a class of nonparametric twosample independence tests of which they show theoretically and empirically that they are faster than baseline approaches.
The paper contains a lot of theory (didnt check the math) and several experiments, indicating increased statistcal power.
It looks like an accept (7)
Q2: Please summarize your review in 12 sentences
Seems to be a solid paper (extensive math, neat idea, several first experiments).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses an important problem of testing whether two distributions are different (i.e. twosample tests which are ubiquitous in many data mining/machine learning applications).
It first introduces the maximum mean discrepancies method which is equivalent to the RKHS distance between mean embeddings of the two probability distributions. Naively computing the MMD between two distributions is quadratic in the sample size and a finite dimensional approximation based on random Fourier feature sampling is suggested. $J$ random Fourier features are sampled from the kernel and evaluated at the characteristic functions of the two distributions. This distance fails to distinguish a large number of probability distributions, however (Proposition 1), e.g. the injective property. Both of the newly proposed estimators turns the random Fourier feature finite dimensional approximation into a random metric. The difference is: given the random Fourier features (e.g. the frequencies to test the difference on), whether to evaluate the difference in characteristic functions or the to evaluate the difference in the mean embeddings.
The first proposed estimator evaluates the difference in characteristic functions. The key idea is to smooth the characteristic functions of the two distributions with an analytic kernel. Theorem 1 whose proof is in the appendix claims that this increases the distinguishing power of the modified distance measure. Then the authors demonstrate that the smoothed characteristic function can be expressed as an expected value of some function with respect to the given measure (a linear time estimator).
The second proposed estimator evaluates the difference in mean embeddings at the sampled random Fourier frequencies.
Hypothesis testing based on both of these estimators involve computation of test statistics which involves an inversion of $J$ by $J$ matrix (which the authors claim are small).
One minor note is that I found it confusing in the beginning to find how to compute the difference in smooth characteristic function in linear time, after introducing the metric based on mean embeddings. The authors should reorganize the paper regarding this point.
Q2: Please summarize your review in 12 sentences
An interesting paper addressing fundamental problems in machine learning
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper proposes two fast methods for nonparametric two sample testing. These are based on the notion of random metrics on distributions. The authors propose two approaches to make random metrics: one based on smooth characteristic functions, and one based of kernel mean embeddings with analytic kernels. Both the approaches essentially make use of representations of probabilities as analytic functions (in an RKHS). Analytic functions are different from each other almost surely w.r.t. any absolutely continuous distribution. Therefore almost surely unique representations of probabilities can be defined, by evaluating of the analytic representations of those probabilities on a random sample from an absolutely continuous distribution. By doing such evaluations on J random samples, the probabilities can be a.s. uniquely represented as J dimensional vectors. The random metrics are defined as distances between these J dimensional vectors. The resulting tests are fast to compute, and have high statistical powers.
I think this paper is nice and should be accepted. It is clever to represent distributions by finite dimensional vectors with analytic functions. I believe this approach can also speed up other methods based on kernel mean embeddings, and therefore would have significant impacts.
Quality: Both the theoretical analysis and experiments of the paper are well conducted. The following are some comments.
As J goes to infinity, Eq. (10) tends to  \mu_P  \mu_Q _{L2(\nu)}, where \nu is the absolutely continuous distribution of T_j. This further equals to  T^{1/2} (\mu_P  \mu_Q) _H, where T denotes the integral operator defined by the kernel and the distribution \nu. This can be seen as a smoothed version of MMD, as T^{1/2} has an effect of smoothing. Therefore Eq. (10) may be seen as an estimator of the smoothed MMD. This might explain why the test based on mean embedding performs well empirically for highdimensional settings. Is this understanding correct? And if the space of the rebuttal permits, please explain how the performance change when J is increased.
Clarity: This is paper is well organized and easy to understand.
Originality: I believe that clever idea of using analytic functions to represent distributions is original.
Significance: The paper proposes approaches for representing distributions a.s. uniquely as finite dimensional vectors. I believe these are beneficial not only to two sample testing, but also to other problems of nonparametric statics. Examples include independence testing, Bayesian inference with graphical models, and reinforcement learning. These have been effectively tackled with kernel mean embeddings, but they are known to be slow. The approaches proposed in this paper can be used to speed up those kernel methods. Therefore I believe this paper would have significant impacts.
Minor points: Line 164: if P \neq Q then d(P,Q) > 0 a.s. ?
Q2: Please summarize your review in 12 sentences
This paper proposes two fast methods for nonparametric two sample testing, based on almost surely unique representations of distributions as finite dimensional vectors based on analytic functions. I believe the proposed approach has significant impacts, as it is not only beneficial to two sample testing, but also to other methods based on kernel mean embeddings.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers for the
positive assessment of the paper and for the useful feedback provided. We
apologize for the typos and instances of poor clarity which were raised 
these will be corrected in the final version.
Reviewer 1 The
relationship between kernel choice and test power will depend on the
properties of the (unknown) distributions P and Q we are attempting to
tell apart. One insight comes from looking at the differences between
characteristic functions. The Laplace kernel has a Fourier transform which
decays as 1/(1+t^2), and will give greater emphasis to differences in
characteristic functions occurring at higher frequencies. The Gaussian
kernel has a Fourier transform that decays exponentially, and focuses on
differences at low frequencies. If such properties of P and Q are known a
priori, the kernel may be chosen accordingly (of course, as [7] point out,
if P and Q are fully known, then we will know precisely where their
characteristic functions have the largest difference, but this is much
stronger prior knowledge).
We will update the reference to D.
M. Ji Zhao.
Reviewer 3 It is true that sometimes Smooth CF test
outperforms the Mean Embeddings test, and indeed there is no reason in
theory for either test to be uniformly more powerful. This is demonstrated
in practice in our examples where for some cases, the smooth CF test is
the more powerful test. The key benefit of our work, however, is that our
new tests always give better power *per unit of computation* than other
MMDlike quadratic or subquadratic tests. This performance/cost
improvement is shown in experiments to be very
substantial.
Reviewer 4 The interpretation of the equation 10
for large J is right, thank you for the new inside ( T^{1/2} (\mu_P 
\mu_Q) _H). Indeed, it might be the reason for good performance in the
highdimensional settings, however since our J is small it is hard to
argue that we approximate the integral  \mu_P  \mu_Q _{L2(\nu)} well.
Experiments suggest that increasing J increases the power but
eventually decreases the power/time ratio.
Reviewer 5 The
matrices that we invert are at most 10x10, since we use J at most 10. As
outlined above, larger J don't increase the power much but increase the
computational time substantially.
We will reorganize paper so that
it will be clear from the beginning that difference in smooth
characteristic functions can be calculated in linear time.

