Paper ID: | 1324 |
---|---|
Title: | Error Analysis of Generalized Nyström Kernel Regression |
This paper studies Nystrom kernel regression based on coefficient regularization. The main contribution is to show that Nystron algorithm also works with guaranteed error bounds when the kernel is indefinite. The results seem to be novel and interesting.
It seems that the definition of C^s kernel is not precise. The authors should check the equation in Definition 1 for the case s > 1, which is considered in Theorem 1. Also, since the dimension may be larger than 1, the absolute value in Definition 1 should be replaced by a concrete norm. The Epanechnikov kernel is not used in the real data. As the main contribution of the paper is on the side of indefinite kernel, it might be helpful also to report results of Epanechnikov kernel on the real-world data. Minor comments: "real-word data" in the end of introduction should be "real-world data" It seems that LY in the paragraph after eq (9) is not defined before.
1-Less confident (might not have understood significant parts)
The authors focus on the kernel ridge regression task with l_2-coefficient regularization [Eq.~(3)], on a compact domain of R^d. The primary goal is to analyze the performance of the Nystrom technique in this task, with 'Mercer kernels' (continuous bounded functions), which are not necessarily positive symmetric or positive definite. The main contribution is a generalization error bound on Nystrom based estimator (Theorem 1) which can give rise to fast rates (Theorem 2) if the true regression function satisfies an exponential approximation requirement (Definition 2) with a C^\infty kernel. The authors numerically illustrate that in the studied task the Nystrom method with column-norm sampling typically performs better compared to Nystrom technique relying on uniform sampling. The paper is 1)novel: it bridges the gap between two popular directions, l_2-coefficient regularized objective functions and approximation with the Nystrom scheme. 2)well-organized, clearly written. There are a few statements in the paper which should be proved and some bugs to be corrected. I detail these comments, and my minor points in the 'Qualitative assessment' section.
I use the 'lX' = 'line X' shorthand below. 'X->Y' will mean "Please change 'X' to 'Y'". Major comments: ============== A)Main part: ----------- 1)I found the 'Introduction' section redundant: the issues and the contributions are quite similar, this part can be significantly compressed. Throughout the paper the notion of 'Mercer kernel' is left a bit undefined, even in the 'exact' definition ('line 96'); currently, it has to be decoded from fragments of the abstract and the introduction. I also suggest adding i)more emphasis on the motivation of general kernels (not necessarily symmetric and positive definite): What are the properties (a few illustrative examples can always help) which are not captured by the classical reproducing kernels?, further potential impact, ... ii)Nystrom alternatives in the introduction such as the random Fourier feature technique [1-5] or the sketching method [6]. References: [1] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS-2007, p. 1177–1184. [2] D.J. Sutherland and J. Schneider. On the Error of Random Fourier Features. In UAI-2015, p. 862-871. [3] B. Sriperumbudur, Z. Szabo. Optimal Rates for Random Fourier Features. In NIPS-2015, p. 1144-1152. [4] F. Bach. On the equivalence between quadrature rules and random features. Technical Report, 2015. (http://arxiv.org/abs/1502.06800) [5] A. Rudi, R. Camoriano, L. Rosasco. Generalization Properties of Learning with Random Features. Technical Report, 2016. (https://arxiv.org/abs/1602.04474) [6] Y. Yang, M. Pilanci and M. J. Wainwright. Randomized Sketches for Kernels: Fast and Optimal Non-Parametric Regression. Annals of Statistics (to appear; http://arxiv.org/abs/1501.06195). 2)There are some statements/conjectures in the paper without proof: i)l103-106: Solution of (1). Note: the cited [18] paper does not include 'inf_{\alpha}' in the regularization term. In fact, without this 'inf_{\alpha}' the regularization is not even a function of \alpha (unless the k(\cdot,x_i) points are assumed to be linearly independent). ii)l110-112: Solution of (2). iii)l130-132: (F,||\cdot ||_F) is a Banach space. iv)l203: Equivalent form of (9). Please, add proofs for these statements. 3)The computational complexity of the p=(p_i) vector in the column-norm sampling scheme seems to be O(n^2). This should also increase the complexity of (4) from O(mn+m^3), shouldn't it? 4)Section 3-4: In these sections there are several function definitions (f_\rho, f_z, \pi(\f_z), f_\lambda, g_\lambda, \hat{g}_\lambda). Based on the main text it is not obvious how these quantities will be connected, i.e. what will be the bridges. If there is place left [there are many displayed equations which can be made inline, or collected (see for example H, ||g||_H) to save space], it would be great to give a few-lines-long proof idea in the main part of the paper, 'the intuition of the bridges'. 5)Could the authors provide some concrete/illustrative examples, easy-to-check sufficient conditions for f_\rho in Def. 2? It is nice show that actually there are problems where this assumption holds. 6)The interpretation of Theorem 2 could be made more accessible by a quick discussion on the value of p and its impact, such as "p\in (0,2), m = n^{1/(2+p)} means that m can be chosen between n^{1/4} and n^{1/2}". Similarly noting that 'The smoothness of K is encoded in Def. 2.' can help understanding. 7)l195-206: This part should be improved. The displayed equation is hard to interpret; one could write instead "S(p_1,\ldots,p_n) = ..., task: \min_{p_1,\ldots,p_n}S(p_1,\ldots,p_n).". In line 203 "L" is undefined. "h_ii" is undefined; based on the supplement it should be "L_ii". "the minimizer of (9) can be approximated by": from the main text it is not clear in which sense this holds. A short sentence/discussion on the intuition of the "L_ii->0" assumption is also useful to add. B)Supplement: ------------- 1)Unproved statements: l25: \phi'(0) = 0. l26-27: "according to the arbitrariness of h, we get \lambda f_\lambda(t)=...". Note: I suppose that the authors wanted to construct some (h_n) \subset \L^2_{\rho_X} sequence and apply a limiting argument. l42-43: last inequality in (2). l102-103: order of "\sqrt{\frac{\log(2/\delta)}{m \lambda}}", see 2) below. Lemma5: 'R can be large enough', see 3) below. Please prove these gaps. 2)l102-103: Check the constants of the 3rd term on the r.h.s. [=D(\lambda) x (...)]. An "\sqrt{\frac{\log(2/\delta)}{m \lambda}}" type term is missing (it is coming from the E_2 bound = l37-38); if it is dominated by some other term, justify its reason. The same '\sqrt{...}' comment holds for Theorem 2. 3)l94-95: It looks like an 'for R is large enough' assumption has been implicitly used here. Note: By assumption D(\lambda) \le c_\beta \lambda^{\beta}, where \beta (0,1]. R = \sqrt{D(\lambda)}/lambda \le \sqrt{c_\beta \lambda^{\beta}} / \lambda is just an exploding upper bound as \lambda ->0. Show that 'R is large enough' for sufficiently small \lambda, or assume it in Theorems 1-2. Minor points: ============ A)Main part: ----------- 1)l96-97: The boundedness of y is assumed; please make this assumption explicit -- otherwise w.l.o.g. one can not have y \in [-1,1]. In fact, it would be nice to leave the general kernel bound (\kappa) and the bound on y to make it explicit how these constants appear in the final performance guarantee (in a linear/quadratic/... form). 2)l123: This part of the sentence reads a bit strange. 3)l127-128: 'Let F be a square integrable space on X...' -> 'Let F be the square integrable space on X...'. Note: For a subset of L^2_{\rho_X}, the range of L_K has to be studied separately. 4)l130-132: in the definition of H: 'g(\cdot)' -> 'g', 'f(\cdot)' -> 'f'. 5)l130-132: in the definition of ||\cdot||_H: 'g(x)' -> 'g', 'f(x)' -> 'f', 'x \in X' -> ''. 6)l143, l165: 'the generalization error bound' -> 'our generalization error bound', 'we provide the main results' -> 'we provide our main results' [reading the text it is not really explicit that these are the authors' results] 7)After Definition 2: a quick note could be useful on the meaning of '\beta', such as 'larger \beta means easier task'. 8)line 215: in the K_E definition: '(...)' -> '\left(...\right)'. 9)Section 5.1: The authors could refer back when listing f_i-s, that these are the true f_\rho-s. 10)l241: 'Figure 1 shows the performance of the three sampling strategies' -> 'Figure 1 shows the performance of the two sampling strategies' [3->2]. Note: the subfigures in Fig. 1 could be arranged to a 3X3 format to save space. 11)References: [1,2,8,14]: missing page numbers. [3]: 'monte carlo' -> 'Monte Carlo'. [6]: 'reguularization' -> 'regularization', 'bouunds' -> 'bounds'. [20,25]: missing url-s. B)Supplement: ------------ l3: 'provides the detail proof' -> 'provides the detailed proof' l12-13: the inequality seems to hold with equality as well l22: 'E_z(0) = 1' -> 'E_z(0) \le 1' since |y_i| \le 1 [i.e., this equality should be inequality]. l23: 'a auxiliary function' -> 'an auxiliary function' l24: 'h is an any fixed function' : at least h\in L^2_{\rho_X} seems to be needed to make the integral well-defined l57: "N_{2,u}(F,\epsilon) = N_{2,u}(...)": on the r.h.s. the '2,u' subscript should be deleted (to be compatible with Def. 1), l62: "Ef \le c (Ef)^\alpha" -> "Ef^2 \le c (Ef)^\alpha", i.e. f should be squared on the l.h.s. l77: r.h.s. of |g_1(z)-g_2(z)| should be '| |2y-\pi(f_1)-f_\rho(x)| |\pi(f_1)(x)-f_\rho(x)| - |2y-\pi(f_2)-f_\rho(x)| |\pi(f_2)(x)-f_\rho(x)| |'. l80: "true for and g" -> "true for any g" l81: please add the "\alpha = 1" choice. l87: 'It is easy to check that \hat{g}_{\lambda} \in G' -> 'By the definition of \hat{g}_{\lambda}, using v_i = \bar{x_i}, it follows that \hat{g}_{\lambda} \in G.' l98-99: 'E_{11}' -> 'E_1' l102-103: '\pi(f_{\rho})' -> 'f_{\rho}' [2x]; l106-109: similarly [3x]. In other words, the \pi clipping is superfluous on f_\rho. l112-113: I guess a \sum_i is missing on the r.h.s of the inequality. l113: 'inequality is follows from Hoder inequality' -> 'inequality follows from the Cauchy-Schwarz inequality' ['is': superfluous, 'Hoder' -> 'Holder' -> 'Cauchy-Schwarz' (it is sufficient)] l114: the inequality in Cauchy-Schwarz holds under equality of the arguments up to multipliers: \sqrt{1-L_{ii}} / \sqrt{p_i} ||K_i||_2 k^{-1} = \sqrt{p_i}; the conclusion is OK.
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
This paper studies the Nyström approximation for kernel ridge regression. The main contributions of the paper, as emphasized by the authors with the word "generalized" in the title, are results for kernels that are not symmetric or positive semi-definite.
As depicted by the authors, a novelty in this paper is the use of a coefficient regularization, as opposed to the norm regularization. The coefficient regularization has been widely used in the literature. Indeed, the connections between a given function and its coefficients is straightforward thanks to the connections between the primal and the dual optimization problems. Experiments are deceiving. While the main motivation of this work is on indefinite kernels, most of the experiments are done with the conventional Gaussian kernel on real and simulated data. The only used indefinite kernel is the Epanechnikov kernel, only on simulated data. We think that extensive experiments should be done in order to examine the relevance of this work. There are some typos, such as "bouunds". UPDATE : The authors have addressed our major concerns. We have adjusted the scores accordingly.
2-Confident (read it all; understood it all reasonably well)
This paper studies the convergence of the kernel-based regression estimator under the Nystr\"{o}m approximation. Nystr\"{o}m approximation method is a classic way that aims at scaling up kernel methods by reducing their computational cost. In the kernel ridge regression setup, the generalization ability of the kernel ridge regression estimator with the Nystro\"{o}m approximation has been well understood. Here, it is frequently assumed that the kernel of interest is a Mercer kernel. In this paper, instead of assuming the positive definiteness of the kernel, the authors consider a more general setup where the kernels are not necessarily to be positive definite. Learning schemes within this setup are also usually termed as coefficient-based schemes and were originally introduced in Scholkopf and Smola, MIT Press, 2001. Without the positive definite restrictions on the kernels, more flexibility of the considered learning scheme can, therefore, be pursued. Concerning the theoretical results of this paper, the generalization bounds of the $\ell^2$-norm regularized regression scheme under the Nystr\"{o}m approximation are derived. This is the main theoretical result of this paper. Besides the generalization bounds, this paper also discusses two different subsampling schemes in Nystr\"{o}m approximation, namely, uniform subsampling and column-norm subsampling.
This paper is presented clearly and is easy to follow. It is interesting to see that Nystr\"{o}m approximation is studied in a different setup, which should be able to draw some interest from researchers in this field. The analysis conducted in this paper seems sound. My only suggestion is that, in my view, the motivation of conducting the present study should be explained more clearly.
2-Confident (read it all; understood it all reasonably well)
The authors consider the problem of generalized Nystrom kernel regression (GNKR) with $\ell_2$-norm regularization: the idea is to consider the Nystrom regression with general kernels (not necessary symmetric or positive semi-definite) and combine it with the $\ell_2$-norm regularization. They provide a bound for the generalization error bound (Theorem 1) and a consequent oracle inequality for the regression function (eq.(8)) and they provide the convergence rate of GNKR and underline its dependence on the subsampling size (Theorem 2). The choice of the subsampling size is chosen as a trade-off between the approximation performance and the computational complexity. Experiments on synthetic and real data are provided.
The paper is well structured: the problem is clearly and exhaustively explained. Their main contributions are well described with respect to previous works. Additional comments: > line 22 "simplicity" should be replaced by "simple" > line 56 "integrating" should be replaced by "combining" > line 131 confusion notation: $\| \cdot\|_{\mathcal F}$ should be replaced by $\| \cdot\|_{L^2_{\rho_{\mathcal X}}}$
2-Confident (read it all; understood it all reasonably well)
This work proposes a novel theoretical analysis of Nystrom-based kernel ridge regression (KRR) with general kernel functions (not necessarily positive-definite), called generalized Nystrom kernel regression (GNKR). The analysis includes generalization bounds and convergence rates with polynomial decay. The experimental section first evaluates the performance of GNKR on synthetic data, both with uniform and non-uniform sampling schemes, showing that for some datasets a general kernel (in this case, the Epanechnikov kernel) performs better than a positive-definite one (the Gaussian kernel). Secondly, the effect of different sampling schemes on GNKR with Gaussian kernel (which is positive-definite) is considered for real-world datasets, showing that column-norm subsampling performs better than uniform subsampling.
The work is an interesting extension of previous ones for general kernels (e.g., kernel functions which are not positive definite). The structure of the paper presents quite clearly the reasoning behind the work, which seems sound. Unfortunately, the exposition is not as clear in terms of grammar, and this makes the paper somehow difficult to read smoothly. For this reason, please have your manuscript proof-read for English grammatical and style issues. A list of some of the most frequent typos (not all of them) can be found below. Theory ------ The theoretical results look correct. However, a deeper discussion of Theorem 2 would be required. For instance, it seems that the presented bound is not a generalization of the one in [14]. In fact, in the case of positive-definite kernel functions, the optimal bound in [14] is not recovered by Theorem 2. This open question should be addressed in-depth in the discussion of the theoretical results, by including a precise comparison of the bounds. Experiments ----------- The experiments on toy synthetic data presented in Subsection 5.1 are interesting, as they show that for a non-continuous and a non-flat regression function a general kernel function (the Epanechnikov kernel) yields higher test accuracy with respect to the positive-definite Gaussian kernel. This provides an empirical motivation for the proposed analysis. On the other hand, the experiments on real data in Subsection 5.2 appear to be out of scope, since they compare the predictive performance of GNKR under different sampling schemes, but with a positive-definite Gaussian kernel instead of a general kernel. Therefore, the actual algorithm used here seems to be Nystrom KRR. Consequently, the experimental results do not provide insight on what would be the influence of sampling schemes in the general setting treated in the core of the paper. Some examples of frequent typos (not exhaustive list) ----------------------------------------------------- - necessary to be --> necessarily 18 - subsampling matrix --> subsampled matrix 22 - most simplicity strategy --> most simple strategy - "the" is often misused 87 - Different from the previous works --> Differently from previous works 93 - independently --> independently and identically 94 - square --> squares 98 - good --> well 104 - semi-definite - semi-definiteness 105 - kernel --> the kernel 107 - implement --> implementation 107, 108 (and beyond) - computation burden, computation complexity --> computational burden, computational complexity 108 - computation requirement --> computational requirements 112 - $O(mn+m^3)$ --> It might be $O(mn^2+m^3)$ 117,119 - subset are drawn --> subset is drawn 121 - similar --> a similar 122 - [3, 4] --> [3, 4], 122 - Gram matrix --> the Gram matrix 185 - Different --> Differently 195 - simply --> simple Table 3 - #Feature --> # Features Table 3 - #Instance --> # Instances Figure 1 - Subsapling --> Subsampling Figure 1 - Number of sampling samples --> Subsampling level
3-Expert (read the paper in detail, know the area, quite certain of my opinion)