Reviews: Solving Interpretable Kernel Dimensionality Reduction

Summary: [19] has proposed recently an efficient iterative spectral (eigendecomposition) method (ISM) for the non-convex interpretable kernel dimensionality reduction (IKDR) objective in the context of alternative clustering. It established theoretical guarantees of ISM for the Gaussian kernel. The paper extends the theoretical guarantees of ISM to a family of kernels [Definition 1]. Each kernel in the ISM family has an associated surrogate matrix \Phi and the optimal projection is formed by the most dominant eigenvectors of \Phi [Theorem 1 and 2]. They showed that any conic combination of the ISM kernels is still an ISM kernel [Proposition 1] and therefore ISM can be extend to conic combination of kernels. They further showed how a range of different learning paradigms (i.e. supervised, unsupervised, semi-supervised dimension reduction and alternative clustering) can all be formulated as a constraint optimisation problem (equation 1) which can be solved by ISM. Hence the realisation of ISM’s potential through the generalisation into a wider range of kernels impacts a wide range of applications. The performance of the ISM using a range of kernels from the ISM family is illustrated in 5 real data experiments under different learning paradigms. Quality: The theoretical claims are well supported by theoretical analysis. A variety of experimental results are presented showcasing the application of the ISM algorithm on a variety of learning paradigms using a variety of kernels. Clarity: Overall, the paper is written with clear structure. However, there are a few places that confuses me. - The paragraph starting line 153, the authors starts to introduce the matrices used and using the language from group theory I believe? Can the authors elaborate on the reasons and connections (if any)? - Can the authors provide some intuitions on the ISM kernels and the associated phi matrix? All the proofs seems to fall magically into place using this phi matrix, so in some sense this phi matrix capture all the informations about the corresponding kernel matrix? Is this correspondence unique? - The matrix A was first defined on p.16 in the proof of Lemma 1, however was first mentioned in Theorem 3. - For the notation in Definition 1 and later the proof of Lemma 1, I am a little confused by what a(x_i, x_j) is? A function with two arguments? Can it be any functions? (I notice the authors discussed when a = x_i, or when a = x_i - x_j, but when do we use which (some derivation of the phi matrix used the former while others used the later)? Some clear explanation is needed here.) Also, what does {\bf a } = a(x_i, x_j ) mean? - In Table 4, MNIST dataset the row with NMI, why does it suddenly change into percentage and why are both bolded? Also, for the wine example, why are NMI with 0.86 and 0.84 all bolded? Originality: The main contribution of the work is the extension of the theoretical guarantees of ISM [19] to a family of kernels and any conic combination of kernels from this family of kernels. While showing that different objective functions from a variety of IKDR problems can be formulated into the same form, the paper illustrates the wider applicability of the ISM algorithm. Though established some new/improved results on the existing datasets, I feel the main contributions of the paper are theoretical. Significance: The realisation of ISM’s potential through the generalisation into a wider range of kernels impacts a wide range of applications. We saw in the experiments, ISM can now be applied to solve a variety of IKDR problems with better computational efficiency comparing to other state of art manifold optimisations. Comparable and sometimes better performance are observed by utilising a different kernel than the Gaussian kernel. Practitioners are likely to use the ideas. ======================= Authors response read. I am happy with the response provided.

Paper ID:	4340
Title:	Solving Interpretable Kernel Dimensionality Reduction

Reviewer 1

Reviewer 2

Reviewer 3