NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1185
Title:Examples are not enough, learn to criticize! Criticism for Interpretability

Reviewer 1

Summary

This submission presents a model criticism approach for machine learning methods, inspired by the Bayesian Model Criticism.

Qualitative Assessment

Nice and simple model criticism method, derived the right way.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

The paper proposes a method for identifying examples of where a model performs badly to help to understand what it does.

Qualitative Assessment

The paper argues that examples of where a model performs well are insufficient to understand it completely and that, in addition, examples of where it does not perform well should be given. The authors propose a methodology for this, leveraging existing techniques. Providing means to understand machine learning-generated models is an important research area with a large potential impact. The proposed approach is justified and explained well. It would help understanding to present the overall algorithm used rather than just the part presented in Algorithm 1. One aspect I particularly like about this paper is the thorough evaluation that involves user studies. The figures illustrate the approach nicely and demonstrate the usefulness of the returned prototypes and criticisms. The presentation of the paper could be improved. For example, the notation is sometimes used inconsistently (e.g. page 3, line 105, R is not using the font used in the rest of the paper. Page 4, line 119.5 is using u and v, which are not defined. Page 5, lines 188 and 190, k-medoids is misspelled. Page 9, lines 341, 342 the citation uses a different style than the rest. In general, the paper is well-written though. In summary, I feel that this paper would make a valuable addition to the NIPS programme.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 3

Summary

The paper presents an approach for selecting prototypes which best describe a dataset, as well as criticism examples, which describe which aspects of the dataset are not captured by the prototypes. The goal is to improve the intepretability of the data. The selection of the prototypes and critisisms is based on the maximum mean discrepancy (MMD) criterion and a greedy algorithm. This is facilitated by the submodularity of the cost function, which is proven for a larger class of problems. Experimental results show that the method performs comparably to related nearest prototype in terms of classification performance, and is considered interpretable in a small user study.

Qualitative Assessment

The paper tackles an underexplored but very important problem of interpretable machine learning, and has many potential applications. Although I am not able to understand the details of the proofs in Section 3, using MMD to select prototypes and criticisms seems to be an intuitive and well-motivated approach. The experiments are appropriate and it is refreshing to see a user study included. The explanation of the procedure, and the presentation of the results for the user study could be a bit clearer (ideally with an example of the question inside the user interface, and a table for the results). Minor concerns: Line 225, The earliest references to “nearest prototype classifier” are from the 90’s, with among others work by Kuncheva et al. Line 234, local kernel – isn’t it required to know the labels of the test data in order to compute this? It might be good to mention that this is an oracle / upper limit

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

Given a set of examples, the objective of this paper is to select a subset of examples that are both representative (called prototypes) of the distribution and the outliers (called critics). To achieve this objective, the authors propose MMD-critic algorithm which is based on the recently proposed maximum mean discrepancy(MMD) measure. The authors propose optimization problems for prototype and critic selection and show that for RBF kernels they are both submodular and allow greedy selection algorithms. Due to the difficulties associated with direct evaluation of the algorithm, the experiment used two indirect methods: nearest prototype classification and human subject evaluation. The results indicate that the algorithms serves the original purpose.

Qualitative Assessment

Positives: + the authors give an elegant principled approach for selection of representative examples + the experiments on human subjects are inventive + the results are promising + the paper is well-written Negatives: - the proposed algorithm is limited to RBF kernels and the prototype selection (described in Section 3.1) does not qualitatively differ from many previously proposed distance-based prototype selection algorithms - the critic part of the algorithm (Section 3.2) seems to be of an outlier selection algorithm. This paper does not discuss its connections with outlier selection and no comparisons are made with any existing outlier selection algorithm. Since all the baselines used in the experiments are prototype-based algorithms and not outlier selection algorithms, the baselines can be considered as straw-men.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

The authors give a presentation of the mmd-critic algorithm, which selects prototypes and criticism for human interpretable machine learning by optimizing a statistic called the maximum mean discrepancy. This is a novel application of a method that was previously applied to model comparison and comparing models to input data. They describe the model and show a proof for submodularity that ensures the method is scalable. They tested the method against other prototype-based machine learning baseline methods from Bien and Tibshirani (2011) for different numbers of prototypes on the USPS handwritten digit database. The local version of MMD-critic showed superior performance to the baselines, mainly for less than 1000 prototypes. A qualitative test was done on images of dogs. Finally, they ran a pilot study to collect measures of interpretability for MMD-critic. They concluded that people performed best when given both prototypes and criticisms.

Qualitative Assessment

The authors explore the compelling question of how to develop interpretable machine learning methods using prototypes and criticisms. The paper was well written and clear, even for a non-expert in the field like myself. The mathematical results appear to be sound. It is hard for me to assess the originality of the work in the field of machine learning, but I imagine that there is work on training with both positive and negative examples. At the very least, within the human category learning literature the issue of learning a concept through examples of the concept and non-examples has been explored. Given that the authors are interested in human interpretable machine learning methods, they should investigate this work in psychology. Relatedly, the methods they compare mmd-criticism to do not have criticisms (negative examples), which makes direct comparison between the algorithms difficult to assess. This is also the case for the human experiments: participants may have done well if they were simply given randomly chosen criticisms and not those chosen by mmd-criticism. In short, the authors have certainly shown that computer and human performance is better given prototypes and criticisms rather than only prototypes. It would have been interesting to see how mmd-critic performs against other algorithms that also provide both types of data. Smaller points: - Eq 4 seems to be missing "k(" - In figure 1, does the value on the x axis include both prototypes and criticisms for the mmd-critic model?

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 6

Summary

The Maximum Mean Discrepancy (MMD) is a measure of the difference between two probability distributions. Rather than use this measure for a pairwise statistical testing of two models or to compare a model to a given input, the authors propose a scheme for using MMD to select subsets of a dataset that in some sense best summarize or explain the full dataset, as well 'criticism' examples that are not well explained by the prototypes. They demonstrate the utility of the method both by using it to produce a nearest-prototype classifier and comparing performance against other baseline models, and by appealing to a pilot human subject study in which people were asked to perform classification tasks with and without protoypes and criticism examples. In the former case, the method of the authors performs as well or better than benchmark methods at classification; in the latter, people found classification significantly easier (though slower) with both prototype and criticism examples.

Qualitative Assessment

Technical quality The authors provide proofs of the consistency and computational tractability of the proposed method in a wide range of practically salient conditions. The tests described are relevant and appropriate for assessing performance. Novelty/originality So far as I am aware, this is the first time that systematic use of 'criticism' examples has been proposed. Potential impact or usefulness There are many contexts in which interpretability of classification models is important. In these cases, the method is likely to be adopted. Clarity and presentation The paper is well-written and clearly organized.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)