{"title": "Examples are not enough, learn to criticize! Criticism for Interpretability", "book": "Advances in Neural Information Processing Systems", "page_first": 2280, "page_last": 2288, "abstract": "Example-based explanations are widely used in the effort to improve the interpretability of highly complex distributions. However, prototypes alone are rarely sufficient to represent the gist of the complexity. In order for users to construct better mental models and understand complex data distributions, we also need {\\em criticism} to explain what are \\textit{not} captured by prototypes. Motivated by the Bayesian model criticism framework, we develop \\texttt{MMD-critic} which efficiently learns prototypes and criticism, designed to aid human interpretability. A human subject pilot study shows that the \\texttt{MMD-critic} selects prototypes and criticism that are useful to facilitate human understanding and reasoning. We also evaluate the prototypes selected by \\texttt{MMD-critic} via a nearest prototype classifier, showing competitive performance compared to baselines.", "full_text": "Examples are not Enough, Learn to Criticize!\n\nCriticism for Interpretability\n\nBeen Kim\u21e4\n\nAllen Institute for AI\n\nbeenkim@csail.mit.edu\n\nRajiv Khanna\n\nUT Austin\n\nrajivak@utexas.edu\n\nOluwasanmi Koyejo\n\nUIUC\n\nsanmi@illinois.edu\n\nAbstract\n\nExample-based explanations are widely used in the effort to improve the inter-\npretability of highly complex distributions. However, prototypes alone are rarely\nsuf\ufb01cient to represent the gist of the complexity. In order for users to construct\nbetter mental models and understand complex data distributions, we also need\ncriticism to explain what are not captured by prototypes. Motivated by the Bayesian\nmodel criticism framework, we develop MMD-critic which ef\ufb01ciently learns pro-\ntotypes and criticism, designed to aid human interpretability. A human subject pilot\nstudy shows that the MMD-critic selects prototypes and criticism that are useful\nto facilitate human understanding and reasoning. We also evaluate the prototypes\nselected by MMD-critic via a nearest prototype classi\ufb01er, showing competitive\nperformance compared to baselines.\n\n1\n\nIntroduction and Related Work\n\nAs machine learning (ML) methods have become ubiquitous in human decision making, their\ntransparency and interpretability have grown in importance (Varshney, 2016). Interpretability is\nparticularity important in domains where decisions can have signi\ufb01cant consequences. For example,\nthe pneumonia risk prediction case study in Caruana et al. (2015) showed that a more interpretable\nmodel could reveal important but surprising patterns in the data that complex models overlooked.\nStudies of human reasoning have shown that the use of examples (prototypes) is fundamental to the\ndevelopment of effective strategies for tactical decision-making (Newell and Simon, 1972; Cohen\net al., 1996). Example-based explanations are widely used in the effort to improve interpretability.\nA popular research program along these lines is case-based reasoning (CBR) (Aamodt and Plaza,\n1994), which has been successfully applied to real-world problems (Bichindaritz and Marling, 2006).\nMore recently, the Bayesian framework has been combined with CBR-based approaches in the\nunsupervised-learning setting, leading to improvements in user interpretability (Kim et al., 2014). In\na supervised learning setting, example-based classi\ufb01ers have been is shown to achieve comparable\nperformance to non-interpretable methods, while offering a condensed view of a dataset (Bien and\nTibshirani, 2011).\nHowever, examples are not enough. Relying only on examples to explain the models\u2019 behavior\ncan lead over-generalization and misunderstanding. Examples alone may be suf\ufb01cient when the\ndistribution of data points are \u2018clean\u2019 \u2013 in the sense that there exists a set of prototypical examples\nwhich suf\ufb01ciently represent the data. However, this is rarely the case in real world data. For instance,\n\ufb01tting models to complex datasets often requires the use of regularization. While the regularization\nadds bias to the model to improve generalization performance, this same bias may con\ufb02ict with the\ndistribution of the data. Thus, to maintain interpretability, it is important, along with prototypical\nexamples, to deliver insights signifying the parts of the input space where prototypical examples\n\n\u21e4All authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdo not provide good explanations. We call the data points that do not quite \ufb01t the model criticism\nsamples. Together with prototypes, criticism can help humans build a better mental model of the\ncomplex data space.\nBayesian model criticism (BMC) is a framework for evaluating \ufb01tted Bayesian models, and was\ndeveloped to to aid model development and selection by helping to identify where and how a particular\nmodel may fail to explain the data. It has quickly developed into an important part of model design,\nand Bayesian statisticians now view model criticism as an important component in the cycle of\nmodel construction, inference and criticism (Gelman et al., 2014). Lloyd and Ghahramani (2015)\nrecently proposed an exploratory approach for statistical model criticism using the maximum mean\ndiscrepancy (MMD) two sample test, and explored the use of the witness function to identify the\nportions of the input space the model most misrepresents the data. Instead of using the MMD to\ncompare two models as in classic two sample testing (Gretton et al., 2008), or to compare the model\nto input data as in the Bayesian model criticism of Lloyd and Ghahramani (2015), we consider a novel\napplication of the MMD, and its associated witness function as a principled approach for selecting\nprototype and criticism samples.\nWe present the MMD-critic, a scalable framework for prototype and criticism selection to improve\nthe interpretability of machine learning methods. To our best knowledge, ours is the \ufb01rst work which\nleverages the BMC framework to generate explanations for machine learning methods. MMD-critic\nuses the MMD statistic as a measure of similarity between points and potential prototypes, and\nef\ufb01ciently selects prototypes that maximize the statistic. In addition to prototypes, MMD-critic\nselects criticism samples i.e. samples that are not well-explained by the prototypes using a regularized\nwitness function score. The scalability follows from our analysis, where we show that under certain\nconditions, the MMD for prototype selection is a supermodular set function. Our supermodularity\nproof is general and may be of independent interest. While we are primarily concerned with prototype\nselection and criticism, we quantitatively evaluate the performance of MMD-critic as a nearest\nprototype classi\ufb01er, and show that it achieves comparable performance to existing methods. We also\npresent results from a human subject pilot study which shows that including the criticism together\nwith prototypes is helpful for an end-task that requires the data-distributions to be well-explained.\n\n2 Preliminaries\n\nThis section includes notation and a few important de\ufb01nitions. Vectors are denoted by lower case\nx and matrices by capital X. The Euclidean inner product between matrices A and B is given by\n\nhA, Bi =P ai,jbi,j. Let det(X) denote the determinant of X. Sets are denoted by sans serif e.g.\nS. The reals are denoted by R. [n] denotes the set of integers {1, . . . , n}, and 2V denotes the power\nset of V. The indicator function 1[a] takes the value of 1 if its argument a is true and is 0 otherwise.\nWe denote probability distributions by either P or Q. The notation | \u00b7 | will denote cardinality when\napplied to sets, or absolute value when applied to real values.\n\n2.1 Maximum Mean Discrepancy (MMD)\nThe maximum mean discrepancy (MMD) is a measure of the difference between distributions P\nand Q, given by the suprenum over a function space F of differences between the expectations with\nrespect to two distributions. The MMD is given by:\n\nMMD(F, P, Q) = sup\n\nf2F\u2713EX\u21e0P [f (X)] EY \u21e0Q [f (Y )]\u25c6.\n\n(1)\n\nWhen F is a reproducing kernel Hilbert space (RKHS) with kernel function k : X \u21e5 X 7! R, the\nsuprenum is achieved at (Gretton et al., 2008):\n(2)\nThe function (2) is also known as the witness function as it measures the maximum discrepancy\nbetween the two expectations in F. Observe that the witness function is positive whenever Q under\ufb01ts\nthe density of P , and negative wherever Q over\ufb01ts P . We can substitute (2) into (1) and square the\nresult, leading to:\n\nf (x) = EX0\u21e0P [k(x, X0)] EX0\u21e0Q [k(x, X0)] .\n\nMMD2(F, P, Q) = EX,X0\u21e0P [k(X, X0)] 2EX\u21e0P,y\u21e0Q [k(X, Y )] + EY,Y 0\u21e0Q [k(Y, Y 0)] .\n\n(3)\n\n2\n\n\fIt is clear that MMD2(F, P, Q) 0 and MMD2(F, P, Q) = 0 iff. P is indistinguishable from\nQ on the RHKS F. This population de\ufb01nition can be approximated using sample expectations.\nIn particular, given n samples from P as X = {xi \u21e0 P, i 2 [n]}, and m samples from Q as\nZ = {zi \u21e0 Q, i 2 [m]}, the following is a \ufb01nite sample approximation:\nk(xi, zj) +\n\nk(zi, zj),\n\nMMD2\n\n(4)\n\n1\n\nb(F, X, Z) =\n\nk(xi, xj) \n\nm2 Xi,j2[m]\n\n1\n\nn2 Xi,j2[n]\n\nand the witness function is approximated as:\n\n2\n\nnm Xi2[n],j2[m]\nm Xj2[m]\n\n1\n\nk(x, xi) \n\nf (x) =\n\n1\n\nn Xi2[n]\n\nk(x, zj).\n\n(5)\n\n3 MMD-critic for Prototype Selection and Criticism\nGiven n samples from a statistical model X = {xi, i 2 [n]}, let S \u2713 [n] represent a subset of the\nindices, so that XS = {xi 8i 2 S}. Given a RKHS with the kernel function k(\u00b7,\u00b7), we can measure the\nmaximum mean discrepancy between the samples and any selected subset using MMD2(F, X, XS).\nMMD-critic selects prototype indices S which minimize MMD2(F, X, XS). For our purposes, it\nwill be convenient to pose the problem as a normalized discrete maximization. To this end, consider\nthe following cost function, given by the negation of MMD2(F, X, XS) with an additive bias:\n\nJb(S) =\n\n=\n\n1\nn2\n\nnXi,j=1\nn|S| Xi2[n],j2S\n\nk(xi, xj) MMD2(F, X, XS)\n|S|2 Xi,j2S\n\nk(xi, yj) \n\n2\n\n1\n\nk(yi, xj).\n\n(6)\n\nn2Pn\nNote that the additive bias MMD2(F, X,;) = 1\nFurther, Jb(S) is normalized, since, when evaluated on the empty set, we have that:\nnXi,j=1\n\nJb(;) = min\nS22[n]\n\nk(xi, xj) \n\nnXi,j=1\n\nJb(S) =\n\nk(xi, xj) = 0.\n\n1\nn2\n\n1\nn2\n\ni,j=1 k(xi, xj) is a constant with respect to S.\n\nMMD-critic selects m\u21e4 prototypes as the subset of indices S \u2713 [n] which optimize:\n\nmax\n\nS22[n],|S|\uf8ffm\u21e4\n\nJb(S).\n\n(7)\n\nFor the purposes of optimizing the cost function (6), it will prove useful to exploit it\u2019s linearity with\nrespect to the kernel entries. The following Lemma is easily shown by enumeration.\nLemma 1. Let Jb(\u00b7) be de\ufb01ned as in (6), then Jb(\u00b7) is a linear function of k(xi, xj). In particular,\nde\ufb01ne K 2 Rn\u21e5n, with ki,j = k(xi, xj), and A(S) 2 Rn\u21e5n with entries ai,j(S) = 2\n1[j2S] \nn|S|\n|S|2 1[i2S]1[j2S] then: Jb(S) = hA(S), Ki.\n3.1 Submodularity and Ef\ufb01cient Prototype Selection\n\n1\n\nWhile the discrete optimization problem (6) may be quite complicated to optimize, we show that the\ncost function Jb(S) is monotone submodular under conditions on the kernel matrix which are often\nsatis\ufb01ed in practice, and which can be easily checked given a kernel matrix. Based on this result, we\ndescribe the greedy forward selection algorithm for ef\ufb01cient prototype selection.\nLet F : 2[n] 7! R represent a set function. F is normalized if F (;) = 0. F is monotonic, if for all\nsubsets u \u21e2 v \u2713 2[n] it holds that F (U) \uf8ff F (V). F is submodular, if for all subsets U, V 2 2[n] it\nholds that F (U[ V) + F (U\\ V) \uf8ff F (U) + F (V). Submodular functions have a diminishing returns\nproperty (Nemhauser et al., 1978) i.e. the marginal gain of adding elements decreases with the size\nof the set. When F is submodular, F is supermodular (and vice versa).\n\n3\n\n\fWe prove submodularity for a larger class of problems, then show submodularity of (6) as a special\ncase. Our proof for the larger class may be of independent interest. In particular, the following\nTheorem considers general discrete optimization problems which are linear matrix functionals, and\nshows suf\ufb01cient conditions on the matrix for the problem to be monotone and/or submodular.\nTheorem 2 (Monotone Submodularity for Linear Forms). Let H 2 Rn\u21e5n (not necessarily symmetric)\nbe element-wise non-negative and bounded, with upper bound h\u21e4 = maxi,j2[n] hi,j > 0. Further,\nconstruct the binary matrix representation of the indices that achieve the maximum as E 2 [0, 1]n\u21e5n\nwith ei,j = 1 if hi,j = h\u21e4 and ei,j = 0 otherwise, and its complement E0 = 1 E with the\ncorresponding set E0 = {(i, j) s.t. ei,j = 0}. Given the ground set S\u2713 2[n] consider the linear form:\nF (H, S) = hA(S), Hi 8 S 2S . Given m = |S|, de\ufb01ne the functions:\n\u21b5(n, m) =\n\n,\n(8)\nwhere a(S) = F (E, S), b(S) = F (E0, S) for all u, v 2S (additional notation suppressed in \u21b5(\u00b7)\nand (\u00b7) for clarity). Let m\u21e4 = maxS2S |S| be the maximal cardinality of any element in the ground\nset.\n\na(S [{ u}) + a(S [ v}) a(S [{ u, v}) a(S)\n\nb(S [{ u, v}) + d(S)\n\na(S [{ u}) a(S)\n\n(n, m) =\n\nb(S)\n\n,\n\n1. If hi,j \uf8ff h\u21e4\u21b5(n, m) 8 0 \uf8ff m \uf8ff m\u21e4, 8 (i, j) 2 E0, then F (H, S) is monotone\n2. If hi,j \uf8ff h\u21e4(n, m) 8 0 \uf8ff m \uf8ff m\u21e4, 8 (i, j) 2 E0, then F (H, S) is submodular.\n\nk\u21e4\n\nFinally, we consider a special case of Theorem 2 for the MMD.\nCorollary 3 (Monotone Submodularity for MMD). Let the kernel matrix K 2 Rn\u21e5n be element-wise\nnon-negative, with equal diagonal terms ki,i = k\u21e4 > 08i 2 [n], and be diagonally dominant. If the\noff-diagonal terms ki,j 8 i, j 2 [n], i 6= j satisfy 0 \uf8ff ki,j \uf8ff\nn3+2n22n3, then Jb(S) given by (6)\nis monotone submodular.\nThe diagonal dominance condition expressed by Corollary 3 is easy to check given a kernel matrix.\nWe also note that the conditions can be signi\ufb01cantly weakened if one determines the required number\nof prototypes m\u21e4 = max|S|\uf8ff n a-priori. This is further simpli\ufb01ed for the MMD since the bounds\n(8) are both monotonically decreasing functions of m, so the condition need only be checked for\nm\u21e4. Observe that diagonal dominance is not a necessary condition, as the more general approach in\nTheorem 2 allows arbitrarily indexed maximal entries in the kernel. Diagonal dominance is assumed\nto simplify the resulting expressions.\nPerhaps, more important to practice is our observation that the diagonal dominance condition\nexpressed by Corollary 3 is satis\ufb01ed by parametrized kernels with appropriately selected parameters.\nWe provide an example for radial basis function (RBF) kernels and powers of positive standardized\nkernels. Further examples and more general conditions are left for future work.\nExample 4 (Radial basis function Kernel). Consider the radial basis function kernel K with entries\nki,j = k(xi, xj) = exp(kxi xjk) evaluated on a sample X with non-duplicate points i.e.\nxi 6= xj 8 xi, xj 2 X. The off-diagonal kernel entries ki,j i 6= j monotonically decrease with respect\nto increasing . Thus, 9 \u21e4 such that Corollary 3 is satis\ufb01ed for \u21e4.\nExample 5 (Powers of Positive Standardized Kernels). Consider a element-wise positive kernel\nmatrix G standardized to be element-wise bounded 0 \uf8ff gi,j < 1 with unitary diagonal gi,i =\n1 8 i 2 [n]. De\ufb01ne the kernel power K with ki,j = gp\ni,j. The off-diagonal kernel entries ki,j i 6= j\nmonotonically decrease with respect to increasing p. Thus, 9 p\u21e4 such that Corollary 3 is satis\ufb01ed for\np p\u21e4.\nBeyond the examples outlined here, similar conditions can be enumerated for a wide range of\nparametrized kernel functions, and are easily checked for model-based kernels e.g. the Fisher kernel\n(Jaakkola et al., 1999) \u2013 useful for comparing data points based on similarity with respect to a\nprobabilistic model. Our interpretation of from these examples is that the conditions of Corollary 3\nare not excessively restrictive. While constrained maximization of submodular functions is generally\nNP-hard, the simple greedy forward selection heuristic has been shown to perform almost as well as\nthe optimal in practice, and is known to have strong theoretical guarantees.\nTheorem 6 (Nemhauser et al. (1978)). In the case of any normalized, monotonic submodular function\n\nF , the set S\u21e4 obtained by the greedy algorithm achieves at least a constant fraction1 1\nobjective value obtained by the optimal solution i.e. F (S\u21e4) =1 1\n\ne of the\n\ne max\n\n|S|\uf8ffm\n\nF (s).\n\n4\n\n\f\n\nk(xj, xl)\n\nIn addition, no polynomial time algorithm can provide a better approximation guarantee unless P\n= NP (Feige, 1998). An additional bene\ufb01t of the greedy approach is that it does not require the\ndecision of the number of prototypes m\u21e4 to be made at training time, so assuming the kernel satis\ufb01es\nappropriate conditions, training can be stopped at any m\u21e4 based on computational constraints, while\nstill returning meaningful results. The greedy algorithm is outlined in Algorithm 1.\n\nAlgorithm 1 Greedy algorithm, max F (S) s.t. |S|\uf8ff m\u21e4\n\nInput: m\u21e4, S = ;\nwhile |S| < m\u21e4 do\n\nend while\nReturn: S.\n\nforeach i 2 [n]\\S, fi = F (S [ i) F (S)\nS = S [{ arg max fi}\n\n3.2 Model Criticism\n\nIn addition to selecting prototype samples, MMD-critic characterizes the data points not well\nexplained by the prototypes \u2013 which we call the model criticism. These data points are selected as\nthe largest values of the witness function (5) i.e. where the similarity between the dataset and the\nprototypes deviate the most. Consider the cost function:\n\nL(C) =Xl2C\n\n1\n\nn Xi2[n]\n\nk(xi, xl) \n\n1\n\nmXj2S\n\n.\n\n(9)\n\nThe absolute value ensures that we measure both positive deviations f (x) > 0 where the prototypes\nunder\ufb01t the density of the samples, and negative deviations f (x) < 0, where the prototypes over\ufb01t\nthe density of the samples. Thus, we focus primarily on the magnitude of deviation, rather than its\nsign. The following theorem shows that (9) is a linear function of C.\nTheorem 7. The criticism function L(C) is a linear function of C.\n\nWe found that the addition of a regularizer which encourages a diverse selection of criticism points\nimproved performance. Let r : 2[n] 7! R represent a regularization function. We select the criticism\npoints as the maximizers of this cost function:\n(10)\n\nL(C) + r(K, C)\n\nmax\n\nC\u2713[n]\\S,|C|\uf8ffc\u21e4\n\nWhere [n]\\S denote all indexes which not include the prototypes, and c\u21e4 is the number of criticism\npoints desired. Fortunately, due to the linearity of (5), the optimization function (10) is submodular\nwhen the regularization function is submodular. We encourage the use of regularizers which incor-\nporate diversity into the criticism selection. We found the best qualitative performance using the\nlog-determinant regularizer (Krause et al., 2008). Let KC,C be the sub-matrix of K corresponding to\nthe pair of indexes in C \u21e5 C, then the log-determinant regularizer is given by:\n\n(11)\nwhich is known to be submodular. Further, several researchers have found, both in theory and practice\n(Sharma et al., 2015), that greedy optimization is an effective strategy for optimization. We apply the\ngreedy algorithm for criticism selection with the function F (C) = L(C) + r(K, C).\n\nr(K, C) = log det KC,C\n\n4 Related Work\n\nThere is a large literature on techniques for selecting prototypes that summarize a dataset, and a full\nliterature survey is beyond the scope of this manuscript. Instead, we overview a few of the most\nrelevant references. The K-medoid clustering (Kaufman and Rousseeuw, 1987) is a classic technique\nfor selecting a representative subset of data points, and can be solved using various iterative algorithms.\nK-medoid clustering is quite similar to K-means clustering, with the additional condition that the\npresented prototypes must be in the dataset. The ubiquity of large datasets has led to resurgence\n\n5\n\n\fof interest in the data summarization problem, also known as the set cover problem. Progress has\nincluded novel cost functions and algorithms for several domains including image summarization\n(Simon et al., 2007) and document summarizauion (Lin and Bilmes, 2011). Recent innovations also\ninclude highly scalable and distributed algorithms (Badanidiyuru et al., 2014; Mirzasoleiman et al.,\n2015). There is also a large literature on variations of the set cover problem tuned for classi\ufb01cation,\nsuch as the cover digraph approach of (Priebe et al., 2003) and prototype selection for interpretable\nclassi\ufb01cation (Bien and Tibshirani, 2011), which involves selecting prototypes that maximize the\ncoverage within the class, but minimize the coverage across classes.\nSubmodular / Supermodular functions are well studied in the combinatorial optimization litera-\nture, with several scalable algorithms that come with optimization theoretic optimality guaran-\ntees (Nemhauser et al., 1978). In the Bayesian modeling literature, submodular optimization has\npreviously been applied for approximate inference by Koyejo et al. (2014). The technical conditions\nrequired for submodularity of (6) are due to averaging of the kernel similarity scores \u2013 as the average\nrequires a division by the cardinality |S|. In particular, the analogue of (6) which replaces all the aver-\nages by sums (i.e. removes all division by |S|) is equivalent to the well known submodular functions\npreviously used for scene (Simon et al., 2007) and document (Lin and Bilmes, 2011) summarization,\nnPi2[n],j2S k(xi, yj) + Pi,j2S k(yi, xj), where > 0 is a regularization parameter.\ngiven by: 2\nThe function that results is known to be submodular when the kernel is element-wise positive i.e.\nwithout the need for additional diagonal dominance conditions. On the other hand, the averaging\nhas a desirable built-in balancing effect. When using the sum, practitioners must tune the additional\nregularization parameter to achieve a similar balance.\n\n5 Results\n\nWe present results for the proposed technique MMD-critic using USPS hand written digits (Hull,\n1994) and Imagenet (Deng et al., 2009) datasets. We quantitatively evaluate the prototypes in terms\nof predictive quality as compared to related baselines on USPS hand written digits dataset. We also\npresent preliminary results from a human subject pilot study. Our results suggest that the model\ncriticism \u2013 which is unique to the proposed MMD-critic is especially useful to facilitate human\nunderstanding. For all datasets, we employed the radial basis function (RBF) kernel with entries\nki,j = k(xi, xj) = exp(kxi xjk), which satis\ufb01es the conditions of Corollary 3 for suf\ufb01ciently\nlarge (c.f. Example 4, see Example 5 and following discussion for alternative feasible kernels).\n\nThe Nearest Prototype Classi\ufb01er: While our primary interest is in interpretable prototype selec-\ntion and criticism, prototypes may also be useful for speeding up memory-based machine learning\ntechniques such as the nearest neighbor classi\ufb01er by restricting the neighbor search to the prototypes,\nsometimes known as the nearest prototype classi\ufb01er (Bien and Tibshirani, 2011; Kuncheva and\nBezdek, 1998). This classi\ufb01cation provides an objective (although indirect) evaluation of the quality\nof the selected prototypes, and is useful for setting hyperparameters. We employ a 1 nearest neighbor\nclassi\ufb01er using the Hilbert space distance induced by the kernels. Let yi 2 [k] denote the label\nassociated with each prototype i 2 S, for k classes. As we employ normalized kernels (where the\ndiagonal is 1), it is suf\ufb01cient to measure the pairwise kernel similarity. Thus, for a test point \u02c6x, the\nnearest neighbor classi\ufb01er reduces to:\n\n= argmax\n\nk(\u02c6x, xi).\n\ni2S\n\nHK\n\n\u02c6y = yi\u21e4, where i\u21e4 = argmin\n\ni2S k\u02c6x xik2\n5.1 MMD-critic evaluated on USPS Digits Dataset\nThe USPS hand written digits dataset Hull (1994) consists of n = 7291 training (and 2007 test)\ngreyscale images of 10 handwritten digits from 0 to 9. We consider two kinds of RBF kernels\n(i) global: where the pairwise kernel is computed between all data points, and (ii) local: given\nby exp(kxi xjk)1[yi=yj ], i.e. points in different classes are assigned a similarity score of\nzero. The local approach has the effect of pushing points in different classes further apart. The\nkernel hyperparameter was chosen based to maximize the average cross-validated classi\ufb01cation\nperformance, then \ufb01xed for all other experiments.\nClassi\ufb01cation: We evaluated nearest prototype classi\ufb01ers using MMD-critic, and compared to\nbaselines (and reported performance) from Bien and Tibshirani (2011) (abbreviated as PS) and their\n\n6\n\n\fFigure 1: Classi\ufb01cation error vs. number of prototypes m = |S|. MMD-critic shows comparable\n(or improved) performance as compared to other models (left). Random subset of prototypes and\ncriticism from the USPS dataset (right).\n\nimplementation of K-medoids. Figure 1(left) compares MMD-critic with global and local kernels,\nto the baselines for different numbers of selected prototypes m = |S|. Our results show comparable\n(or improved) performance as compared to other models. In particular, we observe that the global\nkernels out-perform the local kernels2 by a small margin. We note that MMD is particularly effective\nat selecting the \ufb01rst few prototypes (i.e. speed of error reduction as number of prototypes increases)\nsuggesting its utility for rapidly summarising the dataset.\nSelected Prototypes and Criticism: Fig. 1 (right) presents a randomly selected subset of the\nprototypes and criticism from the MMD-critic using the local kernel. We observe that the prototypes\ncapture many of the common ways of writing digits, while the criticism clearly capture outliers.\n\n5.2 Qualitative Measure: Prototypes and Criticisms of Images\n\nIn this section, we learn prototypes and criticisms from the Imagenet dataset (Russakovsky et al.,\n2015) using image embeddings from He et al. (2015). Each image is represented by a 2048 dimensions\nvector embedding, and each image belongs to one of 1000 categories. We select two breeds of one\ncategory (e.g., Blenheim spaniel) and run MMD-critic to learn prototypes and criticism. As shown\nin Figure 2, MMD-critic learns reasonable prototypes and criticisms for two types of dog breeds. On\nthe left, criticisms picked out the different coloring (second criticism is in black and white picture),\nas well as pictures capturing movements of dogs (\ufb01rst and third criticisms). Similarly, on the right,\ncriticisms capture the unusual, but potentially frequent pictures of dogs in costumes (\ufb01rst and second\ncriticisms).\n\n5.3 Quantitative measure: Prototypes and Criticisms improve interpretability\n\nWe conducted a human pilot study to collect objective and subjective measures of interpretability\nusing MMD-critic. The experiment used the same dataset as Section 5.2. We de\ufb01ne \u2018interpretability\u2019\nin this work as the following: a method is interpretable if a user can correctly and ef\ufb01ciently predict\nthe method\u2019s results. Under this de\ufb01nition, we designed a predictive task to quantitatively evaluate\nthe interpretability. Given a randomly sampled data point, we measure how well a human can predict\na group it belongs to (accuracy), and how fast they can perform the task (ef\ufb01ciency). We chose this\ndataset as the task of assigning a new image to a group requires groups to be well-explained but does\nnot require specialized training.\nWe presented four conditions in the experiment. 1) raw images condition (Raw Condition) 2)\nPrototypes Only (Proto Only Condition) 3) Prototypes and criticisms (Proto and Criticism Condition)\n4) Uniformly sampled data points per group (Uniform Condition). Raw Condition contained 100\nimages per species (e.g., if a group contains 2 species, there are 200 images) Proto Only Condition,\nProto and Criticism Condition and Uniform Condition contains the same number of images.\n\n2 Note that the local kernel trivially achieves perfect accuracy. Thus, in order to measure generalization\nperformance, we do not use class labels for local kernel test instances i.e. we use the global kernel instead of\nlocal kernel for test instances \u2013 regardless of training.\n\n7\n\n01000200030004000Number of prototypes0.040.060.080.100.120.140.160.18Test errorMMD-globalMMD-localPSK-medoids\fFigure 2: Learned prototypes and criticisms from Imagenet dataset (two types of dog breeds)\n\nWe used within-subject design to minimize the effect of inter-participant variability, with a balanced\nLatin square to account for a potential learning effect. The four conditions were assigned to four\nparticipants (four males) in a balanced manner. Each subject answered 21 questions, where the \ufb01rst\nthree questions are practice questions and not included in the analysis. Each question showed six\ngroups (e.g., red fox, kit fox) of a species (e.g., fox), and a randomly sampled data point that belongs\nto one of the groups. Subjects were encouraged to answer the questions as quickly and accurately\nas possible. A break was imposed after each question to mitigate the potential effect of fatigue. We\nmeasured the accuracy of answers as well as the time they took to answer each question. Participants\nwere also asked to respond to 10 5-point Likert scale survey questions about their subjective measure\nof accuracy and ef\ufb01ciency. Each survey question compared a pair of conditions (e.g., Condition A\nwas more helpful than condition B to correctly (or ef\ufb01ciently) assign the image to a group).\nSubjects performed the best using Proto and Criticism Condition (M=87.5%, SD=20%). The\nperformance with Proto Only Condition was relatively similar (M=75%, SD=41%), while that with\nUniform Condition (M=55%, SD=38%, 37% decrease) and Raw Condition (M=56%, SD=33%, 36%\ndecrease) was substantially lower. In terms of speed, subjects were most ef\ufb01cient using Proto Only\nCondition (M=1.04 mins/question, SD=0.28, 44% decrease compared to Raw Condition), followed\nby Uniform Condition (M=1.31 mins/question, SD=0.59) and Proto and Criticism Condition (M=1.37\nmins/question, SD=0.8). Subjects spent the most time with Raw Condition (M=1.86 mins/question,\nSD=0.67).\nSubjects indicated their preference of Proto and Criticism Condition over Raw Condition and\nUniform Condition. In a survey question that asks to compare Proto and Criticism Condition and\nRaw Condition, a subject added that \u201c[Proto and Criticism Condition resulted in] less confusion\nfrom trying to discover hidden patterns in a ton of images, more clues indicating what features are\nimportant\". In particular, in a question that asks to compare Proto and Criticism Condition and\nProto Only Condition, a subject said that \u201cThe addition of criticisms made it easier to locate the\nde\ufb01ning features of the cluster within the prototypical images\". The humans\u2019 superior performance\nwith prototypes and criticism in this preliminary study shows that providing criticisms together with\nprototypes is a promising direction to improve the interpretability.\n\n6 Conclusion\n\nWe present the MMD-critic, a scalable framework for prototype and criticism selection to improve\nthe interpretability of complex data distributions. To our best knowledge, ours is the \ufb01rst work which\nleverages the BMC framework to generate explanations. Further, MMD-critic shows competitive\nperformance as a nearest prototype classi\ufb01er compared to to existing methods. When criticism is\ngiven together with prototypes, a human pilot study suggests that humans are better able to perform a\npredictive task that requires the data-distributions to be well-explained. This suggests that criticism\nand prototypes are a step towards improving interpretability of complex data distributions. For future\nwork, we hope to further explore the properties of MMD-critic such as the effect of the choice of\nkernel, and weaker conditions on the kernel matrix for submodularity. We plan to explore applications\nto larger datasets, aided by recent work on distributed algorithms for submodular optimization. We\nalso intend to complete a larger scale user study on how criticism and prototypes presented together\naffect human understanding.\n\n8\n\n\fReferences\nA. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological variations, and system\n\napproaches. AI communications, 1994.\n\nA. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive\n\ndata summarization on the \ufb02y. In KDD. ACM, 2014.\n\nI. Bichindaritz and C. Marling. Case-based reasoning in the health sciences: What\u2019s next? AI in medicine, 2006.\n\nJ. Bien and R. Tibshirani. Prototype selection for interpretable classi\ufb01cation. The Annals of Applied Statistics,\n\npages 2403\u20132424, 2011.\n\nR. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting\n\npneumonia risk and hospital 30-day readmission. In KDD, 2015.\n\nM.S. Cohen, J.T. Freeman, and S. Wolf. Metarecognition in time-stressed decision making: Recognizing,\n\ncritiquing, and correcting. Human Factors, 1996.\n\nJ. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.\n\nIn CVPR, 2009.\n\nU. Feige. A threshold of ln n for approximating set cover. JACM, 1998.\n\nA. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian data analysis. Taylor & Francis, 2014.\n\nA. Gretton, K.M. Borgwardt, M.J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel method for the two-sample\n\nproblem. JMLR, 2008.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.\n\nJ.J. Hull. A database for handwritten text recognition research. TPAMI, 1994.\n\nT.S. Jaakkola, D. Haussler, et al. Exploiting generative models in discriminative classi\ufb01ers. In NIPS, pages\n\n487\u2013493, 1999.\n\nL. Kaufman and P. Rousseeuw. Clustering by means of medoids. North-Holland, 1987.\n\nB. Kim, C. Rudin, and J.A. Shah. The Bayesian Case Model: A generative approach for case-based reasoning\n\nand prototype classi\ufb01cation. In NIPS, 2014.\n\nO.O. Koyejo, R. Khanna, J. Ghosh, and R. Poldrack. On prior distributions and approximate inference for\n\nstructured variables. In NIPS, 2014.\n\nA. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in gaussian processes: Theory, ef\ufb01cient\n\nalgorithms and empirical studies. JMLR, 2008.\n\nL. I. Kuncheva and J.C. Bezdek. Nearest prototype classi\ufb01cation: clustering, genetic algorithms, or random\n\nsearch? IEEE Transactions on Systems, Man, and Cybernetics, 28(1):160\u2013164, 1998.\n\nH. Lin and J. Bilmes. A class of submodular functions for document summarization. In ACL, 2011.\n\nJ. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In NIPS, 2015.\n\nB. Mirzasoleiman, A. Karbasi, A. Badanidiyuru, and A. Krause. Distributed submodular cover: Succinctly\n\nsummarizing massive data. In NIPS, 2015.\n\nG. L Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions. Mathematical Programming, 1978.\n\nA. Newell and H.A. Simon. Human problem solving. Prentice-Hall Englewood Cliffs, 1972.\n\nC.E. Priebe, D.J. Marchette, J.G. DeVinney, and D.A. Socolinsky. Classi\ufb01cation using class cover catch digraphs.\n\nJournal of classi\ufb01cation, 2003.\n\nO. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,\n\nA.C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.\n\nD. Sharma, A. Kapoor, and A. Deshpande. On greedy maximization of entropy. In ICML, 2015.\n\nI. Simon, N. Snavely, and S.M. Seitz. Scene summarization for online image collections. In ICCV, 2007.\n\nK.R. Varshney. Engineering safety in machine learning. arXiv:1601.04126, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1185, "authors": [{"given_name": "Been", "family_name": "Kim", "institution": "Allen Institute of Artificial Intelligence"}, {"given_name": "Rajiv", "family_name": "Khanna", "institution": "rajivak@utexas.edu"}, {"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "UIUC"}]}