{"title": "On the Accuracy of Influence Functions for Measuring Group Effects", "book": "Advances in Neural Information Processing Systems", "page_first": 5254, "page_last": 5264, "abstract": "Influence functions estimate the effect of removing a training point on a model without the need to retrain. They are based on a first-order Taylor approximation that is guaranteed to be accurate for sufficiently small changes to the model, and so are commonly used to study the effect of individual points in large datasets. However, we often want to study the effects of large groups of training points, e.g., to diagnose batch effects or apportion credit between different data sources. Removing such large groups can result in significant changes to the model. Are influence functions still accurate in this setting? In this paper, we find that across many different types of groups and for a range of real-world datasets, the predicted effect (using influence functions) of a group correlates surprisingly well with its actual effect, even if the absolute and relative errors are large. Our theoretical analysis shows that such strong correlation arises only under certain settings and need not hold in general, indicating that real-world datasets have particular properties that allow the influence approximation to be accurate.", "full_text": "On the Accuracy of In\ufb02uence Functions\n\nfor Measuring Group Effects\n\nPang Wei Koh\u2217\n\nKai-Siang Ang\u2217\nDepartment of Computer Science\n\nHubert H. K. Teo\u2217\n\n{pangwei@cs, kaiang@, hteo@, pliang@cs}.stanford.edu\n\nStanford University\n\nPercy Liang\n\nAbstract\n\nIn\ufb02uence functions estimate the effect of removing a training point on a model\nwithout the need to retrain. They are based on a \ufb01rst-order Taylor approximation\nthat is guaranteed to be accurate for suf\ufb01ciently small changes to the model, and\nso are commonly used to study the effect of individual points in large datasets.\nHowever, we often want to study the effects of large groups of training points,\ne.g., to diagnose batch effects or apportion credit between different data sources.\nRemoving such large groups can result in signi\ufb01cant changes to the model. Are\nin\ufb02uence functions still accurate in this setting? In this paper, we \ufb01nd that across\nmany different types of groups and for a range of real-world datasets, the predicted\neffect (using in\ufb02uence functions) of a group correlates surprisingly well with its\nactual effect, even if the absolute and relative errors are large. Our theoretical anal-\nysis shows that such strong correlation arises only under certain settings and need\nnot hold in general, indicating that real-world datasets have particular properties\nthat allow the in\ufb02uence approximation to be accurate.\n\n1\n\nIntroduction\n\nIn\ufb02uence functions (Jaeckel, 1972; Hampel, 1974; Cook, 1977) estimate the effect of removing an\nindividual training point on a model\u2019s predictions without the computationally-prohibitive cost of\nretraining the model. Tracing a model\u2019s output back to its training data can be useful: in\ufb02uence\nfunctions have been recently applied to explain predictions (Koh and Liang, 2017), produce con\ufb01dence\nintervals (Schulam and Saria, 2019), investigate model bias (Brunet et al., 2018; Wang et al., 2019),\nimprove human trust (Zhou et al., 2019), and even craft data poisoning attacks (Koh et al., 2019).\nIn\ufb02uence functions are based on \ufb01rst-order Taylor approximations that are accurate for estimating\nsmall perturbations to the model, which makes them suitable for predicting the effects of removing\nindividual training points on the model. However, we often want to study the effects of removing\ngroups of points, which represent large perturbations to the data. For example, we might wish\nto analyze the effect of data collected from different experimental batches (Leek et al., 2010) or\ndemographic groups (Chen et al., 2018); apportion credit between crowdworkers, each of whom\ngenerated part of the data (Arrieta-Ibarra et al., 2018); or, in a multi-party learning setting, ensure\nthat no individual user has too much in\ufb02uence on the joint model (Hayes and Ohrimenko, 2018). Are\nin\ufb02uence functions still accurate when predicting the effects of (removing) these larger groups?\nIn this paper, we \ufb01rst show empirically that on real datasets and across a broad variety of groups of\ndata, the predicted and actual effects are strikingly correlated (Spearman \u03c1 of 0.8 to 1.0), such that\nthe groups with the largest actual effect also tend to have the largest predicted effect. Moreover, the\npredicted effect tends to underestimate the actual effect, suggesting that it could be an approximate\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flower bound in practice. Using in\ufb02uence functions to predict the actual effect of removing large, co-\nherent groups of data can therefore still be useful, even though the violation of the small-perturbation\nassumption can result in high absolute and relative errors between the predicted and actual effects.\nWhat explains these phenomena of correlation and underestimation? Prior theoretical work focused\non establishing the conditions under which this in\ufb02uence approximation is accurate, i.e., the error\nbetween the actual and predicted effects is small (Giordano et al., 2019b; Rad and Maleki, 2018).\nHowever, in our setting of removing large, coherent groups of data, this error can be quite large. As\na \ufb01rst step towards understanding the behavior of the in\ufb02uence approximation in this regime, we\ncharacterize the relationship between the predicted and actual effects of a group via the one-step\nNewton approximation (Pregibon et al., 1981), which we \ufb01nd is a surprisingly accurate approximation\nin practice. We show that correlation and underestimation arise under certain settings (e.g., removing\nmultiple copies of a single training point), but need not hold in general, which opens up the intriguing\nquestion of why we observe those phenomena across a wide range of empirical settings.\nFinally, we exploit the correlation of predicted and actual group effects in two example case studies: a\nchemical-disease relationship (CDR) task, where the groups correspond to different labeling functions\n(Hancock et al., 2018), and a natural language inference (NLI) task (Williams et al., 2018), where\nthe groups come from different crowdworkers. On the CDR task, we \ufb01nd that the in\ufb02uence of\neach labeling function correlates with its size (the number of examples it labels) but not its average\naccuracy, which suggests that practitioners should focus on the coverage of the labeling functions they\nconstruct. In contrast, on the NLI task, we \ufb01nd that the in\ufb02uence of each crowdworker is uncorrelated\nwith the number of examples they contibute, which suggests that practitioners should focus on how\nto elicit high-quality examples from crowdworkers over increasing quantity.\n\n2 Background and problem setup\nConsider learning a predictive model with parameters \u03b8 \u2208 \u0398 that maps from an input space X to an\noutput space Y. We are given n training points {(x1, y1), . . . , (xn, yn)} and a loss function (cid:96)(x, y, \u03b8)\nthat is twice-differentiable and convex in \u03b8. To train the model, we select the model parameters\n\n(cid:35)\n\n(cid:35)\n\n(cid:34) n(cid:88)\n\ni=1\n\n(cid:34) n(cid:88)\n\ni=1\n\n\u02c6\u03b8(1) = arg min\u03b8\u2208\u0398\n\n(cid:96)(xi, yi; \u03b8)\n\n+\n\n(cid:107)\u03b8(cid:107)2\n\n2\n\n\u03bb\n2\n\n(1)\n\nthat minimize the L2-regularized empirical risk, where \u03bb > 0 controls regularization strength. The\nall-ones vector 1 in \u02c6\u03b8(1) denotes that the initial training points all have uniform sample weights.\nOur goal is to measure the effects of different groups of training data on the model: if we removed a\nsubset of training points W , how much would the model \u02c6\u03b8 change? Concretely, we de\ufb01ne a vector\nw \u2208 {0, 1}n of sample weights with wi = I((xi, yi) \u2208 W ) and consider the modi\ufb01ed parameters\n\n\u02c6\u03b8(1 \u2212 w) = arg min\u03b8\u2208\u0398\n\n(1 \u2212 wi)(cid:96)(xi, yi; \u03b8)\n\n+\n\n(cid:107)\u03b8(cid:107)2\n\n2\n\n\u03bb\n2\n\n(2)\n\ncorresponding to retraining the model after excluding W . We refer to w as the subset (corresponding\nto W ); the number of removed points as (cid:107)w(cid:107)1; and the fraction of removed points as \u03b1 = (cid:107)w(cid:107)1/n.\nThe actual effect I\u2217\n\nf : [0, 1]n \u2192 R of the subset w is\n\nwhere the evaluation function f : \u0398 \u2192 R measures a quantity of interest. Speci\ufb01cally, we study:\n\nI\u2217\nf (w) = f (\u02c6\u03b8(1 \u2212 w)) \u2212 f (\u02c6\u03b8(1)),\n\n(3)\n\n\u2022 The change in test prediction, with f (\u03b8) = \u03b8(cid:62)xtest. Linear models (for regression or binary\nclassi\ufb01cation) make predictions that are functions of \u03b8(cid:62)xtest, so this measures the effect\nthat removing a subset will have on the model\u2019s prediction for some test point xtest.\n\u2022 The change in test loss, with f (\u03b8) = (cid:96)(xtest, ytest; \u03b8), which is similar to the test prediction.\ni=1 wi(cid:96)(xi, yi; \u03b8), measures the increase in loss on\nthe removed points w. Its average over all subsets of size (cid:107)w(cid:107)1 is the estimated extra loss\nthat leave-(cid:107)w(cid:107)1-out cross-validation (CV) measures over the training loss.\n\n\u2022 The change in self-loss, with f (\u03b8) =(cid:80)n\n\n2\n\n\fIn\ufb02uence functions\n\n2.1\nThe issue with computing the actual effect I\u2217\nf (w) is that retraining the model to compute \u02c6\u03b8(1 \u2212 w)\nfor each subset w can be prohibitively expensive. In\ufb02uence functions provide a relatively ef\ufb01cient\n\ufb01rst-order approximation to I\u2217\n\nConsider the function qw : [0, 1] \u2192 R with qw(t) = f(cid:0)\u02c6\u03b8(1 \u2212 tw)(cid:1), such that the actual effect I\u2217\n\nf (w)\ncan be written as qw(1) \u2212 qw(0). We de\ufb01ne the predicted effect of the subset w to be its in\ufb02uence\nIf (w) = q(cid:48)\nw(0) \u2248 qw(1)\u2212qw(0); in this paper, we use the term predicted effect interchangeably with\nin\ufb02uence. Intuitively, in\ufb02uence measures the effect of removing an in\ufb01nitesimal weight from each\npoint in w and then linearly extrapolates to removing all of w.2 By taking a Taylor approximation\n(see, e.g., Hampel et al. (1986) for details), the in\ufb02uence can be computed as\n\nf (w) that avoids retraining.\n\nIf (w) def= q(cid:48)\n\nw(0) = \u2207\u03b8f(cid:0)\u02c6\u03b8(1)(cid:1)(cid:62)(cid:20) d\n= \u2207\u03b8f(cid:0)\u02c6\u03b8(1)(cid:1)(cid:62)\ni=1 wi\u2207\u03b8(cid:96)(xi, yi; \u02c6\u03b8(1)), H1 =(cid:80)n\n\ndt\nH\u22121\n\u03bb,1g1(w),\ni=1 \u22072\n\n\u02c6\u03b8(1 \u2212 tw)\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)t=0\n\n(4)\n\nwhere g1(w) =(cid:80)n\n\n\u03b8(cid:96)(xi, yi; \u02c6\u03b8(1)), and H\u03bb,1 = H1 + \u03bbI.\nWhen measuring the change in test prediction or test loss, in\ufb02uence is additive: if w = w1 + w2, then\nIf (w) = If (w1) + If (w2), i.e., the in\ufb02uence of a subset is the sum of in\ufb02uences of its constituent\npoints, and we can ef\ufb01ciently compute the in\ufb02uence of any subset by pre-computing the in\ufb02uence of\neach individual point (e.g., by taking a single inverse Hessian-vector product, as in Koh and Liang\n(2017)). However, when measuring the change in self-loss, in\ufb02uence is not additive and requires a\nseparate calculation for each subset removed.\n\n2.2 Relation to prior work\n\nIn\ufb02uence functions\u2014introduced in the seminal work of Hampel (1974) and in Jaeckel (1972), where\nit was called the in\ufb01nitesimal jackknife\u2014have a rich history in robust statistics. The use of in\ufb02uence\nfunctions in the ML community is more recent, though growing; in Section 1, we provide references\nfor several recent applications of in\ufb02uence functions in ML.\nRemoving a single training point, especially when the total number of points n is large, represents a\nsmall perturbation to the training distribution, so we expect the \ufb01rst-order in\ufb02uence approximation\nto be accurate. Indeed, prior work on the accuracy of in\ufb02uence has focused on this regime: e.g.,\nDebruyne et al. (2008); Liu et al. (2014); Rad and Maleki (2018); Giordano et al. (2019b) give\nevidence that the in\ufb02uence on self-loss can approximate LOOCV, and Koh and Liang (2017) similarly\nexamined the accuracy of estimating the change in test loss after removing single training points.\nHowever, removing a constant fraction \u03b1 of the training data represents a large perturbation to the\ntraining distribution. To the best of our knowledge, this setting has not been empirically studied;\nperhaps the closest work is Khanna et al. (2019)\u2019s use of Bayesian quadrature to estimate a maximally\nin\ufb02uential subset.\nInstead, older references have alluded to the phenomena of correlation and\nunderestimation we observe: Pregibon et al. (1981) note that in\ufb02uence tends to be conservative, while\nHampel et al. (1986) say that \u201cbold extrapolations\u201d (i.e., large perturbations) are often still useful.\nOn the theoretical front, Giordano et al. (2019b) established \ufb01nite-sample error bounds that apply\nto groups, e.g., showing that the leave-k-out approximation is consistent as the fraction of removed\npoints \u03b1 \u2192 0. Our focus is instead on the relationship of the actual effect I\u2217\nf (w) and predicted effect\n(in\ufb02uence) If (w) in the regime where \u03b1 is constant and the error |I\u2217\n\nf (w) \u2212 If (w)| is large.\n\n3 Empirical accuracy of in\ufb02uence functions on constructed groups\n\nHow well do in\ufb02uence functions estimate the effect of (removing) a group of training points? If\nn is large and we remove a subset w uniformly at random, the new parameters \u02c6\u03b8(1 \u2212 w) should\nremain close to \u02c6\u03b8(1) even when if fraction of removed points \u03b1 is non-negligible, so the in\ufb02uence\nf (w) \u2212 If (w)| should be small. However, we are usually interested in removing coherent,\nerror |I\u2217\nnon-random groups, e.g., all points from a data source or share some feature. In such settings, the\n\n2In the statistics literature, in\ufb02uence typically refers to the effect of adding weight, so the sign is \ufb02ipped.\n\n3\n\n\f\u03bb/n Test acc.\n\nSource\n\nDataset\nDiabetes\nEnron\nDog\ufb01sh\nMNIST\nCDR\nMultiNLI\n\nClasses\n2\n2\n2\n10\n2\n3\n\nn\n20, 000\n4, 137\n1, 800\n55, 000\n24, 177\n392, 702\n\nd\n127\n3, 289\n2, 048\n784\n328\n600\n\n2.2 \u00d7 10\u22124\n1.0 \u00d7 10\u22123\n2.2 \u00d7 10\u22122\n1.0 \u00d7 10\u22123\n1.0 \u00d7 10\u22124\n1.0 \u00d7 10\u22124\n\n68.2% Strack et al. (2014)\n96.1% Metsis et al. (2006)\n98.5% Koh and Liang (2017)\n92.1% LeCun et al. (1998)\n67.4% Hancock et al. (2018)\n50.4% Williams et al. (2018)\n\nTable 1: Dataset characteristics and the test accuracies that logistic regression achieves (with regular-\nization \u03bb selected by cross-validation). n is the training set size and d is the number of features.\n\nparameters \u02c6\u03b8(1 \u2212 w) and \u02c6\u03b8(1) might differ substantially, and the error |I\u2217\nf (w) \u2212 If (w)| could be\nlarge. Put another way, there could be a cluster of points such that removing one of those points\nwould not change the model by much\u2014so in\ufb02uence could be low\u2014but removing all of them would.\nSurprisingly (to us), we found that even when removing large and coherent groups of points, the\nin\ufb02uence If (w) behaved consistently relative to the actual effect I\u2217\nf (w) on test predictions, test\nlosses, and self-loss, with two broad phenomena emerging:\n\n1. Correlation: If (w) and I\u2217\n2. Underestimation: If (w) and I\u2217\n\nf (w) tend to have the same sign with |If (w)| < |I\u2217\n\nf (w) rank subsets of points w similarly (e.g., high Spearman \u03c1).\nf (w)|.3\nHere, we report results on 5 datasets chosen to span a range of applications, training set size n, and\nnumber of features d (Table 1).4 In an attempt to make the in\ufb02uence approximation as inaccurate\nas possible, we constructed a variety of subsets, from small (\u03b1 = 0.25%) to large (\u03b1 = 25%), to be\ncoherent and have considerable in\ufb02uence on the model. On each dataset, we trained an L2-regularized\nlogistic regression model (or softmax for the multiclass tasks) and compared the in\ufb02uences and actual\neffects of these subsets.\n\nGroup construction. Our aim is to construct coherent groups that when removed will substantially\nchange the model. To do so, we need to choose points that are similar in some way. Speci\ufb01cally, for\neach dataset, we grouped points in 7 ways: 1) points that share feature values; 2) points that cluster\non their features or 3) on their gradients \u2207\u03b8(cid:96)(x, y, \u02c6\u03b8(1))); 4) random points within the same class; 5)\nrandom points from any class. We also grouped 6) points with large positive and 7) negative in\ufb02uence\non the test loss (cid:96)(xtest, ytest, \u02c6\u03b8(1)), since intuitively, training points that all have high in\ufb02uence on\na test point should act together to change the model substantially. Overall, for each dataset, we\nconstructed 1,700 subsets ranging in size from 0.25% to 25% of the training points. See Appendix A\nfor more details.\n\nResults. Figure 1 shows that the in\ufb02uences and actual effects of all of these subsets on test prediction\n(Top), test loss (Mid), and self-loss (Bot) are highly correlated (Spearman \u03c1 of 0.89 to 0.99 across all\nplots), even though the absolute and relative errors of the in\ufb02uence approximation can be quite large.\nMoreover, the in\ufb02uence of a group tends to underestimate its actual effect in all settings except for\ngroups with negative in\ufb02uence on test loss (the left side of each plot in Figure 1-Mid). These trends\nheld across a wide range of regularizations \u03bb, though correlation increased with \u03bb (Appendix C.2).\nIn Section 5, we will use the CDR dataset (Hancock et al., 2018) and the MultiNLI (Williams et al.,\n2018) dataset to show that correlation and underestimation also apply to groups of data that arise\nnaturally, and that in\ufb02uence functions can therefore be used to derive insights about real datasets and\napplications. Before that, we \ufb01rst attempt to develop some theoretical insight into the results above.\n\n3 This holds with one exception: when measuring the change in test loss, f (\u03b8) = (cid:96)(xtest, ytest; \u03b8), underes-\n\ntimation only holds when actual effect I\u2217\n\nf (w) is positive (Figure 1-Mid).\n\n4 The \ufb01rst 4 datasets involve hospital readmission prediction, spam classi\ufb01cation, and object recognition, and\nwere used in Koh and Liang (2017) to study the in\ufb02uence of individual points. The \ufb01fth dataset is a chemical-\ndisease relationship (CDR) dataset Hancock et al. (2018). In Section 5, we will also study the MultiNLI language\ninference dataset (Williams et al., 2018), which was omitted from the experiments here because its large size\nmakes repeated retraining to compute the actual effect too expensive. See Appendix B for dataset details.\n\n4\n\n\fFigure 1: In\ufb02uences vs. actual effects of coherent groups of points ranging from 0.25% to 25%\nin size. Each point corresponds to a group, and its color re\ufb02ects how that group was constructed.\nIn Top and Mid, we show results for the test point with highest loss; other test points are similar\n(Appendix C.1), though with more curvature for test loss (Appendix C.3). The grey reference line\nhas slope 1, and the red borders represent points that are not plotted because they are outside the x- or\ny-axis range. We omit the top row for MNIST, as \u03b8(cid:62)xtest is not meaningful in the multi-class setting.\n\n4 Theoretical analysis\n\nThe experimental results above show that there is consistent underestimation and high correlation\nbetween the predicted effects, based on in\ufb02uence functions, and the actual effects of groups across a\nvariety of datasets, despite the in\ufb02uence approximation incurring large absolute and relative error. As\nwe discussed in Section 2.2, this is outside the regime of existing theory.\nAs an initial step towards understanding the high-error regime, we establish conditions under which\nthe actual effect I\u2217\nf (w) lies approximately between If (w) and CmaxIf (w) for some Cmax > 0.\nThis cone constraint\u2014so called because it implies that all points on the graph of in\ufb02uence vs. actual\neffect lie within a cone\u2014implies underestimation and, if Cmax is small, some degree of correlation.\nWe \ufb01rst show that this constraint holds in restricted settings\u2014when measuring self-loss, or when\nremoving multiple copies of the same point\u2014and that Cmax varies inversely with the regularization\nterm \u03bb, which is expected since stronger regularization reduces the change in the model. However,\nthe cone constraint is stronger than necessary because it bounds the degree of underestimation, and\nwe construct counterexamples to show that it need not hold in more general settings.\nOur analysis centers on the one-step Newton approximation, which estimates the change in parameters\n\n\u02c6\u03b8(1 \u2212 w) \u2212 \u02c6\u03b8(1) \u2248 \u2206\u03b8Nt(w) def= (cid:0)H\u03bb,1(1 \u2212 w)(cid:1)\u22121\n\ng1(w),\n\nwhere H\u03bb,1(1 \u2212 w) = ((cid:80)n\n\ni=1(1 \u2212 wi)\u22072\n\nf (w) = f(cid:0)\u02c6\u03b8(1) + \u2206\u03b8Nt(w)(cid:1) \u2212 f (\u02c6\u03b8(1))) and the corresponding\n\n\u03b8(cid:96)(xi, yi; \u02c6\u03b8(1))) + \u03bbI is the regularized empirical Hessian\nat \u02c6\u03b8(1) but reweighted after removing the subset w. This change in parameters gives the Newton\napproximation of the effect I Nt\nNewton error ErrNt-act(w) = I\u2217\nSpeci\ufb01cally, we decompose the error between the actual effect I\u2217\n+ I Nt\n\nf (w), which measures its gap from the actual effect.\n\nf (w) \u2212 I Nt\n(cid:124)\nf (w) \u2212 If (w) = I\u2217\nI\u2217\n\nf (w) and in\ufb02uence If (w) as\n(cid:125)\nf (w) \u2212 If (w).\n\n(cid:123)(cid:122)\n\nf (w)\n\n(cid:124)\n\n(cid:123)(cid:122)\nf (w) \u2212 I Nt\n\nErrNt-act(w)\n\n(cid:125)\n\nErrNt-inf (w)\n\n(5)\n\n5\n\n50050DiabetesInfluence ontest prediction10010Enron505Dogfish505CDRMNISTShared feature valueFeature clusteringGradient clusteringRandom within classRandomLarge positive test infl.Large negative test infl.050Influence ontest loss05100.02.55.005100100246\u00d7103Influence onself-lossActual effect02004004 hidden02040012\u00d710318 hidden024\u00d71035 hidden\fFigure 2: The Newton approximation accurately captures the actual effect for our datasets (though\nthere is more error on the Diabetes dataset), with the same test point as in Figure 1-Top. We omit\nMNIST and MultiNLI for computational reasons. See Figure C.4 for plots of test loss and self-loss.\n\nO(cid:0)1/(\u03c3min + \u03bb)3(cid:1), where \u03bb is regularization strength and \u03c3min is the smallest eigenvalue of the\n\nIn Section 4.1, we \ufb01rst show that the Newton-actual error ErrNt-act(w) decays at a rate of\n\nempirical Hessian H1. Empirically, this error is small on our datasets, so we focus on characterizing\nthe Newton-in\ufb02uence error ErrNt-inf (w) in Section 4.2. We use this characterization to study the\nbehavior of in\ufb02uence relative to the actual effect on self-loss (Section 4.3) and test prediction (Sec-\ntion 4.4). For margin-based models, the test loss is a monotone function of the test prediction, so the\nanalysis is similar (Appendix D.3).\n\n4.1 Bounding the error of the one-step Newton approximation\nThe Newton approximation is computationally expensive because it computes (H\u03bb,1(1 \u2212 w))\u22121 for\neach w (instead of the \ufb01xed H\u22121\n\u03bb,1 in the in\ufb02uence calculation). However, it provides more accurate\nestimates (e.g., Pregibon et al. (1981), Rad and Maleki (2018)), and we show that its error can be\nbounded as follows (all proofs in Appendix E):\nProposition 1. Let the Newton error be ErrNt-act(w) def= I\u2217\nevaluation function f (\u03b8) is Cf -Lipschitz and that the Hessian \u22072\n1Cf CH C 2\n(cid:96)\n\nf (w) \u2212 I Nt\n\u03b8(cid:96)(x, y, \u03b8) is CH-Lipschitz. Then\n\nf (w). Assume that the\n\n|ErrNt-act(w)| \u2264 n(cid:107)w(cid:107)2\n\n,\n\n(\u03c3min + \u03bb)3\n\ndef= max1\u2264i\u2264n (cid:107)\u2207\u03b8(cid:96)(xi, yi, \u02c6\u03b8(1))(cid:107)2 to be the largest norm of a training point\u2019s\nwhere we de\ufb01ne C(cid:96)\ngradient at \u02c6\u03b8(1), and \u03c3min to be the smallest eigenvalue of H1. ErrNt-act(w) only involves third-order\nor higher derivatives of the loss, so it is 0 for quadratic losses.\n\nProposition 1 tells us that the Newton approximation is accurate when \u03bb is large or the third derivative\nof (cid:96)(x, y;\u00b7) (controlled by CH) is small. Empirically, the Newton error ErrNt-act(w) is strikingly\nsmall in most of our settings (Figure 2), even though the overall error of the in\ufb02uence approximation\nI\u2217\nf (w) \u2212 If (w) is still large. In the remainder of this section, we therefore focus on characterizing\nthe Newton-in\ufb02uence error ErrNt-inf (w), under the assumption that the Newton approximation is\nsimilar to the actual effect (within a factor of O(1/\u03bb3)).\n\n4.2 Characterizing the difference between the Newton approximation and in\ufb02uence\nWe next characterize the Newton-in\ufb02uence error ErrNt-inf (w) = I Nt\nProposition 2. Under the assumptions of Proposition 1 and the additional assumption that the third\nderivative of f (\u03b8) exists and is bounded in norm by Cf,3, the Newton-in\ufb02uence error ErrNt-inf (w) is\nErrNt-inf (w) = \u2207\u03b8f (\u02c6\u03b8(1))(cid:62)H\n\nf (w) \u2212 If (w):\n\n\u2206\u03b8Nt(w)(cid:62)\u22072\n\n\u2212 1\n\u03bb,1 g1(w) +\n\n\u2212 1\n\u03bb,1 D(w)H\n\n2\n\n2\n\n1\n2\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b8f (\u02c6\u03b8(1))\u2206\u03b8Nt(w) + Errf,3(w),\nError from curvature of f (\u00b7)\n\n(cid:125)\n\nwith D(w) def= (cid:0)I \u2212 H\n\n\u2212 1\n\u03bb,1 H1(w)H\n\n2\n\n\u2212 1\n2\n\u03bb,1\n\n(cid:1)\u22121 \u2212 I and H1(w) def= (cid:80)n\n\nmatrix D(w) has eigenvalues between 0 and \u03c3max\n\n\u03b8(cid:96)(xi, yi; \u02c6\u03b8(1)). The error\n\u03bb , where \u03c3max is the largest eigenvalue of H1. The\n\ni=1 wi\u22072\n\n6\n\n50050DiabetesNewton approx.on test predictionActual effect10010Enron505Dogfish505CDRShared feature valueFeature clusteringGradient clusteringRandom within classRandomLarge positive test infl.Large negative test infl.\fFigure 3: In\ufb02uence If (w) vs. Newton ap-\nproximation I Nt\nf (w) on the test prediction on\ntwo counterexamples detailed in Appendix D.1.\nLeft: We adversarially choose a set of w\u2019s such\nthat If (w) and I Nt\nf (w) can have different signs\nand need not correlate. Right: When we only\nremove copies of single points, underestima-\ntion holds. However, we can control the scal-\ning factor d(w) between If (w) and I Nt\nf (w) on\ndifferent groups, so correlation need not hold.\n\n1Cf,3C 3\n\nresidual term Errf,3(w) captures the error due to third-order derivatives of f (\u00b7) and is bounded by\n|Errf,3(w)| \u2264 (cid:107)w(cid:107)3\nWe can interpret Proposition 2 as a formalization of Hampel et al. (1986)\u2019s observation that in\ufb02uence\napproximations are accurate when the model is robust and the curvature of the loss is low. In\ngeneral, the error decreases as \u03bb increases and f (\u00b7) becomes less curved; in Figure C.2, we show that\nincreasing \u03bb reduces error and increases correlation in our experiments.\n\n(cid:96) /6(\u03c3min + \u03bb)3.\n\n4.3 The relationship between in\ufb02uence and actual effect on self-loss\n\nevaluation function f (\u00b7). We start with the self-loss f (\u03b8) =(cid:80)n\n\nLet us now apply Proposition 2 to analyze the behavior of in\ufb02uence under different choices of\ni=1 wi(cid:96)(xi, yi; \u03b8), as its in\ufb02uences\n\nand actual effects are always non-negative, and it is the cleanest to characterize:\nProposition 3. Under the assumptions of Proposition 2, the in\ufb02uence on the self-loss obeys\n\n(cid:18)\n\n(cid:19)\n\nIf (w) + Errf,3(w) \u2264 I Nt\n\nf (w) \u2264\n\n1 +\n\n3\u03c3max\n\n2\u03bb\n\n+\n\n\u03c32\nmax\n2\u03bb2\n\nIf (w) + Errf,3(w).\n\nThe constraint in Proposition 3 implies that up to O(1/\u03bb3) terms, in\ufb02uence underestimates the\nNewton approximation and therefore the actual effect. This explains the previously-unexplained\ndownward bias observed when using in\ufb02uence to approximate LOOCV (Debruyne et al., 2008;\nGiordano et al., 2019b). Equivalently, all points on the graph of in\ufb02uences vs. actual effects lie within\nthe cone bounded by the lines with slope 1 and slope\n\u03bb+3\u03c3max/2 lines, up to O(1/\u03bb3) terms. As \u03bb\ngrows, these lines will converge, and the error terms Errf,3(w) and ErrNt-act(w) will decay at a rate\nof O(1/\u03bb3), forcing the in\ufb02uences and actual effects to be equal.\nHowever, \u03bb/\u03c3max is quite small in our experiments in Section 3, so the actual correlation of in\ufb02uence\nis better than predicted by this theory: in Figure 1-Bot, the sizes of the theoretically-permissible\ncones can be quite large, but the points in the graphs nevertheless trace a tight curve through the cone.\n\n\u03bb\n\n4.4 The relationship between in\ufb02uence and actual effect on a test point\nWe now turn to measuring the test prediction f (\u03b8) = \u03b8(cid:62)xtest. Here, we show that correlation and\nunderestimation need not hold, and that we cannot obtain a cone constraint similar to Proposition 3\n\u2212 1\n\u2212 1\nexcept in a restricted setting. De\ufb01ne vtest = H\n\u03bb,1 g1(w). Proposition 2 gives:\n\u03bb,1 xtest and vw = H\nCorollary 1. Suppose f (\u03b8) = \u03b8(cid:62)xtest. Then I Nt\nf (w) = If (w) + vtest\n(cid:62)D(w)vw, where D(w) =\n\n(cid:1)\u22121 \u2212 I is the error matrix from Proposition 2.\n\n(cid:0)I \u2212 H\n\n\u2212 1\n\u03bb,1 H1(w)H\n\n\u2212 1\n2\n\u03bb,1\n\n2\n\n2\n\n2\n\n(cid:62)vw = 0 but the Newton approximation I Nt\n\nUnfortunately, Corollary 1 implies that no cone constraint applies: in general, we can \ufb01nd xtest such\nthat the in\ufb02uence If (w) = vtest\n(cid:62)D(w)vw\nis large. As a counterexample, Figure 3-Left shows that on synthetic data, If (w) and I Nt\nf (w) can\neven have opposite signs on some subsets w.\nWe can recover a cone constraint similar to Proposition 3 if we restrict our attention to the special\ncase where we use a margin-based model and remove (possibly multiple copies) of a single point:\n\nf (w) = vtest\n\n7\n\n202Influence on test preditionNewton approximation505\fProposition 4. Consider a binary classi\ufb01cation setting with y \u2208 {\u22121, +1} and a margin-based\nmodel with loss (cid:96)(x, y; \u03b8) = \u03c6(y\u03b8(cid:62)x) for some \u03c6 : R \u2192 R+. Suppose f (\u03b8) = \u03b8(cid:62)xtest and that the\nsubset w comprises (cid:107)w(cid:107)1 identical copies of the training point (xw, yw). Then under the assumptions\nof Proposition 1, the Newton approximation I Nt\nf (w) is related to the in\ufb02uence If (w) according to\n\nI Nt\nf (w) =\n\nIf (w)\n\u02c6\u03b8(1)(cid:62)xw) \u00b7 x(cid:62)\n\nThis implies the Newton approximation I Nt\n\n1 \u2212 (cid:107)w(cid:107)1 \u00b7 \u03c6(cid:48)(cid:48)(yw\n\nf (w) is bounded between If (w) and(cid:0)1 + \u03c3max\n\nwH\u22121\n\n.\n\n\u03bb,1xw\n\n(cid:1)If (w).\n\n\u03bb\n\nSimilar to Proposition 3, Proposition 4 shows that up to O(1/\u03bb3) terms, the in\ufb02uence underestimates\nthe actual effect when removing copies of a single point. Moreover, all points on the graph of\nin\ufb02uences vs. actual effects lie within the cone bounded by the lines with slope 1 and slope\n\u03bb/(\u03bb + \u03c3max), up to O(1/\u03bb3) terms. As \u03bb/\u03c3max grows, the cone shrinks, and correlation increases.\nHowever, if \u03bb/\u03c3max is small (as in our experiments in Section 3), the cone is wide, and the scaling\nfactor d(w) = 1/(1 \u2212 (cid:107)w(cid:107)1 \u00b7 \u03c6(cid:48)(cid:48)\n\u03bb,1xk) in Proposition 4 can be quite large for some subsets w\nbut not for others. In particular, d(w) is large when there are few remaining points in the direction\nof the removed points. In Figure 3-Right, we exploit this fact to show that the in\ufb02uence If (w) and\nNewton approximation I Nt\nf (w) can exhibit low correlation (e.g., low If (w) need not mean low\nI Nt\nf (w)), even in the simpli\ufb01ed setting of removing copies of single points. We comment on the\nanalogue of d(w) in the general multiple-point setting in Appendix D.2, and on the in\ufb02uence on test\nloss (instead of test prediction) in Appendix D.3.\n\nk H\u22121\n\nkx(cid:62)\n\n5 Applications of in\ufb02uence functions on natural groups of data\n\nThe analysis in Section 4 shows that the cone constraint between predicted and actual group effects\nneed not always hold. Nonetheless, our experiments in Section 3 demonstrate that on real datasets,\nthe correlation is much stronger than the theory predicts. We now turn to using in\ufb02uence functions to\npredict group effects in two case studies where groups arise naturally.\n\nChemical-disease relation (CDR). The CDR dataset tackles the following task: given text about\nthe relationship between a chemical and a disease, predict if the chemical causes the disease. It was\ncollected via data programming, where users provide labeling functions (LFs)\u2014instead of labels\u2014\nthat take in an unlabeled point and either abstain or output a heuristic label (Ratner et al., 2016).\nSpeci\ufb01cally, Hancock et al. (2018) collected natural language explanations of provided classi\ufb01cations;\nparsed those explanations into LFs; and used those LFs to label a large pool of data (Appendix B.1).\nWe used in\ufb02uence functions to study two important properties of LFs: coverage, the fraction of\nunlabeled points for which an LF outputs a non-abstaining label; and precision, the proportion of\ncorrect labels output. We associated each LF with the group of points that it labeled, and computed\nits in\ufb02uence; as expected, these correlated with actual effects on overall test loss (Spearman \u03c1 = 1;\nFigure C.5). LFs with higher coverage had more in\ufb02uence (Figure 4-Left; see also Figure C.6),\nbut surprisingly, LFs with higher precision did not (Figure 4-Mid). The association with coverage\nstems at least partially from class balance: each LF outputs either all positive or all negative labels,\nso removing an LF with high coverage changes the class balance and consequently improves test\nperformance on one class at the expense of the other (Figure 4-Left). While these \ufb01ndings are not\ncausal claims, they suggest that the coverage of an LF, rather than its precision, might have a stronger\neffect on its overall contribution to test performance.\n\nMultiNLI. The MultiNLI dataset deals with natural language inference: determining if a pair of\nsentences agree, contradict, or are neutral. Williams et al. (2018) presented crowdworkers with initial\nsentences from \ufb01ve genres and asked them to generate follow-on sentences that were neutral or in\nagreement/contradiction (Appendix B.2). We studied the effect that each crowdworker had on the\nmodel\u2019s test set performance by computing the in\ufb02uence of the examples they created on overall test\nloss (Spearman \u03c1 of 0.77 to 0.86 with actual effects across different genres; see Figure C.8).\nStudying the in\ufb02uence of each crowdworker reveals that the number of examples a crowdworker\ncreated was not predictive of in\ufb02uence on test performance: e.g., the most proli\ufb01c crowdworker\n\n8\n\n\fFigure 4: In CDR, the in\ufb02uence of a label-\ning function (LF) on test performance is\npredicted by its coverage (Left) but not its\nprecision (Mid). However, in MultiNLI,\nthe number of examples contributed by a\ncrowdworkers is not predictive of its in\ufb02u-\nence (Right). For CDR, LFs output either\nall + or all \u2212 labels; we plot the in\ufb02uence\nof each LF on the test points of the same\nclass.\n\ncontributed 35,000 examples but had negative in\ufb02uence, and we veri\ufb01ed that removing all of those\nexamples and retraining the model indeed made overall test performance worse (Figure 4-Right).\nCuriously, this effect was genre-speci\ufb01c: crowdworkers who improved performance on some gen-\nres would lower performance on others (Figure C.10), even though the number of examples they\ncontributed to a genre did not correlate with their in\ufb02uence on it (Figure C.11). We note that these\nresults are obtained on a baseline logistic regression model built on top of a continuous bag-of-words\nrepresentation. Identifying precisely what makes a crowdworker\u2019s contributions useful, especially on\nhigher-performing models, could help us improve dataset collection and credit attribution as well as\nbetter understand the biases due to annotator effects (Geva et al., 2019).\n\n6 Discussion\n\nIn this paper, we showed empirically that the in\ufb02uences of groups of points are highly correlated\nwith, and consistently underestimate, their actual effects across a range of datasets, types of groups,\nand sizes. These phenomena allows us to use in\ufb02uence functions to better understand the \u201cdifferent\nstories that different parts of the data tell,\u201d in the words of Hampel et al. (1986). We showed that we\ncan gain insight into the effects of a labeling function in data programming, or a crowdworker in a\ncrowdsourced dataset, by computing the in\ufb02uence of their corresponding group effects.\nWhile these applications involved prede\ufb01ned groups, in\ufb02uence functions could potentially also\ndiscover coherent, semantically-relevant groups in the data. They can also be used to approximate\nShapley values, which are a different but related way of measuring the effect of data points; see, e.g.,\nJia et al. (2019) and Ghorbani and Zou (2019). Separately, in\ufb02uence functions can also estimate the\neffects of adding training points. In this context, underestimation turns into overestimation, i.e., the\nin\ufb02uence of adding a group of training points tends to overestimate the actual effect of adding that\ngroup. This raises the possibility of using in\ufb02uence functions to evaluate the vulnerability of a given\ndataset and model to data poisoning attacks (Steinhardt et al., 2017).\nOur theoretical analysis showed that while correlation and underestimation hold in some restricted\nsettings, they need not hold in general, realistic settings. This gap between theory and experiments\nopens up important directions for future work: Why do we observe such striking correlation between\npredicted and actual effects on real data? To what extent is this due to the speci\ufb01c model, datasets, or\nsubsets used? Do these trends hold for non-convex models like neural networks? Our work suggests\nthat there could be distributional assumptions that hold for real data and give rise to the broad\nphenomena of correlation and underestimation. One promising lead is the surprising observation that\nthe Newton approximation is much more accurate than in\ufb02uence at predicting group effects, which\nholds out the hope that we can understand group effects using just low-order terms (since the Newton\napproximation only uses the \ufb01rst and second derivatives of the loss) without needing to account for\nthe whole loss function through higher order terms (as in Giordano et al. (2019a)).\n\n9\n\n0.00.1LF coverage0250500Influence on test set lossCDR0.00.40.8LF precisionCDRPos. LFNeg. LF010000Examples contributed4202MultiNLI\fReproducibility\n\nThe code for replicating our experiments is available in the GitHub repository https:\n//github.com/kohpangwei/group-influence-release. An executable version of this\npaper is also available on CodaLab at https://worksheets.codalab.org/worksheets/\n0xfed2ae0b9e5b44b7a1af8096365592a5.\n\nAcknowledgments\n\nWe are grateful to Zhenghao Chen, Brad Efron, Jean Feng, Tatsunori Hashimoto, Robin Jia, Stephen\nMussmann, Aditi Raghunathan, Marco T\u00falio Ribeiro, Noah Simon, Jacob Steinhardt, and Jian Zhang\nfor helpful discussions and comments. We are further indebted to Ryan Giordano, Ruoxi Jia, and Will\nStephenson for discussion about prior work, and Samuel Bowman, Braden Hancock, Emma Pierson,\nand Pranav Rajpurkar for their assistance with applications and datasets. This work was funded by an\nOpen Philanthropy Project Award. PWK was supported by the Facebook Fellowship Program.\n\nReferences\nI. Arrieta-Ibarra, L. Goff, D. Jim\u00e9nez-Hern\u00e1ndez, J. Lanier, and E. G. Weyl. Should we treat data\nas labor? Moving beyond \u201cfree\u201d. In American Economic Association Papers and Proceedings,\nvolume 108, pages 38\u201342, 2018.\n\nS. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\nM. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel. Understanding the origins of bias in\n\nword embeddings. arXiv preprint arXiv:1810.03611, 2018.\n\nI. Chen, F. D. Johansson, and D. Sontag. Why is my classi\ufb01er discriminatory? In Advances in Neural\n\nInformation Processing Systems (NeurIPS), pages 3539\u20133550, 2018.\n\nR. D. Cook. Detection of in\ufb02uential observation in linear regression. Technometrics, 19:15\u201318, 1977.\n\nM. Debruyne, M. Hubert, and J. A. Suykens. Model selection in kernel based regression using the\n\nin\ufb02uence function. Journal of Machine Learning Research (JMLR), 9(0):2377\u20132400, 2008.\n\nM. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an investigation\nofannotator bias in natural language understanding datasets. In Empirical Methods in Natural\nLanguage Processing (EMNLP), 2019.\n\nA. Ghorbani and J. Zou. Data shapley: Equitable valuation of data for machine learning. arXiv\n\npreprint arXiv:1904.02868, 2019.\n\nR. Giordano, M. I. Jordan, and T. Broderick. A higher-order Swiss Army in\ufb01nitesimal jackknife.\n\narXiv preprint arXiv:1907.12116, 2019a.\n\nR. Giordano, W. Stephenson, R. Liu, M. Jordan, and T. Broderick. A Swiss Army in\ufb01nitesimal\n\njackknife. In Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 1139\u20131147, 2019b.\n\nF. R. Hampel. The in\ufb02uence curve and its role in robust estimation. Journal of the American\n\nStatistical Association, 69(346):383\u2013393, 1974.\n\nF. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The Approach\n\nBased on In\ufb02uence Functions. Wiley, 1986.\n\nB. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. R\u00e9. Training classi\ufb01ers with natural\n\nlanguage explanations. In Association for Computational Linguistics (ACL), 2018.\n\nJ. Hayes and O. Ohrimenko. Contamination attacks and mitigation in multi-party machine learning.\n\nIn Advances in Neural Information Processing Systems (NeurIPS), pages 6604\u20136615, 2018.\n\nL. A. Jaeckel. The in\ufb01nitesimal jackknife. Unpublished memorandum, Bell Telephone Laboratories,\n\nMurray Hill, NJ, 1972.\n\n10\n\n\fR. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M. Gurel, B. Li, C. Zhang, D. Song, and C. Spanos.\nTowards ef\ufb01cient data valuation based on the shapley value. arXiv preprint arXiv:1902.10275,\n2019.\n\nR. Khanna, B. Kim, J. Ghosh, and O. Koyejo. Interpreting black box predictions using Fisher kernels.\n\nIn Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 3382\u20133390, 2019.\n\nP. W. Koh and P. Liang. Understanding black-box predictions via in\ufb02uence functions. In International\n\nConference on Machine Learning (ICML), 2017.\n\nP. W. Koh, J. Steinhardt, and P. Liang. Stronger data poisoning attacks break data sanitization\n\ndefenses. arXiv preprint arXiv:1811.00741, 2019.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nJ. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman,\nK. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in\nhigh-throughput data. Nature Reviews Genetics, 11(10), 2010.\n\nY. Liu, S. Jiang, and S. Liao. Ef\ufb01cient approximation of cross-validation for kernel methods using\nBouligand in\ufb02uence function. In International Conference on Machine Learning (ICML), pages\n324\u2013332, 2014.\n\nV. Metsis, I. Androutsopoulos, and G. Paliouras. Spam \ufb01ltering with naive Bayes \u2013 which naive\n\nBayes? In CEAS, volume 17, pages 28\u201369, 2006.\n\nD. Pregibon et al. Logistic regression diagnostics. Annals of Statistics, 9(4):705\u2013724, 1981.\n\nK. R. Rad and A. Maleki. A scalable estimate of the extra-sample prediction error via approximate\n\nleave-one-out. arXiv preprint arXiv:1801.10243, 2018.\n\nA. J. Ratner, C. M. D. Sa, S. Wu, D. Selsam, and C. R\u00e9. Data programming: Creating large training\nsets, quickly. In Advances in Neural Information Processing Systems (NeurIPS), pages 3567\u20133575,\n2016.\n\nP. Schulam and S. Saria. Can you trust this prediction? Auditing pointwise reliability after learning.\n\nIn Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 1022\u20131031, 2019.\n\nJ. Steinhardt, P. W. Koh, and P. Liang. Certi\ufb01ed defenses for data poisoning attacks. In Advances in\n\nNeural Information Processing Systems (NeurIPS), 2017.\n\nB. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore. Impact of\nHbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient\nrecords. BioMed Research International, 2014, 2014.\n\nH. Wang, B. Ustun, and F. P. Calmon. Repairing without retraining: Avoiding disparate impact with\n\ncounterfactual distributions. arXiv preprint arXiv:1901.10501, 2019.\n\nC. Wei, Y. Peng, R. Leaman, A. P. Davis, C. J. Mattingly, J. Li, T. C. Wiegers, and Z. Lu. Overview\nof the BioCreative V chemical disease relation (cdr) task. In Proceedings of the Fifth BioCreative\nChallenge Evaluation Workshop, pages 154\u2013166, 2015.\n\nA. Williams, N. Nangia, and S. Bowman. A broad-coverage challenge corpus for sentence under-\nstanding through inference. In Association for Computational Linguistics (ACL), pages 1112\u20131122,\n2018.\n\nJ. Zhou, Z. Li, H. Hu, K. Yu, F. Chen, Z. Li, and Y. Wang. Effects of in\ufb02uence on user trust in\npredictive decision making. In Conference on Human Factors in Computing Systems (CHI), 2019.\n\n11\n\n\f", "award": [], "sourceid": 2840, "authors": [{"given_name": "Pang Wei", "family_name": "Koh", "institution": "Stanford University"}, {"given_name": "Kai-Siang", "family_name": "Ang", "institution": "Stanford University"}, {"given_name": "Hubert", "family_name": "Teo", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}