{"title": "To Trust Or Not To Trust A Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 5541, "page_last": 5552, "abstract": "Knowing when a classifier's prediction can be trusted is useful in many applications and critical for safely using AI. While the bulk of the effort in machine learning research has been towards improving classifier performance, understanding when a classifier's predictions should and should not be trusted has received far less attention. The standard approach is to use the classifier's discriminant or confidence score; however, we show there exists an alternative that is more effective in many situations. We propose a new score, called the {\\it trust score}, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier's confidence score as well as many other baselines. Further, under some mild distributional assumptions, we show that if the trust score for an example is high (low), the classifier will likely agree (disagree) with the Bayes-optimal classifier. Our guarantees consist of non-asymptotic rates of statistical consistency under various nonparametric settings and build on recent developments in topological data analysis.", "full_text": "To Trust Or Not To Trust A Classi\ufb01er\n\nHeinrich Jiang\u2217\nGoogle Research\n\nheinrichj@google.com\n\nBeen Kim\nGoogle Brain\n\nbeenkim@google.com\n\nMelody Y. Guan\u2020\nStanford University\n\nmguan@stanford.edu\n\nMaya Gupta\n\nGoogle Research\n\nmayagupta@google.com\n\nAbstract\n\nKnowing when a classi\ufb01er\u2019s prediction can be trusted is useful in many applications\nand critical for safely using AI. While the bulk of the effort in machine learning\nresearch has been towards improving classi\ufb01er performance, understanding when\na classi\ufb01er\u2019s predictions should and should not be trusted has received far less\nattention. The standard approach is to use the classi\ufb01er\u2019s discriminant or con\ufb01dence\nscore; however, we show there exists an alternative that is more effective in many\nsituations. We propose a new score, called the trust score, which measures the\nagreement between the classi\ufb01er and a modi\ufb01ed nearest-neighbor classi\ufb01er on\nthe testing example. We show empirically that high (low) trust scores produce\nsurprisingly high precision at identifying correctly (incorrectly) classi\ufb01ed examples,\nconsistently outperforming the classi\ufb01er\u2019s con\ufb01dence score as well as many other\nbaselines. Further, under some mild distributional assumptions, we show that if the\ntrust score for an example is high (low), the classi\ufb01er will likely agree (disagree)\nwith the Bayes-optimal classi\ufb01er. Our guarantees consist of non-asymptotic rates\nof statistical consistency under various nonparametric settings and build on recent\ndevelopments in topological data analysis.\n\n1\n\nIntroduction\n\nMachine learning (ML) is a powerful and widely-used tool for making potentially important decisions,\nfrom product recommendations to medical diagnosis. However, despite ML\u2019s impressive performance,\nit makes mistakes, with some more costly than others. As such, ML trust and safety is an important\ntheme [1, 2, 3]. While improving overall accuracy is an important goal that the bulk of the effort in\nML community has been focused on, it may not be enough: we need to also better understand the\nstrengths and limitations of ML techniques.\nThis work focuses on one such challenge: knowing whether a classi\ufb01er\u2019s prediction for a test example\ncan be trusted or not. Such trust scores have practical applications. They can be directly shown to\nusers to help them gauge whether they should trust the AI system. This is crucial when a model\u2019s\nprediction in\ufb02uences important decisions such as a medical diagnosis, but can also be helpful even\nin low-stakes scenarios such as movie recommendations. Trust scores can be used to override the\nclassi\ufb01er and send the decision to a human operator, or to prioritize decisions that human operators\nshould be making. Trust scores are also useful for monitoring classi\ufb01ers to detect distribution shifts\nthat may mean the classi\ufb01er is no longer as useful as it was when deployed.\n\n\u2217All authors contributed equally.\n\u2020Work done while intern at Google Research.\nAn open-source implementation of Trust Scores can be found here: https://github.com/google/TrustScore\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fA standard approach to deciding whether to trust a classi\ufb01er\u2019s decision is to use the classi\ufb01ers\u2019 own\nreported con\ufb01dence or score, e.g. probabilities from the softmax layer of a neural network, distance\nto the separating hyperplane in support vector classi\ufb01cation, mean class probabilities for the trees in\na random forest. While using a model\u2019s own implied con\ufb01dences appears reasonable, it has been\nshown that the raw con\ufb01dence values from a classi\ufb01er are poorly calibrated [4, 5]. Worse yet, even if\nthe scores are calibrated, the ranking of the scores itself may not be reliable. In other words, a higher\ncon\ufb01dence score from the model does not necessarily imply higher probability that the classi\ufb01er is\ncorrect, as shown in [6, 7, 8]. A classi\ufb01er may simply not be the best judge of its own trustworthiness.\nIn this paper, we use a set of labeled examples (e.g. training data or validation data) to help determine\na classi\ufb01er\u2019s trustworthiness for a particular testing example. First, we propose a simple procedure\nthat reduces the training data to a high density set for each class. Then we de\ufb01ne the trust score\u2014the\nratio between the distance from the testing sample to the nearest class different from the predicted\nclass and the distance to the predicted class\u2014to determine whether to trust that classi\ufb01er prediction.\nTheoretically, we show that high/low trust scores correspond to high probability of agree-\nment/disagreement with the Bayes-optimal classi\ufb01er. We show \ufb01nite-sample estimation rates when\nthe data is full-dimension and supported on or near a low-dimensional manifold. Interestingly, we\nattain bounds that depend only on the lower manifold dimension and independent of the ambient\ndimension without any changes to the procedure or knowledge of the manifold. To our knowledge,\nthese results are new and may be of independent interest.\nExperimentally, we found that the trust score better identi\ufb01es correctly-classi\ufb01ed points for low and\nmedium-dimension feature spaces than the model itself. However, high-dimensional feature spaces\nwere more challenging, and we demonstrate that the trust score\u2019s utility depends on the vector space\nused to compute the trust score differences.\n\n2 Related Work\n\nOne related line of work is that of con\ufb01dence calibration, which transforms classi\ufb01er outputs into\nvalues that can be interpreted as probabilities, e.g. [9, 10, 11, 4]. In recent work, [5] explore the\nstructured prediction setting, and [12] obtain con\ufb01dence estimates by using ensembles of networks.\nThese calibration techniques typically only use the the model\u2019s reported score (and the softmax\nlayer in the case of a neural network) for calibration, which notably preserves the rankings of the\nclassi\ufb01er scores. Similarly, [13] considered using the softmax probabilities for the related problem of\nidentifying misclassi\ufb01cations and mislabeled points.\nRecent work explored estimating uncertainty for Bayesian neural networks and returning a distribution\nover the outputs [14, 15]. The proposed trust score does not change the network structure (nor does\nit assume any structure) and gives a single score, rather than a distribution over outputs as the\nrepresentation of uncertainty.\nThe problem of classi\ufb01cation with a reject option or learning with abstention [16, 17, 18, 19, 20, 21,\n22] is a highly related framework where the classi\ufb01er is allowed to abstain from making a prediction\nat a certain cost. Typically such methods jointly learn the classi\ufb01er and the rejection function. Note\nthat the interplay between classi\ufb01cation rate and reject rate is studied in many various forms e.g.\n[23, 24, 25, 26, 27, 28, 29, 30, 31, 32]. Our paper assumes an already trained and possibly black-box\nclassi\ufb01er and learns the con\ufb01dence scores separately, but we do not explicitly learn the appropriate\nrejection thresholds.\nWhether to trust a classi\ufb01er also arises in the setting where one has access to a sequence of classi\ufb01ers,\nbut there is some cost to evaluating each classi\ufb01er, and the goal is to decide after evaluating each\nclassi\ufb01er in the sequence if one should trust the current classi\ufb01er decision enough to stop, rather than\nevaluating more classi\ufb01ers in the sequence (e.g. [33, 34, 35]). Those con\ufb01dence decisions are usually\nbased on whether the current classi\ufb01er score will match the classi\ufb01cation of the full sequence.\nExperimentally we \ufb01nd that the vector space used to compute the distances in the trust score matters,\nand that computing trust scores on more-processed layers of a deep model generally works better.\nThis observation is similar to the work of Papernot and McDaniel [36], who use k-NN regression on\nthe intermediate representations of the network which they showed enhances robustness to adversarial\nattacks and leads to better calibrated uncertainty estimations.\n\n2\n\n\fOur work builds on recent results in topological data analysis. Our method to \ufb01lter low-density\npoints estimates a particular density level-set given a parameter \u03b1, which aims at \ufb01nding the level-\nset that contains 1 \u2212 \u03b1 fraction of the probability mass. Level-set estimation has a long history\n[37, 38, 39, 40, 41, 42]. However such works assume knowledge of the density level, which is\ndif\ufb01cult to determine in practice. We provide rates for Algorithm 1 in estimating the appropriate\nlevel-set corresponding to \u03b1 without knowledge of the level. The proxy \u03b1 offers a more intuitive\nparameter compared to the density value, is used for level-set estimation. Our analysis is also done\nunder various settings including when the data lies near a lower dimensional manifold and we provide\nrates that depend only on the lower dimension.\n\n3 Algorithm: The Trust Score\n\nOur approach proceeds in two steps outlined in Algorithm 1 and 2. We \ufb01rst pre-process the training\ndata, as described in Algorithm 1, to \ufb01nd the \u03b1-high-density-set of each class, which is de\ufb01ned as\nthe training samples within that class after \ufb01ltering out \u03b1-fraction of the samples with lowest density\n(which may be outliers):\nDe\ufb01nition 1 (\u03b1-high-density-set). Let 0 \u2264 \u03b1 < 1 and f be a continuous density function with\ncompact support X \u2286 RD. Then de\ufb01ne H\u03b1(f ), the \u03b1-high-density-set of f, to be the \u03bb\u03b1-level set of\n\nf, de\ufb01ned as {x \u2208 X : f (x) \u2265 \u03bb\u03b1} where \u03bb\u03b1 := inf(cid:8)\u03bb \u2265 0 :(cid:82)\n\nX 1 [f (x) \u2264 \u03bb] f (x)dx \u2265 \u03b1(cid:9).\n\nIn order to approximate the \u03b1-high-density-set, Algorithm 1 \ufb01lters the \u03b1-fraction of the sample points\nwith lowest empirical density, based on k-nearest neighbors. This data \ufb01ltering step is independent of\nthe given classi\ufb01er h.\nThen, the second step: given a testing sample, we de\ufb01ne its trust score to be the ratio between the\ndistance from the testing sample to the \u03b1-high-density-set of the nearest class different from the\npredicted class, and the distance from the test sample to the \u03b1-high-density-set of the class predicted\nby h, as detailed in Algorithm 2. The intuition is that if the classi\ufb01er h predicts a label that is\nconsiderably farther than the closest label, then this is a warning that the classi\ufb01er may be making a\nmistake.\nOur procedure can thus be viewed as a comparison to a modi\ufb01ed nearest-neighbor classi\ufb01er, where\nthe modi\ufb01cation lies in the initial \ufb01ltering of points not in the \u03b1-high-density-set for each class.\nRemark 1. The distances can be computed with respect to any representation of the data. For exam-\nple, the raw inputs, an unsupervised embedding of the space, or the activations of the intermediate\nrepresentations of the classi\ufb01er. Moreover, the nearest-neighbor distance can be replaced by other\ndistance measures, such as k-nearest neighbors or distance to a centroid.\n\nAlgorithm 1 Estimating \u03b1-high-density-set\n\nParameters: \u03b1 (density threshold), k.\nInputs: Sample points X := {x1, .., xn} drawn from f.\nDe\ufb01ne k-NN radius rk(x) := inf{r > 0 : |B(x, r) \u2229 X| \u2265 k} and let \u03b5 := inf{r > 0 : |{x \u2208 X :\nrk(x) > r}| \u2264 \u03b1 \u00b7 n}.\n\nreturn (cid:99)H\u03b1(f ) := {x \u2208 X : rk(x) \u2264 \u03b5}.\n\nAlgorithm 2 Trust Score\n\nParameters: \u03b1 (density threshold), k.\nInput: Classi\ufb01er h : X \u2192 Y. Training data (x1, y1), ..., (xn, yn). Test example x.\n{xj : 1 \u2264 j \u2264 n, yj = (cid:96)}. Then, return the trust score, de\ufb01ned as:\n\nFor each (cid:96) \u2208 Y, let (cid:99)H\u03b1(f(cid:96)) be the output of Algorithm 1 with parameters \u03b1, k and sample points\nwhere(cid:101)h(x) = argminl\u2208Y,l(cid:54)=h(x) d\n\n(cid:16)\n(cid:17)\nx,(cid:99)H\u03b1(f(cid:101)h(x))\n(cid:17)\nx,(cid:99)H\u03b1(fl)\n\n\u03be(h, x) := d\n\n(cid:16)\n\n.\n\n(cid:16)\n\n(cid:17)\nx,(cid:99)H\u03b1(fh(x))\n\n,\n\n/d\n\nThe method has two hyperparameters: k (the number of neighbors, such as in k-NN) and \u03b1 (fraction\nof data to \ufb01lter) to compute the empirical densities. We show in theory that k can lie in a wide range\n\n3\n\n\fand still give us the desired consistency guarantees. Throughout our experiments, we \ufb01x k = 10, and\nuse cross-validation to select \u03b1 as it is data-dependent.\nRemark 2. We observed that the procedure was not very sensitive to the choice of k and \u03b1. As will\nbe shown in the experimental section, for ef\ufb01ciency on larger datasets, we skipped the initial \ufb01ltering\nstep of Algorithm 1 (leading to a hyperparameter-free procedure) and obtained reasonable results.\nThis initial \ufb01ltering step can also be replaced by other strategies. One such example is \ufb01ltering\nexamples whose labels have high disagreement amongst its neighbors, which is implemented in the\nopen-source code release but not experimented with here.\n\n4 Theoretical Analysis\n\nIn this section, we provide theoretical guarantees for Algorithms 1 and 2. Due to space constraints,\nall the proofs are deferred to the Appendix. To simplify the main text, we state our results treating \u03b4,\nthe con\ufb01dence level, as a constant. The dependence on \u03b4 in the rates is made explicit in the Appendix.\nWe show that Algorithm 1 is a statistically consistent estimator of the \u03b1-high-density-level set with\n\ufb01nite-sample estimation rates. We analyze Algorithm 1 in three different settings: when the data lies\non (i) a full-dimensional RD; (ii) an unknown lower dimensional submanifold embedded in RD; and\n(iii) an unknown lower dimensional submanifold with full-dimensional noise.\nFor setting (i), where the data lies in RD, the estimation rate has a dependence on the dimension D,\nwhich may be unattractive in high-dimensional situations: this is known as the curse of dimensionality,\nsuffered by density-based procedures in general. However, when the data has low intrinsic dimension\nin (ii), it turns out that, remarkably, without any changes to the procedure, the estimation rate depends\non the lower dimension d and is independent of the ambient dimension D. However, in realistic\nsituations, the data may not lie exactly on a lower-dimensional manifold, but near one. This re\ufb02ects\nthe setting of (iii), where the data essentially lies on a manifold but has general full-dimensional noise\nso the data is overall full-dimensional. Interestingly, we show that we still obtain estimation rates\ndepending only on the manifold dimension and independent of the ambient dimension; moreover, we\ndo not require knowledge of the manifold nor its dimension to attain these rates.\nWe then analyze Algorithm 2, and establish the culminating result of Theorem 4: for labeled data\ndistributions with well-behaved class margins, when the trust score is large, the classi\ufb01er likely agrees\nwith the Bayes-optimal classi\ufb01er, and when the trust score is small, the classi\ufb01er likely disagrees with\nthe Bayes-optimal classi\ufb01er. If it turns out that even the Bayes-optimal classi\ufb01er has high-error in\na certain region, then any classi\ufb01er will have dif\ufb01culties in that region. Thus, Theorem 4 does not\nguarantee that the trust score can predict misclassi\ufb01cation, but rather that it can predict when the\nclassi\ufb01er is making an unreasonable decision.\n\n4.1 Analysis of Algorithm 1\n\nWe require the following regularity assumptions on the boundaries of H\u03b1(f ), which are standard\nin analyses of level-set estimation [40]. Assumption 1.1 ensures that the density around H\u03b1(f ) has\nboth smoothness and curvature. The upper bound gives smoothness, which is important to ensure\nthat our density estimators are accurate for our analysis (we only require this smoothness near the\nboundaries and not globally). The lower bound ensures curvature: this ensures that H\u03b1(f ) is salient\nenough to be estimated. Assumption 1.2 ensures that H\u03b1(f ) does not get arbitrarily thin anywhere.\nAssumption 1 (\u03b1-high-density-set regularity). Let \u03b2 > 0. There exists \u02c7C\u03b2, \u02c6C\u03b2, \u03b2, rc, r0, \u03c1 > 0 s.t.\n1. \u02c7C\u03b2 \u00b7 d(x, H\u03b1(f ))\u03b2 \u2264 |\u03bb\u03b1 \u2212 f (x)| \u2264 \u02c6C\u03b2 \u00b7 d(x, H\u03b1(f ))\u03b2 for all x \u2208 \u2202H\u03b1(f ) + B(0, rc).\n2. For all 0 < r < r0 and x \u2208 H\u03b1(f ), we have Vol(B(x, r)) \u2265 \u03c1 \u00b7 rD.\n\nwhere \u2202A denotes the boundary of a set A, d(x, A) := inf x(cid:48)\u2208A ||x\u2212x(cid:48)||, B(x, r) := {x(cid:48) : |x\u2212x(cid:48)| \u2264\nr} and A + B(0, r) := {x : d(x, A) \u2264 r}.\n\nOur statistical guarantees are under the Hausdorff metric, which ensures a uniform guarantee over\nour estimator: it is a stronger notion of consistency than other common metrics [41, 43].\nDe\ufb01nition 2 (Hausdorff distance). dH (A, B) := max{supx\u2208A d(x, B), supx\u2208B d(x, A)}.\n\n4\n\n\fWe now give the following result for Algorithm 1. It says that as long as our density function satis\ufb01es\nthe regularity assumptions stated earlier, and the parameter k lies within a certain range, then we can\nbound the Hausdorff distance between what Algorithm 1 recovers and H\u03b1(f ), the true \u03b1-high-density\nset, from an i.i.d. sample drawn from f of size n. Then, as n goes to \u221e, and k grows as a function of\nn, the quantity goes to 0.\nTheorem 1 (Algorithm 1 guarantees). Let 0 < \u03b4 < 1 and suppose that f is continuous and has\ncompact support X \u2286 RD and satis\ufb01es Assumption 1. There exists constants Cl, Cu, C > 0\ndepending on f and \u03b4 such that the following holds with probability at least 1 \u2212 \u03b4. Suppose that k\nsatis\ufb01es Cl \u00b7 log n \u2264 k \u2264 Cu \u00b7 (log n)D(2\u03b2+D) \u00b7 n2\u03b2/(2\u03b2+D). Then we have\n\nn\u22121/2D + log(n)1/2\u03b2 \u00b7 k\u22121/2\u03b2(cid:17)\ndH (H\u03b1(f ),(cid:99)H\u03b1(f )) \u2264 C \u00b7(cid:16)\ndH (H\u03b1(f ),(cid:99)H\u03b1(f )) (cid:46) max{n\u22121/2D, n\u22121/(2\u03b2+D)}.\n\n.\n\nRemark 3. The condition on k can be simpli\ufb01ed by ignoring log factors: log n (cid:46) k (cid:46) n2\u03b2/(2\u03b2+D),\nwhich is a wide range. Setting k to its allowed upper bound, we obtain our consistency guarantee of\n\nThe \ufb01rst term is due to the error from estimating the appropriate level given \u03b1 (i.e. identifying the\nlevel \u03bb\u03b1) and the second term corresponds to the error for recovering the level set given knowledge\nof the level. The latter term matches the lower bound for level-set estimation up to log factors [39].\n\n4.2 Analysis of Algorithm 1 on Manifolds\n\nOne of the disadvantages of Theorem 1 is that the estimation errors have a dependence on D, the\ndimension of the data, which may be highly undesirable in high-dimensional settings. We next\nimprove these rates when the data has a lower intrinsic dimension. Interestingly, we are able to show\nrates that depend only on the intrinsic dimension of the data, without explicit knowledge of that\ndimension nor any changes to the procedure. As common to related work in the manifold setting, we\nmake the following regularity assumptions which are standard among works in manifold learning\n(e.g. [44, 45, 46]).\nAssumption 2 (Manifold Regularity). M is a d-dimensional smooth compact Riemannian manifold\nwithout boundary embedded in compact subset X \u2286 RD with bounded volume. M has \ufb01nite\ncondition number 1/\u03c4, which controls the curvature and prevents self-intersection.\nTheorem 2 (Manifold analogue of Theorem 1). Let 0 < \u03b4 < 1. Suppose that density function f\nis continuous and supported on M and Assumptions 1 and 2 hold. Suppose also that there exists\n\u03bb0 > 0 such that f (x) \u2265 \u03bb0 for all x \u2208 M. Then, there exists constants Cl, Cu, C > 0 depending\non f and \u03b4 such that the following holds with probability at least 1 \u2212 \u03b4. Suppose that k satis\ufb01es\nCl \u00b7 log n \u2264 k \u2264 Cu \u00b7 (log n)d(2\u03b2(cid:48)+d) \u00b7 n2\u03b2(cid:48)/(2\u03b2(cid:48)+d). where \u03b2(cid:48) := max{1, \u03b2}. Then we have\n\nn\u22121/2d + log(n)1/2\u03b2 \u00b7 k\u22121/2\u03b2(cid:17)\ndH (H\u03b1(f ),(cid:99)H\u03b1(f )) \u2264 C \u00b7(cid:16)\ndH (H\u03b1(f ),(cid:99)H\u03b1(f )) (cid:46) max{n\u22121/2d, n\u22121/(2 max{1,\u03b2}+d)}.\n\n.\n\nRemark 4. Setting k to its allowed upper bound, we obtain (ignoring log factors),\n\nThe \ufb01rst term can be compared to that of the previous result where D is replaced with d. The second\nterm is the error for recovering the level set on manifolds, which matches recent rates [42].\n\n4.3 Analysis of Algorithm 1 on Manifolds with Full Dimensional Noise\n\nIn realistic settings, the data may not lie exactly on a low-dimensional manifold, but near one. We\nnext present a result where the data is distributed along a manifold with additional full-dimensional\nnoise. We make mild assumptions on the noise distribution. Thus, in this situation, the data has\nintrinsic dimension equal to the ambient dimension. Interestingly, we are still able to show that the\nrates only depend on the dimension of the manifold and not the dimension of the entire data.\nTheorem 3. Let 0 < \u03b7 < \u03b1 < 1 and 0 < \u03b4 < 1. Suppose that distribution F is a weighted\nmixture (1 \u2212 \u03b7) \u00b7 FM + \u03b7 \u00b7 FE where FM is a distribution with continous density fM supported on a\nd-dimensional manifold M satisfying Assumption 2 and FE is a (noise) distribution with continuous\ndensity fE with compact support over RD with d < D. Suppose also that there exists \u03bb0 > 0 such\n\n5\n\n\fthat fM (x) \u2265 \u03bb0 for all x \u2208 M and H(cid:101)\u03b1(fM ) (where(cid:101)\u03b1 := \u03b1\u2212\u03b7\nfM . Let (cid:98)H\u03b1 be the output of Algorithm 1 on a sample X of size n drawn i.i.d. from F. Then, there\n\n1\u2212\u03b7 ) satis\ufb01es Assumption 1 for density\n\nexists constants Cl, Cu, C > 0 depending on fM , fE, \u03b7, M and \u03b4 such that the following holds with\nprobability at least 1\u2212 \u03b4. Suppose that k satis\ufb01es Cl\u00b7 log n \u2264 k \u2264 Cu\u00b7 (log n)d(2\u03b2(cid:48)+d)\u00b7 n2\u03b2(cid:48)/(2\u03b2(cid:48)+d),\nwhere \u03b2(cid:48) := max{1, \u03b2}. Then we have\n\ndH (H(cid:101)\u03b1(fM ),(cid:99)H\u03b1) \u2264 C \u00b7(cid:16)\n\nn\u22121/2d + log(n)1/2\u03b2 \u00b7 k\u22121/2\u03b2(cid:17)\n\n.\n\nThe above result is compelling because it shows why our methods can work, even in high-dimensions,\ndespite the curse of dimensionality of non-parametric methods. In typical real-world data, even if\nthe data lies in a high-dimensional space, there may be far fewer degrees of freedom. Thus, our\ntheoretical results suggest that when this is true, then our methods will enjoy far better convergence\nrates \u2013 even when the data overall has full intrinsic dimension due to factors such as noise.\n\n4.4 Analysis of Algorithm 2: the Trust Score\n\nthe conditional distribution for label (cid:96) supported on M. De\ufb01ne M(cid:96) := H(cid:101)\u03b1(f(cid:96)), where(cid:101)\u03b1 := \u03b1\u2212\u03b7\n\nWe now provide a guarantee about the trust score, making the same assumptions as in Theorem 3 for\neach of the label distributions. We additionally assume that the class distributions are well-behaved\nin the following sense: that high-density-regions for each of the classes satisfy the property that for\nany point x \u2208 X , if the ratio of the distance to one class\u2019s high-density-region to that of another is\nsmaller by some margin \u03b3, then it is more likely that x\u2019s label corresponds to the former class.\nTheorem 4. Let 0 < \u03b7 < \u03b1 < 1. Let us have labeled data (x1, y1), ..., (xn, yn) drawn from\ndistribution D, which is a joint distribution over X \u00d7 Y where Y are the labels, |Y| < \u221e, and\nX \u2286 RD is compact. Suppose for each (cid:96) \u2208 Y, the conditional distribution for label (cid:96) satis\ufb01es the\nconditions of Theorem 3 for some manifold and noise level \u03b7. Let fM,(cid:96) be the density of the portion of\n1\u2212\u03b7 and\nlet \u0001n be the maximum Hausdorff error from estimating M(cid:96) over each (cid:96) \u2208 Y in Theorem 3. Assume\nthat min(cid:96)\u2208Y PD(y = (cid:96)) > 0 to ensure we have samples from each label.\nSuppose also that for each x \u2208 X , if d(x, Mi)/d(x, Mj) < 1 \u2212 \u03b3 then P(y = i|x) > P(y = j|x)\nfor i, j \u2208 Y. That is, if we are closer to Mi than Mj by a ratio of less than 1 \u2212 \u03b3, then the\npoint is more likely to be from class i. Let h\u2217 be the Bayes-optimal classi\ufb01er, de\ufb01ned by h\u2217(x) :=\nargmax(cid:96)\u2208Y P(y = (cid:96)|x). Then the trust score \u03be of Algorithm 2 satis\ufb01es the following with high\nprobability uniformly over all x \u2208 X and all classi\ufb01ers h : X \u2192 Y simultaneously for n suf\ufb01ciently\nlarge depending on D:\n\n(cid:32) d(x, M(cid:101)h(x))\n(cid:32)\n\nd(x, Mh(x))\n\nd(x, Mh(x))\n\nd(x, M(cid:101)h(x))\n\n(cid:33)\n(cid:33)\n\n+ 1\n\n+ 1\n\n\u03be(h, x) < 1 \u2212 \u03b3 \u2212\n\n\u0001n\n\nd(x, Mh(x)) + \u0001n\n\n1\n\n\u03be(h, x)\n\n< 1 \u2212 \u03b3 \u2212\n\n\u0001n\n\nd(x, M(cid:101)h(x)) + \u0001n\n\n\u00b7\n\n\u00b7\n\n5 Experiments\n\n\u21d2 h(x) (cid:54)= h\u2217(x),\n\n\u21d2 h(x) = h\u2217(x).\n\nIn this section, we empirically test whether trust scores can both detect examples that are incorrectly\nclassi\ufb01ed with high precision and be used as a signal to determine which examples are likely correctly\nclassi\ufb01ed. We perform this evaluation across (i) different datasets (Sections 5.1 and 5.3), (ii) different\nfamilies of classi\ufb01ers (neural network, random forest and logistic regression) (Section 5.1), (iii)\nclassi\ufb01ers with varying accuracy on the same task (Section 5.2) and (iv) different representations of\nthe data e.g. input data or activations of various intermediate layers in neural network (Section 5.3).\nFirst, we test if testing examples with high trust score corresponds to examples in which the model is\ncorrect (\"identifying trustworthy examples\"). Each method produces a numeric score for each testing\nexample. For each method, we bin the data points by percentile value of the score (i.e. 100 bins).\nGiven a recall percentile level (i.e. the x-axis on our plots), we take the performance of the classi\ufb01er\non the bins above the percentile level as the precision (i.e. the y-axis). Then, we take the negative of\neach signal and test if low trust score corresponds to the model being wrong (\"identifying suspicious\n\n6\n\n\fFigure 1: Two example datasets and models. For predicting correctness (top row) the vertical dotted\nblack line indicates error level of the trained classi\ufb01er. For predicting incorrectness (bottom) the\nvertical black dotted line is the accuracy rate of the classi\ufb01er. For detecting trustworthy, for each\npercentile level, we take the test examples whose trust score was above that percentile level and plot\nthe percentage of those test points that were correctly classi\ufb01ed by the classi\ufb01er, and do the same\nmodel con\ufb01dence and 1-nn ratio. For detecting suspicious, we take the negative of each signal and\nplot the precision of identifying incorrectly classi\ufb01ed examples. Shown are average of 20 runs with\nshaded standard error band. The trust score consistently attains a higher precision for each given\npercentile of classi\ufb01er decision-rejection. Furthermore, the trust score generally shows increasing\nprecision as the percentile level increases, but surprisingly, many of the comparison baselines do not.\nSee the Appendix for the full results.\n\nexamples\"). Here the y-axis is the misclassi\ufb01cation rate and the x-axis corresponds to decreasing\ntrust score or model con\ufb01dence.\nIn both cases, the higher the precision vs percentile curve, the better the method. The vertical black\ndotted lines in the plots represent the omniscient ideal. For identifying trustworthy examples it is the\nerror rate of the classi\ufb01er and for identifying suspicious examples\" it is the accuracy rate.\nThe baseline we use in Section is the model\u2019s own con\ufb01dence score, which is similar to the approach\nof [13]. While calibrating the classi\ufb01ers\u2019 con\ufb01dence scores (i.e. transforming them into probability\nestimates of correctness) is an important related work [4, 9], such techniques typically do not change\nthe rankings of the score, at least in the binary case. Since we evaluate the trust score on its precision\nat a given recall percentile level, we are interested in the relative ranking of the scores rather than\ntheir absolute values. Thus, we do not compare against calibration techniques. There are surprisingly\nfew methods aimed at identifying correctly or incorrectly classi\ufb01ed examples with precision at a\nrecall percentile level as noted in [13].\nChoosing Hyperparameters: The two hyperparameters for the trust score are \u03b1 and k. Throughout\nthe experiments, we \ufb01x k = 10 and choose \u03b1 using cross-validation over (negative) powers of 2 on\nthe training set. The metric for cross-validation was optimal performance on detecting suspicious\nexamples at the percentile corresponding to the classi\ufb01er\u2019s accuracy. The bulk of the computational\ncost for the trust-score is in k-nearest neighbor computations for training and 1-nearest neighbor\nsearches for evaluation. To speed things up for the larger datasets MNIST, SVHN, CIFAR-10\nand CIFAR-100, we skipped the initial \ufb01ltering step of Algorithm 1 altogether and reduced the\nintermediate layers down to 20 dimensions using PCA before being processed by the trust score\nwhich showed similar performance. We note that any approximation method (such as approximate\ninstead of exact nearest neighbors) could have been used instead.\n\n5.1 Performance on Benchmark UCI Datasets\n\nIn this section, we show performance on \ufb01ve benchmark UCI datasets [47], each for three kinds\nof classi\ufb01ers (neural network, random forest and logistic regression). Due to space, we only show\n\n7\n\n\fFigure 2: We show the performance of trust score on the Digits dataset for a neural network as we\nincrease the accuracy. As we go from left to right, we train the network with more iterations (each\nwith batch size 50) thus increasing the accuracy indicated by the dotted vertical lines. While the trust\nscore still performs better than model con\ufb01dence, the amount of improvement diminishes.\n\ntwo data sets and two models in Figure 1. The rest can be found in the Appendix. For each method\nand dataset, we evaluated with multiple runs. For each run we took a random strati\ufb01ed split of the\ndataset into two halves. One portion was used for training the trust score and the other was used for\nevaluation and the standard error is shown in addition to the average precision across the runs at each\npercentile level. The results show that our method consistently has a higher precision vs percentile\ncurve than the rest of the methods across the datasets and models. This suggests the trust score\nconsiderably improves upon known methods as a signal for identifying trustworthy and suspicious\ntesting examples for low-dimensional data.\nIn addition to the model\u2019s own con\ufb01dence score, we try one additional baseline, which we call the\nnearest neighbor ratio (1-nn ratio). It is the ratio between the 1-nearest neighbor distance to the\nclosest and second closest class, which can be viewed as an analogue to the trust score without\nknowledge of the classi\ufb01er\u2019s hard prediction.\n\n5.2 Performance as Model Accuracy Varies\n\nIn Figure 2, we show how the performance of trust score changes as the accuracy of the classi\ufb01er\nchanges (averaged over 20 runs for each condition). We observe that as the accuracy of the model\nincreases, while the trust score still performs better than model con\ufb01dence, the amount of improvement\ndiminishes. This suggests that as the model improves, the information trust score can provide in\naddition to the model con\ufb01dence decreases. However, as we show in Section 5.3, the trust score\ncan still have added value even when the classi\ufb01er is known to be of high performance on some\nbenchmark larger-scale datasets.\n\n5.3 Performance on MNIST, SVHN, CIFAR-10 and CIFAR-100 Datasets\nThe MNIST handwritten digit dataset [48] consists of 60,000 28\u00d728-pixel training images and\n10,000 testing images in 10 classes. The SVHN dataset [49] consists of 73,257 32\u00d732-pixel colour\ntraining images and 26,032 testing images and also has 10 classes. The CIFAR-10 and CIFAR-100\ndatasets [50] both consist of 60,000 32\u00d732-pixel colour images, with 50,000 training images and\n10,000 test images. The CIFAR-10 and CIFAR-100 datasets are split evenly between 10 classes and\n100 classes respectively.\n\n8\n\n\f(a) MNIST\n\n(b) SVHN\n\n(c) CIFAR-10\n\n(d) MNIST\n\n(e) SVHN\n\n(f) CIFAR-10\n\nFigure 3: Trust score results using convolutional neural networks on MNIST, SVHN, and CIFAR-10\ndatasets. Top row is detecting trustworthy; bottom row is detecting suspicious. Full chart with\nCIFAR-100 (which was essentially a negative result) is shown in the Appendix.\n\nWe used a pretrained VGG-16 [51] architecture with adaptation to the CIFAR datasets based on [52].\nThe CIFAR-10 VGG-16 network achieves a test accuracy of 93.56% while the CIFAR-100 network\nachieves a test accuracy of 70.48%. We used pretrained, smaller CNNs for MNIST and SVHN. The\nMNIST network achieves a test accuracy of 99.07% and the SVHN network achieves a test accuracy\nof 95.45%. All architectures were implemented in Keras [53].\nOne simple generalization of our method is to use intermediate layers of a neural network as an\ninput instead of the raw x. Many prior work suggests that a neural network may learn different\nrepresentations of x at each layer. As input to the trust score, we tried using 1) the logit layer, 2) the\npreceding fully connected layer with ReLU activation, 3) this fully connected layer, which has 128\ndimensions in the MNIST network and 512 dimensions in the other networks, reduced down to 20\ndimensions from applying PCA.\nThe trust score results on various layers are shown in Figure 3. They suggest that for high dimensional\ndatasets, the trust score may only provide little or no improvement over the model con\ufb01dence at\ndetecting trustworthy and suspicious examples. All plots were made using \u03b1 = 0; using cross-\nvalidation to select a different \u03b1 did not improve trust score performance. We also did not see much\ndifference from using different layers.\nConclusion:\nIn this paper, we provide the trust score: a new, simple, and effective way to judge if one should trust\nthe prediction from a classi\ufb01er. The trust score provides information about the relative positions of\nthe datapoints, which may be lost in common approaches such as the model con\ufb01dence when the\nmodel is trained using SGD. We show high-probability non-asymptotic statistical guarantees that\nhigh (low) trust scores correspond to agreement (disagreement) with the Bayes-optimal classi\ufb01er\nunder various nonparametric settings, which build on recent results in topological data analysis. Our\nempirical results across many datasets, classi\ufb01ers, and representations of the data show that our\nmethod consistently outperforms the classi\ufb01er\u2019s own reported con\ufb01dence in identifying trustworthy\nand suspicious examples in low to mid dimensional datasets. The theoretical and empirical results\nsuggest that this approach may have important practical implications in low to mid dimension\nsettings.\n\nhttps://github.com/geifmany/cifar-vgg\nhttps://github.com/EN10/KerasMNIST\nhttps://github.com/tohinz/SVHN-Classi\ufb01er\n\n9\n\n\fReferences\n[1] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical\n\nsystems, decision sciences, and data products. Big data, 5(3):246\u2013255, 2017.\n\n[2] John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance. Human\n\nfactors, 46(1):50\u201380, 2004.\n\n[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in AI safety. CoRR, abs/1606.06565, 2016.\n\n[4] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural\n\nnetworks. arXiv preprint arXiv:1706.04599, 2017.\n\n[5] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in\n\nNeural Information Processing Systems, pages 3474\u20133482, 2015.\n\n[6] Foster J Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy estimation for\n\ncomparing induction algorithms. In ICML, volume 98, pages 445\u2013453, 1998.\n\n[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[8] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High\ncon\ufb01dence predictions for unrecognizable images. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 427\u2013436, 2015.\n\n[9] John Platt. Probabilistic outputs for support vector machines and comparisons to regularized\n\nlikelihood methods. Advances in Large Margin Classi\ufb01ers, 10(3):61\u201374, 1999.\n\n[10] Bianca Zadrozny and Charles Elkan. Transforming classi\ufb01er scores into accurate multiclass\nprobability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, pages 694\u2013699. ACM, 2002.\n\n[11] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised\nlearning. In Proceedings of the 22nd International Conference on Machine Learning, pages\n625\u2013632. ACM, 2005.\n\n[12] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6405\u20136416, 2017.\n\n[13] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. arXiv preprint arXiv:1610.02136, 2016.\n\n[14] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing\nmodel uncertainty in deep learning. In International Conference on Machine Learning, pages\n1050\u20131059, 2016.\n\n[15] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for\ncomputer vision? In Advances in Neural Information Processing Systems, pages 5580\u20135590,\n2017.\n\n[16] Peter L Bartlett and Marten H Wegkamp. Classi\ufb01cation with a reject option using a hinge loss.\n\nJournal of Machine Learning Research, 9(Aug):1823\u20131840, 2008.\n\n[17] Ming Yuan and Marten Wegkamp. Classi\ufb01cation methods with reject option based on convex\n\nrisk minimization. Journal of Machine Learning Research, 11(Jan):111\u2013130, 2010.\n\n[18] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In International\n\nConference on Algorithmic Learning Theory, pages 67\u201382. Springer, 2016.\n\n[19] Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and St\u00e9phane Canu. Support vector\nmachines with a reject option. In Advances in Neural Information Processing Systems, pages\n537\u2013544, 2009.\n\n10\n\n\f[20] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Boosting with abstention. In Advances in\n\nNeural Information Processing Systems, pages 1660\u20131668, 2016.\n\n[21] Radu Herbei and Marten H Wegkamp. Classi\ufb01cation with reject option. Canadian Journal of\n\nStatistics, 34(4):709\u2013721, 2006.\n\n[22] Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online\n\nlearning with abstention. arXiv preprint arXiv:1703.03478, 2017.\n\n[23] C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information\n\nTheory, 16(1):41\u201346, 1970.\n\n[24] Bernard Dubuisson and Mylene Masson. A statistical decision rule with incomplete knowledge\n\nabout classes. Pattern Recognition, 26(1):155\u2013165, 1993.\n\n[25] Giorgio Fumera, Fabio Roli, and Giorgio Giacinto. Multiple reject thresholds for improving\nclassi\ufb01cation reliability. In Joint IAPR International Workshops on Statistical Techniques in\nPattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages\n863\u2013871. Springer, 2000.\n\n[26] Carla M Santos-Pereira and Ana M Pires. On optimal reject rules and ROC curves. Pattern\n\nRecognition Letters, 26(7):943\u2013952, 2005.\n\n[27] Francesco Tortorella. An optimal reject rule for binary classi\ufb01ers. In Joint IAPR International\nWorkshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic\nPattern Recognition (SSPR), pages 611\u2013620. Springer, 2000.\n\n[28] Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In\n\nPattern Recognition with Support Vector Machines, pages 68\u201382. Springer, 2002.\n\n[29] Thomas CW Landgrebe, David MJ Tax, Pavel Pacl\u00edk, and Robert PW Duin. The interaction\nbetween classi\ufb01cation and reject performance for distance-based reject-option classi\ufb01ers. Pattern\nRecognition Letters, 27(8):908\u2013917, 2006.\n\n[30] Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classi\ufb01cation. Journal\n\nof Machine Learning Research, 11(May):1605\u20131641, 2010.\n\n[31] Yair Wiener and Ran El-Yaniv. Agnostic selective classi\ufb01cation.\n\nInformation Processing Systems, pages 1665\u20131673, 2011.\n\nIn Advances in Neural\n\n[32] David MJ Tax and Robert PW Duin. Growing a multi-class classi\ufb01er with a reject option.\n\nPattern Recognition Letters, 29(10):1565\u20131570, 2008.\n\n[33] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. Ef\ufb01cient learning by directed\nacyclic graph for resource constrained prediction. Advances in Neural Information Processing\nSystems (NIPS), 2015.\n\n[34] Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, and Dun Yu Hsaio. Classifying\nJournal of Machine Learning Research,\n\nwith con\ufb01dence from incomplete information.\n14(December):3561\u20133589, 2013.\n\n[35] Wei Fan, Fang Chu, Haixun Wang, and Philip S. Yu. Pruning and dynamic scheduling of\n\ncost-sensitive ensembles. AAAI, 2002.\n\n[36] Nicolas Papernot and Patrick McDaniel. Deep k-nearest neighbors: Towards con\ufb01dent, inter-\n\npretable and robust deep learning. arXiv preprint arXiv:1803.04765, 2018.\n\n[37] John A Hartigan. Clustering algorithms. 1975.\n\n[38] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, Xiaowei Xu, et al. A density-based algorithm\n\nfor discovering clusters in large spatial databases with noise. In Kdd, pages 226\u2013231, 1996.\n\n[39] Alexandre B Tsybakov et al. On nonparametric estimation of density level sets. The Annals of\n\nStatistics, 25(3):948\u2013969, 1997.\n\n11\n\n\f[40] Aarti Singh, Clayton Scott, Robert Nowak, et al. Adaptive Hausdorff estimation of density level\n\nsets. The Annals of Statistics, 37(5B):2760\u20132782, 2009.\n\n[41] Philippe Rigollet, R\u00e9gis Vert, et al. Optimal rates for plug-in estimators of density level sets.\n\nBernoulli, 15(4):1154\u20131178, 2009.\n\n[42] Heinrich Jiang. Density level set estimation on manifolds with DBSCAN. In International\n\nConference on Machine Learning, pages 1684\u20131693, 2017.\n\n[43] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering. The Annals of\n\nStatistics, 38(5):2678\u20132722, 2010.\n\n[44] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds\nwith high con\ufb01dence from random samples. Discrete & Computational Geometry, 39(1-3):419\u2013\n441, 2008.\n\n[45] Christopher Genovese, Marco Perone-Paci\ufb01co, Isabella Verdinelli, and Larry Wasserman.\nMinimax manifold estimation. Journal of Machine Learning Research, 13(May):1263\u20131291,\n2012.\n\n[46] Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, and Larry\nWasserman. Cluster trees on manifolds. In Advances in Neural Information Processing Systems,\npages 2679\u20132687, 2013.\n\n[47] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning.\n\n2001.\n\n[48] Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[49] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[50] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[51] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[52] Shuying Liu and Weihong Deng. Very deep convolutional neural network based image classi\ufb01ca-\ntion using small training sample size. 2015 3rd IAPR Asian Conference on Pattern Recognition\n(ACPR), pages 730\u2013734, 2015.\n\n[53] Fran\u00e7ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015.\n\n[54] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree.\n\nAdvances in Neural Information Processing Systems, pages 343\u2013351, 2010.\n\nIn\n\n[55] Luc Devroye, Laszlo Gyor\ufb01, Adam Krzyzak, and G\u00e1bor Lugosi. On the strong universal\nconsistency of nearest neighbor regression function estimates. The Annals of Statistics, pages\n1371\u20131385, 1994.\n\n[56] Sanjoy Dasgupta and Samory Kpotufe. Optimal rates for k-NN density and mode estimation.\n\nIn Advances in Neural Information Processing Systems, pages 2555\u20132563, 2014.\n\n[57] Fr\u00e9d\u00e9ric Chazal. An upper bound for the volume of geodesic balls in submanifolds of Euclidean\n\nspaces. https://geometrica.saclay.inria.fr/team/Fred.Chazal/BallVolumeJan2013.pdf, 2013.\n\n[58] Heinrich Jiang. Uniform convergence rates for kernel density estimation. In International\n\nConference on Machine Learning, pages 1694\u20131703, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2654, "authors": [{"given_name": "Heinrich", "family_name": "Jiang", "institution": "Google Research"}, {"given_name": "Been", "family_name": "Kim", "institution": "Google"}, {"given_name": "Melody", "family_name": "Guan", "institution": "Stanford University"}, {"given_name": "Maya", "family_name": "Gupta", "institution": "Google"}]}