{"title": "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift", "book": "Advances in Neural Information Processing Systems", "page_first": 1396, "page_last": 1408, "abstract": "We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected. Interestingly, we show that across the dataset shifts that we explore, a two-sample-testing-based approach, using pre-trained classifiers for dimensionality reduction, performs best. Moreover, we demonstrate that domain-discriminating approaches tend to be helpful for characterizing shifts qualitatively and determining if they are harmful.", "full_text": "Failing Loudly: An Empirical Study of Methods\n\nfor Detecting Dataset Shift\n\nStephan Rabanser\u2217\n\nAWS AI Labs\n\nStephan G \u00a8unnemann\n\nZachary C. Lipton\n\nTechnical University of Munich\n\nCarnegie Mellon University\n\nrabans@amazon.com\n\nguennemann@in.tum.de\n\nzlipton@cmu.edu\n\nAbstract\n\nWe might hope that when faced with unexpected inputs, well-designed software\nsystems would \ufb01re off warnings. Machine learning (ML) systems, however, which\ndepend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to\nfail silently. This paper explores the problem of building ML systems that fail\nloudly, investigating methods for detecting dataset shift, identifying exemplars\nthat most typify the shift, and quantifying shift malignancy. We focus on several\ndatasets and various perturbations to both covariates and label distributions with\nvarying magnitudes and fractions of data affected.\nInterestingly, we show that\nacross the dataset shifts that we explore, a two-sample-testing-based approach,\nusing pre-trained classi\ufb01ers for dimensionality reduction, performs best. More-\nover, we demonstrate that domain-discriminating approaches tend to be helpful\nfor characterizing shifts qualitatively and determining if they are harmful.\n\n1\n\nIntroduction\n\nSoftware systems employing deep neural networks are now applied widely in industry, powering the\nvision systems in social networks [47] and self-driving cars [5], providing assistance to radiologists\n[24], underpinning recommendation engines used by online platforms [9, 12], enabling the best-\nperforming commercial speech recognition software [14, 21], and automating translation between\nlanguages [50]. In each of these systems, predictive models are integrated into conventional human-\ninteracting software systems, leveraging their predictions to drive consequential decisions.\n\nThe reliable functioning of software depends crucially on tests. Many classic software bugs can be\ncaught when software is compiled, e.g. that a function receives input of the wrong type, while other\nproblems are detected only at run-time, triggering warnings or exceptions. In the worst case, if the\nerrors are never caught, software may behave incorrectly without alerting anyone to the problem.\n\nUnfortunately, software systems based on machine learning are notoriously hard to test and maintain\n[42]. Despite their power, modern machine learning models are brittle. Seemingly subtle changes\nin the data distribution can destroy the performance of otherwise state-of-the-art classi\ufb01ers, a phe-\nnomenon exempli\ufb01ed by adversarial examples [51, 57]. When decisions are made under uncertainty,\neven shifts in the label distribution can signi\ufb01cantly compromise accuracy [29, 56]. Unfortunately,\nin practice, ML pipelines rarely inspect incoming data for signs of distribution shift. Moreover, best\npractices for detecting shift in high-dimensional real-world data have not yet been established2.\n\nIn this paper, we investigate methods for detecting and characterizing distribution shift, with the\nhope of removing a critical stumbling block obstructing the safe and responsible deployment of\nmachine learning in high-stakes applications. Faced with distribution shift, our goals are three-fold:\n\n\u2217Work done while a Visiting Research Scholar at Carnegie Mellon University.\n2TensorFlow\u2019s data validation tools compare only summary statistics of source vs target data:\n\nhttps://tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u2026\n\u2026\n\n\u2026\n\n\u2026\n\nx source\nx source\n\n\u2026\n\nx target\n\nDimensionality\n\nReduction\n\n\u2026\n\n\u2026\n\n\u2026\n\n\u2026\n\n\u2026\n\n\u2026\n\nTwo-Sample Test(s)\n\nCombined Test Statistic & \n\nShift Detection\n\nFigure 1: Our pipeline for detecting dataset shift. Source and target data is fed through a dimen-\nsionality reduction process and subsequently analyzed via statistical hypothesis testing. We consider\nvarious choices for how to represent the data and how to perform two-sample tests.\n\n(i) detect when distribution shift occurs from as few examples as possible; (ii) characterize the shift,\ne.g. by identifying those samples from the test set that appear over-represented in the target data;\nand (iii) provide some guidance on whether the shift is harmful or not. As part of this paper we\nprincipally focus on goal (i) and explore preliminary approaches to (ii) and (iii).\n\nWe investigate shift detection through the lens of statistical two-sample testing. We wish to test the\nequivalence of the source distribution p (from which training data is sampled) and target distribu-\ntion q (from which real-world data is sampled). For simple univariate distributions, such hypothesis\ntesting is a mature science. However, best practices for two sample tests with high-dimensional\n(e.g. image) data remain an open question. While off-the-shelf methods for kernel-based multivari-\nate two-sample tests are appealing, they scale badly with dataset size and their statistical power is\nknown to decay badly with high ambient dimension [37].\n\nRecently, Lipton et al. [29] presented results for a method called black box shift detection (BBSD),\nshowing that if one possesses an off-the-shelf label classi\ufb01er f with an invertible confusion ma-\ntrix, then detecting that the source distribution p differs from the target distribution q requires only\ndetecting that p(f (x)) 6= q(f (x)). Building on their idea of combining black-box dimensionality\nreduction with subsequent two-sample testing, we explore a range of dimensionality-reduction tech-\nniques and compare them under a wide variety of shifts (Figure 1 illustrates our general framework).\nWe show (empirically) that BBSD works surprisingly well under a broad set of shifts, even when the\nlabel shift assumption is not met. Furthermore, we provide an empirical analysis on the performance\nof domain-discriminating classi\ufb01er-based approaches (i.e. classi\ufb01ers explicitly trained to discrimi-\nnate between source and target samples), which has so far not been characterized for the complex\nhigh-dimensional data distributions on which modern machine learning is routinely deployed.\n\n2 Related work\n\nGiven just one example from the test data, our problem simpli\ufb01es to anomaly detection, surveyed\nthoroughly by Chandola et al. [8] and Markou and Singh [33]. Popular approaches to anomaly\ndetection include density estimation [6], margin-based approaches such as the one-class SVM [40],\nand the tree-based isolation forest method due to [30]. Recently, also GANs have been explored for\nthis task [39]. Given simple streams of data arriving in a time-dependent fashion where the signal\nis piece-wise stationary with abrupt changes, this is the classic time series problem of change point\ndetection, surveyed comprehensively by Truong et al. [52]. An extensive literature addresses dataset\nshift in the context of domain adaptation. Owing to the impossibility of correcting for shift absent\nassumptions [3], these papers often assume either covariate shift q(x, y) = q(x)p(y|x) [15, 45, 49]\nor label shift q(x, y) = q(y)p(x|y) [7, 29, 38, 48, 56]. Sch\u00a8olkopf et al. [41] provides a unifying\nview of these shifts, associating assumed invariances with the corresponding causal assumptions.\n\nSeveral recent papers have proposed outlier detection mechanisms dubbing the task out-of-\ndistribution (OOD) sample detection. Hendrycks and Gimpel [19] proposes to threshold the max-\nimum softmax entry of a neural network classi\ufb01er which already contains a relevant signal. Liang\net al. [28] and Lee et al. [26] extend this idea by either adding temperature scaling and adversarial-\nlike perturbations on the input or by explicitly adapting the loss to aid OOD detection. Choi and\nJang [10] and Shalev et al. [44] employ model ensembling to further improve detection reliability.\nAlemi et al. [2] motivate use of the variational information bottleneck. Hendrycks et al. [20] ex-\npose the model to OOD samples, exploring heuristics for discriminating between in-distribution and\nout-of-distribution samples. Shafaei et al. [43] survey numerous OOD detection techniques.\n\n2\n\n\f3 Shift Detection Techniques\n\nGiven labeled data {(x1, y1), ..., (xn, yn)} \u223c p and unlabeled data {x\u20321, ..., x\u2032m} \u223c q, our task is\nto determine whether p(x) equals q(x\u2032). Formally, H0 : p(x) = q(x\u2032) vs HA : p(x) 6= q(x\u2032).\nChie\ufb02y, we explore the following design considerations: (i) what representation to run the test on;\n(ii) which two-sample test to run; (iii) when the representation is multidimensional; whether to run\nmultivariate or multiple univariate two-sample tests; and (iv) how to combine their results.\n\n3.1 Dimensionality Reduction\n\nWe now introduce the multiple dimensionality reduction (DR) techniques that we compare vis-\na-vis their effectiveness in shift detection (in concert with two-sample testing). Note that absent\nassumptions on the data, these mappings, which reduce the data dimensionality from D to K (with\nK \u226a D), are in general surjective, with many inputs mapping to the same output. Thus, it is trivial\nto construct pathological cases where the distribution of inputs shifts while the distribution of low-\ndimensional latent representations remains \ufb01xed, yielding false negatives. However, we speculate\nthat in a non-adversarial setting, such shifts may be exceedingly unlikely. Thus our approach is (i)\nempirically motivated; and (ii) not put forth as a defense against worst-case adversarial attacks.\n\nNo Reduction (NoRed\ntests on the original raw features.\n\n): To justify the use of any DR technique, our default baseline is to run\n\nPrincipal Components Analysis (PCA ): Principal components analysis is a standard tool that\n\ufb01nds an optimal orthogonal transformation matrix R such that points are linearly uncorrelated after\ntransformation. This transformation is learned in such a way that the \ufb01rst principal component\naccounts for as much of the variability in the dataset as possible, and that each succeeding principal\ncomponent captures as much of the remaining variance as possible subject to the constraint that\nit be orthogonal to the preceding components. Formally, we wish to learn R given X under the\nmentioned constraints such that \u02c6X = XR yields a more compact data representation.\n\nSparse Random Projection (SRP ): Since computing the optimal transformation might be expen-\nsive in high dimensions, random projections are a popular DR technique which trade a controlled\namount of accuracy for faster processing times. Speci\ufb01cally, we make use of sparse random pro-\njections, a more memory- and computationally-ef\ufb01cient modi\ufb01cation of standard Gaussian random\nprojections. Formally, we generate a random projection matrix R and use it to reduce the dimen-\nsionality of a given data matrix X, such that \u02c6X = XR. The elements of R are generated using the\nfollowing rule set [1, 27]:\n\nK with probability 1\n\n2v\n\nwith probability 1 \u2212 1\nv\n\nRij =\n\n\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n+p v\n0\n\u2212p v\n\nK with probability 1\n\n2v\n\nwhere\n\nv = 1\u221aD\n\n.\n\n(1)\n\nAutoencoders (TAE and UAE ): We compare the above-mentioned linear models to non-linear\nreduced-dimension representations using both trained (TAE) and untrained autoencoders (UAE).\nFormally, an autoencoder consists of an encoder function \u03c6 : X \u2192 H and a decoder function\n\u03c8 : H \u2192 X where the latent space H has lower dimensionality than the input space X . As part of\nthe training process, both the encoding function \u03c6 and the decoding function \u03c8 are learned jointly\nto reduce the reconstruction loss: \u03c6, \u03c8 = arg min\u03c6,\u03c8 kX \u2212 (\u03c8 \u25e6 \u03c6)Xk2.\nLabel Classi\ufb01ers (BBSDs \u22b3 and BBSDh \u22b2): Motivated by recent results achieved by black box\nshift detection (BBSD) [29], we also propose to use the outputs of a (deep network) label classi\ufb01er\ntrained on source data as our dimensionality-reduced representation. We explore variants using\neither the softmax outputs (BBSDs) or the hard-thresholded predictions (BBSDh) for subsequent\ntwo-sample testing. Since both variants provide differently sized output (with BBSDs providing an\nentire softmax vector and BBSDh providing a one-dimensional class prediction), different statistical\ntests are carried out on these representations.\nDomain Classi\ufb01er (Classif \u00d7): Here, we attempt to detect shift by explicitly training a domain\nclassi\ufb01er to discriminate between data from source and target domains. To this end, we partition\nboth the source data and target data into two halves, using the \ufb01rst to train a domain classi\ufb01er to\ndistinguish source (class 0) from target (class 1) data. We then apply this model to the second\n\n3\n\n\fhalf and subsequently conduct a signi\ufb01cance test to determine if the classi\ufb01er\u2019s performance is\nstatistically different from random chance.\n\n3.2 Statistical Hypothesis Testing\n\nThe DR techniques each yield a representation, either uni- or multi-dimensional, and either continu-\nous or discrete, depending on the method. The next step is to choose a suitable statistical hypothesis\ntest for each of these representations.\n\nMultivariate Kernel Two-Sample Tests: Maximum Mean Discrepancy (MMD): For all multi-\ndimensional representations, we evaluate the Maximum Mean Discrepancy [16], a popular kernel-\nbased technique for multivariate two-sample testing. MMD allows us to distinguish between two\nprobability distributions p and q based on the mean embeddings \u00b5p and \u00b5q of the distributions in a\nreproducing kernel Hilbert space F , formally\n\nGiven samples from both distributions, we can calculate an unbiased estimate of the squared MMD\nstatistic as follows\n\nMMD(F, p, q) = ||\u00b5p \u2212 \u00b5q||2\nF .\n\n(2)\n\n1\n\nm\n\nm\n\nX\n\ni=1\n\nm2 \u2212 m\n\nX\nj6=i\n\nMMD2 =\n\n\u03ba(xi, xj) +\n\nX\nj6=i\nwhere we use a squared exponential kernel \u03ba(x, \u02dcx) = e\u2212 1\nand set \u03c3 to the median distance\nbetween points in the aggregate sample over p and q [16]. A p-value can then be obtained by carrying\nout a permutation test on the resulting kernel matrix.\n\n\u03ba(x\u2032i, x\u2032j) \u2212\n\n\u03ba(xi, x\u2032j)\n\n\u03c3 kx\u2212 \u02dcxk2\n\nn2 \u2212 n\n\nX\n\nX\n\nj=1\n\nX\n\ni=1\n\n(3)\n\ni=1\n\nn\n\nn\n\n1\n\nm\n\nn\n\n2\nmn\n\nMultiple Univariate Testing: Kolmogorov-Smirnov (KS) Test + Bonferroni Correction: As a\nsimple baseline alternative to MMD, we consider the approach consisting of testing each of the\nK dimensions separately (instead testing over all dimensions jointly). Here, for continuous data,\nwe adopt the Kolmogorov-Smirnov (KS) test, a non-parametric test whose statistic is calculated by\ncomputing the largest difference Z of the cumulative density functions (CDFs) over all values z as\nfollows\n\nZ = sup\n\n|Fp(z) \u2212 Fq(z)|\n\nz\n\n(4)\n\nwhere Fp and Fq are the empirical CDFs of the source and target data, respectively. Under the null\nhypothesis, Z follows the Kolmogorov distribution.\n\nSince we carry out a KS test on each of the K components, we must subsequently combine the p-\nvalues from each test, raising the issue of multiple hypothesis testing. As we cannot make strong as-\nsumptions about the (in)dependence among the tests, we rely on a conservative aggregation method,\nnotably the Bonferroni correction [4], which rejects the null hypothesis if the minimum p-value\namong all tests is less than \u03b1/K (where \u03b1 is the signi\ufb01cance level of the test). While several less\nconservative aggregations methods have been proposed [18, 32, 46, 53, 55], they typically require\nassumptions on the dependencies among the tests.\n\nCategorical Testing: Chi-Squared Test: For the hard-thresholded label classi\ufb01er (BBSDh), we\nemploy Pearson\u2019s chi-squared test, a parametric tests designed to evaluate whether the frequency\ndistribution of certain events observed in a sample is consistent with a particular theoretical distri-\nbution. Speci\ufb01cally, we use a test of homogeneity between the class distributions (expressed in a\ncontingency table) of source and target data. The testing problem can be formalized as follows:\nGiven a contingency table with 2 rows (one for absolute source and one for absolute target class\nfrequencies) and C columns (one for each of the C-many classes) containing observed counts Oij ,\nthe expected frequency under the independence hypothesis for a particular cell is Eij = Nsumpi\u2022p\u2022j\nwith Nsum being the sum of all cells in the table, pi\u2022 = Oi\u2022\nbeing the fraction of row\ntotals, and p\u2022j = O\u2022j\nX 2 can be computed as\n\nbeing the fraction of column totals. The relevant test statistic\n\n= PC\n\n= P2\n\nOij\nNsum\n\nOij\nNsum\n\nNsum\n\nNsum\n\nj=1\n\ni=1\n\nX 2 =\n\n2\n\nC\n\nX\n\ni=1\n\nX\n\nj=1\n\n(Oij \u2212 Eij)2\n\nEij\n\n(5)\n\nwhich, under the null hypothesis, follows a chi-squared distribution with C \u2212 1 degrees of freedom:\nX 2 \u223c \u03c72\n\nC\u22121.\n\n4\n\n\fBinomial Testing: For the domain classi\ufb01er, we simply compare its accuracy (acc) on held-out\ndata to random chance via a binomial test. Formally, we set up a testing problem H0 : acc = 0.5\nvs HA : acc 6= 0.5. Under the null hypothesis, the accuracy of the classi\ufb01er follows a binomial\ndistribution: acc \u223c Bin(Nhold, 0.5), where Nhold corresponds to the number of held-out samples.\n\n3.3 Obtaining Most Anomalous Samples\n\nAs our detection framework does not detect outliers but rather aims at capturing top-level shift\ndynamics, it is not possible for us to decide whether any given sample is in- or out-of-distribution.\nHowever, we can still provide an indication of what typical samples from the shifted distribution look\nlike by harnessing domain assignments from the domain classi\ufb01er. Speci\ufb01cally, we can identify\nthe exemplars which the classi\ufb01er was most con\ufb01dent in assigning to the target domain. Since\nthe domain classi\ufb01er assigns class-assignment con\ufb01dence scores to each incoming sample via the\nsoftmax-layer at its output, it is easy to create a ranking of samples that are most con\ufb01dently believed\nto come from the target domain (or, alternatively, from the source domain). Hence, whenever the\nbinomial test signals a statistically signi\ufb01cant accuracy deviation from chance, we can use use the\ndomain classi\ufb01er to obtain the most anomalous samples and present them to the user.\n\nIn contrast to the domain classi\ufb01er, the other shift detectors do not base their shift detection potential\non explicitly deciding which domain a single sample belongs to, instead comparing entire distribu-\ntions against each other. While we did explore initial ideas on identifying samples which if removed\nwould lead to a large increase in the overall p-value, the results we obtained were unremarkable.\n\n3.4 Determining the Malignancy of a Shift\n\nTheoretically, absent further assumptions, distribution shifts can cause arbitrarily severe degradation\nin performance. However, in practice distributions shift constantly, and often these changes are\nbenign. Practitioners should therefore be interested in distinguishing malignant shifts that damage\npredictive performance from benign shifts that negligibly impact performance. Although prediction\nquality can be assessed easily on source data on which the black-box model f was trained, we are\nnot able compute the target error directly without labels.\n\nWe therefore explore a heuristic method for approximating the target performance by making use\nof the domain classi\ufb01er\u2019s class assignments as follows: Given access to a labeling function that can\ncorrectly label samples, we can feed in those examples predicted by the domain classi\ufb01er as likely\nto come from the target domain. We can then compare these (true) labels to the labels returned by\nthe black box model f by feeding it the same anomalous samples. If our model is inaccurate on\nthese examples (where the exact threshold can be user-speci\ufb01ed to account for varying sensitivities\nto accuracy drops), then we ought to be concerned that the shift is malignant. Put simply, we sug-\ngest evaluating the accuracy of our models on precisely those examples which are most con\ufb01dently\nassigned to the target domain.\n\n4 Experiments\n\nOur main experiments were carried out on the MNIST (Ntr = 50000; Nval = 10000; Nte = 10000;\nD = 28 \u00d7 28 \u00d7 1; C = 10 classes) [25] and CIFAR-10 (Ntr = 40000; Nval = 10000; Nte =\n10000; D = 32 \u00d7 32 \u00d7 3; C = 10 classes) [23] image datasets. For the autoencoder (UAE &\nTAE) experiments, we employ a convolutional architecture with 3 convolutional layers and 1 fully-\nconnected layer. For both the label and the domain classi\ufb01er we use a ResNet-18 [17]. We train\nall networks (TAE, BBSDs, BBSDh, Classif) using stochastic gradient descent with momentum in\nbatches of 128 examples over 200 epochs with early stopping.\n\nFor PCA, SRP, UAE, and TAE, we reduce dimensionality to K = 32 latent dimensions, which for\nPCA explains roughly 80% of the variance in the CIFAR-10 dataset. The label classi\ufb01er BBSDs\nreduces dimensionality to the number of classes C. Both the hard label classi\ufb01er BBSDh and the\ndomain classi\ufb01er Classif reduce dimensionality to a one-dimensional class prediction, where BBSDh\npredicts label assignments and Classif predicts domain assignments.\n\nTo challenge our detection methods, we simulate a variety of shifts, affecting both the covariates\nand the label proportions. For all shifts, we evaluate the various methods\u2019 abilities to detect shift at\n\n5\n\n\fa signi\ufb01cance level of \u03b1 = 0.05. We also include the no-shift case to check against false positives.\nWe randomly split all of the data into training, validation, and test sets according to the indicated\nproportions Ntr, Nval, and Nte and then apply a particular shift to the test set only.\nIn order to\nqualitatively quantify the robustness of our \ufb01ndings, shift detection performance is averaged over a\ntotal of 5 random splits, which ensures that we apply the same type of shift to different subsets of the\ndata. The selected training data used to \ufb01t the DR methods is kept constant across experiments with\nonly the splits between validation and test changing across the random runs. Note that DR methods\nare learned using training data, while shift detection is being performed on dimensionality-reduced\nrepresentations of the validation and the test set. We evaluate the models with various amounts of\nsamples from the test set s \u2208 {10, 20, 50, 100, 200, 500, 1000, 10000}. Because of the unfavorable\ndependence of kernel methods on the dataset size, we run these methods only up until 1000 target\nsamples have been acquired.\n\nFor each shift type (as appropriate) we explored three levels of shift intensity (e.g. the magnitude of\nadded noise) and various percentages of affected data \u03b4 \u2208 {0.1, 0.5, 1.0}. Speci\ufb01cally, we explore\nthe following types of shifts:\n\n(a) Adversarial (adv): We turn a fraction \u03b4 of samples into adversarial samples via FGSM [13];\n\n(b) Knock-out (ko): We remove a fraction \u03b4 of samples from class 0, creating class imbalance [29];\n\n(c) Gaussian noise (gn): We corrupt covariates of a fraction \u03b4 of test set samples by Gaussian noise\nwith standard deviation \u03c3 \u2208 {1, 10, 100} (denoted s gn, m gn, and l gn);\n\n(d) Image (img): We also explore more natural shifts to images, modifying a fraction \u03b4 of\nimages with combinations of random rotations {10, 40, 90}, (x, y)-axis-translation percentages\n{0.05, 0.2, 0.4}, as well as zoom-in percentages {0.1, 0.2, 0.4} (denoted s img, m img, and l img);\n\n(e) Image + knock-out (m img+ko): We apply a \ufb01xed medium image shift with \u03b41 = 0.5 and a\nvariable knock-out shift \u03b4;\n\n(f) Only-zero + image (oz+m img): Here, we only include images from class 0 in combination\nwith a variable medium image shift affecting only a fraction \u03b4 of the data;\n\n(g) Original splits: We evaluate our detectors on the original source/target splits provided by the\ncreators of MNIST, CIFAR-10, Fashion MNIST [54], and SVHN [35] datasets (assumed to be i.i.d.);\n\n(h) Domain adaptation datasets: Data from the domain adaptation task transferring from MNIST\n(source) to USPS (target) (Ntr = Nval = Nte = 1000; D = 16 \u00d7 16 \u00d7 1; C = 10 classes) [31] as\nwell as the COIL-100 dataset (Ntr = Nval = Nte = 2400; D = 32 \u00d7 32 \u00d7 3; C = 100 classes) [34]\nwhere images between 0\u25e6 and 175\u25e6 are sampled by the source and images between 180\u25e6 and 355\u25e6\nare sampled by the target distribution.\n\nWe provide a sample implementation of our experiments-pipeline written in Python, making use of\nsklearn [36] and Keras [11], located at: https://github.com/steverab/failing-loudly.\n\n5 Discussion\n\nUnivariate VS Multivariate Tests: We \ufb01rst evaluate whether we can detect shifts more easily\nusing multiple univariate tests and aggregating their results via the Bonferroni correction or by using\nmultivariate kernel tests. We were surprised to \ufb01nd that, despite the heavy correction, multiple\nunivariate testing seem to offer comparable performance to multivariate testing (see Table 1a).\n\nDimensionality Reduction Methods: For each testing method and experimental setting, we eval-\nuate which DR technique is best suited to shift detection. Speci\ufb01cally in the multiple-univariate-\ntesting case (and overall), BBSDs was the best-performing DR method. In the multivariate-testing\ncase, UAE performed best. In both cases, these methods consistently outperformed others across\nsample sizes. The domain classi\ufb01er, a popular shift detection approach, performs badly in the low-\nsample regime (\u2264 100 samples), but catches up as more samples are obtained. Noticeably, the\nmultivariate test performs poorly in the no reduction case, which is also regarded a widely used shift\ndetection baseline. Table 1a summarizes these results.\n\nWe note that BBSDs being the best overall method for detecting shift is good news for ML practi-\ntioners. When building black-box models with the main purpose of classi\ufb01cation, said model can be\n\n6\n\n\fTable 1: Dimensionality reduction methods (a) and shift-type (b) comparison. Underlined entries\nindicate accuracy values larger than 0.5.\n\n(a) Detection accuracy of different dimensionality\nreduction techniques across all simulated shifts on\nMNIST and CIFAR-10. Green bold entries indi-\ncate the best DR method at a given sample size,\n2 and Bin tests\nred italic the worst. Results for \u03c7\nare only reported once under the univariate cate-\ngory. BBSDs performs best for univariate testing,\nwhile both UAE and TAE perform best for multi-\nvariate testing.\n\nNumber of samples from test\n\n10\n\n20\n\n50\n\n100\n\n200\n\n500\n\n1,000\n\n10,000\n\nTest\n\nDR\n\nNoRed\n\nPCA\nSRP\nUAE\nTAE\n\ns\nt\ns\ne\nt\n\n.\nv\ni\nn\nU\n\n0.03 0.15 0.26 0.36 0.41 0.47\n0.11 0.15 0.30 0.36 0.41 0.46\n0.15 0.15 0.23 0.27 0.34 0.42\n0.12 0.16 0.27 0.33 0.41 0.49\n0.18 0.23 0.31 0.38 0.43 0.47\n0.65\n\nBBSDs 0.19 0.28 0.47 0.47 0.51\n\n2\n\nBBSDh\n\u03c7\nBin Classif\n\n0.03 0.07 0.12 0.22 0.22 0.40\n0.28 0.42\n0.01 0.03 0.11 0.21\n\ns\nt\ns\ne\nt\n\n.\n\nv\ni\nt\nl\nu\nM\n\nNoRed\nPCA\nSRP\nUAE\nTAE\n\nBBSDs\n\n0.32\n\n0.14 0.15 0.22 0.28\n0.44\n0.15 0.18 0.33 0.38 0.40 0.46\n0.12 0.18 0.23 0.31\n0.31 0.44\n0.20 0.27 0.40 0.43 0.45 0.53\n0.18 0.26 0.37 0.38 0.45 0.52\n0.16 0.20 0.25 0.35 0.35 0.47\n\n0.54\n0.54\n0.55\n0.56\n0.55\n0.70\n\n0.46\n0.51\n\n0.55\n0.55\n0.54\n0.61\n0.59\n0.50\n\n0.72\n0.63\n0.68\n0.77\n0.69\n0.79\n\n0.57\n0.67\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n(b) Detection accuracy of different shifts on\nMNIST and CIFAR-10 using the best-performing\nDR technique (univariate: BBSDs, multivariate:\nUAE). Green bold shifts are identi\ufb01ed as harmless,\nred italic shifts as harmful.\n\nTest\n\nShift\n\nNumber of samples from test\n\n10\n\n20\n\n50\n\n100\n\n200\n\n500\n\n1,000 10,000\n\ns\nD\nS\nB\nB\ne\nt\na\ni\nr\na\nv\ni\nn\nU\n\nE\nA\nU\ne\nt\na\ni\nr\na\nv\ni\nt\nl\nu\nM\n\ns gn\nm gn\nl gn\ns img\nm img\nl img\nadv\nko\n\n0.00 0.00 0.03 0.03 0.07 0.10\n0.00 0.00 0.10 0.13 0.13 0.13\n0.17 0.27 0.53 0.63 0.67 0.83\n0.00 0.00 0.23 0.30 0.40 0.63\n0.30 0.37 0.60 0.67 0.70 0.80\n0.30 0.50 0.70 0.70 0.77 0.87\n0.13 0.27 0.40 0.43 0.53 0.77\n0.00 0.00 0.07 0.07 0.07 0.33\nm img+ko 0.13 0.40 0.87 0.93 0.90 1.00\noz+m img 0.67 1.00 1.00 1.00 1.00 1.00\n\ns gn\nm gn\nl gn\ns img\nm img\nl img\nadv\nko\n\n0.03 0.03 0.03 0.03 0.03 0.07\n0.03 0.03 0.03 0.03 0.17 0.27\n0.50 0.57 0.67 0.70 0.80 0.90\n0.17 0.20 0.27 0.30 0.40 0.47\n0.23 0.33 0.37 0.40 0.47 0.60\n0.30 0.30 0.37 0.47 0.60 0.77\n0.03 0.20 0.27 0.27 0.33 0.40\n0.10 0.13 0.13 0.13 0.17 0.17\nm img+ko 0.20 0.30 0.37 0.53 0.54 0.63\noz+m img 0.27 0.63 0.77 1.00 1.00 1.00\n\n0.10\n0.23\n0.87\n0.70\n0.90\n0.97\n0.83\n0.40\n1.00\n1.00\n\n0.07\n0.30\n1.00\n0.63\n0.70\n0.87\n0.40\n0.30\n0.87\n1.00\n\n0.10\n0.37\n1.00\n0.93\n1.00\n1.00\n0.90\n0.70\n1.00\n1.00\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\nTable 2: Shift detection performance based on shift intensity (a) and perturbed sample percentages\n(b) using the best-performing DR technique (univariate: BBSDs, multivariate: UAE). Underlined\nentries indicate accuracy values larger than 0.5.\n\n(a) Detection accuracy of varying shift intensities.\n\n(b) Detection accuracy of varying shift percentages.\n\nTest\n\nIntensity\n\nTest Percentage\n\nNumber of samples from test\n\nNumber of samples from test\n\n10\n\n20\n\n50\n\n100\n\n200\n\n500\n\n1,000\n\n10,000\n\n10\n\n20\n\n50\n\n100\n\n200\n\n500\n\n1,000 10,000\n\n.\n\nv\ni\nn\nU\n\n.\n\nv\ni\nt\nl\nu\nM\n\nSmall\n\n0.00 0.00 0.14 0.14 0.18 0.36\nMedium 0.14 0.21 0.39 0.38 0.42 0.57\n0.92\n\n0.32 0.54 0.78 0.82 0.83\n\nLarge\n\nSmall\n\n0.11 0.11 0.12 0.14 0.20 0.23\nMedium 0.11 0.19 0.23 0.27 0.32 0.42\n0.82\n\n0.34 0.45 0.57 0.68 0.72\n\nLarge\n\n0.40\n0.66\n0.96\n\n0.33\n0.44\n0.93\n\n0.54\n0.76\n1.00\n\n\u2013\n\u2013\n\u2013\n\n.\n\nv\ni\nn\nU\n\n.\n\nv\ni\nt\nl\nu\nM\n\n10%\n50%\n100%\n\n10%\n50%\n100%\n\n0.11 0.15 0.24 0.25 0.28 0.44\n0.14 0.28 0.52 0.53 0.60 0.68\n0.26 0.41 0.61 0.64 0.70 0.82\n\n0.12 0.13 0.21 0.26 0.27 0.31\n0.19 0.27 0.41 0.41 0.47 0.57\n0.29 0.41 0.44 0.53 0.60 0.70\n\n0.54\n0.72\n0.84\n\n0.44\n0.60\n0.78\n\n0.66\n0.85\n0.86\n\n\u2013\n\u2013\n\u2013\n\neasily extended to also double as a shift detector. Moreover, black-box models with soft predictions\nthat were built and trained in the past can be turned into shift detectors retrospectively.\n\nShift Types: Table 1b lists shift detection accuracy values for each distinct shift as an increasing\namount of samples is obtained from the target domain. Speci\ufb01cally, we see that l gn, m gn, l img,\nm img+ko, oz+m img, and even adv are easily detectable, many of them even with few samples,\nwhile s gn, m gn, and ko are hard to detect even with many samples. With a few exceptions, the\nbest DR technique (BBDSs for multiple univariate tests, UAE for multivariate tests) is signi\ufb01cantly\nfaster and more accurate at detecting shift than the average of all dimensionality reduction methods.\n\nShift Strength: Based on the results in Table 2a, we can conclude that small shifts (s gn, s img,\nand ko) are harder to detect than medium shifts (m gn, m img, and adv) which in turn are harder\nto detect than large shifts (l gn, l img, m img+ko, and oz+m img). Speci\ufb01cally, we see that large\nshifts can on average already be detected with better than chance accuracy at only 20 samples using\nBBSDs, while medium and small shifts require orders of magnitude more samples in order to achieve\nsimilar accuracy. Moreover, the results in Table 2b show that while target data exhibiting only 10%\nanomalous samples are hard to detect, suggesting that this setting might be better addressed via\noutlier detection, perturbation percentages 50% and 100% can already be detected with better than\nchance accuracy using 50 samples.\n\n7\n\n\f(a) Shift test (univ.) with\n10% perturbed test data.\n\n(b) Shift test (univ.) with\n50% perturbed test data.\n\n(c) Shift test (univ.) with\n100% perturbed test data.\n\n(d) Top different.\n\n(e) Classi\ufb01cation accuracy\non 10% perturbed data.\n\n(f) Classi\ufb01cation accuracy\non 50% perturbed data.\n\n(g) Classi\ufb01cation accuracy\non 100% perturbed data.\n\n(h) Top similar.\n\nFigure 2: Shift detection results for medium image shift on MNIST. Sub\ufb01gures (a)-(c) show the\np-value evolution of the different DR methods with varying percentages of perturbed data, while\nsub\ufb01gures (e)-(g) show the obtainable accuracies over the same perturbations. Sub\ufb01gures (d) and\n(h) show the most different and most similar exemplars returned by the domain classi\ufb01er across\nperturbation percentages. Plots show mean values obtained over 5 random runs with a 1-\u03c3 error-bar.\n\n(a) Shift test (univ.) with shuf\ufb02ed sets\ncontaining images from all angles.\n\n(b) Shift test (univ.) with angle parti-\ntioned source and target sets.\n\n(c) Top different.\n\n(d) Classi\ufb01cation accuracy on ran-\ndomly shuf\ufb02ed sets containing images\nfrom all angles.\n\n(e) Classi\ufb01cation accuracy on angle\npartitioned source and target sets.\n\n(f) Top similar.\n\nFigure 3: Shift detection results on COIL-100 dataset. Sub\ufb01gure organization is similar to Figure 2.\n\nMost Anomalous Samples and Shift Malignancy: Across all experiments, we observe that the\nmost different and most similar examples returned by the domain classi\ufb01er are useful in charac-\nterizing the shift. Furthermore, we can successfully distinguish malignant from benign shifts (as\nreported in Table 1b) by using the framework proposed in Section 3.4. While we recognize that\nhaving access to an external labeling function is a strong assumption and that accessing all true la-\nbels would be prohibitive at deployment, our experimental results also showed that, compared to the\ntotal sample size, two to three orders of magnitude fewer labeled examples suf\ufb01ce to obtain a good\napproximation of the (usually unknown) target accuracy.\n\n8\n\n101102103104Numberofsamplesfromtest0.00.20.40.60.81.0p-value101102103104Numberofsamplesfromtest0.00.20.40.60.81.0p-value101102103104Numberofsamplesfromtest0.00.20.40.60.81.0p-valueNoRedPCASRPUAETAEBBSDsBBSDhClassif101102103104Numberofsamplesfromtest0.900.951.00Accuracy101102103104Numberofsamplesfromtest0.40.60.81.0Accuracy101102103104Numberofsamplesfromtest0.20.40.60.81.0AccuracypqClassif101102103Numberofsamplesfromtest0.20.40.60.81.0p-value101102103Numberofsamplesfromtest0.00.20.40.60.81.0p-valueNoRedPCASRPUAETAEBBSDsBBSDhClassif101102103Numberofsamplesfromtest0.970.980.991.00Accuracy101102103Numberofsamplesfromtest0.940.960.981.00AccuracypqClassif\fTraining set average for 6\n\nTest set average for 6\n\nTraining set 6s \u2014 test set 6s\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0.00\n\n-0.02\n\n-0.04\n\n-0.06\n\n-0.08\n\nFigure 4: Difference plot for training and test set sixes.\n\nIndividual Examples: While full results with exact p-value evolution and anomalous samples are\ndocumented in the supplementary material, we brie\ufb02y present two illustrative results in detail:\n\n(a) Synthetic medium image shift on MNIST (Figure 2): From sub\ufb01gures (a)-(c), we see that most\nmethods are able to detect the simulated shift with BBSDs being the quickest method for all tested\nperturbation percentages. We further observe in sub\ufb01gures (e)-(g) that the (true) accuracy on sam-\nples from q increasingly deviates from the model\u2019s performance on source data from p as more\nsamples are perturbed. Since true target accuracy is usually unknown, we use the accuracy obtained\non the top anomalous labeled instances returned by the domain classi\ufb01er Classif. As we can see,\nthese values signi\ufb01cantly deviate from accuracies obtained on p, which is why we consider this shift\nharmful to the label classi\ufb01er\u2019s performance.\n\n(b) Rotation angle partitioning on COIL-100 (Figure 3): Sub\ufb01gures (a) and (b) show that our testing\nframework correctly claims the randomly shuf\ufb02ed dataset containing images from all angles to not\ncontain a shift, while it identi\ufb01es the partitioned dataset to be noticeably different. However, as we\ncan see from sub\ufb01gure (e), this shift does not harm the classi\ufb01er\u2019s performance, meaning that the\nclassi\ufb01er can safely be deployed even when encountering this speci\ufb01c dataset shift.\n\nOriginal Splits: According to our tests, the original split from the MNIST dataset appears to exhibit\na dataset shift. After inspecting the most anomalous samples returned by the domain classi\ufb01er, we\nobserved that many of these samples depicted the digit 6. A mean-difference plot (see Figure 4)\nbetween sixes from the training set and sixes from the test set revealed that the training instances are\nrotated slightly to the right, while the test samples are drawn more open and centered. To back up\nthis claim even further, we also carried out a two-sample KS test between the two sets of sixes in the\ninput space and found that the two sets can conclusively be regarded as different with a p-value of\n2.7 \u00b7 10\u221210, signi\ufb01cantly undercutting the respective Bonferroni threshold of 6.3 \u00b7 10\u22125. While this\nspeci\ufb01c shift does not look particularly signi\ufb01cant to the human eye (and is also declared harmless\nby our malignancy detector), this result however still shows that the original MNIST split is not i.i.d.\n\n6 Conclusions\n\nIn this paper, we put forth a comprehensive empirical investigation, examining the ways in which\ndimensionality reduction and two-sample testing might be combined to produce a practical pipeline\nfor detecting distribution shift in real-life machine learning systems. Our results yielded the surpris-\ning insights that (i) black-box shift detection with soft predictions works well across a wide variety\nof shifts, even when some of its underlying assumptions do not hold; (ii) that aggregated univariate\ntests performed separately on each latent dimension offer comparable shift detection performance\nto multivariate two-sample tests; and (iii) that harnessing predictions from domain-discriminating\nclassi\ufb01ers enables characterization of a shift\u2019s type and its malignancy. Moreover, we produced\nthe surprising observation that the MNIST dataset, despite ostensibly representing a random split,\nexhibits a signi\ufb01cant (although not worrisome) distribution shift.\n\nOur work suggests several open questions that might offer promising paths for future work, including\n(i) shift detection for online data, which would require us to account for and exploit the high degree\nof correlation between adjacent time steps [22]; and, since we have mostly explored a standard image\nclassi\ufb01cation setting for our experiments, (ii) applying our framework to other machine learning\ndomains such as natural language processing or graphs.\n\n9\n\n\fAcknowledgements\n\nWe thank the Center for Machine Learning and Health, a joint venture of Carnegie Mellon Univer-\nsity, UPMC, and the University of Pittsburgh for supporting our collaboration with Abridge AI to\ndevelop robust models for machine learning in healthcare. We are also grateful to Salesforce Re-\nsearch, Facebook AI Research, and Amazon AI for their support of our work on robust deep learning\nunder distribution shift.\n\nReferences\n\n[1] Dimitris Achlioptas. Database-Friendly Random Projections: Johnson-Lindenstrauss with Bi-\n\nnary Coins. Journal of Computer and System Sciences, 66, 2003.\n\n[2] Alexander A Alemi, Ian Fischer, and Joshua V Dillon. Uncertainty in the Variational Informa-\n\ntion Bottleneck. arXiv Preprint arXiv:1807.00906, 2018.\n\n[3] Shai Ben-David, Tyler Lu, Teresa Luu, and D\u00b4avid P\u00b4al. Impossibility Theorems for Domain\nAdaptation. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2010.\n\n[4] J Martin Bland and Douglas G Altman. Multiple Signi\ufb01cance Tests: The Bonferroni Method.\n\nBMJ, 1995.\n\n[5] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Pra-\nsoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to End\nLearning for Self-Driving Cars. arXiv Preprint arXiv:1604.07316, 2016.\n\n[6] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J\u00a8org Sander. LOF: Identifying\n\nDensity-Based Local Outliers. In ACM SIGMOD Record, 2000.\n\n[7] Yee Seng Chan and Hwee Tou Ng. Word Sense Disambiguation with Distribution Estimation.\n\nIn International Joint Conference on Arti\ufb01cial intelligence (IJCAI), 2005.\n\n[8] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A Survey. ACM\n\nComputing Surveys (CSUR), 2009.\n\n[9] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Arad-\nhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & Deep Learning\nfor Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recom-\nmender Systems. ACM, 2016.\n\n[10] Hyunsun Choi and Eric Jang. Generative Ensembles for Robust Anomaly Detection. arXiv\n\nPreprint arXiv:1810.01392, 2018.\n\n[11] Franc\u00b8ois Chollet et al. Keras. https://keras.io, 2015.\n\n[12] Paul Covington, Jay Adams, and Emre Sargin. Deep Neural Networks for YouTube Recom-\nmendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM,\n2016.\n\n[13] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adver-\n\nsarial Examples. In International Conference on Learning Representations (ICLR), 2014.\n\n[14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep\nRecurrent Neural Networks. In IEEE International Conference on Acoustics, Speech and Sig-\nnal Processing. IEEE, 2013.\n\n[15] Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt,\nand Bernhard Sch\u00a8olkopf. Covariate Shift by Kernel Mean Matching. Journal of Machine\nLearning Research (JMLR), 2009.\n\n[16] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00a8olkopf, and Alexander\n\nSmola. A Kernel Two-Sample Test. Journal of Machine Learning Research (JMLR), 2012.\n\n10\n\n\f[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image\n\nRecognition. In Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[18] Nicholas A Heard and Patrick Rubin-Delanchy. Choosing Between Methods of Combining-\n\nValues. Biometrika, 2018.\n\n[19] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassi\ufb01ed and Out-Of-\nIn International Conference on Learning Rep-\n\nDistribution Examples in Neural Networks.\nresentations (ICLR), 2017.\n\n[20] Dan Hendrycks, Mantas Mazeika, and Thomas G Dietterich. Deep Anomaly Detection with\n\nOutlier Exposure. In International Conference on Learning Representations (ICLR), 2019.\n\n[21] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep Neural\nNetworks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine,\n29, 2012.\n\n[22] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, Nonpara-\n\nmetric, Non-Asymptotic Con\ufb01dence Sequences. arXiv Preprint arXiv:1810.08240, 2018.\n\n[23] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Im-\n\nages. Technical report, Citeseer, 2009.\n\n[24] Paras Lakhani and Baskaran Sundaram. Deep Learning at Chest Radiography: Automated\nClassi\ufb01cation of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiol-\nogy, 284, 2017.\n\n[25] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-Based Learning\n\nApplied to Document Recognition. Proceedings of the IEEE, 86, 1998.\n\n[26] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training Con\ufb01dence-Calibrated Clas-\nsi\ufb01ers for Detecting Out-Of-Distribution Samples. In International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[27] Ping Li, Trevor J Hastie, and Kenneth W Church. Very Sparse Random Projections. In Pro-\nceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining (KDD). ACM, 2006.\n\n[28] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the Reliability of Out-Of-Distribution Im-\nage Detection in Neural Networks. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[29] Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and Correcting for Label Shift\nwith Black Box Predictors. In International Conference on Machine Learning (ICML), 2018.\n\n[30] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In International Conference\n\non Data Mining (ICDM), 2008.\n\n[31] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer\nFeature Learning with Joint Distribution Adaptation. In International Conference on Computer\nVision (ICCV), 2013.\n\n[32] Thomas M Loughin. A Systematic Comparison of Methods for Combining p-Values from\n\nIndependent Tests. Computational Statistics & Data Analysis, 2004.\n\n[33] Markos Markou and Sameer Singh. Novelty Detection: A Review: Part 1: Statistical Ap-\n\nproaches. Signal Processing, 2003.\n\n[34] Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia Object Image Library (COIL-\n\n100). 1996.\n\n[35] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading Digits in Natural Images With Unsupervised Feature Learning. 2011.\n\n11\n\n\f[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-\nrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning\nResearch, 12:2825\u20132830, 2011.\n\n[37] Aaditya Ramdas, Sashank Jakkam Reddi, Barnab\u00b4as P\u00b4oczos, Aarti Singh, and Larry A Wasser-\nman. On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests\nin High Dimensions.\nIn Association for the Advancement of Arti\ufb01cial Intelligence (AAAI),\n2015.\n\n[38] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the Outputs of a Clas-\n\nsi\ufb01er to New a Priori Probabilities: A Simple Procedure. Neural Computation, 2002.\n\n[39] Thomas Schlegl, Philipp Seeb\u00a8ock, Sebastian M Waldstein, Ursula Schmidt-Erfurth, and Georg\nLangs. Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide\nMarker Discovery. In International Conference on Information Processing in Medical Imag-\ning, 2017.\n\n[40] Bernhard Sch\u00a8olkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C\nIn Advances in Neural Information\n\nPlatt. Support Vector Method for Novelty Detection.\nProcessing Systems (NIPS), 2000.\n\n[41] Bernhard Sch\u00a8olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris\nMooij. On Causal and Anticausal Learning. In International Conference on Machine Learning\n(ICML), 2012.\n\n[42] D Sculley, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine\nLearning: The High-Interest Credit Card of Technical Debt. In SE4ML: Software Engineering\nfor Machine Learning (NIPS 2014 Workshop), 2014.\n\n[43] Alireza Shafaei, Mark Schmidt, and James J Little. Does Your Model Know the Digit 6 Is\nNot a Cat? A Less Biased Evaluation of Outlier Detectors. arXiv Preprint arXiv:1809.04729,\n2018.\n\n[44] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-Of-Distribution Detection Using Multiple\nIn Advances in Neural Information Processing Systems\n\nSemantic Label Representations.\n(NeurIPS), 2018.\n\n[45] Hidetoshi Shimodaira. Improving Predictive Inference Under Covariate Shift by Weighting\n\nthe Log-Likelihood Function. Journal of Statistical Planning and Inference, 2000.\n\n[46] R John Simes. An Improved Bonferroni Procedure for Multiple Tests of Signi\ufb01cance.\n\nBiometrika, 1986.\n\n[47] Zak Stone, Todd Zickler, and Trevor Darrell. Autotagging Facebook: Social Network Context\nImproves Photo Annotation. In IEEE Computer Society Conference on Computer Vision and\nPattern Recognition Workshops. IEEE, 2008.\n\n[48] Amos Storkey. When Training and Test Sets Are Different: Characterizing Learning Transfer.\n\nDataset Shift in Machine Learning, 2009.\n\n[49] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki\nKawanabe. Direct Importance Estimation with Model Selection and Its Application to Covari-\nate Shift Adaptation. In Advances in Neural Information Processing Systems (NIPS), 2008.\n\n[50] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural\n\nNetworks. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[51] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-\nfellow, and Rob Fergus. Intriguing Properties of Neural Networks. In International Conference\non Learning Representations (ICLR), 2014.\n\n[52] Charles Truong, Laurent Oudre, and Nicolas Vayatis. A Review of Change Point Detection\n\nMethods. arXiv Preprint arXiv:1801.00718, 2018.\n\n12\n\n\f[53] Vladimir Vovk and Ruodu Wang. Combining p-Values via Averaging.\n\narXiv Preprint\n\narXiv:1212.4966, 2018.\n\n[54] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for\n\nBenchmarking Machine Learning Algorithms, 2017.\n\n[55] Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated Prod-\nuct Method for Combining p-Values. Genetic Epidemiology: The Of\ufb01cial Publication of the\nInternational Genetic Epidemiology Society, 2002.\n\n[56] Kun Zhang, Bernhard Sch\u00a8olkopf, Krikamol Muandet, and Zhikun Wang. Domain Adapta-\ntion Under Target and Conditional Shift. In International Conference on Machine Learning\n(ICML), 2013.\n\n[57] Daniel Z\u00a8ugner, Amir Akbarnejad, and Stephan G\u00a8unnemann. Adversarial Attacks on Neural\nNetworks for Graph Data. In International Conference on Knowledge Discovery & Data Min-\ning (KDD), 2018.\n\n13\n\n\f", "award": [], "sourceid": 805, "authors": [{"given_name": "Stephan", "family_name": "Rabanser", "institution": "AWS AI Labs"}, {"given_name": "Stephan", "family_name": "G\u00fcnnemann", "institution": "Technical University of Munich"}, {"given_name": "Zachary", "family_name": "Lipton", "institution": "Carnegie Mellon University"}]}