{"title": "A Simple Cache Model for Image Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 10107, "page_last": 10116, "abstract": "Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or fine-tune it with new data. Here, we show that, surprisingly, this is indeed possible. The key observation we make is that the layers of a deep network close to the output layer contain independent, easily extractable class-relevant information that is not contained in the output layer itself. We propose to extract this extra class-relevant information using a simple key-value cache memory to improve the classification performance of the model at test time. Our cache memory is directly inspired by a similar cache model previously proposed for language modeling (Grave et al., 2017). This cache component does not require any training or fine-tuning; it can be applied to any pre-trained model and, by properly setting only two hyper-parameters, leads to significant improvements in its classification performance. Improvements are observed across several architectures and datasets. In the cache component, using features extracted from layers close to the output (but not from the output layer itself) as keys leads to the largest improvements. Concatenating features from multiple layers to form keys can further improve performance over using single-layer features as keys. The cache component also has a regularizing effect, a simple consequence of which is that it substantially increases the robustness of models against adversarial attacks.", "full_text": "A Simple Cache Model for Image Recognition\n\nEmin Orhan\n\naeminorhan@gmail.com\n\nBaylor College of Medicine & New York University\n\nAbstract\n\nTraining large-scale image recognition models is computationally expensive.\nThis raises the question of whether there might be simple ways to improve\nthe test performance of an already trained model without having to re-train\nor \ufb01ne-tune it with new data. Here, we show that, surprisingly, this is\nindeed possible. The key observation we make is that the layers of a deep\nnetwork close to the output layer contain independent, easily extractable\nclass-relevant information that is not contained in the output layer itself.\nWe propose to extract this extra class-relevant information using a simple\nkey-value cache memory to improve the classi\ufb01cation performance of the\nmodel at test time. Our cache memory is directly inspired by a similar\ncache model previously proposed for language modeling (Grave et al., 2017).\nThis cache component does not require any training or \ufb01ne-tuning; it can\nbe applied to any pre-trained model and, by properly setting only two\nhyper-parameters, leads to signi\ufb01cant improvements in its classi\ufb01cation\nperformance. Improvements are observed across several architectures and\ndatasets. In the cache component, using features extracted from layers close\nto the output (but not from the output layer itself) as keys leads to the\nlargest improvements. Concatenating features from multiple layers to form\nkeys can further improve performance over using single-layer features as keys.\nThe cache component also has a regularizing e\ufb00ect, a simple consequence\nof which is that it substantially increases the robustness of models against\nadversarial attacks.\n\nIntroduction\n\n1\nDeep neural networks are currently the state of the art models in a wide range of image\nrecognition problems. In the standard supervised learning setting, training these models\ntypically requires a large number of labeled examples. This causes at least two potential\nproblems. First, the large model and training set sizes make it computationally expensive to\ntrain these models. Thus, it would be desirable to ensure that the model\u2019s performance is as\ngood as it can be, given a particular budget of training data and training time. Secondly,\nthese models might have di\ufb03culties in cases where correct classi\ufb01cation depends on the\ndetection of rare but distinctive features in an image that do not occur frequently enough in\nthe training data. In this paper, we propose a method that addresses both of these problems.\nOur key observation is that the layers of a deep neural network close to the output layer contain\nindependent, easily extractable class-relevant information that is not already contained in\nthe output layer itself. We propose to extract this extra class-relevant information with a\nsimple key-value cache memory that is directly inspired by Grave et al. [3], where a similar\ncache model was introduced in the context of language modeling.\nOur model addresses the two problems described above. First, by properly setting only\ntwo hyper-parameters, we show that a pre-trained model\u2019s performance at test time can\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbe improved signi\ufb01cantly without having to re-train or even \ufb01ne-tune it with new data.\nSecondly, storing rare but potentially distinctive features in a cache memory enables our\nmodel to successfully retrieve the correct class labels even when those features do not appear\nvery frequently in the training data.\nFinally, we show that the cache memory also has a regularizing e\ufb00ect on the model. It\nachieves this e\ufb00ect by increasing the size of the input region over which the model behaves\nsimilarly to the way it behaves near training data and prior work has shown that trained\nneural networks behave more regularly near training data than elsewhere in the input\nspace [11]. A useful consequence of this e\ufb00ect is the substantially improved robustness of\ncache models against adversarial attacks.\n\n2 Results\nOur cache component is conceptually very similar to the cache component proposed by Grave\net al. [3] in the context of language modeling. Following [3], we de\ufb01ne a cache component by\na pair of key and value matrices (\u00b5, \u03c5). Here, \u00b5 is a d \u00d7 K matrix of keys where K is the\nnumber of items stored in the cache and d is the dimensionality of the key vectors, \u03c5 is a\nC \u00d7 K matrix of values where C is the number of classes. We use \u00b5k and \u03c5k to denote the\nk-th column of \u00b5 and \u03c5, respectively.\nTo build a key matrix \u00b5, we pass the training data (or a subset of it) through an already\ntrained network. The key vector \u00b5k for a particular item xk in the training set is obtained\nby taking the activities of one or more layers of the network when xk is input to the network,\nvectorizing those activities, and concatenating them if more than one layer is used (Figure 1).\nWe then normalize the resulting key vector to have unit norm. The value vector \u03c5k is simply\nthe one-hot encoding of the class label for xk.\nGiven a test item x with label y (considered as a one-hot\nvector), its similarity with the stored items in the cache\nis computed as follows:\n\n\u03c3k(x) \u221d exp(\u03b8\u03c6(x)>\u00b5k)\n\n(1)\nWe use \u03c6(x) to denote the representation of x in the layers\nused in the cache component. A distribution over labels is\nobtained by taking a weighted average of the values stored\nin the cache:\n\nPK\nPK\n\nk=1 \u03c5k\u03c3k(x)\nk=1 \u03c3k(x)\n\npmem(y|x) =\n\n(2)\n\nThe hyper-parameter \u03b8 in Equation 1 controls the sharp-\nness of this distribution, with larger \u03b8 values producing\nsharper distributions. This distribution is then combined\nwith the usual forward model implemented in the \ufb01nal\nsoftmax layer of the trained network:\n\np(y|x) = (1 \u2212 \u03bb)pnet(y|x) + \u03bbpmem(y|x)\n\n(3)\nHere, \u03bb controls the overall weight of the cache component\nin the model. The model thus has only two hyper-parameters, i.e. \u03b8 and \u03bb, which we\noptimize through a simple grid-search procedure on held-out validation data (we search over\nthe ranges 10 \u2264 \u03b8 \u2264 90 and 0.1 \u2264 \u03bb \u2264 0.9).\n\nFigure 1: Schematic diagram of\nthe cache model.\n\nExtracting additional class-relevant information from a pre-trained model\nAs our baseline models, we used deep ResNet [5] and DenseNet [7] models trained on the\nCIFAR-10, CIFAR-100, and ImageNet (ILSVRC2012) datasets. Standard data augmentation\nwas used to double the training set size for the CIFAR-10 and CIFAR-100 datasets. For\nCIFAR-10 and CIFAR-100, we used the training set to train the models, the validation set\nto optimize the hyper-parameters of the cache models and \ufb01nally reported the error rates\n\n2\n\n...dKpmemdkeysvaluescache...pnetxconcat.C\fModel\nResNet20 (\u03bb = 0)\nResNet20-Cache3\nResNet20-Cache3-CacheOnly (\u03bb = 1)\nResNet32 (\u03bb = 0)\nResNet32-Cache3\nResNet32-Cache3-CacheOnly (\u03bb = 1)\nResNet56 (\u03bb = 0)\nResNet56-Cache3\nResNet56-Cache3-CacheOnly (\u03bb = 1)\nDenseNet40 (\u03bb = 0)\nDenseNet40-Cache2\nDenseNet40-Cache2-CacheOnly (\u03bb = 1)\nDenseNet100 (\u03bb = 0)\nDenseNet100-Cache1\nDenseNet100-Cache1-CacheOnly (\u03bb = 1)\nResNet50 (\u03bb = 0)\nResNet50-Cache1\nResNet50-Cache1-CacheOnly (\u03bb = 1)\n\nParams C-10+ C-100+ ImageNet\n8.33\n0.27M\n7.58\n0.27M\n0.27M 11.87\n0.46M\n7.74\n7.01\n0.46M\n0.46M 11.13\n8.74\n0.85M\n8.11\n0.85M\n0.85M 13.05\n5.75\n1M\n5.44\n1M\n8.68\n1M\n5.08\n7M\n4.92\n7M\n7.44\n7M\n\u2013\n25.6M\n\u2013\n25.6M\n25.6M\n\u2013\n\n32.66\n29.18\n39.18\n32.94\n29.36\n38.40\n31.11\n27.99\n34.06\n27.08\n25.25\n37.64\n22.62\n21.95\n32.74\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n30.98\n30.42\n41.24\n\nTable 1: Error rates of di\ufb00erent models on CIFAR-10, CIFAR-100, and ImageNet datasets\n(+ indicates the standard data augmentation for the CIFAR datasets). In the cache models,\nthe number next to \u201cCache\u201d represents the number of layers used for constructing the key\nvectors: e.g. \u201cCache3\u201d means 3 di\ufb00erent layers were concatenated in creating the key vectors.\nThe results for ImageNet are top-1 error rates. We did not run separate layer searches for the\ncache-only models (\u201cCacheOnly\u201d); these models used the same layers as the corresponding\nlinear-combination cache models (the hyper-parameter \u03b8, however, was optimized separately\nfor these models).\n\non the test set. For the ImageNet dataset, we took a pre-trained ResNet50 model, split the\nvalidation set into two, used the \ufb01rst half to optimize the hyper-parameters of the cache\nmodels and reported the error rates on the second half.\nWe compared the performance of the baseline models with the performance of two types of\ncache model. The \ufb01rst one is the model described in Equation 3 above, where the predictions\nof the cache component and the network are linearly combined. The second type of model is\na cache-only model where we just use the predictions of the cache component, i.e. we set\n\u03bb = 1 in Equation 3. This cache-only model thus has a single hyper-parameter, \u03b8, that has\nto be optimized (Equation 1).\nFor the CIFAR-10 and CIFAR-100 datasets, we used all items in the (augmented) training\nset to generate the keys stored in the cache (90K items in total). For the ImageNet dataset,\nusing the entire training set was not computationally feasible, hence we used a random\nsubset of 275K items (275 items per class) from the training set to generate the keys. This\ncorresponds to approximately 22% of the training set. For the cache models where activations\nin only a single layer were used as keys, the optimal layer was chosen with cross-validation\non held-out data by sweeping through all layers in the network. In cases where activations in\nmultiple layers were concatenated to generate the keys, sweeping through all combinations\nof layers was not feasible, hence we used manual exploration to search over the space of\nrelevant combinations of layers. Whenever possible, we tried combinations of up to 3 layers\nin the network and report the test performance of the model that yielded the best accuracy\non the validation data.\nTable 1 shows the error rates of the models with or without a cache component on the three\ndatasets.1 In all cases, the cache model that linearly combines the predictions of the network\nand the cache component has the highest accuracy. The fact that cache models perform\n\n1Table 1 reports the error rates to make comparison with previous results on these benchmark\n\ndatasets easier. In the remainder of the paper, we will report classi\ufb01cation accuracies instead.\n\n3\n\n\fFigure 2: When used as key vectors in the cache component, layers close to the output, but\nnot the output layer itself, lead to the largest improvements in accuracy. Here, test accuracy\nis plotted against the normalized layer index (with 0 indicating the input layer, i.e. the\nimage, and 1 indicating the output layer) for di\ufb00erent models. The best value for each model\nis indicated with a star symbol (?). Results are shown for the (a) CIFAR-10, (b) CIFAR-100,\nand (c) ImageNet datasets. In (c), only the higher layers were low-dimensional enough to be\nused as key vectors, hence results are shown only for mid-network and upper layers. In these\nexperiments, we set the hyper-parameters to the middle of the ranges over which they were\nallowed to vary.\n\nsigni\ufb01cantly better than the baseline models, which only use the output layer of the network\nto make predictions, suggests that layers other than the output layer contain independent\nclass-relevant information and that this extra information can be easily read-out using a\nsimple continuous key-value memory (Equations 1-3).\nTo \ufb01nd out which layers contained this extra class-relevant information, we tried using\ndi\ufb00erent layers of the network, all the way from the input layer, i.e. the image itself, to\nthe output layer, as key vectors in the cache component. This analysis showed that layers\nclose to the output contained most of the extra class-relevant information (Figure 2). Using\nthe output layer itself in the cache component resulted in test performance comparable to\nthat of the baseline model. Using the input layer, on the other hand, generally resulted in\nworse test performance than the baseline (Figure 2a-b). This is presumably because when\nthe input layer, or other low-level features in general, are used as key vectors, the model\nbecomes susceptible to surface similarities that are not relevant for the classi\ufb01cation task.\nHigher-level features, on the other hand, are less likely to be a\ufb00ected by such super\ufb01cial\nsimilarities.\nThe cache-only models performed worse than the baseline models (Table 1, \u201cCacheOnly\u201d).\nHowever, as we show below, these models turned out to be more robust to input perturbations\nthan both the baseline models and the linear-combination cache models. Hence, they may be\npreferred in cases where it is desirable to trade o\ufb00 a certain amount of accuracy for improved\nrobustness.\nTo investigate the e\ufb00ect of cache size on the models\u2019 performance, we varied the cache\nsize from 0% of the training data (i.e. no cache) to 100% of the training data (i.e. using\nthe entire training data for the cache). Importantly, we optimized the hyper-parameters\nof the cache models separately for each cache size. Representative results are shown in\nFigure 3 for ResNet32 models trained on CIFAR-10 and CIFAR-100. We observed signi\ufb01cant\nimprovements in test accuracy over the baseline model even with small cache sizes and the\nperformance of the cache models increased steadily with the cache size.\n\nCache component improves the robustness of image recognition models\n\nOne possible way to view the cache component is as implementing a prior over images that\nfavors images similar to those stored in the cache memory, i.e. training images, where the\nrelevant notion of similarity is based on some high-level features of the images. The cache\ncomponent makes a model\u2019s response to new images more similar to its response to the\ntraining images. It has been shown that trained deep neural networks behave more regularly\n\n4\n\n01l/Lmax9095100Testaccuracy(%)\u2191Inputlayer\u2191OutputlayeraCIFAR10DenseNet100DenseNet40ResNet56ResNet32ResNet2001l/Lmax65707580Testaccuracy(%)bCIFAR10001l/Lmax686970Testaccuracy(%)cImageNetResNet50\fFigure 3: The e\ufb00ect of cache size on the test accuracy of a representative model trained on\nCIFAR-10 (a) and CIFAR-100 (b). Shaded regions represent \u00b11 s.e.m. over 2 independent\nruns. Note that 0% on the x-axis corresponds to the baseline model with no cache component.\n\nin the neighborhood of training data than elsewhere in the input space [11]: they have\nsmaller Jacobian norms and a smaller number of \u201clinear regions\u201d near the training data,\nwhich enables decreased sensitivity to perturbations and hence better generalization in the\nneighborhood of training points in the input space. This suggests that by e\ufb00ectively imposing\na prior over the input space that favors the training data, a cache component may extend the\nrange over which the model behaves regularly and hence improve its generalization behavior\noutside the local neighborhood of the training data. Potentially, this also includes improved\nrobustness against adversarially generated inputs.\nTo test this hypothesis, we conducted several experiments. First, we ran a number of\ngradient-based and decision-based adversarial attacks against a baseline model (ResNet32)\nand tested the e\ufb03cacy of the resulting adversarial images against the cache models (a similar\nprocedure was recently used in [12]). Speci\ufb01cally, we considered the following four adversarial\nattacks.\nFast gradient sign method (FGSM): This is a standard gradient-based attack [2] where,\nstarting from a test image, we take a single step in direction of the componentwise sign of the\ngradient vector scaled by a step size parameter \u0001. The step size \u0001 is gradually increased from\n0 to a maximum value of 0.5 until the model\u2019s prediction of the class label of the perturbed\nimage changes. The image is discarded if no \u0001 yields an adversarial image.\nIterative FGSM (I-FGSM): This attack is similar to FGSM, but instead of taking a\nsingle gradient step, we take 10 steps for each \u0001 value [9].\nSingle pixel attack (SP): In this attack, a single pixel of the image is set to the maximum\nor minimum pixel value across the image [15]. All pixels are tried one by one until the\nperturbed image is classi\ufb01ed di\ufb00erently than the original image by the model. If no such\npixel is found, the image is discarded.\nGaussian blur attack (GB): The image is blurred with a Gaussian \ufb01lter with standard\ndeviation \u0001 that is increased gradually from 0 to a maximum value of max(w, h), where w and\nh are the dimensions of the image in pixels, until the blurred image is classi\ufb01ed di\ufb00erently\nby the model.\nWe applied each attack to 250 randomly selected test images from CIFAR-10. The attacks\nwere all implemented with Foolbox [14]. For each image, we measured the relative size of\nthe minimum perturbation required to generate an adversarial image as follows [10]:\n\n(4)\nwhere x denotes the original test image, xadv is the adversarial image generated from x and\nthe norm represents the Euclidean norm. Table 2 reports the mean \u03c1adv values for the four\n\n||x||\n\n\u03c1adv(x) = ||xadv \u2212 x||\n\n5\n\n050100%oftrainingdatausedincache01accuracyoverbaseline(%)aCIFAR10ResNet32050100%oftrainingdatausedincache04accuracyoverbaseline(%)bCIFAR100ResNet32\fh\u03c1advi\nResNet32 (\u03bb = 0)\nResNet32-Cache3\nResNet32-Cache3-CacheOnly (\u03bb = 1)\n\nFGSM I-FGSM SP\n0.044\n0.064\n9.8\n5.2\n72.8\n58.8\n68.6\n72.0\n\n0.014\n5.2\n68.4\n83.2\n\nGB\n0.224\n2.8\n56.0\n61.3\n\nTable 2: Classi\ufb01cation accuracies of di\ufb00erent models on adversarial images generated from\nthe CIFAR-10 test set by four di\ufb00erent attack methods applied to the baseline ResNet32\nmodel.\n\nResNet32 (\u03bb = 0)\nResNet32-Cache3\nResNet32-Cache3-CacheOnly (\u03bb = 1)\n\nFGSM I-FGSM\n0.014\n0.064\n0.083\n0.020\n\n\u2013\n\n\u2013\n\nSP\n0.044\n0.149\n\n\u2013\n\nGB\n0.224\n0.299\n\n\u2013\n\nTable 3: Mean minimum perturbation sizes, h\u03c1advi, needed for generating adversarial images\nfrom the CIFAR-10 test set using direct white-box attacks on the baseline and the cache\nmodels. For the cache-only model, we were not able to generate any adversarial images using\nany of the attacks. This is indicated with a \u2018\u2013\u2019 in the table above.\n\nattacks. A smaller h\u03c1advi value indicates that a smaller perturbation is su\ufb03cient to fool the\nbaseline model.\nThe low classi\ufb01cation accuracies for the baseline model (ResNet32) in Table 2 demonstrate\nthat the attacks are indeed e\ufb00ective at generating adversarial images that fool the baseline\nmodel. However, di\ufb00erent attack methods require di\ufb00erent minimum perturbation sizes, as\nindicated by the varying h\u03c1advi values. When the same baseline model is combined with a\ncache component (ResNet32-Cache3), the classi\ufb01cation accuracy on the adversarial images\nincreases signi\ufb01cantly. Remarkably, a cache-only model (ResNet32-Cache3-CacheOnly) based\non the same baseline model performs even better. This is presumably because the cache-only\nmodel alters the decision boundaries of the underlying baseline model more drastically than\nthe linear-combination model. These results show that adversarial images generated for the\nbaseline model do not transfer over to the cache models.\nSecondly, we also ran direct white-box attacks on the cache models. Table 3 compares the\nmean minimum perturbation sizes, h\u03c1advi, needed for generating adversarial images from\nthe CIFAR-10 test set in di\ufb00erent models. In general, generating adversarial images for the\ncache models proved to be much more di\ufb03cult than generating adversarial images for the\nbaseline model. For the cache-only model (ResNet32-Cache3-CacheOnly), remarkably, we\nwere not able to generate any adversarial images using direct white-box attacks. For the\nlinear-combination cache model (ResNet32-Cache3), we were able to generate adversarial\nimages, but the resulting adversarial images had larger minimum perturbation sizes than the\nadversarial images generated from the baseline model (Table 3). These results demonstrate\nthe enhanced robustness of the cache models against white-box adversarial attacks.\nAdversarial examples often transfer between models [2]: an example created for one model\noften fools another model as well. To determine whether adding a cache component would\nimpede the transferability of adversarial examples, we presented the adversarial examples\ngenerated for the baseline ResNet32 model above to ResNet20 models with or without a cache\ncomponent. The results are presented in Table 4 and show that adding a cache component\nto the transfer model signi\ufb01cantly reduces the transferability of the adversarial examples (cf.\nResNet20 vs. ResNet20-Cache3).\n\nCache models behave more regularly near test points in the input space\nAs mentioned above, we hypothesized that adding a cache component to a model has a\nregularizing e\ufb00ect on the model\u2019s behavior, because the cache component extends the range\nin the input space over which the model behaves similarly to the way it behaves near training\npoints and previous work has shown that neural networks behave more regularly near training\n\n6\n\n\fTransfer: ResNet20 (\u03bb = 0)\nTransfer: ResNet20-Cache3\nTransfer: ResNet20-Cache3-CacheOnly (\u03bb = 1)\n\nFGSM I-FGSM SP\n70.6\n77.6\n80.4\n80.8\n75.6\n73.5\n\n89.2\n91.6\n88.0\n\nGB\n54.8\n59.7\n55.6\n\nTable 4: Classi\ufb01cation accuracies of ResNet20 variants with or without a cache component\non the adversarial examples generated for the ResNet32 baseline model. Adding a cache\ncomponent reduces the transferability of the adversarial examples.\n\npoints than elsewhere in the input space [11]. The results demonstrating the improved\nrobustness of the cache models against various types of adversarial attacks provide evidence\nfor this hypothesis. To test this idea more directly, we calculated the input-output Jacobian,\nJ(x) \u2261 \u2202p(y|x)/\u2202x, of di\ufb00erent models with or without a cache component at all test points\nof the CIFAR-10 dataset.\nWe found that, as predicted, both cache models lead to an overall decrease in the Jacobian\nnorm, ||J(x)||, averaged over all test points (Figure 4a, inset). The Jacobian norm at a\ngiven point can be considered as an estimate of the average sensitivity of the model to\nperturbations around that point and it was previously shown to be correlated with the\ngeneralization gap in neural networks, with smaller norms indicating smaller generalization\ngaps, hence better generalization performance [11].\nFigure 4a shows the mean singular values of the Jacobian for di\ufb00erent models averaged over\nall test points in CIFAR-10. Compared to the baseline model, the linear-combination cache\nmodel (ResNet32-Cache3; green) signi\ufb01cantly reduces the \ufb01rst singular value but slightly\nincreases the lower-order singular values. On the other hand, although the cache-only model\ndoes not reduce the \ufb01rst singular value as much as the linear-combination cache model does,\nit produces a more consistent reduction in the singular values (ResNet32-CacheOnly; black).\nFigure 4b-c shows this di\ufb00erential behavior of the two cache models by plotting the Jacobian\nnorms at individual test points. In Figure 4b, the points are clustered below the diagonal\nfor high ||J(x)|| values and above the diagonal for low ||J(x)|| values, indicating that the\nlinear-combination cache model reduces the Jacobian norm in the \ufb01rst case, but increases it\nin the second case. In Figure 4c, on the other hand, the points are more consistently below\nthe diagonal, suggesting that the cache-only model reduces the Jacobian norm in both cases.\nThis pattern is consistent with the singular value pro\ufb01les shown in Figure 4a.\nWe conjecture that the di\ufb00erent singular value pro\ufb01les of the two cache models may be\nrelated to their di\ufb00erent generalization patterns. We have observed above that the linear-\ncombination cache model has a better test accuracy than the baseline and cache-only models\n(Table 1), but the cache-only model is more robust to adversarial perturbations (Table 2)\nthan the linear-combination cache model. If one considers the test accuracy as measuring\nthe within-sample or on-the-data-manifold generalization performance and the adversarial\naccuracy as measuring the out-of-sample or o\ufb00-the-data-manifold generalization performance\nof a model, the small \ufb01rst singular value may explain the superior test accuracy of the\nlinear-combination cache model, whereas the small lower order singular values may explain\nthe superior adversarial accuracy of the cache-only model. We leave a fuller exploration of\nthis hypothesis to future work.\n\n3 Discussion\nIn this paper, we proposed a simple method to improve the classi\ufb01cation performance of\nlarge-scale image recognition models at test time. Our method relies on the observation that\nhigher layers of deep neural networks contain independent class-relevant information that is\nnot already contained in the output layer of the network and this extra information is in\nan easily extractable format. In this work, we have used a simple continuous cache-based\nkey-value memory to extract this information. This particular method has the advantage\nthat it does not require any re-training or \ufb01ne-tuning of the model. Moreover, we showed\nthat it also signi\ufb01cantly improves the robustness of the underlying model against adversarial\nattacks.\n\n7\n\n\fFigure 4: Cache models behave more regularly near test points in the input space. (a) Mean\nsingular values of the Jacobian at test points for three di\ufb00erent models. The mean Jacobian\nnorm (averaged over test points) for each model is indicated in the inset. (b) and (c) show\nthe Jacobian norms at individual test points.\n\nHowever, there may be alternative ways of extracting this extra class-relevant information.\nFor example, linear read-outs can be added on top of the layers containing this extra\ninformation and then trained on the training data either separately or in conjunction with\nread-outs from other layers, while the rest of the network remains \ufb01xed.\nWe used deep ResNet and DenseNet architectures in our simulations. These architectures were\nspeci\ufb01cally designed to make it easy to pass information unobstructed between layers [5\u20137].\nIt is thus noteworthy that we were able to extract a signi\ufb01cant amount of extra class-relevant\ninformation from the non-\ufb01nal layers of these networks with a relatively simple read-out\nmethod. We expect the gains from our method to be even larger in deep networks that are\nnot speci\ufb01cally designed to ease the information \ufb02ow between layers, e.g. networks without\nskip connections.\n\nRelated work\n\nOur cache model was directly inspired by the work of Grave et al. [3], where a conceptually\nvery similar model was \ufb01rst introduced in the context of language modeling. In language\nmodeling, prediction of rare words that do not occur frequently in the training corpus poses\na problem. Grave et al. [3] proposed extending the standard recurrent sequence-to-sequence\nmodels with a continuous key-value cache that stored the recent history of the recurrent\nactivations in the network as keys and the next words in the sequence as values. The primary\nmotivation in this work was to address the rare words problem using the basic idea that if a\nword that is overall rare in the corpus appeared recently, it is likely to appear again in the\nnear future. Hence, by storing the recent history of activations and the corresponding target\nwords, one can quickly retrieve the correct label from the recent context when the rare word\nreappears again.\nA similar motivation might apply in image recognition tasks as well. Although there is no\ntemporal context in these tasks, a similar \u201crare features\u201d problem arises in image recognition\ntoo: if correct classi\ufb01cation of an item depends on the detection of a set of distinctive features\nthat do not occur very frequently in the training data, standard image recognition models\ntrained end-to-end with gradient descent might have a di\ufb03culty, since these models would\ntypically require a large enough number of examples to learn the association between those\nfeatures and the correct label. By storing those features and the corresponding labels in a\ncache instead, we can quickly retrieve the correct label upon detection of the corresponding\nfeatures and hence circumvent the sample ine\ufb03ciency of end-to-end training with gradient\ndescent.\nSimilar cache models have recently been proposed to improve the sample e\ufb03ciency in\nreinforcement learning [13] and one-shot learning problems [8] as well.\nIn these cases,\nhowever, the cache component (i.e. its key-value pairs) was trained jointly with the rest\nof the model, hence these models are di\ufb00erent from our model and the model of [3] in this\nrespect.\n\n8\n\n12345678910Singularvalueindex01.2SingularvalueE[||J(xtest)||]=1.22E[||J(xtest)||]=0.86E[||J(xtest)||]=1.06aCIFAR10ResNet32ResNet32Cache3ResNet32CacheOnly1e42e1||J(xtest)||(ResNet32)1e42e1||J(xtest)||(ResNet32Cache3)b1e42e1||J(xtest)||(ResNet32)1e42e1||J(xtest)||(ResNet32CacheOnly)c\fTwo recent papers have used cache memories to improve the robustness of image recognition\nmodels against adversarial attacks [12, 18].\nIn [18], inputs are projected onto the data\nmanifold approximated by the convex hull of a set of features. These features, in turn, are\nformed from a set of \u201ccandidate\u201d items retrieved from a cache. Unlike our model, however,\nboth the features and the projection operator are trained jointly with the rest of the model\nand the model is further constrained to behave linearly on the convex hull through mixup\ntraining [17]. Overall, this model is signi\ufb01cantly more complicated than the simple cache\nmodel proposed in this paper. However, the basic mechanism behind its improvement of\nrobustness against adversarial examples is similar to ours.\nOur model is more similar to the deep k-nearest neighbor (k-NN) model introduced in [12].\nIn this model, for a given test item, k nearest neighbors from the training data are retrieved\nbased on their representations at each layer of a deep network. The model\u2019s prediction for\nthe test item as well as a con\ufb01dence score are then computed based on the retrieved nearest\nneighbors and their labels. There are two main di\ufb00erences between our model and the deep\nk-NN model. First, the deep k-NN model retrieves k nearest neighbors (typically, k = 75\nnearest neighbors were used in the paper) and weighs them equally in the prediction and\ncon\ufb01dence computations, whereas our model uses a continuous cache that utilizes all the\nitems in the cache and weighs them by their similarity to the representation of the test item.\nThis is an important di\ufb00erence for the adversarial robustness of these two models, since\nk-NN models with a large k are known to be more robust against adversarial examples than\nk-NN models with a small k [16]. Secondly, the deep k-NN model uses representations at\nall layers in retrieving the nearest neighbors, whereas our model uses only a small number\nof layers close to the output of the network. We have presented evidence suggesting that\nusing the earlier layers might adversely a\ufb00ect the generalization performance of the model\nby making it vulnerable to surface similarities that are not relevant for the classi\ufb01cation task\n(Figure 2).\n\nFuture directions\n\nFor the sake of simplicity, we have used the entire layer activations as key vectors in our cache\nmodel. Since these vectors are likely to be redundant, a more e\ufb03cient alternative would be\nto apply a dimensionality reduction method \ufb01rst before storing these vectors as keys in the\ncache component. The simplest such method would be using random projections [1], which\nhas favorable theoretical properties and is easy to implement. This would allow us to test\nlarger cache sizes and more layers in the key vectors. Using large cache sizes is important\nespecially in problems with large training set sizes, such as ImageNet, as we empirically\nobserved this to be a more important factor a\ufb00ecting the generalization accuracy than the\nnumber of layers used in the key vectors. Relatedly, e\ufb03cient nearest neighbor methods can\nbe utilized to scale our model to e\ufb00ectively unbounded cache sizes which would be useful\nunder online learning and/or evaluation scenarios [4].\nWhen more than one layer was used in the cache component, we selected the layers based on\nmanual exploration. More principled ways of searching for combinations of layers to be used\nin the cache component should also improve the generalization performance of the cache\nmodels.\nWe have only considered classi\ufb01cation tasks in this paper, but a continuous key-value cache\ncomponent can be added to models performing other image-based tasks, such as object\ndetection or segmentation, as well. However, di\ufb00erent tasks might require di\ufb00erent similarity\nmeasures (Equation 1) and di\ufb00erent ways of combining the predictions of the cache component\nwith the predictions of the underlying model (Equation 3). Video-based tasks are also obvious\ncandidates for the application of a continuous cache component as the original paper [3] that\nmotivated our work also used it in a temporal task, i.e. sequence-to-sequence modeling.\n\nAcknowledgments\n\nI thank the sta\ufb00 at the High Performance Computing cluster at NYU, especially Shenglong\nWang, for their excellent maintenance e\ufb00orts and for their help with troubleshooting.\n\n9\n\n\fReferences\n[1] Cand\u00e8s E, Tao T (2006) Near-optimal signal recovery from random projections: universal\n\nencoding strategies? IEEE Trans Inf Theory 52, 5406\u20135425.\n\n[2] Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial\n\nexamples. arXiv:1412.6572.\n\n[3] Grave E, Joulin A, Usunier N (2017) Improving neural language models with a continuous\n\ncache. ICLR 2017, arXiv:1612.04426.\n\n[4] Grave E, Cisse M, Joulin A (2017) Unbounded cache model for online language modeling\n\nwith open vocabulary. NIPS 2017, arXiv:1711.02604.\n\n[5] He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition.\n\narXiv:1512.03385.\n\n[6] He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks.\n\narXiv:1603.05027.\n\n[7] Huang G, Liu Z, Weinberger KQ, van der Maaten L (2016) Densely connected convolu-\n\ntional networks. CVPR 2017, arXiv:1608.06993.\n\n[8] Kaiser L, Nachum O, Roy A, Bengio S (2017) Learning to remember rare events. ICLR\n\n2017, arXiv:1703.03129.\n\n[9] Kurakin A, Goodfellow IJ, Bengio S (2016) Adversarial examples in the physical world.\n\narXiv:1607.02533.\n\n[10] Moosavi-Dezfooli S-M, Fawzi A, Frossard P (2015) DeepFool: A simple and accurate\n\nmethod to fool deep neural networks. CVPR 2016, arXiv:1511.04599.\n\n[11] Novak R, Bahri Y, Abola\ufb01a DA, Pennington J, Sohl-Dickstein (2018) Sensitivity and\ngeneralization in neural networks: An empirical study. ICLR 2018, arXiv:1802.08760.\n[12] Papernot N, McDaniel P (2018) Deep k-nearest neighbors: Towards con\ufb01dent, inter-\n\npretable and robust deep learning. arXiv:1803.04765.\n\n[13] Pritzel A, et al. (2017) Neural episodic control. ICML 2017, arXiv:1703.01988.\n[14] Rauber J, Brendel W, Bethge M (2017) Foolbox: A Python toolbox to benchmark the\n\nrobustness of machine learning models. arXiv:1707.04131.\n\n[15] Su J, Vargas V, Kouichi S (2017) One pixel attack for fooling deep neural networks.\n\narXiv:1710.08864.\n\n[16] Wang Y, Jha S, Chaudhuri K (2018) Analyzing the robustness of nearest neighbors to\n\nadversarial examples. ICML 2018, arXiv:1706.03922.\n\n[17] Zhang H, Cisse M, Dauphin Y, Lopez-Paz D (2017) mixup: Beyond empirical risk\n\nminimization. ICLR 2018, arXiv:1710.09412.\n\n[18] Zhao J, Cho K (2018) Retrieval-augmented convolutional neural networks for improved\n\nrobustness against adversarial examples. arXiv:1802.09502.\n\n10\n\n\f", "award": [], "sourceid": 6499, "authors": [{"given_name": "Emin", "family_name": "Orhan", "institution": "BCM & Rice"}]}