{"title": "Neural Nearest Neighbors Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1087, "page_last": 1098, "abstract": "Non-local methods exploiting the self-similarity of natural signals have been well studied, for example in image analysis and restoration. Existing approaches, however, rely on k-nearest neighbors (KNN) matching in a fixed feature space. The main hurdle in optimizing this feature space w.r.t. application performance is the non-differentiability of the KNN selection rule. To overcome this, we propose a continuous deterministic relaxation of KNN selection that maintains differentiability w.r.t. pairwise distances, but retains the original KNN as the limit of a temperature parameter approaching zero. To exploit our relaxation, we propose the neural nearest neighbors block (N3 block), a novel non-local processing layer that leverages the principle of self-similarity and can be used as building block in modern neural network architectures. We show its effectiveness for the set reasoning task of correspondence classification as well as for image restoration, including image denoising and single image super-resolution, where we outperform strong convolutional neural network (CNN) baselines and recent non-local models that rely on KNN selection in hand-chosen features spaces.", "full_text": "Neural Nearest Neighbors Networks\n\nTobias Pl\u00f6tz\n\nStefan Roth\n\nDepartment of Computer Science, TU Darmstadt\n\nAbstract\n\nNon-local methods exploiting the self-similarity of natural signals have been well\nstudied, for example in image analysis and restoration. Existing approaches,\nhowever, rely on k-nearest neighbors (KNN) matching in a \ufb01xed feature space.\nThe main hurdle in optimizing this feature space w. r. t. application performance\nis the non-differentiability of the KNN selection rule. To overcome this, we\npropose a continuous deterministic relaxation of KNN selection that maintains\ndifferentiability w. r. t. pairwise distances, but retains the original KNN as the limit\nof a temperature parameter approaching zero. To exploit our relaxation, we propose\nthe neural nearest neighbors block (N3 block), a novel non-local processing layer\nthat leverages the principle of self-similarity and can be used as building block\nin modern neural network architectures.1 We show its effectiveness for the set\nreasoning task of correspondence classi\ufb01cation as well as for image restoration,\nincluding image denoising and single image super-resolution, where we outperform\nstrong convolutional neural network (CNN) baselines and recent non-local models\nthat rely on KNN selection in hand-chosen features spaces.\n\n1\n\nIntroduction\n\nThe ongoing surge of convolutional neural networks (CNNs) has revolutionized many areas of ma-\nchine learning and its applications by enabling unprecedented predictive accuracy. Most network\narchitectures focus on local processing by combining convolutional layers and element-wise op-\nerations. In order to draw upon information from a suf\ufb01ciently broad context, several strategies,\nincluding dilated convolutions [49] or hourglass-shaped architectures [27], have been explored to\nincrease the receptive \ufb01eld size. Yet, they trade off context size for localization accuracy. Hence, for\nmany dense prediction tasks, e. g. in image analysis and restoration, stacking ever more convolutional\nblocks has remained the prevailing choice to obtain bigger receptive \ufb01elds [20, 22, 31, 39, 50].\nIn contrast, traditional algorithms in image restoration increase the receptive \ufb01eld size via non-local\nprocessing, leveraging the self-similarity of natural signals. They exploit that image structures tend to\nre-occur within the same image [53], giving rise to a strong prior for image restoration [28]. Hence,\nmethods like non-local means [6] or BM3D [9] aggregate information across the whole image to\nrestore a local patch. Here, matching patches are usually selected based on some hand-crafted notion\nof similarity, e. g. the Euclidean distance between patches of input intensities. Incorporating this kind\nof non-local processing into neural network architectures for image restoration has only very recently\nbeen considered [23, 47]. These methods replace the \ufb01ltering of matched patches with a trainable\nnetwork, while the feature space on which k-nearest neighbors selection is carried out is taken to be\n\ufb01xed. But why should we rely on a prede\ufb01ned matching space in an otherwise end-to-end trainable\nneural network architecture? In this paper, we demonstrate that we can improve non-local processing\nconsiderably by also optimizing the feature space for matching.\nThe main technical challenge is imposed by the non-differentiability of the KNN selection rule. To\novercome this, we make three contributions. First, we propose a continuous deterministic relaxation\n\n1Code and pretrained models are available at https://github.com/visinf/n3net/.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Query and database\n\n(b) KNN selection (Eq. 2)\n\n(c) Stochastic NN (Eqs. 4 to 7)\n\n(d) Continuous NN (Eqs. 8 to 11)\n\nFigure 1. Illustration of nearest neighbors selection as paths on the simplex. The traditional KNN rule (b)\nselects corners of the simplex deterministically based on the distance of the database items xi to the query item\nq (a). Stochastic neighbors selection (c) performs a random walk on the corners, while our proposed continuous\nnearest neighbors selection (d) relaxes the weights of the database items into the interior of the simplex and\ncomputes a deterministic path. Depending on the temperature parameter this path can interpolate between a\nmore uniform weighting (red) and the original KNN selection (blue).\n\nof the KNN rule, which allows differentiating the output w. r. t. pairwise distances in the input\nspace, such as between image patches. The strength of the novel relaxation can be controlled by\na temperature parameter whose gradients can be obtained as well. Second, from our relaxation\nwe develop a novel neural network layer, called neural nearest neighbors block (N3 block), which\nenables end-to-end trainable non-local processing based on the principle of self-similarity. Third, we\ndemonstrate that the accuracy of image denoising and single image super-resolution (SISR) can be\nimproved signi\ufb01cantly by augmenting strong local CNN architectures with our novel N3 block, also\noutperforming strong non-local baselines. Moreover, for the task of correspondence classi\ufb01cation,\nwe obtain signi\ufb01cant improvements by simply augmenting a recent neural network baseline with our\nN3 block, showing its effectiveness on set-valued data.\n\n2 Related Work\n\nAn important branch of image restoration techniques is comprised of non-local methods [6, 9, 28, 54],\ndriven by the concept of self-similarity. They rely on similar structures being more likely to encounter\nwithin an image than across images [53]. For denoising, the non-local means algorithm [6] averages\nnoisy pixels weighted by the similarity of local neighborhoods. The popular BM3D method [9]\ngoes beyond simple averaging by transforming the 3D stack of matching patches and employing a\nshrinkage function on the resulting coef\ufb01cients. Such transform domain \ufb01ltering is also used in other\nimage restoration tasks, e. g. single image super-resolution [8]. More recently, Yang and Sun [47]\npropose to learn the domain transform and activation functions. Lefkimmiatis [23, 24] goes further\nby chaining multiple stages of trained non-local modules. All of these methods, however, keep the\nstandard KNN matching in \ufb01xed feature spaces. In contrast, we propose to relax the non-differentiable\nKNN selection rule in order to obtain a fully end-to-end trainable non-local network.\nRecently, non-local neural networks have been proposed for higher-level vision tasks such as object\ndetection or pose estimation [42] and, with a recurrent architecture, for low-level vision tasks [26].\nWhile also learning a feature space for distance calculation, their aggregation is restricted to a single\nweighted average of features, a strategy also known as (soft) attention. Our differentiable nearest\nneighbors selection generalizes this; our method can recover a single weighted average by setting k=1.\nAs such, our novel N3 block can potentially bene\ufb01t other methods employing weighted averages, e. g.\nfor visual question answering [45] and more general learning tasks like modeling memory access\n[14] or sequence modeling [40]. Weighted averages have also been used for building differentiable\nrelaxations of the k-nearest neighbors classi\ufb01er [13, 35, 41]. Note that the crucial difference to our\nwork is that we propose a differentiable relaxation of the KNN selection rule where the output is\na set of neighbors, instead of a single aggregation of the labels of the neighbors. Without using\nrelaxations, Weinberger and Saul [44] learn the distance metric underlying KNN classi\ufb01cation using\na max-margin approach. They rely on prede\ufb01ned target neighbors for each query item, a restriction\nthat we avoid.\nImage denoising. Besides improving the visual quality of noisy images, the importance of image\ndenoising also stems from the fact that image noise severely degrades the accuracy of downstream\ncomputer vision tasks, e. g. detection [10]. Moreover, denoising has been recognized as a core module\n\n2\n\nqx1x3x2(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)qx1x3x2(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)qx1x3x2(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)qx1x3x2(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)(1,0,0)(0,1,0)(0,0,1)\ffor density estimation [2] and serves as a sub-routine for more general image restoration tasks in a\n\ufb02urry of recent work, e. g. [5, 36, 51]. Besides classical approaches [11, 37], CNN-based methods\n[18, 31, 50] have shown strong denoising accuracy over the past years.\n\n3 Differentiable k-Nearest Neighbors\n\nWe \ufb01rst detail our continuous and differentiable relaxation of the k-nearest neighbors (KNN) selection\nrule. Here, we will make few assumptions on the data to derive a very general result that can be\nused with many kinds of data, including text or sets. In the next section, we will then de\ufb01ne a\nnon-local neural network layer based on our relaxation. Let us start by precisely de\ufb01ning KNN\nselection. Assume that we are given a query item q, a database of candidate items (xi)i\u2208I with\nindices I = {1, . . . , M} for matching, and a distance metric d(\u00b7,\u00b7) between pairs of items. Assuming\nthat q is not in the database, d yields a ranking of the database items according to the distance to the\nquery. Let \u03c0q : I \u2192 I be a permutation that sorts the database items by increasing distance to q:\n\n\u03c0q(i) < \u03c0q(i(cid:48)) \u21d2 d(q, xi) \u2264 d(q, xi(cid:48)),\n\n\u2200i, i(cid:48) \u2208 I.\n\n(1)\n\nThe KNN of q are then given by the set of the \ufb01rst k items w. r. t. the permutation \u03c0q\n\nKNN(q) \u2261 {xi\n\n| \u03c0q(i) \u2264 k}.\n\n(2)\nThe KNN selection rule is deterministic but not differentiable. This effectively hinders to derive\ngradients w. r. t. the distances d(\u00b7,\u00b7). We will alleviate this problem in two steps. First, we interpret\nthe deterministic KNN rule as a limit of a parametric family of discrete stochastic sampling processes.\nSecond, we derive continuous relaxations for the discrete variables, thus allowing to backpropagate\ngradients through the neighborhood selection while still preserving the KNN rule as a limit case.\nKNN rule as limit distribution. We proceed by interpreting the KNN selection rule as the limit\ndistribution of k categorical distributions that are constructed as follows. As in Neighborhood\nComponent Analysis [13], let Cat(w1 | \u03b11, t) be a categorical distribution over the indices I of the\ndatabase items, obtained by deriving logits \u03b11\ni from the negative distances to the query item d(q, xi),\nscaled with a temperature parameter t. The probability of w1 taking a value i \u2208 I is given by:\n\nP(cid:2)w1 = i | \u03b11, t(cid:3) \u2261 Cat(\u03b11, t) =\n\nwhere \u03b11\n\ni \u2261 \u2212d(q, xi).\n\nexp(cid:0)\u03b11\ni/t(cid:1)\ni(cid:48)\u2208I exp(cid:0)\u03b11\ni(cid:48)/t(cid:1)\n(cid:80)\n\nHere, we treat w1 as a one-hot coded vector and denote with w1 = i that the i-th entry is set to one\nwhile the others are zero. In the limit of t \u2192 0, Cat(w1 | \u03b11, t) will converge to a deterministic\n(\u201cDirac delta\u201d) distribution centered at the index of the database item with smallest distance to q.\nThus we can regard sampling from Cat(w1 | \u03b11, t) as a stochastic relaxation of 1-NN [13]. We\nnow generalize this to arbitrary k by proposing an iterative scheme to construct further conditional\ndistributions Cat(wj+1 | \u03b1j+1, t). Speci\ufb01cally, we compute \u03b1j+1 by setting the wj-th entry of \u03b1j to\nnegative in\ufb01nity, thus ensuring that this index cannot be sampled again:\nif wj (cid:54)= i\nif wj = i.\n\ni \u2261 \u03b1j\n\u03b1j+1\n\ni ,\n\u2212\u221e,\n\n(5)\n\n(cid:26)\u03b1j\nP(cid:2)wj+1 = i | \u03b1j+1, t(cid:3) \u2261 Cat(\u03b1j+1, t) =\n\ni + log(1 \u2212 wj\n\ni ) =\n\nThe updated logits are used to de\ufb01ne a new categorical distribution for the next index to be sampled:\n\ni /t(cid:1)\nexp(cid:0)\u03b1j+1\ni(cid:48) /t(cid:1) .\ni(cid:48)\u2208I exp(cid:0)\u03b1j+1\n(cid:80)\n\nFrom the index vectors wj, we can de\ufb01ne the stochastic nearest neighbors {X 1, . . . , X k} of q using\n\nwj\n\ni xi.\n\n(7)\n\n(3)\n\n(4)\n\n(6)\n\nX j \u2261(cid:88)\n\ni\u2208I\n\nWhen the temperature parameter t approaches zero, the distribution over the {X 1, . . . , X k} will\nbe a deterministic distribution centered on the k nearest neighbors of q. Using these stochastic\nnearest neighbors directly within a deep neural network is problematic, since gradient estimators\n\n3\n\n\ffor expectations over discrete variables are known to suffer from high variance [33]. Hence, in the\nfollowing we consider a continuous deterministic relaxation of the discrete random variables.\nContinuous deterministic relaxation. Our basic idea is to replace the one-hot coded weight vectors\nwith their continuous expectations. This will yield a deterministic and continuous relaxation of the\nstochastic nearest neighbors that still converges to the hard KNN selection rule in the limit case of\nt \u2192 0. Concretely, the expectation \u00afw1 of the \ufb01rst index vector w1 is given by\n\ni \u2261 E(cid:2)w1\n\ni | \u03b11, t(cid:3) = P(cid:2)w1 = i | \u03b11, t(cid:3).\n\n(8)\nWe can now relax the update of the logits (Eq. 5) by using the expected weight vector instead of the\ndiscrete sample as\n\n\u00afw1\n\ni \u2261 \u00af\u03b1j\n\u00af\u03b1j+1\n\ni \u2261 E(cid:2)wj+1\n\ni\n\nThe updated logits are then used in turn to calculate the expectation over the next index vector:\n\n\u00afwj+1\n\n(10)\nAnalogously to Eq. (7), we de\ufb01ne continuous nearest neighbors { \u00afX 1, . . . , \u00afX k} of q using the \u00afwj as\n(11)\n\n| \u00af\u03b1j+1, t(cid:3) = P(cid:2)wj+1 = i | \u00af\u03b1j+1, t(cid:3).\n\u00afX j \u2261(cid:88)\n\n\u00afwj\n\ni + log(1 \u2212 \u00afwj\n\ni ) with\n\ni \u2261 \u03b11\n\u00af\u03b11\ni .\n\n(9)\n\ni xi.\n\ni\u2208I\n\nIn the limit of t \u2192 0, the expectation \u00afw1 of the \ufb01rst sampled index vector will approach a one-hot\nencoding of the index of the closest neighbor. As a consequence, the logit update in Eq. (9) will also\nconverge to the hard update from Eq. (5). By induction it follows that the other \u00afwj will converge to a\none-hot encoding of the closest indices of the j-th nearest neighbor. In summary, this means that our\ncontinuous deterministic relaxation still contains the hard KNN selection rule as a limit case.\nDiscussion. Figure 1 shows the relation between the deterministic KNN selection, stochastic nearest\nneighbors, and our proposed continuous nearest neighbors. Note that the continuous nearest neighbors\nare differentiable w. r. t. the pairwise distances as well as the temperature t. This allows making\nthe temperature a trainable parameter. Moreover, the temperature can depend on the query item q,\nthus allowing to learn for which query items it is bene\ufb01cial to average more uniformly across the\ndatabase items, i. e. by choosing a high temperature, and for which query items the continuous nearest\nneighbors should be close to the discrete nearest neighbors, i. e. by choosing a low temperature. Both\ncases have their justi\ufb01cation. A more uniform averaging effectively allows to aggregate information\nfrom many neighbors at once. On the other hand, the more distinct neighbors obtained with a low\ntemperature allow to \ufb01rst non-linearly process the information before eventually fusing it.\nFrom Eq. (11) it becomes apparent that the continuous nearest neighbors effectively take k weighted\naverages over the database items. Thus, prior work such as non-local networks [42], differentiable\nrelaxations of the KNN classi\ufb01er [41], or soft attention-based architectures [14] can be realized as\na special case of our architecture with k = 1. We also experimented with a continuous relaxation\nof the stochastic nearest neighbors based on approximating the discrete distributions with Concrete\ndistributions [19, 30]. This results in a stochastic sampling of weighted averages as opposed to our\ndeterministic nearest neighbors. For the dense prediction tasks considered in our experiments, we\nfound the deterministic variant to give signi\ufb01cantly better results, see Sec. 5.1.\n\n4 Neural Nearest Neighbors Block\n\nIn the previous section we made no assumptions about the source of query and database items. Here,\nwe propose a new network block, called neural nearest neighbors block (N3 block, Fig. 2a), which\nintegrates our continuous and differentiable nearest neighbors selection into feed-forward neural\nnetworks based on the concept of self-similarity, i. e. query set and database are derived from the\nsame features (e. g., feature patches of an intermediate layer within a CNN). An N3 block consists of\ntwo important parts. First, an embedding network takes the input and produces a feature embedding\nas well as temperature parameters. These are used in a second step to compute continuous nearest\nneighbors feature volumes that are aggregated with the input. We interleave N3 blocks with existing\nlocal processing networks to form neural nearest neighbors networks (N3Net) as shown in Fig. 2b. In\nthe following, we take a closer look at the components of an N3 block and their design choices.\n\n4\n\n\f(a) N3 block\n\n(b) N3Net\n\nFigure 2. (a) In a neural nearest neighbors (N3) block (shaded box), an embedding network takes the output Y\nof a previous layer and calculates a pairwise distance matrix D between elements in Y as well as a temperature\nparameter (T , red feature layer) for each element. These are used to produce a stack of continuous nearest\nneighbors volumes N1, . . . , Nk (green), which are then concatenated with Y . We build an N3Net (b) by\ninterleaving common local processing networks (e. g., DnCNN [50] or VDSR [20]) with N3 blocks.\n\nEmbedding network. A \ufb01rst branch of the embedding network calculates a feature embedding\nE = fE(Y ). For image data, we use CNNs to parameterize fE; for set input we use multi-layer\nperceptrons. The pairwise distance matrix D can now be obtained by Dij = d(Ei, Ej), where Ei\ndenotes the embedding of the i-th item and d is a differentiable distance function. We found that the\nEuclidean distance works well for the tasks that we consider. In practice, for each query item, we\ncon\ufb01ne the set of potential neighbors to a subset of all items, e. g. all image patches in a certain local\nregion. This allows our N3 block to scale linearly in the number of items instead of quadratically.\nAnother network branch computes a tensor T = fT(Y ) containing the temperature t for each item.\nNote that fE and fT can potentially share weights to some degree. We opted for treating them as\nseparate networks as this allows for an easier implementation.\nContinuous nearest neighbors selection. From the distance matrix D and the temperature tensor T ,\nwe compute k continuous nearest neighbors feature volumes N1, . . . , Nk from the input features Y\nby applying Eqs. (8) to (11) to each item. Since Y and each Ni have equal dimensionality, we could\nuse any element-wise operation to aggregate the original features Y and the neighbors. However,\na reduction at this stage would mean a very early fusion of features. Hence, we instead simply\nconcatenate Y and the Ni along the feature dimension, which allows further network layers to learn\nhow to fuse the information effectively in a non-linear way.\n\nN3 block for image data. The N3 block described above is very generic and not limited to a certain\ninput domain. We now describe minor technical modi\ufb01cations when applying the N3 block to image\ndata. Traditionally, non-local methods in image processing have been applied at the patch-level, i. e.\nthe items to be matched consist of image patches instead of pixels. This has the advantage of using a\nbroader local context for matching and aggregation. We follow this reasoning and \ufb01rst apply a strided\nim2col operation on E before calculating pairwise distances. The temperature parameter for each\npatch is obtained by taking the corresponding center pixel in T . Each nearest neighbor volume Ni is\nconverted from the patch domain to the image domain by applying a col2im operation, where we\naverage contributions of different patches to the same pixel.\n\n5 Experiments\n\nWe now analyze the properties of our novel N3Net and show its bene\ufb01ts over state-of-the-art baselines.\nWe use image denoising as our main test bed as non-local methods have been well studied there.\nMoreover, we evaluate on single image super-resolution and correspondence classi\ufb01cation.\nGaussian image denoising. We consider the task of denoising a noisy image D, which arises by\ncorrupting a clean image C with additive white Gaussian noise of standard deviation \u03c3:\n\nD = C + N with N \u223c N (0, \u03c32).\n\n(12)\nOur baseline architecture is the DnCNN model of Zhang et al. [50], consisting of 16 blocks, each\nwith a sequence of a 3 \u00d7 3 convolutional layer with 64 feature maps, batch normalization [17], and\na ReLU activation function. In the end, a \ufb01nal 3 \u00d7 3 convolution is applied, the output of which is\nadded back to the input through a global skip connection.\nWe use the DnCNN architecture to create our N3Net for image denoising. Speci\ufb01cally, we use three\nDnCNNs with six blocks each, cf. Fig. 2b. The \ufb01rst two blocks output 8 feature maps, which are\n\n5\n\nYEmbedding networkContinuous nearest neighbors selectionYNeural nearest neighbors blockTN1N2NkDLocal networkN3block......Local networkN3block...Local networkYEmbedding networkContinuous nearest neighbors selectionYNeural nearest neighbors blockTN1N2NkDLocal networkN3block......Local networkN3block...Local network\fTable 1. PSNR and SSIM [43] on Urban100 for different architectures on gray-scale image denoising (\u03c3=25).\n\n(i)\n(ii)\n(iii)\n(iv)\n(v)\n(ours light)\n(ours full)\n\nModel\n1 \u00d7 DnCNN (d=17)\n1 \u00d7 DnCNN (d=18)\n3 \u00d7 DnCNN (d=6), KNN block (k=7)\n3 \u00d7 DnCNN (d=6), KNN block (k=7)\n3 \u00d7 DnCNN (d=6), Concrete block (k=7)\n2 \u00d7 DnCNN (d=6), N3 block (k=7)\n3 \u00d7 DnCNN (d=6), N3 block (k=7)\n\nMatching on\n\u2013\n\u2013\nnoisy input\nDnCNN output (d=17)\nlearned embedding\nlearned embedding\nlearned embedding\n\nPSNR [dB]\n\n29.97\n29.92\n30.07\n30.08\n29.97\n29.99\n30.19\n\nSSIM\n0.879\n0.885\n0.891\n0.890\n0.889\n0.888\n0.892\n\nfed into a subsequent N3 block that computes 7 neighbor volumes. The concatenated output again\nhas a depth of 64 feature channels, matching the depth of the other intermediate blocks. The N3\nblocks extract 10 \u00d7 10 patches with a stride of 5. Patches are matched to other patches in a 80 \u00d7 80\nregion, yielding a total of 224 candidate patches for matching each query patch. More details on the\narchitecture can be found in the supplemental material.\n\nTraining details. We follow the protocol of Zhang et al. [50] and use the 400 images in the train and\ntest split of the BSD500 dataset for training. Note that these images are strictly separate from the\nvalidation images. For each epoch, we randomly crop 512 patches of size 80 \u00d7 80 from each training\nimage. We use horizontal and vertical \ufb02ipping as well as random rotations \u2208 {0\u25e6, 90\u25e6, 180\u25e6, 270\u25e6}\nas further data augmentation. In total, we train for 50 epochs with a batch size of 32, using the Adam\noptimizer [21] with default parameters \u03b21 = 0.9, \u03b22 = 0.999 to minimize the squared error. The\nlearning rate is initially set to 10\u22123 and exponentially decreased to 10\u22128 over the course of training.\nFollowing the publicly available implementation of DnCNN [50], we apply a weight decay with\nstrength 10\u22124 to the weights of the convolution layers and the scaling of batch normalization layers.\nWe evaluate our full model on three different datasets: (i) a set of twelve commonly used benchmark\nimages (Set12), (ii) the 68 images subset [37] of the BSD500 validation set [32], and (iii) the\nUrban100 [16] dataset, which contains images of urban scenes where repetitive patterns are abundant.\n\n5.1 Ablation study\n\nWe begin by discerning the effectiveness of the individual components. We compare our full N3Net\nagainst several baselines: (i,ii) The baseline DnCNN network with depths 17 (default) and 18\n(matching the depth of N3Net). (iii) A baseline where we replace the N3 blocks with KNN selection\n(k = 7) to obtain neighbors for each patch. Distance calculation is done on the noisy input patches.\n(iv) The same baseline as (iii) but where distances are calculated on denoised patches. Here we\nuse the pretrained 17-layer DnCNN as strong denoiser. The task speci\ufb01c hand-chosen distance\nembedding for this baseline should intuitively yield more sensible nearest neighbors matches than\nwhen matching noisy input patches. (v) A baseline where we use Concrete distributions [19, 30] to\napproximately reparameterize the stochastic nearest neighbors sampling. The resulting Concrete\nblock has an additional network for estimating the annealing parameter of the Concrete distribution.\nTable 1 shows the results on the Urban100 test set (\u03c3 = 25) from which we can infer four insights:\nFirst, the KNN baselines (iii) and (iv) improve upon the plain DnCNN model, showing that allowing\nthe network to access non-local information is bene\ufb01cial. Second, matching denoised patches\n(baseline (iv)) does not improve signi\ufb01cantly over matching noisy patches (baseline (iii)). Third,\nlearning a patch embedding with our novel N3 block shows a clear improvement over all baselines.\nWe, moreover, evaluate a smaller version of N3Net with only two DnCNN blocks of depth 6 (ours\nlight). This model already outperforms the baseline DnCNN with depth 17 despite having fewer\nlayers (12 vs. 17) and fewer parameters (427k vs. 556k). Fourth, reparameterization with Concrete\n\nTable 2. PSNR (dB) on Urban100 for gray-scale image denoising for varying k.\n\n\u03c3 = 25\n\u03c3 = 50\n\nk = 1\n30.17\n26.76\n\nk = 2\n30.21\n26.81\n\nk = 3\n30.15\n26.78\n\nk = 4\n30.27\n26.86\n\nk = 5\n30.27\n26.83\n\nk = 6\n30.22\n26.80\n\nk = 7\n30.19\n26.82\n\n6\n\n\f(a) Clean\n\n(b) BM3D (25.21 dB)\n\n(c) FFDNet (24.92 dB)\n\n(d) NN3D (25.00 dB)\n\n(e) Noisy (14.16 dB)\n\n(f) DnCNN (24.76 dB)\n\n(g) UNLNet (25.47 dB)\n\n(h) N3Net (25.57 dB)\n\nFigure 3. Denoising results (cropped for better display) and PSNR values on an image from Urban100 (\u03c3 = 50).\n\ndistributions (baseline (v)) performs worse than our continuous nearest neighbors. This is probably\ndue to the Concrete distribution introducing stochasticity into the forward pass, leading to a less\nstable training. Additional ablations are given in the supplemental material.\nNext, we compare N3Nets with a varying number of selected neighbors. Table 2 shows the results on\nUrban100 with \u03c3 \u2208 {25, 50}. We can observe that, as expected, more neighbors improve denoising\nresults. However, the effect diminishes after roughly four neighbors and accuracy starts to deteriorate\nagain. As we refrain from selecting optimal hyper-parameters on the test set, we will stick to the\narchitecture with k = 7 for the remaining experiments on image denoising and SISR.\n\n5.2 Comparison to the state of the art\n\nWe compare our full N3Net against state-of-the-art local denoising methods, i. e. the DnCNN baseline\n[50], the very deep and wide (30 layers, 128 feature channels) RED30 model [31], and the recent\nFFDNet [52]. Moreover, we compare against competing non-local denoisers. These include the\nclassical BM3D [9], which uses a hand-crafted denoising pipeline, and the state-of-the-art trainable\nnon-local models NLNet [23] and UNLNet [24], both learning to process non-locally aggregated\npatches. We also compare against NN3D [7], which applies a non-local step on top of a pretrained\nnetwork. For fair comparison, we apply a single denoising step for NN3D using our 17-layer baseline\nDnCNN. As a crucial difference to our proposed N3Net, all of the compared non-local methods use\nKNN selection on a \ufb01xed feature space, thus not being able to learn an embedding for matching.\nTable 3 shows the results for three different noise levels. We make three important observations:\nFirst, our N3Net signi\ufb01cantly outperforms the baseline DnCNN network on all tested noise levels\nand all datasets. Especially for higher noise levels the margin is dramatic, e. g. +0.54dB (\u03c3 = 50)\nor +0.79dB (\u03c3 = 70) on Urban100. Even the deeper and wider RED30 model does not reach\nthe accuracy of N3Net. Second, our method is the only trainable non-local model that is able to\noutperform the local models DnCNN, RED30, and FFDNet. The competing models NLNet and\n\nTable 3. PSNR (dB) for gray-scale image denoising on different datasets. NLNet does not provide a model for\n\u03c3 = 70 and the publicly available UNLNet model was not trained for \u03c3 = 70. RED30 does not provide a model\nfor \u03c3 = 25 and BSD68 is part of the RED30 training set. Hence, we omit these results.\n\nDataset\n\nSet12\n\nBSD68\n\nUrban100\n\n\u03c3\n25\n50\n70\n25\n50\n70\n25\n50\n70\n\n30.31\n27.04\n\nDnCNN BM3D NLNet UNLNet NN3D RED30\n30.44\n27.19\n25.56\n29.23\n26.23\n24.85\n29.97\n26.28\n24.36\n\n30.45\n27.24\n25.61\n29.19\n26.19\n24.89\n30.09\n26.47\n24.53\n\n29.96\n26.70\n25.21\n28.56\n25.63\n24.46\n29.71\n25.95\n24.27\n\n27.24\n25.71\n\n26.32\n24.63\n\n28.99\n26.07\n\n29.03\n26.07\n\n29.92\n26.15\n\n29.80\n26.14\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\n30.27\n27.07\n\n\u2013\n\n\u2013\n\n\u2013\n\n7\n\nFFDNet N3Net (ours)\n30.43\n27.31\n25.81\n29.19\n26.29\n25.04\n29.92\n26.52\n24.87\n\n30.55\n27.43\n25.90\n29.30\n26.39\n25.14\n30.19\n26.82\n25.15\n\n\fUNLNet do not reach the accuracy of DnCNN even on Urban100, whereas our N3Net even fares\nbetter than the strongest local denoiser FFDNet. Third, the post-hoc non-local step applied by NN3D\nis very effective on Urban100 where self-similarity can intuitively shine. However, on Set12 the gains\nare noticeably smaller whilst on BDS68 the non-local step can even result in degraded accuracy, e. g.\nNN3D achieves \u22120.04dB compared to DnCNN while N3Net achieves +0.16dB for \u03c3 = 50. This\nhighlights the importance of integrating non-local processing into an end-to-end trainable pipeline.\nFigure 3 shows denoising results for an image from the Urban100 dataset. BM3D and UNLNet\ncan exploit the recurrence of image structures to produce good results albeit introducing artifacts\nin the windows. DnCNN and FFDNet yield even more artifacts due to the limited receptive \ufb01eld\nand NN3D, as a post-processing method, cannot recover from the errors of DnCNN. In contrast, our\nN3Net produces a signi\ufb01cantly cleaner image where most of the facade structure is correctly restored.\n\n5.3 Real image denoising\n\nTo further demonstrate the merits of our approach, we applied the same N3Net architecture as before\nto the task of denoising real-world images with realistic noise. To this end, we evaluate on the recent\nDarmstadt Noise Dataset [34], consisting of 50 noisy images shot with four different cameras at\nvarying ISO levels. Realistic noise can be well explained by a Poisson-Gaussian distribution which,\nin turn, can be well approximated by a Gaussian distribution where the variance depends on the image\nintensity via a linear noise level function [12]. We use this heteroscedastic Gaussian distribution\nto generate synthetic noise for training. Speci\ufb01cally, we use a broad range of noise level functions\ncovering those that occur on the test images. For training, we use the 400 images of the BSDS\ntraining and test splits, 800 images of the DIV2K training set [1], and a training split of 3793 images\nfrom the Waterloo database [29]. Before adding synthetic noise, we transform the clean RGB images\nYRGB to YRAW such that they more closely resemble images with raw intensity values:\nYRAW = fc \u00b7 Y (YRGB)fe, with fc \u223c U(0.25, 1) and fe \u223c U(1.25, 10),\n\n(13)\nwhere Y (\u00b7) computes luminance values from RGB, the exponentiation with fe aims at un-\ndoing compression of high image intensities, and scaling with fc aims at undoing the ef-\nfect of white balancing. Further training details can be found in the supplemental material.\nWe train both the DnCNN baseline as well as our\nN3Net with the same training protocol and eval-\nuate them on the benchmark website. Results are\nshown in Table 4. N3Net sets a new state of the art\nfor denoising raw images, outperforming DnCNN\nand BM3D by a signi\ufb01cant margin. Moreover,\nthe PSNR values, when evaluated on developed\nsRGB images, surpass those of the currently top\nperforming methods in sRGB denoising, TWSC\n[46] and CBDNet [15].\n\nTable 4. Results on the Darmstadt Noise Dataset [34].\n\nSSIM PSNR\n37.78\n0.9724\n0.9760\n38.08\n38.32\n0.9767\n37.94\n38.06\n\n\u2013\n\u2013\n\nBM3D\nDnCNN\nN3Net\nTWSC\nCBDNet\n\nRaw\n\nsRGB\n\nPSNR\n46.64\n47.37\n47.56\n\n\u2013\n\u2013\n\nSSIM\n0.9308\n0.9357\n0.9384\n0.9403\n0.9421\n\n5.4 Single image super-resolution\n\nWe now show that we can also augment recent strong CNN models for SISR with our N3 block.\nWe particularly consider the common task [16, 20] of upsampling a low-resolution image that was\nobtained from a high-resolution image by bicubic downscaling. We chose the VDSR model [20] as\nour baseline architecture, since it is conceptually very close to the DnCNN model for image denoising.\nThe only notable difference is that it has 20 layers instead of 17. We derive our N3Net for SISR\nfrom the VDSR model by stacking three VDSR networks with depth 7 and inserting two N3 blocks\n(k = 7) after the \ufb01rst two VDSR networks, cf. Fig. 2b. Following [20], the input to our network is the\n\nTable 5. PSNR (dB) for single image super-resolution on Set5.\n\n\u00d72\n\u00d73\n\u00d74\n\nBicubic\n33.68\n30.41\n28.43\n\nSelfEx\n36.49\n32.58\n30.31\n\nWSD-SR\n\nMemNet\n\n37.21\n33.50\n31.39\n\n37.78\n34.09\n31.74\n\nMDSR\n38.11\n34.66\n32.50\n\nVDSR\n37.53\n33.66\n31.35\n\nN3Net\n37.57\n33.84\n31.50\n\n8\n\n\fTable 6. MAP scores for correspondence estimation for different error thresholds and combinations of training\nand testing set. Higher MAP scores are better.\n\nSt. Peter / St. Peter\n\nSt. Peter / Reichstag\n\nBrown / Brown\n\nThreshold\n5\u25e6\n10\u25e6\n20\u25e6\n\nNo Net\n0.014\n0.030\n0.071\n\nCNNet N3Net\n0.316\n0.271\n0.431\n0.379\n0.574\n0.522\n\nNo Net CNNet N3Net\n0.231\n0.0\n0.442\n0.038\n0.601\n0.111\n\n0.173\n0.337\n0.500\n\nNo Net CNNet N3Net\n0.293\n0.054\n0.391\n0.110\n0.510\n0.232\n\n0.236\n0.333\n0.463\n\nbicubicly upsampled low-resolution image and we train a single model for super-resolving images\nwith factors 2, 3, and 4. Further details on the architecture and training protocol can be found in the\nsupplemental material. Note that we refrain from building our N3Net for SISR from more recent\nnetworks, e. g. MemNet [38], MDSR [25], or WDnCNN [3], since they are too costly to train.\nWe compare our N3Net against VDSR and MemNet as well as two non-local models: SelfEx [16]\nand the recent WSD-SR [8]. Table 5 shows results on Set5 [4]. Again, we can observe a consistent\ngain of N3Net compared to the strong baseline VDSR for all super-resolution factors, e. g. +0.15dB\nfor \u00d74 super-resolution. More importantly, the other non-local methods perform inferior compared to\nour N3Net (e. g. +0.36dB compared to WSD-SR for \u00d72 super-resolution), showing that learning the\nmatching feature space is superior to relying on a hand-de\ufb01ned feature space. Further quantitative and\nvisual results demonstrating the same bene\ufb01ts of N3Net can be found in the supplemental material.\n\n5.5 Correspondence classi\ufb01cation\n\nAs a third application, we look at classifying correspondences between image features from two\nimages as either correct or incorrect. Again, we augment a baseline network with our non-local block.\nSpeci\ufb01cally, we build upon the context normalization network [48], which we call CNNet in the\nfollowing. The input to this network is a set of pairs of image coordinates of putative correspondences\nand the output is a probability for each of the correspondences to be correct. CNNet consists of 12\nblocks, each comprised of a local fully connected layer with 128 feature channels that processes each\npoint individually, and a context normalization and batch normalization layer that pool information\nacross the whole point set. We augment CNNet by introducing a N3 block after the sixth original\nblock. As opposed to the N3 block for the previous two tasks, where neighbors are searched only in\nthe vicinity of a query patch, here we search for nearest neighbors among all correspondences. We\nwant to emphasize that this is a pure set reasoning task. Image features are used only to determine\nputative correspondences while the network itself is agnostic of any image content.\nFor training we use the publicly available code of [48]. We consider two settings: First, we train on\nthe training set of the outdoor sequence St. Peter and evaluate on the test set of St. Peter and another\noutdoor sequence called Reichstag to test generalization. Second, we train and test on the respective\nsets of the indoor sequence Brown. Table 6 shows the resulting mean average precision (MAP) values\nat different error thresholds (for details on this metric, see [48]). We compare our N3Net to the\noriginal CNNet and a baseline that just uses all putative correspondences for pose estimation. As\ncan be seen, by simply inserting our N3 block we achieve a consistent and signi\ufb01cant gain in all\nconsidered settings, increasing MAP scores by 10% to 30%. This suggests that our N3 block can\nenhance local processing networks in a wide range of applications and data domains.\n\n6 Conclusion\n\nNon-local methods have been well studied, e. g., in image restoration. Existing approaches, however,\napply KNN selection on a hand-de\ufb01ned feature space, which may be suboptimal for the task at hand.\nTo overcome this limitation, we introduced the \ufb01rst continuous relaxation of the KNN selection\nrule that maintains differentiability w. r. t. the pairwise distances used for neighbor selection. We\nintegrated continuous nearest neighbors selection into a novel network block, called N3 block, which\ncan be used as a general building block in neural networks. We exempli\ufb01ed its bene\ufb01t in the context\nof image denoising, SISR, and correspondence classi\ufb01cation, where we outperform state-of-the-art\nCNN-based methods and non-local approaches. We expect the N3 block to also bene\ufb01t end-to-end\ntrainable architectures for other input domains, such as text or other sequence-valued data.\n\n9\n\n\fAcknowledgments. The research leading to these results has received funding from the European\nResearch Council under the European Union\u2019s Seventh Framework Programme (FP/2007\u20132013)/ERC\nGrant agreement No. 307942. We would like to thank reviewers for their fruitful comments.\n\nReferences\n[1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 challenge on single image super-resolution: Dataset\n\nand study. In CVPR Workshops, pages 126\u2013135, 2017.\n\n[2] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating\n\ndistribution. J. Mach. Learn. Res., 15(1):3563\u20133593, January 2014.\n\n[3] Woong Bae, Jae Jun Yoo, and Jong Chul Ye. Beyond deep residual learning for image restoration: Persistent\n\nhomology-guided manifold simpli\ufb01cation. In CVPR Workshops, pages 145\u2013153, 2017.\n\n[4] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity\nsingle-image super-resolution based on nonnegative neighbor embedding. In BMVC, pages 135.1\u2013135.10,\n2012.\n\n[5] Siavash Arjomand Bigdeli, Matthias Zwicker, Paolo Favaro, and Meiguang Jin. Deep mean-shift priors for\n\nimage restoration. In NIPS*2017, pages 763\u2013772.\n\n[6] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algorithm for image denoising. In\n\nCVPR, pages 60\u201365, 2005.\n\n[7] Crist\u00f3v\u00e3o Cruz, Alessandro Foi, Vladimir Katkovnik, and Karen O. Egiazarian. Nonlocality-reinforced\n\nconvolutional neural networks for image denoising. IEEE Sig. Proc. Letters, 25(8):1216\u20131220, 2018.\n\n[8] Crist\u00f3v\u00e3o Cruz, Rakesh Mehta, Vladimir Katkovnik, and Karen O. Egiazarian. Single image super-\nresolution based on Wiener \ufb01lter in similarity domain. IEEE T. Image Process., 27(2):1376\u20131389, March\n2018.\n\n[9] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising with\n\nblock-matching and 3D \ufb01ltering. In Electronic Imaging \u201906, Proc. SPIE 6064, No. 6064A-30, 2006.\n\n[10] Steven Diamond, Vincent Sitzmann, Stephen Boyd, Gordon Wetzstein, and Felix Heide. Dirty pixels:\n\nOptimizing image classi\ufb01cation architectures for raw sensor data. arXiv:1701.06487 [cs.CV], 2017.\n\n[11] David L. Donoho. Denoising by soft-thresholding. IEEE T. Info. Theory, 41(3):613\u2013627, May 1995.\n\n[12] Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and Karen Egiazarian. Practical Poissonian-\nGaussian noise modeling and \ufb01tting for single-image raw-data. IEEE T. Image Process., 17(10):1737\u20131754,\nOctober 2008.\n\n[13] Jacob Goldberger, Geoffrey E. Hinton, Sam T. Roweis, and Ruslan R. Salakhutdinov. Neighbourhood\n\ncomponents analysis. In NIPS*2005, pages 513\u2013520.\n\n[14] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. arXiv:1410.5401 [cs.NE], 2014.\n\n[15] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Toward convolutional blind denoising of\n\nreal photographs. arXiv:1807.04686 [cs.CV], 2018.\n\n[16] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed\n\nself-exemplars. In CVPR, pages 5197\u20135206, 2015.\n\n[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[18] Viren Jain and H. Sebastian Seung. Natural image denoising with convolutional networks. In NIPS*2008,\n\npages 769\u2013776.\n\n[19] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In ICLR,\n\n2017.\n\n[20] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\n\nconvolutional networks. In CVPR, pages 1646\u20131654, 2016.\n\n[21] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n10\n\n\f[22] Christian Ledig, Lucas Theis, Ferenc Husz\u00e1r, Jose Caballero, Andrew Cunningham, Alejandro Acosta,\nAndrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single\nimage super-resolution using a generative adversarial network. In CVPR, pages 4681\u20134690, 2018.\n\n[23] Stamatios Lefkimmiatis. Non-local color image denoising with convolutional neural networks. In CVPR,\n\npages 5882\u20135891, 2017.\n\n[24] Stamatios Lefkimmiatis. Universal denoising networks: A novel CNN-based network architecture for\n\nimage denoising. In CVPR, pages 3204\u20133213, 2018.\n\n[25] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual\n\nnetworks for single image super-resolution. In CVPR Workshops, pages 136\u2013144, 2017.\n\n[26] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas Huang. Non-local recurrent network\n\nfor image restoration. In NIPS*2018.\n\n[27] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, pages 3431\u20133440, 2015.\n\n[28] Or Lotan and Michal Irani. Needle-match: Reliable patch matching under high uncertainty. In CVPR,\n\npages 439\u2013448, 2016.\n\n[29] Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and Lei Zhang.\nWaterloo Exploration Database: New challenges for image quality assessment models. IEEE T. Image\nProcess., 26(2):1004\u20131016, February 2017.\n\n[30] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete distribution: A continuous relaxation\n\nof discrete random variables. In ICLR, 2017.\n\n[31] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-\n\ndecoder networks with symmetric skip connections. In NIPS*2016, pages 2802\u20132810.\n\n[32] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural\nimages and its application to evaluating segmentation algorithms and measuring ecological statistics. In\nICCV, volume 2, pages 416\u2013423, 2001.\n\n[33] Andriy Mnih and Danilo J. Rezende. Variational inference for Monte Carlo objectives. In ICML, pages\n\n2188\u20132196, 2016.\n\n[34] Tobias Pl\u00f6tz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In CVPR, pages\n\n1586\u20131595, 2017.\n\n[35] Weiqiang Ren, Yinan Yu, Junge Zhang, and Kaiqi Huang. Learning convolutional nonlinear features for k\n\nnearest neighbor image classi\ufb01cation. In ICPR, pages 4358\u20134363, 2014.\n\n[36] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by\n\ndenoising (RED). SIAM Journal on Imaging Sciences, 10(4):1804\u20131844, 2017.\n\n[37] Stefan Roth and Michael J. Black. Fields of experts. Int. J. Comput. Vision, 82(2):205\u2013229, April 2009.\n\n[38] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. MemNet: A persistent memory network for image\n\nrestoration. In ICCV, pages 4539\u20134547, 2017.\n\n[39] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. NTIRE 2017\nchallenge on single image super-resolution: Methods and results. In CVPR Workshops, pages 114\u2013125,\n2017.\n\n[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is all you need. In NIPS*2017, pages 6000\u20136010.\n\n[41] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching\n\nnetworks for one shot learning. In NIPS*2016, pages 3630\u20133638.\n\n[42] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR,\n\npages 7794\u20137803, 2018.\n\n[43] Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multi-scale structural similarity for image quality\nassessment. In IEEE Asilomar Conference on Signals, Systems and Computers, volume 2, pages 1398\u20131402,\nPaci\ufb01c Grove, California, November 2003.\n\n11\n\n\f[44] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. J. Mach. Learn. Res., 10:207\u2013244, February 2009.\n\n[45] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, volume 2, pages 451\u2013466, 2016.\n\n[46] Jun Xu, Lei Zhang, and David Zhang. A trilateral weighted sparse coding scheme for real-world image\n\ndenoising. In ECCV, volume 8, pages 21\u201338, 2018.\n\n[47] Dong Yang and Jian Sun. BM3D-Net: A convolutional neural network for transform-domain collaborative\n\n\ufb01ltering. IEEE T. Signal Process., 25(1):55\u201359, 2018.\n\n[48] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning\n\nto \ufb01nd good correspondences. In CVPR, pages 2666\u20132674, 2018.\n\n[49] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2015.\n\n[50] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian denoiser:\n\nResidual learning of deep CNN for image denoising. IEEE T. Image Process., 26(7):3142\u20133155, 2017.\n\n[51] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep CNN denoiser prior for image\n\nrestoration. In CVPR, pages 2808\u20132817, 2017.\n\n[52] Kai Zhang, Wangmeng Zuo, and Lei Zhang. FFDNet: Toward a fast and \ufb02exible solution for CNN based\n\nimage denoising. IEEE T. Image Process., 27(9):4608\u20134622, 2018.\n\n[53] Maria Zontak and Michal Irani. Internal statistics of a single natural image. In CVPR, pages 977\u2013984,\n\n2011.\n\n[54] Maria Zontak, Inbar Mosseri, and Michal Irani. Separating signal from noise using patch recurrence across\n\nscales. In CVPR, pages 1195\u20131202, 2013.\n\n12\n\n\f", "award": [], "sourceid": 574, "authors": [{"given_name": "Tobias", "family_name": "Pl\u00f6tz", "institution": "TU Darmstadt"}, {"given_name": "Stefan", "family_name": "Roth", "institution": "TU Darmstadt"}]}