{"title": "High-Quality Self-Supervised Deep Image Denoising", "book": "Advances in Neural Information Processing Systems", "page_first": 6970, "page_last": 6980, "abstract": "We describe a novel method for training high-quality image denoising models based on unorganized collections of corrupted images. The training does not need access to clean reference images, or explicit pairs of corrupted images, and can thus be applied in situations where such data is unacceptably expensive or impossible to acquire. We build on a recent technique that removes the need for reference data by employing networks with a \"blind spot\" in the receptive field, and significantly improve two key aspects: image quality and training efficiency. Our result quality is on par with state-of-the-art neural network denoisers in the case of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse noise. We also successfully handle cases where parameters of the noise model are variable and/or unknown in both training and evaluation data.", "full_text": "High-Quality Self-Supervised Deep Image Denoising\n\nSamuli Laine\nNVIDIA\u2217\n\nTero Karras\n\nNVIDIA\n\nJaakko Lehtinen\n\nNVIDIA, Aalto University\n\nTimo Aila\nNVIDIA\n\nAbstract\n\nWe describe a novel method for training high-quality image denoising models\nbased on unorganized collections of corrupted images. The training does not need\naccess to clean reference images, or explicit pairs of corrupted images, and can\nthus be applied in situations where such data is unacceptably expensive or im-\npossible to acquire. We build on a recent technique that removes the need for\nreference data by employing networks with a \u201cblind spot\u201d in the receptive \ufb01eld,\nand signi\ufb01cantly improve two key aspects: image quality and training ef\ufb01ciency.\nOur result quality is on par with state-of-the-art neural network denoisers in the\ncase of i.i.d. additive Gaussian noise, and not far behind with Poisson and impulse\nnoise. We also successfully handle cases where parameters of the noise model are\nvariable and/or unknown in both training and evaluation data.\n\n1\n\nIntroduction\n\nDenoising, the removal of noise from images, is a major application of deep learning. Several\narchitectures have been proposed for general-purpose image restoration tasks, e.g., U-Nets [23],\nhierarchical residual networks [20], and residual dense networks [31]. Traditionally, the models are\ntrained in a supervised fashion with corrupted images as inputs and clean images as targets, so that\nthe network learns to remove the corruption.\nLehtinen et al. [17] introduced NOISE2NOISE training, where pairs of corrupted images are used as\ntraining data. They observe that when certain statistical conditions are met, a network faced with\nthe impossible task of mapping corrupted images to corrupted images learns, loosely speaking, to\noutput the \u201caverage\u201d image. For a large class of image corruptions, the clean image is a simple\nper-pixel statistic \u2014 such as mean, median, or mode \u2014 over the stochastic corruption process, and\nhence the restoration model can be supervised using corrupted data by choosing the appropriate loss\nfunction to recover the statistic of interest.\nWhile removing the need for clean training images, NOISE2NOISE training still requires at least two\nindependent realizations of the corruption for each training image. While this eases data collection\nsigni\ufb01cantly compared to noisy-clean pairs, large collections of (single) poor images are still much\nmore widespread. This motivates investigation of self-supervised training: how much can we learn\nfrom just looking at corrupted data? While foregoing supervision would lead to the expectation of\nsome regression in performance, can we make up for it by making stronger assumptions about the\ncorruption process? In this paper, we show that for several noise models that are i.i.d. between pixels\n(Gaussian, Poisson, impulse), only minor concessions in denoising performance are necessary. We\nfurthermore show that the parameters of the noise models do not need to be known in advance.\nWe draw inspiration from the recent NOISE2VOID training technique of Krull et al. [14]. The\nalgorithm needs no image pairs, and uses just individual noisy images as training data, assuming\nthat the corruption is zero-mean and independent between pixels. The method is based on blind-\nspot networks where the receptive \ufb01eld of the network does not include the center pixel. This\n\n\u2217{slaine, tkarras, jlehtinen, taila}@nvidia.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Top: In our blind-spot network architecture, we effectively construct four denoiser net-\nwork branches, each having its receptive \ufb01eld restricted to a different direction. A single-pixel offset\nat the end of each branch separates the receptive \ufb01eld from the center pixel. The results are then\ncombined by 1\u00d71 convolutions. Bottom: In practice, we run four rotated versions of each input im-\nage through a single receptive \ufb01eld -restricted branch, yielding a simpler architecture that performs\nthe same function. This also implicitly shares the convolution kernels between the branches and thus\navoids the four-fold increase in the number of trainable weights.\n\nallows using the same noisy image as both training input and training target \u2014 because the network\ncannot see the correct answer, using the same image as target is equivalent to using a different noisy\nrealization. This approach is self-supervised in the sense that the surrounding context is used to\npredict the value of the output pixel without a separate reference image [8].\nThe networks used by Krull et al. [14] do not have a blind spot by design, but are trained to ignore\nthe center pixel using a masking scheme where only a few output pixels can contribute to the loss\nfunction, reducing training ef\ufb01ciency considerably. We remedy this with a novel architecture that\nallows ef\ufb01cient training without masking. Furthermore, the existence of the blind spot leads to poor\ndenoising quality. We derive a scheme for combining the network output with data in the blind\nspot, bringing the denoising quality on par with, or at least much closer to, conventionally trained\nnetworks.\n\n2 Convolutional blind-spot network architectures\n\nOur convolutional blind-spot networks are designed by combining multiple branches that each have\ntheir receptive \ufb01eld restricted to a half-plane (Figure 1) that does not contain the center pixel. We\ncombine the four branches with a series of 1\u00d71 convolutions to obtain a receptive \ufb01eld that can\nextend arbitrarily far in every direction but does not contain the center pixel. The principle of\nlimiting the receptive \ufb01eld has been previously used in PixelCNN [29, 28, 24] image synthesis\nnetworks, where only pixels synthesized before the current pixel are allowed in the receptive \ufb01eld.2\nThe bene\ufb01t of our architecture compared to the masking-based training of Krull et al. [14] is that all\noutput pixels can contribute to the loss function as in conventional training.\nIn order to transform a restoration network into one with a restricted receptive \ufb01eld, we modify\neach individual layer so that its receptive \ufb01eld is fully contained within one half-plane, including\nthe center row/column. The receptive \ufb01eld of the resulting network includes the center pixel, so we\noffset the feature maps by one pixel before combining them. Layers that do not extend the receptive\n\ufb01eld, e.g., concatenation, summation, 1\u00d71 convolution, etc., can be used without modi\ufb01cations.\nConvolution layers To restrict the receptive \ufb01eld of a zero-padding convolution layer to extend\nonly, say, upwards, the easiest solution is to offset the feature maps downwards when performing\nthe convolution operation. For an h \u00d7 w kernel size, a downwards offset of k = (cid:98)h/2(cid:99) pixels is\nequivalent to using a kernel that is shifted upwards so that all weights below the center row are zero.\nSpeci\ufb01cally, we \ufb01rst append k rows of zeros to the top of input tensor, then perform the convolution,\nand \ufb01nally crop out the k bottom rows of the output.\n\n2Regrettably the term \u201cblind spot\u201d has a slightly different meaning in PixelCNN literature: van den Oord et\nal. [28] use it to denote valid input pixels that the network in question fails to see due to poor design, whereas\nwe follow the naming convention of Krull et al. [14] so that a blind spot is always intentional.\n\n2\n\nCCCCCCC111CCCCCCC111RR-1\fDownsampling and upsampling layers Many image restoration networks involve downsampling\nand upsampling layers, and by default, these extend the receptive \ufb01eld in all directions. Consider,\ne.g., a 2 \u00d7 2 average downsampling step followed immediately by a nearest-neighbor 2 \u00d7 2 upsam-\npling step. The contents of every 2 \u00d7 2 pixel block in the output now correspond to the average of\nthis block in the input, i.e., information has been transferred in every direction within the block. We\n\ufb01x this problem by again applying an offset to the data. It is suf\ufb01cient to restrict the receptive \ufb01eld\nfor the pair of downsampling and upsampling layers, which means that only one of the layers needs\nto be modi\ufb01ed, and we have chosen to attach the offsets to the downsampling layers. For a 2 \u00d7 2\naverage downsampling layer, we can restrict the receptive \ufb01eld to extend upwards only by padding\nthe input tensor with one row of zeros at top and cropping out the bottom row before performing the\nactual downsampling operation.\n\n3 Self-supervised Bayesian denoising with blind-spot networks\n\nConsider the prediction of the clean value x for a noisy pixel y. As the pixels in an image are\nnot independent, all denoising algorithms assume the clean value depends not only on the noisy\nmeasurement y, but also on the context of neighboring (noisy) pixels that we denote by \u2126y. For our\nconvolutional networks, the context corresponds to the receptive \ufb01eld sans the central pixel. From\nthis point of view, denoising can be thought of as statistical inference on the probability distribution\np(x|y, \u2126y) over the clean pixel value x conditioned with both the context \u2126y and the measurement\ny. Concretely, a standard supervised regression model trained with corrupted-clean pairs and L2\nloss will return an estimate of Ex[p(x|y, \u2126y)], i.e., the mean over all possible clean pixel values\ngiven the noisy pixel and its context.\nAssuming the noise is independent between pixels and independent of the context, the blind-spot\nnetwork introduced by Krull et al. [14] predicts the clean value based purely on the context, using the\nnoisy measurement y as a training target, drawing on the NOISE2NOISE approach [17]. Concretely,\ntheir regressor learns to estimate Ex[p(x|\u2126y)], i.e., the mean of all potential clean values consistent\nwith the context. Batson and Royer [1] present an elegant general formulation for self-supervised\nmodels like this. However, methods that ignore the corrupted measurement y at test-time clearly\nleave useful information unused, potentially leading to reduced performance.\nWe bring in extra information in the form of an explicit model of the corruption, provided as a\nlikelihood p(y|x) of the observation given the clean value, which we assume to be independent of\nthe context and i.i.d. between pixels. This allows us to connect the observed marginal distribution\nof the noisy training data to the unobserved distribution of clean data:\n\n(cid:90)\n\np(y|\u2126y)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nTraining data\n\n=\n\np(y|x)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nNoise model\n\np(x|\u2126y)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nUnobserved\n\ndx\n\n(1)\n\nThis functional relationship suggests that even though we only observe corrupted training data, the\nknown noise model should help us learn to predict a parametric model for the distribution p(x|\u2126y).\nSpeci\ufb01cally, we model p(x|\u2126y) as a multivariate Gaussian N (\u00b5x, \u03a3x) over color components. For\nmany noise models, the marginal likelihood p(y|\u2126y) can then be computed in closed form, allowing\nus to train a neural network to map the context \u2126y to the mean \u00b5x and covariance \u03a3x by maximizing\nthe likelihood of the data under Equation (1).\nThe approximate distribution p(x|\u2126y) allows us to now apply Bayesian reasoning to include infor-\nmation from y at test-time. Speci\ufb01cally, the (unnormalized) posterior probability of the clean value\nx given observations of both the noisy pixel y and its context is given by Bayes\u2019 rule as follows:\n\n(cid:124)\n\n(cid:123)(cid:122)\n\np(x|y, \u2126y)\n\nPosterior\n\n(cid:125)\n\n\u221d p(y|x)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nNoise model\n\np(x|\u2126y)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nPrior\n\n(2)\n\nFrom this point of view, the distribution p(x|\u2126y) takes the role of the prior, encoding our beliefs on\nthe possible xs before observing y. (Note that even though we represent the prior as a Gaussian,\nthe posterior is generally not Gaussian due to the multiplication with the noise likelihood.) With the\nposterior at hand, standard Bayesian inference tools become available: for instance, a maximum a\nposteriori (MAP) estimate would pick the x that maximizes the posterior; we use the posterior mean\nEx[p(x|y, \u2126y)] for all denoising results as it minimizes MSE and consequently maximizes PSNR.\nTo summarize, our approach consists of (1) standard training phase and (2) two-step testing phase:\n\n3\n\n\f(1) Train a neural network to map the context \u2126y to the mean \u00b5x and variance \u03a3x of a Gaussian\n\napproximation to the prior p(x|\u2126y).\nposterior mean Ex[p(x|y, \u2126y)] by closed-form analytic integration.\n\n(2) At test time, \ufb01rst feed context \u2126y to neural network to yield \u00b5x and \u03a3x; then compute\n\nLooping back to the beginning of this section, we note that the estimate found by standard supervised\ntraining with the L2 loss is precisely the same posterior mean Ex[p(x|y, \u2126y)] we seek. Unfortu-\nnately, this does not imply that our self-supervised technique would be guaranteed to \ufb01nd the same\noptimum: we approximate the prior distribution with a Gaussian, whereas standard supervised train-\ning corresponds to a Gaussian approximation of the posterior. However, benign noise models, such\nas additive Gaussian noise or Poisson noise, interact with the prior in a way that the result is almost\nas good, as demonstrated below.\nIn concurrent work, Krull at al. [15] describe a similar algorithm for monochromatic data. Instead\nof an analytical solution, they use a sampling-based method to describe the prior and posterior, and\nrepresent an arbitrary noise model as a discretized two-dimensional histogram.\n\n4 Practical experiments\n\nIn this section, we detail the implementation of our denoising scheme in Gaussian, Poisson, and\nimpulse noise. In all our experiments, we use a modi\ufb01ed version of the \ufb01ve-level U-Net [23] archi-\ntecture used by Lehtinen et al. [17], to which we append three 1\u00d71 convolution layers. We construct\nour convolutional blind-spot networks based on this same architecture. Details regarding network\narchitecture, training, and evaluation are provided in the supplement. Our training data comes from\nthe 50k images in the ILSVRC2012 (Imagenet) validation set, and our test datasets are the commonly\nused KODAK (24 images), BSD300 validation set (100 images), and SET14 (14 images).\n\n4.1 Additive Gaussian noise\n\nLet us now realize the scheme outlined in Section 3 in the context of additive Gaussian noise. We will\ncover the general case of color images only, but the method simpli\ufb01es trivially to monochromatic\nimages by replacing all matrices and vectors with scalar values.\nThe blind-spot network outputs the parameters of a multivariate Gaussian N (\u00b5x, \u03a3x) = p(x|\u2126y)\nrepresenting the distribution of the clean signal. We parameterize the covariance matrix as\nTAx where Ax is an upper triangular matrix. This ensures that \u03a3x is a valid covariance\n\u03a3x = Ax\nmatrix, i.e., symmetric and positive semide\ufb01nite. Thus we have a total of nine output components\nper pixel for RGB images: the three-component mean \u00b5x and the six nonzero elements of Ax.\nModeling the corruption process is particularly simple with additive zero-mean Gaussian noise. In\nthis case, Eq. 1 performs a convolution of two mutually independent Gaussians, and the covariance\nof the result is simply the sum of the constituents [2]. Therefore,\n\n\u00b5y = \u00b5x\n\nand \u03a3y = \u03a3x + \u03c32I,\n\n(3)\nwhere \u03c3 is the standard deviation of the Gaussian noise. We can either assume \u03c3 to be known for\neach training and validation image, or we can learn to estimate it during training. For a constant,\nunknown \u03c3, we add \u03c3 as one of the trainable parameters. For variable and unknown \u03c3, we learn an\nauxiliary neural network for predicting it during training. The architecture of this auxiliary network\nis the same as in the baseline networks except that only one scalar per pixel is produced, and the \u03c3\nfor the entire image is obtained by taking the mean over the output. It is quite likely that a simpler\nnetwork would have suf\ufb01ced for the task, but we did not attempt to optimize its architecture. Note\nthat the \u03c3 estimation network is not trained with a known noise level as a target, but it learns to\npredict it as a part of the training process.\nTo \ufb01t N (\u00b5y, \u03a3y) to the observed noisy training data, we minimize the corresponding negative log-\nlikelihood loss during training [22, 16, 13]:\n\nloss(y, \u00b5y, \u03a3y) = \u2212 log f (y; \u00b5y, \u03a3y) = 1\n\n2\n\n[(y \u2212 \u00b5y)T\u03a3\u22121\n\ny (y \u2212 \u00b5y)] + 1\n\n2\n\nlog |\u03a3y| + C,\n\n(4)\n\nwhere C subsumes additive constant terms that can be discarded, and f (y; \u00b5y, \u03a3y) denotes the\nprobability density of a multivariate Gaussian distribution N (\u00b5y, \u03a3y) at pixel value y. In cases\n\n4\n\n\fTable 1: Image quality results for Gaussian noise. Values of \u03c3 are shown in 8-bit units.\n\nNoise type Method\n\nGaussian\n\u03c3 = 25\n\nGaussian\n\u03c3 \u2208 [5, 50]\n\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, diag. \u03a3\nOur ablated, diag. \u03a3\nOur ablated, \u00b5 only\nCBM3D\nCBM3D\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, diag. \u03a3\nOur ablated, diag. \u03a3\nOur ablated, \u00b5 only\nCBM3D\nCBM3D\n\n\u03c3 known? KODAK\n32.46\n32.45\n32.45\n32.44\n31.60\n31.55\n30.64\n31.82\n31.81\n32.57\n32.57\n32.47\n32.46\n31.59\n31.58\n30.54\n31.99\n31.99\n\nno\nno\nyes\nno\nyes\nno\nno\nyes\nno\nno\nno\nyes\nno\nyes\nno\nno\nyes\nno\n\nBSD300\n\n31.08\n31.07\n31.03\n31.02\n29.91\n29.87\n28.65\n30.40\n30.40\n31.29\n31.29\n31.19\n31.18\n30.06\n30.05\n28.56\n30.67\n30.67\n\nSET14 Average\n31.60\n31.26\n31.58\n31.23\n31.25\n31.57\n31.56\n31.22\n30.70\n30.58\n30.65\n30.53\n29.62\n29.57\n30.68\n30.96\n30.96\n30.66\n31.71\n31.27\n31.70\n31.26\n31.62\n31.21\n31.13\n31.59\n30.73\n30.54\n30.69\n30.45\n29.50\n29.41\n31.15\n30.78\n30.72\n31.13\n\nwhere \u03c3 is unknown and needs to be estimated, we add a small regularization term of \u22120.1\u03c3 to the\nloss. This encourages explaining the observed noise as corruption instead of uncertainty about the\nclean signal. As long as the regularization is gentle enough, the estimated \u03c3 does not overshoot \u2014 if\nit did, \u03a3y = \u03a3x + \u03c32I would become too large to \ufb01t the observed data in easy-to-denoise regions.\nAt test time, we compute the mean of the posterior distribution. With additive Gaussian noise the\nproduct involves two Gaussians, and because both distributions are functions of x, we have\n\np(y|x) p(x|\u2126y) = f (x; y, \u03c32I) f (x; \u00b5x, \u03a3x),\n\n(5)\n\nwhere we have exploited the symmetry of Gaussian distribution in the \ufb01rst term to swap x and y. A\nproduct of two Gaussian functions is an unnormalized Gaussian function, whose mean [2] coincides\nwith the desired posterior mean:\n\nEx[p(x|y, \u2126y)] = (\u03a3\u22121\n\nx + \u03c3\u22122I)\u22121(\u03a3\u22121\n\nx \u00b5x + \u03c3\u22122y).\n\n(6)\n\nNote that we do not need to evaluate the normalizing constant (marginal likelihood), as scalar mul-\ntiplication does not change the mean of a Gaussian.\nInformally, the formula can be seen to \u201cmix in\u201d some of the observed noisy pixel color y into the\nestimated mean \u00b5x. When the network is certain about the clean signal (\u03a3x is small), the estimated\nmean \u00b5x dominates the result. Conversely, the larger the uncertainty of the clean signal is compared\nto \u03c3, the more of the noisy observed signal is included in the result.\nComparisons and ablations Table 1 shows the output image quality for the various methods and\nablations tested. Example result images are shown in Figure 2. All methods are evaluated using the\nsame corrupted input data, and thus the only sources of randomness are the network initialization\nand training data shuf\ufb02ing during training. Denoiser networks seem to be fairly robust to these\neffects, e.g. [17] reports \u00b10.02 dB variation in the averaged results. We expect the same bounds to\nhold for our results as well.\nLet us \ufb01rst consider the case where the amount of noise is \ufb01xed (top half of the table). The N2C\nbaseline is trained with clean reference images as training targets, and unsurprisingly produces the\nbest results that can be reached with a given network architecture. N2N [17] matches the results.\nOur method with a convolutional blind-spot network and posterior mean estimation is virtually as\ngood as the baseline methods. This holds even when the amount of noise is unknown and needs\nto be estimated as part of the learning process. However, when we ablate our method by forcing\nthe covariance matrix \u03a3x to be diagonal, the quality of the results suffers considerably. This setup\ncorresponds to treating each color component of the prior as a univariate, independent distribution,\nand the bad result quality highlights the need to treat the signal as a true multivariate distribution.\n\n5\n\n\fTest image\nKODAK-6\n\nNoisy input\n20.41 dB\n\nN2C (baseline)\n\n31.17 dB\n\nOur (full)\n31.17 dB\n\nOur (diag \u03a3)\n\n30.06 dB\n\nOur (\u00b5 only)\n\n29.04 dB\n\nCBM3D\n30.59 dB\n\nFigure 2: Example result images for methods corresponding to Table 1: Gaussian noise \u03c3 = 25\n(\u03c3 not known). PSNRs refer to the individual images. The supplement gives additional result im-\nages, and the full images are included as PNG \ufb01les in the supplementary material.\n\nTable 2: Average output quality for Gaussian noise (\u03c3 = 25, known) with smaller training sets.\n\nMethod\nBaseline, N2C\nOur\nBaseline, N2C + rotation aug.\nOur\n+ rotation aug.\n\nall\n\n31.60\n31.57\n31.60\n31.58\n\n10 000\n31.59\n31.58\n31.60\n31.58\n\n1000\n31.53\n31.53\n31.57\n31.53\n\n500\n31.44\n31.48\n31.54\n31.47\n\nTraining images\n\n300\n31.35\n31.40\n31.48\n31.42\n\n200\n31.21\n31.29\n31.38\n31.32\n\n100\n30.84\n31.03\n31.21\n31.08\n\nWe can ablate the setup even further by having our blind-spot network architecture predict only the\nmean \u00b5 using standard L2 loss, and using this predicted mean directly as the denoiser output. This\ncorresponds to the setup of Krull et al. [14] in the sense that the center pixel is ignored. As expected,\nthe image quality suffers greatly due to the inability to extract information from the center pixel.\nSince we do not perform posterior mean estimation in this setup, noise level \u03c3 does not appear in\nthe calculations and knowing it would be of no use.\nFinally, we denoise the same test images using the of\ufb01cial implementation of CBM3D [6], a state-\nof-the-art non-learned image denoising algorithm.3 It uses no training data and relies on the contents\nof each individual test image for recovering the clean signal. With both known and automatically\nestimated (using the method of Chen et al. [5]) noise parameters, CBM3D outperforms our ablated\nsetups but remains far from the quality of our full method and the baseline methods.\nThe lower half of Table 1 presents the same metrics in the case of variable Gaussian noise, i.e.,\nwhen the noise parameters are chosen randomly within the speci\ufb01ed range for each training and\ntest image. The relative ordering of the methods remains the same as with a \ufb01xed amount of noise,\nalthough our method concedes 0.1dB relative to the baseline. Knowing the noise level in advance\ndoes not change the results.\nTable 2 illustrates the relationship between output quality and training set size. Without dataset\naugmentation, our method performs roughly on par with the baseline and surpasses it for very small\ndatasets (<1000 images). For the smaller training sets, rotation augmentation becomes bene\ufb01cial\nfor the baseline method, whereas for our method it only improves the training of 1\u00d71 combination\nlayers. With rotation augmentation enabled, our method therefore loses to the baseline method for\nvery small datasets, although not by much. No other training runs in this paper use augmentation,\nas it provides no bene\ufb01t when using the full training set.\nComparison to masking-based training Our \u201c\u00b5 only\u201d ablations illustrate the bene\ufb01ts of\nBayesian training and posterior mean estimation compared to ignoring the center pixel as in the\noriginal NOISE2VOID method. Here, we shall separately estimate the advantages of having an ar-\nchitectural blind spot instead of masking-based training [14]. We trained several networks with our\nbaseline architecture using masking. As recommended by Krull et al., we chose 64 pixels to be\nmasked in each input crop using strati\ufb01ed sampling. Two masking strategies were evaluated: copy-\ning from another pixel in a 5\u00d75 neighborhood (denoted COPY) as advocated in [14], and overwriting\nthe pixel with a random color in [0, 1]3 (denoted RANDOM), as done by Batson and Royer [1].\n\n3Even though (grayscale) WNNM [9] has been shown to be superior to (grayscale) BM3D [7], our experi-\nments with the of\ufb01cial implementation of MCWNNM [30], a multi-channel version of WNNM, indicated that\nCBM3D performs better on our test data where all color channels have the same amount of noise.\n\n6\n\n\ftraining\n\ncomparison,\n\nFigure 3:\nRelative train-\ning costs for Gaussian noise\n(\u03c3 = 25, known) denoisers using\nthe posterior mean estimation.\nFor\na\nconvolutional blind-spot network\nfor 0.5M minibatches achieves\n32.39 dB in KODAK.\nFor the\nmasking-based methods, the hor-\nizontal axis takes into account\nthe approximately 4\u00d7 cheaper\ntraining compared to our convolutional blind-spot networks. For example, at x-axis position marked \u201c1\u201d they\nhave been trained for 2M minibatches compared to 0.5M minibatches for our method.\n\nOur tests con\ufb01rmed that the COPY strategy gave better results when the center pixel was ignored,\nbut the RANDOM strategy gave consistently better results in the Bayesian setting. COPY probably\nleads to the network learning to leak some of the center pixel value into the output, which may help\nby sharpening the output a bit even when done in such an ad hoc fashion. However, our Bayesian\napproach assumes that no such information leaking occurs, and therefore does not tolerate it.\nFocusing on the highest-quality setup with posterior mean estimation and RANDOM masking strat-\negy, we estimate that training to a quality matching 0.5M minibatches with our convolutional blind-\nspot architecture would require at least 20\u2013100\u00d7 as much computation due to the loss function\nsparsity. This is based on a 10\u00d7 longer masking-based training run still not reaching comparable\noutput quality, see Figure 3.\n\n4.2 Poisson noise\n\nIn our second experiment we consider Poisson noise which is an interesting practical case as it can\nbe used to model the photon noise in imaging sensors. We denote the maximum event count as \u03bb\nand implement the noise as yi = Poisson(\u03bbxi)/\u03bb where i is the color channel and xi \u2208 [0, 1] is the\nclean color component. For denoising, we follow the common approach of approximating Poisson\nnoise as signal-dependent Gaussian noise [11].\nIn this setup, the resulting standard deviation is\n\n\u03c3i =(cid:112)xi/\u03bb and the corruption model is thus\n\n\u00b5y = \u00b5x\n\nand \u03a3y = \u03a3x + \u03bb\u22121diag(\u00b5x).\n\n(7)\n\nNote that there is a second approximation in this approach \u2014 the marginalization over x (Eq. 1) is\ntreated as a convolution with a \ufb01xed Gaussian even though p(y|x) should be different for each x.\nIn the formula above, we implicitly take this term to be p(y|\u00b5x) which is a good approximation in\nthe common case of \u03a3x being small. Aside from a different corruption model, both training and\ndenoising are equivalent to the Gaussian case (Section 4.1). For cases where the noise parameters\nare unknown, we treat \u03bb\u22121 as the unknown parameter that is either learned directly or estimated via\nthe auxiliary network, depending on whether the amount of noise is \ufb01xed or variable, respectively.\nComparisons Table 3, top half, shows the image quality results with Poisson noise, and Figure 4,\ntop, shows example result images. Note that even though we internally model the noise as signal-\ndependent Gaussian noise, we apply true Poisson noise to training and test data. In the case of\n\ufb01xed amount of noise, our method is within 0.1\u20130.2 dB from the N2C baseline. Curiously, the case\nwhere the \u03bb is unknown performs slightly better than the case where it is supplied. This is probably a\nconsequence of the approximations discussed above, and the network may be able to \ufb01t the observed\nnoisy distribution better when it is free to choose a different ratio between variance and mean.\nIn the case of variable noise, our method remains roughly as good when the noise parameters are\nknown, but starts to have trouble when they need to be estimated from data. However, it appears that\nthe problems are mainly concentrated to SET14 where there is a 1.2 dB drop whereas the other test\nsets suffer by only \u223c0.1 dB. The lone culprit for this drop is the POWERPOINT clip art image, where\nour method fails to estimate the noise level correctly, suffering a hefty 13dB penalty. Nonethe-\nless, comparing to the \u201c\u00b5 only\u201d ablation with L2 loss, i.e., ignoring the center pixel, shows that\nour method with posterior mean estimation still produces much higher output quality. Anscombe\ntransform [19] is a classical non-learned baseline for denoising Poisson noise, and for reference we\ninclude the results for this method as reported in [17].\n\n7\n\n33.032.532.031.531.030.530.0PSNR (dB)32.391Relative training costOur methodMasking (RANDOM)Masking (COPY)2345678910\fTable 3: Image quality results for Poisson and impulse noise.\n\nNoise type Method\n\nPoisson\n\u03bb = 30\n\nPoisson\n\u03bb \u2208 [5, 50]\n\nImpulse\n\u03b1 = 0.5\n\nImpulse\n\u03b1 \u2208 [0, 1]\n\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, \u00b5 only\nAnscombe [19] (from [17])\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, \u00b5 only\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, \u00b5 only\nBaseline, N2C\nBaseline, N2N\nOur\nOur\nOur ablated, \u00b5 only\n\n\u03bb/\u03b1 known? KODAK\n31.81\n31.80\n31.65\n31.70\n30.22\n29.15\n31.33\n31.32\n31.16\n31.02\n29.88\n33.32\n32.88\n32.98\n32.93\n30.82\n31.69\n31.53\n31.36\n31.40\n27.16\n\nno\nno\nyes\nno\nno\nyes\nno\nno\nyes\nno\nno\nno\nno\nyes\nno\nno\nno\nno\nyes\nno\nno\n\nBSD300\n\n30.40\n30.39\n30.25\n30.28\n28.27\n27.56\n29.91\n29.90\n29.75\n29.69\n27.95\n31.20\n30.85\n30.78\n30.71\n28.52\n30.27\n30.11\n30.00\n29.98\n25.55\n\nSET14 Average\n30.89\n30.45\n30.88\n30.44\n30.29\n30.73\n30.78\n30.35\n29.17\n29.03\n28.62\n28.36\n30.40\n29.96\n29.96\n30.39\n30.24\n29.82\n29.79\n28.65\n28.84\n28.67\n31.42\n31.98\n31.56\n30.94\n31.61\n31.06\n31.57\n31.09\n29.46\n29.05\n29.77\n30.58\n30.38\n29.51\n30.28\n29.47\n30.29\n29.51\n25.56\n26.09\n\n4.3\n\nImpulse noise\n\nOur last example involves impulse noise where each pixel is, with probability \u03b1, replaced by an\nuniformly sampled random color in [0, 1]3. This corruption process is more complex than in the\nprevious cases, as both mean and covariance are modi\ufb01ed, and there is a Dirac peak at the clean\ncolor value. To derive the training loss, we again approximate p(y|\u2126y) with a Gaussian, and match\nits \ufb01rst and second raw moments to the data during training. Because the marginal likelihood is\na mixture distribution, its raw moments are obtained by linearly interpolating, with parameter \u03b1,\nbetween the raw moments of p(x|\u2126y) and the raw moments of the uniform random distribution.\nThe resulting mean and covariance are\n\n(cid:34)1\n\n(cid:35)\n\n1\n1\n\n(cid:34)4\n\n3\n3\n\n3\n4\n3\n\n(cid:35)\n\n3\n3\n4\n\n\u00b5y =\n\n\u03b1\n2\n\n+ (1 \u2212 \u03b1)\u00b5x\n\nand \u03a3y =\n\n\u03b1\n12\n\n+ (1 \u2212 \u03b1)(\u03a3x + \u00b5x\u00b5x\n\nT) \u2212 \u00b5y\u00b5y\nT.\n\n(8)\n\nThis de\ufb01nes the approximate p(y|\u2126y) needed for training the denoiser network. As with previous\nnoise types, in setups where parameter \u03b1 is unknown, we add it as a learned parameter or estimate\nit via a simultaneously trained auxiliary network. The unnormalized posterior is\n\np(y|x) p(x|\u2126y) =(cid:0)\u03b1 + (1 \u2212 \u03b1)\u03b4(y \u2212 x)(cid:1) f (x; \u00b5x, \u03a3x)\n\n= \u03b1f (x; \u00b5x, \u03a3x) + (1 \u2212 \u03b1)\u03b4(y \u2212 x)f (x; \u00b5x, \u03a3x)\n\nfrom which we obtain the posterior mean:\nEx[p(x|y, \u2126y)] =\n\n\u03b1\u00b5x + (1 \u2212 \u03b1)f (y; \u00b5x, \u03a3x)y\n\u03b1 + (1 \u2212 \u03b1)f (y; \u00b5x, \u03a3x)\n\n.\n\nLooking at the formula, we can see that the result is a linear interpolation between the mean \u00b5x\npredicted by the network and the potentially corrupted observed pixel value y.\nInformally, we\ncan reason that the less likely the observed value y is to be drawn from the predicted distribution\nN (\u00b5x, \u03a3x), the more likely it is to be corrupted, and therefore its weight is low compared to the\npredicted mean \u00b5x. On the other hand, when the observed pixel value is consistent with the network\nprediction, it is weighted more heavily in the output color.\nComparisons Table 3, bottom half, shows the image quality results, and example result images\nare shown in Figure 4, bottom. The N2N baseline has more trouble with impulse noise than with\n\n8\n\n(9)\n\n(10)\n\n\fTest image\nKODAK-14\n\nNoisy input\n19.48 dB\n\nN2C (baseline)\n\n30.33 dB\n\nOur (full)\n30.24 dB\n\nOur (\u00b5 only)\n\n28.64 dB\n\nTest image\nKODAK-20\n\nNoisy input\n\n9.30 dB\n\nN2C (baseline)\n\n34.90 dB\n\nOur (full)\n34.55 dB\n\nOur (\u00b5 only)\n\n32.13 dB\n\nFigure 4: Example result images for Poisson (top) and Impulse noise (bottom). PSNRs refer to the\nindividual images. The supplement gives additional result images, and the full images are included\nas PNG \ufb01les in the supplementary material.\n\nGaussian or Poisson noise \u2014 note that it cannot be trained with standard L2 loss because the noise\nis not zero-mean. Lehtinen et al. [17] recommend annealing from L2 loss to L0 loss in these cases.\nWe experimented with several loss function schedules for N2N, and obtained the best results by\nannealing the loss exponent from 2 to 0.5 during the \ufb01rst 75% of training and holding it there for\nthe remaining training time. Our method loses to the N2C baseline by \u223c0.4 dB in the case of \ufb01xed\nnoise, and by \u223c0.3 dB with the more dif\ufb01cult variable noise. Notably, our method does not suffer\nfrom not knowing the noise parameter \u03b1 in either case. The ablated \u201c\u00b5 only\u201d setups were trained\nwith the same loss schedules as the corresponding N2N baselines and lose to the other methods by\nmultiple dB, highlighting the usefulness of the information in the center pixel in this type of noise.\n\n5 Discussion and future work\n\nApplying Bayesian statistics to denoising has a long history. Non-local means [3], BM3D [7], and\nWNNM [9] identify a group of similar pixel neighborhoods and estimate the center pixel\u2019s color\nfrom those. Deep image prior [27] seeks a representation for the input image that is easiest to model\nwith a convolutional network, often encountering a reasonable noise-free representation along the\nway. As with self-supervised training, these methods need only the noisy images, but while the\nexplicit block-based methods determine a small number of neighborhoods from the input image\nalone, a deep denoising model may implicitly identify and regress an arbitrarily large number of\nneighborhoods from a collection of noisy training data.\nStein\u2019s unbiased risk estimator has been used for training deep denoisers for Gaussian noise [26, 21],\nbut compared to our work these methods leave a larger quality gap compared to supervised training.\nJena [12] corrupts noisy training data further, and trains a network to reduce the amount of noise\nto the original level. This network can then iteratively restore images with the original amount of\nnoise. Unfortunately, no comparisons against supervised training are given. Finally, FC-AIDE [4]\nfeatures an interesting combination of supervised and unsupervised training, where a traditionally\ntrained denoiser network is \ufb01ne-tuned in an unsupervised fashion for each test image individually.\nWe have shown, for the \ufb01rst time, that deep denoising models trained in a self-supervised fashion\ncan reach similar quality as comparable models trained using clean reference data, as long as the\ndrawbacks imposed by self-supervision are appropriately remedied. Our method assumes pixel-\nwise independent noise with a known analytic likelihood model, although we have demonstrated\nthat individual parameters of the corruption model can also be successfully deducted from the noisy\ndata. Real corrupted images rarely follow theoretical models exactly [10, 18, 25], and an important\navenue for future work will be to learn as much of the noise model from the data as possible. By\nbasing the learning exclusively on the dataset of interest, we should also be able to alleviate the\nconcern that the training data (e.g., natural images) deviates from the intended use (e.g., medical\nimages). Experiments with such real life data will be valuable next steps.\nAcknowledgements We thank Arno Solin and Samuel Kaski for helpful comments, and Janne\nHellsten and Tero Kuosmanen for the compute infrastructure.\n\n9\n\n\fReferences\n\n[1] J. Batson and L. Royer. Noise2Self: Blind denoising by self-supervision. In Proc. International\n\nConference on Machine Learning (ICML), pages 524\u2013533, 2019. 3, 6\n\n[2] P. A. Bromiley. Products and convolutions of Gaussian distributions. Technical Report 2003-\n\n003, www.tina-vision.net, 2003. 4, 5\n\n[3] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Proc.\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 60\u201365, 2005. 9\nCoRR,\n\nFully convolutional pixel adaptive image denoiser.\n\n[4] S. Cha and T. Moon.\nabs/1807.07569, 2018. 9\n\n[5] G. Chen, F. Zhu, and P. Ann Heng. An ef\ufb01cient statistical method for image noise level\nestimation. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 477\u2013\n485, 2015. 6\n\n[6] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Color image denoising via sparse 3D\nIn Proc.\n\ncollaborative \ufb01ltering with grouping constraint in luminance-chrominance space.\nIEEE International Conference on Image Processing, pages 313\u2013316, 2007. 6\n\n[7] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-\nIEEE Transactions on Image Processing, 16(8):2080\u20132095,\n\ndomain collaborative \ufb01ltering.\n2007. 6, 9\n\n[8] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learning by context\nprediction. In Proc. International Conference on Computer Vision (ICCV), pages 1422\u20131430,\n2015. 2\n\n[9] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application\nto image denoising. In Proc. IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 2862\u20132869, 2014. 6, 9\n\n[10] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang. Toward convolutional blind denoising of real\nphotographs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 1712\u20131722, 2019. 9\n\n[11] S. W. Hasinoff. Photon, Poisson noise. In K. Ikeuchi, editor, Computer Vision: A Reference\n\nGuide, pages 608\u2013610. Springer US, 2014. 7\n\n[12] R. Jena. An approach to image denoising using manifold approximation without clean images.\n\nCoRR, abs/1904.12323, 2019. 9\n\n[13] A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer\nvision? In Advances in Neural Information Processing Systems 30 (Proc. NIPS), pages 5574\u2013\n5584. 2017. 4\n\n[14] A. Krull, T.-O. Buchholz, and F. Jug. Noise2Void \u2013 Learning denoising from single noisy\nIn Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n\nimages.\npages 2129\u20132137, 2019. 1, 2, 3, 6\n\n[15] A. Krull, T. Vicar, and F. Jug. Probabilistic Noise2Void: Unsupervised content-aware denois-\n\ning. CoRR, abs/1906.00651, 2019. 4\n\n[16] Q. V. Le, A. J. Smola, and S. Canu. Heteroscedastic Gaussian process regression. In Proc.\n\nInternational Conference on Machine Learning (ICML), pages 489\u2013496, 2005. 4\n\n[17] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila.\nIn Proc. International Con-\n\nNoise2Noise: Learning image restoration without clean data.\nference on Machine Learning (ICML), 2018. 1, 3, 4, 5, 7, 8, 9\n\n[18] B. Liu, X. Shu, and X. Wu. Deep learning with inaccurate training data for image restoration.\n\nCoRR, abs/1811.07268, 2018. 9\n\n[19] M. M\u00a8akitalo and A. Foi. Optimal inversion of the Anscombe transformation in low-count\nPoisson image denoising. IEEE Transactions on Image Processing, 20(1):99\u2013109, 2011. 7, 8\nImage restoration using very deep convolutional encoder-\ndecoder networks with symmetric skip connections. In Advances in Neural Information Pro-\ncessing Systems 29 (Proc. NIPS), pages 2802\u20132810. 2016. 1\n\n[20] X. Mao, C. Shen, and Y. Yang.\n\n[21] C. A. Metzler, A. Mousavi, R. Heckel, and R. G. Baraniuk. Unsupervised learning with Stein\u2019s\n\nunbiased risk estimator. CoRR, abs/1805.10531, 2018. 9\n\n[22] D. A. Nix and A. S. Weigend. Estimating the mean and variance of the target probability\ndistribution. Proc. IEEE International Conference on Neural Networks (ICNN), pages 55\u201360,\n1994. 4\n\n10\n\n\f[23] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical im-\nage segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI),\n9351:234\u2013241, 2015. 1, 4\n\n[24] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. PixelCNN++: Improving the PixelCNN\nIn Proc. International\n\nwith discretized logistic mixture likelihood and other modi\ufb01cations.\nConference on Learning Representations (ICLR), 2017. 2\n\n[25] A. Shocher, N. Cohen, and M. Irani. \u201cZero-shot\u201d super-resolution using deep internal learning.\nIn Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3118\u2013\n3126, 2018. 9\n\n[26] S. Soltanayev and S. Y. Chun. Training deep learning based denoisers without ground truth\ndata. In Advances in Neural Information Processing Systems 31 (Proc. NeurIPS), pages 3257\u2013\n3267. 2018. 9\n\n[27] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Deep image prior. In Proc. IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 9446\u20139454, 2018. 9\n\nCoRR, abs/1812.10477, 2018. 1\n\n[28] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves.\nConditional image generation with PixelCNN decoders. In Advances in Neural Information\nProcessing Systems 29 (Proc. NIPS), pages 4790\u20134798. 2016. 2\n\n[29] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\n\nProc. International Conference on Machine Learning (ICML), pages 1747\u20131756, 2016. 2\n\n[30] J. Xu, L. Zhang, D. Zhang, and X. Feng. Multi-channel weighted nuclear norm minimization\nfor real color image denoising. In Proc. IEEE International Conference on Computer Vision\n(ICCV), pages 1105\u20131113, 2017. 6\n\n[31] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image restoration.\n\n11\n\n\f", "award": [], "sourceid": 3776, "authors": [{"given_name": "Samuli", "family_name": "Laine", "institution": "NVIDIA"}, {"given_name": "Tero", "family_name": "Karras", "institution": "NVIDIA"}, {"given_name": "Jaakko", "family_name": "Lehtinen", "institution": "Aalto University & NVIDIA"}, {"given_name": "Timo", "family_name": "Aila", "institution": "NVIDIA"}]}