{"title": "Shepard Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 901, "page_last": 909, "abstract": "Deep learning has recently been introduced to the field of low-level computer vision and image processing. Promising results have been obtained in a number of tasks including super-resolution, inpainting, deconvolution, filtering, etc. However, previously adopted neural network approaches such as convolutional neural networks and sparse auto-encoders are inherently with translation invariant operators. We found this property prevents the deep learning approaches from outperforming the state-of-the-art if the task itself requires translation variant interpolation (TVI). In this paper, we draw on Shepard interpolation and design Shepard Convolutional Neural Networks (ShCNN) which efficiently realizes end-to-end trainable TVI operators in the network. We show that by adding only a few feature maps in the new Shepard layers, the network is able to achieve stronger results than a much deeper architecture. Superior performance on both image inpainting and super-resolution is obtained where our system outperforms previous ones while keeping the running time competitive.", "full_text": "Shepard Convolutional Neural Networks\n\nJimmy SJ. Ren\u2217\n\nSenseTime Group Limited\n\nrensijie@sensetime.com\n\nLi Xu\n\nSenseTime Group Limited\nxuli@sensetime.com\n\nQiong Yan\n\nSenseTime Group Limited\n\nWenxiu Sun\n\nSenseTime Group Limited\n\nyanqiong@sensetime.com\n\nsunwenxiu@sensetime.com\n\nAbstract\n\nDeep learning has recently been introduced to the \ufb01eld of low-level computer\nvision and image processing. Promising results have been obtained in a num-\nber of tasks including super-resolution, inpainting, deconvolution, \ufb01ltering, etc.\nHowever, previously adopted neural network approaches such as convolutional\nneural networks and sparse auto-encoders are inherently with translation invariant\noperators. We found this property prevents the deep learning approaches from\noutperforming the state-of-the-art if the task itself requires translation variant in-\nterpolation (TVI). In this paper, we draw on Shepard interpolation and design\nShepard Convolutional Neural Networks (ShCNN) which ef\ufb01ciently realizes end-\nto-end trainable TVI operators in the network. We show that by adding only a few\nfeature maps in the new Shepard layers, the network is able to achieve stronger\nresults than a much deeper architecture. Superior performance on both image in-\npainting and super-resolution is obtained where our system outperforms previous\nones while keeping the running time competitive.\n\n1 Introduction\n\nIn the past a few years, deep learning has been very successful in addressing many aspects of visual\nperception problems such as image classi\ufb01cation, object detection, face recognition [1, 2, 3], to name\na few. Inspired by the breakthrough in high-level computer vision, several attempts have been made\nvery recently to apply deep learning methods in low-level vision as well as image processing tasks.\nEncouraging results has been obtained in a number of tasks including image super-resolution [4],\ninpainting [5], denosing [6], image deconvolution [7], dirt removal [8], edge-aware \ufb01ltering [9] etc.\nPowerful models with multiple layers of nonlinearity such as convolutional neural networks (CNN),\nsparse auto-encoders, etc. were used in the previous studies. Notwithstanding the rapid progress and\npromising performance, we notice that the building blocks of these models are inherently translation\ninvariant when applying to images. The property makes the network architecture less ef\ufb01cient in\nhandling translation variant operators, exempli\ufb01ed by the image interpolation operation.\n\nFigure 1 illustrates the problem of image inpainting, a typical translation variant interpolation (TVI)\ntask. The black region in \ufb01gure 1(a) indicates the missing region where the four selected patches\nwith missing parts are visualized in \ufb01gure 1(b). The interpolation process for the central pixel in\neach patch is done by four different weighting functions shown in the bottom of \ufb01gure 1(b). This\nprocess cannot be simply modeled by a single kernel due to the inherent spatially varying property.\n\nIn fact, the TVI operations are common in many vision applications. Image super-resolution, which\naims to interpolate a high resolution image with a low resolution observation also suffers from the\n\n\u2217Project page: http://www.deeplearning.cc/shepardcnn\n\n1\n\n\f(a)\n\n1 2 2 0\n0\n2 1 0 0\n0\n2 1 0 0\n0\n2 1 0 0 0\n3 2 1 0 0\n\n0 0 0 0\n0\n0 0 0 0\n2\n0 0 0 2\n2\n0 0 2 2 3\n0 1 1 2 3\n\n1 4 2 2\n1\n1 3 2 1\n1\n1 0 0 0\n1\n0 0 0 0 0\n0 0 0 0 0\n\n0 1 1 1\n1\n0 0 1 1\n1\n0 0 0 0\n1\n1 1 1 1 1\n1 1 1 2 2\n\n(b)\n\nFigure 1: Illustration of translation variant interpolation. (a) The application of inpainting. The black regions\nindicate the missing part. (b) Four selected patches. The bottom row shows the kernels for interpolating the\ncentral pixel of each patch.\n\nsame problem: different local patches have different pattern of anchor points. We will show that it\nis thus less optimal to use the traditional convolutional neural network to do the translation variant\noperations for super-resolution task.\n\nIn this paper, we draw on Shepard method [10] and devise a novel CNN architecture named Shep-\nard Convolutional Neural Networks (ShCNN) which ef\ufb01ciently equips conventional CNN with the\nability to learn translation variant operations for irregularly spaced data. By adding only a few\nfeature maps in the new Shepard layer and optimizing a more powerful TVI procedure in the end-\nto-end fashion, the network is able to achieve stronger results than a much deeper architecture. We\ndemonstrate that the resulting system is general enough to bene\ufb01t a number of applications with TVI\noperations.\n\n2 Related Work\n\nDeep learning methods have recently been introduced to the area of low-level computer vision and\nimage processing. Burger et al. [6] used a simple multi-layer neural network to directly learn a\nmapping between noisy and clear image patches. Xie et al. [5] adopted a sparse auto-encoder and\ndemonstrated its ability to do blind image inpainting. A three-layer CNN was used in [8] to tackle\nof problem of rain drop and dirt. It demonstrated the ability of CNN to blindly handle translation\nvariant problem in real world challenges.\n\nXu et al. [7] advocated the use of generative approaches to guide the design of the CNN for decon-\nvolution tasks. In [9], edge-aware \ufb01lters can be well approximated using CNN. While it is feasible\nto use the translation invariant operators, such as convolution, to obtain the translation variant results\nin a deep neural network architecture, it is less effective in achieving high quality results for inter-\npolation operations. The \ufb01rst attempt using CNN to perform image super-resolution [4] connected\nthe CNN approach to the sparse coding ones. But it failed to beat the state-of-the-art super resolu-\ntion system [11]. In this paper, we focus on the design of deep neural network layer that better \ufb01ts\nthe translation variant interpolation tasks. We note that TVI is the essential step for a wide range of\n\n2\n\n\flow-level vision applications including inpainting, dirt removal, noise suppression, super-resolution,\nto name a few.\n\n3 Analysis\n\nDeep learning approaches without explicit TVI mechanism generated reasonable results in a few\ntasks requiring translation variant property. To some extent, deep architecture with multiple layers of\nnonlinearity is expressive to approximate certain TVI operations given suf\ufb01cient amount of training\ndata. It is, however, non-trivial to beat non-CNN based approaches while ensuring the high ef\ufb01ciency\nand simplicity.\n\nTo see this, we experimented with the CNN architecture in [4] and [8] and trained a CNN with three\nconvolutional layers by using 1 million synthetic corrupted/clear image pairs. Network and training\ndetails as well as the concrete statistics of the data will be covered in the experiment section. Typical\ntest images are shown in the left column of \ufb01gure 2 whereas the results of this model are displayed\nin the mid-left column of the same \ufb01gure. We found that visually very similar results as in [5] are\nobtained, namely obvious residues of the text are still left in the images. We also experimented with\na much deeper network by adding more convolutional layers, virtually replicating the network in\n[8] by 2,3, and 4 times. Although slight visual differences are found in the results, no fundamental\nimprovement in the missing regions is observed, namely residue still remains.\n\nA sensible next step is to explicitly inform the network about where the missing pixels are so that\nthe network has the opportunity to \ufb01gure out more plausible solutions for TVI operations. For many\napplications, the underlying mask indicating the processed regions can be detected or be known\nin advance. Sample applications include image completion/inpainting, image matting, dirt/impulse\nnoise removal, etc. Other applications such as sparse point propagation and super resolution by\nnature have the masks for unknown regions.\n\nOne way to incorporate the mask into the network is to treat it as an additional channel of the input.\nWe tested this idea with the same set of network and experimental settings as the previous trial.\nThe results showed that such additional piece of information did bring about improvement but still\nconsiderably far from satisfactory in removing the residues. Results are visualized in the mid-right\ncolumn of \ufb01gure 2. To learn a tractable TVI model, we devise in the next session a novel architecture\nwith an effective mechanism to exploit the information contained in the mask.\n\n4 Shepard Convolutional Neural Networks\n\nWe initiate the attempt to leverage the traditional interpolation framework to guide the design of\nneural network architecture for TVI. We turn to the Shepard framework [10] which weighs known\npixels differently according to their spatial distances to the processed pixel. Speci\ufb01cally, Shepard\nmethod can be re-written in a convolution form\n\nJp = (cid:26) (K \u2217 I)p / (K \u2217 M)p\n\nIp\n\nif Mp = 0\nif Mp = 1\n\n(1)\n\nwhere I and J are the input and output images, respectively. p indexes the image coordinates. M is\nthe binary indicator. Mp = 0 indicates the pixel values are unknown. \u2217 is the convolution operation.\nK is the kernel function with its weights inversely proportional to the distance between a pixel\nwith Mp = 1 and the pixel to process. The element-wise division between the convolved image\nand the convolved mask naturally controls the way how pixel information is propagated across the\nregions. It thus enables the capability to handle interpolation for irregularly-spaced data and make\nit possible translation variant. The key element in Shepard method affecting the interpolation result\nis the de\ufb01nition of the convolution kernel. We thus propose a new convolutional layer in the light of\nShepard method but allow for a more \ufb02exible, data-driven kernel design. The layer is referred to as\nthe Shepard interpolation layer.\n\n3\n\n\fFigure 2: Comparison between ShCNN and CNN in image inpainting. Input images (Left). Results from a\nregular CNN (Mid-left). Results from a regular CNN trained with masks (Mid-right). Our results (Right).\n\n4.1 The Shepard Interpolation Layer\n\nThe feed-forward pass of the trainable interpolation layer can be mathematically described as the\nfollowing equation,\n\nF n\n\ni (F n\u22121, Mn) = \u03c3(Xj\n\n\u2217 F n\u22121\n\nj\n\nKn\nij\nKn\nij\n\n\u2217 Mn\nj\n\n+ bn),\n\nn = 1, 2, 3, ...\n\n(2)\n\nj\n\nis the index of feature maps in layer n. j\nwhere n is the index of layers. The subscript i in F n\ni\nin F n\u22121\nindex the feature maps in layer n \u2212 1. F n\u22121 and Mn are the input and the mask of the\ncurrent layer respectively. F n\u22121 represents all the feature maps in layer n \u2212 1. Kij are the trainable\nkernels which are shared in both numerator and denominator in computing the fraction. Concretely,\nsame Kij is to be convolved with both the activations of the last layer in the numerator and the\nmask of the current layer Mn in the denominator. F n\u22121 could be the output feature maps of regular\nlayers in a CNN such as a convolutional layer or a pooling layer. It could also be a previous Shepard\ninterpolation layer which is a function of both F n\u22122 and Mn\u22121. Thus Shepard interpolation layers\ncan actually be stacked together to form a highly nonlinear interpolation operator. b is the bias\nterm and \u03c3 is the nonlinearity imposed to the network. F is a smooth and differentiable function,\ntherefore standard back-propagation can be used to train the parameters.\n\nFigure 3 illustrates our neural network architecture with Shepard interpolation layers. The inputs of\nthe Shepard interpolation layer are images/feature maps as well as masks indicating where interpo-\nlation should occur. Note that the interpolation layer can be applied repeatedly to construct more\ncomplex interpolation functions with multiple layers of nonlinearity. The mask is a binary map of\nvalue one for the known area, zero for the missing area. Same kernel is applied to the image and\nthe mask. We note that the mask for layer n + 1 can be automatically generated by the result of\nprevious convolved mask Kn \u2217 Mn, by zeroing out insigni\ufb01cant values and thresholding it. It is\nimportant for tasks with relative large missing areas such as inpainting where sophisticated ways of\npropagation may be learned from data by multi-stage Shepard interpolation layer with nonlinearity.\nThis is also a \ufb02exible way to balance the kernel size and the depth of the network. We refer to\n\n4\n\n\fFigure 3: Illustration of ShCNN architecture for multiple layers of interpolation.\n\na convolutional neural network with Shepard interpolation layers as Shepard convolutional neural\nnetwork (ShCNN).\n\n4.2 Discussion\n\nAlthough standard back-propagation can be used, because F is a function of both Ks in the frac-\ntion, matrix form of the quotient rule for derivatives need to be used in deriving the back-propagation\nequations of the interpolation layer. To make the implementation ef\ufb01cient, we unroll the two con-\nvolution operations K \u2217 F and K \u2217 M into two matrix multiplications denoted W \u00b7 I and W \u00b7 M\nwhere I and M are the unrolled versions of F and M. W is the rearrangement of the kernels where\neach kernel is listed in a single row. E is the error function to compute the distance between the\nnetwork output and the ground truth. L2 norm is used to compute this distance. We also denote\nZ n = Kn \u2217F n\u22121\n\u2202Z n , can be\ncomputed the same way as in previous CNN papers [12, 1]. Once this value is computed, we show\nthat the derivative of E with respect to the kernels W connecting jth node in (n \u2212 1)th layer to ith\nnode in nth layer can be computed by,\n\nKn \u2217Mn . The derivative of the error function E with respect to Z n, \u03b4n = \u2202E\n\n\u2202E\n\u2202Wn\nij\n\n= Xm\n\n(Wn\nij\n\n\u00b7 Mjm) \u00b7 Ijm \u2212 (Wn\nij\n\u00b7 Mjm)2\n\n(Wn\nij\n\n\u00b7 Ijm) \u00b7 Mjm\n\n\u00b7 \u03b4im,\n\n(3)\n\nwhere m is the column index in I, M and \u03b4.\nThe denominator of each element in the outer summation in Eq. 3 is different. Therefore, the\nnumerator of each summation element has to be computed separately. While this operation can still\nbe ef\ufb01ciently parallelized by vectorization, it requires signi\ufb01cantly more memory and computations\nthan the regular CNNs. Though it brings extra workload in training, the new interpolation layer only\nadds a fraction of more computation during the test time. We can discern this from Eq. 2, the only\nadded operations are the convolution of the mask with the K and the point-wise division. Because\nthe two convolutions shares the same kernel, it can be ef\ufb01ciently implemented by convolving with\nsamples with the batch size of 2.\nIt thus keeps the computation of Shepard interpolation layer\ncompetitive compare to the traditional convolution layer.\n\nWe note that it is also natural to integrate the interpolation layer to any previous CNN architecture.\nThis is because the new layer only adds a mask input to the convolutional layer, keeping all other\ninterfaces the same. This layer can also degenerate to a fully connected layer because the unrolled\nversion of Eq. 2 merely contains matrix multiplication in the fraction. Therefore, as long as the TVI\noperators are necessary in the task, no matter where it is needed in the architecture and the type of\nlayer before or after it, the interpolation layer can be seamlessly plugged in.\n\n5\n\n\fLast but not least, the interpolation kernels in the layer is learned from data rather than hand-crafted,\ntherefore it is more \ufb02exible and could be more powerful than pre-designed kernels. On the other\nhand, it is end-to-end trainable so that the learned interpolation operators are embedded in the overall\noptimization objective of the model.\n\n5 Experiments\n\nWe conducted experiments on two applications involving TVI: the inpainting and the super-\nresolution. The training data was generated by randomly sampling 1 million patches from 1000\nnatural images scraped from Flickr. Grayscale patches of size 48x48 were used for both tasks to\nfacilitate the comparison with previous studies. All PSNR comparison in the experiment is based on\ngrayscale results. Our model can be directly extended to process color images.\n\n5.1 Inpainting\n\nThe natural images are contaminated by masks containing text of different sizes and fonts as shown\nin \ufb01gure 2. We assume the binary masks indicating missing regions are known in advance. The\nShCNN for inpainting is consists of \ufb01ve layers, two of which are Shepard interpolation layers. We\nuse ReLU function [1] to impose nonlinearity in all our experiments. 4x4 \ufb01lters were used in the\n\ufb01rst Shepard layer to generate 8 feature maps, followed by another Shepard interpolation layer with\n4x4 \ufb01lters. The rest of the ShCNN is conventional CNN architecture. The \ufb01lters for the third layer is\nwith size 9x9x8, which are use to generate 128 feature maps. 1x1x128 \ufb01lters are used in the fourth\nlayer. 8x8 \ufb01lters are used to carry out the reconstruction of image details. Visual results are shown\nin the last column in \ufb01gure 2. The results of the comparisons are generated using the architecture in\n[8]. More examples are provided in the project webpage.\n\n(a) Ground Truth / PSNR\n\n(b) Bicubic / 22.10dB\n\n(c) KSVD / 23.57dB\n\n(d) NE+LLE / 23.38dB\n\n(e) ANR / 23.52dB\n\n(f) A+ / 24.42dB\n\n(g) SRCNN / 25.07dB\n\n(h) ShCNN / 25.63dB\n\nFigure 4: Visual comparison. Factor 4 upscaling of the butter\ufb02y image in Set5 [14].\n\n5.2 Super Resolution\n\nThe quantitative evaluation of super resolution is conducted using synthetic data where the high\nresolution images are \ufb01rst downscaled by a factor to generate low resolution patches. To perform\nsuper resolution, we upscale the low resolution patches and zero out the pixels in the upscaled\nimages, leaving one copy of pixels from low resolution images. In this regard, super resolution can\nbe seemed as a special form of inpainting with repeated patterns of missing area.\n\n6\n\n\fAvg PSNR\nSet14 (x3)\n\nbaboon\nbarbara\nbridge\n\ncoastguard\n\ncomic\nface\n\n\ufb02owers\nforeman\n\nlenna\nman\n\nmonarch\npepper\nppt3\nzebra\n\nSet14 (x2)\n\nbaboon\nbarbara\nbridge\n\ncoastguard\n\ncomic\nface\n\n\ufb02owers\nforeman\n\nlenna\nman\n\nmonarch\npepper\nppt3\nzebra\n\nAvg PSNR\nSet14 (x4)\n\nbaboon\nbarbara\nbridge\n\ncoastguard\n\ncomic\nface\n\n\ufb02owers\nforeman\n\nlenna\nman\n\nmonarch\npepper\nppt3\nzebra\n\nAvg PSNR\n\n25.40dB\n28.56dB\n27.38dB\n30.23dB\n27.61dB\n35.46dB\n31.93dB\n35.93dB\n36.00dB\n30.29dB\n35.26dB\n36.18dB\n28.98dB\n32.59dB\n31.55dB\n\n25.47dB\n28.70dB\n27.55dB\n30.41dB\n27.89 dB\n35.57 dB\n32.28 dB\n36.18 dB\n36.21 dB\n30.44 dB\n35.75 dB\n36.59 dB\n29.30 dB\n33.21dB\n31.81dB\n\nBicubic K-SVD NE+NNLS NE+LLE\n25.52dB\n24.86dB\n28.63dB\n28.00dB\n27.51dB\n26.58dB\n30.38dB\n29.12dB\n27.72dB\n26.46dB\n34.83dB\n35.61dB\n32.19dB\n30.37dB\n36.41dB\n34.14dB\n36.30dB\n34.70dB\n30.43dB\n29.25dB\n35.58dB\n32.94dB\n34.97dB\n36.36dB\n28.97dB\n26.87dB\n33.00dB\n30.63dB\n30.23dB\n31.76dB\nBicubic K-SVD NE+NNLS NE+LLE\n23.21dB\n23.55dB\n26.74dB\n26.25dB\n24.98dB\n24.40dB\n27.07dB\n26.55dB\n23.98dB\n23.12dB\n33.56dB\n32.82dB\n27.23dB\n28.38dB\n33.21dB\n31.18dB\n33.01dB\n31.68dB\n27.87dB\n27.01dB\n30.95dB\n29.43dB\n33.80dB\n32.39dB\n23.71dB\n24.94dB\n28.31dB\n26.63dB\n27.54dB\n28.60dB\nBicubic K-SVD NE+NNLS NE+LLE\n22.67dB\n22.44dB\n25.15dB\n25.58dB\n23.60dB\n23.15dB\n25.81dB\n25.48dB\n22.26dB\n21.69dB\n32.19dB\n31.55dB\n26.38dB\n25.52dB\n29.41dB\n30.90dB\n30.93dB\n29.84dB\n26.38dB\n25.70dB\n28.58dB\n27.46dB\n31.87dB\n30.60dB\n22.77dB\n21.98dB\n24.08dB\n25.36dB\n26.81dB\n26.00dB\n\n22.66dB\n25.58dB\n23.65dB\n25.81dB\n22.31dB\n32.18dB\n26.44dB\n31.01dB\n30.92dB\n26.46dB\n28.72dB\n32.13dB\n23.05dB\n25.47dB\n26.88dB\n\n23.52dB\n26.76dB\n25.02dB\n27.15dB\n23.96dB\n33.53dB\n28.43dB\n33.19dB\n33.00dB\n27.90dB\n31.10dB\n34.07dB\n25.23dB\n28.49dB\n28.67dB\n\n22.63dB\n25.53dB\n23.54dB\n25.82dB\n22.19dB\n32.09dB\n26.28dB\n30.90dB\n30.82dB\n26.30dB\n28.48dB\n31.78dB\n22.61dB\n25.17dB\n26.72dB\n\n23.49dB\n26.67dB\n24.86dB\n27.00dB\n23.83dB\n33.45dB\n28.21dB\n32.87dB\n32.82dB\n27.72dB\n30.76dB\n33.56dB\n24.81dB\n28.12dB\n28.44dB\n\nANR\n25.54dB\n28.59dB\n27.54dB\n30.44dB\n27.80dB\n35.63dB\n32.29dB\n36.40dB\n36.32dB\n30.47dB\n35.71dB\n36.39dB\n28.97dB\n33.07dB\n31.80dB\nANR\n23.56dB\n26.69dB\n25.01dB\n27.08dB\n24.04dB\n33.62dB\n28.49dB\n33.23dB\n33.08dB\n27.92dB\n31.09dB\n33.82dB\n25.03dB\n28.43dB\n28.65dB\nANR\n22.69dB\n25.60dB\n23.63dB\n25.80dB\n22.33dB\n32.23dB\n26.47dB\n30.83dB\n30.99dB\n26.43dB\n28.70dB\n31.93dB\n22.85dB\n25.47dB\n26.85dB\n\nA+\n\n25.65dB\n28.70dB\n27.78dB\n30.57dB\n28.65dB\n35.74dB\n33.02dB\n36.94dB\n36.60dB\n30.87dB\n37.01dB\n37.02dB\n30.09dB\n33.59dB\n32.28dB\n\nA+\n\n23.62dB\n26.47dB\n25.17dB\n27.27dB\n24.38dB\n33.76dB\n29.05dB\n34.30dB\n33.52dB\n28.28dB\n32.14dB\n34.74dB\n26.09dB\n28.98dB\n29.13dB\n\nA+\n\n22.74dB\n25.74dB\n23.77dB\n25.98dB\n22.59dB\n32.44dB\n26.90dB\n32.24dB\n31.41dB\n26.78dB\n29.39dB\n32.87dB\n23.64dB\n25.94dB\n27.32dB\n\n25.62dB\n28.59dB\n27.70dB\n30.49dB\n28.27dB\n35.61dB\n33.03dB\n36.20dB\n36.50dB\n30.82dB\n37.18dB\n36.75dB\n30.40dB\n33.29dB\n32.18dB\n\nSRCNN ShCNN\n25.79dB\n28.59dB\n27.92dB\n30.82dB\n28.70dB\n35.75dB\n33.53dB\n36.14dB\n36.71dB\n31.06dB\n38.09dB\n37.03dB\n31.07dB\n33.51dB\n32.48dB\nSRCNN ShCNN\n23.69dB\n26.54dB\n25.28dB\n27.43dB\n24.70dB\n33.71dB\n29.42dB\n34.45dB\n33.68dB\n28.41dB\n33.37dB\n34.77dB\n26.89dB\n29.10dB\n29.39dB\nSRCNN ShCNN\n22.75dB\n25.80dB\n23.83dB\n26.13dB\n22.74dB\n32.35dB\n27.18dB\n32.30dB\n31.45dB\n26.82dB\n30.30dB\n32.82dB\n24.49dB\n26.21dB\n27.51dB\n\n23.60dB\n26.66dB\n25.07dB\n27.20dB\n24.39dB\n33.58dB\n28.97dB\n33.35dB\n33.39dB\n28.18dB\n32.39dB\n34.35dB\n26.02dB\n28.87dB\n29.00dB\n\n22.70dB\n25.70dB\n23.66dB\n25.93dB\n22.53dB\n32.12dB\n26.84dB\n31.47dB\n31.20dB\n26.65dB\n29.89dB\n32.34dB\n23.84dB\n25.97dB\n27.20dB\n\nTable 1: PSNR comparison on the Set14 [13] image set for upscaling of factor 2, 3 and 4. Methods compared:\nBicubic, K-SVD [13], NE+NNLS [14], NE+LLE [15], ANR [16], A+ [11], SRCNN [4], Our ShCNN\n\nWe use one Shepard interpolation layer at the top with kernel size of 8x8 and feature map number\n16. Other con\ufb01guration of the network is the same as that in our new network for inpainting. During\ntraining, weights were randomly initialized by drawing from a Gaussian distribution with zero mean\nand standard deviation of 0.03. AdaGrad [17] was used in all experiments with learning rate of\n0.001 and fudge factor of 1e-6. Table 1 show the quantitative results of our ShCNN in a widely\nused super-resolution data set [13] for upscaling images 2 times, 3 times and 4 times respectively.\nWe compared our method with 7 methods including the two current state-of-the-art systems [11, 4].\nClear improvement over the state-of-the-art systems can be observed. Visual comparison between\nour method and the previous methods is illustrated in \ufb01gure 4 and \ufb01gure 5.\n\n6 Conclusions\n\nIn this paper, we disclosed the limitation of previous CNN architectures in image processing tasks\nin need of translation variant interpolation. New architecture based on Shepard interpolation was\nproposed and successfully applied to image inpainting and super-resolution. The effectiveness of\n\n7\n\n\f(a) Ground Truth / PSNR\n\n(b) Bicubic / 36.81dB\n\n(c) KSVD / 39.93dB\n\n(d) NE+LLE / 40.00dB\n\n(e) ANR / 40.04dB\n\n(f) A+ / 41.12dB\n\n(g) SRCNN / 40.64dB\n\n(h) ShCNN / 41.30dB\n\nFigure 5: Visual comparison. Factor 2 upscaling of the bird image in Set5 [14].\n\nthe ShCNN with Shepard interpolation layers have been demonstrated by the state-of-the-art perfor-\nmance.\n\nReferences\n\n[1] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In: NIPS. (2012) 1106\u20131114\n\n[2] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,\n\nRabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)\n\n[3] Sun, Y., Liang, D., Wang, X., Tang, X.: Deepid3: Face recognition with very deep neural\n\nnetworks. In: arXiv:1502.00873. (2015)\n\n[4] Dong, C., Loy, C.C., He, K., , Tang, X.: Learning a deep convolutional network for image\n\nsuper-resolution. In: ECCV. (2014)\n\n[5] Xie, J., Xu, L., Chen, E.:\n\nNIPS. (2012)\n\nImage denoising and inpainting with deep neural networks. In:\n\n[6] Burger, H.C., Schuler, C.J., Harmeling, S.:\n\ncompete with bm3d? In: CVPR. (2012)\n\nImage denoising: Can plain neural networks\n\n[7] Xu, L., Ren, J.S., Liu, C., Jia, J.: Deep convolutional neural network for image deconvolution.\n\nIn: NIPS. (2014)\n\n[8] Eigen, D., Krishnan, D., Fergus, R.: Restoring an image taken through a window covered with\n\ndirt or rain. In: ICCV. (2013)\n\n[9] Xu, L., Ren, J.S., Yan, Q., Liao, R., Jia, J.: Deep edge-aware \ufb01lters. In: ICML. (2015)\n[10] Shepard, D.: A two-dimensional interpolation function for irregularly-spaced data. In: 23rd\n\nACM national conference. (1968)\n\n[11] Timofte, R., Smet, V.D., Gool, L.V.: A+: Adjusted anchored neighborhood regression for fast\n\nsuper-resolution. In: ACCV. (2014)\n\n[12] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document\n\nrecognition. In: Proceedings of IEEE. (1998)\n\n[13] Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations.\n\nCurves and Surfaces 6920 (2012) 711\u2013730\n\n8\n\n\f[14] Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Low-complexity single-image\n\nsuper-resolution based on nonnegative neighbor embedding. In: BMVC. (2012)\n\n[15] Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: CVPR.\n\n(2004)\n\n[16] Timofte, R., Smet, V.D., Gool, L.V.: Anchored neighborhood regression for fast example-\n\nbased super-resolution. In: ICCV. (2013)\n\n[17] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and s-\n\ntochastic optimization. Journal of Machine Learning Research 12 (2011) 2121\u20132159\n\n9\n\n\f", "award": [], "sourceid": 588, "authors": [{"given_name": "Jimmy", "family_name": "Ren", "institution": "SenseTime Group Limited"}, {"given_name": "Li", "family_name": "Xu", "institution": "SenseTime Group Limited"}, {"given_name": "Qiong", "family_name": "Yan", "institution": "SenseTime Group Limited"}, {"given_name": "Wenxiu", "family_name": "Sun", "institution": "SenseTime Group Limited"}]}