{"title": "Image Inpainting via Generative Multi-column Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 331, "page_last": 340, "abstract": "In this paper, we propose a generative multi-column network for image inpainting. This network synthesizes different image components in a parallel manner within one stage. To better characterize global structures, we design a confidence-driven reconstruction loss while an implicit diversified MRF regularization is adopted to enhance local details. The multi-column network combined with the reconstruction and MRF loss propagates local and global information derived from context to the target inpainting regions. Extensive experiments on challenging street view, face, natural objects and scenes manifest that our method produces visual compelling results even without previously common post-processing.", "full_text": "Image Inpainting via Generative Multi-column\n\nConvolutional Neural Networks\n\nYi Wang1 Xin Tao1,2 Xiaojuan Qi1 Xiaoyong Shen2\n\nJiaya Jia1,2\n\n1The Chinese University of Hong Kong\n\n2YouTu Lab, Tencent\n\n{yiwang, xtao, xjqi, leojia}@cse.cuhk.edu.hk\n\ngoodshenxy@gmail.com\n\nAbstract\n\nIn this paper, we propose a generative multi-column network for image inpainting.\nThis network synthesizes different image components in a parallel manner within\none stage. To better characterize global structures, we design a con\ufb01dence-driven\nreconstruction loss while an implicit diversi\ufb01ed MRF regularization is adopted to\nenhance local details. The multi-column network combined with the reconstruction\nand MRF loss propagates local and global information derived from context to the\ntarget inpainting regions. Extensive experiments on challenging street view, face,\nnatural objects and scenes manifest that our method produces visual compelling\nresults even without previously common post-processing.\n\n1\n\nIntroduction\n\nImage inpainting (also known as image completion) aims to estimate suitable pixel information to\n\ufb01ll holes in images. It serves various applications such as object removal, image restoration, image\ndenoising, to name a few. Though studied for many years, it remains an open and challenging problem\nsince it is highly ill-posed. In order to generate realistic structures and textures, researchers resort to\nauxiliary information, from either surrounding image areas or external data.\nA typical inpainting method exploits pixels under certain patch-wise similarity measures, addressing\nthree important problems respectively to (1) extract suitable features to evaluate patch similarity; (2)\n\ufb01nd neighboring patches; and (3) to aggregate auxiliary information.\n\nFeatures for Inpainting Suitable feature representations are very important to build connections\nbetween missing and known areas. In contrast to traditional patch-based methods using hand-crafted\nfeatures, recent learning-based algorithms learn from data. From the model perspective, inpainting\nrequires understanding of global information. For example, only by seeing the entire face, the system\ncan determine eyes and nose position, as shown in top-right of Figure 1. On the other hand, pixel-level\ndetails are crucial for visual realism, e.g. texture of the skin/facade in Figure 1.\nRecent CNN-based methods utilize encoder-decoder [18, 25, 24, 9, 26] networks to extract features\nand achieve impressive results. But there is still much room to consider features as a group of different\ncomponents and combine both global semantics and local textures.\n\nReliable Similar Patches\nIn both exemplar-based [7, 8, 4, 21, 10, 11, 2] and recent learning-based\nmethods [18, 25, 24, 9, 26], explicit nearest-neighbor search is one of the key components for\ngeneration of realistic details. When missing areas originally contain structure different from context,\nthe found neighbors may harm the generation process. Also, nearest-neighbor search during testing is\nalso time-consuming. Unlike these solutions, we in this paper apply search only in the training phase\nwith improved similarity measure. Testing is very ef\ufb01cient without the need of post-processing.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Our inpainting results on building, face, and natural scene.\n\nSpatial-variant Constraints Another important issue is that inpainting can take multiple candi-\ndates to \ufb01ll holes. Thus, optimal results should be constrained in a spatially variant way \u2013 pixels\nclose to area boundary are with few choices, while the central part can be less constrained. In fact,\nadversarial loss has already been used in recent methods [18, 25, 24, 9, 26] to learn multi-modality.\nVarious weights are applied to loss [18, 25, 26] for boundary consistency. In this paper, we design a\nnew spatial-variant weight to better handle this issue.\nThe overall framework is a Generative Multi-column Convolutional Neural Network (GMCNN) for\nimage inpainting. The multi-column structure [3, 27, 1] is used since it can decompose images into\ncomponents with different receptive \ufb01elds and feature resolutions. Unlike multi-scale or coarse-to-\n\ufb01ne strategies [24, 12] that use resized images, branches in our multi-column network directly use\nfull-resolution input to characterize multi-scale feature representations regarding global and local\ninformation. A new implicit diversi\ufb01ed Markov random \ufb01eld (ID-MRF) term is proposed and used in\nthe training phase only. Rather than directly using the matched feature, which may lead to visual\nartifacts, we incorporate this term as regularization.\nAdditionally, we design a new con\ufb01dence-driven reconstruction loss that constrains the generated\ncontent according to the spatial location. With all these improvements, the proposed method can\nproduce high quality results considering boundary consistency, structure suitability and texture\nsimilarity, without any post-processing operations. Exemplar inpainting results are given in Figure 1.\n\n2 Related Work\n\nExemplar-based Inpainting Among traditional methods, exemplar-based inpainting [7, 8, 4, 21,\n10, 11, 2] copies and pastes matching patches in a pre-de\ufb01ned order. To preserve structure, patch\npriority computation speci\ufb01es the patch \ufb01lling order [4, 7, 8, 21]. With only low-level information,\nthese methods cannot produce high-quality semantic structures that do not exist in examples, e.g.,\nfaces and facades.\nCNN Inpainting Since the seminal context-encoder work [18], deep CNNs have achieved signi\ufb01cant\nprogress. Pathak et al. proposed training an encoder-decoder CNN and minimizing pixel-wise\nreconstruction loss and adversarial loss. Built upon context-encoder, in [9], global and local discrimi-\nnators helped improve the adversarial loss where a fully convolutional encoder-decoder structure was\nadopted. Besides encoder-decoder, U-net-like structure was also used [23].\nYang et al.[24] and Yu et al.[26] introduced coarse-to-\ufb01ne CNNs for image inpainting. To generate\nmore plausible and detailed texture, combination of CNN and Markov Random Field [24] was taken\nas the post-process to improve inpainting results from the coarse CNN. It is inevitably slow due to\niterative MRF inference. Lately, Yu et al. conducted nearest neighbor search in deep feature space\n[26], which brings clearer texture to the \ufb01lling regions compared with previous strategies of a single\nforward pass.\n\n3 Our Method\n\nOur inpainting system is trainable in an end-to-end fashion, which takes an image X and a binary\nregion mask M (with value 0 for known pixels and 1 otherwise) as input. Unknown regions in image\nX are \ufb01lled with zeros. It outputs a complete image \u02c6Y. We detail our network design below.\n\n2\n\n\fFigure 2: Our framework.\n\n3.1 Network Structure\n\nOur proposed Generative Multi-column Convolutional Neural Network (GMCNN) shown in Figure 2\nconsists of three sub-networks: a generator to produce results, global&local discriminators for\nadversarial training, and a pretrained VGG network [20] to calculate ID-MRF loss. In the testing\nphase, only the generator network is used.\nThe generator network consists of n (n = 3) parallel encoder-decoder branches to extract different\nlevels of features from input X with mask M, and a shared decoder module to transform deep features\ninto natural image space \u02c6Y. We choose various receptive \ufb01elds and spatial resolutions for these\nbranches as shown in Figure 2, which capture different levels of information. Branches are denoted\nas {fi(\u00b7)} (i \u2208 {1, 2, ..., n}), trained in a data driven manner to generate better feature components\nthan handcrafted decomposition.\nThen these components are up-sampled (bilinearly) to the original resolution and are concatenated\ninto feature map F . We further transform features F into image space via shared decoding module\nwith 2 convolutional layers, denoted as d(\u00b7). The output is \u02c6Y = d(F ). Minimizing the difference\nbetween \u02c6Y and Y makes {fi(\u00b7)}i=1,...,n capture appropriate components in X for inpainting. d(\u00b7)\nfurther transforms such deep features to our desired result. Note that although fi(\u00b7) seems independent\nof each other, they are mutually in\ufb02uenced during training due to d(\u00b7).\n\nAnalysis Our framework is by nature different from commonly used one-stream encoder-decoder\nstructure and the coarse-to-\ufb01ne architecture [24, 26, 12]. The encoder-decoder transforms the image\ninto a common feature space with the same-size receptive \ufb01eld, ignoring the fact that inpainting\ninvolves different levels of representations. The multi-branch encoders in our GMCNN contrarily do\nnot have this problem. Our method also overcomes the limitation of the coarse-to-\ufb01ne architecture,\nwhich paints the missing pixels from small to larger scales where errors in the coarse-level already\nin\ufb02uence re\ufb01nement. Our GMCNN incorporates different structures in parallel. They complement\neach other instead of simply inheriting information.\n\n3.2\n\nID-MRF Regularization\n\nHere, we address aforementioned semantic structure matching and computational-heavy iterative\nMRF optimization issues. Our scheme is to take MRF-like regularization only in the training phase,\nnamed implicit diversi\ufb01ed Markov random \ufb01elds (ID-MRF). The proposed network is optimized to\nminimize the difference between generated content and corresponding nearest-neighbors from ground\ntruth in the feature space. Since we only use it in training, complete ground truth images make it\npossible to know high-quality nearest neighbors and give appropriate constraints for the network.\nTo calculate ID-MRF loss, it is possible to simply use direct similarity measure (e.g. cosine similarity)\nto \ufb01nd the nearest neighbors for patches in generated content. But this procedure tends to yield\nsmooth structure, as a \ufb02at region easily connects to similar patterns and quickly reduces structure\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Using different similarity measures to search the nearest neighbors. (a) Inpainting results\nusing cosine similarity. (b) Inpainting results using our relative similarity. (c) Ground truth image\nwhere red rectangle highlights the \ufb01lling region (Best viewed in original resolution and with color).\n\nvariety, as shown in Figure 3(a). We instead adopt a relative distance measure [17, 16, 22] to model\nthe relation between local features and target feature set. It can restore subtle details as illustrated in\nFigure 3(b).\nSpeci\ufb01cally, let \u02c6Yg be the generated content for the missing regions, \u02c6YL\ng and YL are the features\ngenerated by the Lth feature layer of a pretrained deep model. For neural patches v and s extracted\nfrom \u02c6YL\n\ng and YL respectively, the relative similarity from v to s is de\ufb01ned as\n\nRS(v, s) = exp((\n\n\u00b5(v, s)\n\nmaxr\u2208\u03c1s(YL) \u00b5(v, r) + \u0001\n\n)/h),\n\n(1)\n\nwhere \u00b5(\u00b7,\u00b7) is the cosine similarity. r \u2208 \u03c1s(YL) means r belongs to YL excluding s. h and \u0001 are\ntwo positive constants. If v is like s more than other neural patches in YL, RS(v, s) turns large.\nNext, RS(v, s) is normalized as\n\nRS(v, s) = RS(v, s)/\n\nRS(v, r).\n\nFinally, with Eq. (2), the ID-MRF loss between \u02c6YL\n\ng and YL is de\ufb01ned as\n\nLM (L) = \u2212 log(\n\n1\nZ\n\nRS(v, s)),\n\nmax\nv\u2208 \u02c6YL\ng\n\ng\n\ng . In the extreme case that all neural patches in \u02c6YL\n\nRS(v, s) means \u02c6v is closer\ng are\n\nwhere Z is a normalization factor. For each s \u2208 YL, \u02c6v = arg maxv\u2208 \u02c6YL\nto s compared with other neural patches in \u02c6YL\nclose to one patch s, other patches r have their maxv RS(v, r) small. So LM (L) is large.\nOn the other hand, when the patches in \u02c6YL\ng are close to different candidates in YL, each r in YL\nRS(v, r) is thus big and LM (L)\nhas its unique nearest neighbor in \u02c6YL\nbecomes small. We show one example in the supplementary \ufb01le. From this perspective, minimizing\nLM (L) encourages each v in \u02c6YL\ng to approach different neural patches in YL, diversifying neighbors,\nas shown in Figure 3(b).\nAn obvious bene\ufb01t for this measure is to improve the similarity between feature distributions in \u02c6YL\ng\nand YL. By minimizing the ID-MRF loss, not only local neural patches in \u02c6YL\ng \ufb01nd corresponding\ncandidates from YL, but also the feature distributions come near, helping capture variation in\ncomplicated texture.\nOur \ufb01nal ID-MRF loss is computed on several feature layers from VGG19. Following common\npractice [5, 14], we use conv4_2 to describe image semantic structures. Then conv3_2 and conv4_2\n\ng . The resulting maxv\u2208 \u02c6YL\n\ng\n\n4\n\n(cid:88)\n\nr\u2208\u03c1s(YL)\n\n(cid:88)\n\ns\u2208YL\n\n(2)\n\n(3)\n\n\fare utilized to describe image texture as\n\nLmrf = LM (conv4_2) +\n\n4(cid:88)\n\nt=3\n\nLM (convt_2).\n\n(4)\n\nMore Analysis During training, ID-MRF regularizes the generated content based on the reference.\nIt has the strong ability to create realistic texture locally and globally. We note the fundamental\ndifference from the methods of [24, 26], where nearest-neighbor search via networks is employed in\nthe testing phase. Our ID-MRF regularization exploits both reference and contextual information\ninside and out of the \ufb01lling regions, and thus causes high diversity in inpainting structure generation.\n\n3.3\n\nInformation Fusion\n\nSpatial Variant Reconstruction Loss Pixel-wise reconstruction loss is important for inpainting\n[18, 25, 26]. To exert constraints based on spatial location, we design the con\ufb01dence-driven recon-\nstruction loss where unknown pixels close to the \ufb01lling boundary are more strongly constrained than\nthose away from it. We set the con\ufb01dence of known pixels as 1 and unknown ones related to the\ndistance to the boundary. To propagate the con\ufb01dence of known pixels to unknown ones, we use a\nGaussian \ufb01lter g to convolve M to create a loss weight mask Mw as\n\nw = (g \u2217 M\n\ni\n\nMi\n\n) (cid:12) M,\n\n(5)\n\ni\n\nwhere g is with size 64 \u00d7 64 and its standard deviation is 40. M\nw = 0.\n(cid:12) is the Hadamard product operator. Eq. (5) is repeated several times to generate Mw. The \ufb01nal\nreconstruction loss is\n\n= 1 \u2212 M + Mi\u22121\n\nw and M0\n\nLc = ||(Y \u2212 G([X, M]; \u03b8)) (cid:12) Mw||1,\n\n(6)\nwhere G([X, M]; \u03b8) is the output of our generative model G, and \u03b8 denotes learn-able parameters.\nCompared with the reconstruction loss used in [18, 25, 26], ours exploits spatial locations and their\nrelative order by considering con\ufb01dence on both known and unknown pixels. It results in the effect of\ngradually shifting learning focus from \ufb01lling border to the center and smoothing the learning curve.\n\nAdversarial Loss Adversarial loss is a catalyst in \ufb01lling missing regions and becomes common in\nmany creation tasks. Similar to those of [9, 26], we apply the improved Wasserstein GAN [6] and use\nlocal and global discriminators. For the generator, the adversarial loss is de\ufb01ned as\n\nLadv = \u2212EX\u223cPX [D(G(X; \u03b8))] + \u03bbgpE \u02c6X\u223cP \u02c6X\n\n[(||\u2207 \u02c6XD( \u02c6X) (cid:12) Mw||2 \u2212 1)2],\n\n(7)\n\nwhere \u02c6X = tG([X, M]; \u03b8) + (1 \u2212 t)Y and t \u2208 [0, 1].\n\n3.4 Final Objective\n\nWith con\ufb01dence-driven reconstruction loss, ID-MRF loss, and adversarial loss, the model objective\nof our net is de\ufb01ned as\n\n(8)\nwhere \u03bbadv and \u03bbmrf are used to balance the effects between local structure regularization and\nadversarial training.\n\nL = Lc + \u03bbmrfLmrf + \u03bbadvLadv,\n\n3.5 Training\n\nWe train our model \ufb01rst with only con\ufb01dence-driven reconstruction loss and set \u03bbmrf and \u03bbadv to\n0s to stabilize the later adversarial training. After our model G converges, we set \u03bbmrf = 0.05 and\n\u03bbadv = 0.001 for \ufb01ne tuning until converge. The training procedure is optimized using Adam solver\n[13] with learning rate 1e \u2212 4. We set \u03b21 = 0.5 and \u03b22 = 0.9. The batch size is 16.\nFor an input image Y, a binary image mask M (with value 0 for known and 1 for unknown pixels) is\nsampled at a random location. The input image X is produced as X = Y (cid:12) (1 \u2212 M). Our model\nG takes the concatenation of X and M as input. The \ufb01nal prediction is \u02c6Y = Y (cid:12) (1 \u2212 M) +\nG([X, M]) (cid:12) M. All input and output are linearly scaled within range [\u22121, 1].\n\n5\n\n\fTable 1: Quantitative results on the testing datasets.\n\nPairs street view-100\nPSNR\n23.49\n24.44\n23.78\n24.65\n\nSSIM\n0.8732\n0.8477\n0.8588\n0.8650\n\nImageNet-200\nPSNR\n23.56\n20.62\n22.44\n22.43\n\nCelebA-HQ-2K\nPlaces2-2K\nSSIM\nSSIM PSNR\nSSIM PSNR\n\u2212\n\u2212\n\u2212\n\u2212\n0.9105\n\u2212\n\u2212\n\u2212\n\u2212\n0.7217\n23.98\n20.03\n0.8917\n0.8939\n20.16\n25.70\n\n0.8539\n0.8617\n\n0.9441\n0.9546\n\nMethod\nCE [18]\n\nCA [26]\n\nOurs\n\nMSNPS [24]\n\n4 Experiments\n\nWe evaluate our method on \ufb01ve datasets of Paris street view [18], Places2 [28], ImageNet [19],\nCelebA [15], and CelebA-HQ [12].\n\n4.1 Experimental Settings\n\nWe train our models on the training set and evaluate our model on the testing set (for Paris street view)\nor validation set (for Places2, ImageNet, CelebA, and CelebA-HQ). In training, we use images of\nresolution 256 \u00d7 256 with the largest hole size 128 \u00d7 128 in random positions. For Paris street view,\nplaces2, and ImageNet, 256 \u00d7 256 images are randomly cropped and scaled from the full-resolution\nimages. For CelebA and CelebA-HQ face datasets, images are scaled to 256 \u00d7 256. All results given\nin this paper are not post-processed.\nOur implementation is with Tensor\ufb02ow v1.4.1, CUDNN v6.0, and CUDA v8.0. The hardware is\nwith an Intel CPU E5 (2.60GHz) and TITAN X GPU. Our model costs 49.37ms and 146.11ms per\nimage on GPU for testing images with size 256 \u00d7 256 and 512 \u00d7 512, respectively. Using ID-MRF\nin training phrase costs 784ms more per batch (with 16 images of 256 \u00d7 256 \u00d7 3 pixels). The total\nnumber of parameters of our generator network is 12.562M.\n\n4.2 Qualitative Evaluation\n\nAs shown in Figures 8 and 10, compared with other methods, ours gives obvious visual improvement\non plausible image structures and crisp textures. The more reasonably generated structures mainly\nstem from the multi-column architecture and con\ufb01dence-driven reconstruction loss. The realistic\ntextures are created via ID-MRF regularization and adversarial training by leveraging the contextual\nand corresponding textures.\nIn Figures 9, we show partial results of our method and CA [26] on CelebA and CelebA-HQ face\ndatasets. Since we do not apply MRF in a non-parametric manner, visual artifacts are much reduced.\nIt is notable that \ufb01nding suitable patches for these faces is challenging. Our ID-MRF regularization\nremedies the problem. Even the face shadow and re\ufb02ectance can be generated as shown in Figure 9.\nAlso, our model is trained with arbitrary-location and -size square masks. It is thus general to be\napplied to different-shape completion as shown in Figures 10 and 1. More inpainting results are in\nour project website.\n\n4.3 Quantitative Evaluation\n\nAlthough the generation task is not suitable to be evaluated by peak signal-to-noise ratio (PSNR) or\nstructural similarity (SSIM), for completeness, we still give them on the testing or validation sets of\nfour used datasets for reference. In ImageNet, only 200 images are randomly chosen for evaluation\nsince MSNPS [24] takes minutes to complete a 256 \u00d7 256 size image. As shown in Table 1, our\nmethod produces decent results with comparable or better PSNR and SSIM.\nWe also conduct user studies as shown in Table 2. The protocol is based on large batches of blind\nrandomized A/B tests deployed on the Google Forms platform. Each survey involves a batch of 40\npairwise comparisons. Each pair contains two images completed from the same corrupted input by\ntwo different methods. There are 40 participants invited for user study. The participants are asked\nto select the more realistic image in each pair. The images are all shown at the same resolution\n(256\u00d7256). The comparisons are randomized across conditions and the left-right order is randomized.\n\n6\n\n\fTable 2: Result of user study. Each entry is the percentage of cases where results by our approach are\njudged more realistic than another solution.\n\nGMCNN > CE [18]\n\nGMCNN > MSNPS [24]\n\nGMCNN > CA [26]\n\nParis street view ImageNet\n\n98.1%\n94.4%\n84.2%\n\n88.3%\n86.5%\n78.5%\n\nPlaces2 CelebA CelebA-HQ\n\n-\n-\n\n-\n-\n\n-\n-\n\n69.6% 99.0%\n\n93.8%\n\nAll images are shown for unlimited time and the participant is free to spend as much time as desired\non each pair. In all conditions, our method outperforms the baselines.\n\n4.4 Ablation Study\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 4: Visual comparison of CNNs with different structures. (a) Input image. (b) Single encoder-\ndecoder. (c) Coarse-to-\ufb01ne structure [26]. (d) GMCNN with the \ufb01xed receptive \ufb01eld in all branches.\n(e) GMCNN with varied receptive \ufb01elds.\n\nSingle Encoder-Decoder vs. Coarse-to-Fine vs. GMCNN We evaluate our multi-column archi-\ntecture by comparing with single encode-decoder and coarse-to-\ufb01ne networks with two sequential\nencoder-decoder (same as that in [26] except no contextual layer). The single encoder-decoder is\njust the same as our branch three (B3). To minimize the in\ufb02uence of model capacity, we triple the\n\ufb01lter sizes in the single encoder-decoder architecture to make its parameter size as close to ours as\npossible. The loss for these three structures is the same, including con\ufb01dence-driven reconstruction\nloss, ID-MRF loss, and WGAN-GP adversarial loss. The corresponding hyper-parameters are the\nsame. The testing results are shown in Figure 4. Our GMCNN structure with varied receptive \ufb01elds\nin each branch predicts reasonable image structure and texture compared with single encoder-decoder\nand coarse-to-\ufb01ne structure. Additional quantitative experiment is given in Table 3, showing the\nproposed structure is bene\ufb01cial to restore image \ufb01delity.\n\nTable 3: Quantitative results of different structures on Paris street view dataset (ED: Encoder-decoder,\n-f/-v: \ufb01xed/varied receptive \ufb01elds).\n\nED\nModel\nPSNR\n23.75\nSSIM 0.8580\n\nCoarse-to-\ufb01ne GMCNN-f GMCNN-v w/o ID-MRF GMCNN-v\n\n23.63\n0.8597\n\n24.36\n0.8644\n\n24.62\n0.8657\n\n24.65\n0.8650\n\nVaried Receptive Fields vs. Fixed Receptive Field We then validate the necessity of using varied\nreceptive \ufb01elds in branches. The GMCNN with the same receptive \ufb01eld in each branch turns to using\n3 identical third Branches in Figure 2 with \ufb01lter size 5 \u00d7 5. Figure 4 shows within the GMCNN\nstructure, branches with varied receptive \ufb01elds give visual more appealing results.\nSpatial Discounted Reconstruction Loss vs. Con\ufb01dence-Driven Reconstruction Loss We com-\npare our con\ufb01dence-driven reconstruction loss with alternative spatial discounted reconstruction loss\n[26]. We use a single-column CNN trained only with the losses on the Paris street view dataset. The\ntesting results are given in Figure 5. Our con\ufb01dence-driven reconstruction loss works better.\nWith and without ID-MRF Regularization We train a complete GMCNN on the Paris street\nview dataset with all losses and one model that does not involve ID-MRF. As shown in Figure\n6, ID-MRF can signi\ufb01cantly enhance local details. Also, the qualitative and quantitative changes\nare given in Table 4 and Figure 7 about how \u03bbmrf affects inpainting performance. Empirically,\n\u03bbmrf = 0.02 \u223c 0.05 strikes a good balance.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: Visual comparisons of different reconstruction losses.\ndiscounted loss [26]. (c) Con\ufb01dence-driven reconstruction loss.\n\n(a) Input image.\n\n(b) Spatial\n\n(a)\n\n(b)\n\n(c)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 6: Visual comparison of results using ID-MRF and not with it. (a) Input image. (b) Results\nusing ID-MRF. (c) Results without using ID-MRF.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 7: Visual comparison of results using ID-MRF with different \u03bbmrf (a) Input image. (b)\n\u03bbmrf = 2. (c) \u03bbmrf = 0.2. (d) \u03bbmrf = 0.02. (e) \u03bbmrf = 0.002.\n\nTable 4: Quantitative results about how ID-MRF regularizes the inpainting performance.\n\n2\n\n\u03bbmrf\nPSNR\n24.62\nSSIM 0.8659\n\n0.2\n24.53\n0.8652\n\n0.02\n24.64\n0.8654\n\n0.002\n24.36\n0.8640\n\n5 Conclusion\n\nWe have primarily addressed the important problems of representing visual context and using it to\ngenerate and constrain unknown regions in inpainting. We have proposed a generative multi-column\nneural network for this task and showed its ability to model different image components and extract\nmulti-level features. Additionally, the ID-MRF regularization is very helpful to model realistic texture\nwith a new similarity measure. Our con\ufb01dence-driven reconstruction loss also considers spatially\nvariant constraints. Our future work will be to explore other constraints with location and content.\n\nLimitations Similar to other generative neural networks [18, 24, 26, 25] for inpainting, our method\nstill has dif\ufb01culties dealing with large-scale datasets with thousands of diverse object and scene\ncategories, such as ImageNet. When data falls into a few categories, our method works best, since\nthe ambiguity removal in terms of structure and texture can be achieved in these cases.\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 8: Visual comparisons on Paris street view (up) and ImageNet (down). (a) Input image. (b)\nCE [18]. (c) MSNPS [24]. (d) CA [26]. (e) Our results (best viewed in higher resolution).\n\n(a)\n\n(b)\n\n(c)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 9: Visual comparisons on CelebA (Left) and CelebA-HQ (Right). (a) Input image. (b) CA\n[26]. (c) Our results.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 10: Visual comparisons on Places2 for 512\u00d7 680 images with random masks. (a) Input image.\n(b) Results by CA [26]. (c) Our results.\n\n9\n\n\fReferences\n[1] F. Agostinelli, M. R. Anderson, and H. Lee. Adaptive multi-column deep neural networks with\n\napplication to robust image denoising. In NIPS, pages 1493\u20131501, 2013.\n\n[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized\n\ncorrespondence algorithm for structural image editing. TOG, 28(3):24, 2009.\n\n[3] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image\n\nclassi\ufb01cation. In CVPR, pages 3642\u20133649. IEEE, 2012.\n\n[4] A. Criminisi, P. P\u00e9rez, and K. Toyama. Region \ufb01lling and object removal by exemplar-based\n\nimage inpainting. TIP, 13(9):1200\u20131212, 2004.\n\n[5] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural\n\nnetworks. In CVPR, pages 2414\u20132423. IEEE, 2016.\n\n[6] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\n\nwasserstein gans. In NIPS, pages 5769\u20135779, 2017.\n\n[7] K. He and J. Sun. Statistics of patch offsets for image completion. In ECCV, pages 16\u201329.\n\n[8] K. He and J. Sun. Image completion approaches using the statistics of similar patches. TPAMI,\n\n[9] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion.\n\nSpringer, 2012.\n\n36(12):2423\u20132435, 2014.\n\nTOG, 36(4):107, 2017.\n\n[10] J. Jia and C.-K. Tang. Image repairing: Robust image synthesis by adaptive nd tensor voting. In\n\n[11] J. Jia and C.-K. Tang. Inference of segmented color and texture description by tensor voting.\n\nCVPR, volume 1, pages I\u2013I. IEEE, 2003.\n\nTPAMI, 26(6):771\u2013786, 2004.\n\n[12] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality,\n\nstability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] C. Li and M. Wand. Combining markov random \ufb01elds and convolutional neural networks for\n\nimage synthesis. In CVPR, pages 2479\u20132486, 2016.\n\n[15] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, pages\n\n3730\u20133738, 2015.\n\n[16] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor. Learning to maintain natural image\n\nstatistics. arXiv preprint arXiv:1803.04626, 2018.\n\n[17] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with\n\nnon-aligned data. arXiv preprint arXiv:1803.02077, 2018.\n\n[18] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature\n\nlearning by inpainting. In CVPR, pages 2536\u20132544, 2016.\n\n[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nIJCV,\n\nImagenet large scale visual recognition challenge.\n\nA. Khosla, M. Bernstein, et al.\n115(3):211\u2013252, 2015.\n\n[20] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[21] J. Sun, L. Yuan, J. Jia, and H.-Y. Shum. Image completion with structure propagation. In TOG,\n\n[22] I. Talmi, R. Mechrez, and L. Zelnik-Manor. Template matching with deformable diversity\n\nvolume 24, pages 861\u2013868. ACM, 2005.\n\nsimilarity. In CVPR, pages 175\u2013183, 2017.\n\n[23] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan. Shift-net: Image inpainting via deep feature\n\nrearrangement. arXiv preprint arXiv:1801.09392, 2018.\n\n[24] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting\n\nusing multi-scale neural patch synthesis. In CVPR, volume 1, page 3, 2017.\n\n[25] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic\n\nimage inpainting with deep generative models. In CVPR, pages 5485\u20135493, 2017.\n\n[26] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with\n\ncontextual attention. arXiv preprint arXiv:1801.07892, 2018.\n\n[27] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column\n\nconvolutional neural network. In CVPR, pages 589\u2013597, 2016.\n\n[28] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image\n\ndatabase for scene recognition. TPAMI, 2017.\n\n10\n\n\f", "award": [], "sourceid": 210, "authors": [{"given_name": "Yi", "family_name": "Wang", "institution": "Chinese University of Hong Kong"}, {"given_name": "Xin", "family_name": "Tao", "institution": "CUHK"}, {"given_name": "Xiaojuan", "family_name": "Qi", "institution": "CUHK"}, {"given_name": "Xiaoyong", "family_name": "Shen", "institution": "CUHK"}, {"given_name": "Jiaya", "family_name": "Jia", "institution": "CUHK"}]}