{"title": "Rethinking the CSC Model for Natural Images", "book": "Advances in Neural Information Processing Systems", "page_first": 2274, "page_last": 2284, "abstract": "Sparse representation with respect to an overcomplete dictionary is often used when regularizing inverse problems in signal and image processing. In recent years, the Convolutional Sparse Coding (CSC) model, in which the dictionary consists of shift invariant filters, has gained renewed interest. While this model has been successfully used in some image processing problems, it still falls behind traditional patch-based methods on simple tasks such as denoising.\n  In this work we provide new insights regarding the CSC model and its capability to represent natural images, and suggest a Bayesian connection between this model and its patch-based ancestor. Armed with these observations, we suggest a novel feed-forward network that follows an MMSE approximation process to the CSC model, using strided convolutions. The performance of this supervised architecture is shown to be on par with state of the art methods while using much fewer parameters.", "full_text": "Rethinking the CSC Model for Natural Images\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDror Simon\n\nTechnion, Israel\n\ndror.simon@cs.technion.ac.il\n\nMichael Elad\n\nTechnion, Israel\n\nelad@cs.technion.ac.il\n\nAbstract\n\nSparse representation with respect to an overcomplete dictionary is often used\nwhen regularizing inverse problems in signal and image processing. In recent\nyears, the Convolutional Sparse Coding (CSC) model, in which the dictionary\nconsists of shift invariant \ufb01lters, has gained renewed interest. While this model\nhas been successfully used in some image processing problems, it still falls behind\ntraditional patch-based methods on simple tasks such as denoising. In this work we\nprovide new insights regarding the CSC model and its capability to represent natural\nimages, and suggest a Bayesian connection between this model and its patch-based\nancestor. Armed with these observations, we suggest a novel feed-forward network\nthat follows an MMSE approximation process to the CSC model, using strided\nconvolutions. The performance of this supervised architecture is shown to be on\npar with state of the art methods while using much fewer parameters.\n\n1\n\nIntroduction\n\nThe \ufb01eld of image restoration deals with the recovery of degraded images. Popular forms of\ndegradation include an additive noise, a blurring kernel, missing pixels, and more. Retrieving an\nimage from its degraded version is typically an ill-posed problem. Therefore, to enable the inversion\ntask, it is necessary to include prior information on the original signal. An image prior, also referred\nto as an image model, relates to a mathematical description of the image true distribution. In the past\n2-3 decades, many such models have been suggested and deployed. Some of these include reliance\non spatial smoothness, self-similarity, and sparse representation [1\u20134]. The later is the focus of this\nwork.\nThe sparse representation model has been successfully incorporated in various signal and image\nprocessing applications [2, 5\u20139]. This model assumes that a signal X \u2208 RN is formed by a linear\ncombination of only a few atoms, taken from the dictionary D \u2208 RN\u00d7M , i.e. X = D\u0393, where\n\u0393 \u2208 RM is sparse. When a noisy signal Y = X + V \u2208 RN is at hand (V is a bounded energy noise:\n\n(cid:107)V (cid:107)2 \u2264 \u0001), seeking for its sparse representation(cid:98)\u0393, leads to an estimation of the original signal via\n(cid:99)X = D(cid:98)\u0393. Finding(cid:98)\u0393 is commonly referred to as sparse-coding or a pursuit, formulated as\n\n(cid:107)\u0393(cid:107)0\n\nmin\n\n\u0393\n\ns.t. (cid:107)D\u0393 \u2212 Y (cid:107)2 \u2264 \u0001,\n\n(1)\n\nwhere the (cid:96)0 pseudo-norm counts the number of non-zeros in the vector.1 Sparse coding is NP-hard\nin general [10], hence approximation methods are used. A common approach replaces the (cid:96)0 pseudo-\nnorm with the (cid:96)1, leading to a convex problem termed Basis-Pursuit (BP) [11]. The BP method has\nbeen theoretically analyzed [12], shown to successfully recover a solution close to the original sparse\nrepresentation, depending on properties of the dictionary and the cardinality of the sought solution.\nThe dictionary is an important ingredient in the formation of this prior, as its atoms characterize the\nsignals that this model can represent sparsely. Learning the dictionary from the corrupted signal\n\n1(cid:96)0 is not formally a norm since it does not satisfy the homogeneity property.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fitself, or from an external dataset, has been shown to be quite effective, leading to the development\nof various dictionary learning algorithms and their use [13\u201316]. Unfortunately, due to the curse of\ndimensionality, these algorithms are applicable only for reasonably sized signals. Many algorithms\novercome this limitation by dividing the complete signal (e.g. a complete image) into fully overlapping\nsmall patches, treating each independently [2, 14, 15, 17]. This treatment consists of imposing the\nsparse representation model on the patches using a local dictionary DL \u2208 Rn\u00d7m,\n\n\u2200i : min\n\n(cid:107)\u03b1i(cid:107)0\n\ns.t. (cid:107)DL\u03b1i \u2212 PiY (cid:107)2 \u2264 \u0001,\n\n(2)\nwhere Pi \u2208 Rn\u00d7N extracts the i-th patch from Y (n (cid:28) N ), and its representation \u03b1i \u2208 Rm is\nassumed to be sparse. Once clean estimates of the patches are found, this process proceeds by Patch\nAveraging (PA), i.e. merging all the re\ufb01ned patches together to form a \ufb01nal global estimate of the\nclean image:\n\n\u03b1i\n\n(cid:99)X =\n\n(cid:88)\n\ni\n\n1\nn\n\nP T\n\ni DL\u03b1i,\n\n(3)\n\nwhere P T\ni places DL\u03b1i in the i-th location in the constructed image. Intuitively, operating inde-\npendently on patches must be sub-optimal, since the dependencies between the patches are falsely\nneglected [18]. To overcome this \ufb02aw, past work suggested enforcing the local prior on the patches of\nthe merged image [17, 19], leveraging the self-similarity between different patches [20], and more.\nRecently, there has been a renewed interest in global models that may overcome this local-global\ndichotomy. The Convolutional Sparse Coding (CSC) prior [21, 22] replaces the traditional patch-\nbased model with a global shift-invariant one. Instead of operating on patches, it suggests a global\ndictionary constrained by a speci\ufb01c structure \u2013 a concatenation of banded circulant matrices2, limiting\nthe degrees of freedom introduced by the general sparsity-based model. Various algorithms have been\nsuggested to ef\ufb01ciently handle the global pursuit [23\u201327]. These methods have been augmented by\nef\ufb01cient dictionary learning algorithms [26, 28\u201330]. A recent work provided theoretical guarantees\nfor the CSC model and its corresponding global pursuit results [31].\nThe CSC model has shown great success in several natural image processing tasks such as image\nseparation, image fusion, and super-resolution, matching or outperforming local-based methods\n[26, 28, 32\u201334]. Interestingly, one can \ufb01nd two common properties to all these success stories. The\n\ufb01rst is the fact that the CSC is merely used as a complementary component, modeling only the texture\npart of the image, after stripping its low-frequencies. Second, these applications assume noiseless\nimages, and thus the CSC cannot fail in over-\ufb01tting the data. Indeed, when brought to other classical\ntasks, such as image denoising or other inverse problems that involve an additive noise, the CSC has\nbeen shown to fail utterly.\nThe main contribution of this work is in providing novel insights regarding the CSC for modeling\nnatural images, and extending the applicability of this prior while tying it to deep-learning. We\npropose an explanation for the incompetence of this model in representing natural images reliably,\nand show that PA can be perceived as a Minimum Mean Square Error (MMSE) approximation to\nthe CSC. Building on this, we suggest to improve this approximation and obtain a CSC estimation\nprocess that operates directly on an image without any pre-processing steps. Finally, we leverage\nthese observations to implement a feed-forward Convolutional Neural Network (CNN) whose layers\nstrictly correspond to each step in the processing \ufb02ow of sparse-coding based image denoising. Our\nresults are on par with current state of the art supervised methods while drastically reducing the\nnumber of parameters.\n\n2 Background: Convolutional Sparse Coding\n\n2.1 The CSC Model\nThe CSC model considers a shift-invariant property in the signal, by assuming that3 X \u2208 RN is\nconstructed by a sum of m convolutions of sparse feature maps {Zi}m\ni=1 of\n\ni=1 \u2208 RN by \ufb01lters {di}m\n\n2These represent convolutions with small support \ufb01lters.\n3For simplicity of the description, and without loss of generality, we describe the CSC throughout this paper\n\nas operating on 1D signals.\n\n2\n\n\fFigure 1: The CSC model and its components.\n\nlength n (cid:28) N. CSC then refers to solving the following optimization problem:\n\nm(cid:88)\n\ni=1\n\nmin\n{Zi}m\n\ni=1\n\nm(cid:88)\n\ni=1\n\nN(cid:88)\n\n(cid:107)Zi(cid:107)0\n\ns.t. X =\n\ndi \u2217 Zi.\n\n(4)\n\nEquivalently, we de\ufb01ne a single global sparse representation vector \u0393 \u2208 RN m, constructed by\ninterlacing the sparse feature maps {Zi}m\ni=1. A global dictionary is then composed as follows: Let\nDL \u2208 Rn\u00d7m represent a local dictionary whose columns are the \ufb01lters {di}m\ni=1; then D contains N\nshifts of this local dictionary \u2013 see Figure 1. Under this description, (4) is equivalent to\n\n(cid:107)\u0393(cid:107)0\n\nmin\n\n\u0393\n\ns.t. X = D\u0393.\n\n(5)\n\nN(cid:88)\n\nWhen noisy measurements Y are at hand, (5) is modi\ufb01ed to allow for variations in the signal, leading\nto the problem de\ufb01ned in Equation (1).\nWe introduce additional de\ufb01nitions, taken from [31], which will aid in our exposition. The sparse\nrepresentation vector \u0393 can be thought of N concatenated vectors \u03b1i \u2208 Rm, termed needles. Each\ndescribes the contribution of the m \ufb01lters when aligned to the i-th element in X, i.e.\n\nX = D\u0393 =\n\nP T\n\ni DL\u03b1i =\n\nP T\n\ni si,\n\n(6)\n\ni=1\n\ni=1\n\nwhere we have de\ufb01ned the slice si = DL\u03b1i. Observe the resemblance between this and the patch-\naveraging in Equation (3). This may suggest that the CSC is in-fact a global model extending PA.\nIf this was true, one could have expected the CSC to be at least as good as the local model on any\nimage processing task. Is this the case? Keep reading.\nSimilarly, PiX = PiD\u0393 is a patch of size n extracted from X. Equivalently, one may write\nPiD\u0393 = \u2126\u03b3i, where a stripe \u03b3i concatenates 2n \u2212 1 needles and \u2126 is termed the stripe dictionary.\nFigure 1 demonstrates these de\ufb01nitions. An analysis of the convolutional sparse coding problem\nis proposed in [31], showing that when all the stripes \u03b3i are sparse,4 the solution to (5) is unique.\nMoreover, under the same conditions, BP is guaranteed to retrieve this solution. An analysis was also\ngiven for the noisy case, showing that under similar conditions, the solution attained by BP is stable.\n\n2.2 CSC in Practice\n\nThe \ufb01rst application we mention is cartoon-texture separation, where the goal is to blindly decompose\nan image into its texture and cartoon parts. Recent papers have achieved successful results by\nincorporating the CSC model [28, 34]. Curiously, these algorithms model the cartoon image via the\nTotal-Variation smoothness assumption, while using the CSC to model only the texture.\nA second application where the CSC achieves satisfactory results is image fusion [26, 33]. Here\nthe goal is to integrate complementary information from multiple source images of the same scene.\n\n4This is measured via an (cid:96)0,\u221e pseudo-norm, de\ufb01ned as (cid:107)\u0393(cid:107)0,\u221e = maxi (cid:107)\u03b3i(cid:107)0.\n\n3\n\n=\fThe results obtained by integrating the CSC model surpass those achieved by patch-based methods\non various metrics [33]. In such an algorithm, each image is \ufb01rst decomposed to smooth and detail\nlayers. The fusion itself is obtained by computing the convolutional sparse representation of the\ndetailed layers, and merging these by a pixel-wise max-pooling strategy.\nAnother application where CSC has been demonstrated to perform very well is single-image-super-\nresolution. Here, the objective is a high resolution (HR) image, obtained from a low resolution (LR)\none. In [32] a CSC scheme is suggested, leading to superior results over patch-based methods. As\nin the image fusion task, the algorithm separates the LR image into a smooth and a residual image.\nWhile the smooth part is simply interpolated, the residual is coded using CSC, and the \ufb01nal details\nimage is recovered by applying a set of \ufb01lters on the obtained sparse representations.\nIn all these successful applications, the input image is \ufb01rst separated into smooth and non-smooth\nimages. Then, without a formal reasoning, the CSC only models the detail-rich content of the image,\nhinting to its limitations. Why does the CSC perform well only on non-smooth signals? We answer\nthis question in the following sections. Another common feature to these successful applications is\nthe fact that the data is assumed to be noiseless. Is this a coincidence? Can the CSC be of bene\ufb01t on\nnoisy natural images? In order to answer these questions let us refer to an unsuccessful use of the\nCSC: image denoising. Applying the CSC model directly on the noisy image leads to disappointing\nresults, falling far behind PA [25, 35, 36]. We emphasize that the same is true for other applications\nwhere noise cannot be neglected, such as deblurring and other inverse problems.\nTo date, no CSC denoising algorithm competes favorably with the PA method on natural images, with\nthe exception of [37]. This work extends the concept of Learned Iterative Soft Thresholding (LISTA)\n[38\u201340] to CSC, unfolding the pursuit algorithm into a recurrent network. The results obtained are on\npar with the K-SVD algorithm [3].5 The last part of our work is closely related to [37]. By adopting\nour insights on the CSC model and its MMSE approximation, we offer a CSC deployment that leads\nto enhanced denoising performance that are on par with the most recent supervised methods.\n\n3 Why Does the CSC Model Denoise Natural Images Poorly?\n\n3.1 Poor Coherence\n\nThe work in [31] has show that the theoretical uniqueness and stability guarantees for the convolutional\nsparse coding problem are conditioned on the maximum number of local non-zero elements in (cid:107)\u0393(cid:107)\n\n(cid:18)\n\n(cid:19)\n\n(cid:107)\u0393(cid:107)0,\u221e <\n\n1\n2\n\n1 +\n\n1\n\n\u00b5(D)\n\n,\n\n(7)\n\nwhere \u00b5 (D) is the mutual coherence of D, i.e.\nthe max absolute normalized cross-correlation\nbetween its columns.6 Moreover, under the same conditions, BP is guaranteed to recover this solution.\nThis bound implies that to allow for a large number of active \ufb01lters in the signal while keeping the\nsolution accessible, the \ufb01lters and all their shifts must have low cross-correlations. Speci\ufb01cally, the\nauto-correlation of the \ufb01lters should be low as well. Unfortunately, this property does not align\nwith the characteristics of natural images. For the most part, these consist of piece-wise smooth\nregions and occasional textures. This structure is key in many denoising and compression methods\n[1, 41]. Hence, to allow for a sparse representation, a convolutional dictionary must contain smooth\nor piece-wise smooth \ufb01lters. That said, the auto-correlation of a these \ufb01lters decay slowly, leading to\nhighly correlated atoms in the global dictionary, restricting the number of non-zero elements allowed\nin each stripe \u03b3i while satisfaying the above bound. For example, if a dictionary contains the constant\n(DC) \ufb01lter, the maximum number of non-zeros allowed in a stripe to assure uniqueness is 1, enforcing\nunpainted pixels in the signal.\nGenerally, a CSC representation of a natural image imposes a contradiction between the cardinality of\nthe sparse representation and the coherence of the global dictionary. To assure a sparse representation,\nthe former requires piecewise smooth \ufb01lters, whereas the latter demands low global mutual-coherence,\nwhich counters these slow-changing \ufb01lters. In its current form, the CSC cannot satisfy the two\ndemands simultaneously, making it unsuitable for natural images. Note that the successful applications\n\n5Improved performance is reported in their follow-up thesis.\n6The (cid:96)0,\u221e is de\ufb01ned as (cid:107)\u0393(cid:107)0,\u221e = maxi (cid:107)\u03b3i(cid:107)0.\n\n4\n\n\f(cid:26)(cid:98)\u03b1i = arg min\u03b1i \u03bbi(cid:107)\u03b1i(cid:107)1 +\n\n(cid:27)N\n\nthat were mentioned in Section 2.2, apply the CSC model only on the texture rich part of the image,\nleading to Gabor-like non-smooth \ufb01lters, thus avoiding the described con\ufb02ict.\n\n3.2 A Bayesian Standpoint\n\nFrom a Bayesian point of view, the solution to the problem posed in Equation (1) (or its Lagrangian\nform) corresponds to the Maximum A-posteriori Probability (MAP) estimator under a sparse prior [42\u2013\n45]. Clearly, this solution is inferior to the Minimum MSE (MMSE) estimator in terms of MSE when\nthe two differ. As we show next, as opposed to a convolutional pursuit (being MAP approximation),\nPA performs a restrained approximation to the CSC MMSE estimator w.r.t.\nthe entire image,\nexplaining its superiority. To do so, we \ufb01rst formalize the PA approach, then we present the CSC\nMMSE estimator and \ufb01nally, we show their connection.\nPA obtains a clean estimate for each patch and averages overlapping estimates together. Speci\ufb01-\ncally, under a (local) sparse prior with a dictionary DL, each patch participates in a local pursuit\nindependently, leading to N independent optimization problems,7\n\n(cid:88)\n\nS\u2208\u0398\n\nP (S) E{\u0393|Y , S} =\n\n(cid:88)\n\nS\u2208\u0398\n\nP (S)(cid:98)\u0393S,\n\nOnce (cid:98)\u03b1i are found, the clean patches are synthesized by(cid:98)xi = DL(cid:99)\u03b1i, and these are placed in the\n\nsignal while averaging overlapping elements from different patches \u2013 see Equation (3).\nWe move now to discuss the MMSE estimation under a sparsity-promoting prior [44, 45]. Using\nmarginalization, MMSE of the global convolutional sparse representation vector can be written as\n\ni=1\n\n2\n\n(cid:107)DL\u03b1i \u2212 PiY (cid:107)2\n\n.\n\n1\n2\n\n(cid:98)\u0393MMSE = E{\u0393|Y } = ES {E{\u0393|Y , S}} =\n\n(8)\n\n(9)\n\nwhere S stands for the support of \u0393, P (S) is the prior probability of such a support (assumed to\n\npromote sparse vectors) and \u0398 is the set of all possible supports. Furthermore,(cid:98)\u0393S = E{\u0393|Y , S} is\n\nthe MMSE estimator of the sparse representation vector given the support and the noisy measurements,\nknown as the oracle estimator [44]. Equation (9) suggests that the MMSE estimator is actually a\ndense vector consisting of a weighted average of all the possible oracle estimators, where the weight\nof each is its prior probability, P (S). Note that computing the MMSE is an exhaustive task that\nsweeps through all the possible supports, and therefore approximation methods are needed. A natural\nstrategy in this context is to sample a suf\ufb01cient number of supports from P (S), and replace the\nexpectation with a sample mean over these.\nIndeed, consider the case where the sampled supports are such that D\u0393 results in non-overlapping\ntangent slices. Overall, there are n different slice arrangements that uphold this assumption differing\nonly in the location of the \ufb01rst slice on the image. Equivalently, the k-th arrangement can be described\nby a convolutional strided dictionary, where the stride equals the size of the \ufb01lter n, and the \ufb01rst\nnon-zero needle in \u0393 is located in the k-th index 1 \u2264 k \u2264 n. We shall denote this strided dictionary\nby Dk and its corresponding representation as \u0393k. Sparse coding the k-th shift can be done using BP,\nwhich under these constraints can be written as\n(cid:107)Dk\u0393 \u2212 Y (cid:107)2\n\n(cid:98)\u0393k = arg min\u0393 \u03bb(cid:107)\u0393(cid:107)1 +\n\n(10)\n\n2\n\n= arg min\u0393=[\u03b1k;\u03b1k+n,...]\n\n\u03bb(cid:107)\u03b1k+in(cid:107)1 +\n\n1\n2\n\n(cid:107)DL\u03b1k+in \u2212 Pk+inY (cid:107)2\n\n2\n\n.\n\n(11)\n\nThis can be solved for each needle separately,\n\n\u03b1k+in = arg min\u03b1 \u03bb(cid:107)\u03b1(cid:107)1 +\n\n(cid:107)DL\u03b1 \u2212 Pk+inY (cid:107)2\n2,\n\n1\n2\n\n(12)\n\nthe CSC representation(cid:98)\u0393S is equivalent to N\n\nwhile zeroing all the other needles. Thus, under the constraint of non-overlapping slices, estimating\nn independent local pursuits. Clearly, this estimation\n\nresults with a \u201cblocki\ufb01ed\u201d image, due to the lack of overlaps.\n\n7we assume the use of the BP in its Lagrangian form\n\n5\n\n1\n2\n\n(cid:26)\n\nn \u22121(cid:88)\n\nN\n\ni=0\n\n(cid:27)\n\n\f(cid:110)(cid:98)\u03931,(cid:98)\u03932, ...,(cid:98)\u0393n\n\nRepeating this estimation process n times, each time forcing a different shift 1 \u2264 k \u2264 n, leads to\na set of estimates\nLooking back into the MMSE estimator in Equation (9), the MMSE can be approximated as\n\n(cid:111)\n, where(cid:98)\u0393k denotes the estimate obtained using the k-th shift.\n(cid:98)\u0393MMSE \u2248 n(cid:88)\n\nif we further assume that all the estimates are a-priori equally likely. Since each(cid:98)\u0393i is obtained by\n\nP (S)(cid:98)\u0393i \u2248 1\n\na local non-overlapping pursuit (11), the result in (13) is exactly the above outlined PA procedure.\nHence, PA can be perceived as an MMSE approximation of the CSC model, explaining its superior\nMSE performance when compared to a single global CSC pursuit.8 Armed with this insight, can we\npropose better CSC MMSE estimates? This takes us to the next section.\n\nn(cid:88)\n\n(cid:98)\u0393i,\n\n(13)\n\ni=1\n\ni=1\n\nn\n\n4 The Proposed Approach\n\n4.1 Generalizing the MMSE Approximation Using Strided Convolutions\n\nWe suggest to generalize the non-overlapping slices assumption and allow for a smaller constant\nstride. Formally, in the non-overlapping case, the stride between adjacent slices was of the same size\nas the \ufb01lters themselves, leading to n such estimates. We suggest to use a stride q, where 1 \u2264 q < n,\nleading to q estimates in a 1D signal or q2 in a 2D one \u2013 each originating from a different initial shift\nin the signal. Finally, we average these together, as suggested in Equation (9). Note that when q < n,\neach estimate allows for overlapping slices, implying that the pursuit must be done globally on all\nthe involved slices together. This necessarily leads to a global agreement between these slices, as\nopposed to PA where each patch (slice) is estimated separately. Furthermore, when q is suf\ufb01ciently\nlarge, the mutual coherence of the global dictionary can be preserved even for smooth \ufb01lters, in\ncontrast to the standard CSC pursuit (q = 1), since the \ufb01lters only partially overlap.\nFor a preliminary evaluation of our approach, we perform a denoising experiment on images from the\nSet12 dataset contaminated with white Gaussian noise with standard deviation \u03c3 \u2208 {15, 25, 50, 75}.\nWe use both the standard PA algorithm, and the proposed strided CSC using various strides 1 \u2264\nq < n = 11, followed by an averaging operation. BP in its error-bounded from followed by a\ndebiasing step is used to sparse code the signals both in the convolutional and the PA cases. The\ntwice over-redundant DCT dictionary of size 11 \u00d7 11 is chosen as the local dictionary DL. Note\nthat in the strided case, pixels may now have a different number of slices (\ufb01lters) overlapping them,\ndepending on their position in the image and the stride. To compensate for this, we normalize the\n\ufb01lters appropriately for each stride.\nA summary of the results of this experiment are presented in Table 1 (per-image results can be found\nin the supplementary material). As expected, when using CSC with a stride of 1, i.e. standard CSC,\nthe denoising performance is poor and the PA method is substantially better. We attribute this to the\nhigh coherence of the global dictionary, making the estimated image over\ufb01t the noise. However, the\nbest results are achieved when the stride is large but smaller than the size of the \ufb01lters, restraining the\ncoherence, while allowing the \ufb01lters to overlap, leading to a global consensus in each of the estimates.\nInterestingly, the CSC achieved better results even though its error constraint ((cid:107)Dk\u0393 \u2212 Y (cid:107)2 \u2264 \u0001) is\nglobal, as opposed to the much more detailed local constraint used by PA.\n\n4.2 CSCNet \u2013 a Supervised Denoising Model\n\nA popular method to solve BP, i.e. (cid:98)\u0393 = arg min \u0393 1\n\nwhich operates iteratively as follows:\n\n(cid:18)\n\n2 (cid:107)D\u0393 \u2212 Y (cid:107)2\n\n2 + \u03bb(cid:107)\u0393(cid:107)1, is the ISTA algorithm,\n(cid:19)\n\n(cid:0)DT D(cid:1), and S\u03c4 is the soft-thresholding operator extended to operate in an element-\n\nwhere c \u2265 \u03c3max\nwise fashion.9 Often times, convergence requires a large number of iterations, making this process\n\nDT (Y \u2212 D\u0393k)\n\n\u0393k+1 = S \u03bb\n\n\u0393k +\n\n(14)\n\n1\nc\n\n,\n\nc\n\n8The term approximation refers to considering only a small subset of supports in the averaging process.\n9\u03c3max (\u00b7) represent the largest eigenvalue, and S\u03c4 (\u00b7) is de\ufb01ned as S\u03c4 (y) = sign(y) \u00b7 max(y \u2212 \u03c4, 0).\n\n6\n\n\fTable 1: Average Set12 denoising results (PSNR) using PA and CSC with various strides (q). CSC\n\nResults that surpass PA are marked in blue. Best results are bold.\n\nCSC - stride size (q)\n\n\u03c3\n15\n25\n50\n75\n\n1\n\n28.99\n25.78\n21.49\n18.83\n\n2\n\n29.27\n26.11\n22.11\n19.58\n\n3\n\n30.01\n26.94\n23.17\n20.95\n\n4\n\n30.66\n27.72\n23.83\n21.81\n\n5\n\n31.06\n28.26\n24.52\n22.43\n\n6\n\n31.21\n28.50\n24.86\n22.75\n\n7\n\n31.31\n28.64\n25.05\n22.97\n\n8\n\n31.39\n28.75\n25.29\n23.25\n\n9\n\n31.45\n28.84\n25.47\n23.51\n\n10\n31.46\n28.88\n25.56\n23.66\n\nPA\n31.23\n28.73\n25.32\n23.28\n\nFigure 2: The CSCNET architecture.\n\ninef\ufb01cient. To overcome this burden, the LISTA algorithm [38] has been proposed to approximate\nthe sparse coding process, by learning the parameters of a non-linear recurrent encoder that strictly\nfollows L iterations of the iterative process described in Equation (14). This concept has been\nextended to the convolutional setting in [37] as follows:\n\n(cid:19)\n\n(cid:18)\n\n\u0393k+1 = S\u03c4\n\n\u0393k +\n\nA (Y \u2212 B\u0393k)\n\n1\nc\n\n,\n\n(15)\n\nwhere A stands for a convolution operator and B a transposed-convolution one. Once the sparse\nvector is at hand, the estimated clean image is then obtained by a linear transposed-convolutional\n\ndecoder, i.e. (cid:99)X = C\u0393L. The matrices A, B and C are structured as a set of support bounded shift\n\ninvariant \ufb01lters, and together with the thresholds vector \u03c4 , are learned in a supervised manner. Note\nthat the number of parameters does not grow with L, the number of unrolled iterations.\nFollowing the CSC MMSE approximation introduced in Section 4.1, we propose to use a strided\nconvolutional structure on the learned matrices using a constant stride q. To obtain an estimate for\neach possible shift, we duplicate the input image q2 times, where each duplicate is a shifted version of\nthe original image. Following Equation (13) the estimated image is a simple average of the estimates\nof all the shifts. A diagram of the proposed architecture is presented in Figure 2.\n\n4.3 Experiments\n\nADAM optimizer [48] and minimize the (cid:96)2 loss, i.e. L(X,(cid:99)X) = (cid:107)X \u2212(cid:99)X(cid:107)2\n\nTo train the proposed model, we prepare a training set of input-output pairs. The clean images are\ntaken from the Waterloo Exploration Dataset [46] and 432 images from BSD [47]. The noisy inputs\nare obtained by adding white Gaussian noise with a constant standard deviation \u03c3. In each iteration,\na random patch of size 128 is cropped from an image and a random realization of noise is sampled.\nWe train 4 models, one for each noise level {15, 25, 50, 75}. For each model we learn 175 \ufb01lters of\nsize 11 \u00d7 11, use a stride q = 8 and set L = 12. To learn the parameters of the model, we employ the\n2. We use a learning\nrate of 10\u22124 and decrease it by a factor of 0.7 every 50 epochs and iterate over 250 epochs. To avoid\ndivergence, we set the \u0001 parameter of the optimizer to 10\u22123. We evaluate the performance of the\nmodels using the BSD68 dataset that was excluded from the training set. Additional experiments and\ninformation can be found in the supplementary material and on https://github.com/drorsimon/CSCNet.\nTable 2 presents the results of our models compared to other leading methods, and Figure 3 shows\nsome of the learned \ufb01lters, taken from C. The proposed model outperforms BM3D [2], TNRD [49],\n\n7\n\nLISTA IterationDeconv(B)++-Conv(A)+ThresholdDeconv(C)\fTable 2: Denoising performance (PSNR) on the BSD68 dataset.\nBM3D WNNM TNRD MLP DnCNN FFDNet CSCNet\n31.57\n31.07\n29.11\n28.57\n25.62\n26.24\n24.77\n24.21\n\n31.72\n29.22\n26.23\n24.64\n\n31.37\n28.83\n25.87\n24.40\n\n31.63\n29.19\n26.29\n24.79\n\n31.42\n28.92\n25.97\n\n\u2013\n\n\u2013\n\n28.96\n26.03\n24.59\n\n\u03c3\n15\n25\n50\n75\n\nFigure 3: CSCNet \ufb01lters.\n\nFigure 4: CSCNet test error for various strides.\n\nWNNM [50] and MLP [51], while being on par with DnCNN [52] and FFDNet [53]. That said, we\nmention two differences between the proposed model and the other two leading methods:\n\n1. Number of parameters \u2013 The number of parameters in the proposed approach does not\ngrow with the depth of the model. Hence, it uses much fewer parameters compared to other\nmodern methods, as demonstrated in Table 3.\n\n2. Batch Normalization (BN) \u2013 The other two leading denoising methods are based on general\ndeep learning techniques, and therefore employ BN which is known to improve the perfor-\nmance and convergence rate of the trained model [54]. As our presented method relies only\non the CSC prior, we did not include such operators.\n\nTo study the effect of the stride-size, we have trained 6 models, each one with a different stride. The\nresults in Figure 4, referring to noise level \u03c3 = 25, show the same tendency as in Table 1, namely,\nsetting the stride too high (q = 11) results in independent patch based processing with weaker\nperformance (28.9dB); setting it to q = 1 leads to a regular (non-MMSE) deployment of the CSC\nwith highly correlated atoms and the weakest estimate (28.74dB), which is still better than BM3D.\nThe best results are obtained for q = 7 or q = 8 (29.11dB).\n\nTable 3: Comparison of number of parameters in leading denoising architectures.\n\nFirst layer\n\nModel\nDnCNN 3 \u00d7 3 \u00d7 1 \u00d7 64\n3 \u00d7 3 \u00d7 5 \u00d7 64\nFFDNet\nCSCNet\n\n\u2013\n\nLast layer\n3 \u00d7 3 \u00d7 64 \u00d7 1\n3 \u00d7 3 \u00d7 64 \u00d7 4\n11 \u00d7 11 \u00d7 175 \u00d7 1\n\nMid layers\n(3 \u00d7 3 \u00d7 64 \u00d7 64 + 128) \u00d7 15\n(3 \u00d7 3 \u00d7 64 \u00d7 64 + 128) \u00d7 13\n(11 \u00d7 11 \u00d7 175 \u00d7 1) \u00d7 2 + 175\n\nTotal\n556,032\n486,080\n63,700\n\nWe further test our approach on color image denoising. We perform similar experiments to those\ndescribed earlier, where this time the input image and each \ufb01lter have 3 channels (RGB). The\ndenoising performance of our architecture is presented in Table 4. As before, our results are on par\nwith other leading methods (DnCNN, FFDNet) while using much fewer parameters.\n\n5 Conclusions\n\nThis work exposed the limitations of the CSC model in representing natural images in the presence\nof noise. Investigation of the patch-averaging scheme and the origins for its success has lead us to\noffer an MMSE approximation pursuit that overcomes these limitations effectively. A feed-forward\narchitecture based on our insights was shown to be on par with the best supervised denoising\n\n8\n\n050100150200250Epochs28.0028.2528.5028.7529.00TestsetPSNRStride=1Stride=3Stride=5Stride=7Stride=9Stride=11\fTable 4: Denoising performance (PSNR) on the\n\ncolor-BSD68 dataset.\n\n\u03c3\n15\n25\n50\n75\n\nCBM3D CDnCNN FFDNet CSCNet\n33.83\n33.52\n30.71\n31.18\n28.00\n27.38\n25.74\n26.32\n\n33.89\n31.23\n27.92\n24.47\n\n33.87\n31.21\n27.96\n26.24\n\nalgorithms in the literature. Our future work will focus on further improvements of this scheme\nby considering (i) the addition of batch-normalization; (ii) a multi-scale architecture, as proposed\nin recent leading methods; (iii) adoption of a local error constraint as in PA; and (iv) exploiting\nself-similarity, as practiced in [20] and more recently in [55]. In all these directions, our prime goal\nis to incorporate these ideas while maintaining the purity of CSC model, so as to preserve intact the\ntheoretical justi\ufb01cation of the proposed architecture.\n\nAcknowledgement\n\nThe research leading to these results has received funding from the Technion Hiroshi Fujiwara Cyber\nSecurity Research Center and the Israel Cyber Directorate.\n\nReferences\n[1] L. I. Rudin, S. Osher, and E. Fatemi, \u201cNonlinear total variation based noise removal algorithms,\u201d Physica\n\nD: Nonlinear Phenomena, vol. 60, no. 1-4, pp. 259\u2013268, 1992.\n\n[2] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, \u201cImage denoising by sparse 3-d transform-domain\n\ncollaborative \ufb01ltering,\u201d IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080\u20132095, 2007.\n\n[3] M. Elad and M. Aharon, \u201cImage denoising via sparse and redundant representations over learned dictionar-\n\nies,\u201d IEEE Transactions on Image processing, vol. 15, no. 12, pp. 3736\u20133745, 2006.\n\n[4] A. Buades, B. Coll, and J.-M. Morel, \u201cA non-local algorithm for image denoising,\u201d in Conference on\n\nComputer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60\u201365, IEEE, 2005.\n\n[5] W. Dong, L. Zhang, G. Shi, and X. Wu, \u201cImage deblurring and super-resolution by adaptive sparse\ndomain selection and adaptive regularization,\u201d IEEE Transactions on Image Processing, vol. 20, no. 7,\npp. 1838\u20131857, 2011.\n\n[6] J. Salmon, Z. Harmany, C.-A. Deledalle, and R. Willett, \u201cPoisson noise reduction with non-local pca,\u201d\n\nJournal of Mathematical Imaging and Vision, vol. 48, no. 2, pp. 279\u2013294, 2014.\n\n[7] W. Dong, L. Zhang, G. Shi, and X. Li, \u201cNonlocally centralized sparse representation for image restoration,\u201d\n\nIEEE Transactions on Image Processing, vol. 22, no. 4, pp. 1620\u20131630, 2012.\n\n[8] S. Ravishankar and Y. Bresler, \u201cMri image reconstruction from highly undersampled k-space data by\n\ndictionary learning,\u201d IEEE Transactions on Medical Imaging, vol. 30, no. 5, pp. 1028\u20131041, 2011.\n\n[9] M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies, \u201cSparse representations in\naudio and music: from coding to source separation,\u201d Proceedings of the IEEE, vol. 98, no. 6, pp. 995\u20131005,\n2009.\n\n[10] B. K. Natarajan, \u201cSparse approximate solutions to linear systems,\u201d SIAM Journal on Computing, vol. 24,\n\nno. 2, pp. 227\u2013234, 1995.\n\n[11] S. Chen and D. Donoho, \u201cBasis pursuit,\u201d in Proceedings of 1994 28th Asilomar Conference on Signals,\n\nSystems and Computers, vol. 1, pp. 41\u201344, IEEE, 1994.\n\n[12] M. Elad, Sparse and redundant representations: from theory to applications in signal and image processing.\n\nSpringer Science & Business Media, 2010.\n\n[13] K. Engan, S. O. Aase, and J. H. Husoy, \u201cMethod of optimal directions for frame design,\u201d in International\nConference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 2443\u20132446, IEEE, 1999.\n\n9\n\n\f[14] M. Aharon, M. Elad, A. Bruckstein, et al., \u201cK-svd: An algorithm for designing overcomplete dictionaries\n\nfor sparse representation,\u201d IEEE Transactions on Signal Processing, vol. 54, no. 11, p. 4311, 2006.\n\n[15] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, \u201cOnline dictionary learning for sparse coding,\u201d in International\n\nConference on Machine Learning (ICML), pp. 689\u2013696, ACM, 2009.\n\n[16] I. Tosic and P. Frossard, \u201cDictionary learning: What is the right representation for my signal?,\u201d IEEE\n\nSignal Processing Magazine, vol. 28, no. ARTICLE, pp. 27\u201338, 2011.\n\n[17] D. Zoran and Y. Weiss, \u201cFrom learning models of natural image patches to whole image restoration,\u201d in\n\nInternational Conference on Computer Vision (ICCV), pp. 479\u2013486, IEEE, 2011.\n\n[18] D. Batenkov, Y. Romano, and M. Elad, \u201cOn the global-local dichotoy in sparsity modeling,\u201d in Compressed\n\nSensing and its Applications, pp. 1\u201353, Springer, 2017.\n\n[19] J. Sulam and M. Elad, \u201cExpected patch log likelihood with a sparse prior,\u201d in International Workshop on\nEnergy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), pp. 99\u2013111,\nSpringer, 2015.\n\n[20] J. Mairal, F. R. Bach, J. Ponce, G. Sapiro, and A. Zisserman, \u201cNon-local sparse models for image\n\nrestoration.,\u201d in International Conference on Computer Vision (ICCV), vol. 29, pp. 54\u201362, IEEE, 2009.\n\n[21] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, \u201cShift-invariant sparse coding for audio classi\ufb01cation,\u201d in In\nProceedings of the Twenty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, pp. 149\u2013158, 2007.\n\n[22] A. Szlam, K. Kavukcuoglu, and Y. LeCun, \u201cConvolutional matching pursuit and dictionary training,\u201d arXiv\n\npreprint arXiv:1010.0422, 2010.\n\n[23] F. Heide, W. Heidrich, and G. Wetzstein, \u201cFast and \ufb02exible convolutional sparse coding,\u201d in Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pp. 5135\u20135143, IEEE, 2015.\n\n[24] G. Silva, J. Quesada, P. Rodr\u00edguez, and B. Wohlberg, \u201cFast convolutional sparse coding with separable\n\ufb01lters,\u201d in International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6035\u20136039,\nIEEE, 2017.\n\n[25] E. Plaut and R. Giryes, \u201cA greedy approach to 0,in\ufb01nity based convolutional sparse coding,\u201d SIAM Journal\n\non Imaging Sciences, vol. 12, no. 1, pp. 186\u2013210, 2019.\n\n[26] E. Zisselman, J. Sulam, and M. Elad, \u201cA local block coordinate descent algorithm for the convolutional\nsparse coding model,\u201d in Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2019.\n\n[27] B. Wohlberg, \u201cEf\ufb01cient algorithms for convolutional sparse representations,\u201d IEEE Transactions on Image\n\nProcessing, vol. 25, pp. 301\u2013315, Jan 2016.\n\n[28] V. Papyan, Y. Romano, M. Elad, and J. Sulam, \u201cConvolutional dictionary learning via local processing.,\u201d\n\nin International Conference on Computer Vision (ICCV), pp. 5306\u20135314, IEEE, 2017.\n\n[29] C. Garcia-Cardona and B. Wohlberg, \u201cConvolutional dictionary learning: A comparative review and new\n\nalgorithms,\u201d IEEE Transactions on Computational Imaging, 2018.\n\n[30] I. Y. Chun and J. A. Fessler, \u201cConvolutional dictionary learning: Acceleration and convergence,\u201d IEEE\n\nTransactions on Image Processing, vol. 27, no. 4, pp. 1697\u20131712, 2018.\n\n[31] V. Papyan, J. Sulam, and M. Elad, \u201cWorking locally thinking globally: Theoretical guarantees for con-\nvolutional sparse coding,\u201d IEEE Transactions on Signal Processing, vol. 65, no. 21, pp. 5687\u20135701,\n2017.\n\n[32] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang, \u201cConvolutional sparse coding for image\nsuper-resolution,\u201d in International Conference on Computer Vision (ICCV), pp. 1823\u20131831, IEEE, 2015.\n\n[33] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, \u201cImage fusion with convolutional sparse representation,\u201d\n\nIEEE Signal Processing Letters, vol. 23, no. 12, pp. 1882\u20131886, 2016.\n\n[34] I. Rey-Otero, J. Sulam, and M. Elad, \u201cVariations on the csc model,\u201d arXiv preprint arXiv:1810.01169,\n\n2018.\n\n[35] D. Carrera, G. Boracchi, A. Foi, and B. Wohlberg, \u201cSparse overcomplete denoising: aggregation versus\n\nglobal optimization,\u201d IEEE Signal Processing Letters, vol. 24, no. 10, pp. 1468\u20131472, 2017.\n\n10\n\n\f[36] B. Wohlberg, \u201cConvolutional sparse coding with overlapping group norms,\u201d arXiv preprint\n\narXiv:1708.09038, 2017.\n\n[37] H. Sreter and R. Giryes, \u201cLearned convolutional sparse coding,\u201d in International Conference on Acoustics,\n\nSpeech and Signal Processing (ICASSP), pp. 2191\u20132195, IEEE, 2018.\n\n[38] K. Gregor and Y. LeCun, \u201cLearning fast approximations of sparse coding,\u201d in International Conference on\n\nMachine Learning (ICML), pp. 399\u2013406, Omnipress, 2010.\n\n[39] J. T. Rolfe and Y. Lecun, \u201cDiscriminative recurrent sparse auto-encoders,\u201d in International Conference on\n\nLearning Representations (ICLR), 2013.\n\n[40] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, \u201cDeep networks for image super-resolution with sparse\n\nprior,\u201d in International Conference on Computer Vision (ICCV), pp. 370\u2013378, IEEE, 2015.\n\n[41] A. S. Lewis and G. Knowles, \u201cImage compression using the 2-d wavelet transform,\u201d IEEE Transactions on\n\nImage Processing, vol. 1, no. 2, pp. 244\u2013250, 1992.\n\n[42] E. G. Larsson and Y. Sel\u00e9n, \u201cLinear regression with a sparse parameter vector,\u201d IEEE Transactions on\n\nSignal Processing, vol. 55, no. 2, pp. 451\u2013460, 2007.\n\n[43] P. Schniter, L. C. Potter, J. Ziniel, et al., \u201cFast bayesian matching pursuit: Model uncertainty and parameter\n\nestimation for sparse linear models,\u201d IEEE Transactions on Signal Processing, pp. 326\u2013333, 2008.\n\n[44] M. Elad and I. Yavneh, \u201cA plurality of sparse representations is better than the sparsest one alone,\u201d IEEE\n\nTransaction on Information Theory, vol. 55, no. 10, pp. 1\u201335, 2009.\n\n[45] D. Simon, J. Sulam, Y. Romano, Y. M. Lu, and M. Elad, \u201cMmse approximation for sparse coding algorithms\nusing stochastic resonance,\u201d IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4597\u20134610,\n2019.\n\n[46] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, \u201cWaterloo exploration database: New\nchallenges for image quality assessment models,\u201d IEEE Transactions on Image Processing, vol. 26, no. 2,\npp. 1004\u20131016, 2017.\n\n[47] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, \u201cContour detection and hierarchical image segmentation,\u201d\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 898\u2013916, May 2011.\n\n[48] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d International Conference on\n\nLearning Representations (ICLR), 2014.\n\n[49] Y. Chen and T. Pock, \u201cTrainable nonlinear reaction diffusion: A \ufb02exible framework for fast and effective\nimage restoration,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,\npp. 1256\u20131272, 2017.\n\n[50] S. Gu, L. Zhang, W. Zuo, and X. Feng, \u201cWeighted nuclear norm minimization with application to image\ndenoising,\u201d in Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2862\u20132869, IEEE,\n2014.\n\n[51] H. C. Burger, C. J. Schuler, and S. Harmeling, \u201cImage denoising: Can plain neural networks compete with\nbm3d?,\u201d in Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2392\u20132399, IEEE, 2012.\n\n[52] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, \u201cBeyond a gaussian denoiser: Residual learning of\ndeep cnn for image denoising,\u201d IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142\u20133155,\n2017.\n\n[53] K. Zhang, W. Zuo, and L. Zhang, \u201cFfdnet: Toward a fast and \ufb02exible solution for cnn-based image\n\ndenoising,\u201d IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4608\u20134622, 2018.\n\n[54] S. Ioffe and C. Szegedy, \u201cBatch normalization: accelerating deep network training by reducing internal\n\ncovariate shift,\u201d in International Conference on Machine Learning (ICML), pp. 448\u2013456, 2015.\n\n[55] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, \u201cNon-local recurrent network for image restoration,\u201d\n\nin Advances in Neural Information Processing Systems (NeurIPS), pp. 1673\u20131682, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1346, "authors": [{"given_name": "Dror", "family_name": "Simon", "institution": "Technion"}, {"given_name": "Michael", "family_name": "Elad", "institution": "Technion"}]}