{"title": "Supervised Sparse Analysis and Synthesis Operators", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 916, "abstract": "In this paper, we propose a new and computationally efficient framework for learning sparse models. We formulate a unified approach that contains as particular cases models promoting sparse synthesis and analysis type of priors, and mixtures thereof. The supervised training of the proposed model is formulated as a bilevel optimization problem, in which the operators are optimized to achieve the best possible performance on a specific task, e.g., reconstruction or classification. By restricting the operators to be shift invariant, our approach can be thought as a way of learning analysis+synthesis sparsity-promoting convolutional operators. Leveraging recent ideas on fast trainable regressors designed to approximate exact sparse codes, we propose a way of constructing feed-forward neural networks capable of approximating the learned models at a fraction of the computational cost of exact solvers. In the shift-invariant case, this leads to a principled way of constructing task-specific convolutional networks. We illustrate the proposed models on several experiments in music analysis and image processing applications.", "full_text": "Ef\ufb01cient Supervised Sparse Analysis and Synthesis\n\nOperators\n\nPablo Sprechmann\nDuke University\n\npablo.sprechmann@duke.edu\n\nRoee Litman\n\nTel Aviv University\n\nroeelitman@post.tau.ac.il\n\nTal Ben Yakar\n\nTel Aviv University\n\ntalby10@gmail.com\n\nAlex Bronstein\n\nTel Aviv University\n\nbron@eng.tau.ac.il\n\nGuillermo Sapiro\nDuke University\n\nguillermo.sapiro@duke.edu\n\n\u2217\n\nAbstract\n\nIn this paper, we propose a new computationally ef\ufb01cient framework for learn-\ning sparse models. We formulate a uni\ufb01ed approach that contains as particular\ncases models promoting sparse synthesis and analysis type of priors, and mixtures\nthereof. The supervised training of the proposed model is formulated as a bilevel\noptimization problem, in which the operators are optimized to achieve the best\npossible performance on a speci\ufb01c task, e.g., reconstruction or classi\ufb01cation. By\nrestricting the operators to be shift invariant, our approach can be thought as a\nway of learning sparsity-promoting convolutional operators. Leveraging recent\nideas on fast trainable regressors designed to approximate exact sparse codes, we\npropose a way of constructing feed-forward networks capable of approximating\nthe learned models at a fraction of the computational cost of exact solvers. In the\nshift-invariant case, this leads to a principled way of constructing a form of task-\nspeci\ufb01c convolutional networks. We illustrate the proposed models on several\nexperiments in music analysis and image processing applications.\n\n1\n\nIntroduction\n\nParsimony, preferring a simple explanation to a more complex one, is probably one of the most in-\ntuitive principles widely adopted in the modeling of nature. The past two decades of research have\nshown the power of parsimonious representation in a vast variety of applications from diverse do-\nmains of science. Parsimony in the form of sparsity has been shown particularly useful in the \ufb01elds\nof signal and image processing and machine learning. Sparse models impose sparsity-promoting\npriors on the signal, which can be roughly categorized as synthesis or analysis. Synthesis priors are\ngenerative, asserting that the signal is approximated well as a superposition of a small number of\nvectors from a (possibly redundant) synthesis dictionary. Analysis priors, on the other hand, assume\nthat the signal admits a sparse projection onto an analysis dictionary. Many classes of signals, in\nparticular, speech, music, and natural images, have been shown to be sparsely representable in over-\ncomplete wavelet and Gabor frames, which have been successfully adopted as synthesis dictionaries\nin numerous applications [14]. Analysis priors involving differential operators, of which total vari-\nation is a popular instance, have also been shown very successful in regularizing ill-posed image\nrestoration problems [19].\n\n\u2217Work partially supported by ARO, BSF, NGA, ONR, NSF, NSSEFF, and Israel-Us Binational.\n\n1\n\n\fDespite the spectacular success of these axiomatically constructed synthesis and analysis operators,\nsigni\ufb01cant empirical evidence suggests that better performance is achieved when a data- or problem-\nspeci\ufb01c dictionary is used instead of a prede\ufb01ned one. Works [1, 16], followed by many others,\ndemonstrated that synthesis dictionaries can be constructed to best represent training data by solving\nessentially a matrix factorization problem. Despite the lack of convexity, many ef\ufb01cient dictionary\nlearning procedures have been proposed.\nThis unsupervised or data-driven approach to synthesis dictionary learning is well-suited for recon-\nstruction tasks such as image restoration. For example, synthesis models with learned dictionaries,\nhave achieved excellent results in denoising [9, 13]. However, in discriminative tasks such as classi-\n\ufb01cation, good data reconstruction is not necessarily required or even desirable. Attempts to replicate\nthe success of sparse models in discriminative tasks led to the recent interest in supervised or a\ntask- rather than data-driven dictionary learning, which appeared to be a signi\ufb01cantly more dif\ufb01cult\nmodeling and computational problem compared to its unsupervised counterpart [6].\nSupervised learning also seems to be the only practical option for learning unstructured non-\ngenerative analysis operators, for which no simple unsupervised alternatives exist. While the su-\npervised analysis operator learning has been mainly used as regularization on inverse problems,\ne.g., denoising [5], we argue that it is often better suited for classi\ufb01cation tasks than it synthesis\ncounterpart, since the feature learning and the reconstruction are separated. Recent works proposed\nto address the supervised learning of (cid:96)1 norm synthesis [12] and analysis [5, 17] priors via bilevel op-\ntimization [8], in which the minimization of a task-speci\ufb01c loss with respect to a dictionary depends\nin turn on the minimizer of a representation pursuit problem using that dictionary.\nFor the synthesis case, the task-oriented bilevel optimization problem is smooth and can be ef\ufb01-\nciently solved using stochastic gradient descent (SGD) [12]. However, [12] heavily relies on the\nseparability of the proximal operator of the (cid:96)1 norm, and thus cannot be extended to the analysis\ncase, where the (cid:96)1 term is not separable. The approach proposed in [17] formulates an analysis\nmodel with a smoothed (cid:96)1-type prior and uses implicit differentiation to obtain its gradients with re-\nspect to the dictionary required for the solution of the bilevel problem. However, such approximate\npriors are known to produce inferior results compared to their exact counterparts.\nMain contributions. This paper focuses on supervised learning of synthesis and analysis priors,\nmaking three main contributions:\nFirst, we consider a more general sparse model encompassing analysis and synthesis priors as par-\nticular cases, and formulate its supervised learning as a bilevel optimization problem. We propose\na new analysis technique, for which the (almost everywhere) smoothness of the proposed bilevel\nproblem is shown, and its exact subgradients are derived. We also show that the model can be ex-\ntended to include a sensing matrix and a non-Euclidean metric in the data term, both of which can\nbe learned as well. We relate the learning of the latter metric matrix to task-driven metric learning\ntechniques.\nSecond, we show a systematic way of constructing fast \ufb01xed-complexity approximations to the\nsolution of the proposed exact pursuit problem by unrolling few iterations of the exact iterative\nsolver into a feed-forward network, whose parameters are learned in the supervised regime. The\nidea of deriving a fast approximation of sparse codes from an iterative algorithm has been recently\nsuccessfully advocated in [11] for the synthesis model. We present an extension of this line of\nresearch to the various settings of analysis-\ufb02avored sparse models.\nThird, we dedicate special attention to the shift-invariant particular case of our model. The fast\napproximation in this case assumes the form of a convolutional neural network.\n\n2 Analysis, synthesis, and mixed sparse models\n\nmin\n\ny\n\n1\n2\n\nWe consider a generalization of the Lasso-type [21, 22] pursuit problem\n(cid:107)y(cid:107)2\n2,\n\n(1)\nwhere x \u2208 Rn, y \u2208 Rk, M1, M2 are m \u00d7 n and m \u00d7 k, respectively, \u2126 is r \u00d7 k, and \u03bb1, \u03bb2 > 0\nare parameters. Pursuit problem (1) encompasses many important particular cases that have been\nextensively studied in the literature: By setting M1 = I, \u2126 = I, and M2 = D to be a column-\novercomplete dictionary (k > m), the standard sparse synthesis model is obtained, which attempts to\n\n\u03bb2\n2\n\n(cid:107)M1x \u2212 M2y(cid:107)2\n\n2 + \u03bb1(cid:107)\u2126y(cid:107)1 +\n\n2\n\n\finput : Data x, matrices M1, M2, \u2126, weights \u03bb1, \u03bb2, parameter \u03c1 > 0.\noutput: Sparse code y.\nInitialize \u00b50 = 0, z0 = 0\nfor j = 1, 2, . . . until convergence do\n\n2 M1x + \u03c1\u2126T(zj \u2212 \u00b5j))\n\n2 M2 + \u03c1\u2126T\u2126 + \u03bb2I)\u22121(MT\nyj+1 = (MT\n(\u2126yj+1 + \u00b5j)\nzj+1 = \u03c3 \u03bb1\n\u00b5j+1 = \u00b5j + \u2126yj+1 \u2212 zj+1\n\n\u03c1\n\nend\nAlgorithm 1: Alternating direction method of multipliers (ADMM). Here, \u03c3t(z) = sign(z) \u00b7\nmax{|z| \u2212 t, 0} denotes the element-wise soft thresholding (the proximal operator of (cid:96)1).\n\nrepresent the data vector x as a sparse linear combination of the atoms of D. The case where the data\nare unavailable directly, but rather through a set of (usually fewer, m < n) linear measurements, is\nhandled by supplying x \u2208 Rm and setting M2 = \u03a6D, with \u03a6 being an m\u00d7 n sensing matrix. Such\na case arises frequently in compressed sensing applications as well as in general inverse problems.\nOne the other hand, by setting M1, M2 = I, and \u2126 a row-overcomplete dictionary (r > k), the\nstandard sparse analysis model is obtained, which attempts to approximate the data vector x by\nanother vector y in the same space admitting a sparse projection on \u2126. For example, by setting\n\u2126 to be the matrix of discrete derivatives leads to total variation regularization, which has been\nshown extremely successful in numerous signal processing applications. The analysis model can\nalso be extended by adding an m \u00d7 k sensing operator M2 = \u03a6, assuming that x is given in the m-\ndimensional measurement space. This leads to popular analysis formulations of image deblurring,\nsuper-resolution, and other inverse problems.\nKeeping both the analysis and the synthesis dictionaries and setting M2 = D, \u2126 = [\u2126(cid:48)D; I], leads\nto the mixed model. Note that the reconstructed data vector is now obtained by \u02c6x = Dy with sparse\ny; as a result, the (cid:96)1 term is extended to make sparse the projection of \u02c6x on the analysis dictionary\n\u2126(cid:48), as well as impose sparsity of y. A sensing matrix can be incorporated in this setting as well,\nby setting M1 = \u03a6 and M2 = \u03a6D. Alternatively, we can interpret \u03a6 as the projection matrix\nparametrizing a \u03a6T\u03a6 Mahalanobis metric, thus generalizing the traditional Euclidean data term.\nA particularly important family of analysis operators is obtained when the operator is restricted to\nbe shift-invariant. In this case, the operator can be expressed as a convolution with a \ufb01lter, \u03b3 \u2217 y,\nwhose impulse response \u03b3 \u2208 Rf is generally of a much smaller dimension than y. A straightforward\ngeneralization would be to consider an analysis operator consisting of q \ufb01lters,\n\u2126iy = \u03b3i \u2217 y,\n\n\u2126(\u03b31, . . . , \u03b3q) =(cid:2)\u21261(\u03b31);\u00b7\u00b7\u00b7 ; \u2126q(\u03b3q)(cid:3)\n\n1 \u2264 i \u2264 q.\n\nwith\n\n(2)\nThis model includes as a particular case the isotropic total variation priors. In this case, q = 2 and\nthe \ufb01lters correspond to the discrete horizontal and vertical derivatives. In general, the exact form of\nthe operator depends on the dimension of the convolution, and the type of boundary conditions.\nOn of the most attractive properties of pursuit problem (1) is convexity, which becomes strict for\n\u03bb2 > 0. While for \u2126 = I, (1) can be solved ef\ufb01ciently using the popular proximal methods [15]\n(such as FISTA [2]), this is no more an option in the case of a non-trivial \u2126, as (cid:107)\u2126y(cid:107)1 has no more\na closed-form proximal operator. A way to circumvent this dif\ufb01culty is by introducing an auxiliary\nvariable z = \u2126y and solving the constrained convex program\n\nmin\ny,z\n\n1\n2\n\n(cid:107)M1x \u2212 M2y(cid:107)2\n\n2 + \u03bb1(cid:107)z(cid:107)1 +\n\n(cid:107)y(cid:107)2\n\n2\n\n\u03bb2\n2\n\ns.t z = \u2126y,\n\n(3)\n\nwith an unscaled (cid:96)1 term. This leads to a family of the so-called split-Bregman methods; the ap-\nplication of augmented Lagrangian techniques to solve (3) is known in the literature as alternating\ndirection method of multipliers (ADMM) [4], summarized in Algorithm 1. Particular instances\nmight be solved more ef\ufb01ciently with alternative algorithms (i.e. proximal splitting methods).\n\n3 Bilevel sparse models\n\nA central focus of this paper is to develop a framework for supervised learning of the parameters in\n(1), collectively denoted by \u0398 = {M1, M2, D, \u2126}, to achieve the best possible performance in a\n\n3\n\n\fspeci\ufb01c task such as reconstruction or classi\ufb01cation. Supervised schemes arise very naturally when\ndealing with analysis operators. In sharp contrast to the generative synthesis models, where data\nreconstruction can be enforced unsupervisedly, there is no trivial way for unsupervised training of\nanalysis operators without restricting them to satisfy some external, frequently arbitrary, constraints.\nClearly, unconstrained minimization of (1) over \u2126 would lead to a trivial solution \u2126 = 0. The ideas\nproposed in [12] \ufb01t very well here, and were in fact used in [5, 17] for learning of unstructured\nanalysis operators. However, in both cases the authors used a smoothed version of the (cid:96)1 penalty,\nwhich is known to produce inferior results. In this work we extend these ideas, without smoothing\nthe penalty. Formally, given an observed variable x \u2208 Rn coming from a certain distribution PX ,\nwe aim at predicting a corresponding latent variable y \u2208 Rk. The latter can be discrete, representing\na label in a classi\ufb01cation task, or continuous like in regression or reconstruction problems. As noted\nbefore, when \u03bb2 > 0, problem (1) is strictly convex and, consequently, has a unique minimizer. The\nsolution of the pursuit problem de\ufb01nes, therefore, an unambiguous deterministic map from the space\nof the observations to the space of the latent variables, which we denote by y\u2217\n\u0398(x). The map depends\non the model parameters \u0398. The goal of supervised learning is to select such \u0398 that minimize the\nexpectation over PX of some problem-speci\ufb01c loss function (cid:96). In practice, the distribution PX is\nusually unknown, and the expected loss is substituted by an empirical loss computed on a training\nset of pairs (x, y) \u2208 (X ,Y). The task-driven model learning problem becomes [12]\n\n(cid:96)(y, x, y\u2217\n\n\u0398(x)) + \u03c6(\u0398),\n\n(4)\n\n(cid:88)\n\nmin\n\n\u0398\n\n1\n|X|\n\n(x,y)\u2208(X ,Y)\n\nwhere \u03c6(\u0398) denotes a regularizer on the model parameters added to stabilize the solution. Problem\n(4) is a bilevel optimization problem [8], as we need to optimize the loss function (cid:96), which in turn\ndepends on the minimizer of (1).\nAs an example, let us examine the generic class of signal reconstruction problems, in which, as\nexplained in Section 2, the matrix M2 = \u03a6 plays the role of a linear degradation (e.g., blur and sub-\nsampling in case of image super-resolution problems), producing the degraded and, possibly, noisy\nobservation x = \u03a6y+n from the latent clean signal y. The goal of the model learning problem is to\n\u0398(\u03a6y) \u2248 y. Assuming\nselect the model parameters \u0398 yielding the most accurate inverse operator, y\u2217\na simple white Gaussian noise model, this can be achieved through the following loss\n\n(cid:96)(y, x, y\u2217) =\n\n(cid:107)y \u2212 y\u2217(cid:107)2\n2.\n\n1\n2\n\n(5)\n\nWhile the supervised learning of analysis operator has been considered for solving denoising prob-\nlems [5, 17], here we address more general scenarios. In particular, we argue that, when used along\nwith metric learning, it is often better suited for classi\ufb01cation tasks than its synthesis counterpart,\nbecause the non-generative nature of analysis models is more suitable for feature learning. For sim-\nplicity, we consider the case of a linear binary classi\ufb01er of the form sign(wTz + b) operating on\n\u0398(x). Using a loss of the form (cid:96)(y, x, z) = f (\u2212y(wTz + b)), with\nthe \u201cfeature vector\u201d z = \u2126y\u2217\nf being, e.g., the logistic regression function f (t) = log(1 + e\u2212t), we train the model parame-\nters \u0398 simultaneously with the classi\ufb01er parameters w, b. In this context, the learning of \u0398 can be\ninterpreted as feature learning.\nThe generalization to multi-class classi\ufb01cation problems is straightforward, by using a matrix W\nand a vector b instead of w and b. It is worthwhile noting that more stable classi\ufb01ers are obtained\nby adding a regularization of the form \u03c6 = (cid:107)W(cid:107)2\nOptimization. A local minimizer of the non-convex model learning problem (4) can be found via\nstochastic optimization [8, 12, 17], by performing gradient descent steps on each of the variables in\n\u0398 with the pair (x, y) each time drawn at random from the training set. Speci\ufb01cally, the parameters\nat iteration i + 1 are obtained by\n\nF to the learning problem (4).\n\n\u0398i+1 \u2190 \u0398i \u2212 \u03b7i\u2207\u0398(cid:96)(x, y, y\u2217\n\n(6)\nwhere 0 \u2264 \u03b7i \u2264 \u03b7 is a decreasing sequence of step-sizes. Following [12], we use a step size of\nthe form \u03b7i = min(\u03b7, \u03b7i0/i) in all our experiments, which means that a \ufb01xed step size is used\nduring the \ufb01rst k0 iterations, after which it decays according to the 1/i annealing strategy. Note\nthat the learning requires the gradient \u2207\u0398(cid:96), which in turn relies on the gradient of y\u2217\n\u0398(x) with re-\nspect to \u0398. Even though y\u2217\n\u0398(x) is obtained by solving a non-smooth optimization problem, we will\n\n\u0398i(x)),\n\n4\n\n\fshow that it is almost everywhere differentiable, and one can compute its gradient with respect to\n\u0398 = {M1, M2, D, \u2126} explicitly and in closed form. In the next section, we brie\ufb02y summarize the\nderivation of the gradients for \u2207M2(cid:96) and \u2207\u2126(cid:96), as these two are the most interesting cases. The\ngradients needed for the remaining model settings described in Section 2 can be obtained straight-\nforwardly from \u2207M2 (cid:96) and \u2207\u2126(cid:96).\nGradient computation. To obtain the gradients of the cost function with respect to the matrices\nM2 and \u2126, we consider a version of (3) in which the equality constrained is relaxed by a penalty,\n\n2 (M2y\u2217\n\nt ) + \u03bb2y\u2217\n\n1\n2\n\nmin\nz,y\n\n(cid:107)M1x \u2212 M2y(cid:107)2\n\n(cid:107)\u2126y \u2212 z(cid:107)2\nwith t > 0 being the penalty parameter. We denote by y\u2217\nstrongly convex optimization problem with t, x, M1, M2 and \u2126 \ufb01xed. Naturally, y\u2217\nfunctions of x and \u0398, the same way as y\u2217\nto simplify notation. The \ufb01rst-order optimality conditions of (8) lead to the equalities\n\nt the unique minimizers of this\nt are\n\u0398(x). Throughout this section, we will omit this dependence\n\n2 + \u03bb1(cid:107)z(cid:107)1 +\nt and z\u2217\n\nt and z\u2217\n\n(cid:107)y(cid:107)2\n2,\n\n\u03bb2\n2\n\n2 +\n\n(7)\n\nt\n2\n\nt \u2212 M1x) + t\u2126T(\u2126y\u2217\n\nt \u2212 z\u2217\n\nMT\n\nt(z\u2217\n\nt \u2212 \u2126y\u2217\n\nt ) + \u03bb1(sign(z\u2217\n\nt = 0,\nt ) + \u03b1) = 0,\n\n(8)\n(9)\nwhere the sign of zero is de\ufb01ned as zero and \u03b1 is a vector in Rr such that \u03b1\u039b = 0 and |\u03b1\u039bc| \u2264 1.\nHere, \u03b1\u039b denotes the sub-vector of \u03b1 whose rows are reduced to \u039b, the set of non-zero coef\ufb01cients\n(active set) of z\u2217\nt .\nIt has been shown that the solution of the synthesis [12], analysis [23], and generalized Lasso [22]\nregularization problems are all piecewise af\ufb01ne functions of the observations and the regularization\nparameter. This means that the active set of the solution is constant on intervals of the regularization\nparameter \u03bb1. Moreover, the number of transition points (values of \u03bb1 that for a given observation\nx the active set of the solution changes) is \ufb01nite and thus negligible. It can be shown that if \u03bb1\nis not a transition point of x, then a small perturbation in \u2126, M1, or M2 leaves \u039b and the sign\nof the coef\ufb01cients in the solution unchanged [12]. Applying this result to (8), we can state that\nsign(z\u2217\n\u039bI\u039b = diag{|sign(z\u2217)|} denote the matrix setting\nLet I\u039b be the projection onto \u039b, and let P\u039b = IT\nto zero the rows corresponding to \u039bc. Multiplying the second optimality condition by P\u039b, we have\nt \u2212 \u03bb1\nt = P\u039bz\u2217\nt sign(z\u2217\nz\u2217\nt ). We can\nplug the latter result into (9), obtaining\ny\u2217\nt = Qt(MT\n\nt ), where we used the fact that P\u039bsign(z\u2217\n\n2 M1x \u2212 \u03bb1\u2126Tsign(z\u2217\n\nt ) = sign(\u2126y\u2217\nt ).\n\nt ) = sign(z\u2217\n\nt = P\u039b\u2126y\u2217\n\nt )),\nwhere Qt = (t\u2126TP\u039bc\u2126 + B)\u22121 and B = MT\nexpansion of (11), we can obtain an expression for the gradients of (cid:96)(y\u2217\n\n(10)\n2 M2 + \u03bb2I. By using the \ufb01rst-order Taylor\u2019s\nt ) with respect to M2 and \u2126,\n(11)\n(12)\n\nt + t\u03b2ty\u2217\n\nt ) = \u2212\u03bb1sign(z\u2217\nt ) = M2(y\u2217\nt \u03b2T\n\n\u2207\u2126(cid:96)(y\u2217\n\u2207M2 (cid:96)(y\u2217\nwhere \u03b2t = Qt\u2207y\u2217 (cid:96)(y\u2217\nt ).\nNote that since the (unique) solution of (8) can be made arbitrarily close to the (unique) solution of\n(1) by increasing t, we can obtain the exact gradients of y\u2217 by taking the limit t \u2192 \u221e in the above\nexpressions. First, observe that\n\nt )\u03b2T \u2212 P\u039bc \u2126(ty\u2217\nt + \u03b2ty\u2217\n\nt \u03b2T\n\nT),\n\nT),\n\nQt = (t\u2126TP\u039bc \u2126 + B)\u22121 = (B(tB\u22121\u2126TP\u039bc\u2126 + I))\u22121 = (tC + I)\u22121B\u22121,\n\nwhere C = B\u22121\u2126TP\u039bc \u2126. Note that B is invertible if M2 is full-rank or if \u03bb2 > 0. Let C =\nUHU\u22121 be the eigen-decomposition of C, with H a diagonal matrix with the elements hi, 1 \u2264 i \u2264\nn. Then, Qt = UHtU\u22121B\u22121, where Ht is diagonal with 1/(thi + 1) on the diagonal. In the limit,\nthi \u2192 0 if hi = 0, and thi \u2192 \u221e otherwise, yielding\n\nQ = lim\n\nt\u2192\u221e Qt = UH(cid:48)U\u22121B\u22121 with H(cid:48) = diag{h(cid:48)\ni},\n\n(13)\n2 M1x \u2212 \u03bb1\u2126Tsign(z\u2217)). Analogously, we take the\nThe optimum of (1) is given by y\u2217 = Q(MT\nlimit in the expressions describing the gradients in (12) and (13). We summarize our main result in\nProposition 1 below, for which we de\ufb01ne\n\nh(cid:48)\ni =\n\n1\n\n: hi (cid:54)= 0,\n: hi = 0.\n\n(cid:26) 0\n\nt\n\nt\n\n\u02dcQ = lim\n\nt\u2192\u221e tQt = UH(cid:48)(cid:48)U\u22121B\u22121 with H(cid:48)(cid:48) = diag{h(cid:48)(cid:48)\ni },\n\nh(cid:48)(cid:48)\ni =\n\n5\n\n(cid:26) 1\n\nhi\n0\n\n: hi (cid:54)= 0,\n: hi = 0.\n\n(14)\n\n\fFigure 1: ADMM neural network encoder. The network comprises K identical layers parameterized by\nthe matrices A and B and the threshold vector t, and one output layer parameterized by the matrices U\nand V. The initial values of the learned parameters are given by ADMM (see Algorithm 1) according to\n2 M2+\u03c1\u2126T\u2126+\u03bb2I)\u22121\u2126T, A = \u2126U, H = 2\u2126V\u2212I,\nU = (MT\nG = 2I \u2212 \u2126V, F = \u2126V \u2212 I, and t = \u03bb1\n\u03c1 1.\nProposition 1. The functional y\u2217 = y\u2217\n\u0398(x) in (1) is almost everywhere differentiable for \u03bb2 > 0,\nand its gradients satisfy\n\n2 M2+\u03c1\u2126T\u2126+\u03bb2I)\u22121MT\n\n2 M1, V = \u03c1(MT\n\n\u2207\u2126(cid:96)(y\u2217) = \u2212\u03bb1sign(\u2126y\u2217)\u03b2T \u2212 P\u039bc\u2126(\u02dcy\n\u2207M1(cid:96)(y\u2217) = M2(y\u2217\u03b2T + \u03b2y\u2217T),\n\n\u2217\n\n\u03b2T + \u02dc\u03b2y\u2217T),\n\nwhere the vectors \u03b2, \u02dc\u03b2 and \u02dcy in Rk are de\ufb01ned as \u03b2 = Q\u2207y\u2217 (cid:96)(x, \u0398), \u02dc\u03b2 = \u02dcQ\u2207y\u2217 (cid:96)(x, \u0398), and\n\u2217\n\u02dcy\n\n2 M1x \u2212 \u03bb1\u2126Tsign(z\u2217)), with Q and \u02dcQ given by (14) and (15) respectively.\n\n= \u02dcQ(MT\n\nIn addition to being a useful analytic tool, the relationship between (1) and its relaxed version (8)\nalso has practical implications. Obtaining the exact gradients given in Proposition 1 requires com-\nputing the eigendecomposition of C, which is in general computationally expensive. In practice,\nwe approximate the gradients using the expressions in (12) and (13) with a \ufb01xed suf\ufb01ciently large\nvalue of t. The supervised model learning framework can be straightforwardly specialized to the\nshift-invariant case, in which \ufb01lters \u03b3i in (2) are learned instead of a full matrix \u2126. The gradients of\n(cid:96) with respect to the \ufb01lter coef\ufb01cients are obtained using Proposition 1 and the chain rule.\n\n4 Fast approximation\n\nThe discussed sparse models rely on an iterative optimization scheme such as ADMM, required to\nsolve the pursuit problem (1). This has relatively high computational complexity and latency, which\nis furthermore data-dependent. ADMM typically requires hundreds or thousands of iterations to\nconverge, greatly depending on the problem and the input. While the classical optimization the-\nory provides worst-case (data-independent) convergence rate bounds for many families of iterative\nalgorithms, very little is known about their behavior on speci\ufb01c data, coming, e.g., from a distri-\nbution supported on a low-dimensional manifold \u2013 characteristics often exhibited by real data. The\ncommon practice of sparse modeling concentrates on creating sophisticated data models, and then\nrelies on computational and analytic techniques that are totally agnostic of the data structure. Such\na discrepancy hides a (possibly dramatic) potential of computational improvement [11].\nFrom the perspective of the pursuit process, the minimization of (1) is merely a proxy to obtaining\na highly non-linear map between the data vector x and the representation vector y (which can also\nbe the \u201cfeature\u201d vector \u2126Dy or the reconstructed data vector Dy, depending on the application).\nAdopting ADMM, such a map can be expressed by unrolling a suf\ufb01cient number K of iterations into\na feed-forward network comprising K (identical) layers depicted in Figure 1, where the parameters\nA, B, U, V, and t, collectively denoted as \u03a8, are prescribed by the ADMM iteration. Fixing K, we\nobtain a \ufb01xed-complexity and latency encoder \u02c6yK,\u03a8(x), parameterized by \u03a8.\nNote that for a suf\ufb01ciently large K, \u02c6yK,\u03a8(x) \u2248 y\u2217(x), with the latter denoting the exact minimizer\nof (1) given the input x. However, when complexity budget constraints require K to be truncated\nat a small \ufb01xed number, the output of \u02c6yK,\u03a8 is usually unsatisfactory, and the worst-case analysis\nprovided by the classical optimization theory is of little use. However, within the family of functions\n{\u02c6yK,\u03a8 : \u03a8}, there might exist better parameters for which \u02c6y performs better on relevant input data.\nSuch parameters can be obtained via learning, as described in the sequel.\nSimilar ideas were \ufb01rst advocated by [11], who considered Lasso sparse synthesis models, and\nshowed that by unrolling iterative shrinkage thresholding algorithms (ISTA) into a neural network,\n\n6\n\nLayer21xzoutboutzout=bout=Fbprev\u03c3(bin)tzout\u2212zin)+\u00b7\u00b7\u00b7H(Gbin+\u00b7\u00b7\u00b7b0=AxzinbinbinLayer2Kzout=bout=Fbprev\u03c3(bin)tzout\u2212zin)+\u00b7\u00b7\u00b7H(Gbin+\u00b7\u00b7\u00b7bprevbinbinzinbprev00Reconstruction2Layerzoutboutyout=2zout\u2212b)V(Ux+\u00b7\u00b7\u00b7outyout\fand learning a new set of parameters, approximate solutions to the pursuit problem could be obtained\nat a fraction of the cost of the exact solution, if the inputs were restricted to data coming from a\ndistribution similar to that used at training. This approach was later extended to more elaborated\nstructured sparse and low-rank models, with applications in audio separation and denoising [20].\nHere is the \ufb01rst attempt to extend it to sparse analysis and mixed analysis-synthesis models.\nThe learning of the fast encoder is performed by plugging it into the training problem (4) in place\nof the exact encoder. The minimization of a loss function (cid:96)(\u03a8) with respect to \u03a8 requires the\ncomputation of the (sub)gradients d(cid:96)(y)/d\u03a8, which is achieved by the back-propagation procedure\n(essentially, an iterated application of the chain rule). Back-propagation starts with differentiating\n(cid:96)(\u03a8) with respect to the output of the last network layer, and propagating the (sub)gradients down to\nthe input layer, multiplying them by the Jacobian matrices of the traversed layers. For completeness,\nwe summarize the procedure in the supplementary materials. There is no principled way of choosing\nthe number of layers K and in practice this is done via cross-validation. In Section 5 we discuss the\nselection of K for a particular example.\nIn the particular setting of a shift-invariant analysis model, the described neural network encoder\nassumes a structure resembling that of a convolutional network. The matrices A, B, U, and V\nparameterizing the network in Figure 1 are replaced by a set of \ufb01lter coef\ufb01cients. The initial inverse\nkernels of the form (\u03c1\u2126T\u2126+(1+\u03bb2)I)\u22121 prescribed by ADMM are approximated by \ufb01nite-support\n\ufb01lters, which are computed using a standard least squares procedure.\n\n5 Experimental results and discussion\n\nIn what follows, we illustrate the proposed approaches on two experiments: single-image super-\nresolution (demonstrating a reconstruction problem), and polyphonic music transcription (demon-\nstrating a classi\ufb01cation problem). Additional \ufb01gures are provided in the supplementary materials.\n\nSingle-image super-resolution. Single-image super-resolution is an inverse problem in which\na high-resolution image is reconstructed from its blurred and down-sampled version lacking the\nhigh-frequency details. Low-resolution images were created by blurring the original ones with an\nanti-aliasing \ufb01lter, followed by down-sampling operator. In [25], it has been demonstrated that pre-\n\ufb01ltering a high resolution image with a Gaussian kernel with \u03c3 = 0.8s guarantees that the following\ns \u00d7 s sub-sampling generates an almost aliasing-free low resolution image. This models very well\npractical image decimation schemes, since allowing a certain amount of aliasing improves the visual\nperception. Super-resolution consists in inverting both the blurring and sub-sampling together as a\ncompound operator. Since the amount of aliasing is limited, a bi-cubic spline interpolation is more\naccurate than lower ordered interpolations for restoring the images to their original size. As shown\nin [26], up-sampling the low resolution image in this way, produces an image that is very close\nto the pre-\ufb01ltered high resolution counterpart. Then, the problem reduces to deconvolution with a\nGaussian kernel. In all our experiments we used the scaling factor s = 2. A shift-invariant analysis\nmodel was tested in three con\ufb01gurations: a TV prior created using horizontal and vertical derivative\n\ufb01lters; a bank of 48 7\u00d7 7 non-constant DCT \ufb01lters (referred to henceforth as A-DCT); and a combi-\nnation of the former two settings tuned using the proposed supervised scheme with the loss function\n(5). The training set consisted of random image patches from [24]. We also tested a convolutional\nneural network approximation of the third model, trained under similar conditions. Pursuit problem\nwas solved using ADMM with \u03c1 = 1, requiring about 100 iterations to converge. Table 1 reports\nthe obtained PSNR results on seven standard images used in super-resolution experiments. Visual\nresults are shown in the supplementary materials. We observe that on the average, the supervised\nmodel outperforms A-DCT and TV by 1 \u2212 3 dB PSNR. While performing slightly inferior to the\nexact supervised model, the neural network approximation is about ten times faster.\n\nAutomatic polyphonic music transcription. The goal of automatic music transcription is to ob-\ntain a musical score from an input audio signal. This task is particularly dif\ufb01cult when the audio\nsignal is polyphonic, i.e., contains multiple pitches present simultaneously. Like the majority of mu-\nsic and speech analysis techniques, music transcription typically operates on the magnitude of the\naudio time-frequency representation such as the short-time Fourier transform or constant-Q trans-\nform (CQT) [7] (adopted here). Given a spectral frame x at some time, the transcription problem\nconsists of producing a binary label vector p \u2208 {\u22121, +1}k, whose i-th element indicates the pres-\n\n7\n\n\fmethod\nBicubic\nTV\nA-DCT\nSI-ADMM\nSI-NN (K = 10)\n\nmean \u00b1std. dev.\n29.51 \u00b1 4.39\n29.04 \u00b1 3.51\n31.06 \u00b1 4.84\n32.03 \u00b1 4.84\n31.53 \u00b1 5.03\n\nman\n28.52\n30.23\n29.85\n31.05\n30.42\n\nwoman\n38.22\n33.39\n40.23\n40.62\n40.99\n\nbarbara\n24.02\n24.25\n24.32\n24.55\n24.53\n\nboats\n27.38\n29.44\n28.89\n30.06\n29.12\n\nlena\n30.77\n31.75\n32.72\n34.06\n33.58\n\nhouse\n29.75\n29.91\n31.68\n32.91\n31.82\n\npeppers\n27.95\n24.31\n29.71\n30.93\n30.21\n\nTable 1: PSNR in dB of different image super-resolution methods: bicubic interpolation (Bicubic), shift-\ninvariant analysis models with TV and DCT priors (TV and A-DCT), supervised shift-invariant analysis model\n(SI-ADMM), and its fast approximation with K = 10 layers (SI-NN).\n\nFigure 2: Left: Accuracy of the proposed analysis model (Analysis-ADMM) and its fast approximation\n(Analysis-NN) as the function of number of iterations or layers K. For reference, the accuracy of a non-\nnegative synthesis model as well as two leading methods [3, 18] is shown. Right: Precision-recall curve.\nence (+1) or absence (\u22121) of the i-th pitch at that time. We use k = 88 corresponding to the span\nof the standard piano keyboard (MIDI pitches 21 \u2212 108).\nWe used an analysis model with a square dictionary \u2126 and a square metric matrix M1 = M2 to\nproduce the feature vector z = \u2126y, which was then fed to a classi\ufb01er of the form p = sign(Wz+b).\nThe parameters \u2126, M2, W, and b were trained using the logistic loss on the MAPS Disklavier\ndataset [10] containing examples of polyphonic piano recordings with time-aligned groundtruth.\nThe testing was performed on another annotated real piano dataset from [18]. Transcription was\nperformed frame-by-frame, and the output of the classi\ufb01er was temporally \ufb01ltered using a hidden\nMarkov model proposed in [3]. For comparison, we show the performance of a supervised non-\nnegative synthesis model and two leading methods [3, 18] evaluated in the same settings.\nPerformance was measured using the standard precision-recall curve depicted in Figure 2 (right);\nin addition we used accuracy measure Acc = TP/(FP + FN + TP), where TP (true positives)\nis the number of correctly predicted pitches, and FP (false positives) and FN (false negatives) are\nthe number of pitches incorrectly transcribed as ON or OFF, respectively. This measure is frequently\nused in the music analysis literature [3, 18]. The supervised analysis model outperforms leading\npitch transcription methods. Figure 2 (left) shows that replacing the exact ADMM solver by a fast\napproximation described in Section 4 achieves comparable performance, with signi\ufb01cantly lower\ncomplexity. In this example, ten layers are enough for having a good representation and the im-\nprovement obtained by adding layers begins to be very marginal around this point.\n\nConclusion. We presented a bilevel optimization framework for the supervised learning of a super-\nset of sparse analysis and synthesis models. We also showed that in applications requiring low\ncomplexity or latency, a fast approximation to the exact solution of the pursuit problem can be\nachieved by a feed-forward architecture derived from truncated ADMM. The obtained fast regressor\ncan be initialized with the model parameters trained through the supervised bilevel framework, and\ntuned similarly to the training and adaptation of neural networks. We observed that the structure\nof the network becomes essentially a convolutional network in the case of shift-invariant models.\nThe generative setting of the proposed approaches was demonstrated on an image restoration exper-\niment, while the discriminative setting was tested in a polyphonic piano transcription experiment.\nIn the former we obtained a very good and fast solution while in the latter the results comparable or\nsuperior to the state-of-the-art.\n\n8\n\n1001011020102030405060Number of iterations / layers (K)Accuracy (%) Analysis\u2212ADMMAnalysis\u2212NNNonneg. synthesisBenetos & DixonPoliner & Ellis0204060801000102030405060708090100Precision (%)Recall (%) Analysis ADMMAnalysis NN (K=1)Analysis NN (K=10)Nonneg. synthesis\fReferences\n[1] M. Aharon, M. Elad, and A. Bruckstein. k-SVD: an algorithm for designing overcomplete\n\ndictionaries for sparse representation. IEEE Trans. Sig. Proc., 54(11):4311\u20134322, 2006.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Img. Sci., 2:183\u2013202, March 2009.\n\n[3] E. Benetos and S. Dixon. Multiple-instrument polyphonic music transcription using a convo-\n\nlutive probabilistic model. In Sound and Music Computing Conference, pages 19\u201324, 2011.\n\n[4] D.P. Bertsekas. Nonlinear programming. 1999.\n[5] H. Bischof, Y. Chen, and T. Pock. Learning l1-based analysis and synthesis sparsity priors\n\nusing bi-level optimization. NIPS workshop, 2012.\n\n[6] M. M. Bronstein, A. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi. Blind deconvolution of\n\nimages using optimal sparse representations. IEEE Trans. Im. Proc., 14(6):726\u2013736, 2005.\n\n[7] J. C. Brown. Calculation of a constant Q spectral transform. The Journal of the Acoustical\n\nSociety of America, 89:425, 1991.\n\n[8] B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of opera-\n\ntions research, 153(1):235\u2013256, 2007.\n\n[9] M. Elad and M. Aharon.\n\nImage denoising via sparse and redundant representations over\n\nlearned dictionaries. IEEE Trans. on Im. Proc., 54(12):3736\u20133745, 2006.\n\n[10] V. Emiya, R. Badeau, and B. David. Multipitch estimation of piano sounds using a new\nprobabilistic spectral smoothness principle. IEEE Trans. Audio, Speech, and Language Proc.,\n18(6):1643\u20131654, 2010.\n\n[11] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding.\n\n399\u2013406, 2010.\n\nIn ICML, pages\n\n[12] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning.\n\n34(4):791\u2013804, 2012.\n\nIEEE Trans. PAMI,\n\n[13] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE\n\nTrans. on Im. Proc., 17(1):53\u201369, 2008.\n\n[14] S. Mallat. A Wavelet Tour of Signal Processing, Second Edition. Academic Press, 1999.\n[15] Y. Nesterov. Gradient methods for minimizing composite objective function.\n\nIn CORE.\n\nCatholic University of Louvain, Louvain-la-Neuve, Belgium, 2007.\n\n[16] B.A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381(6583):607\u2013609, 1996.\n\n[17] G. Peyr\u00b4e and J. Fadili. Learning analysis sparsity priors. SAMPTA\u201911, 2011.\n[18] G. E. Poliner and D. Ellis. A discriminative model for polyphonic piano transcription.\n\nEURASIP J. Adv. in Sig. Proc., 2007, 2006.\n\n[19] L.I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation-based noise removal algorithms.\n\nPhysica D, 60(1-4):259\u2013268, 1992.\n\n[20] P. Sprechmann, A. M. Bronstein, and G. Sapiro. Learning ef\ufb01cient sparse and low rank models.\n\narXiv preprint arXiv:1212.3631, 2012.\n\n[21] R. Tibshirani. Regression shrinkage and selection via the LASSO. J. Royal Stat. Society:\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[22] Ryan Joseph Tibshirani. The solution path of the generalized lasso. Stanford University, 2011.\n[23] S. Vaiter, G. Peyre, C. Dossal, and J. Fadili. Robust sparse analysis regularization. Information\n\nTheory, IEEE Transactions on, 59(4):2001\u20132016, 2013.\n\n[24] J. Yang, John W., T. Huang, and Y. Ma. Image super-resolution as sparse representation of raw\n\nimage patches. In Proc. CVPR, pages 1\u20138. IEEE, 2008.\n\n[25] G. Yu and J.-M. Morel. On the consistency of the SIFT method. Inverse problems and Imaging,\n\n2009.\n\n[26] G. Yu, G. Sapiro, and S. Mallat. Solving inverse problems with piecewise linear estimators:\nfrom gaussian mixture models to structured sparsity. IEEE Trans. Im. Proc., 21(5):2481\u20132499,\n2012.\n\n9\n\n\f", "award": [], "sourceid": 497, "authors": [{"given_name": "Pablo", "family_name": "Sprechmann", "institution": "Duke University"}, {"given_name": "Roee", "family_name": "Litman", "institution": "Tel Aviv University"}, {"given_name": "Tal", "family_name": "Ben Yakar", "institution": "Tel Aviv University"}, {"given_name": "Alexander", "family_name": "Bronstein", "institution": "Tel Aviv University"}, {"given_name": "Guillermo", "family_name": "Sapiro", "institution": "Duke University"}]}