{"title": "Reconciling \"priors\" & \"priors\" without prejudice?", "book": "Advances in Neural Information Processing Systems", "page_first": 2193, "page_last": 2201, "abstract": "There are two major routes to address linear inverse problems. Whereas regularization-based approaches build estimators as solutions of penalized regression optimization problems, Bayesian estimators rely on the posterior distribution of the unknown, given some assumed family of priors. While these may seem radically different approaches, recent results have shown that, in the context of additive white Gaussian denoising, the Bayesian conditional mean estimator is always the solution of a penalized regression problem. The contribution of this paper is twofold. First, we extend the additive white Gaussian denoising results to general linear inverse problems with colored Gaussian noise. Second, we characterize conditions under which the penalty function associated to the conditional mean estimator can satisfy certain popular properties such as convexity, separability, and smoothness. This sheds light on some tradeoff between computational efficiency and estimation accuracy in sparse regularization, and draws some connections between Bayesian estimation and proximal optimization.", "full_text": "Reconciling \u201cpriors\u201d & \u201cpriors\u201d without prejudice?\n\nR\u00b4emi Gribonval \u2217\n\nInria\n\nPierre Machart\n\nInria\n\nCentre Inria Rennes - Bretagne Atlantique\n\nCentre Inria Rennes - Bretagne Atlantique\n\nremi.gribonval@inria.fr\n\npierre.machart@inria.fr\n\nAbstract\n\nThere are two major routes to address linear inverse problems. Whereas\nregularization-based approaches build estimators as solutions of penalized regres-\nsion optimization problems, Bayesian estimators rely on the posterior distribution\nof the unknown, given some assumed family of priors. While these may seem\nradically different approaches, recent results have shown that, in the context of\nadditive white Gaussian denoising, the Bayesian conditional mean estimator is\nalways the solution of a penalized regression problem. The contribution of this\npaper is twofold. First, we extend the additive white Gaussian denoising results\nto general linear inverse problems with colored Gaussian noise. Second, we char-\nacterize conditions under which the penalty function associated to the conditional\nmean estimator can satisfy certain popular properties such as convexity, separa-\nbility, and smoothness. This sheds light on some tradeoff between computational\nef\ufb01ciency and estimation accuracy in sparse regularization, and draws some con-\nnections between Bayesian estimation and proximal optimization.\n\n1\n\nIntroduction\n\nLet us consider a fairly general linear inverse problem, where one wants to estimate a parameter\nvector z \u2208 RD , from a noisy observation y \u2208 Rn, such that y = Az + b, where A \u2208 Rn\u00d7D\nis sometimes referred to as the observation or design matrix, and b \u2208 Rn represents an additive\nGaussian noise with a distribution PB \u223c N (0, \u03a3). When n < D, it turns out to be an ill-posed\nproblem. However, leveraging some prior knowledge or information, a profusion of schemes have\nbeen developed in order to provide an appropriate estimation of z. In this abundance, we will focus\non two seemingly very different approaches.\n\n1.1 Two families of approaches for linear inverse problems\n\nOn the one hand, Bayesian approaches are based on the assumption that z and b are drawn from\nprobability distributions PZ and PB respectively. From that point, a straightforward way to estimate\nz is to build, for instance, the Minimum Mean Squared Estimator (MMSE), sometimes referred to\nas Bayesian Least Squares, conditional expectation or conditional mean estimator, and de\ufb01ned as:\n\n\u03c8MMSE(y) := E(Z|Y = y).\n\n(1)\n\nThis estimator has the nice property of being optimal (in a least squares sense) but suffers from\nits explicit reliance on the prior distribution, which is usually unknown in practice. Moreover, its\ncomputation involves a tedious integral computation that generally cannot be done explicitly.\nOn the other hand, regularization-based approaches have been at the centre of a tremendous amount\nof work from a wide community of researchers in machine learning, signal processing, and more\n\n\u2217The authors are with the PANAMA project-team at IRISA, Rennes, France.\n\n1\n\n\fgenerally in applied mathematics. These approaches focus on building estimators (also called de-\ncoders) with no explicit reference to the prior distribution. Instead, these estimators are built as an\noptimal trade-off between a data \ufb01delity term and a term promoting some regularity on the solution.\nAmong these, we will focus on a widely studied family of estimators \u03c8 that write in this form:\n\n(cid:107)y \u2212 Az(cid:107)2 + \u03c6(z).\n\n1\n2\n\n\u03c8(y) := argmin\nz\u2208RD\nFor instance, the speci\ufb01c choice \u03c6(z) = \u03bb(cid:107)z(cid:107)2\n2 gives rise to a method often referred to as the ridge\nregression [1] while \u03c6(z) = \u03bb(cid:107)z(cid:107)1 gives rise to the famous Lasso [2].\nThe (cid:96)1 decoder associated to \u03c6(z) = \u03bb(cid:107)z(cid:107)1 has attracted a particular attention, for the associated\noptimization problem is convex, and generalizations to other \u201cmixed\u201d norms are being intensively\nstudied [3, 4]. Several facts explain the popularity of such approaches: a) these penalties have well-\nunderstood geometric interpretations; b) they are known to be sparsity promoting (the minimizer\nhas many zeroes); c) this can be exploited in active set methods for computational ef\ufb01ciency [5]; d)\nconvexity offers a comfortable framework to ensure both a unique minimum and a rich toolbox of\nef\ufb01cient and provably convergent optimization algorithms [6].\n\n(2)\n\n1.2 Do they really provide different estimators?\n\nRegularization and Bayesian estimation seemingly yield radically different viewpoints on inverse\nproblems. In fact, they are underpinned by distinct ways of de\ufb01ning signal models or \u201cpriors\u201d. The\n\u201cregularization prior\u201d is embodied by the penalty function \u03c6(z) which promotes certain solutions,\nsomehow carving an implicit signal model. In the Bayesian framework, the \u201cBayesian prior\u201d is\nembodied by where the mass of the signal distribution PZ lies.\n\nnoise model b \u223c N (0, I) and Bayesian prior PZ(z \u2208 E) =(cid:82)\n\nThe MAP quid pro quo A quid pro quo between these distinct notions of priors has crystallized\naround the notion of maximum a posteriori (MAP) estimation, leading to a long lasting incompre-\nhension between two worlds. In fact, a simple application of Bayes rule shows that under a Gaussian\nE pZ(z)dz, E \u2282 RN , MAP estima-\ntion1 yields the optimization problem (2) with regularization prior \u03c6Z(z) := \u2212 log pZ(z). By a\ntrivial identi\ufb01cation, the optimization problem (2) with regularization prior \u03c6(z) is now routinely\ncalled \u201cMAP with prior exp(\u2212\u03c6(z))\u201d. With the (cid:96)1 penalty, it is often called \u201cMAP with a Laplacian\nprior\u201d. As an unfortunate consequence of an erroneous \u201creverse reading\u201d of this fact, this identi\ufb01ca-\ntion has given rise to the erroneous but common myth that the optimization approach is particularly\nwell adapted when the unknown is distributed as exp(\u2212\u03c6(z)). As a striking counter-example to this\nmyth, it has recently been proved [7] that when z is drawn i.i.d. Laplacian and A \u2208 Rn\u00d7D is drawn\nfrom the Gaussian ensemble, the (cid:96)1 decoder \u2013 and indeed any sparse decoder \u2013 will be outperformed\nby the least squares decoder \u03c8LS(y) := pinv(A)y, unless n (cid:38) 0.15D.\nIn fact, [8] warns us that the MAP estimate is only one of the plural possible Bayesian interpretations\nof (2), even though it is the most straightforward one. Furthermore, to point out that erroneous con-\nception, a deeper connection is dug, showing that in the more restricted context of (white) Gaussian\ndenoising, for any prior, there exists a regularizer \u03c6 such that the MMSE estimator can be expressed\nas the solution of problem (2). This result essentially exhibits a regularization-oriented formulation\nfor which two radically different interpretations can be made. It highlights the important following\nfact: the speci\ufb01c choice of a regularizer \u03c6 does not, alone, induce an implicit prior on the supposed\ndistribution of the unknown; besides a prior PZ, a Bayesian estimator also involves the choice of a\nloss function. For certain regularizers \u03c6, there can in fact exist (at least two) different priors PZ for\nwhich the optimization problem (2) yields the optimal Bayesian estimator, associated to (at least)\ntwo different losses (e.g.., the 0/1 loss for the MAP, and the quadratic loss for the MMSE).\n\n1.3 Main contributions\n\nA \ufb01rst major contribution of that paper is to extend the aforementioned result [8] to a more general\nlinear inverse problem setting. Our \ufb01rst main results can be introduced as follows:\n\n1which is the Bayesian optimal estimator in a 0/1 loss sense, for discrete signals.\n\n2\n\n\fTheorem (Flavour of the main result). For any non-degenerate2 prior PZ, any non-degenerate co-\nvariance matrix \u03a3 and any design matrix A with full rank, there exists a regularizer \u03c6A,\u03a3,PZ such\nthat the MMSE estimator of z \u223c PZ given the observation y = Az + b with b \u223c N (0, \u03a3),\n\n\u03c8A,\u03a3,PZ (y) := E(Z|Y = y),\n\u03a3 + \u03c6A,\u03a3,PZ (z).\n\n2(cid:107)y \u2212 Az(cid:107)2\n\nis a minimizer of z (cid:55)\u2192 1\nRoughly, it states that for the considered inverse problem, for any prior on z, the MMSE estimate\nwith Gaussian noise is also the solution of a regularization-based problem (the converse is not true).\nIn addition to this result we further characterize properties of the penalty function \u03c6A,\u03a3,PZ (z) in\nthe case where A is invertible, by showing that: a) it is convex if and only if the probability density\nfunction of the observation y, pY (y) (often called the evidence), is log-concave; b) when A = I,\ni=1 \u03c6i(zi) where z = (z1, . . . , zn) if, and only if, the evidence is\n\nit is a separable sum \u03c6(z) = (cid:80)n\n\nmultiplicatively separable: pY (y) = \u03a0n\n\ni=1pYi(yi).\n\n(3)\n\n1.4 Outline of the paper\n\nIn Section 2, we develop the main result of our paper, that we just introduced. To do so, we review an\nexisting result from the literature and explicit the different steps that make it possible to generalize\nit to the linear inverse problem setting. In Section 3, we provide further results that shed some light\non the connections between MMSE and regularization-oriented estimators. Namely, we introduce\nsome commonly desired properties on the regularizing function such as separability and convexity\nand show how they relate to the priors in the Bayesian framework. Finally, in Section 4, we conclude\nand offer some perspectives of extension of the present work.\n\n2 Main steps to the main result\n\nWe begin by a highlight of some intermediate results that build into steps towards our main theorem.\n\n2.1 An existing result for white Gaussian denoising\n\nAs a starting point, let us recall the existing results in [8] (Lemma II.1 and Theorem II.2) dealing\nwith the additive white Gaussian denoising problem, A = I, \u03a3 = I.\nTheorem 1 (Reformulation of the main results of [8]). For any non-degenerate prior PZ, we have:\n\n1. \u03c8I,I,PZ is one-to-one;\n2. \u03c8I,I,PZ and its inverse are C\u221e;\n3. \u2200y \u2208 Rn, \u03c8I,I,PZ (y) is the unique global minimum and unique stationary point of\n\nz (cid:55)\u2192 1\n2\n\n(cid:107)y \u2212 Iz(cid:107)2 + \u03c6(z), with:\n\n(z) \u2212 z(cid:107)2\n\n2 \u2212 log pY [\u03c8\u22121\n\nMMSE(z)];\n\n(cid:26) \u2212 1\n\n2(cid:107)\u03c8\u22121\n+\u221e,\n\nI,I,PZ\n\n(4)\n\nfor z \u2208 Im\u03c8I,I,PZ ;\nfor x /\u2208 Im\u03c8I,I,PZ ;\n\n\u03c6(z) = \u03c6I,I,PZ (z) :=\n\n4. The penalty function \u03c6I,I,PZ is C\u221e;\n5. Any penalty function \u03c6(z) such that \u03c8I,I,PZ (y) is a stationary point of (4) satis\ufb01es \u03c6(z) =\n\n\u03c6I,I,PZ (z) + C for some constant C and all z.\n\n2We only need to assume that Z does not intrinsically live almost surely in a lower dimensional hyperplane.\nThe results easily generalize to this degenerate situation by considering appropriate projections of y and z.\nSimilar remarks are in order for the non-degeneracy assumptions on \u03a3 and A.\n\n3\n\n\f2.2 Non-white noise\nSuppose now that B \u2208 Rn is a centred non-degenerate normal Gaussian variable with a (positive\nde\ufb01nite) covariance matrix \u03a3. Using a standard noise whitening technique, \u03a3\u22121/2B \u223c N (0, I).\nThis makes our denoising problem equivalent to y\u03a3 = z\u03a3+b\u03a3, with y\u03a3 := \u03a3\u22121/2y, z\u03a3 := \u03a3\u22121/2z\nand b\u03a3 := \u03a3\u22121/2b, which is drawn from a Gaussian distribution with an identity covariance matrix.\nFinally, let (cid:107).(cid:107)\u03a3 be the norm induced by the scalar product (cid:104)x, y(cid:105)\u03a3 := (cid:104)x, \u03a3\u22121y(cid:105).\nCorollary 1 (non-white Gaussian noise). For any non-degenerate prior PZ, any non-degenerate \u03a3,\nY = Z + B, we have:\n\n1. \u03c8I,\u03a3,PZ is one-to-one.\n2. \u03c8I,\u03a3,PZ and its inverse are C\u221e.\n3. \u2200y \u2208 Rn, \u03c8I,\u03a3,PZ (y) is the unique global minimum and stationary point of\n\n(cid:107)y \u2212 Iz(cid:107)2\n\nz (cid:55)\u2192 1\n2\n(\u03a3\u22121/2z)\n\n\u03a3\u22121/2Z\n\n\u03a3 + \u03c6I,\u03a3,PZ (z).\n\nwith \u03c6I,\u03a3,PZ (z) := \u03c6I,I,P\n\n4. \u03c6I,\u03a3,PZ is C\u221e.\n\nAs with white noise, up to an additive constant, \u03c6I,\u03a3,PZ is the only penalty with these properties.\n\nProof. First, we introduce a simple lemma that is pivotal throughout each step of this section.\nLemma 1. For any function f : Rn \u2192 R and any M \u2208 Rn\u00d7p, we have:\n\nM argmin\n\nv\u2208Rp\n\nf (M v) =\n\nargmin\n\nu\u2208range(M )\u2286Rn\n\nf (u).\n\nNow, the linearity of the (conditional) expectation makes it possible to write\n\n\u03a3\u22121/2E(Z|Y = y) = E(\u03a3\u22121/2Z|\u03a3\u22121/2Y = \u03a3\u22121/2y)\n\u21d4 \u03a3\u22121/2\u03c8I,\u03a3,PZ (y) = \u03c8I,I,P\n\n(\u03a3\u22121/2y).\n\n\u03a3\u22121/2Z\n\nUsing Theorem 1, it follows that:\n\n\u03c8I,\u03a3,PZ (y) = \u03a31/2\u03c8I,I,P\n\n\u03a3\u22121/2 Z\n\n(\u03a3\u22121/2y)\n\nFrom this property and Theorem 1, it is clear that \u03c8I,\u03a3,PZ is one-to-one, C\u221e, as well as its inverse.\nNow, using Lemma 1 with M = \u03a31/2, we get:\n\n(cid:27)\n\n(cid:27)\n\n(cid:26) 1\n\n2\n\n(cid:26) 1\n(cid:26) 1\n\n2\n\n2\n\n= argmin\n\nz\u2208Rn\n\n= argmin\n\nz\u2208Rn\n\n\u03c8I,\u03a3,PZ (y) = \u03a31/2 argmin\nz(cid:48)\u2208Rn\n\n(cid:107)\u03a3\u22121/2y \u2212 z(cid:48)(cid:107)2 + \u03c6I,I,P\n\n\u03a3\u22121/2Z\n\n(z(cid:48))\n\n(cid:107)\u03a3\u22121/2y \u2212 \u03a3\u22121/2z(cid:107)2 + \u03c6I,I,P\n\n(\u03a3\u22121/2z)\n\n\u03a3\u22121/2Z\n\n(cid:27)\n\n(cid:107)y \u2212 z(cid:107)2\n\n\u03a3 + \u03c6I,\u03a3,PZ (z)\n\n,\n\nwith \u03c6I,\u03a3,PZ (z) := \u03c6I,I,P\nand that this minimizer is unique (and is the only stationary point).\n\n\u03a3\u22121/2Z\n\n(\u03a3\u22121/2z). This de\ufb01nition also makes it clear that \u03c6I,\u03a3,PZ is C\u221e,\n\n4\n\n\f2.3 A simple under-determined problem\n\nAs a step towards handling the more generic linear inverse problem y = Az + b, we will investigate\nthe particular case where A = [I 0]. For the sake of readability, for any two (column) vectors\nu, v, let us denote [u; v] the concatenated (column) vector. First and foremost let us decompose the\nMMSE estimator into two parts, composed of the \ufb01rst n and last (D \u2212 n) components :\n\n\u03c8[I 0],\u03a3,PZ (y) := [\u03c81(y); \u03c82(y)]\n\nCorollary 2 (simple under-determined problem). For any non-degenerate prior PZ, any non-\ndegenerate \u03a3, we have:\n\n1. \u03c81(y) = \u03c8I,\u03a3,PZ (y) is one-to-one and C\u221e. Its inverse and \u03c6I,\u03a3,PZ are also C\u221e;\n2. \u03c82(y) = (pB (cid:63) g)(y)/(pB (cid:63) PZ)(y) (with g(z1) := E(Z2|Z1 = z1)p(z1)) is C\u221e;\n3. \u03c8[I 0],\u03a3,PZ is injective.\n\nMoreover, let h : R(D\u2212n) \u00d7 R(D\u2212n) (cid:55)\u2192 R+ be any function such that h(x1, x2) = 0 \u21d2 x1 = x2,\n\n3. \u2200y \u2208 Rn, \u03c8[I 0],\u03a3,PZ (y) is the unique global minimum and stationary point of\n\nz (cid:55)\u2192 1\n2\n\n(z) := \u03c6I,\u03a3,PZ (z1) + h(cid:0)z2, \u03c82 \u25e6 \u03c8\u22121\n\n(cid:107)y \u2212 [I 0]z(cid:107)2\n\n\u03a3 + \u03c6h\n\n[I 0],\u03a3,PZ\n\n1 (z1)(cid:1) and z = [z1; z2].\n\n(z)\n\nwith \u03c6h\n\n[I 0],\u03a3,PZ\n\n4. \u03c6[I 0],\u03a3,PZ is C\u221e if and only if h is.\n\nProof. The expression of \u03c82(y) is obtained by Bayes rule in the integral de\ufb01ning the conditional\nexpectation. The smoothing effect of convolution with the Gaussian pB(b) implies the C\u221e nature\nof \u03c82. Let Z1 = [I 0]Z. Using again the linearity of the expectation, we have:\n\n[I 0]\u03c8[I 0],\u03a3,PZ (y) = E([I 0]Z|Y = y) = E(Z1|Y = y) = \u03c8I,\u03a3,PZ (y).\n\nHence, \u03c81(y) = \u03c8I,\u03a3,PZ (y). Given the properties of h, we have:\n\n\u03c82(y) = argmin\nz2\u2208R(D\u2212n)\n\nIt follows that:\n\n\u03c8[I 0],\u03a3,PZ (y) = argmin\n\nz=[z1;z2]\u2208RD\n\n1\n2\n\n(cid:107)y \u2212 z1(cid:107)2\n\n(cid:0)\u03c81(y)(cid:1)(cid:1) .\n\nh(cid:0)z2, \u03c82 \u25e6 \u03c8\u22121\n\u03a3 + \u03c6I,\u03a3,PZ (z1) + h(cid:0)z2, \u03c82 \u25e6 \u03c8\u22121\n\n1\n\n1 (z1)(cid:1).\n\nFrom the de\ufb01nitions of \u03c8[I 0],\u03a3,PZ and h, it is clear, using Corollary 1 that \u03c8[I 0],\u03a3,PZ is injective,\nis the unique minimizer and stationary point, and that \u03c6[I 0],\u03a3,PZ is C\u221e if and only if h is.\n\n2.4\n\nInverse Problem\n\nWe are now equipped to generalize our result to an arbitrary full rank matrix A. Using the Singular\nValue Decomposition, A can be factored as:\n\nA = U [\u2206 0]V (cid:62) = \u02dcU [I 0]V (cid:62), with \u02dcU = U \u2206.\n\n\u22121\n\nOur problem is now equivalent to y(cid:48) := \u02dcU\nLet \u02dc\u03a3 = \u02dcU\nTheorem 2 (Main result). Let h : R(D\u2212n) \u00d7 R(D\u2212n) (cid:55)\u2192 R+ be any function such that h(x1, x2) =\n0 \u21d2 x1 = x2. For any non-degenerate prior PZ, any non-degenerate \u03a3 and A, we have:\n\n. Note that B(cid:48) \u223c N (0, \u02dc\u03a3).\n\ny = [I 0]V (cid:62)z + \u02dcU\n\nb =: z(cid:48) + b(cid:48).\n\n\u03a3 \u02dcU\n\n\u2212(cid:62)\n\n\u22121\n\n\u22121\n\n1. \u03c8A,\u03a3,PZ is injective.\n\n5\n\n\f2. \u2200y \u2208 Rn, \u03c8[I 0],\u03a3,PZ (y) is the unique global minimum and stationary point of\n(V (cid:62)z).\n\n(z), with \u03c6h\n\nz (cid:55)\u2192 1\n\n(z) := \u03c6h\n\n2(cid:107)y \u2212 Az(cid:107)2\n\n\u03a3 + \u03c6h\n\nA,\u03a3,PZ\n\nA,\u03a3,PZ\n\n[I 0], \u02dc\u03a3,PV (cid:62) Z\n\n3. \u03c6A,\u03a3,PZ is C\u221e if and only if h is.\n\nProof. First note that:\n\nV (cid:62)\u03c8A,\u03a3,PZ (y) = V (cid:62)E(Z|Y = y) = E(Z(cid:48)|Y (cid:48) = y(cid:48)) = \u03c8[I 0], \u02dc\u03a3,PZ(cid:48)\n(z(cid:48)),\n\n(cid:107)U(cid:62)y \u2212 [I 0]z(cid:48)(cid:107)2\n\n= argmin\n\n\u02dc\u03a3 + \u03c6h\n\n[I 0], \u02dc\u03a3,PV (cid:62) Z\n\n1\n2\n\nz(cid:48)\n\nusing Corollary 2. Now, using Lemma 1, we have:\n\n\u03c8A,\u03a3,PZ (y) = argmin\n\nz\n\n= argmin\n\nz\n\n1\n2\n1\n2\n\n(cid:107)U(cid:62)(cid:16)\n\ny \u2212 U [I 0]V (cid:62)(cid:17)(cid:107)2\n\n(cid:107)y \u2212 Az(cid:107)2\n\n\u03a3 + \u03c6h\n\n[I 0], \u02dc\u03a3,PV (cid:62) Z\n\n(V (cid:62)z)\n\n\u02dc\u03a3 + \u03c6h\n\n[I 0], \u02dc\u03a3,PV (cid:62)Z\n\n(V (cid:62)z)\n\nThe other properties come naturally from those of Corollary 2.\nRemark 1. If A is invertible (hence square), \u03c8A,\u03a3,PZ is one-to-one. It is also C\u221e, as well as its\ninverse and \u03c6A,\u03a3,PZ .\n\n3 Connections between the MMSE and regularization-based estimators\n\nEquipped with the results from the previous sections, we can now have a clearer idea about how\nMMSE estimators, and those produced by a regularization-based approach relate to each other. This\nis the object of the present section.\n\n3.1 Obvious connections\n\nSome simple observations of the main theorem can already shed some light on those connections.\nFirst, for any prior, and as long as A is invertible, we have shown that there exists a corresponding\nregularizing term (which is unique up to an additive constant). This simply means that the set of\nMMSE estimators in linear inverse problems with Gaussian noise is a subset of the set of estimators\nthat can be produced by a regularization approach with a quadratic data-\ufb01tting term.\nSecond, since the corresponding penalty is necessarily smooth, it is in fact only a strict subset of such\nregularization estimators. In other words, for some regularizers, there cannot be any interpretation\nin terms of an MMSE estimator. For instance, as pinpointed by [8], all the non-C\u221e regularizers\nbelong to that category. Among them, all the sparsity-inducing regularizers (the (cid:96)1 norm, among\nothers) fall into this scope. This means that when it comes to solving a linear inverse problem (with\nan invertible A) under Gaussian noise, sparsity inducing penalties are necessarily suboptimal (in a\nmean squared error sense).\n\n3.2 Relating desired computational properties to the evidence\n\nLet us now focus on the MMSE estimators (which also can be written as regularization-based estima-\ntors). As reported in the introduction, one of the reasons explaining the success of the optimization-\nbased approaches is that one can have a better control on the computational ef\ufb01ciency on the algo-\nrithms via some appealing properties of the functional to minimize. An interesting question then\nis: can we relate these properties of the regularizer to the Bayesian priors, when interpreting the\nsolution as an MMSE estimate?\nFor instance, when the regularizer is separable, one may easily rely on coordinate descent algo-\nrithms [9]. Here is a more formal de\ufb01nition:\nDe\ufb01nition 1 (Separability). A vector-valued function f : Rn \u2192 Rn is separable if there exists a set\nof functions f1, . . . , fn : R \u2192 R such that \u2200x \u2208 Rn, f (x) = (fi(xi))n\n\ni=1.\n\n6\n\n\fthere exists a set of functions g1, . . . , gn : R \u2192 R such that \u2200x \u2208 Rn, g(x) =(cid:80)n\ng(x) =(cid:81)n\n\nA scalar-valued function g : Rn \u2192 R is additively separable (resp. multiplicatively separable) if\ni=1 gi(xi) (resp.\n\ni=1 gi(xi)).\n\nEspecially when working with high-dimensional data, coordinate descent algorithms have proven to\nbe very ef\ufb01cient and have been extensively used for machine learning [10, 11].\nEven more evidently, when solving optimization problems, dealing with convex functions ensures\nthat many algorithms will provably converge to the global minimizer [6]. As a consequence, it\nwould be interesting to be able to characterize the set of priors for which the MMSE estimate can be\nexpressed as a minimizer of a convex function.\nThe following lemma precisely addresses these issues. For the sake of simplicity and readability, we\nfocus on the speci\ufb01c case where A = I and \u03a3 = I.\nLemma 2 (Convexity and Separability). For any non-degenerate prior PZ, Theorem 1 says that\n\u2200y \u2208 Rn, \u03c8I,I,PZ (y) is the unique global minimum and stationary point of z (cid:55)\u2192 1\n2(cid:107)y \u2212 Iz(cid:107)2 +\n\u03c6I,I,PZ (z). Moreover, the following results hold:\n\n1. \u03c6I,I,PZ is convex if and only if pY (y) := pB (cid:63) PZ(y) is log-concave,\n2. \u03c6I,I,PZ is additively separable if and only if pY (y) is multiplicatively separable.\n\nProof of Lemma 2. From Lemma II.1 in [8], the Jacobian matrix J[\u03c8I,I,PZ ](y) is positive de\ufb01nite\nhence invertible. Derivating \u03c6I,I,PZ [\u03c8I,I,PZ (y)] from its de\ufb01nition in Theorem 1, we get:\n\nJ[\u03c8I,I,PZ ](y)\u2207\u03c6I,I,PZ [\u03c8I,I,PZ (y)]\n\n(cid:21)\n\n(cid:20)\n\n\u2212 1\n2\n\n2 \u2212 log pY (y)\n\n(cid:107)y \u2212 \u03c8I,I,PZ (y)(cid:107)2\n\n= \u2207\n= \u2212 (In \u2212 J[\u03c8I,I,PZ ](y)) (y \u2212 \u03c8I,I,PZ (y)) \u2212 \u2207 log pY (y)\n= (In \u2212 J[\u03c8I,I,PZ ](y))\u2207 log pY (y) \u2212 \u2207 log pY (y)\n= \u2212J[\u03c8I,I,PZ ](y)\u2207 log pY (y)\n\u2207\u03c6I,I,PZ [\u03c8I,I,PZ (y)] = \u2212\u2207 log pY (y).\n\nDerivating this expression once more, we get:\n\nThen:\n\nHence:\n\nJ[\u03c8I,I,PZ ](y)\u22072\u03c6I,I,PZ [\u03c8I,I,PZ (y)] = \u2212\u22072 log pY (y).\n\u22072\u03c6I,I,PZ [\u03c8I,I,PZ (y)] = \u2212J\u22121[\u03c8I,I,PZ ](y)\u22072 log pY (y).\n\n\u03c6I,I,PZ convex \u21d4 \u22072\u03c6I,I,PZ [\u03c8I,I,PZ (y)] (cid:60) 0\n\n\u21d4 \u2212J\u22121[\u03c8I,I,PZ ](y)\u22072 log pY (y) (cid:60) 0\n\nAs \u03c8I,I,PZ is one-to-one, \u03c6I,I,PZ is convex if and only if \u03c6I,I,PZ [\u03c8I,I,PZ ] is. It also follows that:\n\nAs J[\u03c8I,I,PZ ](y) = In + \u22072 log pY (y),\nJ\u22121[\u03c8I,I,PZ ](y) are simultaneously diagonalisable.\nand \u22072 log pY (y) commute. Now, as J[\u03c8I,I,PZ ](y) is positive de\ufb01nite, we have:\n\u2212J\u22121[\u03c8I,I,PZ ](y)\u22072 log pY (y) (cid:60) 0 \u21d4 \u22072 log pY (y) (cid:52) 0.\n\nthe matrices \u22072 log pY (y), J[\u03c8I,I,PZ ](y), and\nIt follows that the matrices J\u22121[\u03c8I,I,PZ ](y)\n\nIt follows that \u03c6I,I,PZ is convex if and only if pY (y) := pB (cid:63) PX (y) is log-concave.\nBy its de\ufb01nition (II.3) in [8], it is clear that:\n\n\u03c6I,I,PZ is additively separable \u21d4 \u03c8I,I,PZ is separable.\n\nUsing now equation (II.2) in [8], we have:\n\n\u03c8I,I,PZ is separable \u21d4 \u2207 log pY is separable\n\n\u21d4 log pY is additively separable\n\u21d4 pY is multiplicatively separable.\n\n7\n\n\fRemark 2. This lemma focuses on the speci\ufb01c case where A = I and a white Gaussian noise.\nHowever, considering the transformations induced by a shift to an arbitrary invertible matrix A and\nto an arbitrary non-degenerate covariance matrix \u03a3, which are depicted throughout Section 2, it is\neasy to see that the result on convexity carries over. An analogous (but more complicated) result\ncould be also derived regarding separability. We leave that part to the interested reader.\n\nThese results provide a precise characterization of conditions on the Bayesian priors so that the\nMMSE estimator can be expressed as minimizer of a convex or separable function. Interestingly,\nthose conditions are expressed in terms of the probability distribution function (pdf in short) of the\nobservations pY , which is sometimes referred to as the evidence. The fact that the evidence plays a\nkey role in Bayesian estimation has been observed in many contexts, see for example [12]. Given\nthat we assume that the noise is Gaussian, its pdf pB always is log-concave. Thanks to a simple\nproperty of the convolution of log-concave functions, it is suf\ufb01cient that the prior on the signal pZ\nis log-concave to ensure that pY also is. However, it is not a necessary condition. This means that\nthere are some priors pX that are not log-concave such that the associated MMSE estimator can still\nbe expressed as the minimizer of a functional with a convex regularizer. For a more detailed analysis\nof this last point, in the speci\ufb01c context of Bernoulli-Gaussian priors (which are not log-concave),\nplease refer to the technical report [13].\nFrom this result, one may also draw an interesting negative result. If the distribution of the observa-\ntion y is not log-concave, then, the MMSE estimate cannot be expressed as the solution of a convex\nregularization-oriented formulation. This means that, with a quadratic data-\ufb01tting term, a convex\napproach to signal estimation cannot be optimal (in a mean squared error sense).\n\n4 Prospects\n\nIn this paper we have extended a result, stating that in the context of linear inverse problems with\nGaussian noise, for any Bayesian prior, there exists a regularizer \u03c6 such that the MMSE estimator\ncan be expressed as the solution of regularized regression problem (2). This result is a generalization\nof a result in [8]. However, we think it could be extended with regards to many aspects. For instance,\nour proof of the result naturally builds on elementary bricks that combine in a way that is imposed\nby the de\ufb01nition of the linear inverse problem. However, by developing more bricks and combining\nthem in different ways, it may be possible to derive analogous results for a variety of other problems.\nMoreover, in the situation where A is not invertible (i.e. the problem is under-determined), there is\na large set of admissible regularizers (i.e. up to the choice of a function h in Corollary 2). This addi-\ntional degree of freedom might be leveraged in order to provide some additional desirable properties,\nfrom an optimization perspective, for instance.\nAlso, our result relies heavily on the choice of a quadratic loss for the data-\ufb01tting term and on a\nGaussian model for the noise. Naturally, investigating other choices (e.g.\nlogistic or hinge loss,\nPoisson noise, to name a few) is a question of interest. But a careful study of the proofs in [8]\nsuggests that there is a peculiar connection between the Gaussian noise model on the one hand and\nthe quadratic loss on the other hand. However, further investigations should be conducted to get a\ndeeper understanding on how these really interplay on a higher level.\nFinally, we have stated a number of negative results regarding the non-optimality of sparse de-\ncoders or of convex formulations for handling observations drawn from a distribution that is not\nlog-concave. It would be interesting to develop a metric in the estimators space in order to quantify,\nfor instance, how \u201cfar\u201d one arbitrary estimator is from an optimal one, or, in other words, what is\nthe intrinsic cost of convex relaxations.\n\nAcknowledgements\n\nThis work was supported in part by the European Research Council, PLEASE project (ERC-StG-\n2011-277906).\n\n8\n\n\fReferences\n[1] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: applications to nonorthogonal\n\nproblems. Technometrics, 12(1):69\u201382, 1970.\n\n[2] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society, 58(1):267\u2013288, 1996.\n\n[3] Matthieu Kowalski. Sparse regression using mixed norms. Applied and Computational Har-\n\nmonic Analysis, 27(3):303\u2013324, 2009.\n\n[4] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optmization with\nsparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n[5] Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Active set algorithm for struc-\ntured sparsity-inducing norms. In OPT 2009: 2nd NIPS Workshop on Optimization for Ma-\nchine Learning, 2009.\n\n[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[7] R\u00b4emi Gribonval, Volkan Cevher, and Mike Davies, E. Compressible Distributions for High-\n\ndimensional Statistics. IEEE Transactions on Information Theory, 2012.\n\n[8] R\u00b4emi Gribonval. Should penalized least squares regression be interpreted as maximum a pos-\n\nteriori estimation? IEEE Transactions on Signal Processing, 59(5):2405\u20132410, 2011.\n\n[9] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\nCore discussion papers, Center for Operations Research and Econometrics (CORE), Catholic\nUniversity of Louvain, 2010.\n\n[10] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate\ndescent method for large-scale linear svm. In Proceedings of the 25th International Conference\non Machine Learning, pages 408\u2013415, 2008.\n\n[11] Pierre Machart, Thomas Peel, Liva Ralaivola, Sandrine Anthoine, and Herv\u00b4e Glotin. Stochas-\nIn 28th International Conference on Machine\n\ntic low-rank kernel learning for regression.\nLearning, 2011.\n\n[12] Martin Raphan and Eero P. Simoncelli. Learning to be bayesian without supervision. In in\n\nAdv. Neural Information Processing Systems (NIPS*06. MIT Press, 2007.\n\n[13] R\u00b4emi Gribonval and Pierre Machart. Reconciling \u201dpriors\u201d & \u201dpriors\u201d without prejudice? Re-\n\nsearch report RR-8366, INRIA, September 2013.\n\n9\n\n\f", "award": [], "sourceid": 1070, "authors": [{"given_name": "Remi", "family_name": "Gribonval", "institution": "INRIA"}, {"given_name": "Pierre", "family_name": "Machart", "institution": "INRIA"}]}