{"title": "Learning with Compressible Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 261, "page_last": 269, "abstract": "We describe probability distributions, dubbed compressible priors, whose independent and identically distributed (iid) realizations result in compressible signals. A signal is compressible when sorted magnitudes of its coefficients exhibit a power-law decay so that the signal can be well-approximated by a sparse signal. Since compressible signals live close to sparse signals, their intrinsic information can be stably embedded via simple non-adaptive linear projections into a much lower dimensional space whose dimension grows logarithmically with the ambient signal dimension. By using order statistics, we show that N-sample iid realizations of generalized Pareto, Student\u2019s t, log-normal, Frechet, and log-logistic distributions are compressible, i.e., they have a constant expected decay rate, which is independent of N. In contrast, we show that generalized Gaussian distribution with shape parameter q is compressible only in restricted cases since the expected decay rate of its N-sample iid realizations decreases with N as 1/[q log(N/q)]. We use compressible priors as a scaffold to build new iterative sparse signal recovery algorithms based on Bayesian inference arguments. We show how tuning of these algorithms explicitly depends on the parameters of the compressible prior of the signal, and how to learn the parameters of the signal\u2019s compressible prior on the fly during recovery.", "full_text": "Learning with Compressible Priors\n\nVolkan Cevher\nRice University\n\nvolkan@rice.edu\n\nAbstract\n\nWe describe a set of probability distributions, dubbed compressible priors, whose\nindependent and identically distributed (iid) realizations result in p-compressible\nsignals. A signal x \u2208 RN is called p-compressible with magnitude R if its sorted\ncoef\ufb01cients exhibit a power-law decay as |x|(i) (cid:46) R \u00b7 i\u2212d, where the decay rate d\nis equal to 1/p. p-compressible signals live close to K-sparse signals (K (cid:28) N)\nin the (cid:96)r-norm (r > p) since their best K-sparse approximation error decreases\n\nwith O(cid:0)R \u00b7 K 1/r\u22121/p(cid:1). We show that the membership of generalized Pareto, Stu-\n\ndent\u2019s t, log-normal, Fr\u00b4echet, and log-logistic distributions to the set of compress-\nible priors depends only on the distribution parameters and is independent of N.\nIn contrast, we demonstrate that the membership of the generalized Gaussian dis-\ntribution (GGD) depends both on the signal dimension and the GGD parameters:\nthe expected decay rate of N-sample iid realizations from the GGD with the shape\nparameter q is given by 1/ [q log (N/q)]. As stylized examples, we show via ex-\nperiments that the wavelet coef\ufb01cients of natural images are 1.67-compressible\nwhereas their pixel gradients are 0.95 log (N/0.95)-compressible, on the average.\nWe also leverage the connections between compressible priors and sparse signals\nto develop new iterative re-weighted sparse signal recovery algorithms that out-\nperform the standard (cid:96)1-norm minimization. Finally, we describe how to learn the\nhyperparameters of compressible priors in underdetermined regression problems\nby exploiting the geometry of their order statistics during signal recovery.\n\nIntroduction\n\n1\nMany problems in signal processing, machine learning, and communications can be cast as a linear\nregression problem where an unknown signal x \u2208 RN is related to its observations y \u2208 RM via\n\ny = \u03a6x + n.\n\n(1)\nIn (1), the observation matrix \u03a6 \u2208 RM\u00d7N is a non-adaptive measurement matrix with random\nentries in compressive sensing (CS), an over-complete dictionary of features in sparse Bayesian\nlearning (SBL), or a code matrix in communications [1, 2]. The vector n \u2208 RM usually accounts\nfor physical noise with partially or fully known distribution, or it models bounded perturbations in\nthe measurement matrix or the signal.\nBecause of its theoretical and practical interest, we focus on the instances of (1) where there are\nmore unknowns than equations, i.e., M < N. Hence, determining x from y in (1) is ill-posed: \u2200v \u2208\nkernel (\u03a6), x + v de\ufb01nes a solution space that produces the same observations y. Prior information\nis therefore necessary to distinguish the true x among the in\ufb01nitely many possible solutions. For\ninstance, CS and SBL frameworks assume that the signal x belongs to the set of sparse signals. By\nsparse, we mean that at most K out of the N signal coef\ufb01cients are nonzero where K (cid:28) N. CS\nand SLB algorithms then regularize the solution space by signal priors that promote sparseness and\nthey have been extremely successful in practice in a number of applications even if M (cid:28) N [1\u20133].\nUnfortunately, prior information by itself is not suf\ufb01cient to recover x from noisy y. Two more\nkey ingredients are required: (i) the observation matrix \u03a6 must stably embed (or encode) the set of\nsignals x into the space of y, and (ii) a tractable decoding algorithm must exist to map y back to\nx. By stable embedding, we mean that \u03a6 is bi-Lipschitz where the encoding x \u2192 \u03a6x is one to\none and the inverse mapping \u2206 = {\u2206 (\u03a6x) \u2192 x} is smooth. The bi-Lipschitz property of \u03a6 is\ncrucial to ensure the stability in decoding x by controlling the amount by which perturbations of the\n\n1\n\n\fobservations are ampli\ufb01ed [1, 4]. Tractable decoding is important for practical reasons as we have\nlimited time and resources, and it can clearly restrict the class of usable signal priors.\nIn this paper, we describe compressible prior distributions whose independent and identically dis-\ntributed (iid) realizations result in compressible signals. A signal is compressible when sorted mag-\nnitudes of its coef\ufb01cients exhibit a power-law decay. For certain decay rates, compressible signals\nlive close to the sparse signals, i.e., they can be well-approximated by sparse signals. It is well-\nknown that the set of K-sparse signals has stable and tractable encoder-decoder pairs (\u03a6, \u2206) for M\nas small as O(K log (N/K)) [1, 5]. Hence, an N-dimensional compressible signal with the proper\ndecay rate inherits the encoder-decoder pairs of its K-sparse approximation for a given approxima-\ntion error, and can be stably embedded into dimensions logarithmic in N.\nCompressible priors analytically summarize the set of compressible signals and shed new light on\nunderdetermined linear regression problems by building upon the literature on sparse signal recov-\nery. Our main results are summarized as follows:\n1) By using order statistics, we show that the compressibility of the iid realizations of generalized\nPareto, Student\u2019s t, Fr\u00b4echet, and log-logistics distributions is independent of the signals\u2019 dimension.\nThese distributions are natural members of compressible priors: they truly support logarithmic di-\nmensionality reduction and have important parameter learning guarantees from \ufb01nite sample sizes.\nWe demonstrate that probabilistic models for the wavelet coef\ufb01cients of natural images must also be\na natural member of compressible priors.\n2) We point out a common misconception about the generalized Gaussian distribution (GGD): GGD\ngenerates signals that lose their compressibility as N grows. For instance, special cases of the GGD\ndistribution, e.g., Laplacian distribution, are commonly used as sparsity promoting priors in CS and\nSBL problems where M is assumed to grow logarithmically with N [1\u20133, 6]. We show that signals\ngenerated from Laplacian distribution can only be stably embedded into lower dimensions that grow\nproportional to N. Hence, we identify an inconsistency between the decoding algorithms motivated\nby the GGD distribution and their sparse solutions.\n3) We use compressible priors as a scaffold to build new decoding algorithms based on Bayesian\ninference arguments. The objective of these algorithms is to approximate the signal realization from\na compressible prior as opposed to pragmatically producing sparse solutions. Some of these new\nalgorithms are variants of the popular iterative re-weighting schemes [3,6\u20138]. We show how the tun-\ning of these algorithms explicitly depends on the compressible prior parameters, and how to learn\nthe parameters of the signal\u2019s compressible prior on the \ufb02y while recovering the signal.\nThe paper is organized as follows. Section 2 provides the necessary background on sparse signal\nrecovery. Section 3 mathematically describes the compressible signals and ties them with the order\nstatistics of distributions to introduce compressible priors. Section 4 de\ufb01nes compressible priors,\nidenti\ufb01es common misconceptions about the GGD distribution, and examines natural images as in-\nstances of compressible priors. Section 5 derives new decoding algorithms for underdetermined\nlinear regression problems. Section 6 describes an algorithm for learning the parameters of com-\npressible priors. Section 7 provides simulations results and is followed by our conclusions.\n2 Background on Sparse Signals\nAny signal x \u2208 RN can be represented in terms of N coef\ufb01cients \u03b1N\u00d71 in a basis \u03a8N\u00d7N via\nx = \u03a8\u03b1. Signal x has a sparse representation if only K (cid:28) N entries of \u03b1 are nonzero. To account\nfor sparse signals in an appropriate basis, (1) should be modi\ufb01ed as y = \u03a6x + n = \u03a6\u03a8\u03b1 + n.\nLet \u03a3K denote the set of all K-sparse signals. When \u03a6 in (1) satis\ufb01es the so-called restricted\nisometry property (RIP), it can be shown that \u03a6\u03a8 de\ufb01nes a bi-Lipschitz embedding of \u03a3K into\nRM [1,4,5]. Moreover, RIP implies the recovery of K-sparse signals to within a given error bound,\nand the best attainable lower bounds for M are related to the Gelfand width of \u03a3K, which is log-\narithmic in the signal dimension, i.e., M = O(K log (N/K)) [5]. Without loss of generality, we\nrestrict our attention in the sequel to canonically sparse signals and assume that \u03a8 = I (the N \u00d7 N\nidentity matrix) so that x = \u03b1.\nWith the sparsity prior and RIP assumptions, inverse maps can be obtained by solving the following\nconvex problems:\n\n(2)\n\n\u22061(y) = arg min(cid:107)x\n\u22062(y) = arg min(cid:107)x\n\u22063(y) = arg min(cid:107)x\n\n(cid:48)(cid:107)1 s.t. y = \u03a6x\n(cid:48)\n,\n(cid:48)(cid:107)2 \u2264 \u0001,\n(cid:48)(cid:107)1 s.t. (cid:107)y \u2212 \u03a6x\n(cid:48)(cid:107)1 + \u03c4(cid:107)y \u2212 \u03a6x\n(cid:48)(cid:107)2\n2,\n\n2\n\n\fwhere \u0001 and \u03c4 are constants, and (cid:107)x(cid:107)r (cid:44) ((cid:80)\n\ni |xi|r)1/r. The decoders \u2206i (i = 1, 2) are known as\nbasis pursuit (BP) and basis pursuit denoising (BPDN), respectively; and, \u22063 is a scalarization of\nBPDN [1, 9]. They also have the following deterministic worst-case guarantee when \u03a6 has RIP:\n\n(cid:107)x \u2212 xK(cid:107)1\n\n\u221a\nK\n\n+ C2(cid:107)n(cid:107)2,\n\n(cid:107)x \u2212 \u2206(y)(cid:107)2 \u2264 C1\n\n(3)\nwhere C1,2 are constants, xK is the best K-term approximation, i.e., xK = arg min(cid:107)x(cid:48)(cid:107)0\u2264K (cid:107)x \u2212\nx(cid:48)(cid:107)r for r \u2265 1, and (cid:107)x(cid:107)0 is a pseudo-norm that counts the number of nonzeros of x [1, 4, 5].\nNote that the error guarantee (3) is adaptive to each given signal x because of the de\ufb01nition of xK.\nMoreover, the guarantee does not assume that the signal is sparse.\n3 Compressible Signals, Order Statistics and Quantile Approximations\nWe de\ufb01ne a signal x as p-compressible if it lives close to the shell of the weak-(cid:96)p ball of radius\nR (sw(cid:96)p(R)\u2013pronounced as swell p). De\ufb01ning \u00afxi = |xi|, we arrange the signal coef\ufb01cients xi in\ndecreasing order of magnitude as\n\nThen, when x \u2208 sw(cid:96)p(R), the i-th ordered entry \u00afx(i) in (4) obeys\n\n\u00afx(1) \u2265 \u00afx(2) \u2265 . . . \u2265 \u00afx(N ).\n\n(4)\n\n\u00afx(i) (cid:46) R \u00b7 i\n\n\u22121/p,\n\n(5)\nwhere (cid:46) means \u201cless than or approximately equal to.\u201d We deliberately substitute (cid:46) for \u2264 in the\np-compressibility de\ufb01nition of [1] to reduce the ambiguity of multiple feasible R and p values. In\nSection 6, we describe a geometric approach to learn R and p so that R \u00b7 i\u22121/p \u2248 \u00afx(i).\nSignals in sw(cid:96)p(R) can be well-approximated by sparse signals as the best K-term approximation\nerror decays rapidly to zero as\n\n(cid:107)x \u2212 xK(cid:107)r (cid:46) (r/p \u2212 1)\n\n\u22121/r RK 1/r\u22121/p, when p < r.\n\n(6)\nGiven M, a good rule of thumb is to set K = M/[C log(N/M)] (C \u2248 4 or 5) and use (6) to predict\nthe approximation error for the decoders \u2206i in Section 2. Since the decoding guarantees are bounded\nby the best K-term approximation error in (cid:96)1 (i.e., r = 1; cf. (3)), we will restrict our attention to\nx \u2208 sw(cid:96)p where p < 1. Including p = 1 adds a logarithmic error factor to the approximation errors,\nwhich is not severe; however, it is not considered in this paper to avoid a messy discussion.\nSuppose now the individual entries xi of the signal x are random variables (RV) drawn iid with\nrespect to a probability density function (pdf) f(x), i.e., xi \u223c f(x) for i = 1, . . . , N. Then, \u00afx(i)\u2019s\nin (4) are also RV\u2019s and are known as the order statistics (OS) of yet another pdf \u00aff(\u00afx), which can\nbe related to f(x) in a straightforward manner: \u00aff(\u00afx) = f(\u00afx) + f(\u2212\u00afx). Note that even though the\nRV\u2019s xi (hence, \u00afxi) are iid, the RV\u2019s \u00afx(i) are statistically dependent.\nThe concept of OS enables us to create a link between signals summarized by pdf\u2019s and their com-\npressibility, which is a deterministic property after the signals are realized. The key to establish-\ning this link turns out to be the parameterized form of the quantile function of the pdf \u00aff(\u00afx). Let\n\u00aff(v)dv be the cumulative distribution function (CDF) and u = \u00afF (\u00afx). The quantile\nfunction \u00afF (cid:63)(u) of \u00aff(\u00afx) is then given by the inverse of its CDF: \u00afF (cid:63)(u) = \u00afF \u22121(u). We will refer to\n\u00afF (cid:63)(u) as the magnitude quantile function (MQF) of f(x).\nA well-known quantile approximation to the expected OS of a pdf is given by [10]:\n\n\u00afF (\u00afx) = (cid:82) \u00afx\n\n0\n\n(7)\nwhere E[\u00b7] is the expected value. Moreover, we have the following moment matching approximation\n\nE[\u00afx(i)] = \u00afF (cid:63)\n\n,\n\n(cid:18)\n\nN + 1\n\n1 \u2212 i\n\n(cid:19)\n(cid:0)1 \u2212 i\n(cid:1)\nN(cid:2)f(cid:0)E[\u00afx(i)](cid:1)(cid:3)2\n\ni\nN\n\nN\n\n(cid:33)\n\n(cid:32)\n\n\u00afx(i) \u223c N\n\nE[\u00afx(i)],\n\n,\n\n(8)\n\nwhich can be used to quantify how much the actual realizations \u00afx(i) deviate from E[\u00afx(i)]. For\ninstance, these deviations for i > K can be used to bound the statistical variations of the best K-\nterm approximation error. In practice, the deviations are relatively small for compressible priors.\nIn Sections 4\u20136, we will use the quantile approximation in (7) as our basis to motivate the set of\ncompressible priors, derive recovery algorithms for x, and learn the parameters of compressible\npriors during recovery.\n\n3\n\n\fTable 1: Example distributions and the sw(cid:96)p(R) parameters of their iid realizations\nDistribution\n\nR\n\np\n\n(cid:16)\n\n(cid:17)\u2212(q+1)\n\npdf\n|x|\n\u03bb\n\n1 +\n\nq\n2\u03bb\n\n(cid:18)\n\n(cid:19)\u2212(q+1)/2\n\u0393((q+1)/2)\n\u221a\n2\u03c0\u03bb\u0393(q/2)\n(q/\u03bb) (x/\u03bb)\u2212(q+1) e\u2212(x/\u03bb)\u2212q\n\n1 + x2\n\u03bb2\n\n(q/\u03bb)(x/\u03bb)q\u22121\n[1+(x/\u03bb)q ]2\nq\n\n2\u03bb\u0393(1/q) e\u2212(|x|/\u03bb)q\n(q/\u03bb) (x/\u03bb)q\u22121 e\u2212(x/\u03bb)q\n\u03bb\u0393(q) (x/\u03bb)q\u22121 e\u2212x/\u03bb\ne\u2212(q log(x/\u03bb))2/2\nq\u221a\n\n1\n\n2\u03c0x\n\nGeneralized Pareto\n\nStudent\u2019s t\n\nFr\u00b4echet\n\nLog-Logistic\n\nGeneralized Gaussian\n\nWeibull\nGamma\n\nLog-Normal\n\n(cid:104) 2\u0393((q+1)/2)\n\n\u03bbN 1/q\n\n(cid:105)1/q\n\n\u221a\n\n\u03c0q\u0393(q/2)\n\n\u03bbN 1/q\n\u03bbN 1/q\n\n\u03bbN 1/q\n\n\u03bb max {1, \u0393 (1 + 1/q)} log1/q (N/q)\n\n\u03bb max {1, \u0393 (1 + 1/q)q} log (qN )\n\n\u03bb log1/q N\n\n\u221a\n\n\u03bbe\n\n2 log N /q\n\nq\n\nq\n\nq\n\nq\n\nq log (N/q)\n\nq log N\nlog (qN )\n\u221a\n\n2 log N q\n\n\u00afF (cid:63)\n\nN + 1\n\n(cid:19)\n\n(cid:18)\n\n1 \u2212 i\n\n(cid:46) R(N, \u03b8) \u00b7 i\n\n4 Compressible Priors\nA compressible prior f(x; \u03b8) in (cid:96)r is a pdf with parameters \u03b8 whose MQF satis\ufb01es\n\u22121/p(N,\u03b8), where R > 0 and p < r.\n\n(9)\nTable 4 lists example pdf\u2019s, parameterized by \u03b8 = (q, \u03bb) (cid:31) 0, and the sw(cid:96)p(R) parameters of their\nN-sample iid realizations. In this paper, we \ufb01x r = 1 (cf. Section 3); hence, the example pdf\u2019s are\ncompressible priors whenever p < 1. In (9), we make it explicit that the sw(cid:96)p(R) parameters can\ndepend on the parameters \u03b8 of the speci\ufb01c compressible prior as well as the signal dimension N.\nThe dependence of the parameter p on N is of particular interest since it has important implications\nin signal recovery as well as parameter learning from \ufb01nite sample sizes, as discussed below.\nWe de\ufb01ne natural p-compressible priors as the set Np of compressible priors such that p = p(\u03b8) < 1\nis independent of N, \u2200f(x; \u03b8) \u2208 Np. It is possible to prove that we can capture most of the (cid:96)1-\nenergy in an N-sample iid realization from a natural p-compressible prior by using a constant K,\ni.e., (cid:107)x \u2212 xK(cid:107)1 \u2264 \u0001(cid:107)x(cid:107)1 for any desired 0 < \u0001 (cid:28) 1 by choosing K = (cid:100)(p/\u0001) p\n1\u2212p(cid:101). Hence,\nN-sample iid signal realizations from the compressible priors in Np can be truly embedded into\ndimensions M that grow logarithmically with N with tractable decoding guarantees due to (3). Np\nmembers include the generalized Pareto (GPD), Fr\u00b4echet (FD), and log-logistic distributions (LLD).\nIt then only comes as a surprise that generalized Gaussian distribution (GGD) is not a natural p-\ncompressible prior since its iid realizations lose their compressibility as N grows (cf. Table 4). While\nit is common practice to use a GGD prior with q \u2264 1 for sparse signal recovery, we have no recov-\nery guarantees for signals generated from GGD when M grows logarithmically with N in (1).1 In\nfact, to be p-compressible, the shape parameter of a GGD prior should satisfy q = NeW\u22121(\u2212p/N ),\nwhere W\u22121(\u00b7) is the Lambert W -function with the alternate branch. As a result, the learned GGD\nparameters from dimensionality-reduced data will in general depend on the dimension and may not\ngeneralize to other dimensions. Along with GGD, Table 4 shows how Weibull, gamma, and log-\nnormal distributions are dimension-restricted in their membership to the set of compressible priors.\nWavelet coef\ufb01cients of natural images provide a stylized example to demonstrate why we should\ncare about the dimensional independence of the parameter p.2 As a brief background, we \ufb01rst note\nthat research in natural image modeling to date has had two distinct approaches, with one focus-\ning on deterministic explanations and the other pursuing probabilistic models [12]. Deterministic\napproaches operate under the assumption that the natural images belong to Besov spaces, having a\nbounded number of derivatives between edges. Unsurprisingly, wavelet thresholding is proven near-\noptimal for representing and denoising Besov space images. As the simplest example, the magnitude\nsorted discrete wavelet coef\ufb01cients \u00afw(i) of a Besov q-image should satisfy \u00afw(i) = R \u00b7 i\u22121/q. The\nprobabilistic approaches, on the other hand, exploit the power-law decay of the power spectra of im-\nages and \ufb01t various pdf\u2019s, such as GGD and the Gaussian scale mixtures, to the histograms of wavelet\n\n1To illustrate the issues with the compressibility of GGD, consider the Laplacian distribution (LD: GGD\nwith q = 1), which is the conventional convex prior for promoting sparsity. Via order statistics, it is possible\nto show that \u00afx(i) \u2248 \u03bb log N\nfor xi \u223c GGD(1, \u03bb). Without loss of generality, let us judiciously pick \u03bb =\n1/ log N so that R = 1. Then, we have (cid:107)x(cid:107)1 \u2248 N \u2212 1 and (cid:107)x \u2212 xK(cid:107)1 \u2248 N \u2212 K log (N/K) \u2212 K. When\nwe only have K terms to capture (1 \u2212 \u0001) of the (cid:96)1 energy (\u0001 (cid:28) 1) in the signal x, we need K \u2248 (1 \u2212 \u221a\n\u0001)N.\n2Here, we assume that the reader is familiar with the discrete wavelet transform and its properties [11].\n\ni\n\n4\n\n\fcoef\ufb01cients while trying to simultaneously capture the dependencies observed in the marginal and\njoint distributions of natural image wavelet coef\ufb01cients. Probabilistic approaches are quite important\nin image compression because optimal compressors quantize the wavelet coef\ufb01cients according to\nthe estimated distributions, dictating the image compression limits via Shannon\u2019s coding theorem.\nWe conjecture that probabilistic models that summarize the wavelet coef\ufb01cients of natural images\nbelong to the set of natural (non-iid) p-compressible priors. We base our claim on two observations:\n1) Due to the multiscale nature of the wavelet transform, the decay pro\ufb01le of the magnitude sorted\nwavelet coef\ufb01cients are scale-invariant, i.e., preserved at different resolutions, where lower resolu-\ntions inherit the highest resolution. Hence, probabilistic models that explain the wavelet transform of\nany signals should exhibit this decay pro\ufb01le inheritance property. 2) The magnitude sorted wavelet\ncoef\ufb01cients of natural images exhibit a constant decay rate, as expected of Besov space images.\nSection 7.2 demonstrates the ideas using natural images from the Berkeley natural images database.\n5 Signal Decoding Algorithms\nConvex problems to recover sparse or compressible signals in (2) are usually motivated by Bayesian\ninference. In a similar fashion, we formalize two new decoding algorithms below by assuming prior\ndistributions on the signal x and the noise n, and then asking inference questions given y in (1).\n5.1 Fixed point continuation for a non-iid compressible prior\nThe multivariate Lomax distribution (MLD) provides an elementary example of a non-iid compress-\n\nible prior. The pdf of the distribution is given by MLD(x; q, \u03bb) \u221d (cid:16)\n\n1 +(cid:80)N\n\n(cid:0)1 +(cid:80)n\n\n|xi|(cid:17)\u2212q\u2212N [13].\n|xi|(cid:1)\u22121). In the sequel, we assume \u03bbi = \u03bb \u2200i, for which\n\nFor MLD, the marginal distribution of the signal coef\ufb01cients is GPD, i.e., xi \u223c GPD(x; q, \u03bbi).\nMoreover, given n-realizations x1:n of MLD (n \u2264 N), the joint marginal distribution of xn+1:N is\nMLD(xn+1:N ; q + k, \u03bbn+1:N\nit can be proved that MLD is compressible with p = 1 [14]. For now, we will only demonstrate\nthis property via simulations in Section 7.1. With the MLD prior on x, we focus on only two op-\ntimization problems below, one based on BP and the other based on maximum a posteriori (MAP)\nestimation. Other convex formulations, such as BPDN (\u22062 in (2)) and LASSO [15], trivially follow.\n1) BP Decoder: When there is no noise, the observations are given by y = \u03a6x, which has in\ufb01nitely\nmany solutions, as discussed in Section 1. In this case, we can exploit the MLD likelihood function\nto regularize the solution space. For instance, when we ask for the solution that maximizes the MLD\nlikelihood given y, it is easy to see that we obtain the BP decoder formulation, i.e., \u22061(y) in (2).\n2) MAP Decoder: Suppose that the noise coef\ufb01cients (ni\u2019s in (1)) are iid Gaussian with zero mean\nand variance \u03c32, ni \u223c N (n; 0, \u03c32). Although many inference questions are possible, here we seek\nthe mode of the posterior distribution to obtain a point estimate, also known as the MAP estimate.\nbe derived using the Bayes rule as(cid:98)xMAP = arg maxx(cid:48) f (y|x(cid:48))f (x(cid:48)), which is explicitly given by\nSince we have f (y|x) = N (y \u2212 \u03a6x; 0, \u03c32I M\u00d7M ) and f (x) =MLD(x; q, \u03bb), the MAP estimate can\n\ni=1 \u03bb\n\ni=1 \u03bb\n\n\u22121\ni\n\n\u22121\ni\n\n(cid:98)xMAP = arg min\n\nx(cid:48) (cid:107)y \u2212 \u03a6x\n\n(cid:48)(cid:107)2\n\n2 + 2\u03c32(q + N ) log(cid:0)1 + \u03bb\n\n\u22121(cid:107)x\n\n(cid:48)(cid:107)1\n\n(cid:1) .\n\n(10)\n\niterative decoder below, indexed by k, where(cid:98)x{k} is the k-th iteration estimate ((cid:98)x{0} = 0):\n\nUnfortunately, we stumble upon a non-convex problem in (10) during our quest for the MAP es-\ntimate. We circumvent the non-convexity in (10) using a majorization-minimization idea where\nwe iteratively obtain a tractable upperbound on the log-term in (10) using the following inequality:\nlog u \u2264 log v + u/v \u2212 1. After some straightforward calculus, we obtain the\n\u2200u, v \u2208 (0,\u221e),\n(cid:98)x{k} = arg min\n\n(cid:48)(cid:107)1, where \u03bdk =\n\n(cid:48)(cid:107)2\n2 + \u03bdk(cid:107)x\n\n(11)\n\nx(cid:48) (cid:107)y \u2212 \u03a6x\n\n2\u03c32(q + N )\n\n\u03bb + (cid:107)(cid:98)x{k\u22121}(cid:107)1\n\n.\n\nThe decoding approach in (11) can be viewed as a continuation (or a homotopy) algorithm where a\n\ufb01xed point is obtained at each iteration, similar to [16]. This decoding scheme has provable, linear\n\nconvergence guarantees when (cid:107)(cid:98)x{k}(cid:107)1 is strictly increasing (cid:20) (equivalently, \u03bdk (cid:21)) [16].\n\nIterative (cid:96)s-decoding for iid scale mixtures of GGD\n\n5.2\nWe consider a generalization of GPD and the Student\u2019s t distribution, which we will denote as\nthe generalized Gaussian gamma scale mixture distribution (SMD, in short), whose pdf is given\n\u2212(q+1)/s. The additional parameter s of SMD modulates its\nby SMD(x; q, \u03bb, s) \u221d (1 + |x|s /\u03bbs)\nIt can be proved that SMD is p-compressible with p = q [14]. SMD, for\nOS near the origin.\ninstance, arises through the following interaction of the gamma distribution and GGD: x = a\u22121/sb,\na \u223c Gamma(a; q/s, \u03bb\u2212s), and b \u223c GGD(b; s, 1). Given a, the distribution of x is a scaled GGD:\n\n5\n\n\ff(x|a) \u223c GGD(x; s, a\u22121). Marginalizing a from f(x|a), we reach the SMD as the true underlying\ndistribution of x. SMD arise in multiple contexts, such as the SLB framework that exploit Student\u2019s\nt (i.e., s = 2) for learning problems [2], and the Laplacian and Gaussian scale mixtures (i.e., s = 1\nand 2, respectively) that model natural images [17, 18].\nDue to lack of space, we only focus on noiseless observations in (1). We assume that x is an N-\nsample iid realization from SMD(x; q, \u03bb, s) with known parameters (q, \u03bb, s) (cid:31) 0 and choose a solu-\n\ntion(cid:98)x that maximizes the SMD likehood to \ufb01nd the true vector x among the kernel of \u03a6:\n\n(cid:88)\n\ni\n\n\u2212s |xi|s(cid:1) , s.t. y = \u03a6x\n\nlog(cid:0)1 + \u03bb\n; where wi,{k} =(cid:0)\u03bbs +(cid:12)(cid:12)xi,{k}(cid:12)(cid:12)s(cid:1)\u22121 .\n\n.\n\n(cid:48)\n\n(12)\n\n(13)\n\n(cid:98)x = max\n(cid:88)\n(cid:98)x{k} = min\n\nx(cid:48)\n\ni\n\nx(cid:48) SMD(x; q, \u03bb, s) = min\nx(cid:48)\n\nwi,{k} |xi|s , s.t. y = \u03a6x\n\n(cid:48)\n\nThe majorization-minimization trick in Section 5.1 also circumvents the non-convexity in (12):\n\nThe decoding scheme in (13) is well-known as the iterative re-weighted (cid:96)s algorithms [7, 19\u201321].\n6 Parameter Learning for Compressible Distributions\nWhile deriving decoding algorithms in Section 5, we assumed that the signal coef\ufb01cients xi are\ngenerated from a compressible prior f(x; \u03b8) and that \u03b8 is known. We now relax the latter assumption\nand discuss how to simultaneously estimate x and learn the parameters \u03b8.\nWhen we visualize the joint estimation of x and \u03b8 from y in (1) as a graphical model, we imme-\ndiately realize that x creates a Markov blanket for \u03b8. Hence, to determine \u03b8, we have to estimate\nthe signal coef\ufb01cients. When \u03a6 has the stable embedding property, we know that the decoding al-\ngorithms can obtain x with approximation guarantees, such as (3). Then, given x, we can choose\nan estimator for \u03b8 via standard Bayesian inference arguments. Unfortunately, this argument leads to\none important road block: estimation of the signal x without knowing the prior parameters \u03b8.\nA n\u00a8aive approach to overcoming this road block is to split the optimization space and alternate\non x and \u03b8 while optimizing the Bayesian objective. Unfortunately, there is one important and\nunrecognized bug in this argument: the estimated signal values are in general not iid, hence we\nwould be minimizing the wrong Bayesian objective to determine \u03b8. To see this, we \ufb01rst note that the\n\napproximation of the signal xK and some other coef\ufb01cients that explain the small tail energy. We\nthen recall from Section (3) that the coef\ufb01cients of xK are statistically dependent. Hence, at least\n\nrecovered signals(cid:98)x in general consist of M (cid:28) N non-zero coef\ufb01cients that mimic the best K-term\npartially, the signi\ufb01cant coef\ufb01cients of(cid:98)x are also dependent. One way to overcome this dependency\nin \ufb01tting the sw(cid:96)p(R) parameters via the auxiliary signal estimates(cid:98)x{k} during iterative recovery.\nlog(cid:12)(cid:12)(cid:98)xi,{k}(cid:12)(cid:12) = log R(N, \u03b8) \u2212\n\nissue is to treat the recovered signals as if they are drawn iid from a censored GPD. However, the\noptimization becomes complicated and the approach does not provide any additional guarantees.\nAs an alternative, we propose to exploit geometry and use the consensus among the coef\ufb01cients\n\nTo do this, we employ Fischler and Bolles\u2019 probabilistic random sampling consensus (RANSAC)\nalgorithm [22] to \ufb01t a line, whose y-intercept is log R(N, \u03b8) and whose slope is 1/p(N, \u03b8):\n\n1\n\np(N, \u03b8)\n\nlog i, for i = 1, . . . , K; where K = M/[C log(N/M )],\n\n(14)\nwhere C \u2248 4, 5 as discussed in Section. 3. RANSAC provides excellent results with high probability\neven if the data contains signi\ufb01cant outliers. Because of its probabilistic nature, it is computationally\nef\ufb01cient. The RANSAC algorithm requires a threshold to gate the observations and count how much\na proposed solution is supported by the observations [22]. We determine this threshold by bounding\nthe tail probability that the OS of a compressible prior will be out of bounds. For the pseudo-code\nand further details of the RANSAC algorithm, cf. [22].\n7 Experiments\n7.1 Order Statistics\nTo demonstrate the sw(cid:96)p(R) decay pro\ufb01le of p-compressible priors, we generated iid realizations\nof GGD with q = 1 (LD) and GPD with q = 1, and (non-iid) realizations of MLD with q = 1 of\nvarying signal dimensions N = 10j, where j = 2, 3, 4, 5. We sorted the magnitudes of the signal\ncoef\ufb01cients, normalized them by their corresponding value of R. We then plotted the results on a\nlog-log scale in Fig. 1. At http://dsp.rice.edu/randcs, we provide a MATLAB routine (randcs.m) so\nthat it is easy to repeat the same experiment for the rest of the distributions in Table 4.\n\n6\n\n\f(a) LD (iid)\n\n(b) GPD (iid)\n\n(c) MLD\n\nFigure 1: Numerical illustration of the sw(cid:96)p(R) decay pro\ufb01le of three different pdfs.\n\nTo live in sw(cid:96)p(1) with 0 < p \u2264 1, the slope of the resulting curve must be less than or equal to \u22121.\nFigure 1(a) illustrates that the iid LD slope is much greater than \u22121 and moreover logarithmically\ngrows with N. In contrast, Fig. 1(b) shows that iid GPD with q = 1 exhibits the constant slope of\n\u22121 that is independent of N. MLD with q = 1 also delivers such a slope (Fig. 1(c)). The latter two\ndistributions thus produce compressible signal realizations, while the Laplacian does not.\n7.2 Natural Images\nWe investigate the images from the Berkeley natural images database in the context of p-\ncompressible priors. We randomly sample 100 image patches of varying sizes N = 2j \u00d7 2j\n(j = 3, . . . , 8), take their wavelet transforms (scaling \ufb01lter: daub2), and plot the average of their\nmagnitude ordered wavelet coef\ufb01cients in Figs. 2(a) and (b) (solid lines). Figure 2(c) also illustrates\nthe OS of the pixel gradients, which are of particular interest in many applications.\nAlong with the wavelet coef\ufb01cients, Fig. 2(a) superposes the expected OS of GPD with q = 1.67\n\nand \u03bb = 10 (dashed line), given by \u00afx(i){GPD(q, \u03bb)} = \u03bb(cid:2)(N + 1)1/qi\u22121/q \u2212 1(cid:3) (i = 1, . . . , N).\n\nAlthough wavelet coef\ufb01cients of natural images do not follow an iid distribution, they exhibit a\nconstant decay rate, which can be well-approximated by an iid GPD distribution. This apparent\nconstant decay rate is well-explained by the decay pro\ufb01le inheritance of the wavelet transform across\ndifferent resolutions and supports the Besov space assumption used in the deterministic approaches.\nThe GPD rate of q = 1.67 implies a disappointing O(K\u22120.1) approximation rate in the (cid:96)2-norm\nvs. the theoretical O(K\u22120.5) rate [23]. Moreover, we lose all the guarantees in the (cid:96)1-norm.\n\n(a) wavelet coef\ufb01cients\n\n(b) wavelet coef\ufb01cients\n\n(c) pixel gradients\n\nFigure 2: Approximation of the order statistics and histograms of natural images with GPD and GGD.\n\nIn contrast, Fig. 2(b) demonstrates the GGD histogram \ufb01ts to the wavelet coef\ufb01cients, where the\nGGD exponent q \u2208 [0.5, 1] depends on the particular dimension and decreases as N increases. The\nhistogram matching is common practice in the existing probabilistic approaches (e.g., [18]) to de-\ntermine pdf\u2019s that explain the statistics of natural images. Typically, least square error metrics or\nKullback-Liebler (KL) divergence measures are used. Although the GGD \ufb01t via histogram matching\nin Fig. 2(b) deceptively appears to \ufb01t a small number of coef\ufb01cients, we emphasize the log-log scale\nof the plots and mention that there is a signi\ufb01cant number of coef\ufb01cients in the narrow space where\nthe GGD distribution is a good \ufb01t. Unfortunately, these approaches approximate the wavelet coef\ufb01-\n\n7\n\n02345\u221210\u22125\u22124\u22123\u22122\u221210ordered index [power of 10]normalized values slope = \u22121average of 100 realizations02345\u221210\u22125\u22124\u22123\u22122\u221210ordered index [power of 10]normalized values slope = \u22121average of 100 realizations02345\u221210\u22125\u22124\u22123\u22122\u221210ordered index [power of 10]normalized values slope = \u22121average of 100 realizations012345\u22125012345ordered index [power of 10]coefficient amplitudes average of 100 imagesGPD(q=1/0.6;\u03bb=10)012345\u22125012345ordered index [power of 10]coefficient amplitudes average of 100 imageshistogram fit \u2212 GGD012345\u22125012345ordered index [power of 10]coefficient amplitudes average of 100 imagesGGD(q=0.95;\u03bb=25)\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Improvements afforded by re-weighted (cid:96)1-decoding (a) with known parameters \u03b8 and (b) with\nlearning. (c) The learned sw(cid:96)p exponent of the GPD distribution with q = 0.4 via the RANSAC algorithm.\n\nIterative (cid:96)1 Decoding\n\ncients of natural images that have almost no approximation power of the overall image. Moreover,\nthe learned GGD distribution is dimension dependent, assigns lower probability to the large coef\ufb01-\ncients that explain the image well, and predicts a mismatched OS of natural images (cf.Fig. 2(b)).\nFigure 2(c) compares the magnitude ordered pixel gradients of the images (solid lines) with the\nexpected OS of GGD (dashed line). From the \ufb01gure, it appears that the natural image pixel gradients\nlose their compressibility as the image dimensions grow, similar to the GGD, Weibull, gamma, and\nlog-normal distributions. In the \ufb01gure, the GGD parameters are given (q, \u03bb) = (0.95, 25).\n7.3\nWe repeat the compressible signal recovery experiment in Section 3.2 of [7] to demonstrate the\nperformance of our iterative (cid:96)s decoder with s = 1 in (13). We \ufb01rst randomly sample a signal\nx \u2208 RN (N = 256) where the signal coef\ufb01cients are iid from the GPD distribution with q = 0.4\nand \u03bb = (N + 1)\u22121/q so that the E[\u00afx(1)] \u2248 1. We set M = 128 and draw a random M \u00d7 N matrix\nwith iid Gaussian entries to obtain y = \u03a6x. We then decode signals via (13) where maximum\niterations is set to 5, with the knowledge of the signal parameters and with learning. During the\nlearning phase, we use log(2) as the threshold for the RANSAC algorithm. We set the maximum\niteration count of RANSAC to 500.\nThe results of a Monte Carlo run with 100 independent realizations are illustrated in Fig. 3.\nIn\nFigs. 3(a) and (b), the plots summarize the average improvement over the standard decoder \u22061(y)\n\nvia the histograms of (cid:107)x \u2212 (cid:98)x{4}(cid:107)2/(cid:107)x \u2212 \u22061(y)(cid:107)2, which have mean and standard deviation\n\n(0.7062, 0.1380) when we know the parameters of the GPD (a) and (0.7101, 0.1364) when we learn\nthe parameters of the GPD via RANSAC (b). The learned sw(cid:96)p exponent is summarized by the his-\ntogram in Fig. 3(c), which has mean and standard deviation (0.3757, 0.0539). Hence, we conclude\nthat the our alternative learning approach via the RANSAC algorithm is competitive with knowing\nthe actual prior parameters that generated the signal. Moreover, the computational time of learning\nis insigni\ufb01cant compared to time required by the state-of-the art linear SPGL algorithm [24].\n8 Conclusions3\nCompressible priors create a connection between probabilistic and deterministic models for signal\ncompressibility. The bridge between these seemingly two different modeling frameworks turns out\nto be the concept of order statistics. We demonstrated that when the p-parameter of a compressible\nprior is independent of the ambient dimension N, it is possible to have truly logarithmic embedding\nof its iid signal realizations. Moreover, the learned parameters of such compressible priors are di-\nmension agnostic. In contrast, we showed that when the p-parameter depends on N, we have many\nrestrictions in signal embedding and recovery as well as in parameter learning. We illustrated that\nwavelet coef\ufb01cients of natural images can be well approximated by the generalized Pareto prior,\nwhich in turn predicts a disappointing approximation rate for image coding with the n\u00a8aive sparse\nmodel and for CS image recovery from measurements that grow only logarithmically with the im-\nage dimension. We motivated many of the existing sparse signal recovery algorithm as instances of\na corresponding compressible prior and discussed parameter learning for these priors from dimen-\nsionality reduced data. We hope that the iid compressibility view taken in this paper will pave the\nway for a better understanding of probabilistic non-iid and structured compressibility models.\n\n3We thank R. G. Baraniuk, M. Wakin, M. Davies, J. Haupt, and J. P. Slavinksy for useful discussions.\n\nSupported by ONR N00014-08-1-1112, DARPA N66001-08-1-2065, ARO W911NF-09-1-0383 grants.\n\n8\n\n0.20.40.60.811.20.20.40.60.811.200.20.40.60.81\fReferences\n[1] E. J. Cand`es. Compressive sampling.\n\nvolume 3, pages 1433\u20131452, Madrid, Spain, 2006.\n\nIn Proc. International Congress of Mathematicians,\n\n[2] M.E. Tipping. Sparse bayesian learning and the relevance vector machine. The Journal of\n\nMachine Learning Research, 1:211\u2013244, 2001.\n\n[3] D. P. Wipf and B. D. Rao. Sparse Bayesian learning for basis selection. IEEE Transactions on\n\nSignal Processing, 52(8):2153\u20132164, 2004.\n\n[4] T. Blumensath and M.E. Davies. Sampling theorems for signals from the union of linear\n\nsubspaces. IEEE Trans. Info. Theory, 2009.\n\n[5] A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation.\n\nAmerican Mathematical Society, 22(1):211\u2013231, 2009.\n\n[6] I. F. Gorodnitsky, J. S. George, and B. D. Rao. Neuromagnetic source imaging with FO-\nCUSS: a recursive weighted minimum norm algorithm. Electroenceph. and Clin. Neurophys.,\n95(4):231\u2013251, 1995.\n\n[7] E. J. Cand`es, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted (cid:96)1 minimization.\n\nJournal of Fourier Analysis and Applications, 14(5):877\u2013905, 2008.\n\n[8] D. P. Wipf and S. Nagarajan. Iterative reweighted (cid:96)1 and (cid:96)2 methods for \ufb01nding sparse solu-\n\ntions. In SPARS09, Rennes, France, 2009.\n\n[9] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM\n\nreview, pages 129\u2013159, 2001.\n\n[10] H.A. David and H.N. Nagaraja. Order Statistics. Wiley-Interscience, 2004.\n[11] S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999.\n[12] H. Choi and R. G. Baraniuk. Wavelet statistical models and Besov spaces. Lecture Notes in\n\nStatistics, pages 9\u201330, 2003.\n\n[13] T. K. Nayak. Multivariate Lomax distribution: properties and usefulness in reliability theory.\n\nJournal of Applied Probability, pages 170\u2013177, 1987.\n\n[14] V. Cevher. Compressible priors. IEEE Trans. on Information Theory, in preparation, 2010.\n[15] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, pages 267\u2013288, 1996.\n\n[16] E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for (cid:96)1-minimization: Methodology\n\nand convergence. SIAM Journal on Optimization, 19:1107, 2008.\n\n[17] P. J. Garrigues. Sparse Coding Models of Natural Images: Algorithms for Ef\ufb01cient Inference\nand Learning of Higher-Order Structure. PhD thesis, EECS Department, University of Cali-\nfornia, Berkeley, May 2009.\n\n[18] M. J. Wainwright and E. P. Simoncelli. Scale mixtures of Gaussians and the statistics of natural\n\nimages. In NIPS, 2000.\n\n[19] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In NIPS, vol-\n\nume 20, 2008.\n\n[20] I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk. Iteratively re-weighted least squares\n\nminimization for sparse recovery. Commun. Pure Appl. Math, 2009.\n\n[21] R. Chartrand and W. Yin.\n\nIteratively reweighted algorithms for compressive sensing.\n\nICASSP, pages 3869\u20133872, 2008.\n\nIn\n\n[22] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for model \ufb01tting\nwith applications to image analysis and automated cartography. Communications of the ACM,\n24(6):381\u2013395, 1981.\n\n[23] E. J. Candes and D. L. Donoho. Curvelets and curvilinear integrals. Journal of Approximation\n\nTheory, 113(1):59\u201390, 2001.\n\n[24] E. van den Berg and M. P. Friedlander. Probing the Pareto frontier for basis pursuit solutions.\n\nSIAM Journal on Scienti\ufb01c Computing, 31(2):890\u2013912, 2008.\n\n9\n\n\f", "award": [], "sourceid": 962, "authors": [{"given_name": "Volkan", "family_name": "Cevher", "institution": null}]}