{"title": "Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2295, "page_last": 2303, "abstract": "Non-parametric Bayesian techniques are considered for learning dictionaries for sparse image representations, with applications in denoising, inpainting and compressive sensing (CS). The beta process is employed as a prior for learning the dictionary, and this non-parametric method naturally infers an appropriate dictionary size. The Dirichlet process and a probit stick-breaking process are also considered to exploit structure within an image. The proposed method can learn a sparse dictionary in situ; training images may be exploited if available, but they are not required. Further, the noise variance need not be known, and can be non-stationary. Another virtue of the proposed method is that sequential inference can be readily employed, thereby allowing scaling to large images. Several example results are presented, using both Gibbs and variational Bayesian inference, with comparisons to other state-of-the-art approaches.", "full_text": "Non-Parametric Bayesian Dictionary Learning for\n\nSparse Image Representations\n\nMingyuan Zhou, Haojun Chen, John Paisley, Lu Ren, 1Guillermo Sapiro and Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\nDuke University, Durham, NC 27708-0291, USA\n\n1Department of Electrical and Computer Engineering\nUniversity of Minnesota, Minneapolis, MN 55455, USA\n\n{mz1,hc44,jwp4,lr,lcarin}@ee.duke.edu, {guille}@umn.edu\n\nAbstract\n\nNon-parametric Bayesian techniques are considered for learning dictionaries for\nsparse image representations, with applications in denoising, inpainting and com-\npressive sensing (CS). The beta process is employed as a prior for learning the\ndictionary, and this non-parametric method naturally infers an appropriate dic-\ntionary size. The Dirichlet process and a probit stick-breaking process are also\nconsidered to exploit structure within an image. The proposed method can learn\na sparse dictionary in situ; training images may be exploited if available, but they\nare not required. Further, the noise variance need not be known, and can be non-\nstationary. Another virtue of the proposed method is that sequential inference can\nbe readily employed, thereby allowing scaling to large images. Several example\nresults are presented, using both Gibbs and variational Bayesian inference, with\ncomparisons to other state-of-the-art approaches.\n\n1 Introduction\nThere has been signi\ufb01cant recent interest in sparse signal expansions in several settings. For ex-\nample, such algorithms as the support vector machine (SVM) [1], the relevance vector machine\n(RVM) [2], Lasso [3] and many others have been developed for sparse regression (and classi\ufb01ca-\ntion). A sparse representation has several advantages, including the fact that it encourages a simple\nmodel, and therefore over-training is often avoided. The inferred sparse coef\ufb01cients also often have\nbiological/physical meaning, of interest for model interpretation [4].\nOf relevance for the current paper, there has recently been signi\ufb01cant interest in sparse representa-\ntions in the context of denoising, inpainting [5\u201310], compressive sensing (CS) [11, 12], and classi\ufb01-\ncation [13]. All of these applications exploit the fact that most images may be sparsely represented\nin an appropriate dictionary. Most of the CS literature assumes \u201coff-the-shelf\u201d wavelet and DCT\nbases/dictionaries [14], but recent denoising and inpainting research has demonstrated the signif-\nicant advantages of learning an often over-complete dictionary matched to the signals of interest\n(e.g., images) [5\u201310, 12, 15]. The purpose of this paper is to perform dictionary learning using\nnew non-parametric Bayesian technology [16,17], that offers several advantages not found in earlier\napproaches, which have generally sought point estimates.\nThis paper makes four main contributions:\n\u2022 The dictionary is learned using a beta process construction [16, 17], and therefore the number of\ndictionary elements and their relative importance may be inferred non-parametrically.\n\u2022 For the denoising and inpainting applications, we do not have to assume a priori knowledge of the\nnoise variance (it is inferred within the inversion). The noise variance can also be non-stationary.\n\u2022 The spatial inter-relationships between different components in images are exploited by use of the\nDirichlet process [18] and a probit stick-breaking process [19].\n\n1\n\n\f\u2022 Using learned dictionaries, inferred off-line or in situ, the proposed approach yields CS perfor-\nmance that is markedly better than existing standard CS methods as applied to imagery.\n\n2 Dictionary Learning with a Beta Process\nIn traditional sparse coding tasks, one considers a signal x \u2208 <n and a \ufb01xed dictionary D =\n(d1, d2, . . . , dM ) where each dm \u2208 <n. We wish to impose that any x \u2208 <n may be represented\napproximately as \u02c6x = D\u03b1, where \u03b1 \u2208 <M is sparse, and our objective is to also minimize the \u20182\nerror k\u02c6x \u2212 xk2. With a proper dictionary, a sparse \u03b1 often manifests robustness to noise (the model\ndoesn\u2019t \ufb01t noise well), and the model also yields effective inference of \u03b1 even when x is partially\nor indirectly observed via a small number of measurements (of interest for inpainting, interpolation\nand compressive sensing [5, 7]). To the authors\u2019 knowledge, all previous work in this direction has\nbeen performed in the following manner: (i) if D is given, the sparse vector \u03b1 is estimated via a\npoint estimate (without a posterior distribution), typically based on orthogonal matching pursuits\n(OMP), basis pursuits or related methods, for which the stopping criteria is de\ufb01ned by assuming\nknowledge (or off-line estimation) of the noise variance or the sparsity level of \u03b1; and (ii) when\nthe dictionary D is to be learned, the dictionary size M must be set a priori, and a point estimate\nis achieved for D (in practice one may infer M via cross-validation, with this step avoided in the\nproposed method). In many applications one may not know the noise variance or an appropriate\nsparsity level of \u03b1; further, one may be interested in the con\ufb01dence of the estimate (e.g., \u201cerror\nbars\u201d on the estimate of \u03b1). To address these goals, we propose development of a non-parametric\nBayesian formulation to this problem, in terms of the beta process, this allowing one to infer the\nappropriate values of M and k\u03b1k0 (sparsity level) jointly, also manifesting a full posterior density\nfunction on the learned D and the inferred \u03b1 (for a particular x), yielding a measure of con\ufb01dence\nin the inversion. As discussed further below, the non-parametric Bayesian formulation also allows\none to relax other assumptions that have been made in the \ufb01eld of learning D and \u03b1 for denoising,\ninpainting and compressive sensing. Further, the addition of other goals are readily addressed within\nthe non-parametric Bayesian paradigm, e.g. designing D for joint compression and classi\ufb01cation.\n\n2.1 Beta process formulation\nWe desire the model x = D\u03b1+\u0001, where x \u2208 <n and D \u2208 <n\u00d7M , and we wish to learn D and in so\ndoing infer M. Toward this end, we consider a dictionary D \u2208 <n\u00d7K, with K \u2192 \u221e; by inferring\nthe number of columns of D that are required for accurate representation of x, the appropriate\nvalue of M is implicitly inferred (work has been considered in [20, 21] for the related but distinct\napplication of factor analysis). We wish to also impose that \u03b1 \u2208 <K is sparse, and therefore only\na small fraction of the columns of D are used for representation of a given x. Speci\ufb01cally, assume\nthat we have a training set D = {xi, yi}i=1,N , where xi \u2208 <n and yi \u2208 {1, 2, . . . , Nc}, where\nNc \u2265 2 represents the number of classes from which the data arise; when learning the dictionary we\nignore the class labels yi, and later discuss how they may be considered in the learning process.\nThe two-parameter beta process (BP) was developed in [17], to which the reader is referred for\nfurther details; we here only provide those details of relevance for the current application. The BP\nwith parameters a > 0 and b > 0, and base measure H0, is represented as BP(a, b, H0), and a draw\nH \u223c BP(a, b, H0) may be represented as\n\nKX\n\nH(\u03c8) =\n\n\u03c0k\u03b4\u03c8k(\u03c8)\n\n\u03c0k \u223c Beta(a/K, b(K \u2212 1)/K)\n\n\u03c8k \u223c H0\n\n(1)\n\nk=1\n\nwith this a valid measure as K \u2192 \u221e. The expression \u03b4\u03c8k(\u03c8) equals one if \u03c8 = \u03c8k and is zero\notherwise. Therefore, H(\u03c8) represents a vector of K probabilities, with each associated with a\nrespective atom \u03c8k. In the limit K \u2192 \u221e, H(\u03c8) corresponds to an in\ufb01nite-dimensional vector of\nprobabilities, and each probability has an associated atom \u03c8k drawn i.i.d. from H0.\nUsing H(\u03c8), we may now draw N binary vectors, the ith of which is denoted zi \u2208 {0, 1}K,\nand the kth component of zi is drawn zik \u223c Bernoulli(\u03c0k). These N binary column vectors are\nused to constitute a matrix Z \u2208 {0, 1}K\u00d7N , with ith column corresponding to zi; the kth row of\nZ is associated with atom \u03c8k, drawn as discussed above. For our problem the atoms \u03c8k \u2208 <n\nwill correspond to candidate members of our dictionary D, and the binary vector zi de\ufb01nes which\nmembers of the dictionary are used to represent sample xi \u2208 D.\n\n2\n\n\fLet \u03a8 = (\u03c81, \u03c82, . . . , \u03c8K), and we may consider the limit K \u2192 \u221e. A naive form of our model,\nfor representation of sample xi \u2208 D, is xi = \u03a8zi + \u0001i. However, this is highly restrictive, as it\nimposes that the coef\ufb01cients of the dictionary expansion must be binary. To address this, we draw\nweights wi \u223c N (0, \u03b3\u22121\nw IK), where \u03b3w is the precision or inverse variance; the dictionary weights\nare now \u03b1i = zi \u25e6 wi, and xi = \u03a8\u03b1i + \u0001i, where \u25e6 represents the Hadamard (element-wise)\nmultiplication of two vectors. Note that, by construction, \u03b1 is sparse; this imposition of sparseness\nis distinct from the widely used Laplace shrinkage prior [3], which imposes that many coef\ufb01cients\nare small but not necessarily exactly zero.\nFor simplicity we assume that the dictionary elements, de\ufb01ned by the atoms \u03c8k, are drawn from a\nmultivariate Gaussian base H0, and the components of the error vectors \u0001i are drawn i.i.d. from a\nzero-mean Gaussian. The hierarchical form of the model may now be expressed as\n\nxi = \u03a8\u03b1i + \u0001i ,\n\u03a8 = (\u03c81, \u03c82, . . . , \u03c8K) ,\nwi \u223c N (0, \u03b3\u22121\n\nw IK) ,\n\nzi \u223c KY\n\nBernoulli(\u03c0k) ,\n\n\u03b1i = zi \u25e6 wi\n\u03c8k \u223c N (0, n\u22121In)\n\u0001i \u223c N (0, \u03b3\u22121\n\u0001 In)\n\u03c0k \u223c Beta(a/K, b(K \u2212 1)/K)\n\n(2)\n\nk=1\n\nNon-informative gamma hyper-priors are typically placed on \u03b3w and \u03b3\u0001. Consecutive elements\nin the above hierarchical model are in the conjugate exponential family, and therefore infer-\nence may be implemented via a variational Bayesian [22] or Gibbs-sampling analysis, with\nanalytic update equations (all inference update equations, and the software, can be found at\nhttp://people.ee.duke.edu/\u223clihan/cs/). After performing such inference, we retain those columns\nof \u03a8 that are used in the representation of the data in D, thereby inferring D and hence M.\nTo impose our desire that the vector of dictionary weights \u03b1 is sparse, one may adjust the parameters\na and b. Particularly, as discussed in [17], in the limit K \u2192 \u221e, the number of elements of zi that\nare non-zero is a random variable drawn from Poisson(a/b). In Section 3.1 we discuss the fact that\nthese parameters are in general non-informative and the sparsity is intrinsic to the data.\n2.2 Accounting for a classi\ufb01cation task\n\nThere are problems for which it is desired that x is sparsely rendered in D, and the associated\nweight vector \u03b1 may be employed for other purposes beyond representation. For example, one may\nperform a classi\ufb01cation task based on \u03b1. If one is interested in joint compression and classi\ufb01cation,\nboth goals should be accounted for when designing D. For simplicity, we assume that the number\nof classes is NC = 2 (binary classi\ufb01cation), with this readily extended [23] to NC > 2.\nFollowing [9], we may de\ufb01ne a linear or bilinear classi\ufb01er based on the sparse weights \u03b1 and the\nassociated data x (in the bilinear case), with this here implemented in the form of a probit classi\ufb01er.\nWe focus on the linear model, as it is simpler (has fewer parameters), and the results in [9] demon-\nstrated that it was often as good or better than the bilinear classi\ufb01er. To account for classi\ufb01cation,\nthe model in (2) remains unchanged, and the following may be added to the top of the hierarchy:\nyi = 1 if \u03b8T \u02c6\u03b1 + \u03bd > 0, yi = 2 if \u03b8T \u02c6\u03b1 + \u03bd < 0, \u03b8 \u223c N (0, \u03b3\u22121\n0 ), where\n\u02c6\u03b1 \u2208 <K+1 is the same as \u03b1 \u2208 <K with an appended one, to account for the classi\ufb01er bias. Again,\none typically places (non-informative) gamma hyper-priors on \u03b3\u03b8 and \u03b30. With the added layers for\nthe classi\ufb01er, the conjugate-exponential character of the model is retained, sustaining the ability to\nperform VB or MCMC inference with analytic update equations. Note that the model in (2) may\nbe employed for unlabeled data, and the extension above may be employed for the available labeled\ndata; consequently, all data (labeled and unlabeled) may be processed jointly to infer D.\n2.3 Sequential dictionary learning for large training sets\nIn the above discussion, we implicitly assumed all data D = {xi, yi}i=1,N are used together to\ninfer the dictionary D. However, in some applications N may be large, and therefore such a \u201cbatch\u201d\napproach is undesirable. To address this issue one may partition the data as D = D1 \u222a D2 \u222a\n. . .DJ\u22121 \u222a DJ, with the data processed sequentially. This issue has been considered for point\nestimates of D [8], in which considerations are required to assure algorithm convergence.\nIt is\nof interest to brie\ufb02y note that sequential inference is handled naturally via the proposed Bayesian\nanalysis.\n\n\u03b8 IK+1), and \u03bd \u223c N (0, \u03b3\u22121\n\n3\n\n\fImage Denoising and Inpainting\n\nSpeci\ufb01cally, let p(D|D, \u0398) represent the posterior on the desired dictionary, with all other model\nparameters marginalized out (e.g., the sample-dependent coef\ufb01cients \u03b1); the vector \u0398 represents\nthe model hyper-parameters. In a Bayesian analysis, rather than evaluating p(D|D, \u0398) directly, one\nmay employ the same model (prior) to infer p(D|D1, \u0398). This posterior may then serve as a prior\nfor D when considering next D2, inferring p(D|D1 \u222a D2, \u0398). When doing variational Bayesian\n(VB) inference we have an analytic approximate representation for posteriors such as p(D|D1, \u0398),\nwhile for Gibbs sampling we may use the inferred samples. When presenting results in Section 5,\nwe discuss additional means of sequentially accelerating a Gibbs sampler.\n3 Denoising, Inpainting and Compressive Sensing\n3.1\nAssume we are given an image I \u2208 <Ny\u00d7Nx with additive noise and missing pixels; we here assume\na monochrome image for simplicity, but color images are also readily handled, as demonstrated\nwhen presenting results. As is done typically [6, 7], we partition the image into NB = (Ny \u2212\nB + 1) \u00d7 (Nx \u2212 B + 1) overlapping blocks {xi}i=1,NB , for each of which xi \u2208 <B2 (B = 8 is\ntypically used). If there is only additive noise but no missing pixels, then the model in (2) can be\nreadily applied for simultaneous dictionary learning and image denoising. If there are both noise\nand missing pixels, instead of directly observing xi, we observe a subset of the pixels in each xi.\nNote that here \u03a8 and {\u03b1i}i=1,NB , which are used to recover the original noise-free and complete\nimage, are directly inferred from the data under test; one may also employ an appropriate training\nset D with which to learn a dictionary D of\ufb02ine, or for initialization of in situ learning.\nIn denoising and inpainting studies of this type (see for example [6, 7] and references therein), it\nis often assumed that either the variance is known and used as a \u201cstopping\u201d criteria, or that the\nsparsity level is pre-determined and \ufb01xed for all i \u2208 {1, NB}. While these may be practical in\nsome applications, we feel it is more desirable to not make these assumptions. In (2) the noise\nprecision (inverse variance), \u03b3\u0001, is assumed drawn from a non-informative gamma distribution, and\na full posterior density function is inferred for \u03b3\u0001 (and all other model parameters). In addition,\nthe problems of addressing spatially nonuniform noise as well as nonuniform noise across color\nchannels are of interest [7]; they are readily handled in the proposed model by drawing a separate\nprecision \u03b3\u0001 for each color channel in each B \u00d7 B block, each of which is drawn from a shared\ngamma prior.\nThe sparsity level of the representation in our model, i.e., {k\u03b1ik0}i=1,N , is in\ufb02uenced by the\nparameters a and b in the beta prior in (2). Examining the posterior p(\u03c0k|\u2212) \u223c Beta(a/K +\ni=1 zik), conditioned on all other parameters, we \ufb01nd that most\nsettings of a and b tend to be non-informative, especially in the case of sequential learning (dis-\ncussed further in Section 5). Therefore, the average sparsity level of the representation is inferred by\nthe data itself and each sample xi has its own unique sparse representation based on the posterior,\nwhich renders much more \ufb02exibility than enforcing the same sparsity level for each sample.\n3.2 Compressive sensing\n\nPN\ni=1 zik, b(K \u2212 1)/K + N \u2212PN\n\nWe consider CS in the manner employed in [12]. Assume our objective is to measure an image\nI \u2208 <Ny\u00d7Nx, with this image constituting the 8 \u00d7 8 blocks {xi}i=1,NB . Rather than measuring\nthe xi directly, pixel-by-pixel, in CS we perform the projection measurement vi = \u03a6xi, where\nvi \u2208 <Np, with Np representing the number of projections, and \u03a6 \u2208 <Np\u00d764 (assuming that xi\nis represented by a 64-dimensional vector). There are many (typically random) ways in which \u03a6\nmay be constructed, with the reader referred to [24]. Our goal is to have Np (cid:28) 64, thereby yielding\ncompressive measurements. Based on the CS measurements {vi}i=1,NB , our objective is to recover\n{xi}i=1,NB .\nConsider a potential dictionary \u03a8, as discussed in Section 2. It is assumed that for each of the\n{xi}i=1,NB from the image under test xi = \u03a8\u03b1i + \u0001i, for sparse \u03b1i and relatively small error\nk\u0001ik2. The number of required projections Np needed for accurate estimation of \u03b1i is proportional\nto k\u03b1ik0 [11], with this underscoring the desirability of learning a dictionary in which very sparse\nrepresentations are manifested (as compared to using an \u201coff-the-shelf\u201d wavelets or DCT basis).\nFor CS inversion, the model in (2) is employed, and therefore the appropriate dictionary D is learned\njointly while performing CS inversion, in situ on the image under test. When performing CS analy-\n\n4\n\n\fsis, in (2), rather than observing xi, we observe vi = \u03a6D\u03b1i + \u0001i, for i = 1, . . . , NB (the likelihood\nfunction is therefore modi\ufb01ed slightly).\nAs discussed when presenting results, one may also learn the CS dictionary in advance, off-line,\nwith appropriate training images (using the model in (2)). However, the unique opportunity for joint\nCS inversion and learning of an appropriate parsimonious dictionary is deemed to be a signi\ufb01cant\nadvantage, as it does not presuppose that one would know an appropriate training set in advance.\nThe inpainting problem may be viewed as a special case of CS, in which each row of \u03a6 corresponds\nto a delta function, locating a unique pixel on the image at which useful (unobscured) data are\nobserved. Those pixels that are unobserved, or that are contaminated (e.g., by superposed text [7])\nare not considered when inferring the \u03b1i and D. A CS camera designed around an inpainting\nconstruction has several advantages, from the standpoint of simplicity. As observed from the results\nin Section 5, an inpainting-based CS camera would simply observe a subset of the usual pixels,\nselected at random.\n\n4 Exploiting Spatial Structure\nFor the applications discussed above, the {xi}i=1,NB come from the single image under test, and\nQK\nk=1 Bernoulli(\u03c0k), and \u03c0 \u223c QK\nconsequently there is underlying (spatial) structure that should ideally be exploited. Rather than\nre-writing the entire model in (2), we focus on the following equations in the hierarchy: zi \u223c\nk=1 Beta(a/K, b(K \u2212 1)/K). Instead of having a single vector\n\u03c0 = {\u03c01, . . . , \u03c0K} that is shared for all {xi}i=1,NB , it is expected that there may be a mixture of \u03c0\nvectors, corresponding to different segments in the image. Since the number of mixture components\nis not known a priori, this mixture model is modeled via a Dirichlet process [18]. We may therefore\nemploy, for i = 1, . . . , NB,\n\nKY\n\nk=1\n\nzi \u223c KY\n\nBernoulli(\u03c0ik)\n\n\u03c0i \u223c G\n\nG \u223c DP(\u03b2,\n\nBeta(a/K, b(K \u2212 1)/K))\n\n(3)\n\nk=1\n\nAlternatively, we may cluster the zi directly, yielding zi \u223c G, G \u223c DP(\u03b2,QK\n\u03c0 \u223cQK\n\nk=1 Bernoulli(\u03c0k)),\nk=1 Beta(a/K, b(K \u2212 1)/K), where the zi are drawn i.i.d. from G. In practice we imple-\nment such DP constructions via a truncated stick-breaking representation [25], again retaining the\nconjugate-exponential structure of interest for analytic VB or Gibbs inference. In such an analysis\nwe place a non-informative gamma prior on the precision \u03b2.\nThe construction in (3) clusters the blocks, and therefore it imposes structure not constituted in the\nsimpler model in (2). However, the DP still assumes that the members of {xi}i=1,NB are exchange-\nable. Space limitations preclude discussing this matter in detail here, but we have also considered\nreplacement of the DP framework above with a probit stick-breaking process (PSBP) [19], which\nexplicitly imposes that it is more likely for proximate blocks to be in the same cluster, relative to\ndistant blocks. When presenting results, we show examples in which PSBP has been used, with\nits relative effectiveness compared to the simpler DP construction. The PSBP again retains full\nconjugate-exponential character within the hierarchy, of interest for ef\ufb01cient inference, as discussed\nabove.\n\n5 Example Results\n\nFor the denoising and inpainting results, we observed that the Gibbs sampler provided better perfor-\nmance than associated variational Bayesian inference. For denoising and inpainting we may exploit\nshifted versions of the data, which accelerates convergence substantially (discussed in detail be-\nlow). Therefore, all denoising and inpainting results are based on ef\ufb01cient Gibbs sampling. For CS\nwe cannot exploit shifted images, and therefore to achieve fast inversion variational Bayesian (VB)\ninference [22] is employed; for this application VB has proven to be quite effective, as discussed\nbelow. The same set of model hyper-parameters are used across all our denoising, inpainting and\nCS examples (no tuning was performed): all gamma priors are set as Gamma(10\u22126, 10\u22126), along\nthe lines suggested in [2], and the beta distribution parameters are set with a = K and b = N/8\n(many other settings of a and b yield similar results).\n\n5\n\n\f5.1 Denoising\nWe consider denoising a 256\u00d7 256 image, with comparison of the proposed approach to K-SVD [6]\n(for which the noise variance is assumed known and \ufb01xed); the true noise standard deviation is\nset at 15, 25 and 50 in the examples below. We show results for three algorithms: (i) mismatched\nK-SVD (with noise standard deviation of 30), (ii) K-SVD when the standard deviation is properly\nmatched, and (iii) the proposed BP approach. For (iii) a non-informative prior is placed on the\nnoise precision, and the same BP model is run for all three noise levels (with the underlying noise\nlevels inferred). The BP and K-SVD employed no a priori training data. In Figure 1 are shown\nthe noisy images at the three different noise levels, as well as the reconstructions via BP and K-\nSVD. A preset large dictionary size K = 256 is used for both algorithms, and for the BP results\nwe inferred that approximately M = 196, 128, and 34 dictionary elements were important for noise\nstandard deviations 15, 25, and 50, respectively; the remaining elements of the dictionary were used\nless than 0.1% of the time. As seen within the bottom portion of the right part of Figure 1, the\nunused dictionary elements appear as random draws from the prior, since they are not used and\nhence in\ufb02uenced by the data.\nNote that K-SVD works well when the set noise variance is at or near truth, but the method is un-\ndermined by mismatch. The proposed BP approach is robust to changing noise levels. Quantitative\nperformance is summarized in Table 1. The BP denoiser estimates a full posterior density func-\ntion on the noise standard deviation; for the examples considered here, the modes of the inferred\nstandard-deviation posteriors were 15.57, 25.35, and 48.12, for true standard deviations 15, 25, and\n50, respectively.\nTo achieve these BP results, we employ a sequential implementation of the Gibbs sampler (a batch\nimplementation converges to the same results but with higher computational cost); this is discussed\nin further detail below, when presenting inpainting results.\n\nFigure 1: Left: Representative denoising results, with the top through bottom rows corresponding to noise\nstandard deviations of 15, 25 and 50, respectively. The second and third columns represent K-SVD [6] results\nwith assumed standard deviation equal to 30 and the ground truth, respectively. The fourth column represents\nthe proposed BP reconstructions. The noisy images are in the \ufb01rst column. Right: Inferred BP dictionary\nelements for noise standard deviation 25, in order of importance (probability to be used) from the top-left.\n\nTable 1: Peak signal-to-reconstructed image measure (PSNR) for the data in Figure 1, for K-SVD [6] and the\nproposed BP method. The true standard deviation was 15, 25 and 50, respectively, from the top to the bottom\nrow. For the mismatched K-SVD results, the noise stand deviation was \ufb01xed at 30.\nK-SVD Denoising\n\nBeta Process\nmismatched variance (dB) matched variance (dB) Denoising (dB)\n\nK-SVD Denoising\n\nOriginal Noisy\nImage (dB)\n\n24.58\n20.19\n14.56\n\n30.67\n31.52\n19.60\n\n34.32\n32.15\n27.95\n\n34.44\n32.17\n28.08\n\nInpainting\n\n5.2\nOur inpainting and denoising results were achieved by using the following sequential procedure.\nConsider any pixel [p, j], where p, j \u2208 [1, B], and let this pixel constitute the left-bottom pixel in\na new B \u00d7 B block. Further, consider all B \u00d7 B blocks with left-bottom pixels at {p + \u2018B, j +\n\n6\n\n\fFigure 2: Inpainting results. The curve shows the PSNR as a function of the B2 = 64 Gibbs learning rounds.\nThe left \ufb01gure is the test image, with 80% of the RGB pixels missing, the middle \ufb01gure is the result after 64\nafter Gibbs rounds (\ufb01nal result), and the right \ufb01gure is the original uncontaminated image.\nmB} \u222a \u03b4(p \u2212 1){Ny \u2212 B + 1, j + mB} \u222a \u03b4(j \u2212 1){p + \u2018B, Nx \u2212 B + 1} for \u2018 and m that satisfy\np + \u2018B \u2264 Ny \u2212 B + 1 and j + mB \u2264 Nx \u2212 B + 1. This set of blocks is denoted data set Dpj,\nand considering 1 \u2264 p \u2264 B and 1 \u2264 j \u2264 B, there are a total of B2 such shifted data sets. In the\n\ufb01rst iteration of learning \u03a8, we employ the blocks in D11, and for this \ufb01rst round we initialize \u03a8\nand \u03b1i based on a singular value decomposition (SVD) of the blocks in D11 (we achieved similar\nresults when \u03a8 was initialized randomly). We do several Gibbs iterations with D11 and then stop\nthe Gibbs algorithm, retaining the last sample of \u03a8 and \u03b1i from the previous step. These \u03a8 and \u03b1i\nare then used to initialize the Gibbs sampler in the second round, now applied to the B \u00d7 B blocks\nin D11 \u222a D21 (for D21 the neighboring \u03b1i is used for initialization). The Gibbs sampler is now run\non this expanded data for several iterations, the last sample is retained, and the data set is augmented\nagain. This is done B2 = 64 times until at the end all shifted blocks are processed simultaneously.\nThis sequential process may be viewed as a sequential Gibbs burn in, after which all of the shifted\nblocks are processed.\nTheoretically, one would expect to need thousands of Gibbs iterations to achieve convergence. How-\never, our experience is that even a single iteration in each of the above B2 rounds yields good results.\nIn Figure 2 we show the PSNR as a function of each of the B2 = 64 rounds discussed above. For\nGibbs rounds 16, 32 and 64 the corresponding PSNR values were 26.78 dB, 28.46 dB and 29.31 dB.\nFor this example we used K = 256. This example was considered in [7] (we obtained similar results\nfor the \u201cNew Orleans\u201d image, also considered in [7]); the best results reported there were a PSNR of\n29.65 dB. However, to achieve those results a training data set was employed for initialization [7];\nthe BP results are achieved with no a priori training data. Concerning computational costs, the in-\npainting and denoising algorithms scale linearly as a function of the block size, the dictionary size,\nthe sparsity level, and the number of training samples; all results reported here were run ef\ufb01ciently\nin Matlab on PCs, with comparable costs as K-SVD.\n\n5.3 Compressive sensing\nWe consider a CS example, in which the image is divided into 8\u00d7 8 patches, with these constituting\nthe underlying data {xi}i=1,NB to be inferred. For each of the NB blocks, a vector of CS measure-\nments vi = \u03a6xi is measured, where the number of projections per patch is Np, and the total number\nof CS projections is NpNB. In this example the elements of \u03a6 were constructed randomly as draws\nfrom N (0, 1), but many other projection classes may be considered [11, 24]. Each xi is assumed\nrepresented in terms of a dictionary xi = D\u03b1i + \u0001i, and three constructions for D were considered:\n(i) a DCT expansion; (ii) learning of D using the beta process construction, using training images;\n(iii) using the beta process to perform joint CS inversion and learning of D. For (ii), the training\ndata consisted of 4000 8\u00d78 patches chosen at random from 100 images selected from the Microsoft\ndatabase (http://research.microsoft.com/en-us/projects/objectclassrecognition). The dictionary was\nset to K = 256, and the of\ufb02ine beta process inferred a dictionary of size M = 237.\nRepresentative CS reconstruction results are shown in Figure 3, for a gray-scale version of the\n\u201ccastle\u201d image. The inversion results at left are based on a learned dictionary; except for the \u201conline\nBP\u201d results, all of these results employ the same dictionary D learned off-line as above, and the\nalgorithms are distinguished by different ways of estimating {\u03b1i}i=1,NB . A range of CS-inversion\n\n7\n\n081624324048566451015202530Learning roundPSNR\falgorithms are considered from the literature, and several BP-based constructions are considered as\nwell for CS inversion. The online BP results are quite competitive with those inferred off-line.\nOne also notes that the results based on a learned dictionary (left in Figure 3) are markedly better\nthan those based on the DCT (right in Figure 3); similar results were achieved when the DCT was\nreplaced by a wavelet representation. For the DCT-based results, note that the DP- and PSBP-based\nBP CS inversion results are signi\ufb01cantly better than those of all other CS inversion algorithms.\nThe results reported here are consistent with tests we performed using over 100 images from the\naforementioned Microsoft database, not reported here in detail for brevity.\nNote that CS inversion using the DP-based BP algorithm (as discussed in Section 4) yield the best\nresults, signi\ufb01cantly better than BP results not based on the DP, and better than all competing CS\ninversion algorithms (for both learned dictionaries and the DCT). The DP-based results are very\nsimilar to those generated by the probit stick-breaking process (PSBP) [19], which enforces spatial\ninformation more explicitly; this suggests that the simpler DP-based results are adequate, at least\nfor the wide class of examples considered. Note that we also considered the DP and PSBP for\nthe denoising and inpaiting examples above (those results were omitted, for brevity). The DP and\nPSBP denoising and inpainting results were similar to BP results without DP/PSBP (those presented\nabove); this is attributed to the fact that when performing denoising/inpainting we may consider\nmany shifted versions of the same image (as discussed when presenting the inpainting results).\nConcerning computational costs, all CS inversions were run ef\ufb01ciently on PCs, with the speci\ufb01cs\ncomputational times dictated by the detailed Matlab implementation and the machine run on. A\nrough ranking of the computational speeds, from fastest to slowest, is as follows: StOMP-CFAR,\nFast BCS, OMP, BP, LARS/Lasso, Online BP, DP BP, PSBP BP, VB BCS, Basis Pursuit; in this\nlist, algorithms BP through Basis Pursuits have approximately the same computational costs. The\nDP-based BP CS inversion algorithm scales as O(NB \u00b7 Np \u00b7 B2).\n\nFigure 3: CS performance (fraction of \u20182 error) based on learned dictionaries (left) and based on the DCT\n(right). For the left results, the \u201cOnline BP\u201d results simultaneously learned the dictionary and did CS inversion;\nthe remainder of the left results are based on a dictionary learned of\ufb02ine on a training set. A DCT dictionary\nis used for the results on the right. The underlying image under test is shown at right. Matlab code for Basis\nPursuit, LARS/Lasso, OMP, STOMP are available at http://sparselab.stanford.edu/, and code for BCS and Fast\nBCS are available at http://people.ee.duke.edu/\u223clihan/cs/. The horizontal axis represents the total number of\nCS projections, NpNB. The total number of pixels in the image is 480 \u00d7 320 = 153, 600. 99.9% of the signal\nenergy is contained in 33, 500 DCT coef\ufb01cients.\n6 Conclusions\nThe non-parametric beta process has been presented for dictionary learning with the goal of image\ndenoising, inpainting and compressive sensing, with very encouraging results relative to the state\nof the art. The framework may also be applied to joint compression-classi\ufb01cation tasks. In the\ncontext of noisy underlying data, the noise variance need not be known in advance, and it need not\nbe spatially uniform. The proposed formulation also allows unique opportunities to leverage known\nstructure in the data, such as relative spatial locations within an image; this framework was used to\nachieve marked improvements in CS-inversion quality.\nAcknowledgement\nThe research reported here was supported in part by ARO, AFOSR, DOE, NGA and ONR.\n\n8\n\n33.544.555.566.577.5x 10400.050.10.150.20.250.3Number of MeasurementsRelative Reconstruction ErrorPSBP BPDP BPOnline BPBPBCSFast BCSBasis PursuitLARS/LassoOMPSTOMP-CFARNumber of CS Measurements (x 104)Relative Reconstruction Error33.544.555.566.577.5x 1040.20.250.30.350.40.450.5Number of MeasurementsRelative Reconstruction ErrorPSBP BPDP BPBPBCSFast BCSBasis PursuitLARS/LassoOMPSTOMP-CFARNumber of CS Measurements (x 104)Relative Reconstruction Error\fReferences\n[1] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge\n\nUniversity Press, 2000.\n\n[2] M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1, 2001.\n\n[3] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58, 1994.\n\n[4] B.A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by V1? Vision Research, 37, 1998.\n\n[5] M. Aharon, M. Elad, and A. M. Bruckstein. K-SVD: An algorithm for designing overcomplete\n\ndictionaries for sparse representation. IEEE Trans. Signal Processing, 54, 2006.\n\n[6] M. Elad and M. Aharon.\n\nImage denoising via sparse and redundant representations over\n\nlearned dictionaries. IEEE Trans. Image Processing, 15, 2006.\n\n[7] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE\n\nTrans. Image Processing, 17, 2008.\n\n[8] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In\n\nProc. International Conference on Machine Learning, 2009.\n\n[9] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In\n\nProc. Neural Information Processing Systems, 2008.\n\n[10] M. Ranzato, C. Poultney, S. Chopra, and Y. Lecun. Ef\ufb01cient learning of sparse representations\n\nwith an energy-based model. In Proc. Neural Information Processing Systems, 2006.\n\n[11] E. Cand`es and T. Tao. Near-optimal signal recovery from random projections: universal en-\n\ncoding strategies? IEEE Trans. Information Theory, 52, 2006.\n\n[12] J.M. Duarte-Carvajalino and G. Sapiro. Learning to sense sparse signals: Simultaneous sensing\n\nmatrix and sparsifying dictionary optimization. IMA Preprint Series 2211, 2008.\n\n[13] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma. Robust face recognition via sparse\n\nrepresentation. IEEE Trans. Pattern Analysis Machine Intelligence, 31, 2009.\n\n[14] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. IEEE Trans. Signal Processing,\n\n56, 2008.\n\n[15] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: transfer learning\n\nfrom unlabeled data. In Proc. International Conference on Machine Learning, 2007.\n\n[16] R. Thibaux and M.I. Jordan. Hierarchical beta processes and the indian buffet process. In Proc.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2007.\n\n[17] J. Paisley and L. Carin. Nonparametric factor analysis with beta process priors.\n\nInternational Conference on Machine Learning, 2009.\n\nIn Proc.\n\n[18] T. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1,\n\n1973.\n\n[19] A. Rodriguez and D.B. Dunson. Nonparametric bayesian models through probit stickbreaking\n\nprocesses. Univ. California Santa Cruz Technical Report, 2009.\n\n[20] D. Knowles and Z. Ghahramani. In\ufb01nite sparse factor analysis and in\ufb01nite independent com-\nponents analysis. In Proc. International Conference on Independent Component Analysis and\nSignal Separation, 2007.\n\n[21] P. Rai and H. Daum\u00b4e III. The in\ufb01nite hierarchical factor regression model. In Proc. Neural\n\nInformation Processing Systems, 2008.\n\n[22] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby\n\nComputational Neuroscience Unit, University College London, 2003.\n\n[23] M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian\n\nprocess priors. Neural Computation, 18, 2006.\n\n[24] R.G. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24, 2007.\n[25] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4, 1994.\n\n9\n\n\f", "award": [], "sourceid": 190, "authors": [{"given_name": "Mingyuan", "family_name": "Zhou", "institution": null}, {"given_name": "Haojun", "family_name": "Chen", "institution": null}, {"given_name": "Lu", "family_name": "Ren", "institution": null}, {"given_name": "Guillermo", "family_name": "Sapiro", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}, {"given_name": "John", "family_name": "Paisley", "institution": null}]}