{"title": "On Sparsity and Overcompleteness in Image Models", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": "Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly overcomplete.", "full_text": "On Sparsity and Overcompleteness in Image Models\n\nPietro Berkes, Richard Turner, and Maneesh Sahani\n\nGatsby Computational Neuroscience Unit, UCL\n\nAlexandra House, 17 Queen Square, London WC1N 3AR\n\nAbstract\n\nComputational models of visual cortex, and in particular those based on sparse\ncoding, have enjoyed much recent attention. Despite this currency, the question\nof how sparse or how over-complete a sparse representation should be, has gone\nwithout principled answer. Here, we use Bayesian model-selection methods to ad-\ndress these questions for a sparse-coding model based on a Student-t prior. Hav-\ning validated our methods on toy data, we \ufb01nd that natural images are indeed best\nmodelled by extremely sparse distributions; although for the Student-t prior, the\nassociated optimal basis size is only modestly over-complete.\n\n1 Introduction\n\nComputational models of visual cortex, and in particular those based on sparse coding, have re-\ncently enjoyed much attention. The basic assumption behind sparse coding is that natural scenes are\ncomposed of structural primitives (edges or lines, for example) and, although there are a potentially\nlarge number of these primitives, typically only a few are active in a single natural scene (hence the\nterm sparse, [1, 2]). The claim is that cortical processing uses these statistical regularities to shape\na representation of natural scenes, and in particular converts the pixel-based representation at the\nretina to a higher-level representation in terms of these structural primitives.\n\nTraditionally, research has focused on determining the characteristics of the structural primitives and\ncomparing their representational properties with those of V1. This has been a successful enterprise,\nbut as a consequence other important questions have been neglected. The two we focus on here\nare: How large is the set of structural primitives best suited to describe all natural scenes (how\nover-complete), and how many primitives are active in a single scene (how sparse)? We will also be\ninterested in the coupling between sparseness and over-completeness. The intuition is that, if there\nare a great number of structural primitives, they can be very speci\ufb01c and only a small number will\nbe active in a visual scene. Conversely if there are a small number they have to be more general and\na larger number will be active on average. We attempt to map this coupling by evaluating models\nwith different over-completenesses and sparsenesses and discover where natural scenes live along\nthis trade-off (see Fig. 1).\n\nIn order to test the sparse coding hypothesis it is necessary to build algorithms that both learn the\nprimitives and decompose natural scenes in terms of them. There have been many ways to derive\nsuch algorithms, but one of the more successful is to regard the task of building a representation\nof natural scenes as one of probabilistic inference. More speci\ufb01cally, the unknown activities of the\nstructural primitives are viewed as latent variables that must be inferred from the natural scene data.\nCommonly the inference is carried out by writing down a generative model (although see [3] for an\nalternative), which formalises the assumptions made about the data and latent variables. The rules\nof probability are then used to derive inference and learning algorithms.\n\nUnfortunately the assumption that natural scenes are composed of a small number of structural\nprimitives is not suf\ufb01cient to build a meaningful generative model. Other assumptions must therefore\nbe made and typically these are that the primitives occur independently, and combine linearly. These\n\n1\n\n\fs\ns\ne\nn\ne\n\nt\n\nl\n\ne\np\nm\no\nc\nr\ne\nv\no\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nsparsity\n\nFigure 1: Schematic showing the space of possible sparse coding models in terms of sparseness (increasing in\nthe direction of the arrow) and over-completeness. For reference, complete models lie along the dashed black\nline. Ideally every model could be evaluated (e.g. via their marginal likelihood or cross-validation) and the\ngrey contours illustrate what we might expect to discover if this were possible: The solid black line illustrates\nthe hypothesised trade-off between over-completeness and sparsity, whilst the star shows the optimal point in\nthis trade-off.\n\nare drastic approximations and it is an open question to what extent this affects the results of sparse\ncoding. The distribution over the latent variables xt,k is chosen to be sparse and typical choices\nare Student-t, a Mixture of Gaussians (with zero means), and the Generalised Gaussian (which\nincludes the Laplace distribution). The output yt is then given by a linear combination of the K,\nD-dimensional structural primitives gk, weighted by their activities, plus some additive Gaussian\nnoise (the model reduces to independent components analysis in the absence of this noise [4]),\n\np(xt,k|\u03b1) = psparse(\u03b1)\n\np(yt|xt, G) = Nyt(Gxt, \u03a3y) .\n\n(1)\n(2)\nThe goal of this paper will be to learn the optimal dimensionality of the latent variables (K) and\nthe optimal sparseness of the prior (\u03b1). In order to do this a notion of optimality has to be de\ufb01ned.\nOne option is to train many different sparse-coding models and \ufb01nd the one which is most \u201csimilar\u201d\nto visual processing. (Indeed this might be a fair characterisation of much of the current activity in\n\ufb01eld.) However, this is fraught with dif\ufb01culty not least as it is unclear how recognition models map\nto neural processes. We believe the more consistent approach is, once again, to use the Bayesian\nframework and view this as a problem of probabilistic inference. In fact, if the hypothesis is that the\nvisual system is implementing an optimal generative model, then questions of over-completeness\nand sparsity should be addressed in this context.\n\nUnfortunately, this is not a simple task and quite sophisticated machine-learning algorithms have\nto be harnessed in order to answer these seemingly simple questions. In the \ufb01rst part of this paper\nwe describe these algorithms and then validate them using arti\ufb01cial data. Finally, we present results\nconcerning the optimal sparseness and over-completeness for natural image patches in the case that\nthe prior is a Student-t distribution.\n\n2 Model\n\nAs discussed earlier, there are many variants of sparse-coding. Here, we focus on the Student-t prior\nfor the latent variables xt,k:\n\np(xt,k|\u03b1, \u03bb) =\n\n\u0393(cid:0) \u03b1+1\n2 (cid:1)\n\u03bb\u221a\u03b1\u03c0 \u0393(cid:0) \u03b1\n2(cid:1)\n\n(cid:18)1 +\n\n1\n\n\u03b1 (cid:16) xt,k\n\n\u03bb (cid:17)2(cid:19)\u2212\n\n\u03b1+1\n\n2\n\n(3)\n\nThere are two main reasons for this choice: The \ufb01rst is that this is a widely used model [1]. The\nsecond is that by implementing the Student-t prior using an auxiliary variable, all the distributions in\nthe generative model become members of the exponential family [5]. This means it is easy to derive\nef\ufb01cient approximate inference schemes like variational Bayes and Gibbs sampling.\n\nThe auxiliary variable method is based on the observation that a Student-t distribution is a continuous\nmixture of zero-mean Gaussians, whose mixing proportions are given by a Gamma distribution over\n\n2\n\n\fthe precisions. This indicates that we can exchange the Student-t prior for a two-step prior in which\nwe \ufb01rst draw a precision from a Gamma distribution and then draw an activation from a Gaussian\nwith that precision,\n\n2\n\n,\n\n\u03b1\u03bb2(cid:19) ,\np(ut,k|\u03b1, \u03bb) = Gut,k(cid:18) \u03b1\np(xt,k|ut,k) = Nxt,k(cid:16)0, u\u22121\nt,k(cid:17) ,\np(yt|xt, G) = Nyt(Gxt, \u03a3y) ,\n\n2\n\n\u03a3y := diag(cid:0)\u03c32\ny(cid:1) .\n\n(4)\n\n(5)\n\n(6)\n(7)\n\nThis model produces data which are often near zero, but occasionally highly non-zero. These non-\nzero elements form star-like patterns, where the points of the star are determined by the direction of\nthe weights (e.g., Fig. 2).\n\nOne of the major technical dif\ufb01culties posed by sparse-coding is that, in the over-complete regime,\nthe posterior distribution of the latent variables p(X|Y, \u03b8) is often complex and multi-modal. Ap-\nproximation schemes are therefore required, but we must be careful to ensure that the scheme we\nchoose does not bias the conclusions we are trying to draw. This is true for any application of sparse\ncoding, but is particularly pertinent for our problem as we will be quantitatively comparing different\nsparse-coding models.\n\n3 Bayesian Model Comparison\n\nA possible strategy for investigating the sparseness/over-completeness coupling would be to tile\nthe space with models and learn the parameters at each point (as schematised in Fig. 1). A model\ncomparison criterion could then be used to rank the models, and to \ufb01nd the optimal sparseness/over-\ncompleteness. One such criterion would be to use cross validation and evaluate the likelihoods on\nsome held-out test data. Another is to use (approximate) Bayesian Model Comparison, and it is on\nthis method that we focus.\nTo evaluate the plausibility of two alternative versions of a model M, each with a different setting\nof the hyperparameters \u039e1 and \u039e2, in the light of some data Y , we compute the evidence [6]:\n\np(M, \u039e1|Y )\np(M, \u039e2|Y )\n\n=\n\np(Y |M, \u039e1) P (M, \u039e1)\np(Y |M, \u039e2) P (M, \u039e2)\n\n.\n\n(8)\n\nSince we do not have any reason a priori to prefer one particular con\ufb01guration of hyperparameters\nto another, we take the prior terms P (M, \u039ei) to be equal, which leaves us with the ratio of the\nmarginal-likelihoods (or Bayes Factor),\n\nP (Y |M, \u039e1)\nP (Y |M, \u039e2)\n\n,\n\n(9)\n\nThe marginal-likelihoods themselves are hard to compute, being formed from high dimensional\nintegrals over the latent variables V and parameters \u0398,\n\np(Y |M, \u039ei) = Z dV d\u0398 p(Y, V, \u0398|M, \u039ei)\n\n= Z dV d\u0398 p(Y, V |\u0398,M, \u039ei)p(\u0398|M, \u039ei) .\n\n(10)\n\n(11)\n\nOne concern in model comparison might be that the more complex models (those which are more\nover-complete) have a larger number parameters and therefore \u2018\ufb01t\u2019 any data set better. However, the\nBayes factor (Eq. 9) implicitly implements a probabilistic version of Occam\u2019s razor that penalises\nmore complex models and mitigates this effect [6]. This makes the Bayesian method appealing for\ndetermining the over-completeness of a sparse-coding model.\n\nUnfortunately computing the marginal-likelihood is computationally intensive, and this precludes\ntiling the sparseness/over-completeness space. However, an alternative is to learn the optimal over-\ncompleteness at a given sparseness using automatic relevance determination (ARD) [7, 8]. The\n\n3\n\n\fadvantage of ARD is that it changes a hard and lengthy model comparison problem (i.e., computing\nthe marginal-likelihood for many models of differing dimensionalities) into a much simpler infer-\nence problem. In a nutshell, the idea is to equip the model with many more components than are\nbelieved to be present in the data, and to let it prune out the weights which are unnecessary. Prac-\ntically this involves placing a (Gaussian) prior over the components which favours small weights,\nand then inferring the scale of this prior. In this way the scale of the super\ufb02uous weights is driven to\nzero, removing them from the model. The necessary ARD hyper-priors are\n\np(gk|\u03b3k) = Ngk(cid:0)0, \u03b3\u22121\nk (cid:1) ,\np(\u03b3k) = G\u03b3k(\u03b8k, lk) .\n\n(12)\n(13)\n\n4 Determining the over-completeness: Variational Bayes\n\nIn the previous two sections we described a generative model for sparse coding that is theoretically\nable to learn the optimal over-completeness of natural scenes. We have two distinct uses for this\nmodel: The \ufb01rst, and computationally more demanding task, is to learn the over-completeness at a\nvariety of different, \ufb01xed, sparsenesses (that is, to \ufb01nd the optimal over-completeness in a vertical\nslice through Fig. 1); The second is to determine the optimal point on this trade-off by evaluating\nthe (approximate) marginal-likelihood (that is, evaluating points along the trade-off line in Fig. 1 to\n\ufb01nd the optimal model - the star). It turns out that no single method is able to solve both these tasks,\nbut that it is possible to develop a pair of approximate algorithms to solve them separately. The\n\ufb01rst approximation scheme is Variational Bayes (VB), and it excels at the \ufb01rst task, but is severely\nbiased in the case of the second. The second scheme is Annealed Importance Sampling (AIS) which\nis prohibitively slow for the \ufb01rst task, but much more accurate on the second. We describe them in\nturn, starting with VB.\n\nThe quantity required for learning is the marginal-likelihood,\n\nlog p(Y |M, \u039e) = log Z dV d\u0398 p(Y, V, \u0398|M, \u039e).\n\n(14)\n\nComputing this integral is intractable (for reasons similar to those given in Sec. 2), but a lower-\nbound can be constructed by introducing any distribution over the latent variables and parameters,\nq(V, \u0398), and using Jensen\u2019s inequality,\n\nlog p(Y |M, \u039e) \u2265 Z dV d\u0398 q(V, \u0398) log\n\np(Y, V, \u0398|M, \u039e)\n\nq(V, \u0398)\n\n=: F(q(V, \u0398))\n\n= log p(Y |M, \u039e) \u2212 KL(q(V, \u0398)||p(V, \u0398|Y ))\n\n(15)\n\n(16)\n\nThis lower-bound is called the free-energy, and the idea is to repeatedly optimise it with respect\nto the distribution q(V, \u0398) so that it becomes as close to the true marginal likelihood as possible.\nClearly the optimal choice for q(V, \u0398) is the (intractable) true posterior. However, by constraining\nthis distribution headway can be made. In particular if we assume that the set of parameters and\nset of latent variables are independent in the posterior, so that q(V, \u0398) = q(V )q(\u0398) then we can\nsequentially optimise the free-energy with respect to each of these distributions. For large hierar-\nchical models, including the one described in this paper, it is often necessary to introduce further\nfactorisations within these two distributions in order to derive the updates. Their general form is,\n\nq(Vi) \u221d exphlog p(V, \u0398)iq(\u0398) Qj6=i q(Vi)\nq(\u0398i) \u221d exphlog p(V, \u0398)iq(V ) Qj6=i q(\u0398i) .\n\n(17)\n\n(18)\n\nAs the Bayesian Sparse Coding model is composed of distributions from the exponential family, the\nfunctional form of these updates is the same as the corresponding priors. So, for example the latent\nvariables have the following form: q(xt) is Gaussian and q(ut,k) is Gamma distributed.\nAlthough this approximation is good at discovering the over-completeness of data at \ufb01xed sparsities,\nit provides an estimate of the marginal-likelihood (the free-energy) which is biased toward regions of\nlow sparsity. The reason is simple to understand. The difference between the free energy and the true\nlikelihood is given by the KL divergence between the approximate and true posterior. Thus, the free-\nenergy bound is tightest in regions where q(V, \u0398) is a good match to the true posterior, and loosest in\n\n4\n\n\fregions where it is a poor match. At high sparsities, the true posterior is multimodal and highly non-\nGaussian. In this regime q(V, \u0398) \u2013 which is always uni-modal \u2013 is a poor approximation. At low-\nsparsities the prior becomes Gaussian-like and the posterior also becomes a uni-modal Gaussian.\nIn this regime q(V, \u0398) is an excellent approximation. This leads to a consistent bias in the peak of\nthe free-energy toward regions of low sparsity. One might also be concerned with another potential\nsource of bias: The number of modes in the posterior increases with the number of components\nin the model, which gives a worse match to the variational approximation for more over-complete\nmodels. However, because of the sparseness of the prior distribution, most of the modes are going\nto be very shallow for typical inputs, so that this effect should be small. We verify this claim on\narti\ufb01cial data in Section 6.2.\n\n5 Determining the sparsity: AIS\n\nAn approximation scheme is required to estimate the marginal-likelihood, but without a sparsity-\ndependent bias. Any scheme which uses a uni-modal approximation to the posterior will inevitably\nfall victim to such biases. This rules out many alternate variational schemes, as well as methods\nlike the Laplace approximation, or Expectation Propagation. One alternative might be to use a\nvariational method which has a multi-modal approximating distribution (e.g. a mixture model). The\napproach taken here is to use Annealed Importance Sampling (AIS) [9] which is one of the few\nmethods for evaluating normalising constants of intractable distributions. The basic idea behind\nAIS is to estimate the marginal-likelihood using importance sampling. The twist is that the proposal\ndistribution for the importance sampler is itself generated using an MCMC method. Brie\ufb02y, this\ninner loop starts by drawing samples from the model\u2019s prior distribution and continues to sample\nas the prior is deformed into the posterior, according to an annealing schedule. Both the details of\nthis schedule, and having a quick-mixing MCMC method, are critical for good results. In fact it is\nsimple to derive a quick-mixing Gibbs sampler for our application and this makes AIS particularly\nappealing.\n\n6 Results\n\nBefore tackling natural images, it is necessary to verify that the approximations can discover the\ncorrect degree of over-completeness and sparsity in the case where the data are drawn from the\nforward model. This is done in two stages: Firstly we focus on a very simple, low-dimensional\nexample that is easy to visualise and which helps explicate the learning algorithms, allowing them\nto be tuned; Secondly, we turn to a larger scale example designed to be as similar to the tests on\nnatural data as possible.\n\n6.1 Veri\ufb01cation using simple arti\ufb01cal data\n\nIn the \ufb01rst experiment the training data are produced as follows: Two-dimensional observations\nare generated by three Student-t sources with degree of freedom chosen to be 2.5. The generative\nweights are \ufb01xed to be 60 degrees apart from one another, as shown in Figure 2.\n\nA series of VB simulations were then run, differing only in the sparseness level (as measured by\nthe degrees of freedom of the Student-t distribution over xt). Each simulation consisted of 500 VB\niterations performed on a set of 3000 data points randomly generated from the model. We initialised\nthe simulations with K = 7 components. To improve convergence, we started the simulations with\nweights near the origin (drawn from a normal distribution with mean 0 and standard deviation 10\u22128)\nand a relatively large input noise variance, and annealed the noise variance between the iterations\ny = 0.3 for 100 iterations,\nof VBEM. The annealing schedule was as following: we started with \u03c32\nreduced this linearly down to \u03c32\ny = 0.01 in a further 50\niterations. During the annealing process, the weights typically grew from the origin and spread in all\ndirections to cover the input space. After an initial growth period, where the representation usually\nbecame as over-complete as allowed by the model, some of the weights rapidly shrank again and\ncollapsed to the origin. At the same time, the corresponding precision hyperparameters grew and\neffectively pruned the unnecessary components. We performed 7 blocks of simulations at different\nsparseness levels. In every block we performed 3 runs of the algorithm and retained the result with\nthe highest free energy.\n\ny = 0.1 in 100 iterations, and \ufb01nally to \u03c32\n\n5\n\n\f2024\n4\n\n\n\n\n\n4\n\n\n\n\n\n2\n\n0\n\n2\n\n4\n\n\u22128000\n\n\u22128200\n\n\u22128400\n\n\u22128600\n\n\u22128800\n\n\u22129000\n\n\u22129200\n\n\u22129400\n\n\u22129600\n\n)\n\nS\nT\nA\nN\n\n(\n \ny\ng\nr\ne\nn\ne\n\n \n\ne\ne\nr\nf\n\n)\n\nS\nT\nA\nN\n\n(\n \n)\n\n(\n\nY\nP\ng\no\n\n \n\nl\n\n2.1\n\n2.2\n\n2.4\n\n\u03b1\n\n2.5\n\n3.0\n\n3.5\n\n\u22126500\n\n\u22126550\n\n\u22126600\n\n\u22126650\n\n\u22126700\n\n\u22126750\n\n\u22126800\n\n\u22126850\n\n2.1\n\n2.2\n\n2.4\n\n\u03b1\n\n2.5\n\n3.0\n\n3.5\n\nFigure 2: Left: Test data drawn from the simple arti\ufb01cial model. Centre: Free energy of the models learned by\nVBEM in the arti\ufb01cial data case. Right: Estimated log marginal likelihood. Error bars are 3 times the estimated\nstandard deviation.\n\nThe marginal likelihoods of the selected results were then estimated using AIS. We derived the\nimportance weights using a \ufb01xed data set with 2500 data points, 250 samples, and 300 intermediate\ndistributions. Following the recommendations in [9], the annealing schedule was chosen to be linear\ninitially (with 50 inverse temperatures spaced uniformly from 0 to 0.01), followed by a geometric\nsection (250 inverse temperatures spaced geometrically from 0.01 to 1). This mean that there were\na total of 300 distributions between the prior and posterior.\n\nThe results indicate that the combination of the two methods is successful at learning both the over-\ncompleteness and sparseness. In particular the VBEM algorithm was able to recover the correct\ndimensionality for all sparseness levels, except for the sparsest case \u03b1 = 2.1, where it preferred a\nmodel with 5 signi\ufb01cant components. As expected, however, \ufb01gure 2 shows that the maximum free\nenergy is biased toward the more Gaussian models. In contrast to this, the marginal likelihood esti-\nmated by AIS (Fig. 2), which is strictly greater than the free-energy as expected, favours sparseness\nlevels close to the true value.\n\n6.2 Veri\ufb01cation using complex arti\ufb01cial data\n\nAlthough it is necessary that the inference scheme should pass simple tests like that in the previous\nsection, they are not suf\ufb01cient to give us con\ufb01dence that it will perform successfully on natural\ndata. One pertinent criticism is that the regime in which we tested the algorithms in the previous\nsection (two dimensional observations, and three hidden latents) is quite different from that required\nto model natural data. To that end, in this section we \ufb01rst learn a sparse model for natural images\nwith \ufb01xed over-completeness levels using a Maximum A Posteriori (MAP) algorithm [2] (degree\nof freedom 2.5). These solutions are then used to generate arti\ufb01cial data as in the previous section.\nThe goal is to validate the model on data which has a content and scale similar to the natural images\ncase, but with a controlled number of generative components.\nThe image data comprised patches of size 9 \u00d7 9 pixels, taken at random positions from 36 natural\nimages randomly selected from the van Hateren database (preprocessed as described in [10]). The\npatches were whitened and their dimensionality reduced from 81 to 36 by principal component\nanalysis. The MAP solution was trained for 500 iterations, with every iteration performed on a new\nbatch of 1440 patches (100 patches per image).\n\nThe model was initialised with a 3-times over-complete number of components (K = 108). As\nabove, the weights were initialised near the origin, and the input noise was annealed linearly from\n\u03c3d = 0.5 to \u03c3d = 0.2 in the \ufb01rst 300 iterations, remaining constant thereafter. Every run consisted\nof 500 VBEM iterations, with every iteration performed on 3600 patches generated from the MAP\nsolution. We performed several simulations for over-completeness levels between 0.5 and 4.5, and\nretained the solutions with the highest free energy.\n\nThe results are summarised in Figure 3: The model is able to recover the underlying dimensionality\nfor data between 0.5 and 2 times over-complete, and correctly saturates to 3 times over-complete\n(the maximum attainable level here) when the data over-completeness exceeds 3. In the regime\nbetween 2.5 and 3 times over-complete data, the model returns solutions with a smaller number of\ncomponents, which is possibly due to the bias described at the end of Section 5. However, these\n\n6\n\n\f3\n\n2.5\n\n2\n\n1.5\n\n1\n\ns\ns\ne\nn\ne\n\nt\n\nl\n\ne\np\nm\no\nc\nr\ne\nv\nO\nd\ne\nr\nr\ne\nn\n\nf\n\n \n\nI\n\n0.5\n\n0.5\n\n1\n\n1.5\n\n2\n\n3.5\n\n4\n\n4.5\n\n2.5\n\n3\n\nTrue Overcompleteness\n\nFigure 3: True versus inferred over-completeness from data drawn from the forward model trained on natural\nimages. If inference was perfect, the true over-completeness would be recovered (black line). This straight\nline saturates when we hit the number of latent variables with which ARD was initialised (three times over-\ncomplete). The results using multiple runs of ARD are close to this line (open circles, simulations with the\nhighest free-energy are shown as closed circles). The maximal and best over-completeness inferred from natural\nscenes is shown by the dotted line, and lies well below the over-completenesses we are able to infer.\n\nvalues are still far above the highest over-completeness learned from natural images (see section\n6.3), so that we believe that the bias does not invalidate our conclusions.\n\n6.3 Natural images\n\nHaving established that the model performs as expected, at least when the data is drawn from the\nforward model, we now turn to natural image data and examine the optimal over-completeness ratio\nand sparseness degree for natural scene statistics.\n\nThe image data for this simulation and the model initialisation and annealing procedure are identical\nto the ones in the experiments in the preceeding section. We performed 20 simulations with different\nsparseness levels, especially concentrated on the more sparse values. Every run comprised 500\nVBEM iterations, with every iteration performed on a new batch of 3600 patches.\n\nAs shown in Figure 4, the free energy increased almost monotonically until \u03b1 = 5 and then stabilised\nand started to decrease for more Gaussian models. The algorithm learnt models that were only\nslightly over-complete: the over-completeness ratio was distributed between 1 and 1.3, with a trend\nfor being more over-complete at high sparseness levels (Fig. 4). Although this general trend accords\nwith the intuition that sparseness and over-completeness are coupled, both the magnitude of the\neffect and the degree of over-completeness is smaller than might have been anticipated. Indeed, this\nresult suggests that highly over-complete models with a Student-t prior may very well be over\ufb01tting\nthe data.\n\nFinally we performed AIS using the same annealing schedule as in Section 6.1, using 250 samples\nfor the \ufb01rst 6 sparseness levels and 50 for the successive 14. The estimates obtained for the log\nmarginal likelihood, shown in Figure 4, were monotonically increasing with increasing sparseness\n(decreasing \u03b1). This indicates that sparse models are indeed optimal for natural scenes. Note that\nthis is exactly the opposite trend to that of the free energy, indicating that it is also biased for natural\nscenes. Figure 4 shows the basis vectors learned in the simulation with \u03b1 = 2.09, which had\nmaximal marginal likelihood. The weights resemble the Gabor wavelets, typical of sparse codes for\nnatural images [1].\n\n7 Discussion\n\nOur results suggest that the optimal sparse-coding model for natural scenes is indeed one which\nis very sparse, but only modestly over-complete. The anticipated coupling between the degree of\nsparsity and the over-completeness in the model is visible, but is weak.\n\nOne crucial question is how far these results will generalise to other prior distributions; and indeed,\nwhich of the various possible sparse-coding priors is best able to capture the structure of natural\nscenes. One indication that the Student-t might not be optimal, is its behaviour as the degree-of-\n\n7\n\n\fa)\n\nx 104\n\n\u22128.8\n\nb)\n\nx 104\n\n\u22128.1\n\n)\n\nS\nT\nA\nN\n\n(\n \n)\n\n(\n\nY\nP\ng\no\n\n \n\n4\n\n6\n\u03b1\n\n8\n\n10\n\nl\n\nd)\n\n\u22128.2\n\n\u22128.3\n\n\u22128.4\n\n\u22128.5\n\n\u22128.6\n\n2\n\n3\n\n4\n\n5\n\n\u03b1\n\n6\n\n7\n\n8\n\n9\n\n)\n\nS\nT\nA\nN\n\n(\n \ny\ng\nr\ne\nn\ne\n \ne\ne\nr\nf\n\n\u22128.9\n\n\u22129\n\n\u22129.1\n\n\u22129.2\n2\n\nc)\n\no\n\ni\nt\n\nt\n\na\nr\n \ns\ns\ne\nn\ne\ne\np\nm\no\nc\nr\ne\nv\no\n\nl\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\n2\n\n4\n\n6\n\n\u03b1\n\n8\n\nFigure 4: Natural images results. a) Free energy b) Marginal likelihood c) Estimated over-completeness d)\nBasis vectors\n\nfreedom parameter moves towards sparser values. The distribution puts a very small amount of\nmass at a very great distance from the mean (for example, the kurtosis is unde\ufb01ned for \u03b1 < 4). It\nis not clear that data with such extreme values will be encountered in typical data sets, and so the\nmodel may become distorted at high sparseness values.\n\nFuture work will be directed towards more general prior distributions. The formulation of the\nStudent-t in terms of a random precision Gaussian is computationally helpful. While no longer\nwithin the exponential family, other distributions on the precision (such as a uniform one) may be\napproximated using a similar approach.\n\nAcknowledgements\n\nThis work has been supported by the Gatsby Charitable Foundation. We thank Yee Whye Teh, Iain\nMurray, and David McKay for fruitful discussions.\n\nReferences\n\n[1] B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse\n\ncode for natural images. Nature, 381(6583):607\u2013609, 1996.\n\n[2] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nV1? Vision Research, 37:3311\u20133325, 1997.\n\n[3] Y.W Teh, M. Welling, S. Osindero, and G.E. Hinton. Energy-based models for sparse overcomplete\n\nrepresentations. Journal of Machine Learning Research, 4:1235\u20131260, 2003.\n\n[4] A.J. Bell and T.J. Sejnowski. The \u2018independent components\u2019 of natural scenes are edge \ufb01lters. Vision\n\nResearch, 37(23):3327\u20133338, 1997.\n\n[5] S. Osindero, M. Welling, and G.E. Hinton. Topographic product models applied to natural scene statistics.\n\nNeural Computation, 18:381\u2013344, 2006.\n\n[6] D.J.C. McKay. Bayesian interpolation. Neural Comput, 4(3):415\u2013447, 1992.\n[7] C.M. Bishop. Variational principal components. In ICANN 1999 Proceedings, pages 509\u2013514, 1999.\n[8] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, University College London, 2003.\n\n[9] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11:125\u2013139, 2001.\n[10] J.H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\n\nsimple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265:359\u2013366, 1998.\n\n8\n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Pietro", "family_name": "Berkes", "institution": null}, {"given_name": "Richard", "family_name": "Turner", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}