{"title": "Select-and-Sample for Spike-and-Slab Sparse Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 3934, "page_last": 3942, "abstract": "Probabilistic inference serves as a popular model for neural processing. It is still unclear, however, how approximate probabilistic inference can be accurate and scalable to very high-dimensional continuous latent spaces. Especially as typical posteriors for sensory data can be expected to exhibit complex latent dependencies including multiple modes. Here, we study an approach that can efficiently be scaled while maintaining a richly structured posterior approximation under these conditions. As example model we use spike-and-slab sparse coding for V1 processing, and combine latent subspace selection with Gibbs sampling (select-and-sample). Unlike factored variational approaches, the method can maintain large numbers of posterior modes and complex latent dependencies. Unlike pure sampling, the method is scalable to very high-dimensional latent spaces. Among all sparse coding approaches with non-trivial posterior approximations (MAP or ICA-like models), we report the largest-scale results. In applications we firstly verify the approach by showing competitiveness in standard denoising benchmarks. Secondly, we use its scalability to, for the first time, study highly-overcomplete settings for V1 encoding using sophisticated posterior representations. More generally, our study shows that very accurate probabilistic inference for multi-modal posteriors with complex dependencies is tractable, functionally desirable and consistent with models for neural inference.", "full_text": "Select-and-Sample for Spike-and-Slab Sparse Coding\n\nAbdul-Saboor Sheikh\n\nTechnical University of Berlin, Germany,\n\nand Cluster of Excellence Hearing4all\nUniversity of Oldenburg, Germany,\n\nand SAP Innovation Center Network, Berlin\n\nsheikh.abdulsaboor@gmail.com\n\nJ\u00f6rg L\u00fccke\n\nResearch Center Neurosensory Science\nand Cluster of Excellence Hearing4all\n\nand Dept. of Medical Physics and Acoustics\n\nUniversity of Oldenburg, Germany\n\njoerg.luecke@uol.de\n\nAbstract\n\nProbabilistic inference serves as a popular model for neural processing.\nIt is\nstill unclear, however, how approximate probabilistic inference can be accurate\nand scalable to very high-dimensional continuous latent spaces. Especially as\ntypical posteriors for sensory data can be expected to exhibit complex latent\ndependencies including multiple modes. Here, we study an approach that can\nef\ufb01ciently be scaled while maintaining a richly structured posterior approximation\nunder these conditions. As example model we use spike-and-slab sparse coding for\nV1 processing, and combine latent subspace selection with Gibbs sampling (select-\nand-sample). Unlike factored variational approaches, the method can maintain\nlarge numbers of posterior modes and complex latent dependencies. Unlike pure\nsampling, the method is scalable to very high-dimensional latent spaces. Among all\nsparse coding approaches with non-trivial posterior approximations (MAP or ICA-\nlike models), we report the largest-scale results. In applications we \ufb01rstly verify the\napproach by showing competitiveness in standard denoising benchmarks. Secondly,\nwe use its scalability to, for the \ufb01rst time, study highly-overcomplete settings for\nV1 encoding using sophisticated posterior representations. More generally, our\nstudy shows that very accurate probabilistic inference for multi-modal posteriors\nwith complex dependencies is tractable, functionally desirable and consistent with\nmodels for neural inference.\n\n1\n\nIntroduction\n\nThe sensory data that enters our brain through our sensors has a high intrinsic dimensionality and\nit is complex and ambiguous. Image patches or small snippets of sound, for instance, often do\nnot contain suf\ufb01cient information to identify edges or phonemes with high degrees of certainty.\nProbabilistic models are therefore very well suited to maintain uncertainty encodings. Given an\nimage patch, for instance, high probabilities for an edge in one location impacts the probabilities\nfor other components resulting in complex dependencies commonly known as \"explaining-away\"\neffects. Such dependencies in general include (anti-)correlations, higher-order dependencies and\nmultiple posterior modes (i.e., alternative interpretations of a patch). Furthermore, sensory data\nis typically composed of many different elementary constituents (e.g., an image patch contains\nsome of a potentially very large number of components) resulting in sparse coding models aiming\nat increasing overcompleteness [1]. If sensory data gives rise to complex posterior dependencies\nand has high intrinsic dimensionality, how can we study inference and learning in such settings?\nTo date most studies, e.g. of V1 encoding models, have avoided the treatment of complex latent\ndependencies by assuming standard sparse models with Laplace priors [2, 3, 1]; high-dimensional\nproblems can then be addressed by applying maximum a-posteriori (MAP) approximations for the\nresulting mono-modal posteriors. Other scalable approaches such as independent component analysis\n(ICA) or singular value decomposition (K-SVD) [4, 5] do not encode for data uncertainty, which\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\favoids posterior estimations altogether. For advanced data models, which we expect to be required,\ne.g., for visual data, neither MAP nor a non-probabilistic treatment can be expected to be suf\ufb01cient.\nIt was, for example, in a number of studies shown that sparse coding models with more \ufb02exible\nspike-and-slab priors are (A) more closely aligned with the true generative process e.g. for images,\nand are (B) resulting in improved functional performance [6, 7]. Spike-and-slab priors do, however,\nresult in posteriors with complex dependencies including many modes [8, 7]. Inference w.r.t. to\nspike-and-slab sparse coding is therefore well suited to, in general, study ef\ufb01cient inference and\nlearning with complex posteriors in high-dimensions. Results for spike-and-slab sparse coding are,\nfurthermore, of direct interest for other important models such as hierarchical communities of experts\n[9], deep Boltzmann Machines (see [6]), or convolutional neural networks [10]. Also for these\ntypically deep systems very high-dimensional inference and learning is of crucial importance.\nSo far, intractable inference for spike-and-slab sparse coding was approximated using sampling or\nfactored variational approaches. While sampling approaches can in principle model any dependencies\nincluding multiple modes, they have been found challenging to train at scale, with the largest scale\napplications going to few hundred of latents [11, 12]. Compared to sampling, approaches using\nfactored variational approximations can not model as complex posterior dependencies because they\nassume posterior independence (no correlations etc), however, they can capture multiple modes\nand are scalable to several hundreds up to thousands of latents [8, 6]. In this work we combine the\naccuracy of sampling approaches and the scalability of variational approaches by applying select-and-\nsample [13] to scale spike-and-slab sparse coding to very high latent dimensions. In contrast to using\na factored approximation, we here select low dimensional subspaces of the continuous hidden space,\nand then apply sampling to approximate posteriors within these lower dimensional spaces.\n\n2 The Spike-and-Slab Sparse Coding Model and Parameter Optimization\n\nh),\n\nThe spike-and-slab sparse coding model (see [8, 6] and citations therein) used for our study assumes\na Bernoulli prior over all H components of the the binary latent vector \ufffdb \u2208 {0, 1}H, with a Gaussian\nprior (the \u2018slab\u2019) for the continuous latent vector \ufffdz \u2208 RH:\np(\ufffdb| \u0398) =\ufffdh \u03c0bh (1 \u2212 \u03c0)1\u2212bh , p(\ufffdz | \u0398) =\ufffdh N (zh; \u00b5h, \u03c82\n(1)\nwhere \u03c0 de\ufb01nes the probability of bh being equal to one and where \ufffd\u00b5 \u2208 RH and \ufffd\u03c8 \u2208 RH parameterize\nthe Gaussian slab. A spike-and-slab hidden variable \ufffds \u2208 RH is then generated by a pointwise\nmultiplication: \ufffds = (\ufffdb \ufffd \ufffdz), i.e., sh = bh zh. Given the hidden variable \ufffds, we follow standard sparse\ncoding by linearly superimposing a set of latent components (i.e., W \ufffds =\ufffdh\n\ufffdWhsh) to initialize the\nmean of a Gaussian noise model:\n(2)\nwhich then generates us the observed data \ufffdy \u2208 RD. Here the columns of the matrix W \u2208 RD\u00d7H\nare each a latent component \ufffdWh that is associated with a spike-and-slab latent variable sh. We use\n\ufffd\u03c3 \u2208 RD to parameterize the observation noise. The parameters of the generative model (1) to (2)\nare together denoted by \u0398 = (\u03c0, \ufffd\u00b5, \ufffd\u03c8, W, \ufffd\u03c3). To \ufb01nd the values of \u0398, we seek to maximize the data\nlikelihood L =\ufffdN\nn=1 p(\ufffdy (n) | \u0398) under the spike-and-slab data model and given a set of N data\npoints {\ufffdy (n)}n=1,...,N . To derive a learning algorithm, we apply expectation maximization (EM) in\nits free-energy formulation. In our case the free-energy is given by:\n\np(\ufffdy | \ufffds, \u0398) =\ufffdd N (yd;\ufffdh Wdhsh, \u03c32\n\nd),\n\nF(q, \u0398) =\n\nN\ufffdn=1\ufffdlog p(\ufffdy (n), \ufffds| \u0398)\ufffdn + H(q(n)), where \ufffd f (\ufffds)\ufffdn = \ufffd q(n)(\ufffds) f (\ufffds)d \ufffds\n\nis the expectation under q(n), a distribution over the latent space and H(\u00b7 ) denotes the Shannon\nentropy. Given the free-energy, the parameter updates are canonically derived by setting the partial\nderivatives of F(q, \u0398) w.r.t. the parameters to zero. For the spike-and-slab sparse coding model (1)\nand (2), we obtain (similar to [8, 6, 7]) the following closed-form M-step equations:\n\n(3)\n\n(4)\n\nd )2\ufffdn\nWdhsh \u2212 y(n)\n\n1\n\nN \ufffdn \ufffd(\ufffdh\n\n1\n\n\u03c0 =\n\nN H\ufffdn \ufffd|\ufffdb|\ufffdn , \u03c82\nW = \ufffdn \ufffdy (n)\ufffd\ufffds\ufffdT\n\ufffdn\ufffd\ufffds\ufffdsT\ufffdn\n\nh = \ufffdn\ufffd(sh \u2212 \u00b5hbh)2\ufffdn\n, \ufffd\u00b5 = \ufffdn\ufffd\ufffds\ufffdn\ufffdn\ufffd\ufffdb\ufffdn\n\n\ufffdn\ufffdbh\ufffdn\n\n, and \u03c32\n\nd =\n\nn\n\n,\n\n2\n\n\fwith sh = bh zh and |\ufffdx| =\ufffdh xh as de\ufb01ned above.\n\n3 Approximate Inference With Select-and-Sample\n\nThe optimal choices for the distributions q(n)(\ufffds) for the expectations in (3) and (4) are the posteriors\np(\ufffds| \ufffdy (n), \u0398), but neither the posteriors nor their corresponding expectation values are computation-\nally tractable for high dimensions. However, a crucial observation that we exploit in our work is\nthat for observed data such as natural sensory input or data generated by a sparse coding model, the\nactivity of latent components (or causes) can be expected to be concentrated in low-dimensional\nsubspaces. In other words, for a given observed data point, all except for a very small fraction\nof the latent components can be assumed to be non-causal or irrelevant, hence the corresponding\nlatent space can be neglected for the integration over \ufffds. For a sparse initiation (i.e., \u03c0 \ufffd 1) of the\nspike-and-slab model (1) to (2), we consider such low dimensional subspaces to be spanned by a few\n(approximately \u03c0H) of the H latent space coordinates. If we denote by J (n) the subspace containing\nthe large majority of posterior mass for a given data point \ufffdy (n), an approximation to p(\ufffds| \ufffdy (n), \u0398) is\nthen given by the following truncated distribution:\n\nq(n)(\ufffds; \u0398) =\n\np(\ufffds| \ufffdy (n), \u0398)\np(\ufffds \ufffd | \ufffdy (n), \u0398) d\ufffds \ufffd\n\n\ufffd\ufffds \ufffd\u2208J (n)\n\n\u03b4(\ufffds \u2208 J (n)),\n\n(5)\n\nwhere \u03b4(\ufffds \u2208 J (n)) is an indicator function, taking the value \u03b4(\ufffds \u2208 J (n)) = 1 only if \ufffds \u2208 J (n)\nand zero otherwise. Truncated approximations have previously been shown to work ef\ufb01ciently and\naccurately for challenging data models [14, 15, 16]. Latents were restricted to be binary, however,\nand scalability was previously limited by the combinatorics within the selected latent subsets. For\nour aim of very large scale applications, we therefore apply the select-and-sample approach [13] and\nuse a sampling approximation that operates within the subspaces J (n). Unlike [13] who used binary\nlatents, we here apply the approximation to the continuous latent space of spike-and-slab sparse\ncoding. Formally, this means that we \ufb01rst use the posterior approximation q(n)(\ufffds) in Eqn. 5 and then\napproximate the expectation values w.r.t. q(n)(\ufffds) using sampling (see illustration of Alg. 1):\n\n\ufffd f (\ufffds)\ufffdn = \ufffd q(n)(\ufffds) f (\ufffds) d\ufffds \u2248\n\n1\nM\n\nM\ufffdm=1\n\nf (\ufffds (m)), where \ufffds (m) \u223c q(n)(\ufffds) ,\n\n(6)\n\nM is the number of samples and f (\ufffds) can be any argument of the expectation values in (3) and (4).\nIt remains to be shown how dif\ufb01cult sampling from q(n)(\ufffds) is compared to directly sampling\nform the full posterior p(\ufffds| \ufffdy (n), \u0398). The index function \u03b4(\ufffds \u2208 J (n)) means that we can clamp\nall values of \ufffds to zero but we have to answer the question how the remaining sh are sampled. A\ncloser analysis of the problem shows that the distribution to sample in the reduced space is given\nby the posterior w.r.t. a truncated generative model. To show this, let us \ufb01rst introduce some\nnotation: Let us denote by I a subset of the indices of the latent variables \ufffds, i.e., I \u2286 {1, . . . , H},\nand let us use H\\I as an abbreviation for {1, . . . , H}\\I. The vector \ufffdsI w.r.t. I is then, as\ncustomary, a vector in R|I| de\ufb01ned by those entries sh with h \u2208 I. In analogy, we take a ma-\ntrix WI \u2208 RD\u00d7|I| to be de\ufb01ned by row vectors ( \ufffdwT\nd are the row vectors of W \u2208 RD\u00d7H.\nProposition 1. Consider the spike-and-slab generative model (1) to (2) with parameters \u0398,\nand let \u0398I(n) = (\u03c0, \ufffd\u00b5I(n) , \ufffd\u03c8I(n) , WI(n) , \ufffd\u03c3) be the parameters of a truncated spike-and-slab model\nwith H\ufffd = dim(I (n)) dimensional latent space. Then it applies that sampling from the truncated\ndistribution in (5) is equivalent to sampling from the posterior p(\ufffdsI(n) | \ufffdy (n), \u0398I(n) ) of the truncated\nspike-and-slab model, while all values sh with h \ufffd\u2208 I (n) are clamped to zero.\nProof.\nIf I (n) denotes the indices of those latents sh that span the subspace in which\nthe posterior mass of p(\ufffds| \ufffdy (n), \u0398) is concentrated,\nthen these subsets are given by\nJ (n) = {\ufffds \u2208 RH | \ufffdsH\\I(n) = \ufffd0}, i.e., \u03b4(\ufffds \u2208 J (n)) can be rewritten as \ufffdh\ufffd\u2208I(n) \u03b4(sh = 0).\nConsidering (5), we can therefore set the corresponding values \ufffdsH\\I(n) = \ufffd0. We now drop the\n\nd )I where \ufffdwT\n\n3\n\n\fsuperscript n for readability and \ufb01rst derive:\np(\ufffdsI, \ufffdsH\\I = \ufffd0, \ufffdy | \u0398)\n= N (\ufffdy; WI\ufffdsI + WH\\I\n= p(\ufffdsI, \ufffdy | \u0398I)U(\ufffdsH\\I = \ufffd0, \u0398) with U(\ufffdsH\\I, \u0398) = p(\ufffdsH\\I | \u0398H\\I),\ni.e., the joint with \ufffdsH\\I = \ufffd0 is given by the joint of the truncated model multiplied by a term not\ndepending on \ufffdsI such that:\n\nBern(bh; \u03c0)N (zh; \u00b5h, \u03c8h)\ufffd\ufffd\ufffdh\ufffd\u2208I\n\nBern(bh = 0; \u03c0)N (zh; \u00b5h, \u03c8h)\ufffd\n\n\ufffd0, \ufffd\u03c3)\ufffd\ufffdh\u2208I\n\n=\n\np(\ufffdsI, \ufffdy | \u0398I)U(\ufffdsH\\I = \ufffd0, \u0398) \u03b4(\ufffds \u2208 J )\n\ufffd\ufffds \ufffd\u2208J\nI, \ufffdy | \u0398I) d\ufffds \ufffd U(\ufffdsH\\I = \ufffd0, \u0398)\n\np(\ufffds \ufffd\n\nq(\ufffds; \u0398) =\n\np(\ufffds \ufffd\n\np(\ufffdsI, \ufffdsH\\I = \ufffd0, \ufffdy | \u0398) \u03b4(\ufffds \u2208 J )\n\ufffd\ufffds \ufffd\u2208J\nI, \ufffds \ufffdH\\I = \ufffd0, \ufffdy | \u0398) d\ufffds \ufffd\nI, \ufffdy | \u0398I) d\ufffds \ufffd\n\nI \ufffdh\ufffd\u2208I\n\n=\n\n(7)\n\n\u03b4(sh = 0) .\n\np(\ufffdsI, \ufffdy | \u0398I)\np(\ufffds \ufffd\n\n\u03b4(sh = 0) = p(\ufffdsI | \ufffdy, \u0398I) \ufffdh\ufffd\u2208I\n\n\ufffd\ufffds \ufffd\n\ufffd\nFollowing the proof, Proposition 1 applies for any generative model p(\ufffds, \ufffdy | \u0398) for which applies\np(\ufffdsI, \ufffdsH\\I = \ufffd0, \ufffdy | \u0398) = p(\ufffdsI, \ufffdy | \u0398I)U(\ufffdsH\\I = \ufffd0, \ufffdy, \u0398). This includes a large class of models\nsuch as linear and non-linear spike-and-slab models, and potentially hierarchical models such as\nSBNs. Proposition 1 does not apply in general, however (we exploit speci\ufb01c model properties).\nSampling. In previous work [7], posteriors for spike-and-slab sparse coding have been evaluated\nexhaustively within selected I (n) which resulted in scalability to be strongly limited by the dimen-\nsionality of I (n). Based on Proposition 1, we can now overcome this bottleneck by using sampling\napproximations within the subspaces J (n), and we have shown that such sampling is equivalent to\nsampling w.r.t. to a much lower dimensional spike-and-slab model. The dimensionality of J (n) is\nstill non-trivial, however, and we use a Markov chain Monte Carlo (MCMC) approach, namely Gibbs\nsampling for ef\ufb01cient scalability. Following Proposition 1 we derive a sampler for the spike-and-slab\nmodel (1) to (2) and later apply it for the needed (low) dimensionality.\nWhile the result of sampling from posteriors of truncated models applies for a broad class of spike-\nand-slab models (Proposition 1), we can here exploit a further speci\ufb01c property of the model (1) to\n(2). As has previously been observed and exploited in different contexts [8, 12, 17], the Gaussian\nslab and the Gaussian noise model can be combined using Gaussian identities such that integrals over\nthe continuous latents \ufffdz are solvable analytically. Here we can use this observation for the derivation\nof a Gibbs sampler. For this we \ufb01rst devise a latent variable Markov chain such that its target density\nis given by the following conditional posterior distribution:\n\np(sh|\ufffdsH\\h, \ufffdy, \u03b8) \u221d p(sh|\u03b8) \ufffdd\n\np(yd|sh, \ufffdsH\\h, \u03b8)\n\n=\ufffd(1 \u2212 \u03c0) \u02dc\u03b4(sh) + \u03c0 N (sh; \u00b5h, \u03c82\n\nh)\ufffd \ufffdd\n\nN (sh; \u03bdd, \u03d52\nd) ,\n\n(8)\n\nd/W 2\n\nd = \u03c32\n\nwhere \u02dc\u03b4(.) is the Dirac delta to represent\n\nthe spike at zero and where \u03bdd = (yd \u2212\n\ndh. Using Gaussian identities we obtain:\n\np(sh|\ufffdsH\\h, \ufffdy, \u03b8) \u221d\ufffd(1 \u2212 \u03c0)N (sh; \u03c5, \u03c62) \u02dc\u03b4(sh) + \u03c0 N (sh; \u03c4, \u03c92)\ufffd ,\n\n\ufffdh\ufffd\\h Wdh\ufffd sh\ufffd )/Wdh and \u03d52\nwhere \u03c5 = \u03c62 \ufffdd \u03bdd/\u03d52\nh) and \u03c92 =\nh)\u22121. We can observe that the conditional posterior (9) of sh retains the form of a\n(1/\u03c62 + 1/\u03c82\nspike-and-slab distribution. We can therefore simply compute the cumulative distribution function\n(CDF) of (9) to simulate sh from the exact conditional distribution (sh \u223c p(sh|\ufffdsH\\h , \ufffdy, \u03b8)) by means\nof inverse transform sampling.\nSelecting. The Gibbs sampler can now be applied to generate posterior samples for a truncated\nspike-and-slab model (de\ufb01ned using parameters \u0398I(n)). We also obtain a valid approximation, of\ncourse, without selection (I = {1,. . . ,H}) but MCMC samplers in very high dimensional spaces\n\nd and \u03c62 = (\ufffdd 1/\u03d52\n\nd)\u22121, whereas \u03c4 = \u03c92 (\u03c5/\u03c62 + \u00b5h/\u03c82\n\n(9)\n\n4\n\n\fAlgorithm 1: Select-and-sample for spike-and-slab sparse coding (S5C)\ninit \u0398;\nrepeat\n\nfor (n = 1, ..., N ) do\n\nfor (h = 1, ..., H) do\n\ncompute Sh(\ufffdy (n)) as in (10);\n\nde\ufb01ne I (n) as in (11);\nfor (m = 1, . . . , M) do\n\ndraw \ufffds(m)\n\nI(n) \u223c p(\ufffdsI(n) | \ufffdy (n), \u0398I(n) ) using (9);\n\ncompute \ufffd f (\ufffds)\ufffdn = 2\n\nm= M\n\n2 +1 f (\ufffds (m));\n\ncompute M-step with arguments f (\ufffds) as in (3) and (4);\n\nM\ufffdM\n\nuntil (until \u0398 have converged);\n\nIllustration of general application.\n\nwith complex posterior structure are known to be challenging (convergence to target distributions\ncan be very slow). The problems typically increase superlinearly with hidden dimensionality but\nfor intermediate dimensions, a Gibbs sampler can be very fast and accurate. By using subspaces\nJ (n) with intermediate dimensionality, therefore, results in very ef\ufb01cient and accurate sampling\napproximations within these spaces. An overall very accurate approximation is then obtained if the\nsubspaces are well selected and if they do contain the large majority of posterior mass. By using\nexact EM it was indeed previously shown for spike-and-slab sparse coding [7] that almost all posterior\nmass, e.g., for naturally mixed sound sources, is concentrated in collections of low-dimensional\nsubspaces (also compare [18]). To de\ufb01ne a subspace J (n) given a data point \ufffdy (n), we follow earlier\napproaches [15, 14] and \ufb01rst de\ufb01ne an ef\ufb01ciently computable selection function to choose those\nlatents that are the most likely to have generated the data point. We use the selection function in [7]\nwhich is given by:\n\nSh(\ufffdy (n), \u0398) = \ufffdd N (\ufffdy (n)\n\nd\n\n; Wdh\u00b5h, \u03c3d + W 2\n\ndh/\u03c8h) \u221d p(\ufffdy (n) |\ufffdb = \ufffdbh, \u0398),\n\n(10)\n\nwhere \ufffdbh represents a singleton state with only component h being equal to one. The subsets are then\nde\ufb01ned as follows:\nI (n) is the set of H\ufffd indices such that \u2200h \u2208 I (n) \u2200h\ufffd \ufffd\u2208 I (n) : Sh(\ufffdy (n), \u0398) > Sh\ufffd (\ufffdy (n), \u0398). (11)\nWe then use J (n) = {\ufffds| \ufffdsH\\I(n) = \ufffd0} as above. In contrast to previous approaches with H\ufffd typically\n< 10, H\ufffd can be chosen relatively large here because the Gibbs sampler is still very ef\ufb01cient and\nprecise for H\ufffd > 10 (we will go up H\ufffd = 40).\nBy combining selection procedure and the Gibbs sampler using Proposition 1, we obtain the ef\ufb01cient\napproximate EM algorithm summarized in Alg. 1. It will be referred to as S5C (see Alg. 1 caption).\nNote that we will, for all experiments, always discard the \ufb01rst half of the drawn samples as burn-in.\n\n4 Numerical Experiments\n\nIn all the experiments, the initial values of \u03c0 were drawn from a uniform distribution on the interval\n[0.1, 0.5] (i.e., intermediately sparse), \ufffd\u00b5 was initialized with normally distributed random values,\n\u03c8h was set to 1 and \u03c3d was initialized with the standard deviation of yd. The elements of W were\niid drawn from a normal distribution with zero mean and a standard deviation of 5.0. We used a\nmulti-core parallelized implementation and executed the algorithm on up to 1000 CPU cores.\nVeri\ufb01cation of functional accuracy. We \ufb01rst investigate the accuracy and convergence properties\nof our method on ground-truth data which was generated by the spike-and-slab data model (1) and\n(2) itself. We used H = 10 hidden variables and D = 5 \u00d7 5 and generative \ufb01elds \ufffdWh in the form\nof \ufb01ve horizontal and \ufb01ve vertical bars. As is customary for such bars like data (e.g., [15] and cites\ntherein) we take each \ufb01eld to contribute to a data point with probability \u03c0 = 2\nH . We then randomly\nmake each of the 5 vertical and 5 horizontal bars positive or negative by assigning them a value of 5\n\n5\n\n\fFigure 1: Functional accuracy of S5C. A Arti\ufb01cial ground-truth data. B Likelihoods during learning\n(Alg. 1) for different H\ufffd. C Denoising performance of S5C on the \u2018house\u2019 benchmark as used for\nother methods (MTMKL [8], K-SVD [4], Beta process [11] and GSC-ET [7]. Bold values highlight\nthe best performing algorithm. \u2217Value not bold-faced as noise variance is assumed known a-priori[4].\nD Top: Noisy image with \u03c3 = 25. Bottom: State-of-the art denoising result after S5C was applied.\n\nor \u22125, while the non-bar pixels are assigned zero value. The parameters of the latent slabs \u00b5h and\n\u03c8h are set to 0.0 and 1.0, respectively, and we set the observation noise to \u03c3d = 1.0. We generate\nN = 5000 data points with this setting (see Fig. 1A for examples).\nWe apply the S5C algorithm (Alg. 1) with H = 10 latents and M = 40 samples per data point and\nuse two settings for preselection: (A) no preselection (H\ufffd = H = 10) and (B) subspace preselection\nusing H\ufffd = 5. We did ten runs per setting using different initializations per run as above. For setting\n(A), i.e. pure Gibbs sampling, the algorithm recovered, after 150 EM iterations, the generating bars\nin 2 of the 10 runs. For setting (B) convergence was faster and in 9 of the 10 runs all bars were\nrecovered after 50 EM iterations. Fig. 1B shows for all 20 runs likelihoods during learning (which\nare still tractable for H = 10). These empirical results show the same effect for a continuous latent\nvariable model as was previously reported for non-continuous latents [19, 20]: preselection helps\navoiding local optima (presumably because poor non-sparse solutions are destabilized using subspace\nselection).\nAfter having veri\ufb01ed the functioning of S5C on arti\ufb01cial data, we turned to verifying the approach on\na denoising benchmark, which is standard for sparse coding. We applied S5C using a noisy \u201chouse\u201d\nimage [following 11, 4, 8, 7]. We used three different levels of added Gaussian noise (\u03c3 = 15, 25, 50).\nFor each setting we extract 8 \u00d7 8 patches from 256 \u00d7 256 noisy image, visiting a whole grid of\n250 \u00d7 250 pixels by shifting (vertically and horizontally) 1 pixel at a time. In total we obtained\nN = 62, 001 overlapping image patches as data points. We applied the S5C algorithm with H = 256,\nselect subspaces with H\ufffd = 40 and used M = 100 samples per subspace. Fig. 1C,D show the\nobtained results and a comparison to alternative approaches. As can be observed, S5C is competitive\nto other approaches and results in higher peak signal-to-noise ratios (PSNRs) (see [7] for details)\nthan, e.g., K-SVD or factored variational EM approaches (MTMKL) for \u03c3 = 25 and 50. Even though\nS5C uses the additional sampling approximation in the selected subspaces, it is also competitive to\nET-GSC [7], which is less ef\ufb01cient as it sums exhaustively within subspaces. For \u03c3 = 25 S5C even\noutperforms ET-GSC presumably because S5C allows for selecting larger subspaces. In general we\nobserved increased improvement with the number of samples, but improvements with H saturated\nafter about H = 256.\nLarge-scale application and V1 encoding. Since sparse coding was \ufb01rst suggested as coding model\nfor primary visual cortex [21], a main goal has been its application to very high latent dimensions\nbecause V1 is believed to be highly overcomplete [1]. Furthermore, for very large hidden dimensions,\nnon-standard generative \ufb01elds were observed [1], a \ufb01nding which is of signi\ufb01cant relevance for\nthe ongoing debate of how and where increasingly complex structures in the visual system may be\nprocessed. Here we applied S5C with H = 10 000 hidden dimensions to demonstrate scalability\nof the method, and to study highly-overcomplete V1 encoding based on a posterior approximation\ncapturing rich structure. For our application we used the standard van Hateren database [22], extracted\nN = 106 image patches of size 16 \u00d7 16, and applied pseudo-whitening following [21]. We applied\n\n6\n\n\fFigure 2: Selection of different types\nof generative \ufb01elds as learned by S5C\nusing H = 10, 000 latent dimensions\n(see Suppl. for all \ufb01elds). Gabor-like\n\ufb01elds are the most frequent type (Gabors,\nridgelets, gratings), followed by globu-\nlar \ufb01elds, curved \ufb01elds and corner-like\n\ufb01elds. We also observed textures other\nthan gratings. Gabors, curved and corner \ufb01elds were almost all among the 30% most frequently acti-\nvated \ufb01elds. Ridgelets, globular \ufb01elds and gratings were typically among the 30-80% most used \ufb01elds.\n\nS5C for 50 EM iterations to the data using H\ufffd = 20 dimensional subspaces and M = 50 samples\nper data point. After learning we observed a large number of generative \ufb01elds specialized to image\ncomponents. Like in recent large-scale applications of standard sparse coding [1] we found \ufb01elds\nthat did not specialize (about 1% in [1] and about 12% for S5C). The higher percentage for S5C may\nbe due to the \ufb01ve-fold higher dimensionality used here. For the \ufb01elds specialized to components,\nwe observed a large number of Gabor-like \ufb01elds including ridgelets and gratings (names follow [1]).\nFurthermore, we observed globular \ufb01elds that have been observed experimentally [23] and are subject\nof a number of recent theoretical studies (e.g., [14, 3]). Notably, we also observed a number of curved\n\ufb01elds and \ufb01elds sensitive to corner-like structures (Fig. 2 shows some examples). Curved \ufb01elds\nhave so far only been described to emerge from sparse coding once before [1] and for convolutional\nsparse coding in two cases [24, 25] (to the knowledge of the authors) but have been suggested for\ntechnical applications much earlier [26] (a link that was not made, so far). Corner-like structures\nhave previously not been observed for sparse coding presumably because of lower dimensional latent\nspaces (also not in [1] but compare convolutional extensions [24, 16, 25]). The numbers of curved (a\nfew hundred) and corner-like \ufb01elds (a few tens) are small but we almost exclusively \ufb01nd those \ufb01elds\namong the 20% most frequently used \ufb01elds (we order according to average approx. posterior, see\nsupplement). Neural responses to corner-like sensitivities are typically associated with higher-level\nprocessing in the visual pathway. Our results may be evidence for such structures to emerge together,\ne.g., with Gabors for very large latent dimensionality (as expected for V1). In general, the statistics\nof generative \ufb01eld shapes can be in\ufb02uenced by many factors including preprocessing details, sparsity,\nlocal optima or details of the learning algorithms. However, because of the applied approximation,\nS5C can avoid the for MAP based approaches required choice of the sparsity penalty [1]. Instead we\nstatistically infer the sparsity level which is well interpretable for hard sparsity, and which corresponds\nfor our application to H\u03c0 = 6.2 components per patch (also compare [14, 20]). In the supplement\nwe provide the full set of the H = 10 000 learned generative \ufb01elds.\n\nFigure 3: The y-axis shows\nthe highest reported latent\ndimensionality for different\nsparse coding algorithms\n(cont. latents), and the x-\naxis the accuracy of poste-\nrior approximations. Within\neach column, entries are or-\ndered (left-to-right) w.r.t. the\npublication year. 1st col-\numn: Sparse coding systems\nusing one latent state for in-\nference (eg., MAP-like [27,\n28, 1] or SAILnet [3] or K-\nSVD [4, 5]). 2nd: Approx-\nimate posteriors in the form\nof factored variational distri-\nbutions that can capture multiple modes but assume posterior independence among the latents sh\n(MTMKL [8], S3C [6]). 3rd: Sampling based approximations [11, 12] and truncated approximations\n(ssMCA [20], GSC-ET [7]) that capture multiple posterior modes and complex latent dependencies.\nFollowing [6] we also included ssRBM for comparison. 4th: Full posterior with exact EM [17].\n\n7\n\n\f5 Discussion\n\nIn this study we have applied a select-and-sample approach [13] to derive and study an approximate\nEM algorithm applicable to models with very large-scale latent spaces. Select-and-sample combines\nsampling with subspace preselection [15, 14] and has previously been applied as model for neural in-\nference using binary latents [13]. Furthermore, it has been used to overcome analytical intractabilities\nof a non-linear sparse coding model [20]. Here, we for the \ufb01rst time apply select-and-sample to scale\na standard linear sparse coding model with spike-and-slab prior up to very large hidden dimensions.\nSpike-and-slab sparse coding is hereby not only more expressive than standard Laplace or binary\npriors [8, 12, 7, 20] but results in properties that we can exploit for our approximation. We have thus\nanalytically shown (Proposition 1) that select-and-sample is applicable to a large class of models with\nhard sparsity (giving justi\ufb01cation also to earlier applications [20]).\nEmpirically, we have, \ufb01rstly, shown that select-and-sample for spike-and-slab sparse coding (S5C)\nmaintains the functional competitiveness of alternative approaches (Fig. 1). Secondly, we demon-\nstrated ef\ufb01ciency by scaling S5C up to very high-dimensional latent spaces (we go up to 10 000). For\ncomparison, Fig. 3 shows the largest reported latent spaces of different sparse coding approaches\ndepending on the posterior structure that can be captured. Non-probabilistic approaches (e.g., K-SVD\n[4, 5]) are known to scale relatively well, and, likewise, approaches using MAP approximations\n[2, 3, 1] have been shown to be applicable to large scales. None of these approaches captures\nposterior dependencies or multiple posterior modes given a data point, however. Factored variational\napproaches can be scaled to very high-dimensional latent spaces and can capture multiple posterior\nmodes. No latent dependencies in the posterior are modeled, however, which has previously been\nreported to result in disadvantageous behavior (e.g. [29, 7]). In contrast to MAP-based or factored\napproaches, sampling approaches can model both multiple posterior modes and complex latent\ndependencies. Some models hereby additionally include a more Bayesian treatment of parameters\n[11, 12] (also compare [8]) which can be considered more general than approaches followed in\nother work (see Fig. 3). The scalability of sampling based approaches has been limited, however.\nAmong those models capturing the crucial posterior structure, S5C shows, to the knowledge of the\nauthors, the largest scale applicability. This is even the case if approaches using factored posteriors\nare included. Notably there is also little reported for very large hidden dimensions for MAP based\nor deterministic approaches (compare, e.g., [5]), although scalability should be less of an issue. In\ngeneral it may well be that a method is scalable to larger than the reported latent spaces but that such\nincreases do not result in functional bene\ufb01ts.\nFor probabilistic approaches, the requirement for approximations with high accuracy have been\nidenti\ufb01ed also in other very promising work [30, 31] which uses different approaches that were,\nso far, applied to much smaller scales. For the select-and-sample method and the spike-and-slab\nsparse coding model, the high-dimensional applicability means that this or similar approaches are\na promising candidate for models such as DBNs, SBNs or CNNs because of their close relation to\nspike-and-slab models and their typically similarly large scale settings. Here we have studied an\napplication of S5C to standard image patches, primarily to demonstrate scalability. The obtained\nnon-standard generative \ufb01elds may by themselves, however, be of relevance for V1 encoding (Fig. 2)\nand they show that spike-and-slab models may be very suitable generalized V1 models. From a\nprobabilistic view on neural processing, the accuracy that can be provided by select-and-sample\ninference is hereby very desirable and is consistent, e.g., with sampling-based interpretations of neural\nvariability [32]. Here we have shown that such probabilistic approximations are also functionally\ncompetitive and scalable to very large hidden dimensions.\n\nAcknowledgements. We thank E. Guiraud for help with Alg. 1 (illustration) and acknowledge\nfunding by the DFG: Cluster of Excellence EXC 1077/1 (Hearing4all) and grant LU 1196/5-1.\n\nReferences\n[1] B. Olshausen. Highly overcomplete sparse coding. In Proc. SPIE, 8651, 2013.\n[2] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1?\n\nVision Research, 37(23):3311\u20133325, 1997.\n\n[3] J. Zylberberg, J. Murphy, and M. DeWeese. A Sparse Coding Model with Synaptically Local Plasticity\nand Spiking Neurons Can Account for the Diverse Shapes of V1 Simple Cell Receptive Fields. PLoS\nComp. Bio., 7(10):e1002250, 2011.\n\n[4] H. Li and F. Liu. Image denoising via sparse and redundant representations over learned dictionaries in\n\nwavelet domain. In ICIG, pages 754\u2013758, 2009.\n\n8\n\n\f[5] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization.\n\nICML, 2016.\n\n[6] I. J. Goodfellow, A. Courville, and Y. Bengio. Scaling up spike-and-slab models for unsupervised feature\n\nlearning. TPAMI, 35(8):1902\u20131914, 2013.\n\n[7] A. S. Sheikh, J. A. Shelton, and J. L\u00fccke. A truncated EM approach for spike-and-slab sparse coding.\n\nJMLR, 15:2653\u20132687, 2014.\n\n[8] M. Titsias and M. Lazaro-Gredilla. Spike and slab variational inference for multi-task and multiple kernel\n\nlearning. In NIPS, pages 2339\u20132347, 2011.\n\n[9] G. E. Hinton, B. Sallans, and Z. Ghahramani. A hierarchical community of experts. In Learning in\n\ngraphical models, pages 479\u2013494. Springer, 1998.\n\n[10] Ankit B Patel, Tan Nguyen, and Richard G Baraniuk. A probabilistic theory of deep learning. In Advances\n\nin Neural Information Processing Systems (NIPS), 2016. in press, preprint arXiv:1504.00641.\n\n[11] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-parametric Bayesian dictionary\n\nlearning for sparse image representations. In NIPS, pages 2295\u20132303, 2009.\n\n[12] S. Mohamed, K. Heller, and Z. Ghahramani. Evaluating Bayesian and L1 approaches for sparse unsuper-\n\nvised learning. In ICML, 2012.\n\n[13] J. Shelton, J. Bornschein, A.-S. Sheikh, P. Berkes, and J. L\u00fccke. Select and sample \u2014 a model of ef\ufb01cient\n\nneural inference and learning. In NIPS, pages 2618\u20132626, 2011.\n\n[14] G. Puertas, J. Bornschein, and J. L\u00fccke. The maximal causes of natural scenes are edge \ufb01lters. In NIPS,\n\nvolume 23, pages 1939\u201347, 2010.\n\n[15] J. L\u00fccke and J. Eggert. Expectation truncation and the bene\ufb01ts of preselection in training generative\n\nmodels. JMLR, 11:2855\u20132900, 2010.\n\n[16] Z. Dai, G. Exarchakis, and J. L\u00fccke. What are the invariant occlusive components of image patches? A\n\nprobabilistic generative approach. In NIPS 26, pages 243\u2013251. 2013.\n\n[17] J. L\u00fccke and A.-S. Sheikh. Closed-form EM for sparse coding and its application to source separation. In\n\nLVA, pages 213\u2013221, 2012.\n\n[18] K. Schnass. Local identi\ufb01cation of overcomplete dictionaries. JMLR, 16:1211\u20131242, 2015.\n[19] G. Exarchakis, M. Henniges, J. Eggert, and J. L\u00fccke. Ternary sparse coding. In LVA/ICA, pages 204\u2013212,\n\n2012.\n\n[20] J. A. Shelton, A. S. Sheikh, J. Bornschein, P. Sterne, and J. L\u00fccke. Nonlinear spike-and-slab sparse coding\n\nfor interpretable image encoding. PLoS ONE, 10:e0124088, 05 2015.\n\n[21] B. Olshausen and D. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 381:607\u2013609, 1996.\n\n[22] J. H. van Hateren and A. van der Schaaf. Independent component \ufb01lters of natural images compared with\nsimple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265:359 \u2013 366, 1998.\n[23] D. L. Ringach. Spatial structure and symmetry of simple-cell receptive \ufb01elds in macaque primary visual\n\ncortex. Journal of Neurophysiology, 88:455 \u2013 463, 2002.\n\n[24] P. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval. Motif: an ef\ufb01cient algorithm for learning translation\n\ninvariant dictionaries. In IEEE Int. Conf. Acoustics Speech and Sig. Processing, volume 5, 2006.\n\n[25] J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and\n\nTrends in Computer Graphics and Vision, 8(2-3):85\u2013283, 2014.\n\n[26] N. Kr\u00fcger and G. Peters. Object recognition with banana wavelets. In Eur. Symp. ANNs, 1997.\n[27] P. Garrigues and B. A. Olshausen. Learning horizontal connections in a sparse coding model of natural\n\nimages. In NIPS, 2007.\n\n[28] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector\n\nquantization. In Proc. ICML, pages 921\u2013928, 2011.\n\n[29] A. Ilin and H. Valpola. On the effect of the form of the posterior approximation in variational learning of\n\nICA models. Neural Processing Letters, 22(2):183\u2013204, 2005.\n\n[30] T. Salimans, D. Kingma, and M. Welling. Markov chain monte carlo and variational inference: Bridging\n\nthe gap. ICML, 2015.\n\n[31] D. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. ICML, 2015.\n[32] P. Berkes, G. Orban, M. Lengyel, and J. Fiser. Spontaneous Cortical Activity Reveals Hallmarks of an\n\nOptimal Internal Model of the Environment. Science, 331(6013):83\u201387, January 2011.\n\n9\n\n\f", "award": [], "sourceid": 1947, "authors": [{"given_name": "Abdul-Saboor", "family_name": "Sheikh", "institution": "SAP Labs Berlin"}, {"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": "University of Oldenburg"}]}