{"title": "On Prior Distributions and Approximate Inference for Structured Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 684, "abstract": "We present a general framework for constructing prior distributions with structured variables. The prior is defined as the information projection of a base distribution onto distributions supported on the constraint set of interest. In cases where this projection is intractable, we propose a family of parameterized approximations indexed by subsets of the domain. We further analyze the special case of sparse structure. While the optimal prior is intractable in general, we show that approximate inference using convex subsets is tractable, and is equivalent to maximizing a submodular function subject to cardinality constraints. As a result, inference using greedy forward selection provably achieves within a factor of (1-1/e) of the optimal objective value. Our work is motivated by the predictive modeling of high-dimensional functional neuroimaging data. For this task, we employ the Gaussian base distribution induced by local partial correlations and consider the design of priors to capture the domain knowledge of sparse support. Experimental results on simulated data and high dimensional neuroimaging data show the effectiveness of our approach in terms of support recovery and predictive accuracy.", "full_text": "On Prior Distributions and Approximate Inference\n\nfor Structured Variables\n\nOluwasanmi Koyejo\n\nPsychology Dept., Stanford\nsanmi@stanford.edu\n\nRajiv Khanna\n\nECE Dept., UT Austin\n\nrajivak@utexas.edu\n\nJoydeep Ghosh\n\nECE Dept., UT Austin\n\nghosh@ece.utexas.edu\n\nRussell A. Poldrack\n\nPsychology Dept., Stanford\n\npoldrack@stanford.edu\n\nAbstract\n\nWe present a general framework for constructing prior distributions with struc-\ntured variables. The prior is de\ufb01ned as the information projection of a base dis-\ntribution onto distributions supported on the constraint set of interest. In cases\nwhere this projection is intractable, we propose a family of parameterized approx-\nimations indexed by subsets of the domain. We further analyze the special case\nof sparse structure. While the optimal prior is intractable in general, we show\nthat approximate inference using convex subsets is tractable, and is equivalent to\nmaximizing a submodular function subject to cardinality constraints. As a re-\nsult, inference using greedy forward selection provably achieves within a factor\nof (1-1/e) of the optimal objective value. Our work is motivated by the predictive\nmodeling of high-dimensional functional neuroimaging data. For this task, we\nemploy the Gaussian base distribution induced by local partial correlations and\nconsider the design of priors to capture the domain knowledge of sparse support.\nExperimental results on simulated data and high dimensional neuroimaging data\nshow the effectiveness of our approach in terms of support recovery and predictive\naccuracy.\n\n1\n\nIntroduction\n\nData in scienti\ufb01c and commercial disciplines are increasingly characterized by high dimensions and\nrelatively few samples. For such cases, a-priori knowledge gleaned from expertise and experimental\nevidence are invaluable for recovering meaningful models. In particular, knowledge of restricted\ndegrees of freedom such as sparsity or low rank has become an important design paradigm, en-\nabling the recovery of parsimonious and interpretable results, and improving storage and prediction\nef\ufb01ciency for high dimensional problems. In Bayesian models, such restricted degrees of freedom\ncan be captured by incorporating structural constraints on the design of the prior distribution. Prior\ndistributions for structured variables can be designed by combining conditional distributions - each\ncapturing portions of the problem structure, into a hierarchical model. In other cases, researchers\ndesign special purpose prior distributions to match the application at hand. In the case of sparsity,\nan example of the former approach is the spike and slab prior [1, 2], and an example of the latter\napproach is the horseshoe prior [3].\nWe describe a framework for designing prior distributions when the a-priori information include\nstructural constraints. Our framework follows the maximum entropy principle [4, 5]. The distribu-\ntion is chosen as one that incorporates known information, but is as dif\ufb01cult as possible to discrim-\ninate from the base distribution with respect to relative entropy. The maximum entropy approach\n\n1\n\n\fhas been especially successful with domain knowledge expressed as expectation constraints. In such\ncases, the solution is given by a member of the exponential family [6, 7] e.g. quadratic constraints\nresult in the Gaussian distribution. Our work extends this framework to the design of prior distribu-\ntions when a-priori information include domain constraints.\nOur main technical contributions are as follows:\n\ndomain constraints is given by its restriction (Section 2).\n\n\u2022 We show that under standard assumptions, the information projection of a base density to\n\u2022 We show the equivalence between relative entropy inference with data observation con-\n\u2022 When such restriction is intractable, we propose a family of parameterized approximations\n\nstraints and Bayes rule for continuous variables\n\nindexed by subsets of the domain (Section 2.1).\n\nWe consider approximate inference in the special case of sparse structure:\n\ntion (Section 3).\n\n\u2022 We characterize the restriction precisely, showing that it is given by a conditional distribu-\n\u2022 We show that the approximate sparse support estimation problem is submodular. As a\ne ) factor optimality (Sec-\n\nresult, greedy forward selection is ef\ufb01cient and guarantees (1- 1\ntion 3.1).\n\nOur work is motivated by the predictive modeling of high-dimensional functional neuroimaging\ndata, measured by cognitive neuroscientists for analyzing the human brain. The data are repre-\nsented using hundreds of thousands of variables. Yet due to real world constraints, most experimen-\ntal datasets contain only a few data samples [8]. The proposed approach is applied to predictive\nmodeling of simulated data and high-dimensional neuroimaging data, and is compared to Bayesian\nhierarchical models and non-probabilistic sparse predictive models, showing superior support re-\ncovery and predictive accuracy (Section 4). Due to space constraints, all proofs are provided in the\nsupplement.\n\n1.1 Preliminaries\n\nThis section includes notation and a few basic de\ufb01nitions. Vectors are denoted by lower case x and\nmatrices by capital X. xi,j denotes the (i, j)th entry of the matrix X. xi,: denotes the ith row of\nX and x:,j denotes the jth column. Let |X| denote the determinant of X. Sets are denoted by sans\nserif e.g. S. The reals are denoted by R. [n] denotes the set of integers {1, . . . , n}, and \u2118(n) denotes\nthe power set of [n]. Let X be either a countable set, or a complete separable metric space equipped\nwith the standard Borel \u03c3-algebra of measurable set. Let P denote the set of probability densities\non X. For the remainder of this paper, we make the following assumption:\nAssumption 1. All distributions P are absolutely continuous with respect to the dominating mea-\nsure \u03bd so there exists a density p \u2208 P that satis\ufb01es dP = pd\u03bd.\nTo simplify notation, we use use the standard d\u03bd = dx. We also assume that all densities are\nbounded. As a consequence of Assumption 1, the relative entropy is given in terms of the densities\nas:\n\n(cid:90)\n\nX\n\n2\n\nKL(q(cid:107)p) =\n\nq(x) log\n\nq(x)\np(x)\n\ndx.\n\nThe relative entropy is strictly convex with respect to its \ufb01rst argument. The information projection\nof a probability density p to a constraint set A is given by the solution of:\n\nKL(q(cid:107)p) s.t. q \u2208 A.\n\ninf\nq\u2208P\n\ndelta functional, denoted by \u03b4(\u00b7), is a generalized set functional that satis\ufb01es(cid:82)\nA f (x)dx, and(cid:82)\n(cid:82)\n\nWe will only consider projections where A is a closed convex set so the in\ufb01mum is achieved. The\nX \u03b4A(x)f (x)dx =\nX \u03b4A(x)dx = 1, for some some A \u2286 X. The set of domain restricted densities,\ndenoted by FA for A \u2282 X, is the set of probability density functions supported on A i.e. FA = {q \u2208\nP | q(x) = 0 \u2200 x /\u2208 A} \u222a {\u03b4{x} \u2200 x \u2208 A} \u2282 FA \u2282 P = FX. Further, note that FA is closed and\nconvex for any A \u2286 X (including nonconvex A).\n\n\fRestriction is a standard approach for de\ufb01ning distributions on subsets A \u2286 X. An important special\ncase we will consider is when A is a measure zero subset of X. The common conditional density is\none such example, the existence of which follows from the disintegration theorem [9]. Restrictions\nof measure require extensive technical tools in the general case [10]. We will employ the following\nsimplifying condition for the remainder of this manuscript:\nCondition 2. The sample space X is a subset of Euclidean space with \u03bd given by the Lebesgue\nmeasure. Alternatively, X is a countable set with \u03bd given by the counting measure.\n\nLet P be a probability distribution on X. Under Assumption 1 and Condition 2, the restriction of\nthe density p to the set A \u2282 X, if it exists, is given by:\n\n(cid:40) p(x)\n(cid:82)\nA p(x)dx x \u2208 A,\n\notherwise.\n\nq(x) =\n\n0\n\n2 Priors for structured variables\nWe assume a-priori information identifying the structure of X via the sub-domain A \u2282 X. We also\nassume a pre-de\ufb01ned base distribution P with associated density p. Without loss of generality, let\np have support everywhere1 on X i.e. p(x) > 0 \u2200 x \u2208 X. Following the principle of minimum\ndiscrimination information, we select the prior as the information projection of the base density p\nto FA. Our \ufb01rst result identi\ufb01es the equivalence between information projection subject to domain\nconstraints and density restriction.\nTheorem 3. Under Condition 2, the information projection of the density p to the constraint set FA,\nif it exists, is the restriction of p to the domain A.\n\nTheorem 3 gives principled justi\ufb01cation for the domain restriction approach to structured prior de-\nsign. Examples of density restriction in the literature include the truncated Gaussian, Beta and\nGamma densities [11], and the restriction of the matrix-variate Gaussian to the manifold of low\nrank matrices [12]. Various properties of the restriction, such as its shape, and tail behavior (up to\nre-scaling) follow directly from the base density. Thus the properties of the resulting prior are more\namenable to analysis when the base measure is well understood. Next, we consider a corollary of\nTheorem 3 that was introduced by Williams [13].\nCorollary 4. Consider the product space X = W \u00d7 Y. Let domain constraint be given by W \u00d7 {\u02c6y}\nfor some \u02c6y \u2208 Y. Under Condition 2, the information projection of p to FW\u00d7{ \u02c6y} is given by p(w|\u02c6y)\u03b4 \u02c6y.\nIn the Bayesian literature, p(w) is known as the prior, p(y|w) is the likelihood and p(w|\u02c6y) is the\nposterior density given the observation y = \u02c6y. Corollary 4 considers the information projection\nof the joint density p(w, y) given observed data, and shows that the solution recovers the Bayesian\nposterior. Williams [13] considered a generalization of Corollary 4, but did not consider projection\nto data constraints2. While Corollary 4 has been widely applied in the literature e.g. [14], to the best\nof our knowledge, the presented result is the \ufb01rst formal proof.\n\n2.1 Approximate inference for structured variables via tractable subsets\n\nFor many structural constraints of interest, restriction requires the computation of an intractable\nnormalization constant. In theory, rejection sampling and Markov Chain Monte Carlo (MCMC)\ninference methods [15] do not require normalized probabilities. However, as many structured sub-\ndomains are measure zero sets with respect to the dominating measure, randomly generated samples\ngenerated from the base distribution are unlikely to lie in the constrained domains e.g.\nrandom\nsamples from a multivariate Gaussian are not sparse. Hence rejection sampling fails, and MCMC\nsuffers from low acceptance probabilities. As a result, inference on such structured sub-domains\ntypically requires specialized methods e.g. [11, 12]. In the following, we propose a class of varia-\ntional approximations based on an inner representation of the structured subdomain. Let {Si \u2208 A}\nrepresent a (possibly overlapping) partitioning of A into subsets. We de\ufb01ne the domain restricted\n\n1When this condition is violated, we simply rede\ufb01ne X as the subdomain supporting p.\n2Speci\ufb01cally, Williams [13] noted \u201cRelative information has been de\ufb01ned only for unconditional distribu-\n\ntions, which say nothing about the relative probabilities of events of probability zero.\u201c\n\n3\n\n\fpA\n\nFA\n\npA\u2229C\n\nFC\n\nP\n\np\n\n(a) Gaussian restriction\n\n(b) Sequential projections\n\nFigure 1: (a) Gaussian density and restriction to diagonal line shown. (b) Illustration of Theorem 5;\nsequence of information projections P \u2192FA \u2192FC and P \u2192FA\u2229C are equivalent.\n\ndensity sets generated by these partitions as FSi, and their union D =(cid:83)FSi. Note that by de\ufb01nition\n\neach FSi \u2286 D \u2286 FA \u2286 FX. Our approach is to approximate the optimization over densities in FA by\noptimizing over D - a smaller subset of tractable densities.\nApproximate inference is generally most successful when the approximation accounts for observed\ndata. Inspired by the results of Corollary 4, we consider such a projection. Let pA(w, y) be the\ninformation projection of the joint distribution p(x, y) to the set FA\u00d7{ \u02c6y}. We propose approximate\ninference via the following rule:\n\npS\u2217 , \u02c6y = arg min\nq\u2208D\u00d7F{ \u02c6y}\n\nKL(q(w, y)(cid:107)pA(w, y)) = arg min\n\nS\n\nmin\n\nq\u2208FS\u00d7{ \u02c6y}\n\nKL(q(w, y)(cid:107)pA(w, y))\n\n.\n\n(1)\n\n(cid:20)\n\n(cid:21)\n\nOur proposed approach may be decomposed into two steps. The inner step is solved by estimating\na parameterized set of prior densities {qS} corresponding to choices of S, and the outer step is\nsolved by the selection of the optimal subset S\u2217. The solution is given by pS\u2217 , \u02c6y(w, y) = pS\u2217 (w|\u02c6y)\u03b4 \u02c6y\n(Corollary 4) with the associated approximate posterior given by pS\u2217 (w|\u02c6y).\nThe following theorem considers the effect of a sequence of domain constrained information pro-\njections (see Fig. 1b), which will useful for subsequent results.\nTheorem 5. Let \u03c0 : [n] (cid:55)\u2192 [n] be a permutation function and {C\u03c0(i) | C\u03c0(i) \u2282 X} represent a\n\nsequence of sets with non empty intersection B =(cid:84) Ci (cid:54)= \u2205. Given a base density p, let q0 = p, and\n\nde\ufb01ne the sequence of information projections:\n\nqi = arg min\nq\u2208FC\u03c0(i)\n\nKL(q(cid:107)qi\u22121).\n\nUnder Condition 2, q\u2217 = qN is independent of \u03c0. Further q\u2217 = min\nq\u2208FB\n\nKL(q(cid:107)p).\n\nWe apply Theorem 5 to formulate equivalent solutions of (1) that may be simpler to solve.\nCorollary 6. Let pS\u2217 , \u02c6y(w, y) be the solution of (1), then the posterior distribution pS\u2217 (w|\u02c6y) is given\nby:\n(2)\n\nKL(q(w)(cid:107)p(w|\u02c6y)).\n\nKL(q(w)(cid:107)pA(w|\u02c6y)) = arg min\nq\u2208D\n\npS\u2217 (w|\u02c6y) = arg min\nq\u2208D\n\nCorollary 6 implies that we can estimate the approximate structured posterior directly as the in-\nformation projection of the unstructured posterior distribution p(w|\u02c6y). Upon further examination,\nCorollary 6 also suggests that the proposed approximation is most useful when there exist subsets\nof A such that the restriction of the base density to each subset leads to tractable inference. Further,\nthe result is most accurate when one of the subsets S\u2217 \u2208 A captures most of the posterior proba-\nbility mass. When the optimal subset S\u2217 is known, the structured prior density associated with the\nstructured posterior can be computed as shown in the following corollary.\nCorollary 7. Let pS\u2217 , \u02c6y(w, y) be the solution of (1). De\ufb01ne the density pS\u2217 (w) as:\nKL(q(w)(cid:107)p(w)).\n\n(3)\n\nKL(q(w)(cid:107)pA(w)) = arg min\nq\u2208FS\u2217\n\npS\u2217 (w) = arg min\nq\u2208FS\u2217\n\nthen pS\u2217 (w) is the prior distribution corresponding to the Bayesian posterior pS\u2217 (w|\u02c6y).\n\n4\n\n\f(cid:20)\n\n(cid:90)\n\n(cid:90)\n\n3 Priors for sparse structure\n\nWe now consider a special case of the proposed framework for sparse structured variables. A d\ndimensional variable x \u2208 X is k-sparse if d \u2212 k of its entries take a default value of ci i.e |{i | xi =\nci}| = d\u2212k. In Euclidean space X = Rd and in most cases, ci = 0 \u2200 i. Similarly, the distribution P\non the domain X is k-sparse if all random variables X \u223c P are at most k-sparse. The support of x \u2208\nX is the set supp(x) = {i | xi (cid:54)= ci} \u2208 \u2118(d). Let S \u2282 X denote the set of variables with support s\ni.e. S = {x \u2208 X s.t. supp(x) = s}. We will use the notation xS = {xi | i \u2208 s}, and its complement\nxS(cid:48) = {xi | i \u2208 s(cid:48)}, where s(cid:48) = [d]\\s. The domain of k sparse vectors is given by the union of all\npossible\nsubset S is a convex set, in fact given by linear subspaces with basis {ei | i \u2208 s}. Further, while the\ninformation projection of a base density p to A is generally intractable, the information projection to\nits convex subsets S turn out to be computationally tractable. We investigate the application of the\nproposed approximation scheme using these subsets.\nConsider the information projection of an arbitrary probability measure P with density3 p to the set\n\n(d\u2212k)!k! sparse support sets as A =(cid:83) Si. While the sparse domain A is non-convex, each\n\nd!\n\nD =(cid:83)FSi given by:\n\n(cid:21)\n\nq\u2208D KL(q(cid:107)p) = min\nS\u2208{Si}\n\nmin\n\nmin\nq\u2208FS\n\nKL(q(cid:107)p)\n\n= min\n\nS\u2208{Si} KL(pS(cid:107)p).\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nApplying Theorem 3, we can compute that pS = p(x)\u03b4S(x)/Z, where Z is a normalization factor:\n\nZ =\n\np(x) =\n\np(xS, xS(cid:48))\u03b4S(x) =\n\nS\n\nX\n\nX\n\np(xS|xS(cid:48))p(xS(cid:48))\u03b4S(x) = p(xS(cid:48) = cS(cid:48)).\n\nThus, the normalization factor is a marginal density at xS(cid:48) = cS(cid:48). We may now compute the restric-\ntion explicitly:\n\npS(x) =\n\np(xS|xS(cid:48))p(xS(cid:48))\u03b4S(x)\n\np(xS(cid:48) = cS(cid:48))\n\n= p(xS|xS(cid:48) = cS(cid:48))\u03b4S(x).\n\n(4)\n\nIn other words, the information projection to a sparse support domain is the density of xS conditioned\non xS(cid:48) = cS(cid:48). The resulting gap is:\n\nKL(pS(cid:107)p) =\n\npS(x) log\n\n=\n\npS(x) log\n\npS(x)\np(x)\n\np(x)\n\np(x)p(xS(cid:48) = cS(cid:48))\n\n= \u2212 log p(xS(cid:48) = cS(cid:48)).\n\nS\nThus, for a given target sparsity k, we solve:\n\nS\n\ns\u2217 = arg max\n\n|s|=k\n\nJ(s), where J(s) = log p(xS(cid:48) = cS(cid:48)).\n\n(5)\n\n3.1 Submodularity and Ef\ufb01cient Inference\n\nIn this section, we show that the cost function J(s) is monotone submodular, and describe the greedy\nforward selection algorithm for ef\ufb01cient inference. Let F : \u2118(d) (cid:55)\u2192 R represent a set function. F\nis normalized if F (\u2205) = 0. A bounded F can be normalized as \u02dcF (s) = F (s) \u2212 F (\u2205) with no\neffect on optimization. F is monotonic, if for all subsets u \u2282 v \u2286 \u2118(d) it holds that F (u) \u2264 F (v).\nF is submodular, if for all subsets u, v \u2286 m it holds that F (u \u222a v) + F (u \u2229 v) \u2264 F (u) + F (v).\nSubmodular functions have a diminishing returns property [16] i.e.\nthe marginal gain of adding\nelements decreases with the size of the set.\nTheorem 8. Let J : \u2118(d) (cid:55)\u2192 R, J(s) = log p(xS(cid:48) = cS(cid:48)), and de\ufb01ne \u02dcJ(s) = J(s) \u2212 J(\u2205), then\n\u02dcJ(s) is normalized and monotone submodular.\n\nWhile constrained maximization of submodular functions is generally NP-hard, a simple greedy\nforward selection heuristic has been shown to perform almost as well as the optimal in practice, and\nis known to have strong theoretical guarantees.\n\n3Where p may represent the conditional densities as in Section 2.1. To simplify the discussion, we suppress\n\nthe dependence on \u02c6y.\n\n5\n\n\fTheorem 9 (Nemhauser et al. [16]). In the case of any normalized, monotonic submodular function\n\nF, the set s\u2217 obtained by the greedy algorithm achieves at least a constant fraction(cid:0)1 \u2212 1\nobjective value obtained by the optimal solution i.e. F (s\u2217) =(cid:0)1 \u2212 1\n\n(cid:1) of the\n\nF (s).\n\n(cid:1) max\n\n|s|\u2264k\n\ne\n\ne\n\nIn addition, no polynomial time algorithm can provide a better approximation guarantee unless P =\nNP [17]. An additional bene\ufb01t of the greedy approach is that it does not require the decision of the\nsupport size k to be made at training time. As an anytime algorithm, training can be stopped at any k\nbased on computational constraints, while still returning meaningful results. An interesting special\ncase occurs when the base density takes a product form.\nCorollary 10. Let J(s) be de\ufb01ned as in Theorem 8 and suppose the base density is product form i.e.\n\np(x) =(cid:81)d\n\ni=1 p(xi), then J(s) is linear.\n\nIn particular, de\ufb01ne h = {p(xi = 0) \u2200 i \u2208 [d]}, then the solution of (5) is given by set of dimensions\nassociated with the smallest k values of h.\n\n4 Experiments\n\nWe present experimental results comparing the proposed sparse approximate inference projection to\nother sparsity inducing models. We performed experiments to test the models ability to estimate the\nsupport of the reconstructed targets and the predictive regression accuracy. The regression accuracy\n\nwas measured using the coef\ufb01cient of determination R2 = 1 \u2212(cid:80)(\u02c6y \u2212 y)2/(cid:80)(y \u2212 \u00afy)2 where y\n\nis the target response with sample mean \u00afy and \u02c6y is the predicted response. R2 measures the gain in\npredictive accuracy compared to a mean model and has a maximum value of 1. The support recovery\nwas measured using the AUC of the recovered support with respect to the true s\u2217.\nThe baseline models are: (i) regularized least squares (Ridge), (ii) least absolute shrinkage and se-\nlection (Lasso) [18], (iii) automatic relevance determination (ARD) [19], (iv) Spike and Slab [1, 2].\nRidge and Lasso were optimized using implementations from the scikit-learn python package [20].\nWhile Ridge does not return sparse weights, it was included as a baseline for regression performance.\nWe implemented ARD using iterative re-weighted Lasso as suggested by Wipf and Nagarajan [19].\nThe noise variance hyperparameter for Ridge and ARD were selected from the set 10{\u22124,\u22123,...,4}.\nLasso was evaluated using the default scikit-learn implementation where the hyperparameter is se-\nlected from 100 logarithmically spaced values based on the maximum correlation between the fea-\ntures and the response. For each of these models, the hyperparameter was selected in an inner 5-fold\ncross validation loop. For speed and scalability, we used a publicly available implementation of\nSpike and Slab [21], which uses a mean \ufb01eld variational approximation. In addition to the weights,\nSpike and Slab estimates the probability that each dimension is non zero. As Spike and Slab does\nnot return sparse estimates, sparsity was estimated by thresholding this posterior at 0.5 for each di-\nmension (SpikeSlab0.5 ), we also tested the full spike and slab posterior prediction for regression\nperformance alone (SpikeSlabFull).\nThe proposed projection approach is designed to be applicable to any probabilistic model. Thus, we\napplied the projection approach as additional post-processing for the two Bayesian model baselines.\nThe \ufb01rst method is a projection of the standard Gaussian regression posterior (Sparse-G ) (more\ndetails in supplement). The second is a projection of the spike and spike and slab approximate\nposterior (SpikeSlabKL). We note that since the spike and slab approximate posterior uses the mean\n\ufb01eld approximation, the posterior distribution is in product form and the projection is straightforward\nusing Corollary 10. Support size selection: The selection of the hyperparameter k - specifying the\nsparsity, can be solved by standard model selection routines such as cross-validation. We found that\nsupport size selection using sequential Bayes factors [22] was particularly effective, thus the support\nsize was selected as the \ufb01rst k where log p(y|Sk+1) \u2212 log p(y|Sk) < \u0001.\n\n4.1 Simulated Data\nWe generated random high dimensional feature vectors ai \u2208 Rd with ai,j \u223c N (0, 1). The re-\nsponse was generated as yi = w(cid:62)ai + \u03bdi where \u03bdi represents independent additive noise with\n\n\u03bdi \u223c N(cid:0)0, \u03c32(cid:1) for all i \u2208 [n]. We set \u03c32 implicitly via the signal to noise ration (SNR) as\n\n6\n\n\f(a) AUC as a function of n:k ratio\n\n(b) R2 as a function of n:k ratio\n\n(c) AUC as a function of SNR\n\n(d) R2 as a function of SNR\n\nFigure 2: Simulated data performance: support recovery (AUC ) and regression (R2 ).\n\nSNR = var(y)/\u03c32, where var(y) is the variance of y. In each experiment, we sampled a sparse\nweight vector w by sampling k dimensions at random with from [d], then we sampled values\nwi \u223c N (0, 1) and set other dimensions to zero. We performed a series of tests to investigate the\nperformance of the model in different scenarios. Each experiment was run 10 times with separate\ntraining and test sets. We present the average results on the test set.\nOur \ufb01rst experiment tested the performance of all models with limited samples. Here we set\nk = 20, d = 10, 000 and an SNR of 20dB. The number of training values was varied from\nn = 100, . . . , 400 with 200 test samples. Fig. 2a shows the model performance in terms of sup-\nport recovery. With limited training samples, Sparse-G outperformed all the baselines including\nLasso. We also found that SpikeSlabKL consistently outperformed SpikeSlab0.5. We speculate that\nthe signi\ufb01cant gap between Sparse-G and SpikeSlabKL may be partly due to the mean \ufb01eld assump-\ntion in the underlying Spike and Slab. Fig. 2b shows the corresponding regression performance.\nAgain, we found that Sparse-G outperformed all other baselines, with Ridge achieving the worst\nperformance.\nOur second experiment tested the performance of all models with high levels of noise. Here we\nset k = 20, d = 10, 000 and n = 200 with 200 test samples. We varied the SNR from 40dB to\n\u221210dB (note that \u03c32 increases as SNR is decreased). Fig. 2c shows the support recovery perfor-\nmance of the different models. We found a performance gap between Sparse-G and Lasso, more\npronounced than in the small sample test. The SpikeSlab0.5 was the worst performing model, but\nthe performance was improved by SpikeSlabKL. Only Sparse-G achieved perfect support recovery\nat low noise (high SNR ) levels. The regression performance is shown in Fig. 2d. While ARD and\nLasso matched Sparse-G at low noise levels (high SNR), their performance degraded much faster at\nhigher noise levels (low SNR).\n\n4.2 Functional Neuroimaging Data\n\nFunctional magnetic resonance imaging (fMRI) is an important tool for non-invasive study of brain\nactivity.\nfMRI studies involve measurements of blood oxygenation (which are sensitive to the\namount of local neuronal activity) while the participant is presented with a stimulus or cognitive\ntask. Neuroimaging signals are then analyzed to identify which brain regions which exhibit a sys-\ntematic response to the stimulation, and thus to infer the functional properties of those brain regions\n[23]. Functional neuroimaging datasets typically consist of a relatively small number of correlated\n\n7\n\n5101520n:k0.50.60.70.80.91.0Support AUCSparse-GLassoARDSpikeSlabKLSpikeSlab0.55101520n:k0.00.20.40.60.81.0R2Sparse-GLassoRidgeARDSpikeSlabFullSpikeSlab0.5SpikeSlabKL403020100-10Signal-to-Noise Ratio(dB)0.50.60.70.80.91.0Support AUCSparse-GLassoSpikeSlab0.5SpikeSlabKLARD403020100-10Signal-to-Noise Ratio(dB)0.00.20.40.60.81.0R2Sparse-GLassoRidgeSpikeSlab0.5SpikeSlabKLARDSpikeSlabFull\fFigure 3: Support selected by Sparse-G applied to fMRI data with 100,000 voxels. Slices are across\nthe vertical dimension. Selected voxels are in red.\n\nhigh dimensional brain images. Hence, capturing the inherent structural properties of the imaging\ndata is critical for robust inference.\nFMRI data were collected from 126 participants while the subjects performed a stop-signal task [24].\nFor each subject, contrast images were computed for \u201cgo\u201d trials and successful \u201cstop\u201d trials using a\ngeneral linear model with FMRIB Software Library (FSL), and these contrast images were used for\nregression against estimated stop-signal reaction times. We used the normalized Laplacian of the 3-\ndimensional spatial graph of the brain image voxels to de\ufb01ne the precision matrix. This corresponds\nto the observation that nearby voxels tend to have similar functional activation. We present the 10-\nfold cross validation performance of all models tested on this data. We tested all models using the\nhigh dimensional 100,000 voxel brain image and measured average predictive R2 . The results are:\nSparse-G (0.051), Lasso (-0.271), Ridge (-0.473), ARD (-0.478). The negative test R2 for baseline\nmodels show worse predictive performance than the test mean predictor, and indicate the dif\ufb01culty\nof this task. Even with the mean \ufb01eld variational inference, the Spike and Slab models did not\nscale to this dataset. Only Sparse-G achieved a positive R2 . The support selected by Sparse-G with\nall 100,000 voxels is shown in Fig. 3, sliced across the vertical dimension. The recovered voxels\nshow biologically plausible brain locations including the orbitofrontal cortex, dorsolateral prefrontal\ncortex, putamen, anterior cingulate, and parietal cortex, which are correlated with the observed re-\nsponse. Further neuroscienti\ufb01c interpretation and validation will be included in an extended version\nof the paper.\n\n5 Conclusion\n\nWe present a principled approach for enforcing structure in Bayesian models via structured prior se-\nlection based on the maximum entropy principle. The prior is de\ufb01ned by the information projection\nof the base measure to the set of distributions supported on the constraint domain. We focus on the\ncase of sparse structure. While the optimal prior is intractable in general, we show that approximate\ninference using selected convex subsets is equivalent to maximizing a submodular function subject\nto cardinality constraints, and propose an ef\ufb01cient greedy forward selection procedure which is guar-\nanteed to achieve within a (1\u2212 1\ne ) factor of the global optimum. For future work, we plan to explore\napplications of our approach with other structural constraints such as low rank and structured spar-\nsity for matrix-variate sample spaces. We also plan to explore more complicated base distributions\non other samples spaces.\nAcknowledgments:\nRoadmap for Medical Research grants UL1-DE019580, RL1MH083269, RL1DA024853, PL1MH083271).\n\nfMRI data was provided by the Consortium for Neuropsychiatric Phenomics (NIH\n\n8\n\n\fReferences\n[1] T.J. Mitchell and J.J. Beauchamp. Bayesian variable selection in linear regression. JASA, 83(404):1023\u2013\n\n1032, 1988.\n\n[2] H. Ishwaran and J.S. Rao. Spike and slab variable selection: frequentist and bayesian strategies. Annals\n\nof Statistics, pages 730\u2013773, 2005.\n\n[3] C. M Carvalho, N.G. Polson, and J.G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97\n\n(2):465\u2013480, 2010.\n\n[4] E.T. Jaynes.\n\nInformation Theory and Statistical Mechanics. Physical Review Online Archive, 106(4):\n\n620\u2013630, 1957.\n\n[5] S. Kullback. Information Theory and Statistics. Dover, 1959.\n[6] D. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.\n[7] O. Koyejo and J. Ghosh. A representation approach for relative entropy minimization with expectation\n\nconstraints. In ICML WDDL workshop, 2013.\n\n[8] R.A. Poldrack. Inferring mental states from neuroimaging data: From reverse inference to large-scale\n\ndecoding. Neuron, 72(5):692\u2013697, 2011.\n\n[9] J.T. Chang and D. Pollard. Conditioning as disintegration. Statistica Neerlandica, 51(3):287\u2013317, 1997.\n[10] A.N. Kolmogorov. Foundations of the theory of probability. Chelsea, New York, 1933.\n[11] P. Damien and S.G. Walker. Sampling truncated normal, beta, and gamma densities. J. of Computational\n\nand Graphical Statistics, 10(2), 2001.\n\n[12] M. Park and J. Pillow. Bayesian inference for low rank spatiotemporal neural receptive \ufb01elds. In NIPS,\n\npages 2688\u20132696. 2013.\n\n[13] P. Williams. Bayesian conditionalisation and the principle of minimum information. The British Journal\n\nfor the Philosophy of Science, 31(2):131\u2013144, 1980.\n\n[14] O. Koyejo and J. Ghosh. Constrained Bayesian inference for low rank multitask learning. In UAI, 2013.\n[15] C.P. Robert, G. Casella, and C.P. Robert. Monte Carlo statistical methods, volume 58. Springer New\n\nYork, 1999.\n\n[16] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submod-\n\nular set functions. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[17] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634\u2013652, 1998.\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\npages 267\u2013288, 1996.\n\n[19] D. Wipf and S. Nagarajan. A new view of automatic relevance determination. In NIPS, pages 1625\u20131632,\n\n2007.\n\n[20] F. et. al. Pedregosa. Scikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830, 2011.\n[21] Michalis K Titsias and Miguel L\u00b4azaro-Gredilla. Spike and slab variational inference for multi-task and\n\nmultiple kernel learning. In NIPS, volume 24, pages 2339\u20132347, 2011.\n\n[22] Robert E Kass and Adrian E Raftery. Bayes factors. JASA, 90(430):773\u2013795, 1995.\n[23] T.M. Mitchell, R. Hutchinson, R.S. Niculescu, F. Pereira, X. Wang, M. Just, and S. Newman. Learning to\n\ndecode cognitive states from brain images. Mach. Learn., 57(1-2):145\u2013175, 2004.\n\n[24] Corey N White, Eliza Congdon, Jeanette A Mumford, Katherine H Karlsgodt, Fred W Sabb, Nelson B\nFreimer, Edythe D London, Tyrone D Cannon, Robert M Bilder, and Russell A Poldrack. Decompos-\ning decision components in the stop-signal task: A model-based approach to individual differences in\ninhibitory control. Journal of Cognitive Neuroscience, 2014.\n\n9\n\n\f", "award": [], "sourceid": 477, "authors": [{"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "Stanford University"}, {"given_name": "Rajiv", "family_name": "Khanna", "institution": "University of Texas at Austin"}, {"given_name": "Joydeep", "family_name": "Ghosh", "institution": "UT Austin"}, {"given_name": "Russell", "family_name": "Poldrack", "institution": "University of Texas"}]}