{"title": "Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1232, "page_last": 1240, "abstract": "In large-scale applications of undirected graphical models, such as social networks and biological networks, similar patterns occur frequently and give rise to similar parameters. In this situation, it is beneficial to group the parameters for more efficient learning. We show that even when the grouping is unknown, we can infer these parameter groups during learning via a Bayesian approach. We impose a Dirichlet process prior on the parameters. Posterior inference usually involves calculating intractable terms, and we propose two approximation algorithms, namely a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algorithm with stripped Beta approximation (Gibbs_SBA). Simulations show that both algorithms outperform conventional maximum likelihood estimation (MLE). Gibbs_SBA's performance is close to Gibbs sampling with exact likelihood calculation. Models learned with Gibbs_SBA also generalize better than the models learned by MLE on real-world Senate voting data.", "full_text": "Bayesian Estimation of Latently-grouped Parameters\n\nin Undirected Graphical Models\n\nJie Liu\n\nDavid Page\n\nDept of CS, University of Wisconsin\n\nDept of BMI, University of Wisconsin\n\nMadison, WI 53706\n\njieliu@cs.wisc.edu\n\nMadison, WI 53706\n\npage@biostat.wisc.edu\n\nAbstract\n\nIn large-scale applications of undirected graphical models, such as social networks\nand biological networks, similar patterns occur frequently and give rise to simi-\nlar parameters. In this situation, it is bene\ufb01cial to group the parameters for more\nef\ufb01cient learning. We show that even when the grouping is unknown, we can in-\nfer these parameter groups during learning via a Bayesian approach. We impose a\nDirichlet process prior on the parameters. Posterior inference usually involves cal-\nculating intractable terms, and we propose two approximation algorithms, namely\na Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling al-\ngorithm with \u201cstripped\u201d Beta approximation (Gibbs SBA). Simulations show that\nboth algorithms outperform conventional maximum likelihood estimation (MLE).\nGibbs SBA\u2019s performance is close to Gibbs sampling with exact likelihood cal-\nculation. Models learned with Gibbs SBA also generalize better than the models\nlearned by MLE on real-world Senate voting data.\n\nIntroduction\n\n1\nUndirected graphical models, a.k.a. Markov random \ufb01elds (MRFs), have many real-world applica-\ntions such as social networks and biological networks. In these large-scale networks, similar kinds\nof relations can occur frequently and give rise to repeated occurrences of similar parameters, but the\ngrouping pattern among the parameters is usually unknown. For a social network example, suppose\nthat we collect voting data over the last 20 years from a group of 1,000 people who are related to each\nother through different types of relations (such as family, co-workers, classmates, friends and so on),\nbut the relation types are usually unknown. If we use a binary pairwise MRF to model the data, each\nbinary node denotes one person\u2019s vote, and two nodes are connected if the two people are linked\nin the social network. Eventually we want to estimate the pairwise potential functions on edges,\nwhich can provide insights about how the relations between people affect their decisions. This can\nbe done via standard maximum likelihood estimation (MLE), but the latent grouping pattern among\nthe parameters is totally ignored, and the model can be overparametrized. Therefore, two questions\nnaturally arise. Can MRF parameter learners automatically identify these latent parameter groups\nduring learning? Will this further abstraction make the model generalize better, analogous to the\nlessons we have learned from hierarchical modeling [9] and topic modeling [5]?\nThis paper shows that it is feasible and potentially bene\ufb01cial to identify the latent parameter groups\nduring MRF parameter learning. Speci\ufb01cally, we impose a Dirichlet process prior on the parameters\nto accommodate our uncertainty about the number of the parameter groups. Posterior inference can\nbe done by Markov chain Monte Carlo with proper approximations. We propose two approximation\nalgorithms, a Metropolis-Hastings algorithm with auxiliary variables and a Gibbs sampling algo-\nrithm with stripped Beta approximation (Gibbs SBA). Algorithmic details are provided in Section\n3 after we review related parameter estimation methods in Section 2. In Section 4, we evaluate\nour Bayesian estimates and the classical MLE on different models, and both algorithms outperform\nclassical MLE. The Gibbs SBA algorithm performs very close to the Gibbs sampling algorithm with\nexact likelihood calculation. Models learned with Gibbs SBA also generalize better than the models\nlearned by MLE on real-world Senate voting data in Section 5. We \ufb01nally conclude in Section 6.\n\n1\n\n\f2 Maximum Likelihood Estimation and Bayesian Estimation for MRFs\nLet X = {0, 1, ..., m \u2212 1} be a discrete space. Suppose that we have an MRF de\ufb01ned on a random\nvector X \u2208 X d described by an undirected graph G(V,E) with d nodes in the node set V and r\nedges in the edge set E. The probability of one sample x from the MRF parameterized by \u03b8 is\n\nwhere Z(\u03b8) is the partition function. \u02dcP (x; \u03b8)=(cid:81)\n\nP (x; \u03b8) = \u02dcP (x; \u03b8)/Z(\u03b8),\n\n(1)\n\nI(Xu=Xv)(1\u2212\u03b8c)\n\nc\u2208C(G) \u03c6c(x; \u03b8c) is some unnormalized measure,\nand C(G) is some subset of cliques in G, and \u03c6c is the potential function de\ufb01ned on the clique c\nparameterized by \u03b8c. In this paper, we consider binary pairwise MRFs for simplicity, i.e. C(G)=E\nand m=2. We also assume that each potential function \u03c6c is parameterized by one parameter \u03b8c,\nI(Xu(cid:54)=Xv) where I(Xu=Xv) indicates whether the two nodes\nnamely \u03c6c(X; \u03b8c)=\u03b8c\nu and v connected by edge c take the same value, and 0<\u03b8c<1,\u2200c=1, ...,r. Thus, \u03b8={\u03b81, ..., \u03b8r}.\nSuppose that we have n independent samples X={x1, ..., xn} from (1), and we want to estimate \u03b8.\nMaximum Likelihood Estimate: The MLE of \u03b8 maximizes the log-likelihood function L(\u03b8|X)\n(cid:80)n\nwhich is concave w.r.t. \u03b8. Therefore, we can use gradient ascent to \ufb01nd the global maximum of\nthe likelihood function and \ufb01nd the MLE of \u03b8. The partial derivative of L(\u03b8|X) with respect to \u03b8i\nis \u2202L(\u03b8|X)\nj=1 \u03c8i(xj)\u2212E\u03b8\u03c8i=EX\u03c8i\u2212E\u03b8\u03c8i where \u03c8i is the suf\ufb01cient statistic corresponding\nto \u03b8i after we rewrite the density into the exponential family form, and E\u03b8\u03c8i is the expectation of\n\u03c8i with respect to the distribution speci\ufb01ed by \u03b8. However the exact computation of E\u03b8\u03c8i takes\ntime exponential in the treewidth of G. A few sampling-based methods have been proposed, with\ndifferent ways of generating particles and computing E\u03b8\u03c8 from the particles, including MCMC-\nMLE [11, 34], particle-\ufb01ltered MCMC-MLE [1], contrastive divergence [15] and its variations such\nas persistent contrastive divergence (PCD) [29] and fast PCD [30]. Note that contrastive divergence\nis related to pseudo-likelihood [4], ratio matching [17, 16], and together with other MRF parameter\nestimators [13, 31, 12] can be uni\ufb01ed as minimum KL contraction [18].\nBayesian Estimate: Let \u03c0(\u03b8) be a prior of \u03b8; then its posterior is P (\u03b8|X) \u221d \u03c0(\u03b8) \u02dcP (X; \u03b8)/Z(\u03b8).\nThe Bayesian estimate of \u03b8 is its posterior mean. Exact sampling from P (\u03b8|X) is known as doubly-\nintractable for general MRFs [21]. If we use the Metropolis-Hastings algorithm, then Metropolis-\nHastings ratio is\n\n= 1\nn\n\n\u2202\u03b8i\n\n\u2217|\u03b8) =\n\na(\u03b8\n\n\u2217\n\n\u2217\n\n) \u02dcP (X; \u03b8\n\n\u03c0(\u03b8\n\u03c0(\u03b8) \u02dcP (X; \u03b8)Q(\u03b8\n\n)Q(\u03b8|\u03b8\n\n\u2217\n\n\u2217\n)/Z(\u03b8\n\u2217|\u03b8)/Z(\u03b8)\n\n)\n\n,\n\n(2)\n\n\u2217\n\n\u2217\n\n,y\u2217)=Q(\u03b8|\u03b8\n\n\u2217, and with probability min{1, a(\u03b8\n\n\u2217|\u03b8) is some proposal distribution from \u03b8 to \u03b8\n\n\u2217. The real hurdle is that we have to evaluate the intractable Z(\u03b8)/Z(\u03b8\n\n\u2217|\u03b8)} we\nwhere Q(\u03b8\n\u2217\naccept the move from \u03b8 to \u03b8\n)\nin the ratio. In [20], M\u00f8ller et al. introduce one auxiliary variable y on the same space as x, and\nthe state variable is extended to (\u03b8, y). They set the new proposal distribution for the extended\nstate Q(\u03b8, y|\u03b8\n) in (2). Therefore by ignoring\ny, we can generate the posterior samples of \u03b8 via Metropolis-Hastings. Technically, this auxiliary\nvariable approach requires perfect sampling [25], but [20] pointed out that other simpler Markov\nchain methods also work with the proviso that it converges adequately to the equilibrium distribution.\n3 Bayesian Parameter Estimation for MRFs with Dirichlet Process Prior\nIn order to model the latent parameter groups, we impose a Dirichlet process prior on \u03b8, which\naccommodates our uncertainty about the number of groups. Then, the generating model is\n\n) \u02dcP (y; \u03b8)/Z(\u03b8) to cancel Z(\u03b8)/Z(\u03b8\n\n\u2217\n\nG \u223c DP(\u03b10, G0)\n\u03b8i|G \u223c G, i = 1, ..., r\nxj|\u03b8 \u223c F (\u03b8), j = 1, ..., n,\n\n(3)\n\nwhere F (\u03b8) is the distribution speci\ufb01ed by (1). G0 is the base distribution (e.g. Unif(0, 1)), and \u03b10 is\nthe concentration parameter. With probability 1.0, the distribution G drawn from DP(\u03b10, G0) is dis-\ncrete, and places its mass on a countably in\ufb01nite collection of atoms drawn from G0. In this model,\nX={x1, ..., xn} is observed, and we want to perform posterior inference for \u03b8 = (\u03b81, \u03b82, ..., \u03b8r),\n\n2\n\n\fa(c\u2217\n\n=\n\n\u03c0(c\u2217\n\u03c0(ci, c\u2212i)P (X; \u03b8)Q(c\u2217\n\ni |ci) is\ni )Q(ci|c\u2217\n\u2217\ni , c\u2212i)P (X; \u03b8.\ni )\ni |ci)\n\u2217\ni )/Z(\u03b8.\ni )\ni |ci)/Z(\u03b8)\n\ni |ci) =\ni |c\u2212i) \u02dcP (X; \u03b8.\n\u03c0(c\u2217\n\u03c0(ci|c\u2212i) \u02dcP (X; \u03b8)Q(c\u2217\n\u2217\ni is the same as \u03b8 except its i-th ele-\n. The conditional prior\n\ni )Q(ci|c\u2217\n\u2217\n\n,\n\nwhere \u03b8.\nment is replaced with \u03c6c\u2217\n\u03c0(c\u2217\n\ni |c\u2212i) is\n\ni\n\n(cid:40) n\u2212i,c\n\nr\u22121+\u03b10\nr\u22121+\u03b10\n\n\u03b10\n\n, if c \u2208 c\u2212i\n, if c (cid:54)\u2208 c\u2212i\n\n\u03c0(ci=c|c\u2212i)=\n\n(1), ..., \u02c6\u03b8\n\n(T ); T samples of \u03b8|X\n\nInput: observed data X={x1, ..., xn}\nOutput: \u02c6\u03b8\nProcedure:\nPerform PCD algorithm to get \u02dc\u03b8, MLE of \u03b8\nInit. c and \u03c6 via K-means on \u02dc\u03b8; K=(cid:98)\u03b10 ln r(cid:99)\nfor t = 1 to T do\n\nfor i = 1 to r do\n\nfor l = 1 to M do\n\nDraw a candidate c\u2217\nIf c\u2217\nSet ci=c\u2217\n\ni from Q(ci|c\u2217\ni )\ni (cid:54)\u2208 c, draw a value for \u03c6ci from G0\ni with prob min{1, a(c\u2217\ni |ci)}\n\nend for\n\nand regard its posterior mean as its Bayesian estimate. We propose two Markov chain Monte Carlo\n(MCMC) methods. One is a Metropolis-Hastings algorithm with auxiliary variables, as introduced\nin Section 3.1. The second is a Gibbs sampling algorithm with stripped Beta approximation, as in-\ntroduced in Section 3.2. In both methods, the state of the Markov chain is speci\ufb01ed by two vectors,\nc and \u03c6. In vector c = (c1, ..., cr), ci denotes the group to which \u03b8i belongs. \u03c6 = (\u03c61, ..., \u03c6k)\nrecords the k distinct values in {\u03b81, ..., \u03b8r} with \u03c6ci = \u03b8i for i = 1, ..., r. This way of specifying the\nMarkov chain is more ef\ufb01cient than setting the state variable directly to be (\u03b81, \u03b82, ..., \u03b8r) [22].\n\n3.1 Metropolis-Hastings (MH) with Auxiliary Variables\nIn the MH algorithm (see Algorithm 1), the initial state of the Markov chain is set by performing K-\nmeans clustering on MLE of \u03b8 (e.g. from the PCD algorithm [29]) with K=(cid:98)\u03b10 ln r(cid:99). The Markov\nchain resembles Algorithm 5 in [22], and it is ergodic. We move the Markov chain forward for T\nsteps. In each step, we update c \ufb01rst and then update \u03c6. We update each element of c in turn; when\nresampling ci, we \ufb01x c\u2212i, all elements in c other than ci. When updating ci, we repeatedly for M\ni |ci) and accept the move with probability\ntimes propose a new value c\u2217\ni according to proposal Q(c\u2217\nmin{1, a(c\u2217\ni |ci)} where a(c\u2217\ni |ci) is the MH ratio. After we update every element of c in the current\niteration, we draw a posterior sample of \u03c6 according to the current grouping c. We iterate T times,\nand get T posterior samples of \u03b8. Unlike the tractable Algorithm 5 in [22], we need to introduce\nauxiliary variables to bypass MRF\u2019s intractable likelihood in two places, namely calculating the MH\nratio (in Section 3.1.1) and drawing samples of \u03c6|c (in Section 3.1.2).\n3.1.1 Calculating Metropolis-Hastings Ratio\nThe MH ratio of proposing a new value c\u2217\ni for ci\naccording to proposal Q(c\u2217\n\nAlgorithm 1 The Metropolis-Hastings algorithm\n\nend for\nDraw a posterior sample of \u03c6 according to\ncurrent c, and set \u02c6\u03b8(t)\n\nwhere n\u2212i,c is the number of cj for j(cid:54)=i and\ni |ci) to be the\ncj=c. We choose proposal Q(c\u2217\ni |c\u2212i), and the Metropolis-\nconditional prior \u03c0(c\u2217\nHastings ratio can be further simpli\ufb01ed as\n\u2217\na(c\u2217\ni ) is intractable. Similar to [20],\nwe introduce an auxiliary variable Z on the same space as X, and the state variable is extended to\n\u2217\n(c, Z). When proposing a move, we propose c\u2217\ni \ufb01rst and then propose Z\u2217 with proposal P (Z; \u03b8.\ni )\n\u2217\ni ). We set the target distribution of Z to be P (Z; \u02dc\u03b8) where \u02dc\u03b8 is\nto cancel the intractable Z(\u03b8)/Z(\u03b8.\nsome estimate of \u03b8 (e.g. from PCD [29]). Then, the MH ratio with the auxiliary variable is\n\n\u2217\n\u2217\ni )Z(\u03b8)/ \u02dcP (X; \u03b8)Z(\u03b8.\ni ). However, Z(\u03b8)/Z(\u03b8.\n\ni =\u03c6ci for i=1, ..., r.\n\ni |ci)= \u02dcP (X; \u03b8.\n\nend for\n\na(c\u2217\n\ni , Z\u2217|ci, Z) =\n\n\u2217\nP (Z\u2217; \u02dc\u03b8) \u02dcP (X; \u03b8.\ni ) \u02dcP (Z; \u03b8)\n\u2217\nP (Z; \u02dc\u03b8) \u02dcP (X; \u03b8) \u02dcP (Z\u2217; \u03b8.\ni )\n\n=\n\n\u2217\n\u02dcP (Z\u2217; \u02dc\u03b8) \u02dcP (X; \u03b8.\ni ) \u02dcP (Z; \u03b8)\n\u2217\n\u02dcP (Z; \u02dc\u03b8) \u02dcP (X; \u03b8) \u02dcP (Z\u2217; \u03b8.\ni )\n\n.\n\nThus, the intractable computation of the MH ratio is replaced by generating particles Z\u2217 and Z under\n\u2217\ni and \u03b8 respectively. Ideally, we should use perfect sampling [25], but it is intractable for general\n\u03b8.\nMRFs. As a compromise, we use standard Gibbs sampling with long runs to generate these particles.\n3.1.2 Drawing Posterior Samples of \u03c6|c\nWe draw posterior samples of \u03c6 under grouping c via the MH algorithm, again following [20]. The\nstate of the Markov chain is \u03c6. The initial state of the Markov chain is set by running PCD [29] with\n\n3\n\n\f\u2217|\u03c6) is a k-variate Gaussian N (\u03c6, \u03c32\nQIk) where\nparameters tied according to c. The proposal Q(\u03c6\nQIk is the covariance matrix. The auxiliary variable Y is on the same space as X, and the state is\n\u03c32\nextended to (\u03c6, Y). The proposal distribution for the extended state variable is Q(\u03c6, Y|\u03c6\n\u2217\n, Y\u2217) =\nQ(\u03c6|\u03c6\n\u2217\n) \u02dcP (Y; \u03c6)/Z(\u03c6). We set the target distribution of Y to be P (Y; \u02dc\u03c6) where \u02dc\u03c6 is some estimate\nof \u03c6 such as the estimate from the PCD algorithm [29]. Then, the MH ratio for the extended state is\n\n\u2217\na(\u03c6\n\n, Y\u2217|\u03c6, Y) = I(\u03c6\n\n\u2217\u2208 \u0398)\n\n\u2217\n\u02dcP (Y\u2217; \u02dc\u03c6) \u02dcP (X; \u03c6\n) \u02dcP (Y; \u03c6)\n\u2217\n\u02dcP (Y; \u02dc\u03c6) \u02dcP (X; \u03c6) \u02dcP (Y\u2217; \u03c6\n)\n\n,\n\n\u2217\u2208 \u0398) indicates that every dimension of \u03c6\n\nwhere I(\u03c6\nthe new values with probability min{1, a(\u03c6\nand get S samples of \u03c6 by ignoring Y. Eventually we draw one sample from them randomly.\n\n\u2217 is in the domain of G0. We set the state to be\n, Y\u2217|\u03c6, Y)}. We move the Markov chain for S steps,\n\n\u2217\n\n3.2 Gibbs Sampling with Stripped Beta Approximation\nIn the Gibbs sampling algorithm (see Al-\ngorithm 2), the initialization of the Markov\nchain is exactly the same as in the MH al-\ngorithm in Section 3.1. The Markov chain\nresembles Algorithm 2 in [22] and it can\nbe shown to be ergodic. We move the\nMarkov chain forward for T steps. In each\nof the T steps, we update c \ufb01rst and then\nupdate \u03c6. When we update c, we \ufb01x the\nvalues in \u03c6, except we may add one new\nvalue to \u03c6 or remove a value from \u03c6. We\nupdate each element of c in turn. When\nwe update ci, we \ufb01rst examine whether ci\nis unique in c. If so, we remove \u03c6ci from\n\u03c6 \ufb01rst. We then update ci by assigning it\nto an existing group or a new group with\na probability proportional to a product of\ntwo quantities, namely\n\nend for\n\nAlgorithm 2 The Gibbs sampling algorithm\nInput: observed data X = {x1, x2, ..., xn}\nOutput: \u02c6\u03b8\nProcedure:\nPerform PCD algorithm to get MLE \u02dc\u03b8\nInit. c and \u03c6 via K-means on \u02dc\u03b8; K=(cid:98)\u03b10 ln r(cid:99)\nfor t = 1 to T do\n\n(T ); T posterior samples of \u03b8|X\n\n(1), ..., \u02c6\u03b8\n\nfor i = 1 to r do\n\nIf current ci is unique in c, remove \u03c6ci from \u03c6\nUpdate ci according to (4).\nIf new ci(cid:54)\u2208c, draw a value for \u03c6ci and add to \u03c6\nend for\nDraw a posterior sample of \u03c6 according to current\nc, and set \u02c6\u03b8(t)\n\ni = \u03c6ci for i = 1, ..., r\n\nP (ci = c|c\u2212i, X, \u03c6c\u2212i) \u221d\n\n(cid:40) n\u2212i,c\n\nr\u22121+\u03b10\nr\u22121+\u03b10\n\n\u03b10\n\nP (X; \u03c6c, \u03c6c\u2212i), if c \u2208 c\u2212i\n\n(cid:82) P (X; \u03b8i, \u03c6c\u2212i) dG0(\u03b8i), if c (cid:54)\u2208 c\u2212i.\n\n(4)\n\nThe \ufb01rst quantity is n\u2212i,c, the number of members already in group c. For starting a new group,\nthe quantity is \u03b10. The second quantity is the likelihood of X after assigning ci to the new value c\nconditional on \u03c6c\u2212i. When considering a new group, we integrate the likelihood w.r.t. G0. After ci\nis resampled, it is either set to be an existing group or a new group. If a new group is assigned, we\ndraw a new value for \u03c6ci, and add it to \u03c6. After updating every element of c in the current iteration,\nwe draw a posterior sample of \u03c6 under the current grouping c. In total, we run T iterations, and\nget T posterior samples of \u03b8. This Gibbs sampling algorithm involves two intractable calculations,\n\nnamely (i) calculating P (X; \u03c6c, \u03c6c\u2212i) and(cid:82) P (X; \u03b8i, \u03c6c\u2212i ) dG0(\u03b8i) in (4) and (ii) drawing posterior\n3.2.1 Calculating P (X; \u03c6c, \u03c6c\u2212i) and(cid:82) P (X; \u03b8i, \u03c6c\u2212i) dG0(\u03b8i) in (4)\n\nsamples for \u03c6. We use a stripped Beta approximation in both places, as in Sections 3.2.1 and 3.2.2.\n\nIn Formula (4), we evaluate P (X; \u03c6c, \u03c6c\u2212i) for different \u03c6c values with \u03c6c\u2212i \ufb01xed and X =\n{x1, x2, ..., xn} observed. For ease of notation, we rewrite this quantity as a likelihood function\nof \u03b8i, L(\u03b8i|X, \u03b8\u2212i), where \u03b8\u2212i = {\u03b81, ..., \u03b8i\u22121, \u03b8i+1, ..., \u03b8r} is \ufb01xed. Suppose that the edge i con-\nnects variables Xu and Xv, and we denote X\u2212uv to be the variables other than Xu and Xv. Then\nL(\u03b8i|X, \u03b8\u2212i)=\n\nP (xj\n\nu, xj\n\nP (xj\n\nv|xj\u2212uv; \u03b8i, \u03b8\u2212i).\nAbove we approximate P (xj\u2212uv; \u03b8i, \u03b8\u2212i) with P (xj\u2212uv; \u03b8\u2212i) because the density of X\u2212uv mostly\ndepends on \u03b8\u2212i. The term P (xj\u2212uv; \u03b8\u2212i) can be dropped since \u03b8\u2212i is \ufb01xed, and we only have\n\nu, xj\n\nu, xj\n\nP (xj\n\nj=1\n\nv|xj\u2212uv; \u03b8i, \u03b8\u2212i)P (xj\u2212uv; \u03b8i, \u03b8\u2212i)\n\nv|xj\u2212uv; \u03b8i, \u03b8\u2212i)P (xj\u2212uv; \u03b8\u2212i) \u221d(cid:89)n\n\n(cid:89)n\n\u2248(cid:89)n\n\nj=1\n\nj=1\n\n4\n\n\fu, xj\n\nv|xj\u2212uv; \u03b8i, \u03b8\u2212i). Since \u03b8\u2212i is \ufb01xed and we are conditioning on xj\u2212uv, they\nto consider P (xj\ntogether can be regarded as a \ufb01xed potential function telling how likely the rest of the graph thinks\nXu and Xv should take the same value. Suppose that this \ufb01xed potential function (the message from\nthe rest of the network xj\u2212uv) is parameterized as \u03b7i (0 < \u03b7i < 1). Then\nI(xj\n\nn(cid:80)\n\nn(cid:80)\n\nI(xj\n\nu(cid:54)=xj\nv)\n\nI(xj\n\nu=xj\n\nv)(1\u2212\u03bb)\n\n\u03bb\n\nI(xj\n\nu(cid:54)=xj\n\nv)=\u03bb\n\nj=1\n\nu=xj\nv)\n\n(1\u2212\u03bb)\n\nj=1\n\n(5)\n\nP (xj\n\nn(cid:89)\neters ((cid:80)n\n\nj=1\n\nu, xj\n\nv|xj\u2212uv; \u03b8i, \u03b8\u2212i)\u221d n(cid:89)\nv)+1, n\u2212(cid:80)n\n\nu=xj\n\nj=1\n\nj=1\n\nj=1\n\nI(xj\n\nI(xj\n\nu=xj\n\nThe integral(cid:82) P (X; \u03b8i, \u03c6c\u2212i) dG0(\u03b8i) in (4) can be calculated via Monte Carlo approximation. We\n\nwhere \u03bb=\u03b8i\u03b7i/{\u03b8i\u03b7i+(1\u2212\u03b8i)(1\u2212\u03b7i)}. The end of (5) resembles a Beta distribution with param-\nv)+1) except that only part of \u03bb, namely \u03b8i, is ran-\ndom. Now we want to use a Beta distribution to approximate the likelihood with respect to \u03b8i, and\nwe need to remove the contribution of \u03b7i and only consider the contribution from \u03b8i. We choose\nBeta((cid:98)n \u02dc\u03b8i(cid:99)+1, n\u2212(cid:98)n \u02dc\u03b8i(cid:99)+1) where \u02dc\u03b8i is MLE of \u03b8i (e.g. from the PCD algorithm). This approxi-\nmation is named the Stripped Beta Approximation. The simulation results in Section 4.2 indicate that\nthe performance of the stripped Beta approximation is very close to using exact calculation. Also\nthis approximation only requires as much computation as in the tractable tree-structure MRFs, and\nit does not require generating expensive particles as in the MH algorithm with auxiliary variables.\ndraw a number of samples of \u03b8i from G0, and evaluate P (X; \u03b8i, \u03c6c\u2212i ) and take the average.\n3.2.2 Drawing Posterior Samples of \u03c6|c\nThe stripped Beta approximation also allows us to draw posterior samples from \u03c6|c approximately.\nSuppose that there are k groups according to c, and we have estimates for \u03c6, denoted as \u02c6\u03c6 =\n( \u02c6\u03c61, ..., \u02c6\u03c6k). We denote the numbers of elements in the k groups by m = {m1, ..., mk}. For group\ni, we draw a posterior sample for \u03c6i from Beta((cid:98)min \u02c6\u03c6i(cid:99)+1, min\u2212(cid:98)min \u02c6\u03c6i(cid:99)+1).\n4 Simulations\nWe investigate the performance of our Bayesian estimators on three models: (i) a tree-MRF, (ii)\na small grid-MRF whose likelihood is tractable, and (iii) a large grid-MRF whose likelihood is\nintractable. We \ufb01rst set the ground truth of the parameters, and then generate training and testing\nsamples. On training data, we apply our grouping-aware Bayesian estimators and two baseline\nestimators, namely a grouping-blind estimator and an oracle estimator. The grouping-blind estimator\ndoes not know groups exist in the parameters, and estimates the parameters in the normal MLE\nfashion. The oracle estimator knows the ground truth of the groupings, and ties the parameters from\nthe same group and estimates them via MLE. For the tree-MRF, our Bayesian estimator is exact\nsince the likelihood is tractable. For the small grid-MRF, we have three variations for the Bayesian\nestimator, namely Gibbs sampling with exact likelihood computation, MH with auxiliary variables,\nand Gibbs sampling with stripped Beta approximation. For the large grid-MRF, the computational\nburden only allows us to apply Gibbs sampling with stripped Beta approximation.\nWe compare the estimators by three measures. The \ufb01rst is the average absolute error of estimate\ni=1 |\u03b8i \u2212 \u02c6\u03b8i| where \u02c6\u03b8i is the estimate of \u03b8i. The second measure is the log likelihood of the\ntesting data, or the log pseudo-likelihood [4] of the testing data when exact likelihood is intractable.\nThirdly, we evaluate how informative the grouping yielded by the Bayesian estimator is. We use the\nvariation of information metric [19] between the inferred grouping \u02c6C and the ground truth grouping\nC, namely VI( \u02c6C, C). Since VI( \u02c6C, C) is sensitive to the number of groups in \u02c6C, we contrast it\nwith VI( \u00afC, C) where \u00afC is a random grouping with its number of groups the same as \u02c6C. Eventually,\nwe evaluate \u02c6C via the VI difference, namely VI( \u00afC, C)\u2212VI( \u02c6C, C). A larger value of VI difference\nindicates a more informative grouping yielded by our Bayesian estimator. Because we have one\ngrouping in each of the T MCMC steps, we average the VI difference yielded in each of the T steps.\n\n1/r(cid:80)r\n\n4.1 Simulations on Tree-structure MRFs\nFor the structure of the MRF, we choose a perfect binary tree of height 12 (i.e. 8,191 nodes and\n8,190 edges). We assume there are 25 groups among the 8,190 parameters. The base distribution\nG0 is Unif(0, 1). We \ufb01rst generate the true parameters for the 25 groups from Unif(0, 1). We then\nrandomly assign each of the 8,190 parameters to one of the 25 groups. We then generate 1,000\n\n5\n\n\fFigure 1: Performance of the grouping-blind MLE, the oracle MLE and our Bayesian estimator on tree-structure\nMRFs in terms of (a) error of estimate and (b) log-likelihood of test data. Sub\ufb01gure (c) shows the VI difference\nbetween the grouping yielded by our Bayesian estimator and random grouping.\n\ntesting samples and n training samples (n=100, 200, ..., 1,000). Eventually, we apply the grouping-\nblind MLE, the oracle MLE, and our grouping-aware Bayesian estimator on the training samples.\nFor tree-structure MRFs, both MLE and Bayesian estimation have a closed form solution. For the\nBayesian estimator, we set the number of Gibbs sampling steps to be 500 and set \u03b10=1.0. We\nreplicate the experiment 500 times, and the averaged results are in Figure 1.\nOur grouping-aware Bayesian estimator has a\nlower estimate error and a higher log likelihood of\ntest data, compared with the grouping-blind MLE,\ndemonstrating the \u201cblessing of abstraction\u201d. Our\nBayesian estimator performs worse than oracle\nMLE, as we expect.\nIn addition, as the train-\ning sample size increases, the performance of our\nBayesian estimator approaches that of the oracle\nMLE. The VI difference in Figure 1(c) indicates that the Bayesian estimator also recovers the latent\ngrouping to some extent, and the inferred groupings become more and more reliable as the training\nsize increases. The number of groups inferred by the Bayesian estimator and its running time are in\nFigure 2. We also investigate the asymptotic performance of the estimators and their performance\nwhen there are no parameter groups. The results are provided in the supplementary materials.\n\nFigure 2: Number of groups inferred by the Bayesian\nestimator and its run time.\n\n4.2 Simulations on Small Grid-MRFs\nFor the structure of the MRF, we choose a 4\u00d74 grid with 16 nodes and 24 edges. Exact likeli-\nhood is tractable in this small model, which allows us to investigate how good the two types of\napproximation are. We apply the grouping-blind MLE (the PCD algorithm), the oracle MLE (the\nPCD algorithm with the parameters from same group tied) and three Bayesian estimators: Gibbs\nsampling with exact likelihood computation (Gibbs ExactL), Metropolis-Hastings with auxiliary\nvariables (MH AuxVar), and Gibbs sampling with stripped Beta approximation (Gibbs SBA). We\nassume there are \ufb01ve parameter groups. The base distribution is Unif(0, 1). We \ufb01rst generate the\ntrue parameters for the \ufb01ve groups from Unif(0, 1). We then randomly assign each of the 24 pa-\nrameters to one of the \ufb01ve groups. We then generate 1,000 testing samples and n training samples\n(n=100, 200, ..., 1,000). For Gibbs ExactL and Gibbs SBA, we set the number of Gibbs sampling\nsteps to be 100. For MH AuxVar, we set the number of MH steps to be 500 and its proposal number\nM to be 5. The parameter \u03c3Q in Section 3.1.2 is set to be 0.001 and the parameter S is set to be\n100. For all three Bayesian estimators, we set \u03b10=1.0. We replicate the experiment 50 times, and\nthe averaged results are in Figure 4.\nOur grouping-aware Bayesian estimators have\na lower estimate error and a higher log likeli-\nhood of test data, compared with the grouping-\nblind MLE, demonstrating the blessing of ab-\nstraction. All three Bayesian estimators per-\nform worse than oracle MLE, as we expect. The\nVI difference in Figure 4(c) indicates that the\nBayesian estimators also recover the grouping to some extent, and the inferred groupings become\nmore and more reliable as the training size increases. In Figure 3, we provide the boxplots of the\nnumber of groups inferred by Gibbs ExactL, MH AuxVar and Gibbs SBA. All three methods re-\ncover a reasonable number of groups, and Gibbs SBA slightly over-estimates the number of groups.\n\nFigure 3:\nGibbs ExactL, MH AuxVar and Gibbs SBA.\n\nThe number of groups\n\ninferred by\n\n6\n\nllllllllll0.0000.0100.0200.030Error of Estimate1002003004005006007008009001000Training Sample Size(a)lMLEOracleBayesianllllllllll\u22124160\u22124150\u22124140\u22124130\u22124120Log\u2212likelihood of Test Data1002003004005006007008009001000Training Sample Size(b)lMLEOracleBayesian5.05.56.06.57.0VI Difference1002003004005006007008009001000Training Sample Size(c)BayesianTraining Sample SizeNumber of Groups Inferred152025301002003004005006007008009001000llllllllllllllllllllllllllllllllllllllllll400410420430440450460Run Time (in seconds)1002003004005006007008009001000Training Sample Size(a) Gibbs_ExactLTraining Sample Size# Groups Inferred468101002003004005006007008009001000lllllllllllllllll(b) MH_AuxVarTraining Sample Size468101002003004005006007008009001000lllllllllll(c) Gibbs_SBATraining Sample Size468101002003004005006007008009001000lllllllllllllllllllllllllllllll\fFigure 4: Performance of grouping-blind MLE, oracle MLE, Gibbs ExactL, MH AuxVar, and Gibbs SBA on\nthe small grid-structure MRFs in terms of (a) error of estimate and (b) log-likelihood of test data. Sub\ufb01gure (c)\nshows the VI difference between the grouping yielded by our Bayesian estimators and random grouping.\n\nFigure 5: Performance of the grouping-blind MLE, the oracle MLE and the Bayesian estimator (Gibbs SBA)\non large grid-structure MRFs in terms of (a) error of estimate and (b) log-likelihood of test data. Sub\ufb01gure (c)\nshows the VI difference between the grouping yielded by our Bayesian estimator and random grouping.\n\nTable 1: The run time (in seconds) of Gibbs ExactL,\nMH AuxVar and Gibbs SBA when training size is n.\n\nAmong the three Bayesian estimators,\nGibbs ExactL has the lowest estimate er-\nror and the highest log likelihood of test\ndata. Gibbs SBA also performs consid-\nerably well, and its performance is close\nto the performance of Gibbs ExactL.\nMH AuxVar works slightly worse, espe-\ncially when there is less training data. However, MH AuxVar recovers better groupings than\nGibbs SBA when there are more training data. The run times of the three Bayesian estimators are\nlisted in Table 1. Gibbs ExactL has a computational complexity that is exponential in the dimen-\nsionality d, and cannot be applied to situations when d > 20. MH AuxVar is also computationally\nintensive because it has to generate expensive particles. Gibbs SBA runs fast, with its burden mainly\nfrom running PCD under a speci\ufb01c grouping in each Gibbs sampling step, and it scales well.\n\nGIBBS EXACTL\nMH AUXVAR\nGIBBS SBA\n\nn=100\n88,136.3\n540.2\n8.1\n\nn=500\n91,055.0\n3,342.2\n10.8\n\nn=1,000\n92,503.4\n4,546.7\n14.2\n\n4.3 Simulations on Large Grid-MRFs\n\nThe large grid consists of 30 rows and 30 columns (i.e. 900 nodes and 1,740 edges). Exact likeli-\nhood is intractable for this large model, and we cannot run Gibbs ExactL. The high dimension also\nprohibits MH AuxVar. Therefore, we only run the Gibbs SBA algorithm on this large grid-structure\nMRF. We assume that there are 10 groups among the 1,740 parameters. We also evaluate the esti-\nmators by the log pseudo-likelihood of testing data. The other settings of the experiments stay the\nsame as Section 4.2. We replicate the experiment 50 times, and the averaged results are in Figure 5.\nFor all 10 training sets, our Bayesian estima-\ntor Gibbs SBA has a lower estimate error and\na higher log likelihood of test data, compared\nwith the grouping-blind MLE (via the PCD al-\ngorithm). Gibbs SBA has a higher estimate error\nand a lower pseudo-likelihood of test data than\nthe oracle MLE. The VI difference in Figure 5(c)\nindicates that Gibbs SBA gradually recovers the\ngrouping as the training size increases. The number of groups inferred by Gibbs SBA and its run-\nning time are provided in Figure 6. Similarly to the observation in Section 4.2, Gibbs SBA over-\nestimates the number of groups. Gibbs SBA \ufb01nishes the simulations on 900 nodes and 1,740 edges\nin hundreds of minutes (depending on the training size), which is considered to be very fast.\n\nFigure 6: Number of groups inferred by Gibbs SBA\nand its run time.\n\n7\n\nllllllllll0.0050.0150.0250.035Error of Estimate1002003004005006007008009001000Training Sample Size(a)lMLEOracleGibbs_ExactLGibbs_SBAMH_AuxVarllllllllll\u22126920\u22126880\u22126840\u22126800Log\u2212likelihood of Test Data1002003004005006007008009001000Training Sample Size(b)lMLEOracleGibbs_ExactLGibbs_SBAMH_AuxVar1.01.21.41.61.82.02.2VI Difference1002003004005006007008009001000Training Sample Size(c)Gibbs_ExactLGibbs_SBAMH_AuxVarllllllllll0.010.020.030.04Error of Estimate1002003004005006007008009001000Training Sample Size(a)lMLEOracleGibbs_SBAllllllllll\u2212210000\u2212206000\u2212202000\u2212198000Log\u2212pseudolikelihood of Test Data1002003004005006007008009001000Training Sample Size(b)lMLEOracleGibbs_SBA0.51.01.52.0VI Difference1002003004005006007008009001000Training Sample Size(c)Gibbs_SBATraining Sample SizeNumber of Groups Inferred204060801002003004005006007008009001000lllllllllllllllllll15000200002500030000Run Time (in seconds)1002003004005006007008009001000Training Sample Size\fTable 2: Log pseudo-likelihood (LPL) of training and testing data from MLE (PCD) and Bayesian estimate\n(Gibbs SBA), the number of groups inferred by Gibbs SBA, and its run time in the Senate voting experiments.\n\nLPL-TRAIN\n\nMLE\n\n-10716.75\n-8306.17\n\nGIBBS SBA\n-10721.34\n-8322.34\n\nEXP1\nEXP2\n\nLPL-TEST\n\nMLE\n\n-9022.01\n-11490.47\n\nGIBBS SBA # GROUPS\n-8989.87\n-11446.45\n\n7.89\n7.29\n\nRUNTIME (MINS)\n\n204\n183\n\n5 Real-world Application\nWe apply the Gibbs SBA algorithm on US Senate voting data from the 109th Congress (available\nat www.senate.gov). The 109th Congress has two sessions, the \ufb01rst session in 2005 and the second\nsession in 2006. There are 366 votes and 278 votes in the two sessions, respectively. There are 100\nsenators in both sessions, but Senator Corzine only served the \ufb01rst session and Senator Menendez\nonly served the second session. We remove them. In total, we have 99 senators in our experiments,\nand we treat the votes from the 99 senators as the 99 variables in the MRF. We only consider con-\ntested votes, namely we remove the votes with less than ten or more than ninety supporters. In total,\nthere are 292 votes and 221 votes left in the two sessions, respectively. The structure of the MRF is\nfrom Figure 13 in [2]. There are in total 279 edges. The votes are coded as \u22121 for no and 1 for yes.\nWe replace all missing votes with \u22121, staying consistent with [2]. We perform two experiments.\nFirst, we train the MRF using the \ufb01rst session data, and test on the second session data. Then, we\ntrain on the second session and test on the \ufb01rst session. We compare our Bayesian estimator (via\nGibbs SBA) and MLE (via PCD) by the log pseudo-likelihood of testing data since exact likelihood\nis intractable. We set the number of Gibbs sampling steps to be 3,000. Both of the two experi-\nments are \ufb01nished in around three hours on a single CPU. The results are summarized in Table 2.\nIn the \ufb01rst experiment, the log pseudo-likelihood of test data is \u22129022.01 from MLE, whereas it\nis \u22128989.87 from our Bayesian estimate. In the second experiment, the log pseudo-likelihood of\ntest data is \u221211490.47 from MLE, whereas it is \u221211446.45 from our Bayesian estimate. The in-\ncrease of log pseudo-likelihood is comparable to the increase of log (pseudo-)likelihood we gain in\nthe simulations (please refer to Figures 1b, 4b and 5b at the points when we simulate 200 and 300\ntraining samples). Both experiments indicate that the models trained with the Gibbs SBA algorithm\ngeneralize considerably better than the models trained with MLE. Gibbs SBA also infers there are\naround eight different types of relations among the senators. The two trained models are provided\nin the supplementary materials, and the estimated parameters in the two models are consistent.\n\n6 Discussion\nBayesian nonparametric approaches [23, 10], such as the Dirichlet process [7], provide an elegant\nway of modeling mixtures with an unknown number of components. These approaches have yielded\nadvances in different machine learning areas, such as the in\ufb01nite Gaussian mixture models [26], the\nin\ufb01nite mixture of Gaussian processes [27], in\ufb01nite HMMs [3, 8], in\ufb01nite HMRFs [6], DP-nonlinear\nmodels [28], DP-mixture GLMs [14], in\ufb01nite SVMs [33, 32], and the in\ufb01nite latent attribute models\n[24]. In this paper, we play the same trick of replacing the prior distribution with a prior stochas-\ntic process to accommodate our uncertainty about the number of parameter groups. To the best of\nour knowledge, this is the \ufb01rst time a Bayesian nonparametric approach is applied to models whose\nlikelihood is intractable. Accordingly, we propose two types of approximation, namely a Metropolis-\nHastings algorithm with auxiliary variables and a Gibbs sampling algorithm with stripped Beta ap-\nproximation. Both algorithms show superior performance over conventional MLE, and Gibbs SBA\ncan also scale well to large-scale MRFs. The Markov chains in both algorithms are ergodic, but\nmay not be in detailed balance because we rely on approximation. Thus, we guarantee that both\nalgorithms converge for general MRFs, but they may not exactly converge to the target distribution.\nIn this paper, we only consider the situation where the potential functions are pairwise and there is\nonly one parameter in each potential function. For graphical models with more than one parameter\nin the potential functions, it is appropriate to group the parameters on the level of potential functions.\nA more sophisticated base distribution G0 (such as some multivariate distribution) needs to be con-\nsidered. In this paper, we also assume the structures of the MRFs are given. When the structures are\nunknown, we still need to perform structure learning. Allowing structure learners to automatically\nidentify structure modules will be another very interesting topic to explore in the future research.\nAcknowledgements\nThe authors acknowledge the support of NIGMS R01GM097618-01 and NLM R01LM011028-01.\n\n8\n\n\fReferences\n[1] A. U. Asuncion, Q. Liu, A. T. Ihler, and P. Smyth. Particle \ufb01ltered MCMC-MLE with connections to\n\ncontrastive divergence. In ICML, 2010.\n\n[2] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate Gaussian or binary data. JMLR, 9:485\u2013516, June 2008.\n\n[3] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The in\ufb01nite hidden Markov model. In NIPS, 2002.\n[4] J. Besag. Statistical analysis of non-lattice data. JRSS-D, 24(3):179\u2013195, 1975.\n[5] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[6] S. P. Chatzis and G. Tsechpenakis. The in\ufb01nite hidden Markov random \ufb01eld model. In ICCV, 2009.\n[7] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209\u2013\n\n[8] J. V. Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden Markov\n\n[9] A. Gelman and J. Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge\n\n230, 1973.\n\nmodel. In ICML, 2008.\n\nUniversity Press, New York, 2007.\n\nPsychology, 56(1):1\u201312, 2012.\n\n156\u2013163, 1991.\n\n[10] S. J. Gershman and D. M. Blei. A tutorial on Bayesian nonparametric models. Journal of Mathematical\n\n[11] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. Computing Science and Statistics, pages\n\n[12] M. Gutmann and J. Hirayama. Bregman divergence as general framework to estimate unnormalized\n\nstatistical models. In UAI, pages 283\u2013290, Corvallis, Oregon, 2011. AUAI Press.\n\n[13] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for unnormal-\n\nized statistical models. In AISTATS, 2010.\n\n[14] L. A. Hannah, D. M. Blei, and W. B. Powell. Dirichlet process mixtures of generalized linear models.\n\n[15] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\nJMLR, 12:1923\u20131953, 2011.\n\n14:1771\u20131800, 2002.\n\n[16] A. Hyvarinen. Connections between score matching, contrastive divergence, and pseudolikelihood for\n\ncontinuous-valued variables. Neural Networks, IEEE Transactions on, 18(5):1529\u20131531, 2007.\n\n[17] A. Hyv\u00a8arinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499\u2013\n\n[18] S. Lyu. Unifying non-maximum likelihood learning objectives with minimum KL contraction. NIPS,\n\n[19] M. Meila. Comparing clusterings by the variation of information. In COLT, 2003.\n[20] J. M\u00f8ller, A. Pettitt, R. Reeves, and K. Berthelsen. An ef\ufb01cient Markov chain Monte Carlo method for\n\ndistributions with intractable normalising constants. Biometrika, 93(2):451\u2013458, 2006.\n\n[21] I. Murray, Z. Ghahramani, and D. J. C. MacKay. MCMC for doubly-intractable distributions. In UAI,\n\n[22] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computa-\n\ntional and Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[23] P. Orbanz and Y. W. Teh. Bayesian nonparametric models.\n\nIn Encyclopedia of Machine Learning.\n\n[24] K. Palla, D. A. Knowles, and Z. Ghahramani. An in\ufb01nite latent attribute model for network data.\n\nIn\n\n[25] J. G. Propp and D. B. Wilson. Exact sampling with coupled Markov chains and applications to statistical\n\nmechanics. Random structures and Algorithms, 9(1-2):223\u2013252, 1996.\n[26] C. E. Rasmussen. The in\ufb01nite Gaussian mixture model. In NIPS, 2000.\n[27] C. E. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian process experts. In NIPS, 2001.\n[28] B. Shahbaba and R. Neal. Nonlinear models using Dirichlet process mixtures. JMLR, 10:1829\u20131850,\n\n[29] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\n\n[30] T. Tieleman and G. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML,\n\n[31] D. Vickrey, C. Lin, and D. Koller. Non-local contrastive objectives. In Proc. of the International Confer-\n\nence on Machine Learning. Citeseer, 2010.\n\n[32] J. Zhu, N. Chen, and E. P. Xing. In\ufb01nite latent SVM for classi\ufb01cation and multi-task learning. In NIPS,\n\n[33] J. Zhu, N. Chen, and E. P. Xing. In\ufb01nite SVM: a Dirichlet process mixture of large-margin kernel ma-\n\nchines. In ICML, 2011.\n\n[34] S. C. Zhu and X. Liu. Learning in Gibbsian \ufb01elds: How accurate and how fast can it be? IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 24:1001\u20131006, 2002.\n\n2512, 2007.\n\n2011.\n\n2006.\n\nSpringer, 2010.\n\nICML, 2012.\n\nICML, 2008.\n\n2009.\n\n2009.\n\n2011.\n\n9\n\n\f", "award": [], "sourceid": 635, "authors": [{"given_name": "Jie", "family_name": "Liu", "institution": "UW-Madison"}, {"given_name": "David", "family_name": "Page", "institution": "UW-Madison"}]}