{"title": "Optimistic Distributionally Robust Optimization for Nonparametric Likelihood Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 15872, "page_last": 15882, "abstract": "The likelihood function is a fundamental component in Bayesian statistics. However, evaluating the likelihood of an observation is computationally intractable in many applications. In this paper, we propose a non-parametric approximation of the likelihood that identifies a probability measure which lies in the neighborhood of the nominal measure and that maximizes the probability of observing the given sample point. We show that when the neighborhood is constructed by the Kullback-Leibler divergence, by moment conditions or by the Wasserstein distance, then our optimistic likelihood can be determined through the solution of a convex optimization problem, and it admits an analytical expression in particular cases. We also show that the posterior inference problem with our optimistic likelihood approximation enjoys strong theoretical performance guarantees, and it performs competitively in a probabilistic classification task.", "full_text": "Optimistic Distributionally Robust Optimization\nfor Nonparametric Likelihood Approximation\n\nViet Anh Nguyen\nSoroosh Sha\ufb01eezadeh-Abadeh\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n{viet-anh.nguyen, soroosh.shafiee}@epfl.ch\n\nMan-Chung Yue\n\nThe Hong Kong Polytechnic University, Hong Kong\n\nmanchung.yue@polyu.edu.hk\n\nDaniel Kuhn\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n\ndaniel.kuhn@epfl.ch\n\nImperial College Business School, United Kingdom\n\nWolfram Wiesemann\n\nww@imperial.ac.uk\n\nAbstract\n\nThe likelihood function is a fundamental component in Bayesian statistics. How-\never, evaluating the likelihood of an observation is computationally intractable\nin many applications. In this paper, we propose a non-parametric approximation\nof the likelihood that identi\ufb01es a probability measure which lies in the neighbor-\nhood of the nominal measure and that maximizes the probability of observing the\ngiven sample point. We show that when the neighborhood is constructed by the\nKullback-Leibler divergence, by moment conditions or by the Wasserstein distance,\nthen our optimistic likelihood can be determined through the solution of a convex\noptimization problem, and it admits an analytical expression in particular cases.\nWe also show that the posterior inference problem with our optimistic likelihood\napproximation enjoys strong theoretical performance guarantees, and it performs\ncompetitively in a probabilistic classi\ufb01cation task.\n\n1\n\nIntroduction\n\nBayesian statistics is a versatile mathematical framework for estimation and inference, with appli-\ncations in bioinformatics [1], computational biology [40, 41], neuroscience [50], natural language\nprocessing [24, 34], computer vision [21, 25], robotics [13], machine learning [28, 46], etc. A\nBayesian inference model is composed of an unknown parameter \u03b8 from a known parameter space\n\u0398, an observed sample point x from a sample space X \u2286 Rm, a likelihood measure (or conditional\ndensity) p(\u00b7|\u03b8) over X and a prior distribution \u03c0(\u00b7) over \u0398. The key objective of Bayesian statistics\nis the computation of the posterior distribution p(\u00b7|x) over \u0398 upon observing x.\nUnfortunately, computing the posterior is a challenging task in practice. Bayes\u2019 theorem, which\nrelates the posterior to the prior [42, Theorem 1.31], requires the evaluation of both the likelihood\nfunction p(\u00b7|\u03b8) and the evidence p(x). Evaluating the likelihood p(\u00b7|\u03b8) at an observation x \u2208 X is\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fan intractable problem in many situations. For example, the statistical model may contain hidden\nvariables \u03b6, and the likelihood p(x|\u03b8) can only be computed by marginalizing out the hidden variables\np(x|\u03b8) =(cid:82) p(x, \u03b6|\u03b8)d\u03b6 [32, pp. 322]. In the g-and-k model, the density function does not exist in\nclosed form and can only be expressed in terms of the derivatives of quantile functions, which implies\nthat p(x|\u03b8) needs to be computed numerically for each individual observation x [18]. Likewise,\nevaluating the evidence p(x) is intractable whenever the evaluation of the likelihood p(x|\u03b8) is. To\navoid calculating p(x) in the process of constructing the posterior, the variational Bayes approach [8]\nmaximizes the evidence lower bound (ELBO), which is tantamount to solving\n\nminQ\u2208Q KL(Q (cid:107) \u03c0) \u2212 EQ[log p(x|\u03b8)],\n\n(1)\nwhere KL(Q (cid:107) \u03c0) denotes the Kullback-Leibler (KL) divergence from Q to \u03c0. One can show that if\nthe feasible set Q contains all probability measures supported on \u0398, then the optimal solution Q(cid:63)\nof (1) coincides with the true posterior distribution. Consequently, inferring the posterior is equivalent\nto solving the convex optimization problem (1) that depends only on the prior distribution \u03c0 and the\nlikelihood p(x|\u03b8). There are scalable algorithms to solve the ELBO maximization problem [19], and\nthe variational Bayes approach has been successfully applied in inference tasks [15, 16], reinforcement\nlearning [20, 30], dimensionality reduction [33] and training deep neural networks [22]. Nevertheless,\nthe variational Bayes approach requires both perfect knowledge and a tractable representation of the\nlikelihood p(x|\u03b8), which is often not available in practice.\nWhile the likelihood p(x|\u03b8) may be intractable to compute, we can approximate p(x|\u03b8) from available\ndata in many applications. For example, in the classi\ufb01cation task where \u0398 = {\u03b81, . . . , \u03b8C} denotes\nthe class labels, the class conditional probabilities p(x|\u03b8i) and the prior distribution \u03c0(\u03b8i) can be\ninferred from the training data, and a probabilistic classi\ufb01er can be constructed by assigning x to\neach class randomly under the posterior distribution [7, pp. 43]. Approximating the intractable\nlikelihood from available samples is also the key ingredient of approximate Bayesian computation\n(ABC), a popular statistical method for likelihood-free inference that has gained widespread success\nin various \ufb01elds [2, 12, 47]. The sampling-based likelihood algorithm underlying ABC assumes that\n\n1\nN\n\nKh (d(x,(cid:98)x)) p((cid:98)x|\u03b8)d(cid:98)x \u2248\n\nwe have access to a simulation device that can generate N i.i.d. samples(cid:98)x1, . . . ,(cid:98)xN from p(\u00b7|\u03b8), and\n\nit approximates the likelihood p(x|\u03b8) by the surrogate ph(x|\u03b8) de\ufb01ned as\nN(cid:88)j=1\n\nph(x|\u03b8) = (cid:90)X\n\nwhere Kh is a kernel function with kernel width h, d(\u00b7,\u00b7) is a distance on X , and the approximation\nis due to the reliance upon \ufb01nitely many samples [37, 39].\nIn this paper, we propose an alternative approach to approximate the likelihood p(x|\u03b8). We assume\nthat the sample space X is countable, and hence p(\u00b7|\u03b8) is a probability mass function. We model the\non X , which in practice typically represents the empirical distribution supported on the (possibly\nsimulated) training samples. We then approximate the likelihood p(x|\u03b8) by the optimal value of the\nfollowing non-parametric optimistic likelihood problem\n(3)\n\ndecision maker\u2019s nominal belief about p(\u00b7|\u03b8) by a nominal probability mass function(cid:98)\u03bd\u03b8 supported\n\nKh (d(x,(cid:98)xj)) ,\n\n(2)\n\nsup\n\n\u03bd\u2208B\u03b8((cid:98)\u03bd\u03b8)\n\n\u03bd(x),\n\n[3, 29, 49]. In contrast to the distributionally robust optimization paradigm, which would look for a\nworst-case measure that minimizes the probability of observing x among all measures contained in\n\nquantity. Thus, problem (3) is closely related to the literature on practicing optimism upon facing\nambiguity, which has been shown to be bene\ufb01cial in multi-armed bandit problems [10], planning\n[31], classi\ufb01cation [6], image denoising [17], Bayesian optimization [9, 45], etc.\n\nwhere B\u03b8((cid:98)\u03bd\u03b8) is a set that contains all probability mass functions in the vicinity of (cid:98)\u03bd\u03b8. In the\ndistributionally robust optimization literature, the set B\u03b8((cid:98)\u03bd\u03b8) is referred to as the ambiguity set\nB\u03b8((cid:98)\u03bd\u03b8), the optimistic likelihood problem (3) determines a best-case measure that maximizes this\nThe choice of the set B\u03b8((cid:98)\u03bd\u03b8) in (3) directly impacts the performance of the optimistic likelihood\napproach. In the limiting case where B\u03b8((cid:98)\u03bd\u03b8) approaches a singleton {(cid:98)\u03bd\u03b8}, the optimistic likelihood\nproblem recovers the nominal estimate(cid:98)\u03bd\u03b8(x). Since this approximation is only reasonable when\n(cid:98)\u03bd\u03b8(x) > 0, which is often violated when (cid:98)\u03bd\u03b8 is estimated from few training samples, a strictly\npositive size of B\u03b8((cid:98)\u03bd\u03b8) is preferred. Ideally, the shape of B\u03b8((cid:98)\u03bd\u03b8) is chosen so that problem (3)\n\n2\n\n\fis computationally tractable and at the same time offers a promising approximation quality. We\n\ndescription based on moment conditions [14, 27] and the Wasserstein distance [23, 35, 38, 43, 44].\nThe contributions of this paper may be summarized as follows.\n\nexplore in this paper three different constructions of B\u03b8((cid:98)\u03bd\u03b8): the Kullback-Leibler divergence [3], a\n1. We show that when B\u03b8((cid:98)\u03bd\u03b8) is constructed using the KL divergence, the optimistic likelihood (3)\n2. We demonstrate that when B\u03b8((cid:98)\u03bd\u03b8) is constructed using moment conditions, the optimistic likeli-\ncapture the tail behavior of(cid:98)\u03bd\u03b8.\n3. We show that when B\u03b8((cid:98)\u03bd\u03b8) is constructed using the Wasserstein distance, the optimistic like-\n\nreduces to a \ufb01nite convex program, which in speci\ufb01c cases admits an analytical solution. However,\nthis approach does not satisfactorily approximate p(x|\u03b8) for previously unseen samples x.\nhood (3) can be computed in closed form. However, since strikingly different distributions can\nshare the same lower-order moments, this approach is often not \ufb02exible enough to accurately\n\nlihood (3) coincides with the optimal value of a linear program that can be solved using a\ngreedy heuristics. Interestingly, this variant of the optimistic likelihood results in a likelihood\napproximation whose decay pattern resembles that of an exponential kernel approximation.\n\n4. We use our optimistic likelihood approximation in the ELBO problem (1) for posterior inference.\nWe prove that the resulting posterior inference problems under the KL divergence and the Wasser-\nstein distance enjoy strong theoretical guarantees, and we illustrate their promising empirical\nperformance in numerical experiments.\n\nWhile this paper focuses on the non-parametric approximation of the likelihood p(x|\u03b8), we emphasize\nthat the optimistic likelihood approach can also be applied in the parametric setting. More speci\ufb01cally,\nif p(\u00b7|\u03b8) belongs to the family of Gaussian distributions, then the optimistic likelihood approximation\ncan be solved ef\ufb01ciently using geodesically convex optimization [36].\nThe remainder of the paper is structured as follows. We study the optimistic likelihood problem under\nthe KL ambiguity set, under moment conditions and under the Wasserstein distance in Sections 2\u20134,\nrespectively. Section 5 provides a performance guarantee for the posterior inference problem using\nour optimistic likelihood. All proofs and additional material are relegated to the Appendix. In\n\nSections 2\u20134, the development of the theoretical results is generic, and hence the dependence of(cid:98)\u03bd\u03b8\nand B\u03b8((cid:98)\u03bd\u03b8) on \u03b8 is omitted to avoid clutter.\nNotation. We denote by M(X ) the set of all probability mass functions supported on X , and we\nrefer to the support of \u03bd \u2208 M(X ) as supp(\u03bd). For any z \u2208 X , \u03b4z is the delta-Dirac measure at z.\nFor any N \u2208 N+, we use [N ] to denote the set {1, . . . , N}. 1x(\u00b7) is the indicator function at x, i.e.,\n1x(\u03be) = 1 if \u03be = x, and 1x(\u03be) = 0 otherwise.\n\n2 Optimistic Likelihood using the Kullback-Leibler Divergence\n\nWe \ufb01rst consider the optimistic likelihood problem where the ambiguity set is constructed using the\nKL divergence. The KL divergence is the starting point of the ELBO maximization problem (1), and\nthus it is natural to explore its potential in our likelihood approximation.\nDe\ufb01nition 2.1 (KL divergence). Let \u03bd1, \u03bd2 be two probability mass functions on X such that \u03bd1 is\nabsolutely continuous with respect to \u03bd2. The KL divergence between \u03bd1 and \u03bd2 is de\ufb01ned as\n\nKL(\u03bd1 (cid:107) \u03bd2) (cid:44)(cid:88)z\u2208X\n\nf (\u03bd1(z)/\u03bd2(z)) \u03bd2(z),\n\nwhere f (t) = t log(t) \u2212 t + 1.\n\nradius \u03b5 \u2265 0, that is,\n\nWe now consider the KL divergence ball BKL((cid:98)\u03bd, \u03b5) centered at the empirical distribution(cid:98)\u03bd with\nBKL((cid:98)\u03bd, \u03b5) = {\u03bd \u2208 M(X ) : KL((cid:98)\u03bd (cid:107) \u03bd) \u2264 \u03b5} .\nMoreover, we assume that the nominal distribution(cid:98)\u03bd is supported on N distinct points(cid:98)x1, . . . ,(cid:98)xN ,\nthat is,(cid:98)\u03bd =(cid:80)j\u2208[N ](cid:98)\u03bdj\u03b4(cid:98)xj with(cid:98)\u03bdj > 0\u2200j \u2208 [N ] and(cid:80)j\u2208[N ](cid:98)\u03bdj = 1.\nThe set BKL((cid:98)\u03bd, \u03b5) is not weakly compact because X can be unbounded, and thus the existence\n\nof a probability measure that optimizes the optimistic likelihood problem (3) over the feasible set\n\n(4)\n\n3\n\n\fBKL((cid:98)\u03bd, \u03b5) is not immediate. The next proposition asserts that the optimal solution exists, and it\n\nprovides structural insights about the support of the optimal measure.\nProposition 2.2 (Existence of optimizers; KL ambiguity). For any \u03b5 \u2265 0 and x \u2208 X , there exists a\nmeasure \u03bd(cid:63)\n\nKL \u2208 BKL((cid:98)\u03bd, \u03b5) such that\n\nsup\n\n\u03bd\u2208BKL((cid:98)\u03bd,\u03b5)\n\n\u03bd(x) = \u03bd(cid:63)\n\nKL(x)\n\n(5)\n\nKL) \u2286 supp((cid:98)\u03bd) \u222a {x}.\n\nMoreover, \u03bd(cid:63)\n\nKL is supported on at most N + 1 points satisfying supp(\u03bd(cid:63)\n\nProposition 2.2 suggests that the optimistic likelihood problem (5), inherently an in\ufb01nite dimensional\nproblem whenever X is in\ufb01nite, can be formulated as a \ufb01nite dimensional problem. The next theorem\nprovides a \ufb01nite convex programming reformulation of (5).\nTheorem 2.3 (Optimistic likelihood; KL ambiguity). For any \u03b5 \u2265 0 and x \u2208 X ,\n\u2022 if x \u2208 supp((cid:98)\u03bd), then problem (5) can be reformulated as the \ufb01nite convex optimization problem\n++,(cid:80)j\u2208[N ](cid:98)\u03bdj log ((cid:98)\u03bdj/yj) \u2264 \u03b5, e(cid:62)y = 1(cid:111) ,\n\u03bd\u2208BKL((cid:98)\u03bd,\u03b5)\n\n\u03bd(x) = max(cid:110)(cid:80)j\u2208[N ] yj 1x((cid:98)xj) : y \u2208 RN\n\nwhere e is the vector of all ones;\n\nsup\n\nTheorem 2.3 indicates that the determining factor in the KL optimistic likelihood approximation is\n\nthen the optimal value of (5) does not depend on x, and the KL divergence approach assigns a\n\ufb02at likelihood. Interestingly, in Appendix B.2 we prove a similar result for the wider class of f-\ndivergences, which contains the KL divergence as a special case. While this \ufb02at likelihood behavior\n\n\u2022 if x (cid:54)\u2208 supp((cid:98)\u03bd), then problem (5) has the optimal value 1 \u2212 exp (\u2212\u03b5).\nwhether the observation x belongs to the support of the nominal measure(cid:98)\u03bd or not. If x (cid:54)\u2208 supp((cid:98)\u03bd),\nmay be useful in speci\ufb01c cases, one would expect the relative distance of x to the atoms of(cid:98)\u03bd to\nx that does not belong to the support of the nominal measure(cid:98)\u03bd.\n\nin\ufb02uence the optimal value of the optimistic likelihood problem, similar to the neighborhood-based\nintuition re\ufb02ected in the kernel approximation approach. Unfortunately, the lack of an underlying\nmetric in its de\ufb01nition implies that the f-divergence family cannot capture this intuition, and thus f-\ndivergence ambiguity sets are not an attractive option to approximate the likelihood of an observation\n\nRemark 2.4 (On the order of the measures). An alternative construction of the KL ambiguity set,\nwhich has been widely used in the literature [3], is\n\n(cid:98)BKL((cid:98)\u03bd, \u03b5) = {\u03bd \u2208 M(X ) : KL(\u03bd (cid:107)(cid:98)\u03bd) \u2264 \u03b5} ,\n\nwhere the two measures \u03bd and(cid:98)\u03bd change roles. However, in this case the KL divergence imposes that\nall \u03bd \u2208 (cid:98)BKL((cid:98)\u03bd, \u03b5) are absolutely continuous with respect to(cid:98)\u03bd. In particular, if x (cid:54)\u2208 supp((cid:98)\u03bd), then\n\u03bd(x) = 0 for all \u03bd \u2208 (cid:98)BKL((cid:98)\u03bd, \u03b5), and(cid:98)BKL((cid:98)\u03bd, \u03b5) is not able to approximate the likelihood of x in a\n\nmeaningful way.\n\n3 Optimistic Likelihood using Moment Conditions\n\nIn this section we study the optimistic likelihood problem (3) when the ambiguity set B((cid:98)\u03bd) is speci\ufb01ed\nby moment conditions. For tractability purposes, we focus on ambiguity sets BMV((cid:98)\u03bd) that contain\nall distributions which share the same mean(cid:98)\u00b5 and covariance matrix(cid:98)\u03a3 \u2208 Sm\ndistribution(cid:98)\u03bd. Formally, this moment ambiguity set BMV((cid:98)\u03bd) can be expressed as\nBMV((cid:98)\u03bd) =(cid:110)\u03bd \u2208 M(X ) : E\u03bd[\u02dcx] =(cid:98)\u00b5, E\u03bd[\u02dcx\u02dcx(cid:62)] =(cid:98)\u03a3 +(cid:98)\u00b5(cid:98)\u00b5(cid:62)(cid:111) .\nThe optimistic likelihood (3) over the ambiguity set BMV((cid:98)\u03bd) is a moment problem that is amenable to\n\na well-known reformulation as a polynomial time solvable semide\ufb01nite program [5]. Surprisingly,\nin our case the optimal value of the optimistic likelihood problem is available in closed form. This\nresult was \ufb01rst discovered in [26], and a proof using optimization techniques can be found in [4].\n\n++ with the nominal\n\n4\n\n\f1\n\n(6)\n\nsup\n\n\u03bd(x) =\n\n\u03bd\u2208BMV((cid:98)\u03bd)\n\n++. For any x \u2208 X , the optimistic likelihood\n\nTheorem 3.1 (Optimistic likelihood; mean-variance ambiguity [4, 26]). Suppose that(cid:98)\u03bd has the mean\nvector (cid:98)\u00b5 \u2208 Rm and the covariance matrix (cid:98)\u03a3 \u2208 Sm\nproblem (3) over the moment ambiguity set BMV((cid:98)\u03bd) has the optimal value\n1 + (x \u2212(cid:98)\u00b5)(cid:62)(cid:98)\u03a3\u22121(x \u2212(cid:98)\u00b5) \u2208 (0, 1].\n\nThe optimal value (6) of the optimistic likelihood problem depends on the location of the observed\nsample point x, and hence the moment ambiguity set captures the behavior of the likelihood function\nin a more realistic way than the KL divergence ambiguity set from Section 2. Moreover, the moment\n\n4 Optimistic Likelihood using the Wasserstein Distance\n\nambiguity set BMV((cid:98)\u03bd) does not depend on any hyper-parameters that need to be tuned. However,\nsince the construction of BMV((cid:98)\u03bd) only relies on the \ufb01rst two moments of the nominal distribution\n(cid:98)\u03bd, it fails to accurately capture the tail behavior of(cid:98)\u03bd, see Appendix B.3. This motivates us to look\nfurther for an ambiguity set that faithfully accounts for the tail behavior of(cid:98)\u03bd.\nWe now study a third construction for the ambiguity set B((cid:98)\u03bd), which is based on the type-1 Wasserstein\n\ndistance (also commonly known as the Monge-Kantorovich distance), see [48]. Contrary to the KL\ndivergence, the Wasserstein distance inherently depends on the ground metric of the sample space X .\nDe\ufb01nition 4.1 (Wasserstein distance). The type-1 Wasserstein distance between two measures\n\u03bd1, \u03bd2 \u2208 M(X ) is de\ufb01ned as\n\nW(\u03bd1, \u03bd2) (cid:44)\n\ninf\n\n\u03bb\u2208\u039b(\u03bd1,\u03bd2)\n\nE\u03bb [d(x1, x2)] ,\n\nwhere \u039b(\u03bd1, \u03bd2) denotes the set of all distributions on X \u00d7 X with the \ufb01rst and second marginal\ndistributions being \u03bd1 and \u03bd2, respectively, and d is the ground metric of X .\n\nThe Wasserstein ball BW((cid:98)\u03bd, \u03b5) centered at the nominal distribution(cid:98)\u03bd with radius \u03b5 \u2265 0 is\n\nBW((cid:98)\u03bd, \u03b5) = {\u03bd \u2208 M(X ) : W(\u03bd,(cid:98)\u03bd) \u2264 \u03b5} .\n\nWe \ufb01rst establish a structural result for the optimistic likelihood problem over the Wasserstein\nambiguity set. This is the counterpart to Proposition 2.2 for the KL divergence.\nProposition 4.2 (Existence of optimizers; Wasserstein ambiguity). For any \u03b5 \u2265 0 and x \u2208 X , there\nexists a measure \u03bd(cid:63)\n\n(7)\n\n\u03bd(x) = \u03bd(cid:63)\n\nW(x).\n\n(8)\n\nW \u2208 BW((cid:98)\u03bd, \u03b5) such that\n\u03bd\u2208BW((cid:98)\u03bd,\u03b5)\n\nsup\n\nFurthermore, \u03bd(cid:63)\n\nW is supported on at most N + 1 points satisfying supp(\u03bd(cid:63)\n\nLeveraging Proposition 4.2, we can show that the optimistic likelihood estimate over the Wasserstein\nambiguity set coincides with the optimal value of a linear program whose number of decision variables\n\nW) \u2286 supp((cid:98)\u03bd) \u222a {x}.\nequals the number of atoms N of the nominal measure(cid:98)\u03bd.\nTheorem 4.3 (Optimistic likelihood; Wasserstein ambiguity). For any \u03b5 \u2265 0 and x \u2208 X , problem (8)\nis equivalent to the linear program\n+ , (cid:88)j\u2208[N ]\n\u03bd\u2208BW((cid:98)\u03bd,\u03b5)\n\nd(x,(cid:98)xj) Tj \u2264 \u03b5, Tj \u2264(cid:98)\u03bdj \u2200j \u2208 [N ]\uf8fc\uf8fd\uf8fe\n\n\u03bd(x) = max\uf8f1\uf8f2\uf8f3(cid:88)j\u2208[N ]\n\nThe currently best complexity bound for solving a general linear program with N decision variables\nis O(N 2.37) [11], which may be prohibitive when N is large. Fortunately, the linear program (9) can\nbe solved to optimality using a greedy heuristics in quasilinear time.\nProposition 4.4 (Optimal solution via greedy heuristics). The linear program (9) can be solved to\noptimality by a greedy heuristics in time O(N log N ).\n\nTj : T \u2208 RN\n\nsup\n\n(9)\n\n.\n\n5\n\n\fExample 4.5 (Qualitative comparison with kernel\n\nmethods). Let m = 1, d(x,(cid:98)x) = (cid:107)x \u2212(cid:98)x(cid:107)1 and\n(cid:98)\u03bd = 0.5\u03b4\u22121+0.5\u03b41. Figure 1 compares the approx-\nimation of p(x|\u03b8) by the Wasserstein optimistic\nlikelihood with those of the \ufb01nite sample kernel\napproximations (2) with Kh(u) = K(cid:0)h\u22121u(cid:1),\nwhere the Kernel K is exponential with K(y) =\nexp(\u2212y), uniform with K(y) = 1[|y| \u2264 1] or\nEpanechnikov with K(y) = 3/4(1 \u2212 y2)1[|y| \u2264\n1]. While both the uniform and the Epachnech-\nnikov kernel may produce an approximation value\n\nof 0 when x is far away from the support of(cid:98)\u03bd, the\n\nWasserstein approximation always returns a pos-\nitive likelihood when \u03b5 > 0 (see Corollary A.2).\nFigure 1: Comparison between the Wasserstein\nQualitatively, the Wasserstein approximation ex-\napproximation (\u03b5 = 0.2) and the sample average\nhibits a decay pattern similar to that of the \ufb01nite\nkernel approximations (h = 1) of p(x|\u03b8).\nsample average exponential kernel approximation.\nOn one hand, the similarity between the optimistic likelihood over the Wasserstein ambiguity set\nand the exponential kernel approximation suggests that the kernel approximation can potentially\nbe interpreted in the light of our optimistic distributionally robust optimization framework. On the\nother hand, and perhaps more importantly, this similarity suggests that there are possibilities to\ndesign novel and computationally ef\ufb01cient kernel-like approximations using advanced optimization\ntechniques. Even though the assumption that p(\u00b7|\u03b8) is a probability mass function is fundamental for\nour approximation, we believe that our approach can be utilized in the ABC setting even when p(\u00b7|\u03b8)\nis a probability density function. We leave these ideas for future research.\nAppendix B.3 illustrates further how the Wasserstein ambiguity set offers a better tail approximation\n\nof the nominal measure(cid:98)\u03bd than the ambiguity set based on moment conditions. Interestingly, the\n\nWasserstein approximation can also be generalized to approximate the log-likelihood of a batch of\ni.i.d. observations, see Appendix B.4\n\n5 Application to the ELBO Problem\n\nMotivated by the fact that the likelihood p(x|\u03b8) is intractable to compute in many practical appli-\ncations, we use our optimistic likelihood approximation (3) as a surrogate for p(x|\u03b8) in the ELBO\nproblem (1). In this section, we will focus on the KL divergence and the Wasserstein ambiguity sets,\nand we will impose the following assumptions.\nAssumption 5.1 (Finite parameter space). We assume that \u0398 = {\u03b81, . . . , \u03b8C} for some C \u2265 2.\nAssumption 5.2 (I.i.d. sampling and empirical distribution). For every i \u2208 [C], we have Ni i.i.d. sam-\nples(cid:98)xij, j \u2208 [Ni], from the conditional probability p(\u00b7|\u03b8i). Furthermore, each nominal distribution\n(cid:98)\u03bdi is given by the empirical distribution(cid:98)\u03bdNi\nAssumption 5.1 is necessary for our approach because we approximate p(x|\u03b8) separately for every\n\u03b8 \u2208 \u0398. Under this assumption, the prior distribution \u03c0 can be expressed by the C-dimensional vector\n\u03c0 \u2208 R+, and the ELBO program (1) becomes the \ufb01nite-dimensional convex optimization problem\nq\u2208Q (cid:88)i\u2208[C]\n(10)\n\ni (cid:80)j\u2208[Ni] \u03b4(cid:98)xij on the samples(cid:98)xij.\n\nqi(log qi \u2212 log \u03c0i) \u2212 (cid:88)i\u2208[C]\n\nJ true = min\n\nqi log p(x|\u03b8i),\n\ni = N\u22121\n\nwhere by a slight abuse of notation, Q is now a subset of the C-dimensional simplex. Assumption 5.2,\non the other hand, is a standard assumption in the nonparametric setting, and it allows us to study the\nstatistical properties of our optimistic likelihood approximation.\nWe approximate p(x|\u03b8i) for each \u03b8i by the optimal value of the optimistic likelihood problem (3):\n(11)\n\n\u03bdi(x)\n\np(x|\u03b8i) \u2248\n\n)\n\n) is the KL divergence or Wasserstein ambiguity set centered at the empirical distri-\ni\n. Under Assumptions 5.1 and 5.2, a surrogate model of the ELBO problem (1) is then\n\nHere, BNi\n\ni ((cid:98)\u03bdNi\nbution(cid:98)\u03bdNi\n\ni\n\nsup\n\u03bdi\u2208BNi\n\ni\n\n((cid:98)\u03bdNi\n\ni\n\n6\n\n-3-2-101230.10.30.50.7WassersteinExponential kernelUniform kernelEpanechnikov kernel\fq\u2208Q (cid:88)i\u2208[C]\n\n(cid:98)JBN = min\n\nqi(log qi \u2212 log \u03c0i) \u2212 (cid:88)i\u2208[C]\n\nWe now study the statistical properties of problem (12). We \ufb01rst present an asymptotic guarantee for\n\nqi log\uf8eb\uf8ed sup\n\u03bdi(x)\uf8f6\uf8f8 ,\n((cid:98)\u03bdNi\n)(cid:9)C\nwhere we use BN to denote the collection of ambiguity sets(cid:8)BNi\ni=1 with N =(cid:80)i Ni.\ni ((cid:98)\u03bdNi\nthe KL divergence. Towards this end, we de\ufb01ne the disappointment as P\u221e(J true < (cid:98)JBN ).\nTheorem 5.3 (Asymptotic guarantee; KL ambiguity). Suppose that Assumptions 5.1 and 5.2 hold.\n, \u03b5i) for some \u03b5i > 0, and set n (cid:44) min{N1, . . . , NC}.\nFor each i \u2208 [C], let BNi\ni ((cid:98)\u03bdNi\nWe then have\n\n\u03bdi\u2208BNi\n\n(12)\n\n)\n\ni\n\ni\n\ni\n\ni\n\ni\n\n) = BKL((cid:98)\u03bdNi\nlog P\u221e(J true < (cid:98)JBN ) \u2264 \u2212 min\n\ni\u2208[C]\n\n1\nn\n\nlim sup\nn\u2192\u221e\n\n\u03b5i < 0.\n\nobtained using the approximation (11) as\n\ni\n\ni\n\ni ((cid:98)\u03bdNi\n\nTheorem 5.3 shows that as the number of training samples Ni for each i \u2208 [C] grows, the disappoint-\nment decays exponentially at a rate of at least mini \u03b5i.\nWe next study the statistical properties of problem (12) when each BNi\n) is a Wasserstein ball.\nTo this end, we additionally impose the following assumption, which essentially requires that the tail\nof each distribution p(\u00b7|\u03b8i), i \u2208 [C], decays at an exponential rate.\nAssumption 5.4 (Light-tailed conditional distribution). For each i \u2208 [C], there exists an exponent\nai > 1 such that Ai (cid:44) E[exp((cid:107)x(cid:107)ai)] < \u221e, where the expectation is taken with respect to p(\u00b7|\u03b8i).\nTheorem 5.5 (Finite sample guarantee; Wasserstein ambiguity). Suppose that Assumptions 5.1,\n5.2 and 5.4 hold, and \ufb01x any \u03b2 \u2208 (0, 1). Assume that m (cid:54)= 2 and that BNi\ni ((cid:98)\u03bdNi\n) =\nBW((cid:98)\u03bdNi\nand ki1, ki2 are positive constants that depend on ai, Ai and m. We then have PN(cid:0)J true < (cid:98)JBN(cid:1) \u2264 \u03b2.\n5.4 hold. For each i \u2208 [C], let \u03b2Ni \u2208 (0, 1) be a sequence such that (cid:80)\u221e\nBNi\ni ((cid:98)\u03bdNi\nJ true as N1, . . . , NC \u2192 \u221e almost surely.\nTheorem 5.6 offers an asymptotic guarantee which asserts that as the numbers of training samples Ni\ngrow, the optimal value of (12) converges to that of the ELBO problem (10).\n\nTheorem 5.5 provides a \ufb01nite sample guarantee for the disappointment of problem (12) under a\nspeci\ufb01c choice of radii for the Wasserstein balls.\nTheorem 5.6 (Asymptotic guarantee for Wasserstein). Suppose that Assumptions 5.1, 5.2 and\nNi=1 \u03b2Ni < \u221e and\n, \u03b5i(\u03b2N , C, Ni)), where \u03b5i is de\ufb01ned as in Theorem 5.5. Then (cid:98)JBN \u2192\n\n, \u03b5i(\u03b2, C, Ni)) for every i \u2208 [C] with\n(cid:16) log(ki1C\u03b2\u22121)\n(cid:16) log(ki1C\u03b2\u22121)\n\n\u03b5i(\u03b2, C, Ni) (cid:44)\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nif Ni \u2265 log(ki1)C\u03b2\u22121\nif Ni < log(ki1)C\u03b2\u22121\n\n(cid:17)1/ max{m,2}\n(cid:17)1/ai\n\n) = BW((cid:98)\u03bdNi\n\nki2Ni\n\nki2Ni\n\nki2\n\nki2\n\n,\n\n,\n\ni\n\ni\n\ni\n\n6 Numerical Experiments\n\nWe \ufb01rst showcase the performance guarantees from the previous section on a synthetic dataset\nin Section 6.1. Afterwards, Section 6.2 benchmarks the performance of the different likelihood\napproximations in a probabilistic classi\ufb01cation task on standard UCI datasets. The source code,\nincluding our algorithm and all tests implemented in Python, are available from https://github.\ncom/sorooshafiee/Nonparam_Likelihood.\n\n6.1 Synthetic Dataset: Beta-Binomial Inference\nWe consider the beta-binomial problem in which the prior \u03c0, the likelihood p(x|\u03b8), and the posterior\ndistribution q(\u03b8|x) have the following forms:\n\n\u03c0(\u03b8) = Beta(\u03b8|\u03b1, \u03b2),\n\np(x|\u03b8) = Bin(x|M, \u03b8),\n\nq(\u03b8|x) = Beta(\u03b8|x + \u03b1, M \u2212 x + \u03b2)\n\n7\n\n\f(a) KL divergence\n\n(b) Wasserstein distance\n\nFigure 2: Average KL divergence between(cid:98)q that solves (12) and\n\nthe discretized posterior qdiscretize(\u00b7|x) as a function of \u03b5 and Ni.\n\nFigure 3: Optimally tuned perfor-\nmance of different approximation\nschemes with varying Ni.\n\nWe emphasize that in this setting, the posterior distribution is known in closed form, and the main\ngoal is to study the properties of the optimistic ELBO problem (12) and the convergence of the\nsolution of problem (12) to the true posterior distribution. We impose a uniform prior distribution \u03c0\nby setting \u03b1 = \u03b2 = 1. The \ufb01nite parameter space \u0398 = {\u03b81, . . . , \u03b8C} contains C = 20 equidistant\ndiscrete points in the range (0, 1). For simplicity, we set N1 = . . . = NC in this experiment.\nWe conduct the following experiment for different training set sizes Ni \u2208 {1, 2, 4, 8, 10} and different\nambiguity set radii \u03b5. For each parameter setting, our experiment consists of 100 repetitions. In each\nrepetition, we randomly generate an observation x from a binomial distribution with M = 20 trials\n\np(x|\u03b8) is approximated using the exponential kernel of the likelihood (2) with varying kernel width.\n\nand success probability \u03b8true = 0.6. We then \ufb01nd the distribution(cid:98)q that solves problem (12) using\nboth the KL and the Wasserstein approximation. In a similar way, we \ufb01nd(cid:98)q by solving (10), where\nWe evaluate the quality of the computed posteriors(cid:98)q from the different approximations based on the\nKL divergences of(cid:98)q to the true discretized posterior qdiscretize(\u03b8i|x) \u221d Beta(\u03b8i|x + \u03b1, M \u2212 x + \u03b2).\nFigures 2(a) and 2(b) depict the average quality of(cid:98)q with different radii. One can readily see that the\noptimal size of the ambiguity set that minimizes KL((cid:98)q (cid:107) qdiscretize(\u00b7|x)) decreases as Ni increases for\n\nboth the KL and the Wasserstein approximation. Figure 3 depicts the performance of the optimally\ntuned approximations with different sample sizes Ni. We notice that the optimistic likelihood over\nthe Wasserstein ambiguity set is comparable to the exponential kernel approximation.\n\n6.2 Real World Dataset: Classi\ufb01cation\n\nWe now consider a probabilistic classi\ufb01cation setting with C = 2 classes. For each class i = 1, 2, we\n\ndistribution \u03c0 is also estimated from the training data as \u03c0(\u03b8i) = Ni/N, where N = N1 + N2 is the\ntotal number of training samples. Upon observing a test sample x, the goal is to compute the posterior\n\nhave access to Ni observations denoted by {(cid:98)xij}j\u2208[Ni]. The nominal class-conditional probability\ni (cid:80)j\u2208[Ni] \u03b4(cid:98)xij for i = 1, 2. The prior\ndistributions are the empirical measures, that is,(cid:98)\u03bdi = N\u22121\ndistribution(cid:98)q by solving the optimization problem (12) using different approximation schemes. We\nsubsequently use the posterior(cid:98)q as a probabilistic classi\ufb01er. In this experiment, we exclude the KL\ndivergence approximation because x (cid:54)\u2208 supp((cid:98)\u03bdi) most of the time.\nradii \u03b5i \u2208 {a\u221am10b : a \u2208 {1, . . . , 9}, b \u2208 {\u22123,\u22122,\u22121}}, i = 1, 2, of the Wasserstein balls by a\n\nIn our experiments involving the Wasserstein ambiguity set, we randomly select 75% of the available\ndata as training set and the remaining 25% as test set. We then use the training samples to tune the\n\nstrati\ufb01ed 5-fold cross validation. For the moment based approximation, there is no hyper-parameter\nto tune, and all data is used as training set. We compare the performance of the classi\ufb01ers from\nour optimistic likelihood approximation against the classi\ufb01er selected by the exponential kernel\napproximation as a benchmark.\nTable 1 presents the results on standard UCI benchmark datasets. All results are averages across 10\nindependent trials. The table shows that our optimistic likelihood approaches often outperform the\nexponential kernel approximation in classi\ufb01cation tasks.\n\nAcknowledgments We gratefully acknowledge \ufb01nancial support from the Swiss National Science\nFoundation under grant BSCGI0_157733 as well as the EPSRC grants EP/M028240/1, EP/M027856/1\nand EP/N020030/1.\n\n8\n\n10\u2212310\u2212210\u22121100101\u03b50.00.51.01.52.02.53.0KL(bqkqdiscretize(\u00b7|x))Ni=1Ni=2Ni=4Ni=8Ni=1010\u2212310\u2212210\u22121100101\u03b50.00.51.01.52.02.53.0KL(bqkqdiscretize(\u00b7|x))Ni=1Ni=2Ni=4Ni=8Ni=10100101102Ni0.000.250.500.751.001.251.50KL(bqkqdiscretize(\u00b7|x))KLdivergenceWassersteinExponentialkernelMoment\fTable 1: Average area under the precision-recall curve for various UCI benchmark datasets. Bold\nnumbers correspond to the best performances.\n\nBanknote Authentication\nBlood Transfusion\nBreast Cancer\nClimate Model\nCylinder\nFourclass\nGerman Credit\nHaberman\nHeart\nHousing\nILPD\nIonosphere\nMammographic Mass\nPima\nQSAR\nSeismic Bumps\nSonar\nThoracic Surgery\n\nExponential Moment Wasserstein\n100.00\n68.23\n97.99\n93.40\n86.23\n100.00\n75.11\n71.10\n75.86\n82.04\n69.88\n98.79\n87.86\n80.48\n90.21\n65.89\n93.85\n56.32\n\n99.05\n64.91\n97.58\n93.80\n76.74\n99.95\n67.58\n70.82\n78.77\n75.62\n71.54\n91.02\n83.46\n79.61\n84.44\n74.81\n85.66\n54.84\n\n99.99\n71.28\n99.26\n81.94\n75.00\n82.77\n75.50\n70.20\n86.87\n81.89\n72.95\n97.05\n86.53\n82.37\n90.85\n75.68\n83.49\n64.73\n\nReferences\n[1] P. Baldi, S. Brunak, and F. Bach. Bioinformatics: The Machine Learning Approach. MIT Press,\n\n2001.\n\n[2] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian computation in popula-\n\ntion genetics. Genetics, 162(4):2025\u20132035, 2002.\n\n[3] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions\nof optimization problems affected by uncertain probabilities. Management Science, 59(2):341\u2013\n357, 2013.\n\n[4] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex optimization\n\napproach. SIAM Journal on Optimization, 15(3):780\u2013804, 2005.\n\n[5] D. Bertsimas and J. Sethuraman. Moment problems and semide\ufb01nite optimization. In Handbook\nof Semide\ufb01nite Programming: Theory, Algorithms, and Applications, pages 469\u2013509. Springer,\n2000.\n\n[6] J. Bi and T. Zhang. Support vector classi\ufb01cation with input data uncertainty. In Advances in\n\nNeural Information Processing Systems, pages 161\u2013168, 2005.\n\n[7] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n\n[8] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[9] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive\ncost functions, with application to active user modeling and hierarchical reinforcement learning.\narXiv preprint arXiv:1012.2599, 2010.\n\n[10] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\ntime. arXiv preprint arXiv:1810.07896, 2018.\n\n[11] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication\n\n[12] K. Csill\u00e9ry, M. G. Blum, O. E. Gaggiotti, and O. Fran\u00e7ois. Approximate Bayesian Computation\n\n(ABC) in practice. Trends in Ecology & Evolution, 25(7):410 \u2013 418, 2010.\n\n9\n\n\f[13] M. Cummins and P. Newman. FAB-MAP: Probabilistic localization and mapping in the space\n\nof appearance. The International Journal of Robotics Research, 27(6):647\u2013665, 2008.\n\n[14] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with\n\napplication to data-driven problems. Operations Research, 58(3):595\u2013612, 2010.\n\n[15] S. Gao, G. Ver Steeg, and A. Galstyan. Variational information maximization for feature\n\nselection. In Advances in Neural Information Processing Systems, pages 487\u2013495, 2016.\n\n[16] N. S. Gorbach, S. Bauer, and J. M. Buhmann. Scalable variational inference for dynamical\n\nsystems. In Advances in Neural Information Processing Systems, pages 4806\u20134815, 2017.\n\n[17] G. A. Hanasusanto, V. Roitch, D. Kuhn, and W. Wiesemann. Ambiguous joint chance constraints\n\nunder mean and dispersion information. Operations Research, 65(3):751\u2013767, 2017.\n\n[18] M. A. Haynes, H. MacGillivray, and K. Mengersen. Robustness of ranking and selection\nrules using generalised g-and-k distributions. Journal of Statistical Planning and Inference,\n65(1):45\u201366, 1997.\n\n[19] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[20] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational\ninformation maximizing exploration. In Advances in Neural Information Processing Systems,\npages 1109\u20131117, 2016.\n\n[21] N. Jojic and B. J. Frey. Learning \ufb02exible sprites in video layers. In Proceedings of IEEE\n\nConference on Computer Vision and Pattern Recognition, volume 1, pages 199\u2013206, 2001.\n\n[22] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization\n\ntrick. In Advances in Neural Information Processing Systems, pages 2575\u20132583, 2015.\n\n[23] D. Kuhn, P. Mohajerin Esfahani, V. A. Nguyen, and S. Sha\ufb01eezadeh-Abadeh. Wasserstein\ndistributionally robust optimization: Theory and applications in machine learning. INFORMS\nTutORials in Operations Research, 2019.\n\n[24] P. Liang, S. Petrov, M. Jordan, and D. Klein. The in\ufb01nite PCFG using hierarchical Dirichlet\nprocesses. In Empirical Methods in Natural Language Processing and Computational Natural\nLanguage Learning, 2007.\n\n[25] A. C. Likas and N. P. Galatsanos. A variational approach for Bayesian blind image deconvolution.\n\nIEEE Transactions on Signal Processing, 52(8):2222\u20132233, 2004.\n\n[26] A. W. Marshall and I. Olkin. Multivariate Chebyshev inequalities. The Annals of Mathematical\n\nStatistics, 31(4):1001\u20131014, 1960.\n\n[27] K. L. Mengersen, P. Pudlo, and C. P. Robert. Bayesian computation via empirical likelihood.\n\nProceedings of the National Academy of Sciences, 110(4):1321\u20131326, 2013.\n\n[28] T. P. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in\n\nArti\ufb01cial Intelligence, pages 362\u2013369. Morgan Kaufmann Publishers Inc., 2001.\n\n[29] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using\nthe Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical\nProgramming, 171(1-2):115\u2013166, 2018.\n\n[30] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 2125\u2013\n2133, 2015.\n\n[31] R. Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to\noptimization and planning. Foundations and Trends R(cid:13) in Machine Learning, 7(1):1\u2013129, 2014.\n\n[32] K. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n\n10\n\n\f[33] S. Nakajima, R. Tomioka, M. Sugiyama, and S. D. Babacan. Perfect dimensionality recovery\nby variational Bayesian PCA. In Advances in Neural Information Processing Systems, pages\n971\u2013979, 2012.\n\n[34] T. Naseem, H. Chen, R. Barzilay, and M. Johnson. Using universal linguistic knowledge to\nguide grammar induction. In Proceedings of the 2010 Conference on Empirical Methods in\nNatural Language Processing, pages 1234\u20131244, 2010.\n\n[35] V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Distributionally robust inverse covariance\n\nestimation: The Wasserstein shrinkage estimator. arXiv preprint arXiv:1805.07194, 2018.\n\n[36] V. A. Nguyen, S. Sha\ufb01eezadeh-Abadeh, M.-C. Yue, D. Kuhn, and W. Wiesemann. Calculat-\ning optimistic likelihoods using (geodesically) convex optimization. In Advances in Neural\nInformation Processing Systems, 2019.\n\n[37] M. Park, W. Jitkrittum, and D. Sejdinovic. K2-ABC: Approximate Bayesian computation\nwith kernel embeddings. In Proceedings of the 19th International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 398\u2013407, 2016.\n\n[38] G. Peyr\u00e9 and M. Cuturi. Computational optimal transport. Foundations and Trends R(cid:13) in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[39] L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott. Bayesian synthetic likelihood. Journal of\n\nComputational and Graphical Statistics, 27(1):1\u201311, 2018.\n\n[40] A. Raj, M. Stephens, and J. K. Pritchard. fastSTRUCTURE: variational inference of population\n\nstructure in large SNP data sets. Genetics, 197(2):573\u2013589, 2014.\n\n[41] G. Sanguinetti, N. D. Lawrence, and M. Rattray. Probabilistic inference of transcription factor\nconcentrations and gene-speci\ufb01c regulatory activities. Bioinformatics, 22(22):2775\u20132781, 2006.\n\n[42] M. J. Schervish. Theory of Statistics. Springer, 1995.\n\n[43] S. Sha\ufb01eezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Regularization via mass trans-\n\nportation. Journal of Machine Learning Research, 20(103):1\u201368, 2019.\n\n[44] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled\nadversarial training. In Proceedings of International Conference on Learning Representations,\n2018.\n\n[45] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit\nsetting: no regret and experimental design. In Proceedings of International Conference on\nMachine Learning, pages 1015\u20131022, 2010.\n\n[46] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1(Jun):211\u2013244, 2001.\n\n[47] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. Stumpf. Approximate Bayesian computa-\ntion scheme for parameter inference and model selection in dynamical systems. Journal of The\nRoyal Society Interface, 6(31):187\u2013202, 2009.\n\n[48] C. Villani. Optimal Transport: Old and New. Springer, 2008.\n\n[49] W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization. Operations\n\nResearch, 62(6):1358\u20131376, 2014.\n\n[50] M. W. Woolrich, T. E. Behrens, C. F. Beckmann, M. Jenkinson, and S. M. Smith. Multilevel\nlinear modelling for FMRI group analysis using Bayesian inference. Neuroimage, 21(4):1732\u2013\n1747, 2004.\n\n11\n\n\f", "award": [], "sourceid": 9324, "authors": [{"given_name": "Viet Anh", "family_name": "Nguyen", "institution": "EPFL"}, {"given_name": "Soroosh", "family_name": "Shafieezadeh Abadeh", "institution": "EPFL"}, {"given_name": "Man-Chung", "family_name": "Yue", "institution": "The Hong Kong Polytechnic University"}, {"given_name": "Daniel", "family_name": "Kuhn", "institution": "EPFL"}, {"given_name": "Wolfram", "family_name": "Wiesemann", "institution": "Imperial College"}]}