Paper ID: | 9324 |
---|---|

Title: | Optimistic Distributionally Robust Optimization for Nonparametric Likelihood Approximation |

Detailed Comments: Originality: To the best of my knowledge, the distributional optimistic approach to approximating the likelihood function, based on constructing an uncertainly set and picking the most likely distribution in this set, is a novel and interesting idea. The idea borrows from the robust optimization community as well as the principle of optimism under uncertainty and is intuitively appealing. Clarity: The paper is for the most part written well and is well organized. Quality/Significance: To the best of my knowledge, the mathematical analysis and proofs are correct. The paper importantly demonstrates that for several important choices of uncertainty sets (e.g., KL and Wasserstein divergences), the optimistic likelihood formulation reduces to a convex optimization problem. Moreover, in the cases of KL and Wasserstein, the paper shows that applying this technique within the ELBO problem for posterior inference has good theoretical asymptotic guarantees. The above two classes of results on reformulations and asymptotic properties are the most important results to establish for this type of methodology, and it appears that the paper does a good job in establishing these results. Finally, while the numerical experiments are largely illustrative on toy datasets, they do also provide some justification for applying this methodology for posterior inference.

Overall, I found this paper to be a nice read. It lays out the motivation for the problem and then illustrates how one can apply the idea for various different notions of a "close distribution," e.g., KL-divergence, Wasserstein metric, and distributions that match the first and second empirical moments. One strange thing about this approach is that the optimistic probabilities found at the end may not integrate to 1 (for example, the kernel density estimator will integrate to 1). For this reason, it doesn't appear the optimistic likelihood is a likelihood in any traditional sense. Because of this property, I would like to understand better how this new sense of likelihood behaves. Does it relate to any other already studied notions of convergence? E.g., there are results that kernel density estimates will converge to the true density under certain regularity assumptions as the number of samples grows and the bandwidth decrease. Is some analog true here? One area I found a bit confusing in the paper was how this could be actually used to solve the problem of Bayesian inference referenced at the beginning of the paper. For example, in section 6.1, how does one come up with the hat{v_i}? Does one have to know how to sample from p to produce these distributions? While the ideas presented in the paper are interesting, I'm still unsure whether this approach is better than simpler methods like the kernel density estimation. I would be more convinced if this method were tried on more practical models, e.g., for those where ABC were necessary. Detailed comments: L74: Should say "denoising." L127: Should the v* be v*_{KL}? L133-134: What's e in e^T y = 1? Figure 1: Is it true that p(x) is unbounded near the poles -1 and 1? Is this desired? L224: Is it clear that this approximation really is an approximation? By the looks of Figure 1, it appears the density can be unbounded near x_i. L269-275: How does one construct hat{v}_i in this case? Originality: The paper has a new approach for fitting posterior distributions in a non-parametric fashion. One limitation is that the distributions must be discrete, and they can be expensive to use as the dimension m and data size m increase. Quality: The paper appears to be correct, although I have not checked the proofs from Section 5. Clarity: The paper is quite clear and nice to read. There are a few parts where some details are omitted (see the comments above) but overall it was easy to follow. Significance: This is the biggest area preventing a higher score on this work. Based on the limited empirical examples, fact that Theta must be finite, and fact that the optimistic likelihood isn't a true probability distribution, I am unsure whether this method is truly better than something simpler.

* Line 55-63 * is confusing. Why p(\dot|\theta) represents a discrete distribution? For the classification task the authors mentioned, p(x|\theta_i) can be supported on some absolutely continuous measure (w.r.t Lebesgue measure in finite dimension space). One possible explanation is the observation \hat{x} is discrete, so we use the empirical measure of \hat{x} as the base measure, which seems consistent with the following paper (especially Assumption 2.2). But the authors should make the meaning of notation p(\dot|\theta) clear. *Line 67-76* the authors connect their proposed optimistic likelihood estimation with some existing optimum in the face of uncertainty method. However, these methods always consider the optimistic feedback (e.g. UCB in bandits, planning and Bayesian optimization) but not the optimistic likelihood. Can the authors comment more on this? *Theorem 2.4* missing definition of *e*. The authors only discuss the problem when we have some observation S and want to get the likelihood of one point x inside or outside S, but how to deal with the situation we want to simultaneously get the likelihood of several evaluation points (likelihood for each observation, not the sum of log-likelihood just in Appendix B.4)? This can be simply handled with some traditional methods like KDE and ABC methods introduced in the Section 1. And in real worlds, there exists such cases, for example when we approximate the ELBO with monte carlo estimators, we need to evaluate the likelihood from the samples of q at the same time. We cannot use this optimistic likelihood approximation individually for each sample because this may make the measure unnormalized (because we optimistically assign density to S\cup x in each optimization, if I understand correctly). Overall, this paper proposed a novel optimistic non-parametric likelihood estimation. The authors provide some practical estimators with ambiguity sets constructed by f-divergence, moment conditions and Wasserstein distance and all of the claims from the authors have strong theoretical guarantees. The experiments seem not so supportive.