Reviews: Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder

The authors present a novel approach for optimizing discrete latent variable models. The approach is a straight-forward combination of the recently introduced direct loss minimization technique (originally designed for structured prediction) with Gumbel-max re-parameterization. This approach avoids the approximation to the arg-max that other methods employ, making in conceptually attractive. The authors apply the method to several datasets and a few uses cases (semi-supervised learning, learning structured latent distributions) and show competitive performance compared to existing methods. Optimizing discrete latent variable models is a basic problem with broad applicability. Methods for optimizing such models is an area of active research. Given the original approach, this work is likely to be of interest to many in the NeurIPS community. The submission appears technically sounds in all of its derivations and the results show competitive or superior results when compared to existing methods. The clarity of the work is decent but certainly could be improved. Since the approach is a relatively straightforward application of two existing methods (Gumbel-max/ direct loss minimization) a more thorough and clear summary of those methods as a background would increase the clarity of the submission and allow a more general audience to fully appreciate and understand the work. Presently, serious consultations with referenced methods are required, even for a reader fairly acquainted with the general approach. The originality of the approach, its technical quality and significance make the paper worthy of acceptance, but the clarity of the presentation detracts from it. Given that the approach appears to be the straightforward application and combination of two existing ideas, the onus falls on the authors to distill and explain these ideas to the reader.

Originality =========== To the best of my knowledge, the insight that the Gumbel-max trick can be combined with direct loss mimization in the VAE setting is novel. Moreover, the work that had to be done to get the ideas to play well together seems significant and original. Quality =========== I have some reservations about the method presented, which is why I've given a slightly negative overall score. However, it seems very plausible that an author response could clarify my concerns and cause me to revise my score upward. Figure 1: -Epsilon and Tau are different variables in quite different settings. It seems weird to have them share an axis. I'd find the chart more intuitive if one chart showed the bias and variance for GSM and one for Direct. -I would appreciate some error bars for standard error of your bias and variance estimates. ----The authors have satisfied me entirely on this suggestion. Though I would suggest rewording some of the text based on the very wide error bars on Figure 1.---- Theorem 1: -I haven't compared to the direct loss paper [28] in immense detail, but I notice that in their proof there are no gradients inside the expectation on the RHS. Do you have an intuitive explanation for why you have gradients here and they don't? You imply a correspondance in "The above therem closely relates to ..." but as far as I can see plugging this loss into their paper would not give your estimator. ----The authors have corrected my misunderstanding of this point. Thank you.---- -I'm slightly puzzled by the final step of your proof. You set two terms equal to each other by setting epsilon=0, but in the same breath you take the limit of epsilon to 0. I'm not sure you can do both at the same time. ----I think I'm happy with this on reflection---- -I would appreciate more discussion of the nature of the bias that you introduce by using epsilon != 0. ----I still think that developing this further would improve the paper, but it is good enough without it.---- Experiments comparing loss, e.g., Table 1 and Figure 2: -I'm very puzzled by the fact that your unbiased estimator is beaten by direct and even GSM as k goes up. You say that this is due to the relative complexity of FashionMNIST, but the effect exists in all the other datasets at k=50 and Fashion MNIST isn't more complex than Omniglot. You say it may be due to slower convergence, but did the effect go away when you ran it longer? The models look fairly converged in your plot. You say it may be due to the non-convexity of the bound, but why would this affect direct less than unbiased? Can you explain this remark a little more? My concern is that something odd may be going on that casts the experimental evidence into a little more doubt. -Also puzzled that the loss for direct and GSM are higher for k=50 than k=40 for MNIST and Omniglot. Do you have an explanation for that? ----I'm still a bit unsure about what's going on here, but the results seem important and sufficiently solid despite that.---- -Nice-to-have: I'd be interested to see the effect of changing epsilon on your results. ----This would still be nice, but is fine.---- -I'd appreciate standard error and averaging over multiple random seeds. ----Thank you for including this.---- The experiments with semi-supervised loss or structured encoders seem good. I would love to have some comparisons for Figure 5 and 4 to your baselines, be it GSM or unbiased. Not including them makes me wonder if there isn't a detectable improvement, which makes me wonder if the result is sufficiently significant. ----I understand that these are qualitative and that there are comparisons elsewhere. I was and am still curious about what the practical significance of the new method was relative to baselines on these results. What is already here is obviously fine as far as it goes, but comparing to baselines could make the point stronger.---- Clarity =========== I think you could cut almost all of the 2nd para on p1 since it repeats itself in the related work section. I found your notation for z slightly confusing. Sometimes you use it as a random variable, and sometimes you use it as an index/the values the random variable might take on. This results also in your notation in S3 being inconsistent with your later work where z is a set of binary random variables. I think you could improve the legibility of your paper a bit by being clearer about these things. Most graphs have far too small text. Small typos: -p1 "applied to structure setting[s]" -p4 "since the gradients" probably should have an apostrophe -In the future could you please use the submission option of the Neurips package. This gives reviewers line-numbers which make it easier to give feedback. Significance =========== Low variance estimators for gradients with discrete latent variables of significant interest to the community. I am satisfied by the experimental work. There are some slightly unexpected behaviours with the unbiased baseline, but I'm satisfied that they don't reflect a hidden weakness of the presented approach.

Paper ID:	3347
Title:	Direct Optimization through $\arg \max$ for Discrete Variational Auto-Encoder

Reviewer 1

Reviewer 2

Reviewer 3