Reviews: Explicitly disentangling image content from translation and rotation with spatial-VAE

1. Mentioning some solid motivations and practical usage of rotation and translation invariant latent variables would be helpful. 2. Not very clear where the MLP in Figure 1 is inserted within the VAE? A diagram of the overall architecture would be useful. 3. The method description is a bit too high level and sparse. Would be great to have some key implementation details. 4. Is Gaussian approximation appropriate for theta (i.e. angle)? 5. For the MNIST experiments, were three spatial VAEs built, one for each (transformed) MNIST dataset? How sensitive are the results to the choice on prior values, e.g. what happens if use the same value for prior on theta? Also, is setting D to 2, 3, and 5 reasonable? Based on just ELBO, seems like using higher D will do the trick. Hence, need to show usefulness of rotation and translation invariant latent variables. 6. All the nicely aligned images in Figures 4 to 6 are great, but what exactly can we extract from these? Some quantitative evaluation, and again demonstration of usefulness are highly needed. ------ My comments have been satisfactorily addressed especially the point regarding quantitative evaluation. I have thus raised the score.

The stated contribution of the paper is an unsupervised method on image data (potentially other signals on a 2D plane) for learning latent factors disentangled from orientation-preserving isometries on the plane (compositions of rotation/translation). The authors take the approach of using a single function to model decoder/likelihood functions per pixel parameterized by a global latent state and pixel position (represented by Cartesian coordinates), as opposed to the typical approach with separate functions per pixel. They disentangle orientation-preserving isometries by adding latent variables parameterizing rotation/translation which are applied to the pixel position before evaluation of the decoder/likelihood function. These latent variables are given Gaussian priors and approximate posteriors. They demonstrate improvements in ELBO for spatial VAE vs vanilla VAE with same dimension of global latent variables (excluding spatial-VAEs additional latent dimensions for rotation/translation -- we discuss this further below). They show on a few datasets how image orientation is disentangled in generated images when rotation/translation variables are fixed. Overall, we liked the ideas and the work presented. We admit were somewhat surprised similar solutions haven't been tried in this area but the authors list related work so we conclude the approach is indeed novel and appear effective. We particularly liked Fig3 for improved interpretability. We did have several issues listed below: (1) The exposition of the model in section 2.1 should be edited for improved clarity. In particular, the contrast between vanilla VAE and spatial VAE could be better highlighted, and the notation to represent different parts of the problem should be explained prior to using them. In particular: (a) The second sentence of this section (lines 60-68) could be broken up for clarity. (b) Image coordinates should be defined previously before using them as part of inline sentence description of a function. It is not immediately clear that there are $n$ pixels indexed by $i$ during this sentence. (c) It is unclear from the current description that $x^i$ represents a point on the plane rather than a scalar value. Some possible ways to address would be explicitly parameterizing at first as an ordered pair or typesetting in bold. (d) Figure 1 could separate the components of the first layer coming from the latent variables vs spatial coordinates to show how they go into the MLP. (2) In Figure 2 (a) Caption is missing description distinguishing solid vs dotted lines (b) For consideration of ELBO, we believe it would be more meaningful to consider the effective total number of latent variables in the model rather than the number of “unconstrained” latent variables. For example, we think that spatial VAE with Z-D=2 (first column of current figure) would best be compared to vanilla/fixed spatial VAE with Z-D=5 (third column of current figure) to account for the 3 latent factors representing rotation/translation. We believe that the discussion/figure should be updated to consider these differences. (3) In section 2.2, lines 102-104, it is stated that the prior and approximate posterior of $\theta$ is modeled using the Gaussian distribution with zero mean. This conflicts with the definition of the approximate posterior in line 115. We suspect that the intention is for only the prior to be constrained to be mean zero -- this can easily be fixed by removing the statement about the mean from that sentence and subsequently constraining the prior alone in the following sentence. (4) While we are not vision experts it seems the work presented here suffers from several limitations compared to approaches taken in that field. First the orientation/translations are global while in many vision problems several objects, each with it's own transformations, are involved. Second, there is a difference between transformations of the object and transformations of the camera/viewer pitch/yaw/roll. The problems discussed here are much simpler. Worth mentioning/discussing.

Paper ID:	8930
Title:	Explicitly disentangling image content from translation and rotation with spatial-VAE

Reviewer 1

Reviewer 2

Reviewer 3