__ Summary and Contributions__: This paper gives a control function approach for causal inference with an instrumental variable. The proposed method trades typical assumptions made on the outcome process for assumptions made instead on the treatment process; estimation proceeds in two stages, the first via variational autoencoder to estimate the control function and the second via maximum likelihood for outcome modeling.
-----
Update post-rebuttal: I was hoping for more comparison against previous identification results than the brief description in Section 2.2. For now it is unclear how the authors expand on this in the updated version. I look forward to seeing how the authors address multiple suggestions to make the contributions and writing more clear.

__ Strengths__: The claims made appear sound. It seems worthwhile to consider settings where usual functional assumptions may hold for the treatment but not outcome.

__ Weaknesses__: The main theoretical contribution is the identification result in Theorem 1, but it is not explained how this relates to standard identification results for control functions, and I cannot immediately see from the result itself (for example it is not clear to me how the assumptions map to those of Guo & Small's 2016 paper). It seems to me that for a new identification result it would be required to compare & contrast with previous results, for example stating which conditions are stronger/weaker, etc.
I also found the assumptions required for identification to be difficult to gauge in a real data analysis.
The method proposed in Section 2.2 seemed somewhat opaque to me, and no analysis of its properties was given, beyond what I felt was somewhat limited evidence from simulation experiments in Section 3.

__ Correctness__: I did not see errors in the claims or methodology.

__ Clarity__: I found this paper somewhat difficult to follow, despite being quite familiar with the IV literature. I felt it was lacking some useful description and background throughout.

__ Relation to Prior Work__: Beyond the first point made in the Weaknesses section above, it appears the paper cites sufficient relevant work.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper combines ideas from variational inference and the control function approach to instrumental variable estimation to produce a method for instrumental variable estimation that allows weaker constraints on how unobserved confounders affect the response than the common y = f(t,x) + u setting, at the expense of stronger constraints on how the unobserved confounders affect the treatment (though slightly weaker than the original control function setting).

__ Strengths__: This is creative work - it shows how the control function approach, which attempts to estimate the unobserved confounder so that you can use it as a control in the downstream estimation task, can be generalized beyond the additive treatment and additive response model (eqn 1). Theorem 1 summarizes the necessary conditions that allow the authors to weaken these assumptions to additive or monotonic treatment relationships without the constraints on the output model. These conditions are, however, tricky to satisfy, so much of the remainder of the paper focuses on designing a methodology that meets the requirements of the theorem. Theorem 1 requires independence between the control function and the instrument - which the author achieve via their variational decoupling method which minimizes the mutual information between these two variables. Because they're optimizing an upper bound on this quantity & don't appear to be enforcing this constraint exactly (the legrange multiplier appears fixed throughout the optimization), it seems likely that this mutual information minimization is not driven to 0. The authors give bounds on the estimation error that follow from this problem. They also do a decent job of explaining the subtitles around the joint independence requirements that follow from their theorem.
On the empirical side - the performance of the method appears to be strong relative to TSLS & DeepIV - particularly on the the cases where the TSLS & DeepIV assumption fail (as one would expect). This is somewhat comforting because, as I discuss below - there are a lot of moving pieces in this model, so it's not clear a priori that it should work.

__ Weaknesses__: - the method needs both 0 mutual information between \hat{z} and \epsilon and perfect reconstruction of the treatment for the conditions of theorem 1 to apply. The estimation error section gives some discussion of bounds that suggests that things don't get too bad, but I really would have liked to see this evaluated experimentally.
- some of the sections are clearly written but the overall structure could be better: I would bring lines 187 to 206 before section 2.2 so that section 2.1 deals with all the conditions & assumptions that have to hold for the method to work. Then presenting variational decoupling as your approach to attempting to satisfy those conditions + being explicit about the ways it might fail would make the me less skeptical of the work. In particular, as mentioned above, the error bounds section seems to be an important part of the contribution, but as its currently positioned it feels tacked on...
More minor:
- I know economists love using the weather as instruments, but a hurricane feels like a poor example - it seems likely that more response variables would also be affected by a hurricane (so it fails the exclusion assumption).
- Consider swapping \epsilon and z in your notation... z is widely used as the symbol for an instrument & it's not unusual to use \epsilon as a confounder, so as someone who reads a lot of IV literature I kept getting confused about which variable was which... it felt a bit like you were using y for features and x for the target variable.

__ Correctness__: As far as I can tell the claims are correct.

__ Clarity__: See above - it could definitely be a lot clearer.

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: Post rebuttal - increased my score to account for the new experiments & proposed changes.

__ Summary and Contributions__: UPDATE:
I remain at a weak accept. Without seeing the result of improved presentation and clarity I wouldn't increase my score since I already took into account that the presentation might improve. The rebuttal was too brief to really get the message across. I agree with R1's comment that the identification results should be better discussed and compared to related work, and this is reflected in some of my review. Although it does appear that the results have enough novelty. Overall I think this paper could be a strong contribution. Based on the provided information I'm just not convinced enough to give a clear accept.
The paper introduces instrumental variable methods for causal inference in models that are non-separable, i.e. where the outcome is not a (weighted) sum of the treatment and confounder. Additionally, other linearity assumptions are relaxed. This task requires specific conditions on the causal graph, which are derived and explained. The paper formulates a variational objective that whose optimization leads to one of these conditions.

__ Strengths__: IV regression is an important area of causal inference, complementing methods that adjust for observed confounders. Existing IV regression methods require strong assumptions on the functional forms generating the treatment and outcome. Here, these are replaced with different and potentially weaker assumptions such as a monotone relationship between the unobserved confounder and the treatment. A sound theoretical analysis shows which independence assumptions are necessary and an elegant constrained optimization (VDE) is developed to satisfy them. Toy experiments show that other methods can fail when their assumptions are violated.

__ Weaknesses__: 1) The experiments are simple toy simulations. I don’t know if this is standard practice in IV regression but it would be desirable to use benchmarks if available. However, the high-dimensional experiment, taken from prior work, is a more challenging/convincing test. The slave export dataset is unconvincing since it only shows that the method gives a similar estimate as another method which requires stronger assumptions but we do not know if that estimate is correct.
2) The methods section and introduction could be written more clearly. Before publishing, I would recommend to give them a thorough re-write.
3) A1 and A2 are not clearly explained. Furthermore, it should be discussed how realistic these assumptions are.
Although its formulation should be improved, in my view, theorem 1 is a useful result and leads to a well-justified algorithm VDE. As far as I can see these are novel insights. So I lean towards acceptance.

__ Correctness__: I am not aware of any errors.

__ Clarity__: I would recommend to find a more intuitive structure for section 2 if possible. E.g. some related work has a clear separation between stage 1 and stage 2. Section 3 is clearly structured and easier to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Kernel instrumental variable regression may be an appropriate additional baseline as it also relaxes linearity assumptions. DeepIV already does this, but the performance of neural nets can be a bit unpredictable.
Does VDE work if z is multidimensional, and what would be the necessary assumption - monotonicity/additivity in all components?
Typos:
“While VDE creates control functions, that are independent of the instrument.”, “when its observed”, “is can put”, “GCFN is also more robust to confounding than DeepIV when the additive outcome process assumption.”, “and guarantees positivity”