NeurIPS 2020

### Review 1

Summary and Contributions: General framework for output constraints in BNNs, which is still novel and important contribution

Strengths: The experiments are in excellent

Weaknesses: The method description is technical and difficult to follow

Correctness: No issues

Clarity: Well written, but presentation is very technical

Relation to Prior Work: Yes

Reproducibility: Yes

### Review 2

Summary and Contributions: The paper proposes to incorporate output constraints into BNN learning. It first gives a formal definition of the output constraints (deterministic and probabilistic), and then proposes two methods for inference. One is to use some soft data likelihoods that encourage the satisfaction of the constraints and then do posterior inference; the other is to directly optimize the prior parameters to obey the constraints to a maximum extent. The experiments show the usage of the proposed methods.

Strengths: 1. good motivation. Putting prior knowledge/constraint helps improves the interpretability of BNNs. Incorporating output constraints is more intuitive and easier. 2. The definition of the output constraints is formal and rigorous.

Weaknesses: 1. The methods are too straightforward and NOT novel enough. While the first approach is wrapped with many jargons from measure theory and stochastic process construction and seemingly deep, the key idea is to simply introduce some soft data likelihood to encourage the consistency to the constraint. The second idea is to directly optimize the definition. BTW, that is not called variational approximation'' because we do not see any variational representations. 2. Both inference methods are soft and can not guarantee the definition of the constraints can really be satisfied (i.e., \episilon level). These are in essence regularization methods, which are commonly strategies to incorporate existing knowledge. 3. I believe the definition 4.2 is wrong. You need to integrate out y as well. Otherwise, how is it consistent with Eq. (8)?

Correctness: mostly correct except the point mentioned above.

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I have read the rebuttal. I do think the work is useful. I like the formalization of the output constraint better than the proposed algorithms (which are not novel enough to me). However, the gap between the constraint definition and how the algorithm satisfies the constraints should be addressed. Some theoretical guarantees about the proposed inference algorithm, e.g., the upper bound of the epsilon along with the number of data points, will make the work more solid and integral. I will maintain my score.

### Review 3

Summary and Contributions: The paper outlines an approach to incorporating output constraints into Bayesian neural networks, forming the OC-BNN, with the ability to perform inference in function space rather than parameter space. The method is tested on several different real world datasets. ============ Post Rebuttal The two main discussion points were around clarity and novelty. Assuming the authors make the changes proposed, the clarity of the paper should be improved. The novelty is around producing a more general framework (than e.g. [18]) and a solid experimental setup. This is a solid (but not outstanding) contribution. My score remains the same (marginal accept) but I have increased my confidence in this score.

Strengths: - Tackles the important problem of more interpretable priors for BNNs - Empirical validation of the method - Entire framework has been developed

Weaknesses: - No discussion or experimental results varying epsilon (soft/hard constraints) - Not quite as novel as claimed

Correctness: In figure 2, the prior predictive sometimes falls outside of the constraint regions for the OC-BNN (whereas it doesn’t for the baseline methods) - why does this happen? Similarly, in figure 3, some of the posterior draws appear to violate the constraint region. Presumably this is because the constraint is a soft rather than hard constraint. Can you make it clear that this is the case, what the value of epsilon is, and show the comparison to hard constraints?

Clarity: The paper is mostly clear, although I found the derivation of the prior itself in §4.1 a bit unclear. For example, the authors say "(4) can loosely be viewed as an application of Bayes’ Rule, where the realization ΦCx of the stochastic process CP is the evidence (likelihood) of C being satisfied by w”. This looks more like a product of independent stochastic processes, so I don’t understand the link here. Also, is there an assumption that the constraints are independent, and don’t interact with each other?

Relation to Prior Work: When comparing to [18], the authors say that "Similar to OC-BNNs, constraint (C) satisfaction is modeled as the conditional p(Y, C|x), e.g. Gaussian or logistic. The main differences are: (i) their method is applied to deep Gaussian processes…”. Note that the DGP is in fact a stacked Bayesian linear model using random features, which is in fact *exactly* a BNN with a particular form of nonlinearity. Extending their approach to the classification setting would be relatively trivial as well (DGPs have been used in this setting, and the constraint formulation carries over). The point about amortisation holds, and the framework outlined is more general here, but the difference is not as clear as is being made out by the authors.

Reproducibility: Yes

Additional Feedback: Check capitalisation in references