NeurIPS 2020

### Review 1

Summary and Contributions: The main contribution of this paper is to propose a certifiable training method with a tight outer bound propagation in a highly efficient manner. To my best of knowledge, this is the first non-trivial determinist verification procedure/method that can be applied on Tiny ImageNet.

Strengths: Overall, the paper is well-written and easy to follow the logic. The idea is simple and yet effective in practice, which is well motivated by the geometric interpretation, i.e., get a tighter outer bound (image space). I really like Figure 1. All the technical proof and induction is correct and sound. Extensive experiments are conducted to support their proposed methods.

Weaknesses: It would be great if you can give a simple example like figure 2 to showcase the gap between the proposed method and other baselines (dual relaxation-based, SDP). A tighter or looser region you developed? or do you sacrifice something? the trade-off between efficiency and accuracy?

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I just have some minor comments. 1. For the problem (11), it can be recast into a one-dimensional search problem, (e.g., see the associated lagrangian multiplier w.r.t ball constraints). Naturally, it can be done by some modified secant method to solve it. I am not sure which one is better. Maybe you can have a try. 2. prop 1, why do you cite [39]? 3. line 125, 0? 4. line 216, as you mentioned figure 2(d) at first, it's better to change it to (a). #Update: Thanks for your response.

### Review 2

Summary and Contributions: This paper proposed a method named Boxed constraint propagation (BCP) for both verification and certified robust training. The key idea is to simply combine interval bound propagation as the Linf box constraint and additional L2 ball constraint whose radius is estimated by Lipschitz constant (it appears to me that the lipschitz constant here is global lipschitz constant since the authors use power iterations to get rho. The authors didn't explain clearly, but it appears to me that they use power iteration to get maximum eigen-value of weight matrix which should correspond to global lips const when input perturbation is L2 norm. Please clarify if this is not what you are doing in the paper). The results compared with Wong etal, which is not state-of-the-art in the certified defense method; meanwhile, the improvement of BCP is marginal. Also, the evaluation on the tightness of the bounds is problematic: removing the box constraint would naturally lead to loose bound due to the way that the authors estimate rho. I think a reasonable comparison would be comparing BCP (i.e. equation 11) with IBP on L2 input perturbation and see if the tightness is improved. Overall, the novelty is low and the performance of the method is marginal.

Strengths: -

Weaknesses: 1. Evaluation is problematic and does not compare to state-of-the-art in the certified training (which should be at least IBP, Gowal 2019), please see the "summary and contributions" part for more details. 2. Novelty is relatively low and the improved performance is marginal

Correctness: Evaluation is problematic

Clarity: The notation is a bit dense

Relation to Prior Work: It discussed some certified training methods

Reproducibility: Yes

Additional Feedback: Other questions: 1. Line 125: what does "0 method" mean? 2. Line 204: LMT is not STOA, should compare at least with IBP 3. Fig 2: not clear what it is plotting. Why w/o BCP goes outside the decision boundary? 4. Tab 1: improvement is marginal, and Wong is not STOA ===== post rebuttal ====== The additional experiments compared with IBP on L2 norm shows large improvement. Please include this experiment in the revised version. Also, it is suggested to add all the clarifications as well as simplifying the math derivation in the revision to improve the manuscript.

### Review 3

Summary and Contributions: The paper proposes a certifiable training algorithm based on Lipschitz analysis and interval arithmetic. They show that BCP achieves a tighter outer bound than the global Lipschitz-based outer bound. The computation time has been improved.

Strengths: The paper is well-organized. The authors reasonably explain their motivation and methodology. The significance and novelty are median.

Weaknesses: I have concerns for Algorithm 1 and find it not easy to follow. Particularly to the $\rho^{K-1}$, computed with product sequence. Also, c = W_1-W_0, while z is a vector as described in the notation. How can they have the same dimension for substraction p = z-\\rho^{K-1} c/||c||? In the experiments, the paper stated inline 157 that their methods using the l_infity case could be considered as IBP (interval bound propagation). Although they have compared with IBP given in the appendix, I wonder why they did not put it to the main text.

Correctness: The value $\rho^{K-1}$ could be significantly small or large. I have checked the code, which they set " r = eps*torch.ones(x.size()[0]).cuda() # b, if args.linfty: r *= np.sqrt(x.reshape(x.size(0),-1).size(1))" They set linfty=False meaning that by default one should not use this estimation (the description of the algorithm should be consistent with what is implemented ). Perhaps the author could use normalization methods (batch normalization/ layer normalization) to stabilize the Lipchitz constant. I think the author should give more descriptions in their algorithm. It's not easy to follow and difficult to find the right hyper-parameters. ============================== I have read their rebuttal, which clarifies my concerns and question. Althought the clarity need to be improved, I am willing to increase the score to borderline accept. I would suggest the author make the algorithm in more details without referring to other paragraphs.

Clarity: Median

Relation to Prior Work: Median

Reproducibility: No