Reviews: Learning Perceptual Inference by Contrasting

Originality: I find the originality to be good. There is a nice translation of contrastive learning from cognitive science and I believe the authors implemented it well in a machine learning setting. The contrast module seems novel to the best of my knowledge. Quality: I find the submission to be technically sound. The ablation study really shows the impact of the authors contrast module. Clarity: I found the article to be clear with respect to its own model, but lacking details when it came to the WReN permutation sensitivity. Significance: I find the significance of the authors empirical results to be high. RPM is a task that is well established at top-tier conferences and obtaining such strong SOTA results is very impressive. - Overall the authors make a valid point that a machine learning reasoning model should exhibit some sort of invariance with respect to the order of the multiple choice answers. In Lines 64-68 the authors claim that they removed positional tagging from one model WReN [14] and they claim it decreased performance by 28%. - For WReN: How exactly was this removal of positional tagging performed? Some more details would be very helpful here (even if it’s relegated to the Supplementary Materials). For example: exactly what are the WReN representations, how did you remove them, did you retrain, if so what was your training scheme, what type of hyperparameter search did you perform, etc. These would have been nice to have for completeness (say in the Supplementary Materials). - How exactly is WReN permutation sensitive? Please correct me if I am wrong but it is my understanding that WReN independently processes 8 different combinations O_1 = {O union a_1}, …, O_8 = {U union a_8}, where O is the 8 given panel sequence, and a_i are the multiple choice answers. Each O_i set contains 9 panels, and each panel in O_i is given a one-hot positional embedding. But if this is the case, then the a_i is always given the positional embedding of [0,…,0,1] e.g. all 0 except a 1 in the 9th coordinate. I quote from [14]: “The input vector representations were produced by processing each panel independently through a small CNN and tagging it with a panel label, similar to the LSTM processing described above…” and for the LSTM processing guideline: “Since LSTMs are designed to process inputs sequentially, we first passed each panel (context panels and multiple choice panels) sequentially and independently through a small 4-layer CNN, and tagged the CNN’s output with a one-hot label indicating the panel’s position (the top left PGM panel is tagged with label 1, the top-middle PGM panel is tagged with label 2, etc.) and passed the resulting sequence of labelled embeddings to the LSTM.” If my understanding is correct, then I do not see how permuting the answer choices affects the WReN model in the spirit of Lines 118-120. Can the authors clarify here? - For [12] I believe models that stack all the answer choices along a channel axis should be evaluated via their permutation equivariance, not permutation invariance. - For completeness the authors are missing a reference to “Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations” which also benchmarks Ravens Progressive Matrices task. - The contrast module in Figure 1c is well diagrammed and intuitive to understand. One of the benefits that immediately comes to mind for the contrast module vs. the RN model is that the contrast module seems to scale linearly in the number of answer choices vs. the RN which produces a quadratic set. - The benchmarks on RPM are impressive and on PGM the authors are able to exhibit a very healthy gap in performance over “permutation invariant”-ized baseline models. The ablation studies really show how significant the contrast module is. - Overall I view this submission in the following manner: there is clear architectural novelty, the motivation for permutation invariance is obvious, and the results are quite strong. I’m convinced that the authors model clearly beats out the WReN model, but I believe the authors may have misrepresented the WReN model’s permutation sensitivity (see my above point about tagging of the panels). Specifically, I find the author’s use of removing the positional tagging to be misrepresentative of the permutation sensitivity of the WReN model. I would be happy if the authors corrected any misunderstanding on my part because I do believe the empirical results are strong and the hypothesis is intuitive and well-motivated. Grammar and expository-oriented comments: - Line 8: I don’t think this is the proper use of “i.e.” which is typically used like “in other words”. Instead it could be: “In this work, we study how to improve machines’ reasoning ability on one challenging task of this kind: the Raven’s Progressive Matrices (RPM).” - Line 306-307 is a bit too strong.

Paper ID:	634
Title:	Learning Perceptual Inference by Contrasting

Reviewer 1

Reviewer 2

Reviewer 3