__ Summary and Contributions__: The paper proposes an interesting method for weakly supervised semantic segmentation based on a casual intervention model. The context prior plays an important role in semantic segmentation. The model uses the context prior derived from the object mask distribution predicted on the whole dataset to help learn better pseudo semantic labels. The experiments are conducted on both Pascal VOC 2012 and COCO dataset, showing the effectiveness of the proposed approach.

__ Strengths__: The pros:
- The paper is clearly motivated and very well written, and is easy to follow as well.
- The idea of using context prior for iteratively improving learning the semantic pseudo labels are interesting and novel. The ablation study shows that the proposed scheme could effectively boost the performance over two very strong baseline models with clear gain.
- The final results of the model shows the state-of-the-art performance on both Pascal-VOC and COCO datasets.

__ Weaknesses__: The cons:
- The calculation of the context prior is not very clearly described. How is it derived from the whole prediction distribution of the classification model?
- The effectiveness of using the context prior is not very demonstrated to the reviewer. How about disabling the context prior and just using the prediction masks to directly contact with the input image X? This seems to be an important baseline to run.
- The context prior is actually help to improve the prediction of the pseudo lables. Another interesting baseline is to refine the predicted semantic label map with a CRF model, and then concat the refined one with input image X. This is to show the context prior is indeed more beneficial than another simple way to provide semantic prior.

__ Correctness__: correct

__ Clarity__: yes

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: Please refer to the weakness part for the rebuttal.

__ Summary and Contributions__: The paper addresses the problem of explicitly modelling spatial context in weakly-supervised semantic segmentation (supervision by image-level labels). While most modern methods rely on implicit learning of context, that may result in learning spurious correlations inherent to a specific dataset, e.g. that horse can only be present when a person is also present. The paper uses Pearl’s causal inference framework to unbias training data. Experiments explore different ways to model the context and evaluate design choices, in addition to improving SOTA in weakly-supervised segmentation on PASCAL VOC and MS-COCO.

__ Strengths__: Novelty. The paper is the first one to apply causal inference to semantic segmentation in order to explicitly leverage the context.
Motivation. While, as the paper mentions, all modern methods have to take advantage of the context, the correlations in the training data may be not representative for the deployment scenario, so a causal model is reasonable to use (I think the paper undersells this contribution; see below for suggestions to test better generalisation). Causal inference is even more helpful in the weakly-supervised case, where the model cannot even learn spatial correlations implicitly. From the vision side, the paper is motivated by Marr’s work on visual psychology and the further work on modelling context like DPMs.
Relevance. The paper is relevant to a large part of NeurIPS community, i.e. both to causal-inference and computer-vision people.
Experiments. They are convincing. The paper improves SOTA on two datasets, tested different versions of the system on them, and used different segmentation backbones. The backdoor adjustment consistently improves the results by 1–2 p.ps. Specifically, the Q1 check (line 254) is interesting; it proves that the improvement is not just due to better segmentation result.
The paper is generally clear and seems correct (see below for more details).

__ Weaknesses__: Not really a weakness, but something that can make the paper even stronger. Since the paper claims to remove spurious correlation inherent to the dataset, it would be nice to test if the model generalises better in the transfer-learning scenario: train on one dataset, and predict on another.
See the following sections for the comments on correctness and clarity.
UPD. The rebuttal addressed my main concerns. Re Q2, by no means I suggested to use Rubin’s framework; it is the one I am more familiar with. What I think is that it will be a valuable addition to the paper if it states the assumptions of the causal framework of choice, and explains what it means in context of the task. I imagine the paper will influence future computer-vision research attempting to use causal inference; it will be important for the followers to understand the assumptions.

__ Correctness__: The paper does not clearly state the assumptions related to applying the backdoor adjustment. I am more familiar with Rubin’s framework, which has the overlap assumption; I think something similar should be required for the backdoor: in order to re-weight horses without riders, do you need to have at least one example of a horse without person in the training data? Stating the assumptions explicitly would be nice.
In line 199, the paper refers to Appendix for the assumptions. There is derivation but the assumptions are not explicitly stated there. For the derivation, is it the example for 2 classes and not the general case?
Figure 6 shows examples with better masks but I don’t see how they illustrate the hypothesis. It would be good to inspect the cases where the objects occur in an unusual context. E.g. a cow on grass is probably not the case when the proposed methods shines.

__ Clarity__: The paper is generally clear and easy to read.
I can’t follow the claim in the lines 141–144 that no context can contribute to Y when training P(Y | X). The dataset biases are present in X, so the model can still learn the context from them? Please clarify this claim.
The description of Step 2 – Pseudo-Mask Generation feels vague; more formal description would help.
In Table 1, it is not clear what pseudo-mask column refers to. Is it the inferred mask on the last round? Should it converge to the Segmentation mask and have the same accuracy on the last rounds?

__ Relation to Prior Work__: Related work is comprehensive.

__ Reproducibility__: Yes

__ Additional Feedback__: Typos:
line 36: As you might concern ← you might be concerned (also this sounds a bit manipulatory);
line 65: random control trial ← randomised control trial;
line 68: in an unified ← in a unified;
line 71: the prohibitively “physical” intervention ← the prohibitively expensive “physical” intervention?;
line 195–: the use of X_m for segmentation masks may be confusing, as X is used for input images.
In experiments, please use percentage points (p.p.) to describe the change in percentage, to avoid the ambiguity.

__ Summary and Contributions__: 1. This paper analyzes three problems in pseudo-mask generation in weakly supervised semantic segmentation, including object ambiguity, incomplete background, incomplete foreground, and points out that these problems are due to the context prior in dataset.
2. This paper proposes the context adjustment to remove the confounding effect of context prior and adopts an iterative procedure to generate high-quality seed areas.

__ Strengths__:
1. This paper analyzes the phenomenon of context bias in WSSS and the problems it causes in detail, which is somewhat novel.
2. The proposed method achieves great performance improvement and works well over different methods and datasets.

__ Weaknesses__: 1. If M_{t} is calculated only using $C$ without $X$, how about the results? This experiment should be added to demonstrate the necessity of $X$.
2. The description of Equation 3 is not very clear. X_m \in R^{hw x n} and C \in R^{hw x n}, so the part contained in softmax function has the shape of nxn. Then which dimension is the softmax function applied for? Why use sqrt(n)? If all P(c)=1/n, the \sigma_c and P(c) can be removed, but the output M_{t+1} is no longer a single channel map. Is W_{1} and W_{2} learned or manually set? Therefore Equation 3 should be rewritten carefully or it would be confused.
-----------------------
Reviewer has read the author response. I would like to stick to my original score.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: Please refer to the weakness

__ Summary and Contributions__: This paper tackles the challenging weakly-supervised semantic segmentation task. More specifically, the authors establish a structural causal model to handle the negative impact caused by the scene context. The proposed method is model-agnostic, showing consistent improvement on top of several methods on VOC and COCO datasets.

__ Strengths__: 1. Experiments
1) The proposed method establishes a new state-of-the-art on both VOC and COCO datasets
2) The proposed method serves as a plug-and-play module, consistently improving several strong baselines in a model-agnostic way
3) By detailed ablation study, the authors well validate their design choices
2. Novelty
Although the causal intervention method (backdoor adjust) is not new, this paper is the first to introduce causal inference into the weakly-supervised semantic segmentation task. I think this paper is novel.
3. Relevance
Causal inference is definitely relevant to the NeurIPS community. And this paper well demonstrates how to apply causal inference to tackle a challenging computer vision problem.

__ Weaknesses__: 1. Questions about the structural causal model
1) I feel that the confounder set C can be interpreted as “object shapes and where to place them”. But I still do not have an intuitive way to interpret the image-specific context representation M.
2) Why is X -> M instead of M -> X? From my understanding, we sample object shapes and their locations to get M. And then later we sample object appearance (e.g., texture, lighting, etc.) to get X.
2. Implementation
1) Since the images in both VOC and COCO have different sizes and ratios, I wonder how the authors construct the confounder set C.
2) Is the segmentation mask X_m (L195) logits or probabilities?
3) I feel a bit confused about Eqn. (3). It seems that W_1 and W_2 are used as projection matrices, reducing the dimension from original spatial size (hw) to the number of class (n). I wonder if this is reasonable. And I think the projected embedding space can be any dimension, not necessarily to be n?
4) Why do the authors choose P(c) to be uniform? Using the actual object frequencies in the dataset to represent P(c) might be better?
3. Experiments
1) For Q1 in Table 1, more details are required. How exactly the segmentation mask is used in the network? What’s the dimension? Is it a soft mask with probability/logit, or a binary mask with one-hot label? What if the author constructed a self-attention mask similar to Eqn. (3)?
2) Since the proposed method requires iterative refinement, I think it should also compare with the Noisy Student training [A1]. For example, after the first time training of the segmentation model, the authors can then use it to generate pseudo labels. And then, use the pseudo labels to re-train the segmentation model. By comparing with this baseline, we can then know if the performance gain comes from causal intervention or simply from the iterative refinement of the segmentation model itself.
3) In Table, it seems that the proposed method has smaller gain when using stronger feature backbone. Does it mean that, stronger network can better handle the context (e.g., effectively exploit its advantage while discard its negative impact)?
Reference
[A1] Xie et al. CVPR 2020. Self-training with Noisy Student improves ImageNet classification

__ Correctness__: My only concern is the causal link direction between M and X in the structural causal model. Please check the weakness section.

__ Clarity__: Yes. Overall, I find this paper easy to read and understand.

__ Relation to Prior Work__: Yes. The proposed method can be regarded as a plug-and-play module to be incorporated with existing state-of-the-art weakly-supervised semantic segmentation methods.

__ Reproducibility__: Yes

__ Additional Feedback__: Please address the raised issues in the weakness section.
Update: The rebuttal addresses most of the raised issues. Thus, I am willing to increase my rating and recommend acceptance.