Summary and Contributions: Authors address an essential problem in domain generalization - to learn discriminative domain-invariant features. Authors use a combination of entropy regularization loss along with usual loss like cross entropy loss for classification and adversarial loss for domain discrimination. Authors compare their method on various domain generalization datasets and state-of-the-art methods.
Strengths: 1) Authors have pointed out the problem of learning discriminative domain-invariant features clearly. Simulated data shown in figure 1 clarifies the problem that authors are trying to solve. 2) Authors show results on 2 simulated datasets and 2 real world datasets. All these datasets are used to benchmark domain generalization methods. 3) Each loss in Equation 10 is explained and intuition given on why each of the terms is important.
Weaknesses: 1) Can authors give Architecture size/number of parameters to learn for state-of-the-art methods and proposed method? 2) How were hyperparameters selected? Was there a train, validation and test set? 3) Some of the state-of-the-art methods (especially ones with sound theory) were completely ignored. Can authors add these papers at least in discussion if not in experiments [1-3]? 4) Dataset presented in Figure 1 is a really good motivation. Could authors get results on such a dataset for various state-of-the-art methods and proposed methods?  Blanchard, Gilles, et al. "Domain generalization by marginal transfer learning." arXiv preprint arXiv:1711.07910 (2017). This is an extended paper from previous NIPS 2011 paper.  Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.  Deshmukh, Aniket Anand, et al. "A Generalization Error Bound for Multi-class Domain Generalization." arXiv preprint arXiv:1905.10392 (2019). Reviewers have addressed my concerns and clarified most of the doubts I had. I am changing my score from 6 to 7.
Correctness: Looks correct.
Relation to Prior Work: Some of the theoretically sound paper were omitted from the discussion completely.
Summary and Contributions: Domain generalization is a challenging problem. This paper proposes a regularization approach using entropy loss and a few other conventional loss functions for generalization across domain shift. Thus in addition to cross entropy loss for classification, and adversarial loss for domain discrimination, the overall objective is to learn conditional-invariant features across all source domains. Thus the proposed method learns classifiers with better generalization capabilities. Experimental results on many datasets are provided.
Strengths: The formulation is theoretically sound and supported by well-reasoned theorems. The authors have compared with several existing methods on multiple datasets. Ablation studies are also strong.
Weaknesses: I am old school. So I am not happy when small percentage improvements are marketed as outperforming, superior etc. If you really look at the tables, the results are somewhat mixed. The proposed method is not uniformly better than existing methods. For example, in Table 3 only for Pascal VOC the proposed methods is performing better. For other datasets, the improvements are either small or non-existent. Likewise, in Table 4, the proposed method does not improve on SOTA for cartoon and photo domains.
Correctness: No and Yes. Claims of outperforming and superiority are not fully justified.
Relation to Prior Work: Yes
Additional Feedback: This is a good paper, although performance improvement is not uniformly better than SOTA. I will suggest to the authors that they tone down the claims! I read the other reviews and the author's rebuttal. I am satisfied with the rebuttal and will keep my current recommendation.
Summary and Contributions: Authors propose to tackle the problem of domain generalization by finding domain-invariant feature representations across domains with invariant conditional distributions. It's an extension of the adversarial learning approach to domain generalization. Authors accomplish this by minimizing the Jensen-Shannon divergence among conditional distributions under the pressumption that it is equivalent to entropy regularization. Authors show that this approach improves domain generalization among several datasets.
Strengths: - The paper is well written and easy to follow. - The approach is theoretically sound - Authors perform exhaustive evaluation across multiple datasets - Results presented show strong benefits for some of the datasets tested performance wise of using entropy regularization over basic adversarial learning - Authors provided code to reproduce experiments
Weaknesses: - The entire approach relies under the assumption the minimizing JSD is equivalent to the intended entropy regularization. This is only true when classes are perfectly balanced. It would be interesting to see how much would the performance be affected based on class unbalance. - The approach requires an aditional classifier per domain/dataset which is a bit worrisome to me. How much of the improvement is coming from extra capacity given by extra classifiers? How does the model perform if you remove then? How well you do if the last term of the loss in equation 10 is removed? What happens when the number of domains is large? - For many of the datasets tested the improvement over other approaches or even the general adversarial approach is marginal Post Rebuttal: Authors clarified most of my concerns so I am raising the score.
Correctness: Claims, approach, and empirical methodolody are correct.
Clarity: The paper is very well written and easy to follow. I found minor grammatical mistakes on lines 31, 60/61.
Relation to Prior Work: Authors properly addressed related work and how their approach is different to others previously proposed.
Additional Feedback: I like the way authors presented their work and how they tested the approach in several benchmarks. I think that showing how inbalance affect the approach and to what extend patch re-weighting alleviates the issue is important. Show how much of the gain in performance is coming from minimizing KL vs having additional classifiers (more capacity) is also important.