NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3472
Title:Domain Generalization via Model-Agnostic Learning of Semantic Features

Reviewer 1

My main concern is that the problem of domain generalization is not very well-defined. It is necessary to explicitly specify the underlying assumption of the proposed method. "Directly generalize to target domains with unknown statistics" as mentioned in the abstract would be impossible if the target domain is very different from the training domains. In comparison, MAML has a task distribution from which the tasks (including target) are sampled. If MASF uses the same assumption, then the problem is not really domain generalization but meta-learning from multiple tasks as in MAML. The experiments focus on learning from multiple domains then applied the model to *only one* target domain, which is, in fact, more related to the multi-source to single-target adaptation than domain generalization. For example, the papers below [a,b,c] are theoretically justified and it would make sense to discuss and compare to them. [a] Mansour, Y., Mohri, M., and Rostamizadeh, A. (2009). Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems, pages 1041-1048. [b] Zhao, H., Zhang, S., Wu, G., Moura, J. M., Costeira, J. P., and Gordon, G. J. (2018). Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pages 8559-8570. [c] Hoffman, J., Mohri, M., and Zhang, N. (2018). Algorithms and theory for multiple-source adaptation. In Advances in Neural Information Processing Systems, pages 8246-8256. The necessity of global class alignment (Sec.3.2) is not very convincing: certain design choices are not clearly explained. It is not clear why the soft labels need to be computed based on average class features. We can alternatively compute the average soft labels given the data from one particular class directly (i.e., averaging over the final predictions instead of features). It is also not clear why symmetrized KL divergence is used instead of Jensen-Shannon divergence. Any comments or explanations on these alternatives would be helpful. What is the "linear-sized random subset" in L188? Experiments - The VLCS, PACS and MRI results in Table 1, 2 and 4 have no error bars. Are these results from one single run? Besides, it is not clear what the error bar in Table 3 means. Is it standard deviation, standard error or something else? Are these results statistically significant? - Why clipped gradient is needed? This indicates the proposed algorithm is not very stable or easy to train. - How is the margin parameter \xi selected in the experiments? And what criterion is the selection based on? - Compared to Table 1, some alternative methods are not included in Table 2. Why? Minors: - L95, samples are "drawn from" a dataset -> "drawn to form"? - L219, the highest ==================================== Update after rebuttal ==================================== The rebuttal resolves some of my concerns about the underlying assumption and design choices. It is essential to provide explicit assumptions about the proposed method. It also indicates that several hyperparameters are chosen heuristically. Without proper selection strategy or seeing sensitivity analysis about these hyperparameters, it is difficult to tell whether the improvements are due to better objective function or extensive hyperparameter-tuning. Overall, this is a borderline paper.

Reviewer 2

Originality/Significance: Although the proposed losses are established in other application areas/problem settings, and although their use in this application is kind of “obvious in retrospect”, these losses were not previously used in DG setting together with meta-learning. So it's good to highlight their efficacy in this problem setting. But otherwise the novelty is limited as the same meta-learning pipeline proposed for few-shot in MAML [10], and extended to domain generalisation in MLDG [23] is used. Engineering Issues: (i) Many hyper parameters are introduced (e.g. Algo 1). Tuning these is not straightforward in DG problems. (ii) L_global and L_local both trigger second order gradients, which makes training slow. (iii) L_global seems to compute all pairs of available meta-train and meta-test, which is slow and scales badly with number of domains. Empirical: (i) Recent DG papers [1] used ResNet rather than out of date AlexNet. If this paper is accepted, ResNet experiments should be included to avoid making the paper be of out-of-date relevance before its even published. Minor: - Some recent methods like [28] missing from comparison table. - L.219 highes -> highest 
 Overall assessment: It’s quite a “vision style” paper. Not really a fundamental machine learning development. However the motivation, explanation, ablation, and numerical performance are all quite good, and the real medical application is an icing on the cake. So it could be acceptable for NIPS. ----- Update. I have read the author feedback and other reviews. Besides the somewhat vision style, I don't see any real flaws. The updated addition of ResNet experiments will benefit the longevity and relevance of the paper. Hopefully the authors will also share code so others can build on it.

Reviewer 3

[After the author response] Thank you for reporting the performance of global alignment on JiGen as a baseline, which shows consistent performance boost. I intended to check whether the proposed method only works with DeepAll or not. I conjecture that the proposed method works well with the other methods not only JiGen. Also, it would be better if the performances of local loss are also reported in a future version of the paper. ============================================================== I’m positive to this paper because I believe that episodic training is an important topic for learning scheme. The proposed algorithm consists of three parts; Episodic training, global alignment, and local objective. The main contribution of this paper is that global alignment and local objective are adopted in the episodic training procedure. While each component seems to be closely related to the other methods, the combination of all components as the name of episodic training is plausible to solve domain generalization problem. While the proposed algorithm is validated on several benchmarks, there exists only a single baseline algorithm. I think that validating the proposed components, such as global alignment and local objective, on the other baseline algorithms strengthens the proposed algorithm.