NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:99
Title:Unsupervised Domain Adaptation with Residual Transfer Networks

Reviewer 1

Summary

This paper focuses on unsupervised domain adaptation and on the existence of a source-target classifier mismatch besides the marginal distribution difference of source and target. The difference among the classifiers is expressed as a perturbation function (as in [25,26]): this is learned with a new deep residual network which also integrates feature learning and feature adaptation to reduce the marginal distribution shift.

Qualitative Assessment

This paper smartly builds over the deep network that won the ILSVRC challenge in 2015 proposing to use it for domain adaptation. The explanation about how the residual block is added to the CNN architecture and used to estimate the perturbation function to the target classifier is very clear and constitutes an interesting novel contribution. Moreover, as far as I'm concerned, the entropy minimization principle [28] was not integrated before in a domain adaptation network and the MK-MMD is also a new variant of what used in [5]. On the experimental side, the obtained results are quite convincing but there might be the need of tests on more challenging testbeds. It might be also useful to clarify on the basis of which measure figures 2-c,d appear better than 2-a,d: maybe indicating the average class-to-class distance could help to clarify the claimed improvement -- the figure is not fully self-explanatory.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

The paper presents a new deep learning architecture for the unsupervised adaptation problem, i.e. when one has labeled training data for a source domain and unlabeled data only for the target domain. The approach mixes ideas of deep adaptation networks (ref. 5), residual networks (ref. 8) and entropy minimization for unsupervised learning (ref. 28) to propose an accurate and principled approach.

Qualitative Assessment

The proposed architecture is a smart mix of previously published ideas : use of MK-MMD from deep adaptation networks (ref. 5), including residual networks blocks from (ref. 8) to relate the classifiers in the source domain and in the target domain, and entropy minimization (ref. 28) to guide the final learning in the target domain, where there are no labeled training samples. This allows the authors outperforming state of the art methods on standard benchmarks. The use of residual learning combined with entropy minimization learning criterion to jointly learn the source and target classifier is new and smart and it seems to work well in practice. Yet although the residual learning idea seems complementary to the MK-MMD of (ref. 5) (in the experimental section) this works looks more like an incremental work over (ref. 5) than a brand new model that is very different from (ref. 5) as the authors claim. I don’t fully understand the difference between the use of MK MMD here with respect to what is proposed in (ref. 5). What is different ? Why is it better here ? And what is the motivation ? This part is then not fully convincing and especially since there is no experimental comparison with the method proposed in (ref. 5) and the variant proposed here. It is said multiple times that the residual learning framework guarantees that the residual part will be small but i am not sure there are such guarantees actually. It is more an experimental finding but in a very different setting. Why should it be true here ? Experimental results are provided for the two benchmarks of (ref. 5) and show quite convincing improvements of proposed method with respect to state of the art and put in evidence the actual complementarity of MMD criterion and of the residual learning idea. As said above a comparison with the same MMD use was in (ref. 5) would help appreciating if this variant is efficient or not here. There are a number of results which are different from the ones in (ref. 5) although the experimental protocol described in 4.1 looks strictly the same. This is the case for DAN, the method in (ref. 5), which reaches 72.9 average accuracy on Office-31 dataset while it is reported at 70.0 here (ref. Table 1). Comments on such differences should be given. Moreover results from benchmark methods like TCA and GFK are much better here than results reported in (ref. 5) (~66% instead of ~27%)which might be due to the use of deep features as input where these methods were used on rough features in (ref. 5)? Please provide more precision on that. Overall the paper is nice and easy to read. It proposes a new brick to previously designed deep architectures for unsupervised adaptation. It is then partially incremental in my opinion but still it is a nice use of residual learning idea.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 3

Summary

The paper proposes an approach to domain adaptation of deep neural networks based the following. 1) on multi kernel maximum mean discrepancy to enhance the similarity of the features between the target domain and the source domain 2) using the entropy of the prediction to adapt the classifier to the target domain. 3) the main contribution: the classifier of the source domain is a residual function of the classifier of the target domain. Overall, the technique appears to work well, and there is a big comparison with related work.

Qualitative Assessment

I think the paper is interesting, and could become a very good paper if the experimental evaluation is improved. The main issue I have with the paper is that it does not properly isolate/evaluate the contribution of the residual function. It is not clear to me whether this residual block is actually needed. For example, I think the residual block can be replaced by simple L2 regularisation to keep the source and target classifier similar. Figure 2 also fails to convince me. I cannot see a qualitative difference between the DAN and the RTN predictions. Additionally. I found the mixing of H and F (what is residual and what is the original function) confusing at first. It would be better to stick with one notation and keep this consistent.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

This paper proposes an unsupervised adaptation technique using residual connects to make the target classifier equal a sum of the source classifier with a residual function. This is distinct from recent deep adaptation methods which learn a joint source and target model. This paper also proposes using the kernel MMD for aligning the representations of the two domains, just as done in the related DAN and DDC methods.

Qualitative Assessment

The idea of learning a residual difference to produce a target classifier is novel, simple, and interesting. The paper is well written and the method is well explained. The main issue stems from the fact that there are 3 components to the approach, the entropy loss on the final target scores, the MMD loss for domain invariance, and the residual connection. From the experiments it appears that the entropy loss offers the most benefit, followed by MMD, and finally by the residual connection. The MMD loss is not novel and the entropy loss is not as well explained and the residual connection. The method appears useful, it's simply unclear if the novel portions of the approach are actually contributing much to the overall system.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)