NeurIPS 2020

Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID

Review 1

Summary and Contributions: This paper proposed a self-paced contrastive learning framework and utilise hybrid memory to jointly distinguishes source-domain classes, and target-domain clusters and un-clustered instances. The significant improvement of the proposed method over the baselines are shown on multiple benchmarks.

Strengths: 1. The paper proposes an innovative way to fully utilize all data during domain adaptation for ReID, while other methods discard source-domain knowledge and target-domain outliers. 2. To include different sources for training, the author defines a unified contrastive loss to jointly consider three sources of supervision, with appropriate adaptation on source-domain to match semantics. 3. The hybrid memory design provides centroids/instance for the unified contrastive loss, while live update for source-domain and self-paced learning for clustering in target-domain refresh the memory. 4. The self-paced learning mechanism helps forming more reliable cluster centroids by introducing two metrics, independence and compactness, to make the clustering process self-adaptive. 5. Abundant studies show the effectiveness of the designed components, together with an oracle setup to reveal a possible upper bound.

Weaknesses: 1. The poor result of reimplemented MoCo in 4.3 Table 4 needs further explanation and reasoning to account for. 2. Need clearer description on difference between the setup of ‘Src. Class + tgt. Cluster (w/ self-paced)’ in ablation study Table 5 and the full model. If the only difference is target instance, where does the ~10 difference in mAP come from?

Correctness: The claims and method are correct. No doubt for empirical methodology.

Clarity: Most part of this paper is easy to follow and clear for practice.

Relation to Prior Work: Clearly discussed in related work part.

Reproducibility: Yes

Additional Feedback: Comments after the rebuttal ------------------------------------------------------------ I agree with the concerns brought up by R6 and R9 that the paper shares similar ideas with memory banks and momentum contrast. Thus, I decrease the score to "marginally below the acceptance threshold".

Review 2

Summary and Contributions: This paper proposes a self-paced contrastive learning framework for object re-identification. They learn feature representations by contrasting each feature against the source classes, target clusters (unsupervised), and remaining target samples (which kind of act as a cluster with a single member). They also use a memory bank to keep prototype representations for each source class, target cluster and target outlier. The update of these prototypes are performed through momentum updates. They claim the state of the art results in object re-identification benchmarks.

Strengths: + Smoothly combines the ideas of [13] and [45] in a unified framework specifically targeting the object re-identification problem. + They present the state of the art results in several object re-identification benchmarks. + The authors provide a decent ablation of their components.

Weaknesses: - As system keeps features for all the instances in the target domain in a memory bank, there scale well with large number of unlabelled instances. - Cluster reliability measures are a bit adhoc and should be explained better. Also it doesn't seem to have too large effect on the final results.

Correctness: Appears to be so.

Clarity: It is in a decent form but presentation can be improved. There are many repetitions.

Relation to Prior Work: Paper shares similar ideas with [45] (instance discrimination with memory banks) and [13] (MOCO - Momentum contrast), though the proposed method is sufficiently different from both papers and operates on an entirely different problem (object re-id). That said, there is a value in acknowledging and explaining these relations. For instance the loss for target outliers is quite similar to instance discrimination, the main difference is having also the class and cluster centroids. Keeping all these prototypes in a hybrid memory also resembles the memory bank of [A]. The way the prototypes and features updated also has loose similarities with momentum contrast idea. Establishing these links would increase the quality and readability of the paper. [45] Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination Zhirong Wu, Yuanjun Xiong, Stella Yu, Dahua Lin [13] Momentum Contrast for Unsupervised Visual Representation Learning Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick

Reproducibility: No

Additional Feedback: Update after the rebuttal: After reading the other reviews and the rebuttal, I'm a bit concerned about the ethical issues. I wasn't aware of it before but the general ethical concern on DukeMTMC dataset and Duke University's removal should be a good enough reason to not to report on this dataset in any scientific paper. The fact that this is overlooked and not even mentioned in the ethics and broader impact section raises serious concerns about this paper. Also accepting the paper in its current form would be encouraging the future usage of this dataset in follow-up papers. In the light of these concerns on ethics, I would suggest removing the results on DukeMTMC following Duke University's (the publisher of the dataset) decision. Unfortunately that would make the paper a bit weaker than it currently is.

Review 3

Summary and Contributions: The paper addresses the problem of unsupervised domain adaptation (UDA) with a strong emphasis on the task of re-identification. The contributions include: the extension of the contrastive softmax with specific selection of class centroids from the labeled source domain along with cluster centroids and outlier instances of the target domain, a hybrid memory which can be considered as a non-parametric model used to maintain and update the state of clusters and outliers between epochs. A form of self-paced learning is proposed which considers cluster reliability of pseudo labels to remove difficult/noisy samples from the target domain each epoch for smoother and more reliable training of domain transfer. Predominantly an empirical paper without theoretical novelty. That said the novelty combination of pseudo label clustering and network training within a self-paced framework can be considered as a novel contribution that leads the proposed method to outstanding results.

Strengths: The paper is written reasonably well with clear structure and presentation. An extensive evaluation is given for several domain transfer scenarios across seven different datasets. In all experiments the proposed method significantly outperforms other UDA methods. The authors have provided links to their code. Despite not yet having run the code to check for reproducibility the code appears to be complete (also see DukeMTMC note in weaknesses). The UDA-ReID problem is interesting from the perspective of learning representations, generalising across domains and towards unsupervised learning.

Weaknesses: I wasn't able to download the DukeMTMC dataset used in most experiments. The link in the reference from the provided source. It would appear that this dataset has been taken down by Duke university due to a potential privacy infringement since June 2019. This would make it difficult to ethically reproduce many of the experiments of this paper. There are enough other datasets used in this work to validate the proposed method without DukeMTMC. Given that DukeMTMC has been taken down so long ago I do not know what it is used in the experiments. Especially for Tables 3, 4, and 5 where MSMT17 could be used instead. For a NeurIPS paper I would expect more of a theoretical grounding for the various design choices. However, the paper does motive several choices using common intuition e.g. measuring cluster reliability due to the uneven density in the latent space.

Correctness: No novel or strong theoretical claims are made that require a proof. The empirical methodology uses common metrics and appear consistent with comparative papers.

Clarity: Overall the paper is well written with a clear and illustrative description of the proposed method. Some grammatical and clarity issues remain and should be addressed. Line 145 should read ", the performance drops significantly." "the target-domain instance features {v} are only initialized once at the beginning of the whole learning algorithm" on line 165 is contradicted in lines 166, 179 and with equation 4.

Relation to Prior Work: While the appendix attempts to highlight the differences between the Hybrid memory and the memory used in ECN, more could be done to express the differences in the Hybrid memory updating procedure and the MoCo[13] method. Other than the use of centroids in a non-parametric fashion there appears to be significant similarity which should be acknowledged in section 3.1.2 While this paper differs in its use or application of measuring cluster stability for self-paced learning, it also overlooks some earlier work in cluster stability analysis. Namely, more could be done to motivate the use of thresholding cluster compatness and independence using selected hyper-parameters (e.g. density and alpha beta) with other measures of dissimilarity inn heirarchical clustering (e.g. [A]) or forms of consensus clustering [B]. [A] R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander (2015) "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection", ACM Trans. on Knowledge Discovery from Data [B] A. Strehl, J. Ghosh, (2002). "Cluster ensembles – a knowledge reuse framework for combining multiple partitions" Journal on Machine Learning Research

Reproducibility: Yes

Additional Feedback: More discussion on the sensitivity of hyper parameters and choice of clustering algorithm should be made with regard to their effect on the UDA problem. While experiments with DukeMTMC are valuable for comparing with prior work, it is important to respect the privacy of the people in the dataset and acknowledge that this dataset is no longer available. More emphasis should be placed on the other experiments with publicly available datasets. Overall the paper is very strong on the empirical side with very impressive results. Update: In the rebuttal the authors continue to use Duke as an empirical datum in responses to other reviewers and have not acknowledged that this dataset has officially been decommissioned, meaning that other researchers cannot access this dataset for reproducibility and more importantly that any storage or distribution of said dataset is considered unethical as it is known to breach privacy standards. See: The act of disregarding this takedown notice stands against the comments in the Broader Impact section around infringement of people's privacy. While this dataset is common for ReID it has been removed for over 12 months and the camera-ready version of this paper should acknowledge this and as suggested remove or reduce the use of Duke. The additional results support the effectiveness of this approach and I'm sure it a revised version of the paper would do just as well on other datasets as shown but this would require significant changes.

Review 4

Summary and Contributions: This work address the task of unsupervised domain adaptation for object re-ID. It proposes to use a contrastive learning framework with source-domain class-level, target-domain cluster-level and target-domain instance-level supervision. It also defines two criteria of independence and compactness to help obtain reliable clusters for learning. Experiments are conducted on person and vehicle re-ID and some ablation studies are also presented.

Strengths: + The task of unsupervised domain adaptation is interesting and challenging. + Multiple datasets are used for evaluations. + Related works are appropriately discussed and compared.

Weaknesses: - The main idea of this method is unified contrastive learning. However, the strategy of joint learning of source and target domain is not new although different methods implement with different losses (e.g., in [57,58]). It is also natural that the performance on source domain with joint learning of source and target domains is higher than finetuing with target data only. Besides, the form of non-parametric contrastive learning is widely used in general unsupervised visual representation learning methods (such as MoCo and SimCLR) and is not new in this method. - The assumption of the proposed unified contrastive learning is that the source domain has disjoint classes with target domain as it needs to collect cross-domain samples as negatives. It may meet with the current UDA benchmarks but the generality of this method based on such assumption is limited in those real-world practical application scenarios where no prior knowledge are available on target data. Existing methods which optimize source and target domains separately thus show more advantages in this aspect. - It is not clear why optimizing class-level and instance-level contrastive losses simultaneously will work. Class-level supervision is different with instance-level supervision as optimization target. The experiments of MoCo on UDA do not work, which also implies that instance-level supervision is not suitable for distinguishing semantic classes in the object re-ID tasks. It lacks sufficient explanations and corresponding ablation studies to clarify this point. It is hard to convince me why such contrastive loss can work with current content and experiments. - The ablation studies are not clear and sufficient enough. (1) What are the differences between "src class + tgt class (w/o self-paced)" and "ours w/o self-paced r_comp & r_indep"? It is not clear which algorithmic components self-paced learning contains and it lacks necessary detailed descriptions of the setting of these experiments. (2) The ablation studies of different combinations of class-level, cluster-level and instance-level are not presented. Since the unified contrastive learning is the core idea of this method, these experiments are necessary but missing unfortunately. (3) I'm also confused with the differences between w/o self-paced and Delta_d=0. (4) Why did using learnable classifiers perform worst than using class centroids for source domain? It also lacks necessary theoretical analysis and explanations. - The strategies of independence and compactness of clusters seem to be tricky and incremental. The strategies need multiple manual parameters based on DBSCAN clustering. From Table 5, on Market-to-Duke task, only 0.8% mAP drops w/o r_indep and only 1.3% mAP drops w/o r_comp. The results implies incremental contributions of such strategies. - In the unified contrastive learning (Eq. 1), if f is a target-domain un-clustered outlier, it is not clear how to collect its corresponding positive samples. - Softmax-based losses and triplet losses are widely used in object re-ID tasks. It is necessary to compare them with the contrasitve loss. But the comparisons and analysis are missing in this work. - The parameter analysis experiments show that tuning temperature param has a large impact on final re-ID performance, e.g., 68.8% with 0.05 vs 57.4% with 0.09. Such large gap (11.4% mAP) is even higher than other major algorithmic components. It may imply that this method is sensitive to this param and not robust enough to extend to other tasks. It also raises the concerns that whether the improvement mainly comes from hyper-parameters tuning. - In Figure 3, the metric of cluster number is not good enough to show the quality of clustering. A better way is to use some quantitative metrics (e.g., NMI or F-measure) to check how good/bad clusters a method obtains.

Correctness: It needs more clarifications and experiments on method design, e.g., why combining instance-level and class-level supervision can work?

Clarity: The sentences are well written. But it lacks some necessary ablation experiments and analysis.

Relation to Prior Work: It lacks some necessary ablation stuides and detailed descriptions of some experimental settings.

Reproducibility: Yes

Additional Feedback: See above. Update after the rebuttal: I have read the other reviews as well as the rebuttal. The rebuttal addressed part of the raised issues. However, I still have concerns on the technical novelty, theoretical basis and ethical problem. There are about 80% of experiments that rely on the questionable Duke dataset. I think this paper needs major improvement and thus keep my original rating.